Link Search Menu Expand Document
Statistical Analysis with R Workshop

Installing and loading R packages

This workshop uses the following R packages:

  • psych, a statistical analysis package
  • rstatix, a statistical analysis package
  • dplyr, a data manipulation package
  • DT, a package for creating interactive tables
  • ggplot2, a data visualization package
  • GGally, a data visualization package (extension of ggplot2)
  • table1, a package for creating summary tables
  • gtsummary, a package for creating summary tables
  • flextable, a package for tabular reporting
  • modelsummary, a package creating summary tables and plots
  • lmtest, a package for linear regression
  • stargazer, a package for exporting regression tables
  • car, a package for regression analysis
  • RColorBrewer, a color palette package
  • report, a package for reporting statistical results
  • sjPlot, a package for regression analysis tables and plots

Check whether required packages are already installed, and install any that are missing

list.of.packages <- c("psych", "rstatix","dplyr","DT","ggplot2","GGally","table1","gtsummary","flextable","modelsummary","lmtest","stargazer","car","RColorBrewer","remotes","report","sjPlot")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

Load packages into your current session

suppressPackageStartupMessages({
library(psych)
library(rstatix)
library(dplyr)
library(DT)
library(ggplot2)
library(GGally)
library(table1)
library(gtsummary)
library(flextable)
library(modelsummary)  
library(lmtest)
library(stargazer)
library(car)
library(RColorBrewer)
library(report)
library(sjPlot)})

Changing default theme for visualizations

Changing the default theme to theme_light

Suppress error messages

Uploading data

Creating a data frame scores using an in-built dataset sat.act from the psych package

scores <- sat.act

Importing data from a .csv file (located in your working directory)

# scores <- read.csv("sat.act.csv") # does not work in this case as you don't have a .csv file

Describing data structure

See the first six rows of the data frame

head(scores) 
##       gender education age ACT SATV SATQ
## 29442      2         3  19  24  500  500
## 29457      2         3  23  35  600  500
## 29498      2         3  20  21  480  470
## 29503      1         4  27  26  550  520
## 29504      1         2  33  31  600  550
## 29518      1         5  26  28  640  640

Get dimension for your dataset

dim(scores) # Get number of rows (observations) and number of columns (variables)
## [1] 700   6

Check data structure

str(scores)
## 'data.frame':    700 obs. of  6 variables:
##  $ gender   : int  2 2 2 1 1 1 2 1 2 2 ...
##  $ education: int  3 3 3 4 2 5 5 3 4 5 ...
##  $ age      : int  19 23 20 27 33 26 30 19 23 40 ...
##  $ ACT      : int  24 35 21 26 31 28 36 22 22 35 ...
##  $ SATV     : int  500 600 480 550 600 640 610 520 400 730 ...
##  $ SATQ     : int  500 500 470 520 550 640 500 560 600 800 ...

Browse dataset interactively

datatable(scores)

Data manipulation

Changing format of variables

Check current data format of the gender variable

class(scores$gender)
## [1] "integer"

Change the format of gender from integer (quantitative) to factor (categorical)

scores$gender <- as.factor(scores$gender)

Check whether gender has factor format

is.factor(scores$gender)
## [1] TRUE

Exercise 1

  • Using class command, check the current format of education
  • Using as.factor command, change education to factor.
  • Using is.factor command, check if education is defined as factor.

Check current data format of the education variable

class(scores$education)
## [1] "integer"

Change the format of education from integer (quantitative) to factor (categorical)

scores$education <- as.factor(scores$education)

Check whether education has factor format

is.factor(scores$education)
## [1] TRUE

Transforming values of categorical variables

Gender

Gender is now factor but it is still coded as “1” (men) and “2” (women) - it would be helpful for later analysis to change “1” to “men” and “2” to “women”

Replacing values

# Step 1: Change the data format to character
scores$gender <- as.character(scores$gender)

# Step 2: Replace "1" with "men" and "2" with "women"
scores$gender[scores$gender=="1"] <- "Men"
scores$gender[scores$gender=="2"] <- "Women"

# Step 3: Change the data format back to factor
scores$gender <- as.factor(scores$gender)

# Step 4: Check levels for `gender`
levels(scores$gender)
## [1] "Men"   "Women"

Descriptive statistics

Categorical data

Get frequency for the gender variable

table(scores$gender)
## 
##   Men Women 
##   247   453

Get cross-tabulation for gender and education

table(scores$gender, scores$education)
##        
##           0   1   2   3   4   5
##   Men    27  20  23  80  51  46
##   Women  30  25  21 195  87  95

Quantitative data

Find summary statistics for each quantitative variable

get_summary_stats(scores)
## # A tibble: 4 x 13
##   variable     n   min   max median    q1    q3   iqr    mad  mean     sd    se
##   <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
## 1 ACT        700     3    36     29    25    32     7   4.45  28.5   4.82 0.182
## 2 age        700    13    65     22    19    29    10   5.93  25.6   9.50 0.359
## 3 SATQ       687   200   800    620   530   700   170 119.   610.  116.   4.41 
## 4 SATV       700   200   800    620   550   700   150 119.   612.  113.   4.27 
## # ... with 1 more variable: ci <dbl>

Find summary statistics for ACT scores, grouped by gender

scores %>% 
  group_by(gender) %>%
  get_summary_stats(ACT)
## # A tibble: 2 x 14
##   gender variable     n   min   max median    q1    q3   iqr   mad  mean    sd
##   <fct>  <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Men    ACT        247     3    36     30    25  32.5   7.5  4.45  28.8  5.06
## 2 Women  ACT        453    15    36     29    25  32     7    4.45  28.4  4.69
## # ... with 2 more variables: se <dbl>, ci <dbl>

Generated a summary statistics table, grouped by gender

table1::table1(~ education + age + ACT + SATV + SATQ | gender, data = scores)
Men
(N=247)
Women
(N=453)
Overall
(N=700)
education
0 27 (10.9%) 30 (6.6%) 57 (8.1%)
1 20 (8.1%) 25 (5.5%) 45 (6.4%)
2 23 (9.3%) 21 (4.6%) 44 (6.3%)
3 80 (32.4%) 195 (43.0%) 275 (39.3%)
4 51 (20.6%) 87 (19.2%) 138 (19.7%)
5 46 (18.6%) 95 (21.0%) 141 (20.1%)
age
Mean (SD) 25.9 (9.74) 25.4 (9.37) 25.6 (9.50)
Median [Min, Max] 22.0 [14.0, 58.0] 22.0 [13.0, 65.0] 22.0 [13.0, 65.0]
ACT
Mean (SD) 28.8 (5.06) 28.4 (4.69) 28.5 (4.82)
Median [Min, Max] 30.0 [3.00, 36.0] 29.0 [15.0, 36.0] 29.0 [3.00, 36.0]
SATV
Mean (SD) 615 (114) 611 (112) 612 (113)
Median [Min, Max] 630 [200, 800] 620 [200, 800] 620 [200, 800]
SATQ
Mean (SD) 636 (116) 596 (113) 610 (116)
Median [Min, Max] 660 [300, 800] 600 [200, 800] 620 [200, 800]
Missing 2 (0.8%) 11 (2.4%) 13 (1.9%)

Alternative way of making a summary statistics table (using gtsummary package)

scores %>% 
  tbl_summary(by = gender)
Characteristic Men, N = 2471 Women, N = 4531
education
0 27 (11%) 30 (6.6%)
1 20 (8.1%) 25 (5.5%)
2 23 (9.3%) 21 (4.6%)
3 80 (32%) 195 (43%)
4 51 (21%) 87 (19%)
5 46 (19%) 95 (21%)
age 22 (19, 30) 22 (19, 28)
ACT 30 (25, 32) 29 (25, 32)
SATV 630 (540, 700) 620 (550, 700)
SATQ 660 (570, 720) 600 (500, 683)
Unknown 2 11

1 n (%); Median (IQR)

Show mean, SD, median + range for numeric variables, add p-values

summary_table <- scores %>% 
  tbl_summary(by = gender,
              type = all_continuous() ~ "continuous2", #to add multiple summary lines
              statistic = list(all_continuous() ~ c("{mean} ({sd})",
                                                    "{median} ({min}, {max})"), 
                               all_categorical() ~ "{n} ({p}%)"),
              digits = all_continuous() ~ 1) %>% 
  add_p()

summary_table
Characteristic Men, N = 247 Women, N = 453 p-value1
education 0.007
0 27 (11%) 30 (6.6%)
1 20 (8.1%) 25 (5.5%)
2 23 (9.3%) 21 (4.6%)
3 80 (32%) 195 (43%)
4 51 (21%) 87 (19%)
5 46 (19%) 95 (21%)
age >0.9
Mean (SD) 25.9 (9.7) 25.4 (9.4)
Median (Range) 22.0 (14.0, 58.0) 22.0 (13.0, 65.0)
ACT 0.13
Mean (SD) 28.8 (5.1) 28.4 (4.7)
Median (Range) 30.0 (3.0, 36.0) 29.0 (15.0, 36.0)
SATV 0.5
Mean (SD) 615.1 (114.2) 610.7 (112.3)
Median (Range) 630.0 (200.0, 800.0) 620.0 (200.0, 800.0)
SATQ <0.001
Mean (SD) 635.9 (116.0) 596.0 (113.1)
Median (Range) 660.0 (300.0, 800.0) 600.0 (200.0, 800.0)
Unknown 2 11

1 Pearson's Chi-squared test; Wilcoxon rank sum test

Export summary table as a word document

summary_table %>% 
  as_flex_table() %>%
  flextable::save_as_docx(path = "Summary table.docx")

See https://cran.r-project.org/web/packages/gtsummary/vignettes/tbl_summary.html for more information on how to further modify the table

Inferential Statistics

You should select statistical tests based on your hypothesis and the data type of your variable(s)

  • One sample t-test
  • Chi-square goodness of fit test
  • Two sample t-test
  • One-way ANOVA
  • Two-way ANOVA
  • Chi-square test of independence
  • Simple linear regression
  • Multiple linear regression
  • Simple logistic regression
  • Multiple logistic regression

One sample t-test

It is used to see whether the hypothesized value of the population mean matches actual value.

Question: Is the average ACT score for all participants 27?

Visualize the distribution of ACT scores using a histogram with a normal distribution curve and a line for mean

scores %>%
  ggplot(aes(x = ACT)) + 
    geom_histogram(aes(y = stat(density)), binwidth = 2, 
                   color = "black", fill = "lightblue") +
    stat_function(fun = dnorm, args = list(mean = mean(scores$ACT), 
                                           sd = sd(scores$ACT)), size = 1) +
    geom_vline(aes(xintercept = mean(ACT)), color = "red") +
    labs(title = "Histogram of ACT scores with a Normal Curve", 
         x = "ACT score", y = "Percent of observations")

Run a one sample t-test

m1 <- t.test(scores$ACT, mu = 27,
       alternative = c("two.sided"),
       conf.level = 0.95)
m1
## 
##  One Sample t-test
## 
## data:  scores$ACT
## t = 8.4862, df = 699, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 27
## 95 percent confidence interval:
##  28.18920 28.90509
## sample estimates:
## mean of x 
##  28.54714
report(m1)
## Effect sizes were labelled following Cohen's (1988) recommendations.
## 
## The One Sample t-test testing the difference between scores$ACT (mean = 28.55) and mu = 27 suggests that the effect is positive, statistically significant, and small (difference = 1.55, 95% CI [28.19, 28.91], t(699) = 8.49, p < .001; Cohen's d = 0.32, 95% CI [0.24, 0.40])

t(df=699) = 8.4862, p-value = 0.00% < 5% (reject the null) -> The average ACT score is not 27.

Chi-square goodness of fit

It is used to see whether the actual distribution (from a sample) of a categorical variable matches the expected distribution.

Question: Is gender distribution 50%-50% in this data set?

Make a relative bar-chart showing counts of observations by gender

scores %>% 
  ggplot(aes(fill = gender, x = gender)) + 
    geom_bar(aes(y = stat(count)/sum(count))) +
    geom_text(aes(label = round(stat(count)/sum(count),2), y = stat(count)/sum(count)), 
              stat = "count", position = position_stack(vjust = 0.5), size = 5) +
    scale_fill_brewer(palette = "Dark2") +
    labs(title = "Share of observations by gender",
         x = "Gender", y = "Percent of observations", fill = "Gender")

Run a Chi-square goodness of fit test

m2 <- chisq.test(table(scores$gender), p = c(0.5,0.5))
m2
## 
##  Chi-squared test for given probabilities
## 
## data:  table(scores$gender)
## X-squared = 60.623, df = 1, p-value = 6.913e-15

X-squared(df=1) = 60.623, p-value = 0.00% < 5% (reject the null) -> The gender distribution is not 50%-50%.

Two-sample t-test

It is used to see whether there are group differences in population means between two groups.

Question: Do men and women have different average SAT verbal scores?

Use box plots to compare distributions of one quantitative variable across multiple categories

scores %>%
  ggplot(aes(x = gender, y = SATV, fill = gender)) +
    geom_boxplot(outlier.size = 1) +
    stat_summary(fun = mean, geom = "point", shape = 5, size = 4) +
    scale_fill_brewer(palette = "Dark2") +
    labs(title = "Boxplots of SAT Verbal scores by gender",x="Gender", 
         y="SAT Verbal scores",fill="Gender")

Alternatively, use density plots to compare distributions

scores %>%
  ggplot(aes(x = SATV, fill = gender)) +
    geom_density(alpha = 0.5) +
    scale_fill_brewer(palette = "Dark2") +
    labs(title = "Density plots of SAT Verbal scores by gender", 
         x = "SAT Verbal scores", fill = "Gender")

Run a two sample t-test

m3 <- t.test(SATV ~ gender, var.eq = TRUE, data = scores)
m3
## 
##  Two Sample t-test
## 
## data:  SATV by gender
## t = 0.49792, df = 698, p-value = 0.6187
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.09357  21.99137
## sample estimates:
##   mean in group Men mean in group Women 
##            615.1134            610.6645
report(m3)
## Effect sizes were labelled following Cohen's (1988) recommendations.
## 
## The Two Sample t-test testing the difference of SATV by gender (mean in group Men = 615.11, mean in group Women = 610.66) suggests that the effect is negative, statistically not significant, and very small (difference = -4.45, 95% CI [-13.09, 21.99], t(698) = 0.50, p = 0.619; Cohen's d = 0.04, 95% CI [-0.11, 0.19])

t(df=698) = 0.49792, p-value = 61.87% > 5% (fail to reject the null) -> There is no statistically significant difference in average SAT verbal scores between men and women.

One-way ANOVA

It is used to determine whether there are group differences in numeric data between two or more groups.

Question: Do SAT verbal scores significantly differ by educational levels (1= HS, 2= some college degree, 3 = 2-year college degree, 4= 4-year college degree, 5= graduate work, 6=professional degree)?

Use boxplots to compare distributions of one quantitative variable across multiple categories; comparing distribution of SATV scores variable by education level

scores %>%
  ggplot(aes(x = education, y = SATV, fill = education)) +
    geom_boxplot(outlier.size = 1) +
    labs(title = "Boxplots of SAT Verbal scores by education level",
         x = "Levels of education", y = "SAT Verbal scores", fill = "Education")

Run one-way ANOVA

m4 <- aov(SATV ~ education, data = scores)
summary(m4)
##              Df  Sum Sq Mean Sq F value Pr(>F)
## education     5   80754   16151   1.269  0.275
## Residuals   694 8829392   12722
report(m4)
## For one-way between subjects designs, partial eta squared is equivalent to eta squared.
## Returning eta squared.
## The ANOVA (formula: SATV ~ education) suggests that:
## 
##   - The main effect of education is statistically not significant and very small (F(5, 694) = 1.27, p = 0.275; Eta2 = 9.06e-03, 90% CI [0.00, 0.02])
## 
## Effect sizes were labelled following Field's (2013) recommendations.

F value(df Model=5, df Residuals=694) = 1.269, p-value = 27.5% > 5% (fail to reject the null) -> There were no significant group differences in SAT verbal scores according to students’ educational levels.

We do not have to run the post hoc tests because the group differences are not significant.

Tukey’s HSD is the most popular post hoc test for comparing multiple pairings

Tukey_ANOVA_SATV_ed <- TukeyHSD(aov(SATV ~ education, data = scores), conf.level = .95) 
Tukey_ANOVA_SATV_ed
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = SATV ~ education, data = scores)
## 
## $education
##            diff        lwr       upr     p adj
## 1-0 -16.8421053  -81.11952  47.43531 0.9756332
## 2-0 -40.4860447 -105.17035  24.19826 0.4737629
## 3-0  -4.3742265  -51.28442  42.53597 0.9998182
## 4-0   0.4405034  -50.31023  51.19124 1.0000000
## 5-0   4.8883912  -45.70428  55.48106 0.9997835
## 2-1 -23.6439394  -91.98228  44.69440 0.9215425
## 3-1  12.4678788  -39.36489  64.30064 0.9833397
## 4-1  17.2826087  -38.05008  72.61529 0.9482786
## 5-1  21.7304965  -33.45725  76.91824 0.8708907
## 3-2  36.1118182  -16.22468  88.44832 0.3596722
## 4-2  40.9265481  -14.87828  96.73138 0.2906084
## 5-2  45.3744358  -10.28669 101.03556 0.1835826
## 4-3   4.8147299  -28.81095  38.44041 0.9985288
## 5-3   9.2626177  -24.12403  42.64926 0.9687311
## 5-4   4.4478878  -34.14924  43.04502 0.9994867

Optional: Some other commonly used post hoc tests are Bonferroni (more conservative) and Holm (less restrictive). Use pairwise.t.test for pairwise comparisons and indicate the adjustment method, e.g. p.adj = "holm".

pairwise.t.test(scores$SATV, scores$education, p.adj="holm")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  scores$SATV and scores$education 
## 
##   0    1    2    3    4   
## 1 1.00 -    -    -    -   
## 2 0.89 1.00 -    -    -   
## 3 1.00 1.00 0.64 -    -   
## 4 1.00 1.00 0.51 1.00 -   
## 5 1.00 1.00 0.30 1.00 1.00
## 
## P value adjustment method: holm

Two-way ANOVA

It is used to determine whether there are group differences in numeric data between groups characterized by two different categorical variables.

Question: Do SAT verbal scores significantly differ by educational levels and gender?

Use box plots to compare distribution of SATV scores variable by education level and gender (grouped by education level first, gender second)

scores %>%
  ggplot(aes(x = education, y = SATV, fill = gender)) +
    geom_boxplot(outlier.size = 1) +
    scale_fill_brewer(palette = "Dark2") +
    labs(title = "Boxplots of SAT Verbal scores by education level and gender",
         x = "Level of education", y = "SAT Verbal score", fill = "Gender")

Run two-way ANOVA

m5 <- aov(SATV ~ education + gender, data = scores)

summary(m5)
##              Df  Sum Sq Mean Sq F value Pr(>F)
## education     5   80754   16151   1.269  0.276
## gender        1    6740    6740   0.529  0.467
## Residuals   693 8822652   12731
report(m5)
## The ANOVA (formula: SATV ~ education + gender) suggests that:
## 
##   - The main effect of education is statistically not significant and very small (F(5, 693) = 1.27, p = 0.276; Eta2 (partial) = 9.07e-03, 90% CI [0.00, 0.02])
##   - The main effect of gender is statistically not significant and very small (F(1, 693) = 0.53, p = 0.467; Eta2 (partial) = 7.63e-04, 90% CI [0.00, 7.99e-03])
## 
## Effect sizes were labelled following Field's (2013) recommendations.
  • Education: F value(df Model=5, df Residuals=693) = 1.269, p-value = 27.6% > 5% (fail to reject the null)
  • Gender: F value(df Model=1, df Residuals=693) = 0.529, p-value = 46.7% > 5% (fail to reject the null)

There are no significant group differences in SAT verbal scores according to students’ educational levels or gender.

Chi-square test of independence

It is used to determine whether two categorical variables are dependent or independent.

Question: Is gender independent of education levels?

Make a bar-chart showing distribution of observations by gender and education

scores %>% 
  ggplot(aes(fill = gender, x = education)) + 
    geom_bar(position = "fill") +
    scale_fill_brewer(palette = "Dark2") +
    geom_text(aes(y = ..count../tapply(..count.., ..x.. ,sum)[..x..], 
                  label = scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..]) ),
                  stat = "count", position = position_stack(vjust = 0.5)) +  
    labs(title = "Distribution of observations by gender and education levels",
         x = "Education levels", y = "Number of observations", fill = "Gender") +
    coord_flip()

Run a Chi-square test of independence

m6 <- chisq.test(table(scores$gender, scores$education))
m6
## 
##  Pearson's Chi-squared test
## 
## data:  table(scores$gender, scores$education)
## X-squared = 16.085, df = 5, p-value = 0.006605

X-squared(df=5) = 16.085, p-value = 0.006% < 5% (reject the null) -> Gender and education levels are dependent.

Simple linear regression

It is used to identify a presence of a linear relationship between two quantitative variables.

Question: Is there a linear relationship between SAT Verbal and SAT Quantitative scores?

Make a scatter plot with a best-fit line

ggplot(scores, aes(x = SATQ, y = SATV)) + 
  geom_point() +
  geom_smooth(method = lm) +
  labs(title = "Scatteplot of SAT Verbal and SAT Quantitative scores",
       x = "SAT Quantitative score", y = "SAT Verbal score")
## `geom_smooth()` using formula 'y ~ x'

Run a simple linear regression

model1 <- lm(SATV ~ SATQ, data = scores)
summary(model1)
## 
## Call:
## lm(formula = SATV ~ SATQ, data = scores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -365.89  -50.57   -3.23   54.68  257.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 227.14322   17.77978   12.78   <2e-16 ***
## SATQ          0.63124    0.02863   22.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.71 on 685 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.4151, Adjusted R-squared:  0.4143 
## F-statistic: 486.2 on 1 and 685 DF,  p-value: < 2.2e-16

T-test for Beta1:

  • t-stat(SATQ) = 22.05, p-value = 0.00% < 5% (reject the null)
  • There is a linear relationship between SAT Quantitative scores and SAT Verbal scores.

ANOVA for Regression:

  • F-stat(1,685) = 486.2, p-value = 0.00% < 5% (reject the null)
  • The overall model is worthwhile.

Overall:

  • The estimated regression line equation: SATV = 227.14 + 0.63(SATQ). We would expect 0.63 points increase in SAT Verbal scores for every one point increase in SAT Quantitative score, assuming all the other variables are held constant.
  • 41.51% of the variability in the SAT verbal scores was explained by the SAT quantitative scores.

Get a formal report for the model

report(model1)
## We fitted a linear model (estimated using OLS) to predict SATV with SATQ (formula: SATV ~ SATQ). The model explains a statistically significant and substantial proportion of variance (R2 = 0.42, F(1, 685) = 486.19, p < .001, adj. R2 = 0.41). The model's intercept, corresponding to SATQ = 0, is at 227.14 (95% CI [192.23, 262.05], t(685) = 12.78, p < .001). Within this model:
## 
##   - The effect of SATQ is statistically significant and positive (beta = 0.63, 95% CI [0.58, 0.69], t(685) = 22.05, p < .001; Std. beta = 0.64, 95% CI [0.59, 0.70])
## 
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset.

Multiple linear regression

It is used to explain/predict one quantitative variable using multiple explanatory variables (one of which has to be quantitative).

Question: Can you explain/predict SAT Verbal using SAT Quantitative scores and ACT scores?

Make a correlation matrix

scores %>%
  select(SATV, SATQ, ACT) %>% 
  ggpairs(aes())

Run a multiple linear regression

model2 <- lm(SATV ~ SATQ + ACT, data = scores)
summary(model2)
## 
## Call:
## lm(formula = SATV ~ SATQ + ACT, data = scores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -376.97  -46.64    1.89   50.25  243.20 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 138.57027   20.25101   6.843 1.73e-11 ***
## SATQ          0.47130    0.03382  13.934  < 2e-16 ***
## ACT           6.52075    0.80963   8.054 3.56e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.93 on 684 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.4658, Adjusted R-squared:  0.4642 
## F-statistic: 298.2 on 2 and 684 DF,  p-value: < 2.2e-16

Get a formal report for the model

report(model2)
## We fitted a linear model (estimated using OLS) to predict SATV with SATQ and ACT (formula: SATV ~ SATQ + ACT). The model explains a statistically significant and substantial proportion of variance (R2 = 0.47, F(2, 684) = 298.19, p < .001, adj. R2 = 0.46). The model's intercept, corresponding to SATQ = 0 and ACT = 0, is at 138.57 (95% CI [98.81, 178.33], t(684) = 6.84, p < .001). Within this model:
## 
##   - The effect of SATQ is statistically significant and positive (beta = 0.47, 95% CI [0.40, 0.54], t(684) = 13.93, p < .001; Std. beta = 0.48, 95% CI [0.41, 0.55])
##   - The effect of ACT is statistically significant and positive (beta = 6.52, 95% CI [4.93, 8.11], t(684) = 8.05, p < .001; Std. beta = 0.28, 95% CI [0.21, 0.35])
## 
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset.

Multicollinearity

Multicollinearity may occur when an independent variable is highly correlated (r >= 0.7) with one or more of the other independent variables in the model.

What if we wanted to add more predictors to the model?

# transform education variable to be numeric
scores$education <- as.numeric(scores$education)

scores %>%
  ggpairs(aes(fill = gender, color = gender, alpha = 0.3), progress = F) +
  scale_fill_brewer(palette = "Dark2") +
  scale_color_brewer(palette = "Dark2")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

model3 <- lm(SATV ~ SATQ + ACT + age + education + gender, data = scores)
summary(model3)
## 
## Call:
## lm(formula = SATV ~ SATQ + ACT + age + education + gender, data = scores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -386.97  -43.23    2.44   51.91  240.16 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 136.25377   22.59310   6.031 2.68e-09 ***
## SATQ          0.47786    0.03445  13.873  < 2e-16 ***
## ACT           6.61477    0.82185   8.049 3.73e-15 ***
## age          -0.74316    0.40359  -1.841   0.0660 .  
## education     0.96598    2.70177   0.358   0.7208    
## genderWomen  16.56045    6.73492   2.459   0.0142 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.46 on 681 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.4741, Adjusted R-squared:  0.4703 
## F-statistic: 122.8 on 5 and 681 DF,  p-value: < 2.2e-16

T-tests for Beta1:

  • SATQ: p-value = 0.00% < 5% (reject the null) -> There is a statistically significant linear relationship between SATQ and SATV scores.
  • ACT: p-value = 0.00% < 5% (reject the null) -> There is a statistically significant linear relationship between ACT and SATV scores.
  • Age: p-value = 6.60% > 5% (fail to reject the null) -> A linear relationship between age and SATV scores is not statistically significant.
  • Education: p-value = 72.08% > 5% (fail to reject the null) -> A linear relationship between education and SATV scores is not statistically significant.
  • Gender: p-value = 1.42% < 5 % (reject the null) -> There is a statistically significant linear relationship between gender and SATV scores.

ANOVA for Regression: - F-stat(5,681) = 122.8, p-value = 0.00% < 5% (reject the null) - The overall model is worthwhile.

Overall: - The estimated regression line equation: SATV = 136.25 + 0.48(SATQ) + 6.61(ACT) - 0.74(Age) + 0.97(Education) + 16.56(GenderWomen). - 47.41% of the variability in the SAT verbal scores was explained by the model as a whole.

Get a formal report on your model

report(model3)
## We fitted a linear model (estimated using OLS) to predict SATV with SATQ, ACT, age, education and gender (formula: SATV ~ SATQ + ACT + age + education + gender). The model explains a statistically significant and substantial proportion of variance (R2 = 0.47, F(5, 681) = 122.80, p < .001, adj. R2 = 0.47). The model's intercept, corresponding to SATQ = 0, ACT = 0, age = 0, education = 0 and gender = Men, is at 136.25 (95% CI [91.89, 180.61], t(681) = 6.03, p < .001). Within this model:
## 
##   - The effect of SATQ is statistically significant and positive (beta = 0.48, 95% CI [0.41, 0.55], t(681) = 13.87, p < .001; Std. beta = 0.49, 95% CI [0.42, 0.56])
##   - The effect of ACT is statistically significant and positive (beta = 6.61, 95% CI [5.00, 8.23], t(681) = 8.05, p < .001; Std. beta = 0.28, 95% CI [0.21, 0.35])
##   - The effect of age is statistically non-significant and negative (beta = -0.74, 95% CI [-1.54, 0.05], t(681) = -1.84, p = 0.066; Std. beta = -0.06, 95% CI [-0.13, 4.13e-03])
##   - The effect of education is statistically non-significant and positive (beta = 0.97, 95% CI [-4.34, 6.27], t(681) = 0.36, p = 0.721; Std. beta = 0.01, 95% CI [-0.05, 0.08])
##   - The effect of gender [Women] is statistically significant and positive (beta = 16.56, 95% CI [3.34, 29.78], t(681) = 2.46, p = 0.014; Std. beta = 0.15, 95% CI [0.03, 0.26])
## 
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset.

Check Variance Inflation Factor (VIF) for the severity of multicollinearity

vif(model3)
##      SATQ       ACT       age education    gender 
##  1.600706  1.590366  1.479992  1.494064  1.051513

VIF values above 4 - moderate multicollinearity; above 10 - strong multicollinearity

F-test for nested models

Run F-test for comparing nested models

anova(model1, model2, model3)
## Analysis of Variance Table
## 
## Model 1: SATV ~ SATQ
## Model 2: SATV ~ SATQ + ACT
## Model 3: SATV ~ SATQ + ACT + age + education + gender
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    685 5150020                                  
## 2    684 4703926  1    446094 65.606 2.546e-15 ***
## 3    681 4630511  3     73415  3.599   0.01334 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-values below 5% indicate that the new model is an improvement over the previous one:

  • Model 2 is an improvement over Model 1 (coefficient for ACT is not a zero)
  • Model 3 is an improvement over Model 2 (coefficient for Gender is not a zero)

Assumptions of linear regression

  • Normality of residuals: residual errors are assumed to be normally distributed.
  • Homoscedasticity of residuals variance: residuals are assumed to have a constant variance
  • Independence of residual error terms: residuals should be independent of each other.
  • Linearity: the relationship between predictors and outcome variable is assumed to be linear.

Normality

Normal Q-Q plot

plot=plot(model3, 2)

Checking for normality using Shapiro-Wilk test:

shapiro.test(resid(model3))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model3)
## W = 0.98587, p-value = 3.383e-06

p-values below 5% indicate non-normality of residuals

Homoscedasticity

Plot positive values of residuals against fitted values

plot=plot(model3, 3)

Checking for heteroscedasticity using Breusch-Pagan Test

bptest(model3)
## 
##  studentized Breusch-Pagan test
## 
## data:  model3
## BP = 23.592, df = 5, p-value = 0.00026

p-values below 5% indicate heteroscedasticity of residuals

Independence and linearity

Residuals vs Fitted plot to check if the observed pattern is linear

plot=plot(model3, 1)

Checking for auto-correlation using Durbin-Watson Test.

durbinWatsonTest(model3)
##  lag Autocorrelation D-W Statistic p-value
##    1      -0.1095882      2.218698   0.002
##  Alternative hypothesis: rho != 0

For the independent data, DW statistic is close to 2. DW < 1 or DW > 3 may indicate the presence of autocorrelation.

p-values below 5% indicate autocorrelation of residuals.

Forest plots

Forest plot enables you to visualize regression coefficients with CIs (uses sjplot package)

plot_models(model1, model2, model3,
            rm.terms = c("genderWomen","age"), #removing two variables from the plot
            m.labels = c("Model 1","Model 2","Model 3"),
            legend.title = "Models",
            show.values = TRUE,
            vline.color = "black")

Exporting regression results

Use sjplot package to produce a table with regression results

tab_model(model1, model2, model3,
          p.style = "stars",
          dv.labels = c("Model1", "Model2", "Model3"),
          show.ci = FALSE)
  Model1 Model2 Model3
Predictors Estimates Estimates Estimates
(Intercept) 227.14 *** 138.57 *** 136.25 ***
SATQ 0.63 *** 0.47 *** 0.48 ***
ACT 6.52 *** 6.61 ***
age -0.74
education 0.97
gender [Women] 16.56 *
Observations 687 687 687
R2 / R2 adjusted 0.415 / 0.414 0.466 / 0.464 0.474 / 0.470
  • p<0.05   ** p<0.01   *** p<0.001 </td> </tr>
</table>

Use stargazer package to export regression results into your current working directory

stargazer(model1, model2, model3, 
          type = "html",
          out = "Regression results.doc",
          digits = 2)
Dependent variable:
SATV
(1) (2) (3)
SATQ 0.63*** 0.47*** 0.48***
(0.03) (0.03) (0.03)
ACT 6.52*** 6.61***
(0.81) (0.82)
age -0.74*
(0.40)
education 0.97
(2.70)
genderWomen 16.56**
(6.73)
Constant 227.14*** 138.57*** 136.25***
(17.78) (20.25) (22.59)
Observations 687 687 687
R2 0.42 0.47 0.47
Adjusted R2 0.41 0.46 0.47
Residual Std. Error 86.71 (df = 685) 82.93 (df = 684) 82.46 (df = 681)
F Statistic 486.19*** (df = 1; 685) 298.19*** (df = 2; 684) 122.80*** (df = 5; 681)
Note: </sup>p<0.1; </sup>p<0.05; </strong></em>p<0.01 </td> </tr> </table> </div> </div>

Optional

Simple logistic regression

It is used to explain/predict one binary variable using one quantitative independent variables.

Question: Can you predict one’s gender using one’s SAT Quantitative score?

Run a simple logistic regression

m7 <- glm(gender ~ SATQ, family = binomial, data = scores)
summary(m7)
## 
## Call:
## glm(formula = gender ~ SATQ, family = binomial, data = scores)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8857  -1.3047   0.8054   0.9798   1.1734  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.5427964  0.4686452   5.426 5.77e-08 ***
## SATQ        -0.0031667  0.0007407  -4.275 1.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 895.09  on 686  degrees of freedom
## Residual deviance: 875.67  on 685  degrees of freedom
##   (13 observations deleted due to missingness)
## AIC: 879.67
## 
## Number of Fisher Scoring iterations: 4
report(m7)
## We fitted a logistic model (estimated using ML) to predict gender with SATQ (formula: gender ~ SATQ). The model's explanatory power is weak (Tjur's R2 = 0.03). The model's intercept, corresponding to SATQ = 0, is at 2.54 (95% CI [1.64, 3.48], p < .001). Within this model:
## 
##   - The effect of SATQ is statistically significant and negative (beta = -3.17e-03, 95% CI [-4.64e-03, -1.74e-03], p < .001; Std. beta = -0.37, 95% CI [-0.54, -0.20])
## 
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using

Multiple logistic regression

It is used to explain/predict one binary variable using multiple explanatory variables (one of which has to be quantitative).

Question: Can you predict one’s gender using one’s SAT Quantitative and Verbal scores?

Run a multiple logistic regression

m8 <- glm(gender ~ SATQ + SATV, family = binomial, data = scores)
summary(m8)
## 
## Call:
## glm(formula = gender ~ SATQ + SATV, family = binomial, data = scores)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0323  -1.2901   0.7977   0.9594   1.4946  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.9120390  0.5066135   3.774 0.000161 ***
## SATQ        -0.0050271  0.0009834  -5.112 3.19e-07 ***
## SATV         0.0028975  0.0009699   2.987 0.002814 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 895.09  on 686  degrees of freedom
## Residual deviance: 866.51  on 684  degrees of freedom
##   (13 observations deleted due to missingness)
## AIC: 872.51
## 
## Number of Fisher Scoring iterations: 4
report(m8)
## We fitted a logistic model (estimated using ML) to predict gender with SATQ and SATV (formula: gender ~ SATQ + SATV). The model's explanatory power is weak (Tjur's R2 = 0.04). The model's intercept, corresponding to SATQ = 0 and SATV = 0, is at 1.91 (95% CI [0.93, 2.92], p < .001). Within this model:
## 
##   - The effect of SATQ is statistically significant and negative (beta = -5.03e-03, 95% CI [-6.99e-03, -3.13e-03], p < .001; Std. beta = -0.58, 95% CI [-0.81, -0.36])
##   - The effect of SATV is statistically significant and positive (beta = 2.90e-03, 95% CI [1.01e-03, 4.82e-03], p = 0.003; Std. beta = 0.33, 95% CI [0.11, 0.55])
## 
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
</div> </body> </html>