Statistical Analysis with R Workshop
Albina Gibadullina and Maria Laricheva
November 26, 2021
Installing and loading R packages
This workshop uses the following R packages:
- psych, a statistical analysis package
- rstatix, a statistical analysis package
- dplyr, a data manipulation package
- DT, a package for creating interactive tables
- ggplot2, a data visualization package
- GGally, a data visualization package (extension of ggplot2)
- table1, a package for creating summary tables
- gtsummary, a package for creating summary tables
- flextable, a package for tabular reporting
- modelsummary, a package creating summary tables and plots
- lmtest, a package for linear regression
- stargazer, a package for exporting regression tables
- car, a package for regression analysis
- RColorBrewer, a color palette package
- report, a package for reporting statistical results
- sjPlot, a package for regression analysis tables and plots
Check whether required packages are already installed, and install any that are missing
list.of.packages <- c("psych", "rstatix","dplyr","DT","ggplot2","GGally","table1","gtsummary","flextable","modelsummary","lmtest","stargazer","car","RColorBrewer","remotes","report","sjPlot")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
Load packages into your current session
suppressPackageStartupMessages({
library(psych)
library(rstatix)
library(dplyr)
library(DT)
library(ggplot2)
library(GGally)
library(table1)
library(gtsummary)
library(flextable)
library(modelsummary)
library(lmtest)
library(stargazer)
library(car)
library(RColorBrewer)
library(report)
library(sjPlot)})
Changing default theme for visualizations
Changing the default theme to theme_light
Suppress error messages
Uploading data
Creating a data frame scores
using an in-built dataset sat.act
from the psych
package
scores <- sat.act
Importing data from a .csv file (located in your working directory)
# scores <- read.csv("sat.act.csv") # does not work in this case as you don't have a .csv file
Describing data structure
See the first six rows of the data frame
head(scores)
## gender education age ACT SATV SATQ
## 29442 2 3 19 24 500 500
## 29457 2 3 23 35 600 500
## 29498 2 3 20 21 480 470
## 29503 1 4 27 26 550 520
## 29504 1 2 33 31 600 550
## 29518 1 5 26 28 640 640
Get dimension for your dataset
dim(scores) # Get number of rows (observations) and number of columns (variables)
## [1] 700 6
Check data structure
str(scores)
## 'data.frame': 700 obs. of 6 variables:
## $ gender : int 2 2 2 1 1 1 2 1 2 2 ...
## $ education: int 3 3 3 4 2 5 5 3 4 5 ...
## $ age : int 19 23 20 27 33 26 30 19 23 40 ...
## $ ACT : int 24 35 21 26 31 28 36 22 22 35 ...
## $ SATV : int 500 600 480 550 600 640 610 520 400 730 ...
## $ SATQ : int 500 500 470 520 550 640 500 560 600 800 ...
Browse dataset interactively
datatable(scores)
Data manipulation
Changing format of variables
Check current data format of the gender
variable
class(scores$gender)
## [1] "integer"
Change the format of gender
from integer (quantitative) to factor (categorical)
scores$gender <- as.factor(scores$gender)
Check whether gender
has factor format
is.factor(scores$gender)
## [1] TRUE
Exercise 1
- Using
class
command, check the current format ofeducation
- Using
as.factor
command, changeeducation
to factor. - Using
is.factor
command, check ifeducation
is defined as factor.
Check current data format of the education
variable
class(scores$education)
## [1] "integer"
Change the format of education
from integer (quantitative) to factor (categorical)
scores$education <- as.factor(scores$education)
Check whether education
has factor format
is.factor(scores$education)
## [1] TRUE
Transforming values of categorical variables
Gender
Gender
is now factor but it is still coded as “1” (men) and “2” (women) - it would be helpful for later analysis to change “1” to “men” and “2” to “women”
Replacing values
# Step 1: Change the data format to character
scores$gender <- as.character(scores$gender)
# Step 2: Replace "1" with "men" and "2" with "women"
scores$gender[scores$gender=="1"] <- "Men"
scores$gender[scores$gender=="2"] <- "Women"
# Step 3: Change the data format back to factor
scores$gender <- as.factor(scores$gender)
# Step 4: Check levels for `gender`
levels(scores$gender)
## [1] "Men" "Women"
Descriptive statistics
Categorical data
Get frequency for the gender
variable
table(scores$gender)
##
## Men Women
## 247 453
Get cross-tabulation for gender
and education
table(scores$gender, scores$education)
##
## 0 1 2 3 4 5
## Men 27 20 23 80 51 46
## Women 30 25 21 195 87 95
Quantitative data
Find summary statistics for each quantitative variable
get_summary_stats(scores)
## # A tibble: 4 x 13
## variable n min max median q1 q3 iqr mad mean sd se
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ACT 700 3 36 29 25 32 7 4.45 28.5 4.82 0.182
## 2 age 700 13 65 22 19 29 10 5.93 25.6 9.50 0.359
## 3 SATQ 687 200 800 620 530 700 170 119. 610. 116. 4.41
## 4 SATV 700 200 800 620 550 700 150 119. 612. 113. 4.27
## # ... with 1 more variable: ci <dbl>
Find summary statistics for ACT
scores, grouped by gender
scores %>%
group_by(gender) %>%
get_summary_stats(ACT)
## # A tibble: 2 x 14
## gender variable n min max median q1 q3 iqr mad mean sd
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Men ACT 247 3 36 30 25 32.5 7.5 4.45 28.8 5.06
## 2 Women ACT 453 15 36 29 25 32 7 4.45 28.4 4.69
## # ... with 2 more variables: se <dbl>, ci <dbl>
Generated a summary statistics table, grouped by gender
table1::table1(~ education + age + ACT + SATV + SATQ | gender, data = scores)
Men (N=247) | Women (N=453) | Overall (N=700) | |
---|---|---|---|
education | |||
0 | 27 (10.9%) | 30 (6.6%) | 57 (8.1%) |
1 | 20 (8.1%) | 25 (5.5%) | 45 (6.4%) |
2 | 23 (9.3%) | 21 (4.6%) | 44 (6.3%) |
3 | 80 (32.4%) | 195 (43.0%) | 275 (39.3%) |
4 | 51 (20.6%) | 87 (19.2%) | 138 (19.7%) |
5 | 46 (18.6%) | 95 (21.0%) | 141 (20.1%) |
age | |||
Mean (SD) | 25.9 (9.74) | 25.4 (9.37) | 25.6 (9.50) |
Median [Min, Max] | 22.0 [14.0, 58.0] | 22.0 [13.0, 65.0] | 22.0 [13.0, 65.0] |
ACT | |||
Mean (SD) | 28.8 (5.06) | 28.4 (4.69) | 28.5 (4.82) |
Median [Min, Max] | 30.0 [3.00, 36.0] | 29.0 [15.0, 36.0] | 29.0 [3.00, 36.0] |
SATV | |||
Mean (SD) | 615 (114) | 611 (112) | 612 (113) |
Median [Min, Max] | 630 [200, 800] | 620 [200, 800] | 620 [200, 800] |
SATQ | |||
Mean (SD) | 636 (116) | 596 (113) | 610 (116) |
Median [Min, Max] | 660 [300, 800] | 600 [200, 800] | 620 [200, 800] |
Missing | 2 (0.8%) | 11 (2.4%) | 13 (1.9%) |
Alternative way of making a summary statistics table (using gtsummary
package)
scores %>%
tbl_summary(by = gender)
Characteristic | Men, N = 2471 | Women, N = 4531 |
---|---|---|
education | ||
0 | 27 (11%) | 30 (6.6%) |
1 | 20 (8.1%) | 25 (5.5%) |
2 | 23 (9.3%) | 21 (4.6%) |
3 | 80 (32%) | 195 (43%) |
4 | 51 (21%) | 87 (19%) |
5 | 46 (19%) | 95 (21%) |
age | 22 (19, 30) | 22 (19, 28) |
ACT | 30 (25, 32) | 29 (25, 32) |
SATV | 630 (540, 700) | 620 (550, 700) |
SATQ | 660 (570, 720) | 600 (500, 683) |
Unknown | 2 | 11 |
1 n (%); Median (IQR) |
Show mean, SD, median + range for numeric variables, add p-values
summary_table <- scores %>%
tbl_summary(by = gender,
type = all_continuous() ~ "continuous2", #to add multiple summary lines
statistic = list(all_continuous() ~ c("{mean} ({sd})",
"{median} ({min}, {max})"),
all_categorical() ~ "{n} ({p}%)"),
digits = all_continuous() ~ 1) %>%
add_p()
summary_table
Characteristic | Men, N = 247 | Women, N = 453 | p-value1 |
---|---|---|---|
education | 0.007 | ||
0 | 27 (11%) | 30 (6.6%) | |
1 | 20 (8.1%) | 25 (5.5%) | |
2 | 23 (9.3%) | 21 (4.6%) | |
3 | 80 (32%) | 195 (43%) | |
4 | 51 (21%) | 87 (19%) | |
5 | 46 (19%) | 95 (21%) | |
age | >0.9 | ||
Mean (SD) | 25.9 (9.7) | 25.4 (9.4) | |
Median (Range) | 22.0 (14.0, 58.0) | 22.0 (13.0, 65.0) | |
ACT | 0.13 | ||
Mean (SD) | 28.8 (5.1) | 28.4 (4.7) | |
Median (Range) | 30.0 (3.0, 36.0) | 29.0 (15.0, 36.0) | |
SATV | 0.5 | ||
Mean (SD) | 615.1 (114.2) | 610.7 (112.3) | |
Median (Range) | 630.0 (200.0, 800.0) | 620.0 (200.0, 800.0) | |
SATQ | <0.001 | ||
Mean (SD) | 635.9 (116.0) | 596.0 (113.1) | |
Median (Range) | 660.0 (300.0, 800.0) | 600.0 (200.0, 800.0) | |
Unknown | 2 | 11 | |
1 Pearson's Chi-squared test; Wilcoxon rank sum test |
Export summary table as a word document
summary_table %>%
as_flex_table() %>%
flextable::save_as_docx(path = "Summary table.docx")
See https://cran.r-project.org/web/packages/gtsummary/vignettes/tbl_summary.html for more information on how to further modify the table
Inferential Statistics
You should select statistical tests based on your hypothesis and the data type of your variable(s)
- One sample t-test
- Chi-square goodness of fit test
- Two sample t-test
- One-way ANOVA
- Two-way ANOVA
- Chi-square test of independence
- Simple linear regression
- Multiple linear regression
- Simple logistic regression
- Multiple logistic regression
One sample t-test
It is used to see whether the hypothesized value of the population mean matches actual value.
Question: Is the average ACT score for all participants 27?
Visualize the distribution of ACT scores using a histogram with a normal distribution curve and a line for mean
scores %>%
ggplot(aes(x = ACT)) +
geom_histogram(aes(y = stat(density)), binwidth = 2,
color = "black", fill = "lightblue") +
stat_function(fun = dnorm, args = list(mean = mean(scores$ACT),
sd = sd(scores$ACT)), size = 1) +
geom_vline(aes(xintercept = mean(ACT)), color = "red") +
labs(title = "Histogram of ACT scores with a Normal Curve",
x = "ACT score", y = "Percent of observations")
Run a one sample t-test
m1 <- t.test(scores$ACT, mu = 27,
alternative = c("two.sided"),
conf.level = 0.95)
m1
##
## One Sample t-test
##
## data: scores$ACT
## t = 8.4862, df = 699, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 27
## 95 percent confidence interval:
## 28.18920 28.90509
## sample estimates:
## mean of x
## 28.54714
report(m1)
## Effect sizes were labelled following Cohen's (1988) recommendations.
##
## The One Sample t-test testing the difference between scores$ACT (mean = 28.55) and mu = 27 suggests that the effect is positive, statistically significant, and small (difference = 1.55, 95% CI [28.19, 28.91], t(699) = 8.49, p < .001; Cohen's d = 0.32, 95% CI [0.24, 0.40])
t(df=699) = 8.4862, p-value = 0.00% < 5% (reject the null) -> The average ACT score is not 27.
Chi-square goodness of fit
It is used to see whether the actual distribution (from a sample) of a categorical variable matches the expected distribution.
Question: Is gender distribution 50%-50% in this data set?
Make a relative bar-chart showing counts of observations by gender
scores %>%
ggplot(aes(fill = gender, x = gender)) +
geom_bar(aes(y = stat(count)/sum(count))) +
geom_text(aes(label = round(stat(count)/sum(count),2), y = stat(count)/sum(count)),
stat = "count", position = position_stack(vjust = 0.5), size = 5) +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Share of observations by gender",
x = "Gender", y = "Percent of observations", fill = "Gender")
Run a Chi-square goodness of fit test
m2 <- chisq.test(table(scores$gender), p = c(0.5,0.5))
m2
##
## Chi-squared test for given probabilities
##
## data: table(scores$gender)
## X-squared = 60.623, df = 1, p-value = 6.913e-15
X-squared(df=1) = 60.623, p-value = 0.00% < 5% (reject the null) -> The gender distribution is not 50%-50%.
Two-sample t-test
It is used to see whether there are group differences in population means between two groups.
Question: Do men and women have different average SAT verbal scores?
Use box plots to compare distributions of one quantitative variable across multiple categories
scores %>%
ggplot(aes(x = gender, y = SATV, fill = gender)) +
geom_boxplot(outlier.size = 1) +
stat_summary(fun = mean, geom = "point", shape = 5, size = 4) +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Boxplots of SAT Verbal scores by gender",x="Gender",
y="SAT Verbal scores",fill="Gender")
Alternatively, use density plots to compare distributions
scores %>%
ggplot(aes(x = SATV, fill = gender)) +
geom_density(alpha = 0.5) +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Density plots of SAT Verbal scores by gender",
x = "SAT Verbal scores", fill = "Gender")
Run a two sample t-test
m3 <- t.test(SATV ~ gender, var.eq = TRUE, data = scores)
m3
##
## Two Sample t-test
##
## data: SATV by gender
## t = 0.49792, df = 698, p-value = 0.6187
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -13.09357 21.99137
## sample estimates:
## mean in group Men mean in group Women
## 615.1134 610.6645
report(m3)
## Effect sizes were labelled following Cohen's (1988) recommendations.
##
## The Two Sample t-test testing the difference of SATV by gender (mean in group Men = 615.11, mean in group Women = 610.66) suggests that the effect is negative, statistically not significant, and very small (difference = -4.45, 95% CI [-13.09, 21.99], t(698) = 0.50, p = 0.619; Cohen's d = 0.04, 95% CI [-0.11, 0.19])
t(df=698) = 0.49792, p-value = 61.87% > 5% (fail to reject the null) -> There is no statistically significant difference in average SAT verbal scores between men and women.
One-way ANOVA
It is used to determine whether there are group differences in numeric data between two or more groups.
Question: Do SAT verbal scores significantly differ by educational levels (1= HS, 2= some college degree, 3 = 2-year college degree, 4= 4-year college degree, 5= graduate work, 6=professional degree)?
Use boxplots to compare distributions of one quantitative variable across multiple categories; comparing distribution of SATV
scores variable by education level
scores %>%
ggplot(aes(x = education, y = SATV, fill = education)) +
geom_boxplot(outlier.size = 1) +
labs(title = "Boxplots of SAT Verbal scores by education level",
x = "Levels of education", y = "SAT Verbal scores", fill = "Education")
Run one-way ANOVA
m4 <- aov(SATV ~ education, data = scores)
summary(m4)
## Df Sum Sq Mean Sq F value Pr(>F)
## education 5 80754 16151 1.269 0.275
## Residuals 694 8829392 12722
report(m4)
## For one-way between subjects designs, partial eta squared is equivalent to eta squared.
## Returning eta squared.
## The ANOVA (formula: SATV ~ education) suggests that:
##
## - The main effect of education is statistically not significant and very small (F(5, 694) = 1.27, p = 0.275; Eta2 = 9.06e-03, 90% CI [0.00, 0.02])
##
## Effect sizes were labelled following Field's (2013) recommendations.
F value(df Model=5, df Residuals=694) = 1.269, p-value = 27.5% > 5% (fail to reject the null) -> There were no significant group differences in SAT verbal scores according to students’ educational levels.
We do not have to run the post hoc tests because the group differences are not significant.
Tukey’s HSD is the most popular post hoc test for comparing multiple pairings
Tukey_ANOVA_SATV_ed <- TukeyHSD(aov(SATV ~ education, data = scores), conf.level = .95)
Tukey_ANOVA_SATV_ed
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = SATV ~ education, data = scores)
##
## $education
## diff lwr upr p adj
## 1-0 -16.8421053 -81.11952 47.43531 0.9756332
## 2-0 -40.4860447 -105.17035 24.19826 0.4737629
## 3-0 -4.3742265 -51.28442 42.53597 0.9998182
## 4-0 0.4405034 -50.31023 51.19124 1.0000000
## 5-0 4.8883912 -45.70428 55.48106 0.9997835
## 2-1 -23.6439394 -91.98228 44.69440 0.9215425
## 3-1 12.4678788 -39.36489 64.30064 0.9833397
## 4-1 17.2826087 -38.05008 72.61529 0.9482786
## 5-1 21.7304965 -33.45725 76.91824 0.8708907
## 3-2 36.1118182 -16.22468 88.44832 0.3596722
## 4-2 40.9265481 -14.87828 96.73138 0.2906084
## 5-2 45.3744358 -10.28669 101.03556 0.1835826
## 4-3 4.8147299 -28.81095 38.44041 0.9985288
## 5-3 9.2626177 -24.12403 42.64926 0.9687311
## 5-4 4.4478878 -34.14924 43.04502 0.9994867
Optional: Some other commonly used post hoc tests are Bonferroni (more conservative) and Holm (less restrictive). Use pairwise.t.test
for pairwise comparisons and indicate the adjustment method, e.g. p.adj = "holm"
.
pairwise.t.test(scores$SATV, scores$education, p.adj="holm")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: scores$SATV and scores$education
##
## 0 1 2 3 4
## 1 1.00 - - - -
## 2 0.89 1.00 - - -
## 3 1.00 1.00 0.64 - -
## 4 1.00 1.00 0.51 1.00 -
## 5 1.00 1.00 0.30 1.00 1.00
##
## P value adjustment method: holm
Two-way ANOVA
It is used to determine whether there are group differences in numeric data between groups characterized by two different categorical variables.
Question: Do SAT verbal scores significantly differ by educational levels and gender?
Use box plots to compare distribution of SATV
scores variable by education level and gender (grouped by education level first, gender second)
scores %>%
ggplot(aes(x = education, y = SATV, fill = gender)) +
geom_boxplot(outlier.size = 1) +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Boxplots of SAT Verbal scores by education level and gender",
x = "Level of education", y = "SAT Verbal score", fill = "Gender")
Run two-way ANOVA
m5 <- aov(SATV ~ education + gender, data = scores)
summary(m5)
## Df Sum Sq Mean Sq F value Pr(>F)
## education 5 80754 16151 1.269 0.276
## gender 1 6740 6740 0.529 0.467
## Residuals 693 8822652 12731
report(m5)
## The ANOVA (formula: SATV ~ education + gender) suggests that:
##
## - The main effect of education is statistically not significant and very small (F(5, 693) = 1.27, p = 0.276; Eta2 (partial) = 9.07e-03, 90% CI [0.00, 0.02])
## - The main effect of gender is statistically not significant and very small (F(1, 693) = 0.53, p = 0.467; Eta2 (partial) = 7.63e-04, 90% CI [0.00, 7.99e-03])
##
## Effect sizes were labelled following Field's (2013) recommendations.
- Education: F value(df Model=5, df Residuals=693) = 1.269, p-value = 27.6% > 5% (fail to reject the null)
- Gender: F value(df Model=1, df Residuals=693) = 0.529, p-value = 46.7% > 5% (fail to reject the null)
There are no significant group differences in SAT verbal scores according to students’ educational levels or gender.
Chi-square test of independence
It is used to determine whether two categorical variables are dependent or independent.
Question: Is gender independent of education levels?
Make a bar-chart showing distribution of observations by gender and education
scores %>%
ggplot(aes(fill = gender, x = education)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Dark2") +
geom_text(aes(y = ..count../tapply(..count.., ..x.. ,sum)[..x..],
label = scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..]) ),
stat = "count", position = position_stack(vjust = 0.5)) +
labs(title = "Distribution of observations by gender and education levels",
x = "Education levels", y = "Number of observations", fill = "Gender") +
coord_flip()
Run a Chi-square test of independence
m6 <- chisq.test(table(scores$gender, scores$education))
m6
##
## Pearson's Chi-squared test
##
## data: table(scores$gender, scores$education)
## X-squared = 16.085, df = 5, p-value = 0.006605
X-squared(df=5) = 16.085, p-value = 0.006% < 5% (reject the null) -> Gender and education levels are dependent.
Simple linear regression
It is used to identify a presence of a linear relationship between two quantitative variables.
Question: Is there a linear relationship between SAT Verbal and SAT Quantitative scores?
Make a scatter plot with a best-fit line
ggplot(scores, aes(x = SATQ, y = SATV)) +
geom_point() +
geom_smooth(method = lm) +
labs(title = "Scatteplot of SAT Verbal and SAT Quantitative scores",
x = "SAT Quantitative score", y = "SAT Verbal score")
## `geom_smooth()` using formula 'y ~ x'
Run a simple linear regression
model1 <- lm(SATV ~ SATQ, data = scores)
summary(model1)
##
## Call:
## lm(formula = SATV ~ SATQ, data = scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -365.89 -50.57 -3.23 54.68 257.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 227.14322 17.77978 12.78 <2e-16 ***
## SATQ 0.63124 0.02863 22.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.71 on 685 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.4151, Adjusted R-squared: 0.4143
## F-statistic: 486.2 on 1 and 685 DF, p-value: < 2.2e-16
T-test for Beta1:
- t-stat(SATQ) = 22.05, p-value = 0.00% < 5% (reject the null)
- There is a linear relationship between SAT Quantitative scores and SAT Verbal scores.
ANOVA for Regression:
- F-stat(1,685) = 486.2, p-value = 0.00% < 5% (reject the null)
- The overall model is worthwhile.
Overall:
- The estimated regression line equation: SATV = 227.14 + 0.63(SATQ). We would expect 0.63 points increase in SAT Verbal scores for every one point increase in SAT Quantitative score, assuming all the other variables are held constant.
- 41.51% of the variability in the SAT verbal scores was explained by the SAT quantitative scores.
Get a formal report for the model
report(model1)
## We fitted a linear model (estimated using OLS) to predict SATV with SATQ (formula: SATV ~ SATQ). The model explains a statistically significant and substantial proportion of variance (R2 = 0.42, F(1, 685) = 486.19, p < .001, adj. R2 = 0.41). The model's intercept, corresponding to SATQ = 0, is at 227.14 (95% CI [192.23, 262.05], t(685) = 12.78, p < .001). Within this model:
##
## - The effect of SATQ is statistically significant and positive (beta = 0.63, 95% CI [0.58, 0.69], t(685) = 22.05, p < .001; Std. beta = 0.64, 95% CI [0.59, 0.70])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset.
Multiple linear regression
It is used to explain/predict one quantitative variable using multiple explanatory variables (one of which has to be quantitative).
Question: Can you explain/predict SAT Verbal using SAT Quantitative scores and ACT scores?
Make a correlation matrix
scores %>%
select(SATV, SATQ, ACT) %>%
ggpairs(aes())
Run a multiple linear regression
model2 <- lm(SATV ~ SATQ + ACT, data = scores)
summary(model2)
##
## Call:
## lm(formula = SATV ~ SATQ + ACT, data = scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -376.97 -46.64 1.89 50.25 243.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 138.57027 20.25101 6.843 1.73e-11 ***
## SATQ 0.47130 0.03382 13.934 < 2e-16 ***
## ACT 6.52075 0.80963 8.054 3.56e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82.93 on 684 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.4658, Adjusted R-squared: 0.4642
## F-statistic: 298.2 on 2 and 684 DF, p-value: < 2.2e-16
Get a formal report for the model
report(model2)
## We fitted a linear model (estimated using OLS) to predict SATV with SATQ and ACT (formula: SATV ~ SATQ + ACT). The model explains a statistically significant and substantial proportion of variance (R2 = 0.47, F(2, 684) = 298.19, p < .001, adj. R2 = 0.46). The model's intercept, corresponding to SATQ = 0 and ACT = 0, is at 138.57 (95% CI [98.81, 178.33], t(684) = 6.84, p < .001). Within this model:
##
## - The effect of SATQ is statistically significant and positive (beta = 0.47, 95% CI [0.40, 0.54], t(684) = 13.93, p < .001; Std. beta = 0.48, 95% CI [0.41, 0.55])
## - The effect of ACT is statistically significant and positive (beta = 6.52, 95% CI [4.93, 8.11], t(684) = 8.05, p < .001; Std. beta = 0.28, 95% CI [0.21, 0.35])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset.
Multicollinearity
Multicollinearity may occur when an independent variable is highly correlated (r >= 0.7) with one or more of the other independent variables in the model.
What if we wanted to add more predictors to the model?
# transform education variable to be numeric
scores$education <- as.numeric(scores$education)
scores %>%
ggpairs(aes(fill = gender, color = gender, alpha = 0.3), progress = F) +
scale_fill_brewer(palette = "Dark2") +
scale_color_brewer(palette = "Dark2")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
model3 <- lm(SATV ~ SATQ + ACT + age + education + gender, data = scores)
summary(model3)
##
## Call:
## lm(formula = SATV ~ SATQ + ACT + age + education + gender, data = scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -386.97 -43.23 2.44 51.91 240.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 136.25377 22.59310 6.031 2.68e-09 ***
## SATQ 0.47786 0.03445 13.873 < 2e-16 ***
## ACT 6.61477 0.82185 8.049 3.73e-15 ***
## age -0.74316 0.40359 -1.841 0.0660 .
## education 0.96598 2.70177 0.358 0.7208
## genderWomen 16.56045 6.73492 2.459 0.0142 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82.46 on 681 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.4741, Adjusted R-squared: 0.4703
## F-statistic: 122.8 on 5 and 681 DF, p-value: < 2.2e-16
T-tests for Beta1:
- SATQ: p-value = 0.00% < 5% (reject the null) -> There is a statistically significant linear relationship between SATQ and SATV scores.
- ACT: p-value = 0.00% < 5% (reject the null) -> There is a statistically significant linear relationship between ACT and SATV scores.
- Age: p-value = 6.60% > 5% (fail to reject the null) -> A linear relationship between age and SATV scores is not statistically significant.
- Education: p-value = 72.08% > 5% (fail to reject the null) -> A linear relationship between education and SATV scores is not statistically significant.
- Gender: p-value = 1.42% < 5 % (reject the null) -> There is a statistically significant linear relationship between gender and SATV scores.
ANOVA for Regression: - F-stat(5,681) = 122.8, p-value = 0.00% < 5% (reject the null) - The overall model is worthwhile.
Overall: - The estimated regression line equation: SATV = 136.25 + 0.48(SATQ) + 6.61(ACT) - 0.74(Age) + 0.97(Education) + 16.56(GenderWomen). - 47.41% of the variability in the SAT verbal scores was explained by the model as a whole.
Get a formal report on your model
report(model3)
## We fitted a linear model (estimated using OLS) to predict SATV with SATQ, ACT, age, education and gender (formula: SATV ~ SATQ + ACT + age + education + gender). The model explains a statistically significant and substantial proportion of variance (R2 = 0.47, F(5, 681) = 122.80, p < .001, adj. R2 = 0.47). The model's intercept, corresponding to SATQ = 0, ACT = 0, age = 0, education = 0 and gender = Men, is at 136.25 (95% CI [91.89, 180.61], t(681) = 6.03, p < .001). Within this model:
##
## - The effect of SATQ is statistically significant and positive (beta = 0.48, 95% CI [0.41, 0.55], t(681) = 13.87, p < .001; Std. beta = 0.49, 95% CI [0.42, 0.56])
## - The effect of ACT is statistically significant and positive (beta = 6.61, 95% CI [5.00, 8.23], t(681) = 8.05, p < .001; Std. beta = 0.28, 95% CI [0.21, 0.35])
## - The effect of age is statistically non-significant and negative (beta = -0.74, 95% CI [-1.54, 0.05], t(681) = -1.84, p = 0.066; Std. beta = -0.06, 95% CI [-0.13, 4.13e-03])
## - The effect of education is statistically non-significant and positive (beta = 0.97, 95% CI [-4.34, 6.27], t(681) = 0.36, p = 0.721; Std. beta = 0.01, 95% CI [-0.05, 0.08])
## - The effect of gender [Women] is statistically significant and positive (beta = 16.56, 95% CI [3.34, 29.78], t(681) = 2.46, p = 0.014; Std. beta = 0.15, 95% CI [0.03, 0.26])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset.
Check Variance Inflation Factor (VIF) for the severity of multicollinearity
vif(model3)
## SATQ ACT age education gender
## 1.600706 1.590366 1.479992 1.494064 1.051513
VIF values above 4 - moderate multicollinearity; above 10 - strong multicollinearity
F-test for nested models
Run F-test for comparing nested models
anova(model1, model2, model3)
## Analysis of Variance Table
##
## Model 1: SATV ~ SATQ
## Model 2: SATV ~ SATQ + ACT
## Model 3: SATV ~ SATQ + ACT + age + education + gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 685 5150020
## 2 684 4703926 1 446094 65.606 2.546e-15 ***
## 3 681 4630511 3 73415 3.599 0.01334 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P-values below 5% indicate that the new model is an improvement over the previous one:
- Model 2 is an improvement over Model 1 (coefficient for ACT is not a zero)
- Model 3 is an improvement over Model 2 (coefficient for Gender is not a zero)
Assumptions of linear regression
- Normality of residuals: residual errors are assumed to be normally distributed.
- Homoscedasticity of residuals variance: residuals are assumed to have a constant variance
- Independence of residual error terms: residuals should be independent of each other.
- Linearity: the relationship between predictors and outcome variable is assumed to be linear.
Normality
Normal Q-Q plot
plot=plot(model3, 2)
Checking for normality using Shapiro-Wilk test:
shapiro.test(resid(model3))
##
## Shapiro-Wilk normality test
##
## data: resid(model3)
## W = 0.98587, p-value = 3.383e-06
p-values below 5% indicate non-normality of residuals
Homoscedasticity
Plot positive values of residuals against fitted values
plot=plot(model3, 3)
Checking for heteroscedasticity using Breusch-Pagan Test
bptest(model3)
##
## studentized Breusch-Pagan test
##
## data: model3
## BP = 23.592, df = 5, p-value = 0.00026
p-values below 5% indicate heteroscedasticity of residuals
Independence and linearity
Residuals vs Fitted plot to check if the observed pattern is linear
plot=plot(model3, 1)
Checking for auto-correlation using Durbin-Watson Test.
durbinWatsonTest(model3)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.1095882 2.218698 0.002
## Alternative hypothesis: rho != 0
For the independent data, DW statistic is close to 2. DW < 1 or DW > 3 may indicate the presence of autocorrelation.
p-values below 5% indicate autocorrelation of residuals.
Forest plots
Forest plot enables you to visualize regression coefficients with CIs (uses sjplot
package)
plot_models(model1, model2, model3,
rm.terms = c("genderWomen","age"), #removing two variables from the plot
m.labels = c("Model 1","Model 2","Model 3"),
legend.title = "Models",
show.values = TRUE,
vline.color = "black")
Exporting regression results
Use sjplot
package to produce a table with regression results
tab_model(model1, model2, model3,
p.style = "stars",
dv.labels = c("Model1", "Model2", "Model3"),
show.ci = FALSE)
Model1 | Model2 | Model3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Predictors | Estimates | Estimates | Estimates | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(Intercept) | 227.14 *** | 138.57 *** | 136.25 *** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SATQ | 0.63 *** | 0.47 *** | 0.48 *** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ACT | 6.52 *** | 6.61 *** | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
age | -0.74 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
education | 0.97 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
gender [Women] | 16.56 * | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Observations | 687 | 687 | 687 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
R2 / R2 adjusted | 0.415 / 0.414 | 0.466 / 0.464 | 0.474 / 0.470 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Use
|