Believe it or not, despite all of the complexity under the hood, fitting a linear model in R with least squares is quite simple with a straightfoward workflow.
Let’s go through each step with an example of seals. Are older seals larger?
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
seals <- read.csv("Labs/data/17e8ShrinkingSeals Trites 1996.csv")
seal_base <- ggplot(seals, aes(x=age.days, y=length.cm)) +
geom_point() +
theme_grey(base_size=14)
seal_base
Neat data set, no?
Now, looking at this from the get-go, we can see it’s likely nonlinear. Maybe even non-normal! Let’s ignore that for now, as it will make the results great fodder for our diagnostics!
OK, so, you have the data. And in your model, you want to see how age is a predictor of length. How do you fit it?
Nothing could be simpler. R has a function called lm()
which stands for linear model. It works just like base plot, with the
y ~ x
syntax. And we create a fit model object.
seal_lm <- lm(length.cm ~ age.days, data=seals)
That’s it.
Now, if we want to just peak at the fit, before going forward, we can
use coef()
which is pretty standard across all R fit model
functions.
coef(seal_lm)
## (Intercept) age.days
## 1.157668e+02 2.370559e-03
But that’s getting ahead of ourselves…
R also provides a 1-stop shop for evaluating functions. Fit model
objects can typically be plotted using the performance
package.
library(performance)
check_model(seal_lm)
Whoah - that’s a lot!
OK, breathe.
There’s a lot - first, checking to see if the model produces fits that look like the data. You can also get this with
check_predictions(seal_lm)
Next, we can look at the fitted and residual plot
check_model(seal_lm, check = "ncv") |> plot()
Note the curvature of the line? Troubling, or a high n? It’s worth thinking about. Next, we can look information about the residuals. First, we check the homogeneity of variance.
check_heteroscedasticity(seal_lm) |> plot()
That curvilinearity at the end is again a hair concerning. But, let’s dig deeper and look at the residuals. Are they normal?
check_normality(seal_lm) |> plot()
A QQ plot of the residuals
check_normality(seal_lm) |> plot(type = "qq")
## For confidence bands, please install `qqplotr`.
Not bad!
Last, outliers
The Cook’s D values
check_outliers(seal_lm) |> plot(type = "bar")
All quite small!
And last, leverage
check_outliers(seal_lm) |> plot()
I also like to look at the histogram of residuals.There is a function
called residuals
that will work on nearly any fit model
object in R. So we can just…
hist(residuals(seal_lm))
Note, there’s also a library called modelr which can add the appropriate residuals to your data frame using a dplyr-like syntax.
library(modelr)
##
## Attaching package: 'modelr'
## The following objects are masked from 'package:performance':
##
## mae, mse, rmse
seals <- seals %>%
add_residuals(seal_lm)
head(seals)
## age.days length.cm resid
## 1 5337 131 2.5815746
## 2 2081 123 2.3001156
## 3 2085 122 1.2906334
## 4 4299 136 10.0422152
## 5 2861 122 -0.5489206
## 6 5052 131 3.2571840
Check out that new column. You can now plot your predictor versus residuals, which should show no trend, which you can use a spline with stat_smooth to see.
qplot(age.days, resid, data=seals) +
stat_smooth()
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
And you can also add fitted values and look at fitted versus residual.
seals <- seals %>%
add_predictions(seal_lm)
qplot(pred, resid, data=seals) +
stat_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
OK, ok, everything looks fine. Now, how do we test our model.
First, F-tests! R has a base method called anova
which -
well, it’s for looking at analysis of partitioning variance, but really
will take on a wide variety of forms as we go forward. For now, it will
produce F tables for us
anova(seal_lm)
## Analysis of Variance Table
##
## Response: length.cm
## Df Sum Sq Mean Sq F value Pr(>F)
## age.days 1 90862 90862 2815.8 < 2.2e-16 ***
## Residuals 9663 311807 32
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Boom! P values! And they are low. Simple, no?
For more information - t tests, Rsummary()
- again, a function that is a go-to for nearly
any fit model.
summary(seal_lm)
##
## Call:
## lm(formula = length.cm ~ age.days, data = seals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8700 -3.8279 0.0304 3.7541 21.6874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.158e+02 1.764e-01 656.37 <2e-16 ***
## age.days 2.371e-03 4.467e-05 53.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.681 on 9663 degrees of freedom
## Multiple R-squared: 0.2256, Adjusted R-squared: 0.2256
## F-statistic: 2816 on 1 and 9663 DF, p-value: < 2.2e-16
This is a lot of information to drink in - function call, distribution of residuals, coefficient t-tests, and multiple pieces of information about total fit.
We may want to get this information in a more condensed form for use
in other contexts - particularly to compare against other models. For
that, there’s a wonderful packages called broom
that sweeps
up your model into easy digestable pieces.
First, the coefficient table - let’s make it pretty.
library(broom)
##
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
##
## bootstrap
tidy(seal_lm)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 116. 0.176 656. 0
## 2 age.days 0.00237 0.0000447 53.1 0
Nice.
We can also do this for the F-test.
tidy(anova(seal_lm))
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 age.days 1 90862. 90862. 2816. 0
## 2 Residuals 9663 311807. 32.3 NA NA
If we want to get information about fit, there’s
glance()
glance(seal_lm)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.226 0.226 5.68 2816. 0 1 -30502. 61009. 61031.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Lovely! Now, how do we visualize the fit and fit prediction error?
In ggplot2
we can use the smoother,
stat_smooth
in conjunction with method = "lm"
to get the job done.
seal_fit_plot <- ggplot(data=seals) +
aes(x=age.days, y=length.cm) +
geom_point() +
stat_smooth(method="lm")
seal_fit_plot
## `geom_smooth()` using formula = 'y ~ x'
Fist, the relationship between how lean you are and how quickly you lose fat. Implement this to get a sense ot the general workflow for analysis
library(ggplot2)
fat <- read.csv("Labs/data/17q04BodyFatHeatLoss Sloan and Keatinge 1973 replica.csv")
#initial visualization to determine if lm is appropriate
fat_plot <- ggplot(data=fat, aes(x=leanness, y=lossrate)) +
geom_point()
fat_plot
fat_mod <- lm(lossrate ~ leanness, data=fat)
#assumptions
check_model(fat_mod)
#f-tests of model
anova(fat_mod)
#t-tests of parameters
summary(fat_mod)
#plot with line
fat_plot +
stat_smooth(method=lm, formula=y~x)
For your first faded example, let’s look at the relationship between DEET and mosquito bites.
deet <- read.csv("Labsdata/17q24DEETMosquiteBites.csv")
deet_plot <- ggplot(data=___, aes(x=dose, y=bites)) +
geom_point()
deet_plot
deet_mod <- lm(bites ~ dose, data=deet)
#assumptions
check_model(____)
#f-tests of model
anova(___)
#t-tests of parameters
summary(___)
#plot with line
deet_plot +
stat_smooth(method=lm, formula=y~x)
We’ll start today with the dataset
15e1KneesWhoSayNight.csv
about an experiment to help
resolve jetlag by having people shine lights at different parts of
themselves to try and shift their internal clocks.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ broom::bootstrap() masks modelr::bootstrap()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ modelr::mae() masks performance::mae()
## ✖ modelr::mse() masks performance::mse()
## ✖ modelr::rmse() masks performance::rmse()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
knees <- read_csv("Labs/data/15e1KneesWhoSayNight.csv")
## Rows: 22 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): treatment
## dbl (1): shift
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can see the outcomes with ggplot2
library(ggplot2)
ggplot(knees, mapping=aes(x=treatment, y=shift)) +
stat_summary(color="red", size=1.3) +
geom_point(alpha=0.7) +
theme_bw(base_size=17)
## No summary function supplied, defaulting to `mean_se()`
As the underlying model of ANOVA is a linear one, we fit ANOVAs using
lm()
just as with linear regression.
knees <- read.csv("Labs/data/15e1KneesWhoSayNight.csv")
knees_lm <- lm(shift ~ treatment, data=knees)
Now, there are two things to notice here. One, note that treatment is
a either a character or factor. If it is not, because we are using
lm()
, it will be fit like a linear regression. So,
beware!
There is an ANOVA-specific model fitting function -
aov
.
knees_aov <- aov(shift ~ treatment, data=knees)
It’s ok, I guess, and works with a few functions that
lm()
objects do not. But, in general, I find it too
limiting. You can’t see coefficients, etc. Boooooring.
Because this is an lm, we can check our assumptions as before - with one new one. First, some oldies but goodies.
#The whole par thing lets me make a multi-panel plot
check_model(knees_lm)
Now, the residuals v. fitted lets us see how the residuals are
distributed by treatment, but I often find it insufficient, as spacing
on the x-axis can get odd. I could roll my own plot of resudials versus
treatment, but, there’s a wonderful package called
car
- which is from the book Companion to Applied
Regression by John Fox. I recommend it highly! It has a function in
it called residualPlots()
which is useful here.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
residualPlots(knees_lm)
Note how it both does fitted v. residuals but also a boxplot by treatment. Handy, no?
OK, so, let’s see the ANOVA table! With the
function….anova()
!
anova(knees_lm)
## Analysis of Variance Table
##
## Response: shift
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 7.2245 3.6122 7.2894 0.004472 **
## Residuals 19 9.4153 0.4955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now….this is a type I sums of squares test. Which is fine for a 1-way
ANOVA. If you want to start getting into the practice of using type II,
car
provides a function Anova()
- note the
capital A - which defaults to type II and I use instead. In fact, I use
it all the time, as it handles a wide set of different models.
Anova(knees_lm)
## Anova Table (Type II tests)
##
## Response: shift
## Sum Sq Df F value Pr(>F)
## treatment 7.2245 2 7.2894 0.004472 **
## Residuals 9.4153 19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here it matters not a whit as you get the same table.
So, there are a lot of things we can do with a fit model
summary(knees_lm)
##
## Call:
## lm(formula = shift ~ treatment, data = knees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27857 -0.36125 0.03857 0.61147 1.06571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.30875 0.24888 -1.241 0.22988
## treatmenteyes -1.24268 0.36433 -3.411 0.00293 **
## treatmentknee -0.02696 0.36433 -0.074 0.94178
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7039 on 19 degrees of freedom
## Multiple R-squared: 0.4342, Adjusted R-squared: 0.3746
## F-statistic: 7.289 on 2 and 19 DF, p-value: 0.004472
First, notice that we get the same information as a linear regression - including \(R^2\) and overall model F-test. THis is great. We also get coefficients, but, what do they mean?
Well, they are the treatment contrasts. Not super useful. R fits a
model where treatment 1 is the intercept, and then we look at deviations
from that initial treatment as your other coefficients. It’s efficient,
but, hard to make sense of. To not get an intercept term, you need to
refit the model without the intercept. You can fit a whole new model
with -1
in the model formulation. Or, as I like to do to
ensure I don’t frak anything up, you can update()
your
model. Just use .
to signify what was there
before.
knees_lm_no_int <- update(knees_lm, formula = . ~ . -1)
summary(knees_lm_no_int)
##
## Call:
## lm(formula = shift ~ treatment - 1, data = knees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27857 -0.36125 0.03857 0.61147 1.06571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## treatmentcontrol -0.3087 0.2489 -1.241 0.230
## treatmenteyes -1.5514 0.2661 -5.831 1.29e-05 ***
## treatmentknee -0.3357 0.2661 -1.262 0.222
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7039 on 19 degrees of freedom
## Multiple R-squared: 0.6615, Adjusted R-squared: 0.6081
## F-statistic: 12.38 on 3 and 19 DF, p-value: 0.0001021
OK - that makes more sense. For a 1-way ANOVA, we can also see
treatment means using the emmeans
package - much more on
that next week (and later below for contrasts).
library(emmeans)
library(multcompView)
emmeans(knees_lm, ~treatment)
## treatment emmean SE df lower.CL upper.CL
## control -0.309 0.249 19 -0.830 0.212
## eyes -1.551 0.266 19 -2.108 -0.995
## knee -0.336 0.266 19 -0.893 0.221
##
## Confidence level used: 0.95
I also like this because it outputs CIs.
We see means and if they are different from 0. But….what about post-hoc tests
If you have a priori contrasts, you can use the constrat
library to test them. You give contrast an a list and a b list. Then we
get all comparisons of a v. b, in order. It’s not great syntactically,
but, it lets you do some pretty creative things.
contrast::contrast(knees_lm,
a = list(treatment = "control"),
b = list(treatment = "eyes"))
## lm model parameter contrast
##
## Contrast S.E. Lower Upper t df Pr(>|t|)
## 1 1.242679 0.3643283 0.4801306 2.005227 3.41 19 0.0029
Meh. 9 times out of 10 we want to do something more like a Tukey
Test. There is a TukeyHSD
function that works on
aov
objects, but, if you’re doing anything with an
lm
, it borks on you. Instead, let’s use
emmeans
. It is wonderful as it’s designed to work with
ANOVA and ANCOVA models with complicated structures such that, for
post-hocs, it adjusts to the mean or median level of all other factors.
Very handy.
knees_em <- emmeans(knees_lm, ~treatment)
contrast(knees_em,
method = "tukey")
## contrast estimate SE df t.ratio p.value
## control - eyes 1.243 0.364 19 3.411 0.0079
## control - knee 0.027 0.364 19 0.074 0.9970
## eyes - knee -1.216 0.376 19 -3.231 0.0117
##
## P value adjustment: tukey method for comparing a family of 3 estimates
We don’t need to worry about many of the fancier things that emmeans does for the moment - those will become more useful with other models. But, we can look at this test a few different ways. First, we can visualize it
plot(contrast(knees_em,
method = "tukey")) +
geom_vline(xintercept = 0, color = "red", lty=2)
We can also, using our tukey method of adjustment, get “groups” - i.e., see which groups are statistically the same versus different.
library(multcomp)
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
##
## geyser
cld(knees_em, adjust="tukey")
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## treatment emmean SE df lower.CL upper.CL .group
## eyes -1.551 0.266 19 -2.25 -0.855 1
## knee -0.336 0.266 19 -1.03 0.361 2
## control -0.309 0.249 19 -0.96 0.343 2
##
## Confidence level used: 0.95
## Conf-level adjustment: sidak method for 3 estimates
## P value adjustment: tukey method for comparing a family of 3 estimates
## significance level used: alpha = 0.05
## NOTE: If two or more means share the same grouping symbol,
## then we cannot show them to be different.
## But we also did not show them to be the same.
This can be very useful in plotting. For example, we can use that
output as a data frame for a ggplot
in a few different
ways.
cld(knees_em, adjust="tukey") %>%
ggplot(aes(x = treatment, y = emmean,
ymin = lower.CL, ymax = upper.CL,
color = factor(.group))) +
geom_pointrange()
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
cld(knees_em, adjust="tukey") %>%
mutate(.group = letters[as.numeric(.group)]) %>%
ggplot(aes(x = treatment, y = emmean,
ymin = lower.CL, ymax = upper.CL)) +
geom_pointrange() +
geom_text(mapping = aes(label = .group), y = rep(1, 3)) +
ylim(c(-2.5, 2))
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
knees_expanded <- left_join(knees,
cld(knees_em, adjust="tukey"))
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Joining with `by = join_by(treatment)`
ggplot(knees_expanded,
aes(x = treatment, y = shift, color = .group)) +
geom_point()
We can similarly use this to look at a Dunnett’s test, which compares against the control
contrast(knees_em,
method = "dunnett")
## contrast estimate SE df t.ratio p.value
## eyes - control -1.243 0.364 19 -3.411 0.0057
## knee - control -0.027 0.364 19 -0.074 0.9927
##
## P value adjustment: dunnettx method for 2 tests
Note, if the “control” had not been the first treatment, you can
either re-order the factor using forcats
or just specify
which of the levels is the control. For example, eyes is the second
treatment. Let’s make it our new reference.
contrast(knees_em,
method = "dunnett", ref=2)
## contrast estimate SE df t.ratio p.value
## control - eyes 1.24 0.364 19 3.411 0.0057
## knee - eyes 1.22 0.376 19 3.231 0.0085
##
## P value adjustment: dunnettx method for 2 tests
You can even plot these results
plot(contrast(knees_em,
method = "dunnett", ref=2)) +
geom_vline(xintercept = 0, color = "red", lty=2)
Let’s say you wanted to do all pairwise tests, but, compare using a
Bonferroni correction or FDR. Or none! No problem! There’s an
adjust
argument
contrast(knees_em,
method = "tukey", adjust="bonferroni")
## contrast estimate SE df t.ratio p.value
## control - eyes 1.243 0.364 19 3.411 0.0088
## control - knee 0.027 0.364 19 0.074 1.0000
## eyes - knee -1.216 0.376 19 -3.231 0.0132
##
## P value adjustment: bonferroni method for 3 tests
contrast(knees_em,
method = "tukey", adjust="fdr")
## contrast estimate SE df t.ratio p.value
## control - eyes 1.243 0.364 19 3.411 0.0066
## control - knee 0.027 0.364 19 0.074 0.9418
## eyes - knee -1.216 0.376 19 -3.231 0.0066
##
## P value adjustment: fdr method for 3 tests
contrast(knees_em,
method = "tukey", adjust="none")
## contrast estimate SE df t.ratio p.value
## control - eyes 1.243 0.364 19 3.411 0.0029
## control - knee 0.027 0.364 19 0.074 0.9418
## eyes - knee -1.216 0.376 19 -3.231 0.0044
Let’s try three ANOVAs! First - do landscape characteristics affect the number of generations plant species can exist before local extinction?
plants <- read.csv("Labsdata/15q01PlantPopulationPersistence.csv")
#Visualize
qplot(treatment, generations, data=plants, geom="boxplot")
#fit
plant_lm <- lm(generations ~ treatment, data=plants)
#assumptions
plot(plant_lm, which=c(1,2,4,5))
#ANOVA
anova(plant_lm)
#Tukey's HSD
contrast(emmeans(plant_lm, ~treatment), method = "tukey")
Second, how do different host types affect nematode longevity?
worms <- read.csv("Labsdata/15q19NematodeLifespan.csv")
#Visualize
qplot(treatment, lifespan, data=____, geom="____")
#fit
worm_lm <- lm(______ ~ ______, data=worms)
#assumptions
plot(______, which=c(1,2,4,5))
#ANOVA
anova(______)
#Tukey's HSD
contrast(emmeans(______, ~________), method = "tukey")
And last, how about how number of genotypes affect eelgrass productivity. Note, THERE IS A TRAP HERE. Look at your dataset before you do ANYTHING.
eelgrass <- read.csv("Labsdata/15q05EelgrassGenotypes.csv")
#Visualize
________(treatment.genotypes, shoots, data=____, geom="____")
#fit
eelgrass_lm <- __(______ ~ ______, data=________)
#assumptions
________(______, which=c(1,2,4,5))
#ANOVA
________(______)
#Tukey's HSD
contrast(________(______, ~________), method = "________")