Believe it or not, despite all of the complexity under the hood, fitting a linear model in R with least squares is quite simple with a straightfoward workflow.

  1. Load the data
  2. Visualize the data - just to detect problems and perform a cursory test of assumptions!
  3. Fit the model.
  4. Use the fit model to test assumptions
  5. Evaluate the model
  6. Visualize the fit model

Let’s go through each step with an example of seals. Are older seals larger?

0. Load and visualize the data

Are Older Seals Bigger?

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

seals <- read.csv("Labs/data/17e8ShrinkingSeals Trites 1996.csv")

seal_base <- ggplot(seals, aes(x=age.days, y=length.cm)) +
  geom_point() +
  theme_grey(base_size=14)

seal_base

Neat data set, no?

Now, looking at this from the get-go, we can see it’s likely nonlinear. Maybe even non-normal! Let’s ignore that for now, as it will make the results great fodder for our diagnostics!

1. Fitting a model

OK, so, you have the data. And in your model, you want to see how age is a predictor of length. How do you fit it?

Nothing could be simpler. R has a function called lm() which stands for linear model. It works just like base plot, with the y ~ x syntax. And we create a fit model object.

seal_lm <- lm(length.cm ~ age.days, data=seals)

That’s it.

Now, if we want to just peak at the fit, before going forward, we can use coef() which is pretty standard across all R fit model functions.

coef(seal_lm)
##  (Intercept)     age.days 
## 1.157668e+02 2.370559e-03

But that’s getting ahead of ourselves…

2. Evaluating Assumptions

R also provides a 1-stop shop for evaluating functions. Fit model objects can typically be plotted using the performance package.

library(performance)
check_model(seal_lm)

Whoah - that’s a lot!

OK, breathe.

There’s a lot - first, checking to see if the model produces fits that look like the data. You can also get this with

check_predictions(seal_lm)

Next, we can look at the fitted and residual plot

check_model(seal_lm, check = "ncv") |> plot()

Note the curvature of the line? Troubling, or a high n? It’s worth thinking about. Next, we can look information about the residuals. First, we check the homogeneity of variance.

check_heteroscedasticity(seal_lm) |> plot()

That curvilinearity at the end is again a hair concerning. But, let’s dig deeper and look at the residuals. Are they normal?

check_normality(seal_lm) |> plot()

A QQ plot of the residuals

check_normality(seal_lm) |> plot(type = "qq")
## For confidence bands, please install `qqplotr`.

Not bad!

Last, outliers

The Cook’s D values

check_outliers(seal_lm) |> plot(type = "bar")

All quite small!

And last, leverage

check_outliers(seal_lm) |> plot()

I also like to look at the histogram of residuals.There is a function called residuals that will work on nearly any fit model object in R. So we can just…

hist(residuals(seal_lm))

Note, there’s also a library called modelr which can add the appropriate residuals to your data frame using a dplyr-like syntax.

library(modelr)
## 
## Attaching package: 'modelr'
## The following objects are masked from 'package:performance':
## 
##     mae, mse, rmse
seals <- seals %>%
  add_residuals(seal_lm)

head(seals)
##   age.days length.cm      resid
## 1     5337       131  2.5815746
## 2     2081       123  2.3001156
## 3     2085       122  1.2906334
## 4     4299       136 10.0422152
## 5     2861       122 -0.5489206
## 6     5052       131  3.2571840

Check out that new column. You can now plot your predictor versus residuals, which should show no trend, which you can use a spline with stat_smooth to see.

qplot(age.days, resid, data=seals) +
  stat_smooth()
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

And you can also add fitted values and look at fitted versus residual.

seals <- seals %>%
  add_predictions(seal_lm)

qplot(pred, resid, data=seals) +
  stat_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

3. Putting it to the test

OK, ok, everything looks fine. Now, how do we test our model.

First, F-tests! R has a base method called anova which - well, it’s for looking at analysis of partitioning variance, but really will take on a wide variety of forms as we go forward. For now, it will produce F tables for us

anova(seal_lm)
## Analysis of Variance Table
## 
## Response: length.cm
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## age.days     1  90862   90862  2815.8 < 2.2e-16 ***
## Residuals 9663 311807      32                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Boom! P values! And they are low. Simple, no?

For more information - t tests, R2, and more, we can use summary() - again, a function that is a go-to for nearly any fit model.

summary(seal_lm)
## 
## Call:
## lm(formula = length.cm ~ age.days, data = seals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8700  -3.8279   0.0304   3.7541  21.6874 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.158e+02  1.764e-01  656.37   <2e-16 ***
## age.days    2.371e-03  4.467e-05   53.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.681 on 9663 degrees of freedom
## Multiple R-squared:  0.2256, Adjusted R-squared:  0.2256 
## F-statistic:  2816 on 1 and 9663 DF,  p-value: < 2.2e-16

This is a lot of information to drink in - function call, distribution of residuals, coefficient t-tests, and multiple pieces of information about total fit.

We may want to get this information in a more condensed form for use in other contexts - particularly to compare against other models. For that, there’s a wonderful packages called broom that sweeps up your model into easy digestable pieces.

First, the coefficient table - let’s make it pretty.

library(broom)
## 
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
## 
##     bootstrap
tidy(seal_lm)
## # A tibble: 2 × 5
##   term         estimate std.error statistic p.value
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) 116.      0.176         656.        0
## 2 age.days      0.00237 0.0000447      53.1       0

Nice.

We can also do this for the F-test.

tidy(anova(seal_lm))
## # A tibble: 2 × 6
##   term         df   sumsq  meansq statistic p.value
##   <chr>     <int>   <dbl>   <dbl>     <dbl>   <dbl>
## 1 age.days      1  90862. 90862.      2816.       0
## 2 Residuals  9663 311807.    32.3       NA       NA

If we want to get information about fit, there’s glance()

glance(seal_lm)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.226         0.226  5.68     2816.       0     1 -30502. 61009. 61031.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

4. Visualization

Lovely! Now, how do we visualize the fit and fit prediction error?

In ggplot2 we can use the smoother, stat_smooth in conjunction with method = "lm" to get the job done.

seal_fit_plot <- ggplot(data=seals) +
  aes(x=age.days, y=length.cm) +
  geom_point() +
  stat_smooth(method="lm")

seal_fit_plot
## `geom_smooth()` using formula = 'y ~ x'

5. Faded Examples.

A Fat Model

Fist, the relationship between how lean you are and how quickly you lose fat. Implement this to get a sense ot the general workflow for analysis

library(ggplot2)
fat <- read.csv("Labs/data/17q04BodyFatHeatLoss Sloan and Keatinge 1973 replica.csv")

#initial visualization to determine if lm is appropriate
fat_plot <- ggplot(data=fat, aes(x=leanness, y=lossrate)) + 
  geom_point()
fat_plot

fat_mod <- lm(lossrate ~ leanness, data=fat)

#assumptions
check_model(fat_mod)

#f-tests of model
anova(fat_mod)

#t-tests of parameters
summary(fat_mod)

#plot with line
fat_plot + 
  stat_smooth(method=lm, formula=y~x)

An Itchy Followup

For your first faded example, let’s look at the relationship between DEET and mosquito bites.

deet <- read.csv("Labsdata/17q24DEETMosquiteBites.csv")

deet_plot <- ggplot(data=___, aes(x=dose, y=bites)) + 
  geom_point()

deet_plot

deet_mod <- lm(bites ~ dose, data=deet)

#assumptions
check_model(____)

#f-tests of model
anova(___)

#t-tests of parameters
summary(___)

#plot with line
deet_plot + 
  stat_smooth(method=lm, formula=y~x)

6.One-Way ANOVA Model

We’ll start today with the dataset 15e1KneesWhoSayNight.csv about an experiment to help resolve jetlag by having people shine lights at different parts of themselves to try and shift their internal clocks.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ readr     2.1.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ broom::bootstrap() masks modelr::bootstrap()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ modelr::mae()      masks performance::mae()
## ✖ modelr::mse()      masks performance::mse()
## ✖ modelr::rmse()     masks performance::rmse()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
knees <- read_csv("Labs/data/15e1KneesWhoSayNight.csv")
## Rows: 22 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): treatment
## dbl (1): shift
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We can see the outcomes with ggplot2

library(ggplot2)
ggplot(knees, mapping=aes(x=treatment, y=shift)) +
  stat_summary(color="red", size=1.3) +
    geom_point(alpha=0.7) +
  theme_bw(base_size=17)
## No summary function supplied, defaulting to `mean_se()`

6.1 LM, AOV, and Factors

As the underlying model of ANOVA is a linear one, we fit ANOVAs using lm() just as with linear regression.

knees <- read.csv("Labs/data/15e1KneesWhoSayNight.csv")

knees_lm <- lm(shift ~ treatment, data=knees)

Now, there are two things to notice here. One, note that treatment is a either a character or factor. If it is not, because we are using lm(), it will be fit like a linear regression. So, beware!

There is an ANOVA-specific model fitting function - aov.

knees_aov <- aov(shift ~ treatment, data=knees)

It’s ok, I guess, and works with a few functions that lm() objects do not. But, in general, I find it too limiting. You can’t see coefficients, etc. Boooooring.

6.2 Assumption Evaluation

Because this is an lm, we can check our assumptions as before - with one new one. First, some oldies but goodies.

#The whole par thing lets me make a multi-panel plot
check_model(knees_lm)

Now, the residuals v. fitted lets us see how the residuals are distributed by treatment, but I often find it insufficient, as spacing on the x-axis can get odd. I could roll my own plot of resudials versus treatment, but, there’s a wonderful package called car - which is from the book Companion to Applied Regression by John Fox. I recommend it highly! It has a function in it called residualPlots() which is useful here.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
residualPlots(knees_lm)

Note how it both does fitted v. residuals but also a boxplot by treatment. Handy, no?

6.3 F-Tests

OK, so, let’s see the ANOVA table! With the function….anova()!

anova(knees_lm)
## Analysis of Variance Table
## 
## Response: shift
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## treatment  2 7.2245  3.6122  7.2894 0.004472 **
## Residuals 19 9.4153  0.4955                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now….this is a type I sums of squares test. Which is fine for a 1-way ANOVA. If you want to start getting into the practice of using type II, car provides a function Anova() - note the capital A - which defaults to type II and I use instead. In fact, I use it all the time, as it handles a wide set of different models.

Anova(knees_lm)
## Anova Table (Type II tests)
## 
## Response: shift
##           Sum Sq Df F value   Pr(>F)   
## treatment 7.2245  2  7.2894 0.004472 **
## Residuals 9.4153 19                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here it matters not a whit as you get the same table.

6.4 Post-hoc Tests

So, there are a lot of things we can do with a fit model

6.4.0 Summary Output

summary(knees_lm)
## 
## Call:
## lm(formula = shift ~ treatment, data = knees)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27857 -0.36125  0.03857  0.61147  1.06571 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.30875    0.24888  -1.241  0.22988   
## treatmenteyes -1.24268    0.36433  -3.411  0.00293 **
## treatmentknee -0.02696    0.36433  -0.074  0.94178   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7039 on 19 degrees of freedom
## Multiple R-squared:  0.4342, Adjusted R-squared:  0.3746 
## F-statistic: 7.289 on 2 and 19 DF,  p-value: 0.004472

First, notice that we get the same information as a linear regression - including \(R^2\) and overall model F-test. THis is great. We also get coefficients, but, what do they mean?

Well, they are the treatment contrasts. Not super useful. R fits a model where treatment 1 is the intercept, and then we look at deviations from that initial treatment as your other coefficients. It’s efficient, but, hard to make sense of. To not get an intercept term, you need to refit the model without the intercept. You can fit a whole new model with -1 in the model formulation. Or, as I like to do to ensure I don’t frak anything up, you can update() your model. Just use . to signify what was there before.

knees_lm_no_int <- update(knees_lm, formula = . ~ . -1)

summary(knees_lm_no_int)
## 
## Call:
## lm(formula = shift ~ treatment - 1, data = knees)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27857 -0.36125  0.03857  0.61147  1.06571 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## treatmentcontrol  -0.3087     0.2489  -1.241    0.230    
## treatmenteyes     -1.5514     0.2661  -5.831 1.29e-05 ***
## treatmentknee     -0.3357     0.2661  -1.262    0.222    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7039 on 19 degrees of freedom
## Multiple R-squared:  0.6615, Adjusted R-squared:  0.6081 
## F-statistic: 12.38 on 3 and 19 DF,  p-value: 0.0001021

OK - that makes more sense. For a 1-way ANOVA, we can also see treatment means using the emmeans package - much more on that next week (and later below for contrasts).

library(emmeans)
library(multcompView)

emmeans(knees_lm, ~treatment)
##  treatment emmean    SE df lower.CL upper.CL
##  control   -0.309 0.249 19   -0.830    0.212
##  eyes      -1.551 0.266 19   -2.108   -0.995
##  knee      -0.336 0.266 19   -0.893    0.221
## 
## Confidence level used: 0.95

I also like this because it outputs CIs.

We see means and if they are different from 0. But….what about post-hoc tests

6.4.1 A Priori Contrasts

If you have a priori contrasts, you can use the constrat library to test them. You give contrast an a list and a b list. Then we get all comparisons of a v. b, in order. It’s not great syntactically, but, it lets you do some pretty creative things.

contrast::contrast(knees_lm, 
         a = list(treatment = "control"), 
         b = list(treatment = "eyes"))
## lm model parameter contrast
## 
##   Contrast      S.E.     Lower    Upper    t df Pr(>|t|)
## 1 1.242679 0.3643283 0.4801306 2.005227 3.41 19   0.0029

6.4.2 Tukey’s HSD

Meh. 9 times out of 10 we want to do something more like a Tukey Test. There is a TukeyHSD function that works on aov objects, but, if you’re doing anything with an lm, it borks on you. Instead, let’s use emmeans. It is wonderful as it’s designed to work with ANOVA and ANCOVA models with complicated structures such that, for post-hocs, it adjusts to the mean or median level of all other factors. Very handy.

knees_em <- emmeans(knees_lm, ~treatment)

contrast(knees_em,
        method = "tukey")
##  contrast       estimate    SE df t.ratio p.value
##  control - eyes    1.243 0.364 19   3.411  0.0079
##  control - knee    0.027 0.364 19   0.074  0.9970
##  eyes - knee      -1.216 0.376 19  -3.231  0.0117
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

We don’t need to worry about many of the fancier things that emmeans does for the moment - those will become more useful with other models. But, we can look at this test a few different ways. First, we can visualize it

plot(contrast(knees_em,
        method = "tukey")) +
  geom_vline(xintercept = 0, color = "red", lty=2)

We can also, using our tukey method of adjustment, get “groups” - i.e., see which groups are statistically the same versus different.

library(multcomp)
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
## 
##     geyser
cld(knees_em, adjust="tukey")
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
##  treatment emmean    SE df lower.CL upper.CL .group
##  eyes      -1.551 0.266 19    -2.25   -0.855  1    
##  knee      -0.336 0.266 19    -1.03    0.361   2   
##  control   -0.309 0.249 19    -0.96    0.343   2   
## 
## Confidence level used: 0.95 
## Conf-level adjustment: sidak method for 3 estimates 
## P value adjustment: tukey method for comparing a family of 3 estimates 
## significance level used: alpha = 0.05 
## NOTE: If two or more means share the same grouping symbol,
##       then we cannot show them to be different.
##       But we also did not show them to be the same.

This can be very useful in plotting. For example, we can use that output as a data frame for a ggplot in a few different ways.

cld(knees_em, adjust="tukey") %>%
  ggplot(aes(x = treatment, y = emmean, 
             ymin = lower.CL, ymax = upper.CL,
             color = factor(.group))) +
  geom_pointrange() 
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons

cld(knees_em, adjust="tukey") %>%
  mutate(.group = letters[as.numeric(.group)]) %>%
  ggplot(aes(x = treatment, y = emmean, 
             ymin = lower.CL, ymax = upper.CL)) +
  geom_pointrange() +
  geom_text(mapping = aes(label = .group), y = rep(1, 3)) +
  ylim(c(-2.5, 2))
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons

knees_expanded <- left_join(knees, 
                            cld(knees_em, adjust="tukey"))
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Joining with `by = join_by(treatment)`
ggplot(knees_expanded,
       aes(x = treatment, y = shift, color = .group)) + 
  geom_point()

6.4.2 Dunnet’s Test

We can similarly use this to look at a Dunnett’s test, which compares against the control

contrast(knees_em,
        method = "dunnett")
##  contrast       estimate    SE df t.ratio p.value
##  eyes - control   -1.243 0.364 19  -3.411  0.0057
##  knee - control   -0.027 0.364 19  -0.074  0.9927
## 
## P value adjustment: dunnettx method for 2 tests

Note, if the “control” had not been the first treatment, you can either re-order the factor using forcats or just specify which of the levels is the control. For example, eyes is the second treatment. Let’s make it our new reference.

contrast(knees_em,
        method = "dunnett", ref=2)
##  contrast       estimate    SE df t.ratio p.value
##  control - eyes     1.24 0.364 19   3.411  0.0057
##  knee - eyes        1.22 0.376 19   3.231  0.0085
## 
## P value adjustment: dunnettx method for 2 tests

You can even plot these results

plot(contrast(knees_em,
        method = "dunnett", ref=2)) +
  geom_vline(xintercept = 0, color = "red", lty=2)

6.4.2 Bonferroni Correction and FDR

Let’s say you wanted to do all pairwise tests, but, compare using a Bonferroni correction or FDR. Or none! No problem! There’s an adjust argument

contrast(knees_em,
        method = "tukey", adjust="bonferroni")
##  contrast       estimate    SE df t.ratio p.value
##  control - eyes    1.243 0.364 19   3.411  0.0088
##  control - knee    0.027 0.364 19   0.074  1.0000
##  eyes - knee      -1.216 0.376 19  -3.231  0.0132
## 
## P value adjustment: bonferroni method for 3 tests
contrast(knees_em,
        method = "tukey", adjust="fdr")
##  contrast       estimate    SE df t.ratio p.value
##  control - eyes    1.243 0.364 19   3.411  0.0066
##  control - knee    0.027 0.364 19   0.074  0.9418
##  eyes - knee      -1.216 0.376 19  -3.231  0.0066
## 
## P value adjustment: fdr method for 3 tests
contrast(knees_em,
        method = "tukey", adjust="none")
##  contrast       estimate    SE df t.ratio p.value
##  control - eyes    1.243 0.364 19   3.411  0.0029
##  control - knee    0.027 0.364 19   0.074  0.9418
##  eyes - knee      -1.216 0.376 19  -3.231  0.0044

7. ANOVA Faded Examples

Let’s try three ANOVAs! First - do landscape characteristics affect the number of generations plant species can exist before local extinction?

plants <- read.csv("Labsdata/15q01PlantPopulationPersistence.csv")

#Visualize
qplot(treatment, generations, data=plants, geom="boxplot")

#fit
plant_lm <- lm(generations ~ treatment, data=plants)

#assumptions
plot(plant_lm, which=c(1,2,4,5))

#ANOVA
anova(plant_lm)

#Tukey's HSD
contrast(emmeans(plant_lm, ~treatment), method = "tukey")

Second, how do different host types affect nematode longevity?

worms <- read.csv("Labsdata/15q19NematodeLifespan.csv")

#Visualize
qplot(treatment, lifespan, data=____, geom="____")

#fit
worm_lm <- lm(______ ~ ______, data=worms)

#assumptions
plot(______, which=c(1,2,4,5))

#ANOVA
anova(______)

#Tukey's HSD
contrast(emmeans(______, ~________), method = "tukey")

And last, how about how number of genotypes affect eelgrass productivity. Note, THERE IS A TRAP HERE. Look at your dataset before you do ANYTHING.

eelgrass <- read.csv("Labsdata/15q05EelgrassGenotypes.csv")

#Visualize
________(treatment.genotypes, shoots, data=____, geom="____")

#fit
eelgrass_lm <- __(______ ~ ______, data=________)

#assumptions
________(______, which=c(1,2,4,5))

#ANOVA
________(______)

#Tukey's HSD
contrast(________(______, ~________), method = "________")