Lab: General Linear Models in R

Time for the general linear models.

1. Mixing Categorical and Continuous Predictors

Combining categorical and continuous variables is not that different from multiway ANOVA. To start with, let’s look at the neanderthal data.

neand <- read_csv("Labs/data/18q09NeanderthalBrainSize.csv")
head(neand)

## # A tibble: 6 × 3
##   lnmass lnbrain species
##    <dbl>   <dbl> <chr>  
## 1   4       7.19 recent 
## 2   3.96    7.23 recent 
## 3   3.95    7.26 recent 
## 4   4.04    7.23 recent 
## 5   4.11    7.22 recent 
## 6   4.1     7.24 recent

We can clearly see both the categorical variable we’re interested in, and the covariate.

To get a preliminary sense of what’s going on here, do some exploratory visualization with ggplot2 why doncha!

library(ggplot2)

ggplot(data = neand,
       aes(x = lnmass, 
           y = lnbrain, 
           color=species)) +
  geom_point() + 
  stat_smooth(method="lm")

Now, the CIs are going to be off as this wasn’t all tested in the same model, but you begin to get a sense of whether things are parallel or not, and whether this covariate is important.

What other plots might you produce?

As this is a general linear model, good olde lm() is still there for us.

neand_lm <- lm(lnbrain ~ species + lnmass, data=neand)

1.1 Testing Assumptions

In addition to the ususal tests, we need to make sure that the slopes really are parallel. We do that by fitting a model with an interaction and testing it (which, well, if there was and interaction, might that be interesting).

First, the usual

check_model(neand_lm)

Looks pretty good.

To test the parallel presumption

neand_int <- lm(lnbrain ~ species*lnmass, data=neand)

Anova(neand_int)

## Anova Table (Type II tests)
## 
## Response: lnbrain
##                  Sum Sq Df F value    Pr(>F)    
## species        0.027553  1  6.2203    0.0175 *  
## lnmass         0.130018  1 29.3527 4.523e-06 ***
## species:lnmass 0.004845  1  1.0938    0.3028    
## Residuals      0.155033 35                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the species:lnmass term, we see that the p value is large, indicating that the slope is homogeneous. We could have a heterogeneous slope - but we’ll get to interactions later in the lab!

1.2 Assessing results

So, first we have our F-tests.

Anova(neand_lm)

## Anova Table (Type II tests)
## 
## Response: lnbrain
##             Sum Sq Df F value    Pr(>F)    
## species   0.027553  1  6.2041   0.01749 *  
## lnmass    0.130018  1 29.2764 4.262e-06 ***
## Residuals 0.159878 36                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both the treatment and covariate matter.

Second, we might want to compare covariate adjusted groups and/or look at covariate adjusted means.

library(emmeans)
adj_means <- emmeans(neand_lm, specs="species",
                     by = "lnmass")

#adjusted means
adj_means

## lnmass = 4.198:
##  species     emmean      SE df lower.CL upper.CL
##  neanderthal  7.272 0.02419 36    7.223    7.321
##  recent       7.342 0.01250 36    7.317    7.367
## 
## Confidence level used: 0.95

#comparisons
contrast(adj_means, method="pairwise", adjust="none")

## lnmass = 4.2:
##  contrast             estimate     SE df t.ratio p.value
##  neanderthal - recent  -0.0703 0.0282 36  -2.491  0.0175

If you have an interaction, this method is no longer valid. Instead, you’ll need to monkey with your posthocs (if you even want to use them - often we don’t) to look at tests at different levels of the covariate.

1.3 Visualization

Visualization is funny, as you want to make parallel lines and also get the CIs right. Rather than rely on ggplot2 to do this natively, we need to futz around a bit with generating predictions

neand_predict <- predict(neand_lm, 
                         interval="confidence")

head(neand_predict)

##        fit      lwr      upr
## 1 7.243614 7.203988 7.283240
## 2 7.223761 7.178077 7.269445
## 3 7.218798 7.171538 7.266059
## 4 7.263467 7.229347 7.297586
## 5 7.298209 7.271375 7.325042
## 6 7.293246 7.265628 7.320863

So, here we have fit values, lower confidence interval, and upper confidence intervals. As we have not fed predict() any new data, these values line up with our neand data frame, so we can cbind it all together, and then use these values to make a prediction plot.

neand <- cbind(neand, neand_predict)

ggplot(data=neand) +
  geom_point(mapping=aes(x=lnmass, y=lnbrain, color=species)) +
  geom_line(mapping = aes(x = lnmass, y=fit, color=species)) + 
  geom_ribbon(data=neand, aes(x = lnmass, 
                              ymin=lwr, 
                              ymax=upr,
                              group = species), 
              fill="lightgrey", 
              alpha=0.5) +
  theme_bw()

And there we have nice parallel lines with model predicted confidence intervals!

1.4 Examples

I’ve provided two data sets:
1) 18e4MoleRatLayabouts.csv looking at how caste and mass affect the energy mole rates expend

2) 18q11ExploitedLarvalFish.csv looking at how status of a marine area - protected or not - influences the CV around age of maturity of a number of different fish (so, age is a predictor)

Using the following workflow, analyze these data sets.

# Load the data

# Perform a preliminary visualization

# Fit an ANCOVA model

# Test Asssumptions and modeify model if needed

# Evaluate results

# Post-hocs if you can

# Visualize results

2. Multiple Linear Regression

Multiple linear regression is conceptially very similar to ANCOVA. Let’s use the keeley fire severity plant richness data to see it in action.

keeley <- read_csv("Labs/data/Keeley_rawdata_select4.csv")

head(keeley)

## # A tibble: 6 × 8
##   distance  elev abiotic   age hetero firesev cover  rich
##      <dbl> <dbl>   <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl>
## 1     53.4  1225    60.7    40  0.757    3.5  1.04     51
## 2     37.0    60    40.9    25  0.491    4.05 0.478    31
## 3     53.7   200    51.0    15  0.844    2.6  0.949    71
## 4     53.7   200    61.2    15  0.691    2.9  1.19     64
## 5     52.0   970    46.7    23  0.546    4.3  1.30     68
## 6     52.0   970    39.8    24  0.653    4    1.17     34

For our purposes, we’ll focus on fire severity and plant cover as predictors.

2.1 Visualizing

I’m not going to lie, visualizing multiple continuous variables is as much of an art as a science. One can use colors and sizes of points, or slice up the data into chunks and facet that. Here are a few examples.

keeley_base <- ggplot(data = keeley,
       mapping = aes(x = cover, y = rich,
                     color = firesev)) +
  geom_point() + 
  scale_color_gradient(low="yellow", high="red") 

keeley_base

# or split up
keeley_base +
    facet_wrap(vars(cut_number(firesev, 4)))

Note the new faceting notation. Cool, no?

What other plots can you make?

2.2 Fit and Evaluate Assumptions

Fitting is straightforward for an additive MLR model. It’s just a linear model!

keeley_mlr <- lm(rich ~ firesev + cover, data=keeley)

As for assumptions, we have the usual

check_model(keeley_mlr)

But now we also need to think about how the residuals relate to each predictor. Fortunately, there’s still residualPlots.

residualPlots(keeley_mlr, test=FALSE)

Odd bow shape - but not too bad. Maybe there’s an interaction? Maybe we want to log transform something? Who knows!

We also want to look at multicollinearity. There are two steps for that. First, the vif

vif(keeley_mlr)

##  firesev    cover 
## 1.236226 1.236226

Not bad. We might also want to look at the correlation matrix. Dplyr can help us here as we want to select down to just relevant columns.

keeley |>
  select(firesev, cover) |>
  cor()

##            firesev      cover
## firesev  1.0000000 -0.4371346
## cover   -0.4371346  1.0000000

3.3 Evaluation

We evaluate the same way as usual

Anova(keeley_mlr)

## Anova Table (Type II tests)
## 
## Response: rich
##            Sum Sq Df F value  Pr(>F)  
## firesev    1258.9  1  6.5004 0.01254 *
## cover       711.6  1  3.6744 0.05854 .
## Residuals 16849.1 87                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And then the coefficients and R²

summary(keeley_mlr)

## 
## Call:
## lm(formula = rich ~ firesev + cover, data = keeley)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.029 -11.583  -1.655  11.759  28.722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.9358     7.0437   7.657 2.45e-11 ***
## firesev      -2.5308     0.9926  -2.550   0.0125 *  
## cover         9.9105     5.1701   1.917   0.0585 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.92 on 87 degrees of freedom
## Multiple R-squared:  0.1703, Adjusted R-squared:  0.1513 
## F-statistic:  8.93 on 2 and 87 DF,  p-value: 0.0002968

Not amazing fit, but, the coefficients are clearly different from 0.

2.3 Visualization

This is where things get sticky. We have two main approaches. First, visualizing with component + residual plots

crPlots(keeley_mlr, smooth=FALSE)

But the values on the y axis are….odd. We get a sense of what’s going on and the scatter after accounting for our predictor of interest, but we might want to look at, say, evaluation of a variable at the mean of the other.

For that, we have visreg with gg=TRUE to generate ggplots

library(visreg)
visreg(keeley_mlr, "firesev", gg=TRUE)

visreg(keeley_mlr, "cover", gg=TRUE)

Now the axes make far more sense, and we have a sense of the relationship.

We can actually whip this up on our own using crossing, the median of each value, and predict.

k_med_firesev <- data.frame(firesev = median(keeley$firesev),
                                 cover = seq(0,1.5, length.out = 100))

k_med_firesev <- cbind(k_med_firesev, predict(keeley_mlr, 
                                              newdata = k_med_firesev,
                                              interval="confidence"))
  
ggplot() +
  geom_point(data=keeley, mapping = aes(x=cover, y = rich)) +
  geom_line(data = k_med_firesev, mapping=aes(x=cover, y=fit), color="blue") +
  geom_ribbon(data = k_med_firesev, mapping = aes(x=cover, 
                                                  ymin = lwr,
                                                  ymax = upr),
              fill="lightgrey", alpha=0.5)

2.4 Examples

OK, here are two datasets to work with:

planktonSummary.csv showing plankton from Lake Baikal (thanks, Stephanie Hampton). Evluate how Chlorophyll (CHLFa) is affected by other predictors. Note, you’ll probably want to log transform CHLFa.
SwaddleWestNile2002NCEAS_shortnames.csv is about the prevalence of West Nile virus in Birds around Sacramento county in California. What predicts human WNV?

Using the following workflow, analyze these data sets.

# Load the data

# Perform a preliminary visualization. Play with this and choose two predictors

# Fit a MLR model

# Test Asssumptions and modify model if needed

# Evaluate results

# Visualize results

3. Interaction Effects in MLR

3.1 Visualization of the Data

Let’s start with our fire severity data. We’ll ask if age interacts with fire severity to shape plant species richness. First, we can visualize this. How do we visualize? Well, one simple way is to split points into groups by one of your two terms and see if patterns appear different.

ggplot(keeley,
       aes(x=firesev, y = rich)) +
  geom_point() +
  facet_wrap(~cut_interval(age, 3))

Note, I’m using cut_interval, as we often want to split our intervals up evenly, rather than having big boluses of data in certain groups. We could also apply stat_smooth to help give us an idea if things are changin in these different groups.

ggplot(keeley,
       aes(x=age, y = rich)) +
  geom_point() +
  facet_wrap(~cut_interval(firesev, 3)) +
  stat_smooth(method="lm")

Note - number of groups will also influence how easily you can see interactions in the data. Try 2, 3, 4, 5, and 6. It’s going to be sample size and predictor spread dependent.

3.2 Fit and Evaluate Assumptions

Since we have a suspected interaction effect here, let’s look at the model

keeley_int <- lm(rich ~ firesev*age, data = keeley)

Before we even look at the usual suspects, let’s inspect our vif

vif(keeley_int)

##     firesev         age firesev:age 
##    4.879394    7.967244   15.800768

Whoah! 15! This might give you pause, but, remember, it’s a VIF for an interaction with its linear predictors. So, no real problems here. If the VIF for one of your linear predictors was high, then maybe worry, but, high VIFs for nonlinear transforms is not a problem.

Let’s evaluate the rest.

check_model(keeley_int)

residualPlots(keeley_int)

##            Test stat Pr(>|Test stat|)
## firesev       0.6176           0.5385
## age          -0.6982           0.4869
## Tukey test    1.3802           0.1675

Meh, not bad.

3.3 Evaluation

As before, I can use F-tables to evaluate my model.

Anova(keeley_int)

## Anova Table (Type II tests)
## 
## Response: rich
##              Sum Sq Df F value    Pr(>F)    
## firesev      1361.5  1  7.8768 0.0061927 ** 
## age           466.9  1  2.7013 0.1039152    
## firesev:age  2228.7  1 12.8938 0.0005479 ***
## Residuals   14865.1 86                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I can also summarize as before to get coefficients, etc. Note, each additive coefficient, given the interaction and centering, means the effect of one predictor on the response at the average level of the other predictor. Take a look at the uncentered model to see what the coefficient for the effect of one predictor on the response when the other variable = 0. You can also back-calculate these if you want to convince yourself.

summary(keeley_int)

## 
## Call:
## lm(formula = rich ~ firesev * age, data = keeley)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.7559 -10.1586   0.0662  10.7169  30.1410 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.93524    7.85054   5.469 4.37e-07 ***
## firesev      3.10564    1.86304   1.667 0.099157 .  
## age          0.82681    0.31303   2.641 0.009809 ** 
## firesev:age -0.23024    0.06412  -3.591 0.000548 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.15 on 86 degrees of freedom
## Multiple R-squared:  0.268,  Adjusted R-squared:  0.2425 
## F-statistic:  10.5 on 3 and 86 DF,  p-value: 5.934e-06

Our \(R^2\) isn’t great - but, eh, that’s what we’ve got!

3.4 Visualization

So, how do we visualize? visreg actually again has us covered in a few ways straight off the bat.

visreg(keeley_int, "age", by = "firesev", gg=TRUE)

We can change the breakpoints with a breaks argument, but, in general, this makes a nice structured ggplot for us to work with and modify further.

We can also use visreg to look at the full range of things using visreg2d

visreg2d(keeley_int, "age", "firesev", plot.type="gg")

Nice, I had to specify an x and y variable (this is good for when I have >1 interaction), but it’s still pretty convenient.

Now, you might want to do this yourself - for example, what if you had three predictors, and you wanted to look at all of them. Let’s try that, actually.

keeley_elev_int <- lm(rich ~ age*abiotic*firesev, data = keeley)

Anova(keeley_elev_int)

## Anova Table (Type II tests)
## 
## Response: rich
##                      Sum Sq Df F value    Pr(>F)    
## age                   206.8  1  1.4217   0.23655    
## abiotic              2647.7  1 18.2012 5.295e-05 ***
## firesev               662.8  1  4.5564   0.03578 *  
## age:abiotic           250.9  1  1.7251   0.19271    
## age:firesev           924.6  1  6.3559   0.01364 *  
## abiotic:firesev       142.2  1  0.9777   0.32567    
## age:abiotic:firesev     3.3  1  0.0229   0.88011    
## Residuals           11928.4 82                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we have an age:firesev interaction, but an additive effect of abiotic. Neat! But, man, visualization…. Ugh. Fortunately, there’s a nice solution. Let’s make some fit lines using modelr. We’re going to use data_grid to make a grid of possible predicted values. Then use the predict function to get confidence intervals of the fit. Note, prediction would give us the full range of variability. Another topic for another time.

library(modelr)
fit_data <-  keeley %>%
  data_grid(abiotic = round(seq_range(abiotic, 3), 2),
            age = seq_range(age, 4),
            firesev = seq_range(firesev, 100)) 

fit_lines <- fit_data %>%
  predict(keeley_elev_int, newdata = .,  interval = "confidence") %>%
  as_tibble() %>%
  rename(rich = fit) %>%
  bind_cols(fit_data)

Now, we can plot this!

ggplot(fit_lines, 
       aes(x = firesev, y = rich,
           ymin = lwr, ymax = upr,
           group = factor(age),
           color = factor(age))) +
  geom_ribbon(alpha=0.3, color=NA) +
  geom_line() +
  facet_wrap(~str_c("abiotic: ", abiotic))

3.4 Examples

OK, let’s use the same datasets as before. Did you have any interactions?

planktonSummary.csv showing plankton from Lake Baikal (thanks, Stephanie Hampton). Evluate how Chlorophyll (CHLFa) is affected by other predictors. Note, you’ll probably want to log transform CHLFa.
SwaddleWestNile2002NCEAS_shortnames.csv is about the prevalence of West Nile virus in Birds around Sacramento county in California. What predicts human WNV?

Using the following workflow, analyze these data sets.

# Load the data

# Perform a preliminary visualization. Play with this and choose two predictors

# Fit a MLR model

# Test Asssumptions and modify model if needed

# Evaluate results

# Visualize results

Lab: General Linear Models in R

Intro to Data Science for Biology

1. Mixing Categorical and Continuous Predictors

1.1 Testing Assumptions

1.2 Assessing results

1.3 Visualization

1.4 Examples

2. Multiple Linear Regression

2.1 Visualizing

2.2 Fit and Evaluate Assumptions

3.3 Evaluation

2.3 Visualization

2.4 Examples

3. Interaction Effects in MLR

3.1 Visualization of the Data

3.2 Fit and Evaluate Assumptions

3.3 Evaluation

3.4 Visualization

3.4 Examples