Time for the general linear models.
Combining categorical and continuous variables is not that different from multiway ANOVA. To start with, let’s look at the neanderthal data.
neand <- read_csv("Labs/data/18q09NeanderthalBrainSize.csv")
head(neand)
## # A tibble: 6 × 3
## lnmass lnbrain species
## <dbl> <dbl> <chr>
## 1 4 7.19 recent
## 2 3.96 7.23 recent
## 3 3.95 7.26 recent
## 4 4.04 7.23 recent
## 5 4.11 7.22 recent
## 6 4.1 7.24 recent
We can clearly see both the categorical variable we’re interested in, and the covariate.
To get a preliminary sense of what’s going on here, do some
exploratory visualization with ggplot2
why doncha!
library(ggplot2)
ggplot(data = neand,
aes(x = lnmass,
y = lnbrain,
color=species)) +
geom_point() +
stat_smooth(method="lm")
Now, the CIs are going to be off as this wasn’t all tested in the same model, but you begin to get a sense of whether things are parallel or not, and whether this covariate is important.
What other plots might you produce?
As this is a general linear model, good olde lm()
is
still there for us.
neand_lm <- lm(lnbrain ~ species + lnmass, data=neand)
In addition to the ususal tests, we need to make sure that the slopes really are parallel. We do that by fitting a model with an interaction and testing it (which, well, if there was and interaction, might that be interesting).
First, the usual
check_model(neand_lm)
Looks pretty good.
To test the parallel presumption
neand_int <- lm(lnbrain ~ species*lnmass, data=neand)
Anova(neand_int)
## Anova Table (Type II tests)
##
## Response: lnbrain
## Sum Sq Df F value Pr(>F)
## species 0.027553 1 6.2203 0.0175 *
## lnmass 0.130018 1 29.3527 4.523e-06 ***
## species:lnmass 0.004845 1 1.0938 0.3028
## Residuals 0.155033 35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the species:lnmass term, we see that the p value is large, indicating that the slope is homogeneous. We could have a heterogeneous slope - but we’ll get to interactions later in the lab!
So, first we have our F-tests.
Anova(neand_lm)
## Anova Table (Type II tests)
##
## Response: lnbrain
## Sum Sq Df F value Pr(>F)
## species 0.027553 1 6.2041 0.01749 *
## lnmass 0.130018 1 29.2764 4.262e-06 ***
## Residuals 0.159878 36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both the treatment and covariate matter.
Second, we might want to compare covariate adjusted groups and/or look at covariate adjusted means.
library(emmeans)
adj_means <- emmeans(neand_lm, specs="species",
by = "lnmass")
#adjusted means
adj_means
## lnmass = 4.198:
## species emmean SE df lower.CL upper.CL
## neanderthal 7.272 0.02419 36 7.223 7.321
## recent 7.342 0.01250 36 7.317 7.367
##
## Confidence level used: 0.95
#comparisons
contrast(adj_means, method="pairwise", adjust="none")
## lnmass = 4.2:
## contrast estimate SE df t.ratio p.value
## neanderthal - recent -0.0703 0.0282 36 -2.491 0.0175
If you have an interaction, this method is no longer valid. Instead, you’ll need to monkey with your posthocs (if you even want to use them - often we don’t) to look at tests at different levels of the covariate.
Visualization is funny, as you want to make parallel lines and also
get the CIs right. Rather than rely on ggplot2
to do this
natively, we need to futz around a bit with generating predictions
neand_predict <- predict(neand_lm,
interval="confidence")
head(neand_predict)
## fit lwr upr
## 1 7.243614 7.203988 7.283240
## 2 7.223761 7.178077 7.269445
## 3 7.218798 7.171538 7.266059
## 4 7.263467 7.229347 7.297586
## 5 7.298209 7.271375 7.325042
## 6 7.293246 7.265628 7.320863
So, here we have fit values, lower confidence interval, and upper
confidence intervals. As we have not fed predict()
any new
data, these values line up with our neand
data frame, so we
can cbind it all together, and then use these values to make a
prediction plot.
neand <- cbind(neand, neand_predict)
ggplot(data=neand) +
geom_point(mapping=aes(x=lnmass, y=lnbrain, color=species)) +
geom_line(mapping = aes(x = lnmass, y=fit, color=species)) +
geom_ribbon(data=neand, aes(x = lnmass,
ymin=lwr,
ymax=upr,
group = species),
fill="lightgrey",
alpha=0.5) +
theme_bw()
And there we have nice parallel lines with model predicted confidence intervals!
I’ve provided two data sets:
1) 18e4MoleRatLayabouts.csv
looking at how caste and mass
affect the energy mole rates expend
2) 18q11ExploitedLarvalFish.csv
looking at how status of a
marine area - protected or not - influences the CV around age of
maturity of a number of different fish (so, age is a predictor)
Using the following workflow, analyze these data sets.
# Load the data
# Perform a preliminary visualization
# Fit an ANCOVA model
# Test Asssumptions and modeify model if needed
# Evaluate results
# Post-hocs if you can
# Visualize results
Multiple linear regression is conceptially very similar to ANCOVA. Let’s use the keeley fire severity plant richness data to see it in action.
keeley <- read_csv("Labs/data/Keeley_rawdata_select4.csv")
head(keeley)
## # A tibble: 6 × 8
## distance elev abiotic age hetero firesev cover rich
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 53.4 1225 60.7 40 0.757 3.5 1.04 51
## 2 37.0 60 40.9 25 0.491 4.05 0.478 31
## 3 53.7 200 51.0 15 0.844 2.6 0.949 71
## 4 53.7 200 61.2 15 0.691 2.9 1.19 64
## 5 52.0 970 46.7 23 0.546 4.3 1.30 68
## 6 52.0 970 39.8 24 0.653 4 1.17 34
For our purposes, we’ll focus on fire severity and plant cover as predictors.
I’m not going to lie, visualizing multiple continuous variables is as much of an art as a science. One can use colors and sizes of points, or slice up the data into chunks and facet that. Here are a few examples.
keeley_base <- ggplot(data = keeley,
mapping = aes(x = cover, y = rich,
color = firesev)) +
geom_point() +
scale_color_gradient(low="yellow", high="red")
keeley_base
# or split up
keeley_base +
facet_wrap(vars(cut_number(firesev, 4)))
Note the new faceting notation. Cool, no?
What other plots can you make?
Fitting is straightforward for an additive MLR model. It’s just a linear model!
keeley_mlr <- lm(rich ~ firesev + cover, data=keeley)
As for assumptions, we have the usual
check_model(keeley_mlr)
But now we also need to think about how the residuals relate to each
predictor. Fortunately, there’s still residualPlots
.
residualPlots(keeley_mlr, test=FALSE)
Odd bow shape - but not too bad. Maybe there’s an interaction? Maybe we want to log transform something? Who knows!
We also want to look at multicollinearity. There are two steps for
that. First, the vif
vif(keeley_mlr)
## firesev cover
## 1.236226 1.236226
Not bad. We might also want to look at the correlation matrix. Dplyr can help us here as we want to select down to just relevant columns.
keeley |>
select(firesev, cover) |>
cor()
## firesev cover
## firesev 1.0000000 -0.4371346
## cover -0.4371346 1.0000000
We evaluate the same way as usual
Anova(keeley_mlr)
## Anova Table (Type II tests)
##
## Response: rich
## Sum Sq Df F value Pr(>F)
## firesev 1258.9 1 6.5004 0.01254 *
## cover 711.6 1 3.6744 0.05854 .
## Residuals 16849.1 87
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And then the coefficients and R2
summary(keeley_mlr)
##
## Call:
## lm(formula = rich ~ firesev + cover, data = keeley)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.029 -11.583 -1.655 11.759 28.722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.9358 7.0437 7.657 2.45e-11 ***
## firesev -2.5308 0.9926 -2.550 0.0125 *
## cover 9.9105 5.1701 1.917 0.0585 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.92 on 87 degrees of freedom
## Multiple R-squared: 0.1703, Adjusted R-squared: 0.1513
## F-statistic: 8.93 on 2 and 87 DF, p-value: 0.0002968
Not amazing fit, but, the coefficients are clearly different from 0.
This is where things get sticky. We have two main approaches. First, visualizing with component + residual plots
crPlots(keeley_mlr, smooth=FALSE)
But the values on the y axis are….odd. We get a sense of what’s going on and the scatter after accounting for our predictor of interest, but we might want to look at, say, evaluation of a variable at the mean of the other.
For that, we have visreg
with gg=TRUE
to
generate ggplots
library(visreg)
visreg(keeley_mlr, "firesev", gg=TRUE)
visreg(keeley_mlr, "cover", gg=TRUE)
Now the axes make far more sense, and we have a sense of the relationship.
We can actually whip this up on our own using crossing
,
the median of each value, and predict.
k_med_firesev <- data.frame(firesev = median(keeley$firesev),
cover = seq(0,1.5, length.out = 100))
k_med_firesev <- cbind(k_med_firesev, predict(keeley_mlr,
newdata = k_med_firesev,
interval="confidence"))
ggplot() +
geom_point(data=keeley, mapping = aes(x=cover, y = rich)) +
geom_line(data = k_med_firesev, mapping=aes(x=cover, y=fit), color="blue") +
geom_ribbon(data = k_med_firesev, mapping = aes(x=cover,
ymin = lwr,
ymax = upr),
fill="lightgrey", alpha=0.5)
OK, here are two datasets to work with:
planktonSummary.csv
showing plankton from Lake Baikal
(thanks, Stephanie Hampton). Evluate how Chlorophyll
(CHLFa
) is affected by other predictors. Note, you’ll
probably want to log transform CHLFa
.SwaddleWestNile2002NCEAS_shortnames.csv
is about the
prevalence of West Nile virus in Birds around Sacramento county in
California. What predicts human WNV?Using the following workflow, analyze these data sets.
# Load the data
# Perform a preliminary visualization. Play with this and choose two predictors
# Fit a MLR model
# Test Asssumptions and modify model if needed
# Evaluate results
# Visualize results
Let’s start with our fire severity data. We’ll ask if age interacts with fire severity to shape plant species richness. First, we can visualize this. How do we visualize? Well, one simple way is to split points into groups by one of your two terms and see if patterns appear different.
ggplot(keeley,
aes(x=firesev, y = rich)) +
geom_point() +
facet_wrap(~cut_interval(age, 3))
Note, I’m using cut_interval, as we often want to split our intervals
up evenly, rather than having big boluses of data in certain groups. We
could also apply stat_smooth
to help give us an idea if
things are changin in these different groups.
ggplot(keeley,
aes(x=age, y = rich)) +
geom_point() +
facet_wrap(~cut_interval(firesev, 3)) +
stat_smooth(method="lm")
Note - number of groups will also influence how easily you can see interactions in the data. Try 2, 3, 4, 5, and 6. It’s going to be sample size and predictor spread dependent.
Since we have a suspected interaction effect here, let’s look at the model
keeley_int <- lm(rich ~ firesev*age, data = keeley)
Before we even look at the usual suspects, let’s inspect our vif
vif(keeley_int)
## firesev age firesev:age
## 4.879394 7.967244 15.800768
Whoah! 15! This might give you pause, but, remember, it’s a VIF for an interaction with its linear predictors. So, no real problems here. If the VIF for one of your linear predictors was high, then maybe worry, but, high VIFs for nonlinear transforms is not a problem.
Let’s evaluate the rest.
check_model(keeley_int)
residualPlots(keeley_int)
## Test stat Pr(>|Test stat|)
## firesev 0.6176 0.5385
## age -0.6982 0.4869
## Tukey test 1.3802 0.1675
Meh, not bad.
As before, I can use F-tables to evaluate my model.
Anova(keeley_int)
## Anova Table (Type II tests)
##
## Response: rich
## Sum Sq Df F value Pr(>F)
## firesev 1361.5 1 7.8768 0.0061927 **
## age 466.9 1 2.7013 0.1039152
## firesev:age 2228.7 1 12.8938 0.0005479 ***
## Residuals 14865.1 86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I can also summarize as before to get coefficients, etc. Note, each additive coefficient, given the interaction and centering, means the effect of one predictor on the response at the average level of the other predictor. Take a look at the uncentered model to see what the coefficient for the effect of one predictor on the response when the other variable = 0. You can also back-calculate these if you want to convince yourself.
summary(keeley_int)
##
## Call:
## lm(formula = rich ~ firesev * age, data = keeley)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.7559 -10.1586 0.0662 10.7169 30.1410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.93524 7.85054 5.469 4.37e-07 ***
## firesev 3.10564 1.86304 1.667 0.099157 .
## age 0.82681 0.31303 2.641 0.009809 **
## firesev:age -0.23024 0.06412 -3.591 0.000548 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.15 on 86 degrees of freedom
## Multiple R-squared: 0.268, Adjusted R-squared: 0.2425
## F-statistic: 10.5 on 3 and 86 DF, p-value: 5.934e-06
Our \(R^2\) isn’t great - but, eh, that’s what we’ve got!
So, how do we visualize? visreg
actually again has us
covered in a few ways straight off the bat.
visreg(keeley_int, "age", by = "firesev", gg=TRUE)
We can change the breakpoints with a breaks
argument,
but, in general, this makes a nice structured ggplot for us to work with
and modify further.
We can also use visreg to look at the full range of things using
visreg2d
visreg2d(keeley_int, "age", "firesev", plot.type="gg")
Nice, I had to specify an x and y variable (this is good for when I have >1 interaction), but it’s still pretty convenient.
Now, you might want to do this yourself - for example, what if you had three predictors, and you wanted to look at all of them. Let’s try that, actually.
keeley_elev_int <- lm(rich ~ age*abiotic*firesev, data = keeley)
Anova(keeley_elev_int)
## Anova Table (Type II tests)
##
## Response: rich
## Sum Sq Df F value Pr(>F)
## age 206.8 1 1.4217 0.23655
## abiotic 2647.7 1 18.2012 5.295e-05 ***
## firesev 662.8 1 4.5564 0.03578 *
## age:abiotic 250.9 1 1.7251 0.19271
## age:firesev 924.6 1 6.3559 0.01364 *
## abiotic:firesev 142.2 1 0.9777 0.32567
## age:abiotic:firesev 3.3 1 0.0229 0.88011
## Residuals 11928.4 82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we have an age:firesev
interaction, but an additive
effect of abiotic. Neat! But, man, visualization…. Ugh. Fortunately,
there’s a nice solution. Let’s make some fit lines using
modelr
. We’re going to use data_grid
to make a
grid of possible predicted values. Then use the predict
function to get confidence intervals of the fit. Note,
prediction
would give us the full range of variability.
Another topic for another time.
library(modelr)
fit_data <- keeley %>%
data_grid(abiotic = round(seq_range(abiotic, 3), 2),
age = seq_range(age, 4),
firesev = seq_range(firesev, 100))
fit_lines <- fit_data %>%
predict(keeley_elev_int, newdata = ., interval = "confidence") %>%
as_tibble() %>%
rename(rich = fit) %>%
bind_cols(fit_data)
Now, we can plot this!
ggplot(fit_lines,
aes(x = firesev, y = rich,
ymin = lwr, ymax = upr,
group = factor(age),
color = factor(age))) +
geom_ribbon(alpha=0.3, color=NA) +
geom_line() +
facet_wrap(~str_c("abiotic: ", abiotic))
OK, let’s use the same datasets as before. Did you have any interactions?
planktonSummary.csv
showing plankton from Lake Baikal
(thanks, Stephanie Hampton). Evluate how Chlorophyll
(CHLFa
) is affected by other predictors. Note, you’ll
probably want to log transform CHLFa
.SwaddleWestNile2002NCEAS_shortnames.csv
is about the
prevalence of West Nile virus in Birds around Sacramento county in
California. What predicts human WNV?Using the following workflow, analyze these data sets.
# Load the data
# Perform a preliminary visualization. Play with this and choose two predictors
# Fit a MLR model
# Test Asssumptions and modify model if needed
# Evaluate results
# Visualize results