Let’s begin today with our fire severity dataset from before.
library(tidyverse)
library(readr)
keeley <- read_csv("./data/Keeley_rawdata_select4.csv")
To show that many variables have a relationship with plant richness, we can reshape the data a bit and plot it
keeley %>% select(rich, abiotic, firesev, cover) %>%
gather(variable, value, -rich) %>%
ggplot(mapping=aes(x=value, y=rich)) +
facet_wrap(~variable, strip.position="bottom", scale="free_x") +
geom_point() +
stat_smooth(method = "lm", color="red") +
xlab("") +
theme_bw(base_size=17)
Or we can look at everything
pairs(keeley)
OK, what’s the right way to go?
We can start with fitting a few basic models
k_abiotic <- lm(rich ~ abiotic, data=keeley)
k_firesev <- lm(rich ~ firesev, data=keeley)
k_cover <- lm(rich ~ cover, data=keeley)
Any one of these models has a log likelhood
logLik(k_abiotic)
## 'log Lik.' -358.1043 (df=3)
and from that we can get it’s AIC
AIC(k_abiotic)
## [1] 722.2085
Simple, no?
AIC(k_abiotic)
## [1] 722.2085
AIC(k_firesev)
## [1] 736.0338
AIC(k_cover)
## [1] 738.796
OK, so, k_abiotic
has the lowest AIC. But what if we want more - and AICc or to more formally compare them? For that, we have the AICcmodavg
library
library(AICcmodavg)
AICc(k_abiotic)
## [1] 722.4876
AICc(k_firesev)
## [1] 736.3129
AICc(k_cover)
## [1] 739.075
Note that the AICc doesn’t change our conclusions here. If we want a larger analysis, we can get an AIC table using aictab
aictab(list(k_abiotic, k_firesev, k_cover),
modnames = c("abiotic", "firesev", "cover"))
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## abiotic 3 722.49 0.00 1 1 -358.10
## firesev 3 736.31 13.83 0 1 -365.02
## cover 3 739.08 16.59 0 1 -366.40
We we have more information - weights, delta AICcs, and more.
Note, if you want to see how this differs from BIC results
bictab(list(k_abiotic, k_firesev, k_cover),
modnames = c("abiotic", "firesev", "cover"))
##
## Model selection based on BIC:
##
## K BIC Delta_BIC BICWt Cum.Wt LL
## abiotic 3 729.71 0.00 1 1 -358.10
## firesev 3 743.53 13.83 0 1 -365.02
## cover 3 746.30 16.59 0 1 -366.40
Consider the following set of alternate models
keeley_soil_fire <- lm(rich ~ elev + abiotic + hetero +
distance + firesev,
data = keeley)
keeley_plant_fire <- lm(rich ~ distance + firesev +
age + cover,
data = keeley)
keeley_soil_plant <- lm(rich ~ elev + abiotic + hetero +
age + cover,
data = keeley)
For use in the next set of following functions, giving the list names so they will be used later.
mods <- list(soil_fire = keeley_soil_fire,
plant_fire = keeley_plant_fire,
soil_plant = keeley_soil_plant)
aictab(mods)
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## soil_fire 7 692.55 0.00 0.88 0.88 -338.59
## soil_plant 7 696.57 4.01 0.12 1.00 -340.60
## plant_fire 6 709.69 17.13 0.00 1.00 -348.34
Note how the names got subbed in without us having to use them. Nice, no?
To look at the importance of any one variable, we can use the importance()
function which will give us the relative variable importance across all models.
importance(mods, "firesev")
## Error in importance.AIClm(mods, "firesev"):
## Importance values are only meaningful when the number of models with and without parameter are equal
And this gives us important lesson - to calculate variable importance we need a balanced model set. We can resolve it in this case by adding a null model, but, this does make one think much more carefully about the models used.
keeley_null <- lm(rich ~ 1, data = keeley)
mods <- list(soil_fire = keeley_soil_fire,
plant_fire = keeley_plant_fire,
soil_plant = keeley_soil_plant,
null = keeley_null)
importance(mods, "firesev")
##
## Importance values of 'firesev':
##
## w+ (models including parameter): 0.88
## w- (models excluding parameter): 0.12
So, a variable importance of 88%.
To get the model averaged coefficient
modavgShrink(mods, parm="firesev")
##
## Multimodel inference on "firesev" based on AICc
##
## AICc table used to obtain model-averaged estimate with shrinkage:
##
## K AICc Delta_AICc AICcWt Estimate SE
## soil_fire 7 692.55 0.00 0.88 -1.89 0.73
## plant_fire 6 709.69 17.13 0.00 -1.39 0.92
## soil_plant 7 696.57 4.01 0.12 0.00 0.00
## null 2 747.25 54.70 0.00 0.00 0.00
##
## Model-averaged estimate with shrinkage: -1.66
## Unconditional SE: 0.92
## 95% Unconditional confidence interval: -3.46, 0.13
To get predictions, we again use our candidate set and their weightings with modavgPred()
. However, remember, because we have an ensemble dataset, we need to account for all predictors. For the sake of examination, let’s look at the relationship between fire severity and richness holding all of our other predictors at their median value.
library(modelr)
new_data <- keeley %>% data_grid(distance = median(distance),
elev = median(elev),
abiotic = median(abiotic),
age = median(age),
hetero = median(hetero),
firesev = seq_range(firesev, 100),
cover=median(cover))
model_averaged_predictions <- modavgPred(mods, newdata = new_data)
This unfortunately produces a super ugly list, but we can turn it into a data frame
model_averaged_predictions <- as.data.frame(model_averaged_predictions)
head(model_averaged_predictions)
## type mod.avg.pred uncond.se conf.level lower.CL upper.CL
## 1 response 55.79253 3.907097 0.95 48.13476 63.45030
## 2 response 55.65803 3.838699 0.95 48.13432 63.18174
## 3 response 55.52353 3.770512 0.95 48.13346 62.91360
## 4 response 55.38903 3.702547 0.95 48.13218 62.64589
## 5 response 55.25454 3.634816 0.95 48.13043 62.37864
## 6 response 55.12004 3.567333 0.95 48.12819 62.11188
## matrix.output.mod.avg.pred matrix.output.uncond.se
## 1 55.79253 3.907097
## 2 55.65803 3.838699
## 3 55.52353 3.770512
## 4 55.38903 3.702547
## 5 55.25454 3.634816
## 6 55.12004 3.567333
## matrix.output.lower.CL matrix.output.upper.CL
## 1 48.13476 63.45030
## 2 48.13432 63.18174
## 3 48.13346 62.91360
## 4 48.13218 62.64589
## 5 48.13043 62.37864
## 6 48.12819 62.11188
model_averaged_predictions <- model_averaged_predictions %>%
bind_cols(new_data)
So, how much does model averaging change things? Let’s compare the results to the best fit model.
best_fit_predictions <- predict(keeley_soil_fire, newdata = new_data,
interval = "confidence") %>%
as.data.frame() %>%
bind_cols(new_data)
Now let’s plot to see the results
ggplot() +
geom_point(data = keeley, mapping = aes(x=firesev, y=rich)) +
geom_line(data = model_averaged_predictions,
mapping = aes(x=firesev, y = mod.avg.pred)) +
geom_ribbon(data = model_averaged_predictions,
mapping = aes(x=firesev, y = mod.avg.pred,
ymin = lower.CL, ymax = upper.CL),
alpha = 0.3) +
geom_line(data = best_fit_predictions,
mapping = aes(x=firesev, y = fit), color="red") +
geom_ribbon(data = best_fit_predictions,
mapping = aes(x=firesev, y = fit,
ymin = lwr, ymax = upr),
alpha = 0.2, fill = "red") +
theme_bw(base_size=17)
## Warning: Ignoring unknown aesthetics: y
## Warning: Ignoring unknown aesthetics: y
Let’s try a few of these analyses where the goal is to create a model set, evaluate it, and draw conclusions based on either a single model or the ensemble. Here are a few example files from the rethinking package
First, the relationship where we ask if neocortex quality is a predictor of milk quality in primates.
#Load and visualize the data
library(GGally)
milk <- read_csv("./data/milk.csv")
ggpairs(milk %>% select(-species))
#build a candidate set of models
mod1 <- lm(kcal.per.g ~ neocortex + clade + log(mass),
data = milk)
mod2 <- lm(kcal.per.g ~ neocortex + clade ,
data = milk)
mod3 <- lm(kcal.per.g ~ clade + log(mass),
data = milk)
mod4 <- lm(kcal.per.g ~ clade,
data = milk)
#make a list of models
modList <- list(all = mod1, brain_clade = mod2,
mass_clade = mod3, clade_only = mod4)
#evaluate the models
aictab(modList)
#look at the importance of one predictor
importance(modList, "neocortex")
modavgShrink(modList, parm="neocortex")
#look at all coefficients
map(modList, car::Anova)
map(modList, broom::tidy)
#evalute the predictor in context
preddata <- data_grid(milk,
clade = unique(clade),
neocortex = seq_range(neocortex, 100),
mass = mean(mass)) %>%
bind_cols(as.data.frame(modavgPred(modList, newdata = .)))
ggplot(preddata,
aes(x=neocortex, y=mod.avg.pred, color=clade,
ymin = lower.CL, ymax = upper.CL, group=clade)) +
geom_line() +
geom_ribbon(alpha=0.1, color=NA)
Whew - that’s a lot, but, hey, it’s MMI!
Let’s try something with language diversity!
#Load and visualize the data
nettle <- read_csv("./data/nettle.csv")
pairs(nettle %>% select(-country)) #get rid of country
# build a candidate set of models
# looking at number of languages as predicted by
# population size, and the average and variability of growing season
#for funsies, feel free to try a glm with a poisson error and log link
_________
_________
_________
_________
#make a list of models
modList <- list(_________)
#evaluate the models
aictab(_________)
#look at the importance of one predictor
importance(_________, "_________")
modavgShrink(_________, parm="_________")
#look at all coefficients
_________(modList, car::Anova)
_________(modList, broom::tidy)
#evalute the predictor in context
#note, might be many things in _________
preddata <- data_grid(nettle,
_________) %>%
bind_cols(as.data.frame(modavgPred(modList, newdata = .)))
ggplot(preddata,
aes(x=_________, y=mod.avg.pred,
ymin = lower.CL, ymax = upper.CL)) +
geom_line() +
geom_ribbon(alpha=0.1)
OK, now that you’ve got that down, use the following workflow on the rugged.csv
dataset - a dataset that looks at log_gdp (rgdppc_2000
in the data) as a function of many predictors. Clasically, the hypothesis has been that landscape ruggedness is a predictor. But, there are others.
Note, if you want to look at something a bit sillier, installl the rethinking package devtools::install_github('rmcelreath/rethinking')
and look at library(rethinking); data(WaffleDivorce)
- what elements of society predict the number of Waffle Houses?
As for the workflow
lm
or glm
approach.