Multimodel Inference

1 Fire Severity

Let’s begin today with our fire severity dataset from before.

library(tidyverse)
library(readr)

keeley <- read_csv("./data/Keeley_rawdata_select4.csv")

To show that many variables have a relationship with plant richness, we can reshape the data a bit and plot it

keeley %>% select(rich, abiotic, firesev, cover) %>%
  gather(variable, value, -rich) %>%
  ggplot(mapping=aes(x=value, y=rich)) +
  facet_wrap(~variable, strip.position="bottom", scale="free_x") +
  geom_point() +
  stat_smooth(method = "lm", color="red") +
  xlab("") +
  theme_bw(base_size=17)

Or we can look at everything

pairs(keeley)

OK, what’s the right way to go?

2 Implementing AIC

We can start with fitting a few basic models

k_abiotic <- lm(rich ~ abiotic, data=keeley)
  
k_firesev <- lm(rich ~ firesev, data=keeley)
  
k_cover <- lm(rich ~ cover, data=keeley)

Any one of these models has a log likelhood

logLik(k_abiotic)

## 'log Lik.' -358.1043 (df=3)

and from that we can get it’s AIC

AIC(k_abiotic)

## [1] 722.2085

Simple, no?

AIC(k_abiotic)

## [1] 722.2085

AIC(k_firesev)

## [1] 736.0338

AIC(k_cover)

## [1] 738.796

OK, so, k_abiotic has the lowest AIC. But what if we want more - and AICc or to more formally compare them? For that, we have the AICcmodavg library

library(AICcmodavg)

AICc(k_abiotic)

## [1] 722.4876

AICc(k_firesev)

## [1] 736.3129

AICc(k_cover)

## [1] 739.075

Note that the AICc doesn’t change our conclusions here. If we want a larger analysis, we can get an AIC table using aictab

aictab(list(k_abiotic, k_firesev, k_cover),
       modnames = c("abiotic", "firesev", "cover"))

## 
## Model selection based on AICc:
## 
##         K   AICc Delta_AICc AICcWt Cum.Wt      LL
## abiotic 3 722.49       0.00      1      1 -358.10
## firesev 3 736.31      13.83      0      1 -365.02
## cover   3 739.08      16.59      0      1 -366.40

We we have more information - weights, delta AICcs, and more.

Note, if you want to see how this differs from BIC results

bictab(list(k_abiotic, k_firesev, k_cover),
       modnames = c("abiotic", "firesev", "cover"))

## 
## Model selection based on BIC:
## 
##         K    BIC Delta_BIC BICWt Cum.Wt      LL
## abiotic 3 729.71      0.00     1      1 -358.10
## firesev 3 743.53     13.83     0      1 -365.02
## cover   3 746.30     16.59     0      1 -366.40

3 Model Averaged Coefficients

Consider the following set of alternate models

keeley_soil_fire <- lm(rich ~ elev + abiotic + hetero +
                          distance + firesev,
                  data = keeley)

keeley_plant_fire <- lm(rich ~  distance + firesev + 
                          age + cover,
                  data = keeley)

keeley_soil_plant <- lm(rich ~  elev + abiotic + hetero +
                          age + cover,
                  data = keeley)

For use in the next set of following functions, giving the list names so they will be used later.

mods <- list(soil_fire = keeley_soil_fire,
             plant_fire = keeley_plant_fire,
             soil_plant = keeley_soil_plant)

aictab(mods)

## 
## Model selection based on AICc:
## 
##            K   AICc Delta_AICc AICcWt Cum.Wt      LL
## soil_fire  7 692.55       0.00   0.88   0.88 -338.59
## soil_plant 7 696.57       4.01   0.12   1.00 -340.60
## plant_fire 6 709.69      17.13   0.00   1.00 -348.34

Note how the names got subbed in without us having to use them. Nice, no?

To look at the importance of any one variable, we can use the importance() function which will give us the relative variable importance across all models.

importance(mods, "firesev")

## Error in importance.AIClm(mods, "firesev"): 
## Importance values are only meaningful when the number of models with and without parameter are equal

And this gives us important lesson - to calculate variable importance we need a balanced model set. We can resolve it in this case by adding a null model, but, this does make one think much more carefully about the models used.

keeley_null <- lm(rich ~ 1, data = keeley)

mods <- list(soil_fire = keeley_soil_fire,
             plant_fire = keeley_plant_fire,
             soil_plant = keeley_soil_plant,
             null = keeley_null)

importance(mods, "firesev")

## 
## Importance values of 'firesev':
## 
## w+ (models including parameter): 0.88 
## w- (models excluding parameter): 0.12

So, a variable importance of 88%.

To get the model averaged coefficient

modavgShrink(mods, parm="firesev")

## 
## Multimodel inference on "firesev" based on AICc
## 
## AICc table used to obtain model-averaged estimate with shrinkage:
## 
##            K   AICc Delta_AICc AICcWt Estimate   SE
## soil_fire  7 692.55       0.00   0.88    -1.89 0.73
## plant_fire 6 709.69      17.13   0.00    -1.39 0.92
## soil_plant 7 696.57       4.01   0.12     0.00 0.00
## null       2 747.25      54.70   0.00     0.00 0.00
## 
## Model-averaged estimate with shrinkage: -1.66 
## Unconditional SE: 0.92 
## 95% Unconditional confidence interval: -3.46, 0.13

4 Ensemble Predictions

To get predictions, we again use our candidate set and their weightings with modavgPred(). However, remember, because we have an ensemble dataset, we need to account for all predictors. For the sake of examination, let’s look at the relationship between fire severity and richness holding all of our other predictors at their median value.

library(modelr)
new_data <- keeley %>% data_grid(distance = median(distance),
                      elev = median(elev),
                      abiotic = median(abiotic),
                      age = median(age),
                      hetero = median(hetero),
                      firesev = seq_range(firesev, 100),
                      cover=median(cover))

model_averaged_predictions <- modavgPred(mods, newdata = new_data)

This unfortunately produces a super ugly list, but we can turn it into a data frame

model_averaged_predictions <- as.data.frame(model_averaged_predictions)

head(model_averaged_predictions)

##       type mod.avg.pred uncond.se conf.level lower.CL upper.CL
## 1 response     55.79253  3.907097       0.95 48.13476 63.45030
## 2 response     55.65803  3.838699       0.95 48.13432 63.18174
## 3 response     55.52353  3.770512       0.95 48.13346 62.91360
## 4 response     55.38903  3.702547       0.95 48.13218 62.64589
## 5 response     55.25454  3.634816       0.95 48.13043 62.37864
## 6 response     55.12004  3.567333       0.95 48.12819 62.11188
##   matrix.output.mod.avg.pred matrix.output.uncond.se
## 1                   55.79253                3.907097
## 2                   55.65803                3.838699
## 3                   55.52353                3.770512
## 4                   55.38903                3.702547
## 5                   55.25454                3.634816
## 6                   55.12004                3.567333
##   matrix.output.lower.CL matrix.output.upper.CL
## 1               48.13476               63.45030
## 2               48.13432               63.18174
## 3               48.13346               62.91360
## 4               48.13218               62.64589
## 5               48.13043               62.37864
## 6               48.12819               62.11188

model_averaged_predictions <- model_averaged_predictions %>%
  bind_cols(new_data)

So, how much does model averaging change things? Let’s compare the results to the best fit model.

best_fit_predictions <- predict(keeley_soil_fire, newdata = new_data,
                                interval = "confidence")  %>%
  as.data.frame() %>%
  bind_cols(new_data)

Now let’s plot to see the results

ggplot() +
  geom_point(data = keeley, mapping = aes(x=firesev, y=rich)) +
  geom_line(data = model_averaged_predictions, 
            mapping = aes(x=firesev, y = mod.avg.pred)) +
  geom_ribbon(data = model_averaged_predictions, 
            mapping = aes(x=firesev, y = mod.avg.pred,
                          ymin = lower.CL, ymax = upper.CL),
            alpha = 0.3) +
  geom_line(data = best_fit_predictions, 
            mapping = aes(x=firesev, y = fit), color="red") +
  geom_ribbon(data = best_fit_predictions, 
            mapping = aes(x=firesev, y = fit,
                          ymin = lwr, ymax = upr),
            alpha = 0.2, fill = "red") +
  theme_bw(base_size=17)

## Warning: Ignoring unknown aesthetics: y

## Warning: Ignoring unknown aesthetics: y

5 Faded Examples

Let’s try a few of these analyses where the goal is to create a model set, evaluate it, and draw conclusions based on either a single model or the ensemble. Here are a few example files from the rethinking package

First, the relationship where we ask if neocortex quality is a predictor of milk quality in primates.

#Load and visualize the data
library(GGally)
milk <- read_csv("./data/milk.csv")
ggpairs(milk %>% select(-species))

#build a candidate set of models
mod1 <- lm(kcal.per.g ~ neocortex + clade + log(mass),
           data = milk)

mod2 <- lm(kcal.per.g ~ neocortex + clade ,
           data = milk)

mod3 <- lm(kcal.per.g ~ clade + log(mass),
           data = milk)

mod4 <- lm(kcal.per.g ~ clade,
           data = milk)

#make a list of models
modList <- list(all = mod1, brain_clade = mod2, 
                mass_clade = mod3, clade_only = mod4)

#evaluate the models
aictab(modList)

#look at the importance of one predictor
importance(modList, "neocortex")
modavgShrink(modList, parm="neocortex")

#look at all coefficients
map(modList, car::Anova)
map(modList, broom::tidy)

#evalute the predictor in context
preddata <- data_grid(milk, 
                      clade = unique(clade),
                      neocortex = seq_range(neocortex, 100),
                      mass = mean(mass)) %>%
  bind_cols(as.data.frame(modavgPred(modList, newdata = .)))

ggplot(preddata,
       aes(x=neocortex, y=mod.avg.pred, color=clade,
           ymin = lower.CL, ymax = upper.CL, group=clade)) +
  geom_line() +
  geom_ribbon(alpha=0.1, color=NA)

Whew - that’s a lot, but, hey, it’s MMI!

Let’s try something with language diversity!

#Load and visualize the data
nettle <- read_csv("./data/nettle.csv")
pairs(nettle %>% select(-country)) #get rid of country

# build a candidate set of models
# looking at number of languages as predicted by 
# population size, and the average and variability of growing season
#for funsies, feel free to try a glm with a poisson error and log link
_________
_________
_________
_________


#make a list of models
modList <- list(_________)

#evaluate the models
aictab(_________)

#look at the importance of one predictor
importance(_________, "_________")
modavgShrink(_________, parm="_________")

#look at all coefficients
_________(modList, car::Anova)
_________(modList, broom::tidy)

#evalute the predictor in context
#note, might be many things in _________
preddata <- data_grid(nettle, 
                      _________) %>%
  bind_cols(as.data.frame(modavgPred(modList, newdata = .)))

ggplot(preddata,
       aes(x=_________, y=mod.avg.pred,
           ymin = lower.CL, ymax = upper.CL)) +
  geom_line() +
  geom_ribbon(alpha=0.1)

OK, now that you’ve got that down, use the following workflow on the rugged.csv dataset - a dataset that looks at log_gdp (rgdppc_2000 in the data) as a function of many predictors. Clasically, the hypothesis has been that landscape ruggedness is a predictor. But, there are others.

Note, if you want to look at something a bit sillier, installl the rethinking package devtools::install_github('rmcelreath/rethinking') and look at library(rethinking); data(WaffleDivorce) - what elements of society predict the number of Waffle Houses?

As for the workflow

Load the data and visualize the relationships to decide if you are using a lm or glm approach.
Fit a set of candidate models and make a list.
Evaluate the candidate set using AIC, AICc, or BIC.
Look at the importance and coefficient value for one predictor of interest.
Look at all coefficients across all models.
Based on this exploration, decide if you want to view the fit result for one model or the entire ensemble. Visualize with respect to one predictor.