Let’s begin today with our fire severity dataset from before.
library(tidyverse)
library(readr)
keeley <- read_csv("./data/Keeley_rawdata_select4.csv")To show that many variables have a relationship with plant richness, we can reshape the data a bit and plot it
keeley %>% select(rich, abiotic, firesev, cover) %>%
  gather(variable, value, -rich) %>%
  ggplot(mapping=aes(x=value, y=rich)) +
  facet_wrap(~variable, strip.position="bottom", scale="free_x") +
  geom_point() +
  stat_smooth(method = "lm", color="red") +
  xlab("") +
  theme_bw(base_size=17)Or we can look at everything
pairs(keeley)OK, what’s the right way to go?
We can start with fitting a few basic models
k_abiotic <- lm(rich ~ abiotic, data=keeley)
  
k_firesev <- lm(rich ~ firesev, data=keeley)
  
k_cover <- lm(rich ~ cover, data=keeley)Any one of these models has a log likelhood
logLik(k_abiotic)## 'log Lik.' -358.1043 (df=3)and from that we can get it’s AIC
AIC(k_abiotic)## [1] 722.2085Simple, no?
AIC(k_abiotic)## [1] 722.2085AIC(k_firesev)## [1] 736.0338AIC(k_cover)## [1] 738.796OK, so, k_abiotic has the lowest AIC. But what if we want more - and AICc or to more formally compare them? For that, we have the AICcmodavg library
library(AICcmodavg)
AICc(k_abiotic)## [1] 722.4876AICc(k_firesev)## [1] 736.3129AICc(k_cover)## [1] 739.075Note that the AICc doesn’t change our conclusions here. If we want a larger analysis, we can get an AIC table using aictab
aictab(list(k_abiotic, k_firesev, k_cover),
       modnames = c("abiotic", "firesev", "cover"))## 
## Model selection based on AICc:
## 
##         K   AICc Delta_AICc AICcWt Cum.Wt      LL
## abiotic 3 722.49       0.00      1      1 -358.10
## firesev 3 736.31      13.83      0      1 -365.02
## cover   3 739.08      16.59      0      1 -366.40We we have more information - weights, delta AICcs, and more.
Note, if you want to see how this differs from BIC results
bictab(list(k_abiotic, k_firesev, k_cover),
       modnames = c("abiotic", "firesev", "cover"))## 
## Model selection based on BIC:
## 
##         K    BIC Delta_BIC BICWt Cum.Wt      LL
## abiotic 3 729.71      0.00     1      1 -358.10
## firesev 3 743.53     13.83     0      1 -365.02
## cover   3 746.30     16.59     0      1 -366.40Consider the following set of alternate models
keeley_soil_fire <- lm(rich ~ elev + abiotic + hetero +
                          distance + firesev,
                  data = keeley)
keeley_plant_fire <- lm(rich ~  distance + firesev + 
                          age + cover,
                  data = keeley)
keeley_soil_plant <- lm(rich ~  elev + abiotic + hetero +
                          age + cover,
                  data = keeley)For use in the next set of following functions, giving the list names so they will be used later.
mods <- list(soil_fire = keeley_soil_fire,
             plant_fire = keeley_plant_fire,
             soil_plant = keeley_soil_plant)
aictab(mods)## 
## Model selection based on AICc:
## 
##            K   AICc Delta_AICc AICcWt Cum.Wt      LL
## soil_fire  7 692.55       0.00   0.88   0.88 -338.59
## soil_plant 7 696.57       4.01   0.12   1.00 -340.60
## plant_fire 6 709.69      17.13   0.00   1.00 -348.34Note how the names got subbed in without us having to use them. Nice, no?
To look at the importance of any one variable, we can use the importance() function which will give us the relative variable importance across all models.
importance(mods, "firesev")## Error in importance.AIClm(mods, "firesev"): 
## Importance values are only meaningful when the number of models with and without parameter are equalAnd this gives us important lesson - to calculate variable importance we need a balanced model set. We can resolve it in this case by adding a null model, but, this does make one think much more carefully about the models used.
keeley_null <- lm(rich ~ 1, data = keeley)
mods <- list(soil_fire = keeley_soil_fire,
             plant_fire = keeley_plant_fire,
             soil_plant = keeley_soil_plant,
             null = keeley_null)
importance(mods, "firesev")## 
## Importance values of 'firesev':
## 
## w+ (models including parameter): 0.88 
## w- (models excluding parameter): 0.12So, a variable importance of 88%.
To get the model averaged coefficient
modavgShrink(mods, parm="firesev")## 
## Multimodel inference on "firesev" based on AICc
## 
## AICc table used to obtain model-averaged estimate with shrinkage:
## 
##            K   AICc Delta_AICc AICcWt Estimate   SE
## soil_fire  7 692.55       0.00   0.88    -1.89 0.73
## plant_fire 6 709.69      17.13   0.00    -1.39 0.92
## soil_plant 7 696.57       4.01   0.12     0.00 0.00
## null       2 747.25      54.70   0.00     0.00 0.00
## 
## Model-averaged estimate with shrinkage: -1.66 
## Unconditional SE: 0.92 
## 95% Unconditional confidence interval: -3.46, 0.13To get predictions, we again use our candidate set and their weightings with modavgPred(). However, remember, because we have an ensemble dataset, we need to account for all predictors. For the sake of examination, let’s look at the relationship between fire severity and richness holding all of our other predictors at their median value.
library(modelr)
new_data <- keeley %>% data_grid(distance = median(distance),
                      elev = median(elev),
                      abiotic = median(abiotic),
                      age = median(age),
                      hetero = median(hetero),
                      firesev = seq_range(firesev, 100),
                      cover=median(cover))
model_averaged_predictions <- modavgPred(mods, newdata = new_data)This unfortunately produces a super ugly list, but we can turn it into a data frame
model_averaged_predictions <- as.data.frame(model_averaged_predictions)
head(model_averaged_predictions)##       type mod.avg.pred uncond.se conf.level lower.CL upper.CL
## 1 response     55.79253  3.907097       0.95 48.13476 63.45030
## 2 response     55.65803  3.838699       0.95 48.13432 63.18174
## 3 response     55.52353  3.770512       0.95 48.13346 62.91360
## 4 response     55.38903  3.702547       0.95 48.13218 62.64589
## 5 response     55.25454  3.634816       0.95 48.13043 62.37864
## 6 response     55.12004  3.567333       0.95 48.12819 62.11188
##   matrix.output.mod.avg.pred matrix.output.uncond.se
## 1                   55.79253                3.907097
## 2                   55.65803                3.838699
## 3                   55.52353                3.770512
## 4                   55.38903                3.702547
## 5                   55.25454                3.634816
## 6                   55.12004                3.567333
##   matrix.output.lower.CL matrix.output.upper.CL
## 1               48.13476               63.45030
## 2               48.13432               63.18174
## 3               48.13346               62.91360
## 4               48.13218               62.64589
## 5               48.13043               62.37864
## 6               48.12819               62.11188model_averaged_predictions <- model_averaged_predictions %>%
  bind_cols(new_data)So, how much does model averaging change things? Let’s compare the results to the best fit model.
best_fit_predictions <- predict(keeley_soil_fire, newdata = new_data,
                                interval = "confidence")  %>%
  as.data.frame() %>%
  bind_cols(new_data)Now let’s plot to see the results
ggplot() +
  geom_point(data = keeley, mapping = aes(x=firesev, y=rich)) +
  geom_line(data = model_averaged_predictions, 
            mapping = aes(x=firesev, y = mod.avg.pred)) +
  geom_ribbon(data = model_averaged_predictions, 
            mapping = aes(x=firesev, y = mod.avg.pred,
                          ymin = lower.CL, ymax = upper.CL),
            alpha = 0.3) +
  geom_line(data = best_fit_predictions, 
            mapping = aes(x=firesev, y = fit), color="red") +
  geom_ribbon(data = best_fit_predictions, 
            mapping = aes(x=firesev, y = fit,
                          ymin = lwr, ymax = upr),
            alpha = 0.2, fill = "red") +
  theme_bw(base_size=17)## Warning: Ignoring unknown aesthetics: y
## Warning: Ignoring unknown aesthetics: yLet’s try a few of these analyses where the goal is to create a model set, evaluate it, and draw conclusions based on either a single model or the ensemble. Here are a few example files from the rethinking package
First, the relationship where we ask if neocortex quality is a predictor of milk quality in primates.
#Load and visualize the data
library(GGally)
milk <- read_csv("./data/milk.csv")
ggpairs(milk %>% select(-species))
#build a candidate set of models
mod1 <- lm(kcal.per.g ~ neocortex + clade + log(mass),
           data = milk)
mod2 <- lm(kcal.per.g ~ neocortex + clade ,
           data = milk)
mod3 <- lm(kcal.per.g ~ clade + log(mass),
           data = milk)
mod4 <- lm(kcal.per.g ~ clade,
           data = milk)
#make a list of models
modList <- list(all = mod1, brain_clade = mod2, 
                mass_clade = mod3, clade_only = mod4)
#evaluate the models
aictab(modList)
#look at the importance of one predictor
importance(modList, "neocortex")
modavgShrink(modList, parm="neocortex")
#look at all coefficients
map(modList, car::Anova)
map(modList, broom::tidy)
#evalute the predictor in context
preddata <- data_grid(milk, 
                      clade = unique(clade),
                      neocortex = seq_range(neocortex, 100),
                      mass = mean(mass)) %>%
  bind_cols(as.data.frame(modavgPred(modList, newdata = .)))
ggplot(preddata,
       aes(x=neocortex, y=mod.avg.pred, color=clade,
           ymin = lower.CL, ymax = upper.CL, group=clade)) +
  geom_line() +
  geom_ribbon(alpha=0.1, color=NA)Whew - that’s a lot, but, hey, it’s MMI!
Let’s try something with language diversity!
#Load and visualize the data
nettle <- read_csv("./data/nettle.csv")
pairs(nettle %>% select(-country)) #get rid of country
# build a candidate set of models
# looking at number of languages as predicted by 
# population size, and the average and variability of growing season
#for funsies, feel free to try a glm with a poisson error and log link
_________
_________
_________
_________
#make a list of models
modList <- list(_________)
#evaluate the models
aictab(_________)
#look at the importance of one predictor
importance(_________, "_________")
modavgShrink(_________, parm="_________")
#look at all coefficients
_________(modList, car::Anova)
_________(modList, broom::tidy)
#evalute the predictor in context
#note, might be many things in _________
preddata <- data_grid(nettle, 
                      _________) %>%
  bind_cols(as.data.frame(modavgPred(modList, newdata = .)))
ggplot(preddata,
       aes(x=_________, y=mod.avg.pred,
           ymin = lower.CL, ymax = upper.CL)) +
  geom_line() +
  geom_ribbon(alpha=0.1)OK, now that you’ve got that down, use the following workflow on the rugged.csv dataset - a dataset that looks at log_gdp (rgdppc_2000 in the data) as a function of many predictors. Clasically, the hypothesis has been that landscape ruggedness is a predictor. But, there are others.
Note, if you want to look at something a bit sillier, installl the rethinking package devtools::install_github('rmcelreath/rethinking') and look at library(rethinking); data(WaffleDivorce) - what elements of society predict the number of Waffle Houses?
As for the workflow
lm or glm approach.