Joins and Geospatial Data

Map maker, map maper, make me a map!

Today we’re going to look at how we can use joins and geospatial data more explicitly to make maps. Maps are among the first data visualizations that ever occured. And some of the most powerful. They’re also one of the places where joins become incredibly important, as to put data on a map we have to join our data with a geospatial description of the map we want.

Death from Heart Disease

Today’s data set that we’ll be using is a data set of heart disease mortality from the CDC

#load the data and prep
#for some data manipulation
library(readxl)
library(dplyr)

heart_disease <- read_excel("./data/hd_all.xlsx", 
                            na="Insufficient Data")

head(heart_disease)

## # A tibble: 6 x 4
##   State   County  Death_Rate FIPS_Code
##   <chr>   <chr>        <dbl>     <dbl>
## 1 Alabama Autauga       463       1001
## 2 Alabama Baldwin       391.      1003
## 3 Alabama Barbour       533.      1005
## 4 Alabama Bibb          511.      1007
## 5 Alabama Blount        426.      1009
## 6 Alabama Bullock       483.      1011

OK, we see that we have state, county, and information on death. FIPS codes, FYI, are standardized county codes. They’ll become important in a bit.

Introducing: Maps

There are a LOT of ways to get map data into R. We’re going to begin with the simplest - grabbing it from an R package. The USAbouncaries package has boundaries of states and counties in the US through time. To learn more, check out https://lincolnmullen.com/software/usaboundaries/. It has a wonderful set of county maps that are easy to use with the us_counties() function!

#install the USAboundaries library if you
#don't have it
#devtools::install_github("ropensci/USAboundaries")
#devtools::install_github("ropensci/USAboundariesData")
library(USAboundaries) 
library(sf)

map_sf <- us_counties()

Nice! So, what is in here?

class(map_sf)

## [1] "sf"         "data.frame"

st_crs(map_sf)

## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"

OK, it is indeed an sf object.

map_sf

## Simple feature collection with 3220 features and 12 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##    statefp countyfp countyns       affgeoid geoid       name lsad      aland
## 1       39      131 01074078 0500000US39131 39131       Pike   06 1140324458
## 2       46      003 01266983 0500000US46003 46003     Aurora   06 1834813753
## 3       55      035 01581077 0500000US55035 55035 Eau Claire   06 1652211310
## 4       72      145 01804553 0500000US72145 72145  Vega Baja   13  118766803
## 5       48      259 01383915 0500000US48259 48259    Kendall   06 1715747531
## 6       40      015 01101795 0500000US40015 40015      Caddo   06 3310745124
## 7       19      093 00465235 0500000US19093 19093        Ida   06 1117599859
## 8       28      071 00695759 0500000US28071 28071  Lafayette   06 1636141755
## 9       12      027 00294452 0500000US12027 12027     DeSoto   06 1649978040
## 10      31      137 00835890 0500000US31137 31137     Phelps   06 1398048574
##       awater   state_name state_abbr jurisdiction_type
## 1    9567612         Ohio         OH             state
## 2   11201379 South Dakota         SD             state
## 3   18848512    Wisconsin         WI             state
## 4   57805868  Puerto Rico         PR         territory
## 5    1496797        Texas         TX             state
## 6   30820525     Oklahoma         OK             state
## 7    1406461         Iowa         IA             state
## 8  123052156  Mississippi         MS             state
## 9    6247257      Florida         FL             state
## 10   1646534     Nebraska         NE             state
##                          geometry
## 1  MULTIPOLYGON (((-83.35353 3...
## 2  MULTIPOLYGON (((-98.80777 4...
## 3  MULTIPOLYGON (((-91.65045 4...
## 4  MULTIPOLYGON (((-66.44899 1...
## 5  MULTIPOLYGON (((-98.92015 3...
## 6  MULTIPOLYGON (((-98.62315 3...
## 7  MULTIPOLYGON (((-95.74161 4...
## 8  MULTIPOLYGON (((-89.72134 3...
## 9  MULTIPOLYGON (((-82.0565 27...
## 10 MULTIPOLYGON (((-99.64346 4...

OK! there is a kit tgere! Most notably, I want you to see that while we have county names in name and state names in state_name we also have these fp codes - known as FIPS codes. There are Federal Information Processing Standards codes used by the government, as names are… flexible. For more, see https://en.wikipedia.org/wiki/FIPS_county_code - we’ll talk about them in a bit.

To make sure all is well, let’s take a gander!

library(ggplot2)
ggplot(data=map_sf) +
  geom_sf(fill = "grey") +
  coord_sf() +
  theme_minimal()

Great!

Getting ready to join maps and data

OK, so, we want to see how death from heart disease varies by county! To do this, we’ll need to join the data frame and sf object.

First, we need to fix up the columns in our map_sf data set to state and county names.

library(tidyr)
library(stringr)

map_sf <- map_sf %>%
  rename(County = name, 
         State = state_name)

Now let’s test out the join! Before we do the big join, let’s look at mismatch. This is the perfect job for anti_join

bad_match <- anti_join(heart_disease, map_sf)

nrow(bad_match)

## [1] 18

Hey, not bad! 18! Let’s see what the problems are

bad_match[,1:2]

## # A tibble: 18 x 2
##    State                      County                          
##    <chr>                      <chr>                           
##  1 Alaska                     Prince of Wales-Outer Ketchikan 
##  2 Alaska                     Skagway-Hoonah-Angoon           
##  3 Alaska                     Wade Hampton                    
##  4 Alaska                     Wrangell-Petersburg             
##  5 Illinois                   La Salle                        
##  6 Louisiana                  La Salle                        
##  7 Maryland                   Baltimore City                  
##  8 Missouri                   St. Louis City                  
##  9 New Mexico                 Dona Ana                        
## 10 South Dakota               Shannon                         
## 11 Virginia                   Bedford City                    
## 12 Virginia                   Fairfax City                    
## 13 Virginia                   Franklin City                   
## 14 Virginia                   Richmond City                   
## 15 Virginia                   Roanoke City                    
## 16 Virgin Islands of the U.S. Saint Croix (County Equivalent) 
## 17 Virgin Islands of the U.S. Saint John (County Equivalent)  
## 18 Virgin Islands of the U.S. Saint Thomas (County Equivalent)

So - some of it is cities rather than counties. Understandable. Someo f it is the USVIs. And some of it might be simply due to funny names. Let’s look at one - that Prince of Wales county in Alaska as a case study.

bad_match[1,]

## # A tibble: 1 x 4
##   State  County                          Death_Rate FIPS_Code
##   <chr>  <chr>                                <dbl>     <dbl>
## 1 Alaska Prince of Wales-Outer Ketchikan         NA      2201

map_sf %>%
  filter(State=="Alaska") %>%
  filter(stringr::str_detect(County, "Prince")) %>%
  as_tibble()  %>%
  select(State, County)

## # A tibble: 1 x 2
##   State  County               
##   <chr>  <chr>                
## 1 Alaska Prince of Wales-Hyder

Oh! So, names are off. Huh. What if instead of joining by state and county we used something else - like a standard FIPS code! We have a small problem - the map_sf object needs a full state-county FIPS code. And then we can make the join ignore state and county using the by argument to any join.

map_sf <- map_sf %>%
  mutate(FIPS_Code = stringr::str_c(statefp, countyfp) %>% as.numeric)


anti_join(heart_disease, map_sf,
          by = "FIPS_Code")

## # A tibble: 9 x 4
##   State                      County                         Death_Rate FIPS_Code
##   <chr>                      <chr>                               <dbl>     <dbl>
## 1 Alaska                     Prince of Wales-Outer Ketchik…        NA       2201
## 2 Alaska                     Skagway-Hoonah-Angoon                 NA       2232
## 3 Alaska                     Wade Hampton                         427.      2270
## 4 Alaska                     Wrangell-Petersburg                   NA       2280
## 5 South Dakota               Shannon                              288.     46113
## 6 Virginia                   Bedford City                         330.     51515
## 7 Virgin Islands of the U.S. Saint Croix (County Equivalen…        NA      78010
## 8 Virgin Islands of the U.S. Saint John (County Equivalent)        NA      78020
## 9 Virgin Islands of the U.S. Saint Thomas (County Equivale…        NA      78030

Only 9. Were we trying to make this perfect, we’d try and figure out why, but, for now, let’s soldier on. 9 is acceptable.

What if we go the other way?

anti_join(map_sf, heart_disease,
          by = "FIPS_Code")

## Simple feature collection with 85 features and 13 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -166.2118 ymin: 17.91377 xmax: -65.22157 ymax: 63.28039
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##    statefp countyfp countyns       affgeoid geoid        County lsad      aland
## 1       72      145 01804553 0500000US72145 72145     Vega Baja   13  118766803
## 2       72      129 01804545 0500000US72129 72129   San Lorenzo   13  137547980
## 3       72      019 01804489 0500000US72019 72019  Barranquitas   13   88714041
## 4       72      061 01804511 0500000US72061 72061      Guaynabo   13   71444703
## 5       72      027 01804493 0500000US72027 72027         Camuy   13  120058277
## 6       72      135 01804548 0500000US72135 72135      Toa Alta   13   69972604
## 7       72      137 01804549 0500000US72137 72137      Toa Baja   13   60196196
## 8       72      067 01804514 0500000US72067 72067   Hormigueros   13   29384893
## 9       72      015 01804487 0500000US72015 72015        Arroyo   13   38869483
## 10      46      102 01266992 0500000US46102 46102 Oglala Lakota   06 5422847813
##      awater        State state_abbr jurisdiction_type FIPS_Code
## 1  57805868  Puerto Rico         PR         territory     72145
## 2    256425  Puerto Rico         PR         territory     72129
## 3     77812  Puerto Rico         PR         territory     72019
## 4    468234  Puerto Rico         PR         territory     72061
## 5  40430074  Puerto Rico         PR         territory     72027
## 6   1405463  Puerto Rico         PR         territory     72135
## 7  48081472  Puerto Rico         PR         territory     72137
## 8       481  Puerto Rico         PR         territory     72067
## 9  53459916  Puerto Rico         PR         territory     72015
## 10  7126221 South Dakota         SD             state     46102
##                          geometry
## 1  MULTIPOLYGON (((-66.44899 1...
## 2  MULTIPOLYGON (((-66.05255 1...
## 3  MULTIPOLYGON (((-66.35163 1...
## 4  MULTIPOLYGON (((-66.12172 1...
## 5  MULTIPOLYGON (((-66.90157 1...
## 6  MULTIPOLYGON (((-66.30366 1...
## 7  MULTIPOLYGON (((-66.2581 18...
## 8  MULTIPOLYGON (((-67.15973 1...
## 9  MULTIPOLYGON (((-66.08226 1...
## 10 MULTIPOLYGON (((-103.0011 4...

OH! Territories. There’s no data for them. That’s OK for now.

Joining maps and data

So we know that there are some territories and cities left in the heart disease data, and we don’t want to worry about them for the moment. Given that we’ve got some mismatch on both sides, we want to use an left_join - although we could also full_join to eliminate some information. Try it both ways.

Let’s also get rid of Hawaii and Alaska for the moment, as well as anything that is a territory. So, just the lower 48!

We will also need to make sure that this is indeed an sf object. Joining spatial and non-spatial objects leads to non-spatial objects!

heart_disease_map_data <- left_join(heart_disease, map_sf) %>%
  filter(jurisdiction_type != "territory") %>%
  filter(!(State %in% c("Hawaii", "Alaska"))) %>%
  st_as_sf(crs = 4326)

And now - let’s plot it!

heart_map <- ggplot(data=heart_disease_map_data, 
       mapping = aes(fill = Death_Rate)) +
  geom_sf(color = NA) +
  theme_minimal()

heart_map

Making pretty choropleths

So, this is a choropleth map. There are some issues - the scale is hard to resolve, and everything is hard to see. There are a lot of ways we could rectify it. Here are a few tricks.

First, a better scale. Maybe a gradient?

heart_map +
  scale_fill_gradient(low = "white", high = "red")

OK, not bad! Not great, but much better. Are there gradients you’d prefer?

However, the bigger problem is that we have a huge range to cover. One common way to make choropleths is to bin data into categories and go from there. Binning in combination with a nice color scale - say via Color Brewer - cab be great. Remember Color Brewer? http://colorbrewer2.org/ actually uses maps as it’s demo! And there’s a scale_fill_brewer function in ggplot2.

So let’s make bins and plot that instead!

heart_map_binned <- ggplot(data=heart_disease_map_data, 
       mapping = aes(fill = cut_interval(Death_Rate, 5))) +
  geom_sf(color = NA) +
  theme_minimal()

heart_map_binned +
    scale_fill_brewer(palette="Reds")

MUCH nicer. Now we can really start to see some hot spots.

Faded Examples

For our faded examples, today we’re going to look at data on TB incidence from the World Health organization. This look at TB for the entire planet at the country level. Let’s take a look

library(readr)

tb_world <- read_csv("./data/who_tb_data.csv")

tb_world

## # A tibble: 3,651 x 18
##    country iso2  iso3  g_whoregion  year estimated_popul… est_incidences_…
##    <chr>   <chr> <chr> <chr>       <dbl>            <dbl>            <dbl>
##  1 Afghan… AF    AFG   EMR          2000         20093756              190
##  2 Afghan… AF    AFG   EMR          2001         20966463              189
##  3 Afghan… AF    AFG   EMR          2002         21979923              189
##  4 Afghan… AF    AFG   EMR          2003         23064851              189
##  5 Afghan… AF    AFG   EMR          2004         24118979              189
##  6 Afghan… AF    AFG   EMR          2005         25070798              189
##  7 Afghan… AF    AFG   EMR          2006         25893450              189
##  8 Afghan… AF    AFG   EMR          2007         26616792              189
##  9 Afghan… AF    AFG   EMR          2008         27294031              189
## 10 Afghan… AF    AFG   EMR          2009         28004331              189
## # … with 3,641 more rows, and 11 more variables:
## #   est_incidences_per_100K_people_lwr <dbl>,
## #   est_incidences_per_100K_people_upr <dbl>, est_incidences <dbl>,
## #   est_incidences_lwr <dbl>, est_incidences_upr <dbl>,
## #   est_mortality_per_100K_people <dbl>,
## #   est_mortality_per_100K_people_lwr <dbl>,
## #   est_mortality_per_100K_people_upr <dbl>, est_mortality <dbl>,
## #   est_mortality_lwr <dbl>, est_mortality_upr <dbl>

There’s a lot going on here - in particular let’s focus on estimated insidences, incidences per 100K people, and mortality. upr and lwr denote upper and lower bounds to estimates.

So, let’s make a simple map of 2015.

Before we do that, I want to introduce you to rnaturalearth. It’s a package that I use almost daily for really nice maps. While it doesn’t have detailed county level data, it has a ton of data at the world and state/province level coupled with iso3 standardized international codes. Let’s see it in action for just ONE type of map.

library(rnaturalearth)

worldmap <- ne_countries(returnclass = "sf")

names(worldmap)

##  [1] "scalerank"  "featurecla" "labelrank"  "sovereignt" "sov_a3"    
##  [6] "adm0_dif"   "level"      "type"       "admin"      "adm0_a3"   
## [11] "geou_dif"   "geounit"    "gu_a3"      "su_dif"     "subunit"   
## [16] "su_a3"      "brk_diff"   "name"       "name_long"  "brk_a3"    
## [21] "brk_name"   "brk_group"  "abbrev"     "postal"     "formal_en" 
## [26] "formal_fr"  "note_adm0"  "note_brk"   "name_sort"  "name_alt"  
## [31] "mapcolor7"  "mapcolor8"  "mapcolor9"  "mapcolor13" "pop_est"   
## [36] "gdp_md_est" "pop_year"   "lastcensus" "gdp_year"   "economy"   
## [41] "income_grp" "wikipedia"  "fips_10"    "iso_a2"     "iso_a3"    
## [46] "iso_n3"     "un_a3"      "wb_a2"      "wb_a3"      "woe_id"    
## [51] "adm0_a3_is" "adm0_a3_us" "adm0_a3_un" "adm0_a3_wb" "continent" 
## [56] "region_un"  "subregion"  "region_wb"  "name_len"   "long_len"  
## [61] "abbrev_len" "tiny"       "homepart"   "geometry"

WHOAH! That is a ton of information. However, I’d like you to notice the iso_a3 code, analagous to the iso3 code in the WHO data. We can use it to join on.

So, 2015 - note, I’m going to left join on the worldmap in order to ensure that we retain all of the countries of the world

#get a map
worldmap <- ne_countries(returnclass = "sf")

#filter to 2015
tb_2015 <- tb_world %>% filter(year == 2015)

#join
incidence_df <- left_join(worldmap, tb_2015, by = c("iso_a3" = "iso3"))

#plot
ggplot(incidence_df, aes(fill=est_incidences)) +
  geom_sf() +
  scale_fill_gradient(low = "blue", high = "red")

Nice!

Let’s look at mortality in 2000

#get a map
worldmap <- ne_countries(returnclass = "____")


#filter to 2015
tb_2000 <- tb_world %>% filter(year == ____)

#join
incidence_df_2000 <- left_join(worldmap, ____, by = c("iso_a3" = "iso3"))

#plot
ggplot(incidence_df_2000, aes(fill=est_mortality)) +
  geom_sf() +
  scale_fill_gradient(____)

Now let’s look at incidence per 100K, but, only by interval, in 2016 - feel free to do something difference with the fill

#get a map
worldmap <- ____(returnclass = "____")


#filter to 2016
tb_2016 <- tb_world %>% ____(year == ____)

#join
incidence_df_2016 <- left_join(worldmap, ____, by = c("iso_a3" = "iso3"))

#plot
ggplot(incidence_df_2016, aes(fill=____(est_incidences_per_100K_people,5))) +
  ____() +
  scale_fill_brewer(palette = "YlOrBr",
                    guide = guide_legend("Incidences per 100K"))

Pair Programming Exercises

What does not join well? Can you fix any of them via changes in the tb or worldmap dataset? Or should you not bother?
Can you make a map showing change in mortality over four years (filter to four years, and then use facet_wrap)?

Wibbly Wobbly Visualizations

This is all well and good, but what if you want to be google maps? There are a few packages out there. leaflet and tmap are the dominant. I <3 leaflet, but I find tmap is easier to get used to, as it functions similarly to ggplot. There’s also a nice primer in Geocomputation in R. Both are great, but, for the moment, let’s try tmap!

Let’s start with this heart diesease data to see how it works in general. The basic way that tmap works is that it uses tmap_shape() to reference a dataset, and then a series of functions after to add features from that dataset - e.g., tm_polygons(). It doesn’t use NSE, so we put quotes around column names.

library(tmap)

first_tmap <- tm_shape(shp = heart_disease_map_data) +
  tm_polygons(col = "Death_Rate")

Hey, wait, I said interactive! To make it so, we call tmap_mode("view") before we run things. Right now we’re in the plot view.

tmap_mode("view")

first_tmap

OH! Cool! Play around a bit here!

You can still do a lot with tmap and the vignette shows off a lot like facets. Also, see how we add palettes. I’m using a list because we have two elements. And see tmaptools::palette_explorer() for more on colors.

tm_shape(incidence_df_2016) +
  tm_polygons(c("est_incidences", "est_mortality"),
              palette = list("magma", "viridis")) +
  tm_facets(sync = TRUE, ncol = 2)

We can also use facets in a non-interactive way that is quite useful..

tmap_mode("plot")
tm_shape(incidence_df_2016) +
  tm_polygons("est_incidences") +
  tm_facets(by = "region_un")

Exercise

Turn the following two examples:

ggplot(incidence_df, aes(fill=est_incidences)) +
  geom_sf() +
  scale_fill_gradient(low = "blue", high = "red")

ggplot(incidence_df_2016, aes(fill=cut_interval(est_incidences_per_100K_people,5))) +
  geom_sf() +
  scale_fill_brewer(palette = "YlOrBr",
                    guide = guide_legend("Incidences per 100K"))

into tmap syntax using tm_shape(), tm_polygons(), and the palette argument. Explore palettes if need be. To change the legend name, see ?tm_polygons for the proper argument. Make them interactive!

Final Lab Exercise!

OK, this is all well and good. Working alone or in groups, I’d like you to load up the coronavirus package. Remembering that a. Each row represents one report, b. Each report has a type (confirmed, death, or recovered), c. Each report comes with other geospatial information - country, state, etc.

I’d like you to visualize some aspect of the spatial distribution of coronavirus. The maximum date in the dataset isn’t current, so, just remember that. If you want to look at a more current dataset, check https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases - but you will have to look hard at the data to understand what is there. (I’m working on a method to get something more current within R along with Rami Krispin and others.)