Today we’re going to look at how we can use joins and geospatial data more explicitly to make maps. Maps are among the first data visualizations that ever occured. And some of the most powerful. They’re also one of the places where joins become incredibly important, as to put data on a map we have to join our data with a geospatial description of the map we want.
Today’s data set that we’ll be using is a data set of heart disease mortality from the CDC
#load the data and prep
#for some data manipulation
library(readxl)
library(dplyr)
heart_disease <- read_excel("./data/hd_all.xlsx",
na="Insufficient Data")
head(heart_disease)
## # A tibble: 6 x 4
## State County Death_Rate FIPS_Code
## <chr> <chr> <dbl> <dbl>
## 1 Alabama Autauga 463 1001
## 2 Alabama Baldwin 391. 1003
## 3 Alabama Barbour 533. 1005
## 4 Alabama Bibb 511. 1007
## 5 Alabama Blount 426. 1009
## 6 Alabama Bullock 483. 1011
OK, we see that we have state, county, and information on death. FIPS codes, FYI, are standardized county codes. They’ll become important in a bit.
There are a LOT of ways to get map data into R. We’re going to begin with the simplest - grabbing it from an R package. The USAbouncaries
package has boundaries of states and counties in the US through time. To learn more, check out https://lincolnmullen.com/software/usaboundaries/. It has a wonderful set of county maps that are easy to use with the us_counties()
function!
#install the USAboundaries library if you
#don't have it
#devtools::install_github("ropensci/USAboundaries")
#devtools::install_github("ropensci/USAboundariesData")
library(USAboundaries)
library(sf)
map_sf <- us_counties()
Nice! So, what is in here?
class(map_sf)
## [1] "sf" "data.frame"
st_crs(map_sf)
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +datum=WGS84 +no_defs"
OK, it is indeed an sf
object.
map_sf
## Simple feature collection with 3220 features and 12 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
## statefp countyfp countyns affgeoid geoid name lsad aland
## 1 39 131 01074078 0500000US39131 39131 Pike 06 1140324458
## 2 46 003 01266983 0500000US46003 46003 Aurora 06 1834813753
## 3 55 035 01581077 0500000US55035 55035 Eau Claire 06 1652211310
## 4 72 145 01804553 0500000US72145 72145 Vega Baja 13 118766803
## 5 48 259 01383915 0500000US48259 48259 Kendall 06 1715747531
## 6 40 015 01101795 0500000US40015 40015 Caddo 06 3310745124
## 7 19 093 00465235 0500000US19093 19093 Ida 06 1117599859
## 8 28 071 00695759 0500000US28071 28071 Lafayette 06 1636141755
## 9 12 027 00294452 0500000US12027 12027 DeSoto 06 1649978040
## 10 31 137 00835890 0500000US31137 31137 Phelps 06 1398048574
## awater state_name state_abbr jurisdiction_type
## 1 9567612 Ohio OH state
## 2 11201379 South Dakota SD state
## 3 18848512 Wisconsin WI state
## 4 57805868 Puerto Rico PR territory
## 5 1496797 Texas TX state
## 6 30820525 Oklahoma OK state
## 7 1406461 Iowa IA state
## 8 123052156 Mississippi MS state
## 9 6247257 Florida FL state
## 10 1646534 Nebraska NE state
## geometry
## 1 MULTIPOLYGON (((-83.35353 3...
## 2 MULTIPOLYGON (((-98.80777 4...
## 3 MULTIPOLYGON (((-91.65045 4...
## 4 MULTIPOLYGON (((-66.44899 1...
## 5 MULTIPOLYGON (((-98.92015 3...
## 6 MULTIPOLYGON (((-98.62315 3...
## 7 MULTIPOLYGON (((-95.74161 4...
## 8 MULTIPOLYGON (((-89.72134 3...
## 9 MULTIPOLYGON (((-82.0565 27...
## 10 MULTIPOLYGON (((-99.64346 4...
OK! there is a kit tgere! Most notably, I want you to see that while we have county names in name
and state names in state_name
we also have these fp codes - known as FIPS codes. There are Federal Information Processing Standards codes used by the government, as names are… flexible. For more, see https://en.wikipedia.org/wiki/FIPS_county_code - we’ll talk about them in a bit.
To make sure all is well, let’s take a gander!
library(ggplot2)
ggplot(data=map_sf) +
geom_sf(fill = "grey") +
coord_sf() +
theme_minimal()
Great!
OK, so, we want to see how death from heart disease varies by county! To do this, we’ll need to join the data frame and sf object.
First, we need to fix up the columns in our map_sf
data set to state and county names.
library(tidyr)
library(stringr)
map_sf <- map_sf %>%
rename(County = name,
State = state_name)
Now let’s test out the join! Before we do the big join, let’s look at mismatch. This is the perfect job for anti_join
bad_match <- anti_join(heart_disease, map_sf)
nrow(bad_match)
## [1] 18
Hey, not bad! 18! Let’s see what the problems are
bad_match[,1:2]
## # A tibble: 18 x 2
## State County
## <chr> <chr>
## 1 Alaska Prince of Wales-Outer Ketchikan
## 2 Alaska Skagway-Hoonah-Angoon
## 3 Alaska Wade Hampton
## 4 Alaska Wrangell-Petersburg
## 5 Illinois La Salle
## 6 Louisiana La Salle
## 7 Maryland Baltimore City
## 8 Missouri St. Louis City
## 9 New Mexico Dona Ana
## 10 South Dakota Shannon
## 11 Virginia Bedford City
## 12 Virginia Fairfax City
## 13 Virginia Franklin City
## 14 Virginia Richmond City
## 15 Virginia Roanoke City
## 16 Virgin Islands of the U.S. Saint Croix (County Equivalent)
## 17 Virgin Islands of the U.S. Saint John (County Equivalent)
## 18 Virgin Islands of the U.S. Saint Thomas (County Equivalent)
So - some of it is cities rather than counties. Understandable. Someo f it is the USVIs. And some of it might be simply due to funny names. Let’s look at one - that Prince of Wales county in Alaska as a case study.
bad_match[1,]
## # A tibble: 1 x 4
## State County Death_Rate FIPS_Code
## <chr> <chr> <dbl> <dbl>
## 1 Alaska Prince of Wales-Outer Ketchikan NA 2201
map_sf %>%
filter(State=="Alaska") %>%
filter(stringr::str_detect(County, "Prince")) %>%
as_tibble() %>%
select(State, County)
## # A tibble: 1 x 2
## State County
## <chr> <chr>
## 1 Alaska Prince of Wales-Hyder
Oh! So, names are off. Huh. What if instead of joining by state and county we used something else - like a standard FIPS code! We have a small problem - the map_sf
object needs a full state-county FIPS code. And then we can make the join ignore state and county using the by
argument to any join.
map_sf <- map_sf %>%
mutate(FIPS_Code = stringr::str_c(statefp, countyfp) %>% as.numeric)
anti_join(heart_disease, map_sf,
by = "FIPS_Code")
## # A tibble: 9 x 4
## State County Death_Rate FIPS_Code
## <chr> <chr> <dbl> <dbl>
## 1 Alaska Prince of Wales-Outer Ketchik… NA 2201
## 2 Alaska Skagway-Hoonah-Angoon NA 2232
## 3 Alaska Wade Hampton 427. 2270
## 4 Alaska Wrangell-Petersburg NA 2280
## 5 South Dakota Shannon 288. 46113
## 6 Virginia Bedford City 330. 51515
## 7 Virgin Islands of the U.S. Saint Croix (County Equivalen… NA 78010
## 8 Virgin Islands of the U.S. Saint John (County Equivalent) NA 78020
## 9 Virgin Islands of the U.S. Saint Thomas (County Equivale… NA 78030
Only 9. Were we trying to make this perfect, we’d try and figure out why, but, for now, let’s soldier on. 9 is acceptable.
What if we go the other way?
anti_join(map_sf, heart_disease,
by = "FIPS_Code")
## Simple feature collection with 85 features and 13 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -166.2118 ymin: 17.91377 xmax: -65.22157 ymax: 63.28039
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
## statefp countyfp countyns affgeoid geoid County lsad aland
## 1 72 145 01804553 0500000US72145 72145 Vega Baja 13 118766803
## 2 72 129 01804545 0500000US72129 72129 San Lorenzo 13 137547980
## 3 72 019 01804489 0500000US72019 72019 Barranquitas 13 88714041
## 4 72 061 01804511 0500000US72061 72061 Guaynabo 13 71444703
## 5 72 027 01804493 0500000US72027 72027 Camuy 13 120058277
## 6 72 135 01804548 0500000US72135 72135 Toa Alta 13 69972604
## 7 72 137 01804549 0500000US72137 72137 Toa Baja 13 60196196
## 8 72 067 01804514 0500000US72067 72067 Hormigueros 13 29384893
## 9 72 015 01804487 0500000US72015 72015 Arroyo 13 38869483
## 10 46 102 01266992 0500000US46102 46102 Oglala Lakota 06 5422847813
## awater State state_abbr jurisdiction_type FIPS_Code
## 1 57805868 Puerto Rico PR territory 72145
## 2 256425 Puerto Rico PR territory 72129
## 3 77812 Puerto Rico PR territory 72019
## 4 468234 Puerto Rico PR territory 72061
## 5 40430074 Puerto Rico PR territory 72027
## 6 1405463 Puerto Rico PR territory 72135
## 7 48081472 Puerto Rico PR territory 72137
## 8 481 Puerto Rico PR territory 72067
## 9 53459916 Puerto Rico PR territory 72015
## 10 7126221 South Dakota SD state 46102
## geometry
## 1 MULTIPOLYGON (((-66.44899 1...
## 2 MULTIPOLYGON (((-66.05255 1...
## 3 MULTIPOLYGON (((-66.35163 1...
## 4 MULTIPOLYGON (((-66.12172 1...
## 5 MULTIPOLYGON (((-66.90157 1...
## 6 MULTIPOLYGON (((-66.30366 1...
## 7 MULTIPOLYGON (((-66.2581 18...
## 8 MULTIPOLYGON (((-67.15973 1...
## 9 MULTIPOLYGON (((-66.08226 1...
## 10 MULTIPOLYGON (((-103.0011 4...
OH! Territories. There’s no data for them. That’s OK for now.
So we know that there are some territories and cities left in the heart disease data, and we don’t want to worry about them for the moment. Given that we’ve got some mismatch on both sides, we want to use an left_join
- although we could also full_join
to eliminate some information. Try it both ways.
Let’s also get rid of Hawaii and Alaska for the moment, as well as anything that is a territory. So, just the lower 48!
We will also need to make sure that this is indeed an sf object. Joining spatial and non-spatial objects leads to non-spatial objects!
heart_disease_map_data <- left_join(heart_disease, map_sf) %>%
filter(jurisdiction_type != "territory") %>%
filter(!(State %in% c("Hawaii", "Alaska"))) %>%
st_as_sf(crs = 4326)
And now - let’s plot it!
heart_map <- ggplot(data=heart_disease_map_data,
mapping = aes(fill = Death_Rate)) +
geom_sf(color = NA) +
theme_minimal()
heart_map
So, this is a choropleth map. There are some issues - the scale is hard to resolve, and everything is hard to see. There are a lot of ways we could rectify it. Here are a few tricks.
First, a better scale. Maybe a gradient?
heart_map +
scale_fill_gradient(low = "white", high = "red")
OK, not bad! Not great, but much better. Are there gradients you’d prefer?
However, the bigger problem is that we have a huge range to cover. One common way to make choropleths is to bin data into categories and go from there. Binning in combination with a nice color scale - say via Color Brewer - cab be great. Remember Color Brewer? http://colorbrewer2.org/ actually uses maps as it’s demo! And there’s a scale_fill_brewer
function in ggplot2.
So let’s make bins and plot that instead!
heart_map_binned <- ggplot(data=heart_disease_map_data,
mapping = aes(fill = cut_interval(Death_Rate, 5))) +
geom_sf(color = NA) +
theme_minimal()
heart_map_binned +
scale_fill_brewer(palette="Reds")
MUCH nicer. Now we can really start to see some hot spots.
For our faded examples, today we’re going to look at data on TB incidence from the World Health organization. This look at TB for the entire planet at the country level. Let’s take a look
library(readr)
tb_world <- read_csv("./data/who_tb_data.csv")
tb_world
## # A tibble: 3,651 x 18
## country iso2 iso3 g_whoregion year estimated_popul… est_incidences_…
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Afghan… AF AFG EMR 2000 20093756 190
## 2 Afghan… AF AFG EMR 2001 20966463 189
## 3 Afghan… AF AFG EMR 2002 21979923 189
## 4 Afghan… AF AFG EMR 2003 23064851 189
## 5 Afghan… AF AFG EMR 2004 24118979 189
## 6 Afghan… AF AFG EMR 2005 25070798 189
## 7 Afghan… AF AFG EMR 2006 25893450 189
## 8 Afghan… AF AFG EMR 2007 26616792 189
## 9 Afghan… AF AFG EMR 2008 27294031 189
## 10 Afghan… AF AFG EMR 2009 28004331 189
## # … with 3,641 more rows, and 11 more variables:
## # est_incidences_per_100K_people_lwr <dbl>,
## # est_incidences_per_100K_people_upr <dbl>, est_incidences <dbl>,
## # est_incidences_lwr <dbl>, est_incidences_upr <dbl>,
## # est_mortality_per_100K_people <dbl>,
## # est_mortality_per_100K_people_lwr <dbl>,
## # est_mortality_per_100K_people_upr <dbl>, est_mortality <dbl>,
## # est_mortality_lwr <dbl>, est_mortality_upr <dbl>
There’s a lot going on here - in particular let’s focus on estimated insidences, incidences per 100K people, and mortality. upr
and lwr
denote upper and lower bounds to estimates.
So, let’s make a simple map of 2015.
Before we do that, I want to introduce you to rnaturalearth
. It’s a package that I use almost daily for really nice maps. While it doesn’t have detailed county level data, it has a ton of data at the world and state/province level coupled with iso3 standardized international codes. Let’s see it in action for just ONE type of map.
library(rnaturalearth)
worldmap <- ne_countries(returnclass = "sf")
names(worldmap)
## [1] "scalerank" "featurecla" "labelrank" "sovereignt" "sov_a3"
## [6] "adm0_dif" "level" "type" "admin" "adm0_a3"
## [11] "geou_dif" "geounit" "gu_a3" "su_dif" "subunit"
## [16] "su_a3" "brk_diff" "name" "name_long" "brk_a3"
## [21] "brk_name" "brk_group" "abbrev" "postal" "formal_en"
## [26] "formal_fr" "note_adm0" "note_brk" "name_sort" "name_alt"
## [31] "mapcolor7" "mapcolor8" "mapcolor9" "mapcolor13" "pop_est"
## [36] "gdp_md_est" "pop_year" "lastcensus" "gdp_year" "economy"
## [41] "income_grp" "wikipedia" "fips_10" "iso_a2" "iso_a3"
## [46] "iso_n3" "un_a3" "wb_a2" "wb_a3" "woe_id"
## [51] "adm0_a3_is" "adm0_a3_us" "adm0_a3_un" "adm0_a3_wb" "continent"
## [56] "region_un" "subregion" "region_wb" "name_len" "long_len"
## [61] "abbrev_len" "tiny" "homepart" "geometry"
WHOAH! That is a ton of information. However, I’d like you to notice the iso_a3 code, analagous to the iso3 code in the WHO data. We can use it to join on.
So, 2015 - note, I’m going to left join on the worldmap in order to ensure that we retain all of the countries of the world
#get a map
worldmap <- ne_countries(returnclass = "sf")
#filter to 2015
tb_2015 <- tb_world %>% filter(year == 2015)
#join
incidence_df <- left_join(worldmap, tb_2015, by = c("iso_a3" = "iso3"))
#plot
ggplot(incidence_df, aes(fill=est_incidences)) +
geom_sf() +
scale_fill_gradient(low = "blue", high = "red")
Nice!
Let’s look at mortality in 2000
#get a map
worldmap <- ne_countries(returnclass = "____")
#filter to 2015
tb_2000 <- tb_world %>% filter(year == ____)
#join
incidence_df_2000 <- left_join(worldmap, ____, by = c("iso_a3" = "iso3"))
#plot
ggplot(incidence_df_2000, aes(fill=est_mortality)) +
geom_sf() +
scale_fill_gradient(____)
Now let’s look at incidence per 100K, but, only by interval, in 2016 - feel free to do something difference with the fill
#get a map
worldmap <- ____(returnclass = "____")
#filter to 2016
tb_2016 <- tb_world %>% ____(year == ____)
#join
incidence_df_2016 <- left_join(worldmap, ____, by = c("iso_a3" = "iso3"))
#plot
ggplot(incidence_df_2016, aes(fill=____(est_incidences_per_100K_people,5))) +
____() +
scale_fill_brewer(palette = "YlOrBr",
guide = guide_legend("Incidences per 100K"))
What does not join well? Can you fix any of them via changes in the tb or worldmap dataset? Or should you not bother?
Can you make a map showing change in mortality over four years (filter to four years, and then use facet_wrap
)?
This is all well and good, but what if you want to be google maps? There are a few packages out there. leaflet and tmap are the dominant. I <3 leaflet, but I find tmap is easier to get used to, as it functions similarly to ggplot. There’s also a nice primer in Geocomputation in R. Both are great, but, for the moment, let’s try tmap
!
Let’s start with this heart diesease data to see how it works in general. The basic way that tmap works is that it uses tmap_shape()
to reference a dataset, and then a series of functions after to add features from that dataset - e.g., tm_polygons()
. It doesn’t use NSE, so we put quotes around column names.
library(tmap)
first_tmap <- tm_shape(shp = heart_disease_map_data) +
tm_polygons(col = "Death_Rate")
Hey, wait, I said interactive! To make it so, we call tmap_mode("view")
before we run things. Right now we’re in the plot
view.
tmap_mode("view")
first_tmap
OH! Cool! Play around a bit here!
You can still do a lot with tmap
and the vignette shows off a lot like facets. Also, see how we add palettes. I’m using a list because we have two elements. And see tmaptools::palette_explorer()
for more on colors.
tm_shape(incidence_df_2016) +
tm_polygons(c("est_incidences", "est_mortality"),
palette = list("magma", "viridis")) +
tm_facets(sync = TRUE, ncol = 2)
We can also use facets in a non-interactive way that is quite useful..
tmap_mode("plot")
tm_shape(incidence_df_2016) +
tm_polygons("est_incidences") +
tm_facets(by = "region_un")
Turn the following two examples:
ggplot(incidence_df, aes(fill=est_incidences)) +
geom_sf() +
scale_fill_gradient(low = "blue", high = "red")
ggplot(incidence_df_2016, aes(fill=cut_interval(est_incidences_per_100K_people,5))) +
geom_sf() +
scale_fill_brewer(palette = "YlOrBr",
guide = guide_legend("Incidences per 100K"))
into tmap syntax using tm_shape()
, tm_polygons()
, and the palette
argument. Explore palettes if need be. To change the legend name, see ?tm_polygons
for the proper argument. Make them interactive!
OK, this is all well and good. Working alone or in groups, I’d like you to load up the coronavirus
package. Remembering that a. Each row represents one report, b. Each report has a type (confirmed, death, or recovered), c. Each report comes with other geospatial information - country, state, etc.
I’d like you to visualize some aspect of the spatial distribution of coronavirus. The maximum date in the dataset isn’t current, so, just remember that. If you want to look at a more current dataset, check https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases - but you will have to look hard at the data to understand what is there. (I’m working on a method to get something more current within R along with Rami Krispin and others.)