For the past few weeks, we’ve been grappling with data sets that we’ve almost wrestled down to size. The only problem is, what we’re left with is a big table with one score per row, and a lot of NAs. This doesn’t seem efficient, nor does it seem like a good format for plotting. Clearly, we need to TIDY THE DATA.
As we talked about a few weeks ago, there are both long and wide data formats. Long data formats have one observation and one measurement per row. So, multiple rows constitude a single observation.
These kinds of data sets are great for plotting summary information for each group and each variable aggregated together. Think of a bar plot with averages for different species taken across different sites.
Wide data has a every measurement in a single observation in a single row.
This kind of data is ideal for things like scatterplots of one measurement against another, with each observation as a single data point.
Quite often, data is recorded in a long format for efficiency. Let’s talk about this with respect to a data set of mammals. For every site a researcher visited, they wrote down the species of mammals they say, and the density of those mammals. As this was just a running tally, the data ended up in a long format:
## The mammals data frame:
## site taxon density
## 1 1 Suncus etruscus 6.2
## 2 1 Sorex cinereus 5.2
## 3 2 Myotis nigricans 11.0
## 4 3 Notiosorex crawfordi 1.2
## 5 3 Suncus etruscus 9.4
## 6 3 Myotis nigricans 9.6
To get the data yourself, run the following in your instance of R:
library(tidyverse)
#### Some data for the tidyr lecture
mammals <- data.frame(site = c(1,1,2,3,3,3),
taxon = c('Suncus etruscus', 'Sorex cinereus',
'Myotis nigricans', 'Notiosorex crawfordi',
'Suncus etruscus', 'Myotis nigricans'),
density = c(6.2, 5.2, 11.0, 1.2, 9.4, 9.6)
)
But, what if we wanted to plot the abundance of, say, Myostis nigricans against Suncus etruscus - we can’t do that from here. We need to pivot_wider
the data into a wide format. For that, there is the tidyr
library and it’s pivot_wider
function.
pivot_wider
works simply in that it take a data set, and you tell it the name of the column which contains values that will be the column names in your new data set - a so called “key” column. You also tell it which column contains the relevant numbers - the “values” column.
library(tidyr)
m_wide <- mammals %>%
pivot_wider(names_from = taxon,
values_from = density)
m_wide
## # A tibble: 3 x 5
## site `Suncus etruscus` `Sorex cinereus` `Myotis nigrican… `Notiosorex crawfo…
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 6.2 5.2 NA NA
## 2 2 NA NA 11 NA
## 3 3 9.4 NA 9.6 1.2
Great! We have a wide data set. But - uh oh, what are all of those NAs doing there?
We often have data sets like this. Ones where we have many measurements that we have not observed. Sometimes, they are NA - we just didn’t take them. Other times, such as in this data set, we want to fill them in with a sensible value, such as 0.
The simplest approach here would be to use the values_fill
argument in pivot_wider
. Note, we have to specify a list here, as pivot wider can work across multiple columns of values.
m_wide_0 <- mammals %>%
pivot_wider(names_from = taxon,
values_from = density,
values_fill = list(density = 0))
m_wide_0
## # A tibble: 3 x 5
## site `Suncus etruscus` `Sorex cinereus` `Myotis nigrican… `Notiosorex crawfo…
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 6.2 5.2 0 0
## 2 2 0 0 11 0
## 3 3 9.4 0 9.6 1.2
Perfect! Other times, though, we want to fill those in right into the long data format. For that we have the complete
function. In that function, we specify which columns we want all combinations of, and then supply a list of how new values should be filled in for other columns. If we don’t give a column name in that list, it defaults to NA. Oh, complete
function… you complete
me.
m_long_0 <- mammals %>%
complete(site, taxon, fill=list(density=0))
m_long_0
## # A tibble: 12 x 3
## site taxon density
## <dbl> <fct> <dbl>
## 1 1 Myotis nigricans 0
## 2 1 Notiosorex crawfordi 0
## 3 1 Sorex cinereus 5.2
## 4 1 Suncus etruscus 6.2
## 5 2 Myotis nigricans 11
## 6 2 Notiosorex crawfordi 0
## 7 2 Sorex cinereus 0
## 8 2 Suncus etruscus 0
## 9 3 Myotis nigricans 9.6
## 10 3 Notiosorex crawfordi 1.2
## 11 3 Sorex cinereus 0
## 12 3 Suncus etruscus 9.4
We could then take that new complete
long data frame and pivot_wider
it with abandon! Either way, now we can plot!
library(ggplot2)
ggplot(m_wide_0, mapping = aes(x=`Myotis nigricans`,
y=`Suncus etruscus`)) +
geom_point(size=1.4) +
stat_smooth() +
theme_bw()
OK, so, that’s not such an impressive plot, but you see what we’re getting at here…
Load the hadcrut data. Filter out the column month (it provides duplicate information). Now, use pivot_wider
so that each month_name
has it’s own column. Is there a relationship between January and August anomoly?
We could have pivoted by year instead. What are the rows now? Notice anything funny about the column names?
Is there a relationship between the anomoly in 1942 and 2012?
But what if your intrepid mammal researcher had been wise and recorded their data in a wide format (in truth, this is what I do all the time). And instead of a scatterplot, we wanted to compare the densities of all of the different species in a point-range plot? Well, then we’d have to go from a wide format to long.
To do this, tidyr
offers the pivot_longer
function, as in gather up your wide data into one nice long piece. pivot_longer
is a little trickier, as you specify what you want the name of the new key column to be, what you want the name of the new values column to be, and then…well, you have some options. You can either specify which columns are to be gathered up (which can take a while) or you can specify which columns you want to exclude. You can do fancy things, like specify all columns between two others with a :, but, honestly, I often specify which to exclude. Let’s see two examples.
m_long <- m_wide_0 %>%
pivot_longer(cols = -site,
names_to = "Species_name",
values_to = "Density")
m_long
## # A tibble: 12 x 3
## site Species_name Density
## <dbl> <chr> <dbl>
## 1 1 Suncus etruscus 6.2
## 2 1 Sorex cinereus 5.2
## 3 1 Myotis nigricans 0
## 4 1 Notiosorex crawfordi 0
## 5 2 Suncus etruscus 0
## 6 2 Sorex cinereus 0
## 7 2 Myotis nigricans 11
## 8 2 Notiosorex crawfordi 0
## 9 3 Suncus etruscus 9.4
## 10 3 Sorex cinereus 0
## 11 3 Myotis nigricans 9.6
## 12 3 Notiosorex crawfordi 1.2
#Another way of doing the same thing
m_long <- m_wide_0 %>%
pivot_longer(cols = `Suncus etruscus`:`Notiosorex crawfordi`,
names_to = "Species_name",
values_to = "Density")
m_long
## # A tibble: 12 x 3
## site Species_name Density
## <dbl> <chr> <dbl>
## 1 1 Suncus etruscus 6.2
## 2 1 Sorex cinereus 5.2
## 3 1 Myotis nigricans 0
## 4 1 Notiosorex crawfordi 0
## 5 2 Suncus etruscus 0
## 6 2 Sorex cinereus 0
## 7 2 Myotis nigricans 11
## 8 2 Notiosorex crawfordi 0
## 9 3 Suncus etruscus 9.4
## 10 3 Sorex cinereus 0
## 11 3 Myotis nigricans 9.6
## 12 3 Notiosorex crawfordi 1.2
And now that we have a long format, let’s plot some means of species!
ggplot(m_long, mapping=aes(x=Species_name, y=Density)) +
geom_boxplot() +
geom_point(size=2, color = "red")
With the hadcrut data with years as columns, select a single decade and then reshape it to a wide format. Plot the summary statistics for the months in that decade.
With the hadcrut data with months as columns, select two months of interest, reshape the data, and plot the trend for both years over time.
You may have noticed that we’ve been throwing around just one value column. What if you have two - say, density and average height? Or three? Or more? In other words, a not-long-not-wide-but-in-between data set. These are common.
#make randomness less random
set.seed(355)
mamh <- mammals %>%
mutate(height=rnorm(6,30,3))
What do we do, if we want to fill in zeros for all of the densities AND heights? Turns out, there are a lot of answers. This becomes a bit of fun puzzle solving, in a way. Let’s explore them, from simple to complex, to illustrate a few concepts.
First, the complete solution (HA!) We can just use complete
here. If you give multiple columns to the fill
list argument, you will be able to fill them all - and even with different values, if need be! For example, we know that if density is 0, height would be NA, right?
mamh_0_complete <- mamh %>%
complete(site, taxon,
fill = list(height = NA, density = 0))
mamh_0_complete
## # A tibble: 12 x 4
## site taxon density height
## <dbl> <fct> <dbl> <dbl>
## 1 1 Myotis nigricans 0 NA
## 2 1 Notiosorex crawfordi 0 NA
## 3 1 Sorex cinereus 5.2 36.8
## 4 1 Suncus etruscus 6.2 27.1
## 5 2 Myotis nigricans 11 24.8
## 6 2 Notiosorex crawfordi 0 NA
## 7 2 Sorex cinereus 0 NA
## 8 2 Suncus etruscus 0 NA
## 9 3 Myotis nigricans 9.6 27.7
## 10 3 Notiosorex crawfordi 1.2 29.2
## 11 3 Sorex cinereus 0 NA
## 12 3 Suncus etruscus 9.4 31.1
That’s one way to do it. Another is to make the data truly long, and work from there.
mamh_long <- mamh %>%
pivot_longer(cols = c(density, height),
names_to = "measurement",
values_to = "value")
mamh_long
## # A tibble: 12 x 4
## site taxon measurement value
## <dbl> <fct> <chr> <dbl>
## 1 1 Suncus etruscus density 6.2
## 2 1 Suncus etruscus height 27.1
## 3 1 Sorex cinereus density 5.2
## 4 1 Sorex cinereus height 36.8
## 5 2 Myotis nigricans density 11
## 6 2 Myotis nigricans height 24.8
## 7 3 Notiosorex crawfordi density 1.2
## 8 3 Notiosorex crawfordi height 29.2
## 9 3 Suncus etruscus density 9.4
## 10 3 Suncus etruscus height 31.1
## 11 3 Myotis nigricans density 9.6
## 12 3 Myotis nigricans height 27.7
Now, we have a lot of options. One would be to use complete
, albeit in a less satisfying way than before, and then pivoting wide back.
mamh_0_from_long <- mamh_long %>%
#fill in the 0s
complete(site, taxon, measurement,
fill = list(value = 0)) %>%
#pivot back to a wider frame
pivot_wider(names_from = measurement,
values_from = value)
mamh_0_from_long
## # A tibble: 12 x 4
## site taxon density height
## <dbl> <fct> <dbl> <dbl>
## 1 1 Myotis nigricans 0 0
## 2 1 Notiosorex crawfordi 0 0
## 3 1 Sorex cinereus 5.2 36.8
## 4 1 Suncus etruscus 6.2 27.1
## 5 2 Myotis nigricans 11 24.8
## 6 2 Notiosorex crawfordi 0 0
## 7 2 Sorex cinereus 0 0
## 8 2 Suncus etruscus 0 0
## 9 3 Myotis nigricans 9.6 27.7
## 10 3 Notiosorex crawfordi 1.2 29.2
## 11 3 Sorex cinereus 0 0
## 12 3 Suncus etruscus 9.4 31.1
tidyr can be even more flexible on that count. You can use unite
to turn those columns into something that’s easy to separate
later down the line. Which we do with… regular expressions! Yes, they are everywhere. Let’s take an example of height in the mammals data set.
So now we want to do everything above, but with density and height! So, we unite them, using a _
as our separator. We could have used anything, but I like _ because it’s used for so few other things with numbers. tidyr
just needs to know the name of the new column, and then which columns will be united
mamh2 <- mamh %>%
unite(measurement, density, height, sep="_")
mamh2
## site taxon measurement
## 1 1 Suncus etruscus 6.2_27.078930680398
## 2 1 Sorex cinereus 5.2_36.7693975641899
## 3 2 Myotis nigricans 11_24.80802853174
## 4 3 Notiosorex crawfordi 1.2_29.1830714736418
## 5 3 Suncus etruscus 9.4_31.1451832551941
## 6 3 Myotis nigricans 9.6_27.6589585537913
We can now proceed as normal. Let’s say, for example, that you wanted to fill in all of the empty species-site combos with zeroes. Here’s one workflow to do that.
mamh_long <- mamh2 %>%
spread(taxon, measurement, fill="0_0") %>%
gather(taxon, measurement, -site)
mamh_long
## site taxon measurement
## 1 1 Myotis nigricans 0_0
## 2 2 Myotis nigricans 11_24.80802853174
## 3 3 Myotis nigricans 9.6_27.6589585537913
## 4 1 Notiosorex crawfordi 0_0
## 5 2 Notiosorex crawfordi 0_0
## 6 3 Notiosorex crawfordi 1.2_29.1830714736418
## 7 1 Sorex cinereus 5.2_36.7693975641899
## 8 2 Sorex cinereus 0_0
## 9 3 Sorex cinereus 0_0
## 10 1 Suncus etruscus 6.2_27.078930680398
## 11 2 Suncus etruscus 0_0
## 12 3 Suncus etruscus 9.4_31.1451832551941
OK, great! We have all of the “0_0” entries (notice what I did there for the fill?), but, now we want to restore our old measurements. For that, we have separate
which takes the relevant column, the new column names in a vector, and the pattern you match to split them.
mamh_long <- mamh_long %>%
separate(measurement, into = c("density", "height"), sep="_")
mamh_long
## site taxon density height
## 1 1 Myotis nigricans 0 0
## 2 2 Myotis nigricans 11 24.80802853174
## 3 3 Myotis nigricans 9.6 27.6589585537913
## 4 1 Notiosorex crawfordi 0 0
## 5 2 Notiosorex crawfordi 0 0
## 6 3 Notiosorex crawfordi 1.2 29.1830714736418
## 7 1 Sorex cinereus 5.2 36.7693975641899
## 8 2 Sorex cinereus 0 0
## 9 3 Sorex cinereus 0 0
## 10 1 Suncus etruscus 6.2 27.078930680398
## 11 2 Suncus etruscus 0 0
## 12 3 Suncus etruscus 9.4 31.1451832551941
One note - these new columns will be treated as characters (you did split a string after all), so you’ll need to do some cleanup
mamh_long <- mamh_long %>%
mutate(density = as.numeric(density),
height = as.numeric(height))
And there you go! You can now do exciting things like plot the density-height relationship, grouping by taxon, for example.
ggplot(mamh_long,
aes(x = density, y = height, color = taxon)) +
geom_point()
OK, to make sure you got separate
and unite
down, first, separate the taxon into two columns - Genus and species (taxon is currently a latin binomial).
Add four columns to the original mammals frame. Now use any technique above (which may or may not involve unite) to fill in 0s for all four columns. There are many different ways to solve this. How few lines of code can you write?