Why Tidy?

For the past few weeks, we’ve been grappling with data sets that we’ve almost wrestled down to size. The only problem is, what we’re left with is a big table with one score per row, and a lot of NAs. This doesn’t seem efficient, nor does it seem like a good format for plotting. Clearly, we need to TIDY THE DATA.

Long versus Wide

As we talked about a few weeks ago, there are both long and wide data formats. Long data formats have one observation and one measurement per row. So, multiple rows constitude a single observation.

long from R4DS

long from R4DS

These kinds of data sets are great for plotting summary information for each group and each variable aggregated together. Think of a bar plot with averages for different species taken across different sites.

Wide data has a every measurement in a single observation in a single row.

wide from R4DS

wide from R4DS

This kind of data is ideal for things like scatterplots of one measurement against another, with each observation as a single data point.

Long to Wide

Quite often, data is recorded in a long format for efficiency. Let’s talk about this with respect to a data set of mammals. For every site a researcher visited, they wrote down the species of mammals they say, and the density of those mammals. As this was just a running tally, the data ended up in a long format:

## The mammals data frame:
##   site                taxon density
## 1    1      Suncus etruscus     6.2
## 2    1       Sorex cinereus     5.2
## 3    2     Myotis nigricans    11.0
## 4    3 Notiosorex crawfordi     1.2
## 5    3      Suncus etruscus     9.4
## 6    3     Myotis nigricans     9.6

To get the data yourself, run the following in your instance of R:

library(tidyverse)

#### Some data for the tidyr lecture
mammals <- data.frame(site = c(1,1,2,3,3,3), 
                      taxon = c('Suncus etruscus', 'Sorex cinereus', 
                                'Myotis nigricans', 'Notiosorex crawfordi', 
                                'Suncus etruscus', 'Myotis nigricans'),
                      density = c(6.2, 5.2, 11.0, 1.2, 9.4, 9.6)
)

But, what if we wanted to plot the abundance of, say, Myostis nigricans against Suncus etruscus - we can’t do that from here. We need to pivot_wider the data into a wide format. For that, there is the tidyr library and it’s pivot_wider function.

pivot_wider works simply in that it take a data set, and you tell it the name of the column which contains values that will be the column names in your new data set - a so called “key” column. You also tell it which column contains the relevant numbers - the “values” column.

library(tidyr)

m_wide <- mammals %>%
  pivot_wider(names_from = taxon,
              values_from = density)

m_wide
## # A tibble: 3 x 5
##    site `Suncus etruscus` `Sorex cinereus` `Myotis nigrican… `Notiosorex crawfo…
##   <dbl>             <dbl>            <dbl>             <dbl>               <dbl>
## 1     1               6.2              5.2              NA                  NA  
## 2     2              NA               NA                11                  NA  
## 3     3               9.4             NA                 9.6                 1.2

Great! We have a wide data set. But - uh oh, what are all of those NAs doing there?

Filling in null values

We often have data sets like this. Ones where we have many measurements that we have not observed. Sometimes, they are NA - we just didn’t take them. Other times, such as in this data set, we want to fill them in with a sensible value, such as 0.

The simplest approach here would be to use the values_fill argument in pivot_wider. Note, we have to specify a list here, as pivot wider can work across multiple columns of values.

m_wide_0 <- mammals %>%
  pivot_wider(names_from = taxon,
              values_from = density,
              values_fill = list(density = 0))
m_wide_0
## # A tibble: 3 x 5
##    site `Suncus etruscus` `Sorex cinereus` `Myotis nigrican… `Notiosorex crawfo…
##   <dbl>             <dbl>            <dbl>             <dbl>               <dbl>
## 1     1               6.2              5.2               0                   0  
## 2     2               0                0                11                   0  
## 3     3               9.4              0                 9.6                 1.2

Perfect! Other times, though, we want to fill those in right into the long data format. For that we have the complete function. In that function, we specify which columns we want all combinations of, and then supply a list of how new values should be filled in for other columns. If we don’t give a column name in that list, it defaults to NA. Oh, complete function… you complete me.

m_long_0 <- mammals %>%
  complete(site, taxon, fill=list(density=0))

m_long_0
## # A tibble: 12 x 3
##     site taxon                density
##    <dbl> <fct>                  <dbl>
##  1     1 Myotis nigricans         0  
##  2     1 Notiosorex crawfordi     0  
##  3     1 Sorex cinereus           5.2
##  4     1 Suncus etruscus          6.2
##  5     2 Myotis nigricans        11  
##  6     2 Notiosorex crawfordi     0  
##  7     2 Sorex cinereus           0  
##  8     2 Suncus etruscus          0  
##  9     3 Myotis nigricans         9.6
## 10     3 Notiosorex crawfordi     1.2
## 11     3 Sorex cinereus           0  
## 12     3 Suncus etruscus          9.4

We could then take that new complete long data frame and pivot_wider it with abandon! Either way, now we can plot!

library(ggplot2)

ggplot(m_wide_0, mapping = aes(x=`Myotis nigricans`,
                               y=`Suncus etruscus`)) +
  geom_point(size=1.4) +
  stat_smooth() +
  theme_bw()

OK, so, that’s not such an impressive plot, but you see what we’re getting at here…

Exercise with Spread

  1. Load the hadcrut data. Filter out the column month (it provides duplicate information). Now, use pivot_wider so that each month_name has it’s own column. Is there a relationship between January and August anomoly?

  2. We could have pivoted by year instead. What are the rows now? Notice anything funny about the column names?

  3. Is there a relationship between the anomoly in 1942 and 2012?

Wide to Long

But what if your intrepid mammal researcher had been wise and recorded their data in a wide format (in truth, this is what I do all the time). And instead of a scatterplot, we wanted to compare the densities of all of the different species in a point-range plot? Well, then we’d have to go from a wide format to long.

To do this, tidyr offers the pivot_longer function, as in gather up your wide data into one nice long piece. pivot_longer is a little trickier, as you specify what you want the name of the new key column to be, what you want the name of the new values column to be, and then…well, you have some options. You can either specify which columns are to be gathered up (which can take a while) or you can specify which columns you want to exclude. You can do fancy things, like specify all columns between two others with a :, but, honestly, I often specify which to exclude. Let’s see two examples.

m_long <- m_wide_0 %>%
  pivot_longer(cols = -site, 
               names_to = "Species_name", 
               values_to = "Density")
  
m_long
## # A tibble: 12 x 3
##     site Species_name         Density
##    <dbl> <chr>                  <dbl>
##  1     1 Suncus etruscus          6.2
##  2     1 Sorex cinereus           5.2
##  3     1 Myotis nigricans         0  
##  4     1 Notiosorex crawfordi     0  
##  5     2 Suncus etruscus          0  
##  6     2 Sorex cinereus           0  
##  7     2 Myotis nigricans        11  
##  8     2 Notiosorex crawfordi     0  
##  9     3 Suncus etruscus          9.4
## 10     3 Sorex cinereus           0  
## 11     3 Myotis nigricans         9.6
## 12     3 Notiosorex crawfordi     1.2
#Another way of doing the same thing
m_long <- m_wide_0 %>%
  pivot_longer(cols = `Suncus etruscus`:`Notiosorex crawfordi`,
         names_to = "Species_name", 
         values_to = "Density")

m_long
## # A tibble: 12 x 3
##     site Species_name         Density
##    <dbl> <chr>                  <dbl>
##  1     1 Suncus etruscus          6.2
##  2     1 Sorex cinereus           5.2
##  3     1 Myotis nigricans         0  
##  4     1 Notiosorex crawfordi     0  
##  5     2 Suncus etruscus          0  
##  6     2 Sorex cinereus           0  
##  7     2 Myotis nigricans        11  
##  8     2 Notiosorex crawfordi     0  
##  9     3 Suncus etruscus          9.4
## 10     3 Sorex cinereus           0  
## 11     3 Myotis nigricans         9.6
## 12     3 Notiosorex crawfordi     1.2

And now that we have a long format, let’s plot some means of species!

ggplot(m_long, mapping=aes(x=Species_name, y=Density)) +
  geom_boxplot() +
  geom_point(size=2, color = "red") 

Exercises

  1. With the hadcrut data with years as columns, select a single decade and then reshape it to a wide format. Plot the summary statistics for the months in that decade.

  2. With the hadcrut data with months as columns, select two months of interest, reshape the data, and plot the trend for both years over time.

Hey, wait, these only work on one “Value” column…

You may have noticed that we’ve been throwing around just one value column. What if you have two - say, density and average height? Or three? Or more? In other words, a not-long-not-wide-but-in-between data set. These are common.

#make randomness less random
set.seed(355)

mamh <- mammals %>%
  mutate(height=rnorm(6,30,3))

What do we do, if we want to fill in zeros for all of the densities AND heights? Turns out, there are a lot of answers. This becomes a bit of fun puzzle solving, in a way. Let’s explore them, from simple to complex, to illustrate a few concepts.

First, the complete solution (HA!) We can just use complete here. If you give multiple columns to the fill list argument, you will be able to fill them all - and even with different values, if need be! For example, we know that if density is 0, height would be NA, right?

mamh_0_complete <- mamh %>%
  complete(site, taxon,
           fill = list(height = NA, density = 0))

mamh_0_complete
## # A tibble: 12 x 4
##     site taxon                density height
##    <dbl> <fct>                  <dbl>  <dbl>
##  1     1 Myotis nigricans         0     NA  
##  2     1 Notiosorex crawfordi     0     NA  
##  3     1 Sorex cinereus           5.2   36.8
##  4     1 Suncus etruscus          6.2   27.1
##  5     2 Myotis nigricans        11     24.8
##  6     2 Notiosorex crawfordi     0     NA  
##  7     2 Sorex cinereus           0     NA  
##  8     2 Suncus etruscus          0     NA  
##  9     3 Myotis nigricans         9.6   27.7
## 10     3 Notiosorex crawfordi     1.2   29.2
## 11     3 Sorex cinereus           0     NA  
## 12     3 Suncus etruscus          9.4   31.1

That’s one way to do it. Another is to make the data truly long, and work from there.

mamh_long  <- mamh %>%
  pivot_longer(cols = c(density, height),
               names_to = "measurement",
               values_to = "value")

mamh_long
## # A tibble: 12 x 4
##     site taxon                measurement value
##    <dbl> <fct>                <chr>       <dbl>
##  1     1 Suncus etruscus      density       6.2
##  2     1 Suncus etruscus      height       27.1
##  3     1 Sorex cinereus       density       5.2
##  4     1 Sorex cinereus       height       36.8
##  5     2 Myotis nigricans     density      11  
##  6     2 Myotis nigricans     height       24.8
##  7     3 Notiosorex crawfordi density       1.2
##  8     3 Notiosorex crawfordi height       29.2
##  9     3 Suncus etruscus      density       9.4
## 10     3 Suncus etruscus      height       31.1
## 11     3 Myotis nigricans     density       9.6
## 12     3 Myotis nigricans     height       27.7

Now, we have a lot of options. One would be to use complete, albeit in a less satisfying way than before, and then pivoting wide back.

mamh_0_from_long <- mamh_long %>%
  
  #fill in the 0s
  complete(site, taxon, measurement, 
           fill = list(value = 0)) %>%
  
  #pivot back to a wider frame
  pivot_wider(names_from = measurement,
              values_from = value)

mamh_0_from_long
## # A tibble: 12 x 4
##     site taxon                density height
##    <dbl> <fct>                  <dbl>  <dbl>
##  1     1 Myotis nigricans         0      0  
##  2     1 Notiosorex crawfordi     0      0  
##  3     1 Sorex cinereus           5.2   36.8
##  4     1 Suncus etruscus          6.2   27.1
##  5     2 Myotis nigricans        11     24.8
##  6     2 Notiosorex crawfordi     0      0  
##  7     2 Sorex cinereus           0      0  
##  8     2 Suncus etruscus          0      0  
##  9     3 Myotis nigricans         9.6   27.7
## 10     3 Notiosorex crawfordi     1.2   29.2
## 11     3 Sorex cinereus           0      0  
## 12     3 Suncus etruscus          9.4   31.1

tidyr can be even more flexible on that count. You can use unite to turn those columns into something that’s easy to separate later down the line. Which we do with… regular expressions! Yes, they are everywhere. Let’s take an example of height in the mammals data set.

So now we want to do everything above, but with density and height! So, we unite them, using a _ as our separator. We could have used anything, but I like _ because it’s used for so few other things with numbers. tidyr just needs to know the name of the new column, and then which columns will be united

mamh2 <- mamh %>%
  unite(measurement, density, height, sep="_")

mamh2
##   site                taxon          measurement
## 1    1      Suncus etruscus  6.2_27.078930680398
## 2    1       Sorex cinereus 5.2_36.7693975641899
## 3    2     Myotis nigricans    11_24.80802853174
## 4    3 Notiosorex crawfordi 1.2_29.1830714736418
## 5    3      Suncus etruscus 9.4_31.1451832551941
## 6    3     Myotis nigricans 9.6_27.6589585537913

We can now proceed as normal. Let’s say, for example, that you wanted to fill in all of the empty species-site combos with zeroes. Here’s one workflow to do that.

mamh_long <- mamh2 %>%
  spread(taxon, measurement, fill="0_0") %>%
  gather(taxon, measurement, -site)

mamh_long
##    site                taxon          measurement
## 1     1     Myotis nigricans                  0_0
## 2     2     Myotis nigricans    11_24.80802853174
## 3     3     Myotis nigricans 9.6_27.6589585537913
## 4     1 Notiosorex crawfordi                  0_0
## 5     2 Notiosorex crawfordi                  0_0
## 6     3 Notiosorex crawfordi 1.2_29.1830714736418
## 7     1       Sorex cinereus 5.2_36.7693975641899
## 8     2       Sorex cinereus                  0_0
## 9     3       Sorex cinereus                  0_0
## 10    1      Suncus etruscus  6.2_27.078930680398
## 11    2      Suncus etruscus                  0_0
## 12    3      Suncus etruscus 9.4_31.1451832551941

OK, great! We have all of the “0_0” entries (notice what I did there for the fill?), but, now we want to restore our old measurements. For that, we have separate which takes the relevant column, the new column names in a vector, and the pattern you match to split them.

mamh_long <- mamh_long %>%
  separate(measurement, into = c("density", "height"), sep="_")

mamh_long
##    site                taxon density           height
## 1     1     Myotis nigricans       0                0
## 2     2     Myotis nigricans      11   24.80802853174
## 3     3     Myotis nigricans     9.6 27.6589585537913
## 4     1 Notiosorex crawfordi       0                0
## 5     2 Notiosorex crawfordi       0                0
## 6     3 Notiosorex crawfordi     1.2 29.1830714736418
## 7     1       Sorex cinereus     5.2 36.7693975641899
## 8     2       Sorex cinereus       0                0
## 9     3       Sorex cinereus       0                0
## 10    1      Suncus etruscus     6.2  27.078930680398
## 11    2      Suncus etruscus       0                0
## 12    3      Suncus etruscus     9.4 31.1451832551941

One note - these new columns will be treated as characters (you did split a string after all), so you’ll need to do some cleanup

mamh_long <- mamh_long %>%
  mutate(density = as.numeric(density),
         height = as.numeric(height))

And there you go! You can now do exciting things like plot the density-height relationship, grouping by taxon, for example.

ggplot(mamh_long,
       aes(x = density, y = height, color = taxon)) +
  geom_point() 

Exercises

  1. OK, to make sure you got separate and unite down, first, separate the taxon into two columns - Genus and species (taxon is currently a latin binomial).

  2. Add four columns to the original mammals frame. Now use any technique above (which may or may not involve unite) to fill in 0s for all four columns. There are many different ways to solve this. How few lines of code can you write?