Making Data Work for You

You have Ascended. Others… have not.

You have spent almost half a semester immersing yourself in thinking about good data practice. You may not think you have, but I guarantee that you now have a strict aversion to many data practices that you might have walked into the room with. You know, for example, that you will always endeavor to have your data in a long or wide format, with complete columns, reasonable NA values, and strong data validation. One value per cell! And so many more principles.

Your past self and, frankly, many of your collaborators, are not at this level of transcendence. However, they are fabulous scientists producing juicy data.

Today, we’re going to look at a data set that looks not unlike something you might have produced in the past and walk through the steps of how to make it into something we can analyze!

The data

The data we’re going to work with is from the McCusker lab. It’s a beautiful piece of work looking at limb regeneration in Axoltls. These amphibians have incredible limb regeneration abilities. In essence, the idea of this experiment was to see if with some application of different compounds, different kinds of limb regrowth could be stimulated.

Based on where the limb was injured and whether a particular compound was applied, the researchers saw very different results of the types of limbs being formed. We want to take their data and recreate some of their iconic plots.

There’s just one problem. The dataset is not in an easy to use format. You can download the data and take a look. There are a lot of issues - although this is perfectly good if you are only working within excel.

How do we attack datasets like this?

The first question you might have is, what do I do?! Relax. Take a breath. Look at the dataset. Note any common features. Note anything that delimits sections. What needs to be modified in this data?

Let’s break the process down into steps.

What are the columns?
Do we need to standardize column names or otherwise?
Are there extra rows/values that can be eliminated?
Do we need to fill in any missing data?
Reshape and plot!

The Columns

Let’s begin by loading in the data, as well as loading up the tidyverse.

library(tidyverse)
library(readxl)

axoltl <- read_excel("./data/ALM assay statistics.xlsx")

axoltl

## # A tibble: 151 x 19
##    ...1    `Statistics on ca… ...3   ...4    ...5   ...6     ...7   ...8   ...9 
##    <chr>   <chr>              <chr>  <chr>   <chr>  <chr>    <chr>  <chr>  <lgl>
##  1 <NA>    <NA>               <NA>   <NA>    <NA>   <NA>     <NA>   <NA>   NA   
##  2 EXPERI… Animal ID          Pheno… <NA>    <NA>   <NA>     <NA>   <NA>   NA   
##  3 <NA>    <NA>               No ca… Single… multi… multipl… Compl… NOTES  NA   
##  4 HepIII… 64 RFL             <NA>   <NA>    1      <NA>     <NA>   dista… NA   
##  5 Comple… 64 LFL             <NA>   <NA>    <NA>   1        <NA>   dista… NA   
##  6 <NA>    80 RFL             1      <NA>    <NA>   <NA>     <NA>   <NA>   NA   
##  7 <NA>    80 LFL             <NA>   <NA>    1      <NA>     <NA>   dista… NA   
##  8 <NA>    82 RFL             <NA>   <NA>    1      <NA>     <NA>   dista… NA   
##  9 <NA>    82 LFL             <NA>   <NA>    1      <NA>     <NA>   dista… NA   
## 10 <NA>    76 RFL             <NA>   <NA>    1      <NA>     <NA>   dista… NA   
## # … with 141 more rows, and 10 more variables: ...10 <dttm>, ...11 <chr>,
## #   ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>, ...16 <chr>,
## #   ...17 <chr>, ...18 <chr>, ...19 <chr>

Well that’s no good. We notice instantly that the real column names are split between row 2 and 3. We can clean this up by making a new vector of column names, and then reloading the data, starting at row 4, with the proper names. Let’s start with making some new column names

cnames <- paste(axoltl[2,], axoltl[3,])

cnames

##  [1] "EXPERIMENT NA"                                                     
##  [2] "Animal ID NA"                                                      
##  [3] "Phenotypes No cartilage"                                           
##  [4] "NA Single cartilage element"                                       
##  [5] "NA multiple symetrical elements"                                   
##  [6] "NA multiple assymetrical with only 1 stylopod element(hypomorphic)"
##  [7] "NA Complete limb (with 2 stylopod elements)"                       
##  [8] "NA NOTES"                                                          
##  [9] "NA NA"                                                             
## [10] "NA NA"                                                             
## [11] "NA NA"                                                             
## [12] "NA NA"                                                             
## [13] "NA NA"                                                             
## [14] "NA NA"                                                             
## [15] "NA NA"                                                             
## [16] "NA NA"                                                             
## [17] "NA NA"                                                             
## [18] "NA NA"                                                             
## [19] "NA NA"

OK, this is good, except for three things. 1) The “NA” or " NA“s. 2) The random”Phenotypes " from the original sheet which we don’t need. 3) Replicate column names.

We can deal with the first two using regexps! The third we can use an awesome package called janitor which helps with cleaning excel files. Particularly, the function clean_names() Note, janitor will replace " " with "_" - so, let’s also fix that.

Exercise: If you were to do this replacement, step by step, what regexps would you use?

library(janitor)

#in one row!
cnames <- cnames %>%
  str_replace_all("^NA ", "") %>%
  str_replace_all( "NA$", "") %>%
  str_replace_all("Phenotypes ", "") %>%
  make_clean_names() %>%
  str_replace_all("_", " ")

cnames

##  [1] "experiment"                                                    
##  [2] "animal id"                                                     
##  [3] "no cartilage"                                                  
##  [4] "single cartilage element"                                      
##  [5] "multiple symetrical elements"                                  
##  [6] "multiple assymetrical with only 1 stylopod element hypomorphic"
##  [7] "complete limb with 2 stylopod elements"                        
##  [8] "notes"                                                         
##  [9] "x"                                                             
## [10] "x 2"                                                           
## [11] "x 3"                                                           
## [12] "x 4"                                                           
## [13] "x 5"                                                           
## [14] "x 6"                                                           
## [15] "x 7"                                                           
## [16] "x 8"                                                           
## [17] "x 9"                                                           
## [18] "x 10"                                                          
## [19] "x 11"

Great! Now we can reload the data and use clean names! Let’s skip to line 4 as our first line of data, and use col_names to specify clean column names.

axoltl <- read_excel("./data/ALM assay statistics.xlsx",
                     skip = 4,
                     col_names = cnames)

axoltl

## # A tibble: 148 x 19
##    experiment    `animal id` `no cartilage` `single cartilag… `multiple symetri…
##    <chr>         <chr>                <dbl>             <dbl>              <dbl>
##  1 HepIII treat… 64 RFL                  NA                NA                  1
##  2 Completed 3/… 64 LFL                  NA                NA                 NA
##  3 <NA>          80 RFL                   1                NA                 NA
##  4 <NA>          80 LFL                  NA                NA                  1
##  5 <NA>          82 RFL                  NA                NA                  1
##  6 <NA>          82 LFL                  NA                NA                  1
##  7 <NA>          76 RFL                  NA                NA                  1
##  8 <NA>          76 LFL                  NA                NA                  1
##  9 <NA>          81 RFL                  NA                NA                  1
## 10 <NA>          81 LFL                   1                NA                 NA
## # … with 138 more rows, and 14 more variables:
## #   multiple assymetrical with only 1 stylopod element hypomorphic <dbl>,
## #   complete limb with 2 stylopod elements <dbl>, notes <chr>, x <lgl>,
## #   x 2 <dttm>, x 3 <chr>, x 4 <chr>, x 5 <chr>, x 6 <chr>, x 7 <chr>,
## #   x 8 <chr>, x 9 <chr>, x 10 <chr>, x 11 <chr>

WHOAH! Way better!

Now we can shave off the columns from NOTES on, which we don’t need by using select and knowing that the bad columns start with x.

axoltl <- axoltl %>%
  select(-starts_with("x"))

axoltl

## # A tibble: 148 x 8
##    experiment    `animal id` `no cartilage` `single cartilag… `multiple symetri…
##    <chr>         <chr>                <dbl>             <dbl>              <dbl>
##  1 HepIII treat… 64 RFL                  NA                NA                  1
##  2 Completed 3/… 64 LFL                  NA                NA                 NA
##  3 <NA>          80 RFL                   1                NA                 NA
##  4 <NA>          80 LFL                  NA                NA                  1
##  5 <NA>          82 RFL                  NA                NA                  1
##  6 <NA>          82 LFL                  NA                NA                  1
##  7 <NA>          76 RFL                  NA                NA                  1
##  8 <NA>          76 LFL                  NA                NA                  1
##  9 <NA>          81 RFL                  NA                NA                  1
## 10 <NA>          81 LFL                   1                NA                 NA
## # … with 138 more rows, and 3 more variables:
## #   multiple assymetrical with only 1 stylopod element hypomorphic <dbl>,
## #   complete limb with 2 stylopod elements <dbl>, notes <chr>

Lovely!

Fixing up column names

So, column names are long and unweildy. Also, capitalization is odd. Let’s make them use title capitalization Then let’s make sure each one only has only two words.

Exercise: Using regexps, how would you putt out the first two words of a phrase? Practice using stringr::sentences ot test!

names(axoltl) <- names(axoltl) %>%
  str_to_title() %>%
  str_replace("^([a-zA-Z0-9]+ [a-zA-Z0-9]+)(.*)", "\\1")

names(axoltl)

## [1] "Experiment"            "Animal Id"             "No Cartilage"         
## [4] "Single Cartilage"      "Multiple Symetrical"   "Multiple Assymetrical"
## [7] "Complete Limb"         "Notes"

Nice! String Substitution FTW!

Getting Rid of Bad Rows and Reshape

Looking at the data in Excel, we see three problems - first, the “Completed” elements in the first column. Second, rows that are entirely NA. Third, rows that are ‘totals’. The later two are easy, as they’re a filter. Actually, we don’t even have to worry about the blank rows, as readxl took care of that for us. Let’s get rid of those rows entirely based on all of the action occuring in Animal ID. We’ll make them capital at first to make life easier.

axoltl <- axoltl %>%
  mutate(`Animal Id` = str_to_upper(`Animal Id`)) %>%
  filter(!str_detect(`Animal Id`, "TOTAL"))

Nice, now, let’s scrub anything in the Experiment column. We need to do something a little funny to add in NAs - either an ifelse with an NA or, str_replace offers a different solution

axoltl <- axoltl %>%
  mutate(Experiment = str_replace(Experiment, 
                                  ".*[cC]omple.*", 
                                  NA_character_))

Filling down

OK, we have a lot of missing values - both in the Experiment column and through the rest of the data set. How do we fix all of that up? Let’s start with experiment - we can use fill from tidyr and set it to down in order to make the column fixed.

axoltl <- axoltl %>%
  fill(Experiment, .direction = "down")

Now what about the rest? Well here’s where gathering and spreading can be very useful.

Let’s think about this data. Notice that every row only has 1 once. That’s because the 1 means - this was the outcome from that treatment on this particular animal. So, what we want, is each animal to have the condition of its response. This means going from this wide format to a long format. Experiment, Animal Id, and Notes need to hang around, however, as they are unique to each animal.

axoltl_long <- axoltl %>%
  pivot_longer(cols = -c(Experiment, `Animal Id`, Notes),
               names_to = "Response",
               values_to = "Value")
  

axoltl_long

## # A tibble: 635 x 5
##    Experiment          `Animal Id` Notes                   Response        Value
##    <chr>               <chr>       <chr>                   <chr>           <dbl>
##  1 HepIII treated per… 64 RFL      distal taper            No Cartilage       NA
##  2 HepIII treated per… 64 RFL      distal taper            Single Cartila…    NA
##  3 HepIII treated per… 64 RFL      distal taper            Multiple Symet…     1
##  4 HepIII treated per… 64 RFL      distal taper            Multiple Assym…    NA
##  5 HepIII treated per… 64 RFL      distal taper            Complete Limb      NA
##  6 HepIII treated per… 64 LFL      distal tapered digits … No Cartilage       NA
##  7 HepIII treated per… 64 LFL      distal tapered digits … Single Cartila…    NA
##  8 HepIII treated per… 64 LFL      distal tapered digits … Multiple Symet…    NA
##  9 HepIII treated per… 64 LFL      distal tapered digits … Multiple Assym…     1
## 10 HepIII treated per… 64 LFL      distal tapered digits … Complete Limb      NA
## # … with 625 more rows

Nice! Two notes. First, if you wanted to replace those NAs with 0s in the wide format, you’d do that here, and then reshape back to wide. But, in this case, we’re going to eliminate anything that is NA, and then get rid of the Value column entirely, as we don’t need it.

axoltl_long <- axoltl_long %>%
  filter(!is.na(Value)) %>%
  select(-Value)

Plot!

OK! We’re ready to plot the response! We can use geom_bar() here, but, what are other ways you would visualize?

ggplot(data=axoltl_long, aes(x=Experiment, fill=Response)) +
  geom_bar() +
  theme_bw(base_size=14) +
  theme(axis.text.x = element_text(angle=45, hjust=1))