You have spent almost half a semester immersing yourself in thinking about good data practice. You may not think you have, but I guarantee that you now have a strict aversion to many data practices that you might have walked into the room with. You know, for example, that you will always endeavor to have your data in a long or wide format, with complete columns, reasonable NA values, and strong data validation. One value per cell! And so many more principles.
Your past self and, frankly, many of your collaborators, are not at this level of transcendence. However, they are fabulous scientists producing juicy data.
Today, we’re going to look at a data set that looks not unlike something you might have produced in the past and walk through the steps of how to make it into something we can analyze!
The data we’re going to work with is from the McCusker lab. It’s a beautiful piece of work looking at limb regeneration in Axoltls. These amphibians have incredible limb regeneration abilities. In essence, the idea of this experiment was to see if with some application of different compounds, different kinds of limb regrowth could be stimulated.
Based on where the limb was injured and whether a particular compound was applied, the researchers saw very different results of the types of limbs being formed. We want to take their data and recreate some of their iconic plots.
There’s just one problem. The dataset is not in an easy to use format. You can download the data and take a look. There are a lot of issues - although this is perfectly good if you are only working within excel.
The first question you might have is, what do I do?! Relax. Take a breath. Look at the dataset. Note any common features. Note anything that delimits sections. What needs to be modified in this data?
Let’s break the process down into steps.
Let’s begin by loading in the data, as well as loading up the tidyverse.
library(tidyverse)
library(readxl)
axoltl <- read_excel("./data/ALM assay statistics.xlsx")
axoltl
## # A tibble: 151 x 19
## ...1 `Statistics on ca… ...3 ...4 ...5 ...6 ...7 ...8 ...9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA
## 2 EXPERI… Animal ID Pheno… <NA> <NA> <NA> <NA> <NA> NA
## 3 <NA> <NA> No ca… Single… multi… multipl… Compl… NOTES NA
## 4 HepIII… 64 RFL <NA> <NA> 1 <NA> <NA> dista… NA
## 5 Comple… 64 LFL <NA> <NA> <NA> 1 <NA> dista… NA
## 6 <NA> 80 RFL 1 <NA> <NA> <NA> <NA> <NA> NA
## 7 <NA> 80 LFL <NA> <NA> 1 <NA> <NA> dista… NA
## 8 <NA> 82 RFL <NA> <NA> 1 <NA> <NA> dista… NA
## 9 <NA> 82 LFL <NA> <NA> 1 <NA> <NA> dista… NA
## 10 <NA> 76 RFL <NA> <NA> 1 <NA> <NA> dista… NA
## # … with 141 more rows, and 10 more variables: ...10 <dttm>, ...11 <chr>,
## # ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>, ...16 <chr>,
## # ...17 <chr>, ...18 <chr>, ...19 <chr>
Well that’s no good. We notice instantly that the real column names are split between row 2 and 3. We can clean this up by making a new vector of column names, and then reloading the data, starting at row 4, with the proper names. Let’s start with making some new column names
cnames <- paste(axoltl[2,], axoltl[3,])
cnames
## [1] "EXPERIMENT NA"
## [2] "Animal ID NA"
## [3] "Phenotypes No cartilage"
## [4] "NA Single cartilage element"
## [5] "NA multiple symetrical elements"
## [6] "NA multiple assymetrical with only 1 stylopod element(hypomorphic)"
## [7] "NA Complete limb (with 2 stylopod elements)"
## [8] "NA NOTES"
## [9] "NA NA"
## [10] "NA NA"
## [11] "NA NA"
## [12] "NA NA"
## [13] "NA NA"
## [14] "NA NA"
## [15] "NA NA"
## [16] "NA NA"
## [17] "NA NA"
## [18] "NA NA"
## [19] "NA NA"
OK, this is good, except for three things. 1) The “NA” or " NA“s. 2) The random”Phenotypes " from the original sheet which we don’t need. 3) Replicate column names.
We can deal with the first two using regexps! The third we can use an awesome package called janitor which helps with cleaning excel files. Particularly, the function clean_names()
Note, janitor will replace " " with "_" - so, let’s also fix that.
Exercise: If you were to do this replacement, step by step, what regexps would you use?
library(janitor)
#in one row!
cnames <- cnames %>%
str_replace_all("^NA ", "") %>%
str_replace_all( "NA$", "") %>%
str_replace_all("Phenotypes ", "") %>%
make_clean_names() %>%
str_replace_all("_", " ")
cnames
## [1] "experiment"
## [2] "animal id"
## [3] "no cartilage"
## [4] "single cartilage element"
## [5] "multiple symetrical elements"
## [6] "multiple assymetrical with only 1 stylopod element hypomorphic"
## [7] "complete limb with 2 stylopod elements"
## [8] "notes"
## [9] "x"
## [10] "x 2"
## [11] "x 3"
## [12] "x 4"
## [13] "x 5"
## [14] "x 6"
## [15] "x 7"
## [16] "x 8"
## [17] "x 9"
## [18] "x 10"
## [19] "x 11"
Great! Now we can reload the data and use clean names! Let’s skip to line 4 as our first line of data, and use col_names
to specify clean column names.
axoltl <- read_excel("./data/ALM assay statistics.xlsx",
skip = 4,
col_names = cnames)
axoltl
## # A tibble: 148 x 19
## experiment `animal id` `no cartilage` `single cartilag… `multiple symetri…
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 HepIII treat… 64 RFL NA NA 1
## 2 Completed 3/… 64 LFL NA NA NA
## 3 <NA> 80 RFL 1 NA NA
## 4 <NA> 80 LFL NA NA 1
## 5 <NA> 82 RFL NA NA 1
## 6 <NA> 82 LFL NA NA 1
## 7 <NA> 76 RFL NA NA 1
## 8 <NA> 76 LFL NA NA 1
## 9 <NA> 81 RFL NA NA 1
## 10 <NA> 81 LFL 1 NA NA
## # … with 138 more rows, and 14 more variables:
## # multiple assymetrical with only 1 stylopod element hypomorphic <dbl>,
## # complete limb with 2 stylopod elements <dbl>, notes <chr>, x <lgl>,
## # x 2 <dttm>, x 3 <chr>, x 4 <chr>, x 5 <chr>, x 6 <chr>, x 7 <chr>,
## # x 8 <chr>, x 9 <chr>, x 10 <chr>, x 11 <chr>
WHOAH! Way better!
Now we can shave off the columns from NOTES on, which we don’t need by using select and knowing that the bad columns start with x.
axoltl <- axoltl %>%
select(-starts_with("x"))
axoltl
## # A tibble: 148 x 8
## experiment `animal id` `no cartilage` `single cartilag… `multiple symetri…
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 HepIII treat… 64 RFL NA NA 1
## 2 Completed 3/… 64 LFL NA NA NA
## 3 <NA> 80 RFL 1 NA NA
## 4 <NA> 80 LFL NA NA 1
## 5 <NA> 82 RFL NA NA 1
## 6 <NA> 82 LFL NA NA 1
## 7 <NA> 76 RFL NA NA 1
## 8 <NA> 76 LFL NA NA 1
## 9 <NA> 81 RFL NA NA 1
## 10 <NA> 81 LFL 1 NA NA
## # … with 138 more rows, and 3 more variables:
## # multiple assymetrical with only 1 stylopod element hypomorphic <dbl>,
## # complete limb with 2 stylopod elements <dbl>, notes <chr>
Lovely!
So, column names are long and unweildy. Also, capitalization is odd. Let’s make them use title capitalization Then let’s make sure each one only has only two words.
Exercise: Using regexps, how would you putt out the first two words of a phrase? Practice using stringr::sentences ot test!
names(axoltl) <- names(axoltl) %>%
str_to_title() %>%
str_replace("^([a-zA-Z0-9]+ [a-zA-Z0-9]+)(.*)", "\\1")
names(axoltl)
## [1] "Experiment" "Animal Id" "No Cartilage"
## [4] "Single Cartilage" "Multiple Symetrical" "Multiple Assymetrical"
## [7] "Complete Limb" "Notes"
Nice! String Substitution FTW!
Looking at the data in Excel, we see three problems - first, the “Completed” elements in the first column. Second, rows that are entirely NA. Third, rows that are ‘totals’. The later two are easy, as they’re a filter. Actually, we don’t even have to worry about the blank rows, as readxl
took care of that for us. Let’s get rid of those rows entirely based on all of the action occuring in Animal ID
. We’ll make them capital at first to make life easier.
axoltl <- axoltl %>%
mutate(`Animal Id` = str_to_upper(`Animal Id`)) %>%
filter(!str_detect(`Animal Id`, "TOTAL"))
Nice, now, let’s scrub anything in the Experiment column. We need to do something a little funny to add in NAs - either an ifelse with an NA or, str_replace offers a different solution
axoltl <- axoltl %>%
mutate(Experiment = str_replace(Experiment,
".*[cC]omple.*",
NA_character_))
OK, we have a lot of missing values - both in the Experiment column and through the rest of the data set. How do we fix all of that up? Let’s start with experiment - we can use fill
from tidyr and set it to down
in order to make the column fixed.
axoltl <- axoltl %>%
fill(Experiment, .direction = "down")
Now what about the rest? Well here’s where gathering and spreading can be very useful.
Let’s think about this data. Notice that every row only has 1 once. That’s because the 1 means - this was the outcome from that treatment on this particular animal. So, what we want, is each animal to have the condition of its response. This means going from this wide format to a long format. Experiment, Animal Id, and Notes need to hang around, however, as they are unique to each animal.
axoltl_long <- axoltl %>%
pivot_longer(cols = -c(Experiment, `Animal Id`, Notes),
names_to = "Response",
values_to = "Value")
axoltl_long
## # A tibble: 635 x 5
## Experiment `Animal Id` Notes Response Value
## <chr> <chr> <chr> <chr> <dbl>
## 1 HepIII treated per… 64 RFL distal taper No Cartilage NA
## 2 HepIII treated per… 64 RFL distal taper Single Cartila… NA
## 3 HepIII treated per… 64 RFL distal taper Multiple Symet… 1
## 4 HepIII treated per… 64 RFL distal taper Multiple Assym… NA
## 5 HepIII treated per… 64 RFL distal taper Complete Limb NA
## 6 HepIII treated per… 64 LFL distal tapered digits … No Cartilage NA
## 7 HepIII treated per… 64 LFL distal tapered digits … Single Cartila… NA
## 8 HepIII treated per… 64 LFL distal tapered digits … Multiple Symet… NA
## 9 HepIII treated per… 64 LFL distal tapered digits … Multiple Assym… 1
## 10 HepIII treated per… 64 LFL distal tapered digits … Complete Limb NA
## # … with 625 more rows
Nice! Two notes. First, if you wanted to replace those NAs with 0s in the wide format, you’d do that here, and then reshape back to wide. But, in this case, we’re going to eliminate anything that is NA, and then get rid of the Value column entirely, as we don’t need it.
axoltl_long <- axoltl_long %>%
filter(!is.na(Value)) %>%
select(-Value)
OK! We’re ready to plot the response! We can use geom_bar()
here, but, what are other ways you would visualize?
ggplot(data=axoltl_long, aes(x=Experiment, fill=Response)) +
geom_bar() +
theme_bw(base_size=14) +
theme(axis.text.x = element_text(angle=45, hjust=1))