I wanted to start by going through modular programming in a bit more detail. Let’s start with the 1000 yard view, and break it into pieces.
Our goal in this lab and as we move forward is to take a directory of files looking at different buoys and pull wave height information from them in a single unified data frame.
To do that, we need a function that will load a file and, regardless of format, return a single standardized file type. Let’s sketch out a skeleton of what we want to happen.
#start with a year
#read in a file
#fix formatting
#return fixed file
We can then take that, and fill it in with function names of functions we will write…
#start with a year as our argument
get_buoy <- function(a_year){
#read in a file
one_buoy <- read_buoy(a_year)
#fix formatting
one_buoy <- format_buoy(one_buoy)
#return fixed file
return(one_buoy)
}
This looks pretty good! Heck, we could even jazz it up with some pipes!
library(dplyr)
#start with a year as our argument
get_buoy <- function(a_year){
#read in a file
one_buoy <- read_buoy(a_year) %>%
#fix formatting
format_buoy()
#return fixed file
return(one_buoy)
}
Regardless, this leaves clear that we have at least two more functions to write - read_buoy()
and format_buoy()
. Let’s start with the first.
We’ll begin by outlining read_buoy()
#take a year
#make the correct filename for that year by combining it with directory info
#read in the file
#return it
Looking at the comments above, there is nothing that we need to write a new function for. We know how to combine strings - stringr::str_c()
. We know how to read in files - readr::read_csv()
. So, let’s start with a year, and see how things go…
#take a year
a_year <- 2012
#make the correct filename for that year by combining it with directory info
buoy_file <- stringr::str_c("./data/buoydata/44013_", a_year, ".csv")
#read in the file
one_buoy <- readr::read_csv(buoy_file)
#return it
Hey! That worked! We can look at this or other years, and see that it works. Looking at the files we read in, we see a variety of NA
values - 99, 999, 99.00, and more. We can add those in, and wrap the above into a function!
#take a year as the argument...
read_buoy <- function(a_year){
#make the correct filename for that year by combining it with directory info
buoy_file <-
stringr::str_c("./data/buoydata/44013_", a_year, ".csv")
#read in the file
one_buoy <- readr::read_csv(buoy_file,
na = c("99", "999",
"99.00", "9999.00",
"99.0", "9999.0",
"999.0"))
#return it
return(one_buoy)
}
Then let’s test it -
one_buoy <- read_buoy(1992)
head(one_buoy)
## # A tibble: 6 x 16
## YY MM DD hh WD WSPD GST WVHT DPD APD MWD BAR ATMP
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl>
## 1 92 1 1 1 249 5.9 7 1.2 7.7 7.3 NA 1034. 2.1
## 2 92 1 1 2 260 5.6 6.3 1.1 7.7 7.5 NA 1034. 1.7
## 3 92 1 1 3 269 5 5.9 1.1 11.1 7.6 NA 1034. 1.7
## 4 92 1 1 4 281 5.2 6.4 1 10 8.1 NA 1034. 1.7
## 5 92 1 1 5 280 5.4 6.4 1 10 8.1 NA 1034. 1.6
## 6 92 1 1 6 271 5.6 6.3 1 10 8.3 NA 1033. 1.3
## # … with 3 more variables: WTMP <dbl>, DEWP <lgl>, VIS <lgl>
visdat::vis_dat(one_buoy)
Nice!
OK, we know that every file is going to be different. We have, broadly, three problems.
So, we want a function that will fix these. We can write a function skeleton. I’m going to write this skeleton in the form of a function, as we already know that a buoy data frame is the input.
format_buoy <- function(a_buoy_df){
#Take the buoy data frame
#fix the year names
#fix the bad rows
#fix the years
return(a_buoy_df)
}
None of these are low-level things. We’ll need to write functions inside of functions! Modular programming! Let’s start out by sketching what each of these functions will be.
format_buoy <- function(a_buoy_df){
#Take the buoy data frame
a_buoy_df <- a_buoy_df %>%
#fix the year names
fix_year_names %>%
#fix the bad rows
fix_bad_rows %>%
#fix the years
fix_bad_years
return(a_buoy_df)
}
Great! We are now ready for our lower level modules. Fill in the blanks…
fix_year_names()
library(stringr)
#bad names
fix_year_names <- function(a_buoy_df){
#start with the colmn names
names(a_buoy_df) <- names(a_buoy_df) %>%
#replace YY with YYYY
____("^YY$", "YYYY") %>%
#replace X.YY with YYYY
____("X\\.YY", "YYYY")
____(a_buoy_df)
}
#A test!
read_buoy(2012) %>%
fix_year_names
fix_bad_rows()
fix_bad_rows <- function(a_buoy_df){
#start with a buoy df
a_buoy_df <- a_buoy_df %>%
#make everything numeric
mutate_all(____) %>%
#filter out rows with NAs in the year
____(!____(____))
_____(a_buoy_df)
}
#A test!
read_buoy(2012) %>%
fix_year_names %>%
fix_bad_rows
fix_bad_years()
#bad years
fix_bad_years <- function(a_buoy_df){
#start with a buoy data frame
a_buoy_df <- ____ %>%
#if the YYY col is less than 1900, add 1900 to it
____(____ = ____(____ < 1900, YYYY+1900, ____))
return(a_buoy_df)
}
#A test!
read_buoy(2012) %>%
fix_year_names %>%
fix_bad_rows %>%
fix_bad_years
Were you testing along the way? You will notice that now, given that you have written all of these functions And now, if we do, say,
get_buoy(1992) %>% head()
## # A tibble: 6 x 16
## YYYY MM DD hh WD WSPD GST WVHT DPD APD MWD BAR ATMP
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1992 1 1 1 249 5.9 7 1.2 7.7 7.3 NA 1034. 2.1
## 2 1992 1 1 2 260 5.6 6.3 1.1 7.7 7.5 NA 1034. 1.7
## 3 1992 1 1 3 269 5 5.9 1.1 11.1 7.6 NA 1034. 1.7
## 4 1992 1 1 4 281 5.2 6.4 1 10 8.1 NA 1034. 1.7
## 5 1992 1 1 5 280 5.4 6.4 1 10 8.1 NA 1034. 1.6
## 6 1992 1 1 6 271 5.6 6.3 1 10 8.3 NA 1033. 1.3
## # … with 3 more variables: WTMP <dbl>, DEWP <dbl>, VIS <dbl>
Things look pretty good.
get_buoy
to read in a file, get the monthly average wind speed (WSPD) and the lower and upper SD, then plots it.library(ggplot2)
plot_wspd_by_month <- function(a_year){
#get the file
a_buoy <- get_buoy(_____)
#calculate the mean, lwd sd, and upper sd
summarized_buoy <- wsp_mean_sd(_____)
#plot it
plot_summarized_buoy(_____)
}
wsp_mean_sd <- function(raw_buoy){
#take the buoy
_____ %>%
#group by month
group_by(MM) %>%
#calculate mean, mean-1sd, mean+1sd
summarize(mean_WSPD = mean(_____, na.rm=T),
upr_WSPD = _____ + sd(_____, na.rm=T),
lwr_WSPD = mean_WSPD - sd(_____, na.rm=T))
}
plot_summarized_buoy <- function(_____){
ggplot(buoy_to_plot,
#x = month, yvals relate to wind speed
aes(x = MM, y = _____, ymin = _____, ymax = upr_WSPD)) +
#use a geom to show the mean +/- 1SD
_____() +
#use a geom to add concecting lines beween the means
_____()
}
Now, take that function out for a spin on different files! What do you see?
gust_increase_hist <- function(a_year){
#get the cleaned buoy
#create a long data frame with each row as a data point, measuring either
#difference between air and water or wind speed and gust speed
#create a plot
}
buoy_measured_diff_long <- function(a_buoy){
#with one buoy
#calculate differences between ATMP and WTMP as well as WSPD and GST
#pivot to make it long
#return the modified data
}
plot_dual_hist <- function(summarized_buoy){
#create a ggplot with a single value as the x
#make a histogram
#facet by the measurement type
}
#test it out!
Which of the above functions is reusable in other scenarios?
To do this, you will need
A. A function that creates the data frame you will need for plotting for one state. This function will need to (potentially with subfunctions - I recommend it!)
Read in any state’s data given a state name. Use readRDS
to read in a single data file and fix up the CRS (these are all in lat/long - you want a mollweide, in which distance is in meters). You’ll need st_transform
for the later, and the projection is epsg:54009.
Calculate the number of counties
Caculate percent area of each county and then get average and total area. You’ll need st_area()
here. Note, for sf objects, when you summarize()
, you also compress all of the polygons into one big single polygon… which you can then take one big area from! You’ll need to get both county and state areas!
B. Get all of the paths to the state files to iterate over.
C. Succesfully iterate over all states to generate a large data frame
D. Plot!