To begin my workflow, after receiving my data in .csv form from the NOAA, I first needed to load it in, which I can’t do until I set my working directory to where my files are saved. I next read in the data while removing missing values listed as “9999” in the raw .csv data. Because there was so much data, it was sent in three files which needed to be loaded individually.
data_19902016 <- read.csv("./734706.csv",
na.string=c("-9999"))
data_19701989 <- read.csv("./730605.csv",
na.string=c("-9999"))
data_19501969 <- read.csv("./730589.csv",
na.string=c("-9999"))
In order to manipulate and visualize data, I will need packages including dplyr, tidyr, and ggplot2, so all of them need to be loaded
library(ggplot2)
library(tidyr)
library(dplyr)
Now begins the actual manipulation of data, with the ultimate goal of combining my three sets into one well organized data set.
I begin by first looking at the data and seeing things I will need to change. The best way to do this is the str() function.
str(data_19501969)
## 'data.frame': 5075 obs. of 9 variables:
## $ STATION : Factor w/ 34 levels "GHCND:USC00190166",..: 29 29 29 29 29 29 29 29 29 29 ...
## $ STATION_NAME: Factor w/ 33 levels "ARLINGTON MA US",..: 20 20 20 20 20 20 20 20 20 20 ...
## $ ELEVATION : Factor w/ 34 levels "10.1","11.9",..: 24 24 24 24 24 24 24 24 24 24 ...
## $ LATITUDE : Factor w/ 29 levels "42.05","42.08333",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ LONGITUDE : Factor w/ 33 levels "-70.8","-70.83333",..: 25 25 25 25 25 25 25 25 25 25 ...
## $ DATE : int 19581001 19581101 19581201 19590101 19590201 19590301 19590401 19590501 19590601 19590701 ...
## $ MMXT : num 59.9 52.5 32.2 35.4 34.9 43.7 58.5 73.6 73.4 81.9 ...
## $ MMNT : num 41.4 34.9 14.9 18 14.9 27.1 39 50.2 55.4 64 ...
## $ MNTM : num 50.7 43.7 23.5 26.8 25 35.4 48.7 61.9 64.4 73 ...
str(data_19701989)
## 'data.frame': 4893 obs. of 9 variables:
## $ STATION : Factor w/ 28 levels "GHCND:USC00190535",..: 25 25 25 25 25 25 25 25 25 25 ...
## $ STATION_NAME: Factor w/ 27 levels "BEDFORD HANSCOM FIELD MA US",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ ELEVATION : num 63.1 63.1 63.1 63.1 63.1 63.1 63.1 63.1 63.1 63.1 ...
## $ LATITUDE : num 42.4 42.4 42.4 42.4 42.4 ...
## $ LONGITUDE : num -71.5 -71.5 -71.5 -71.5 -71.5 ...
## $ DATE : int 19700101 19700201 19700301 19700401 19700501 19700601 19700701 19700801 19710401 19710601 ...
## $ MMXT : num 28 38.5 42.6 59 71.2 75.2 82.8 82.4 56.5 80.8 ...
## $ MMNT : num 2.7 14.2 21.2 32 45.3 52.2 58.6 54.9 32 53.1 ...
## $ MNTM : num 15.4 26.2 31.8 45.5 58.3 63.7 70.7 68.5 44.2 66.9 ...
str(data_19902016)
## 'data.frame': 6808 obs. of 9 variables:
## $ STATION : Factor w/ 33 levels "GHCND:USC00190535",..: 15 15 15 15 15 15 15 15 15 15 ...
## $ STATION_NAME: Factor w/ 31 levels "BEDFORD HANSCOM FIELD MA US",..: 19 19 19 19 19 19 19 19 19 19 ...
## $ ELEVATION : Factor w/ 39 levels "10.1","10.7",..: 34 34 34 34 34 34 34 34 34 34 ...
## $ LATITUDE : Factor w/ 71 levels "41.95028","41.9538",..: 37 37 37 37 37 37 37 37 37 37 ...
## $ LONGITUDE : Factor w/ 67 levels "-70.8371","-70.84028",..: 63 63 63 63 63 63 63 63 63 63 ...
## $ DATE : int 19900101 19900201 19900301 19900401 19900501 19900601 19900701 19900801 19900901 19901001 ...
## $ MMXT : num 41.2 41.2 50.5 57 63.5 77.9 83.1 81.9 74.3 67.1 ...
## $ MMNT : num 23.7 18.3 25 36.1 45 56.3 61.9 61.9 49.3 44.2 ...
## $ MNTM : num 32.4 29.8 37.8 46.6 54.1 67.1 72.5 72 61.9 55.8 ...
I can immediately see some issues, as many of my columns are listed as factors when they need to be numeric. Date is also listed as an integer, so that needs to be changed to date.
data_19501969 <- data_19501969 %>%
mutate(ELEVATION = as.numeric(as.character(ELEVATION)),
LATITUDE = as.numeric(as.character(LATITUDE)),
LONGITUDE = as.numeric(as.character(LONGITUDE)),
DATE = as.Date(as.character(DATE), format='%Y%m%d'))
data_19701989 <- data_19701989%>%
mutate(ELEVATION = as.numeric(as.character(ELEVATION)),
LATITUDE = as.numeric(as.character(LATITUDE)),
LONGITUDE = as.numeric(as.character(LONGITUDE)),
DATE = as.Date(as.character(DATE), format='%Y%m%d'))
data_19902016 <- data_19902016 %>%
mutate(ELEVATION = as.numeric(as.character(ELEVATION)),
LATITUDE = as.numeric(as.character(LATITUDE)),
LONGITUDE = as.numeric(as.character(LONGITUDE)),
DATE = as.Date(as.character(DATE), format='%Y%m%d'))
The next step is to begin joining my 3 data sets, which is easiest to do one at a time. Because sets contain different dates for each row, I don’t want to lose any information. For this reason a full join works best.
data_1950to1989 <- full_join(data_19501969, data_19701989, by = c("STATION", "DATE", "STATION_NAME",
"MMXT", "MMNT", "MNTM", "ELEVATION",
"LATITUDE", "LONGITUDE"))
data_1950to2016 <- full_join(data_1950to1989, data_19902016, by = c("STATION", "DATE", "STATION_NAME",
"MMXT", "MMNT", "MNTM", "ELEVATION",
"LATITUDE", "LONGITUDE"))
For the purposes of manipulating data, it also makes sense to add separate columns for Year and Month as well. Day is not necessary since all days are listed as “1”. Vectors also need to be in numeric form. Also removed “STATION” since STATION_NAME was the only column needed for stations, as well as ELEVATION, since I wasn’t using it in my analysis.
data_1950to2016$Year <- format(data_1950to2016$DATE, "%Y")
data_1950to2016$Month <- format(data_1950to2016$DATE, "%m")
data_1950to2016 <- data_1950to2016 %>%
mutate(Year = as.numeric(Year),
Month = as.numeric(Month)) %>%
select(-STATION, -ELEVATION)
To tidy things up, some column names need to be renamed
data_1950to2016 <- data_1950to2016 %>%
rename(long = LONGITUDE,
lat = LATITUDE,
Station = STATION_NAME)
For the final step of organizing my data, I want to order it all by Station so all the data for each station is grouped together. arrange() is the easiest way for this.
data_1950to2016 <- arrange(data_1950to2016, Station, DATE)
Now that all the data has been organized, it can be plotted to see if there are any observable trends or changes. Ggplot will work best to plot the data, and I want to graph changes over time, so my Y-axis will be MNTM(mean temp) and X-axis will be Year. Before any changes are made, a raw plot was made to quickly see if there appear to be any changes over time.
ggplot(data = data_1950to2016, aes(x = Year, y = MNTM, colour = Station)) +
geom_point() +
facet_wrap(~ Month) +
guides(color = "none")
To fit all the data onto one graph, a few figures need to be manipulated to best visualize the results. First, I wanted to go from monthly averages to yearly averages for each station, which was done by grouping into station and by year, then taking the averages. Because some stations did not have full years’ worth of measurements, the plot worked best by only taking full years to comprise a true years average, which was done by creating a new column with number of months in the Yearly_Mean summary, then filtering out any rows where the count was not equal to 12. NAs also needed to be removed from the Yearly_Mean column to connect all the points.
annual_avg <- data_1950to2016 %>%
group_by(Station, Year, lat, long) %>%
summarise(Yearly_Mean = mean(MNTM), temp_count = n()) %>%
filter(temp_count == 12, !is.na(Yearly_Mean))
Finally, because I wanted to make an overlaying plot with lines for stations as well as a line for averages of all the stations, I need to create yearly means for all the stations together, which involves grouping by year and creating a new mean(y)
allstations_avg <- annual_avg %>%
group_by(Year) %>%
summarise(y = mean(Yearly_Mean)) %>%
ungroup()
Now that all the data points I need to create my plot of Yearly temperature means over time, I can make a plot. I want lines for both individual stations and total station averages, the latter laying on top and standing out over the former. To do this, I set the color of stations to “grey” so it would appear lighter on the background and made both points and line for total averages in black so it is clearly visible. Finally, to pull it all together, a linear model line was added to clearly represent any slope over time.
ggplot(annual_avg, mapping = aes(x=Year, y=Yearly_Mean)) +
geom_line(color="grey", alpha=0.5, mapping=aes(group = Station)) +
theme_bw() +
geom_line(data = allstations_avg, mapping = aes(x=Year, y=y), color="black") +
geom_point(data = allstations_avg, mapping = aes(x=Year, y=y), color="black") +
stat_smooth(method="lm", fill=NA, color="red", lty=2)
For better visualization of where all the stations are located around Boston, I wanted to put them on a map.
library(maps)
library(ggmap)
library(mapdata)
boston_map <- map_data("state", region = "massachusetts")
temp_map <- ggplot(boston_map, aes(x = long, y = lat)) +
geom_polygon() +
coord_map() +
geom_point(data = annual_avg, aes(x=long, y=lat), color = "orange")
temp_map
To observe changes between max monthly and min monthly temperatures, I created a new column called Fluctuations to observe how drastically average extremes were varying over time
data_1950to2016 <- data_1950to2016 %>%
mutate(Fluctuation = (MMXT - MMNT))
Next, I wanted to view those changes over time using the same type of plot I used for mean temperatures over time
fluctuation_avg <- data_1950to2016 %>%
group_by(Station, Year, lat, long) %>%
summarise(Fluctuation_Mean = mean(Fluctuation), temp_count = n()) %>%
filter(temp_count == 12, !is.na(Fluctuation_Mean))
fluc_mean <- fluctuation_avg %>%
group_by(Year) %>%
summarise(y = mean(Fluctuation_Mean)) %>%
ungroup()
ggplot(fluctuation_avg, mapping = aes(x=Year, y=Fluctuation_Mean)) +
geom_line(color="grey", alpha=0.5, mapping=aes(group = Station)) +
theme_bw() +
geom_line(data = fluc_mean, mapping = aes(x=Year, y=y), color="black")+
geom_point(data = fluc_mean, mapping = aes(x=Year, y=y), color="black") +
stat_smooth(method="lm", fill=NA, color="red", lty=2)
The climate data now needs to be analyzed. To do this, the data must be modified into linear models, which can be done for both the annual averages of stations and the total annual average of all stations.
climate_mod <- lm(Year ~ Yearly_Mean, data = annual_avg)
allstations_mod <- lm(Year ~ y, data = allstations_avg)
fluc_mean_mod <- lm(Year ~ y, data = fluc_mean)
Using the linear models, the data for the individual stations is summarized to show calculations for standard error, t value, F value, and P value
summary(climate_mod)
##
## Call:
## lm(formula = Year ~ Yearly_Mean, data = annual_avg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.935 -14.102 0.382 15.675 37.334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1849.8770 16.7114 110.696 < 2e-16 ***
## Yearly_Mean 2.6834 0.3376 7.948 4.18e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.66 on 1261 degrees of freedom
## Multiple R-squared: 0.0477, Adjusted R-squared: 0.04695
## F-statistic: 63.17 on 1 and 1261 DF, p-value: 4.181e-15
To show how well the actual data fits versus theorectical data, a qq plot is created. Ideally, the points would be perfectly linear
plot(climate_mod, which = 2)
The same tests are then performed for the averages of all stations across time, as well as fluctuations of extremes over time
summary(allstations_mod)
##
## Call:
## lm(formula = Year ~ y, data = allstations_avg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.093 -12.517 1.826 14.351 33.278
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1754.658 103.417 16.967 <2e-16 ***
## y 4.605 2.090 2.204 0.0312 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.65 on 64 degrees of freedom
## Multiple R-squared: 0.07053, Adjusted R-squared: 0.056
## F-statistic: 4.856 on 1 and 64 DF, p-value: 0.03115
plot(allstations_mod, which = 2)
summary(fluc_mean_mod)
##
## Call:
## lm(formula = Year ~ y, data = fluc_mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.193 -13.238 2.391 14.977 39.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2131.194 63.761 33.425 <2e-16 ***
## y -7.358 3.153 -2.334 0.0228 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.57 on 64 degrees of freedom
## Multiple R-squared: 0.07841, Adjusted R-squared: 0.06401
## F-statistic: 5.445 on 1 and 64 DF, p-value: 0.02277
plot(fluc_mean_mod, which = 2)
In the results section, very low R-squared values were found, which isn’t abnormal for ecological data since the variation comes from some large deviations in annual temperature from year to year. QQ plots also did not align perfectly linearly, though when all stations were averaged together a better fit was achieved than the qqplot of all the individual stations, reducing some of the individual station variation.
As the trend line and p-value show, there is clearly an upward overall trend in average annual temperature from 1950-present day. A very small P-value of 0.03279 indicates the change is unlikely to have occured by chance, confirming my first hypothesis.
The variation of extreme average temperature fluctuation over time actually decreased over time, which was a bit surprising as I hypothesized that the different between mean monthly max temp and mean monthly min temp would have become greater over time. The linear model also produced a very low p-value of 0.02277, suggesting fluctuation is trending this way and is not due to chance, thus I am unable to reject my second null hypothesis.
For future or additional experiments, I would want to get even more detailed with my analysis. While temperature does appear to be increasing, perhaps finding exactly when and where the greatest changes are occuring would give greater insight into the cause of changes.