Final Project

Introduction

Data collected by National Centers for Environmental Information - National Oceanic and Atmospheric Administration
Shows temperature readings from different stations around Boston from 1950-2016. Each station also has a listing of elevation, latitude, and longitude to further specify location.
Recordings from each station include MMXT (MEAN MONTHLY EXTREME TEMP), MMNT (MEAN MONTHLY MIN TEMP), AND MNTM (MEAN MONTHLY TEMP). These readings show average temp for the month, as well as average min and max temps for each month, to see if greater fluctuation exists in one or the other, instead of just mean. With min and max temps we can also observe if there has been a change in extremes over time.

Questions

Is there a significant and noticeable change in average annual temperature over time?
Can it be approximated when the greatest changes occured?

Hypotheses

There will be an observable change in average temperature over time
Difference between MMXT and MMNT (average fluctuation) will increase over time

Null Hypotheses

There will not be an observable change in average temmperature over time
Difference between MMXT and MMNT (average fluctuation) will not increase over time

Methods

To begin my workflow, after receiving my data in .csv form from the NOAA, I first needed to load it in, which I can’t do until I set my working directory to where my files are saved. I next read in the data while removing missing values listed as “9999” in the raw .csv data. Because there was so much data, it was sent in three files which needed to be loaded individually.

data_19902016 <- read.csv("./734706.csv",
                          na.string=c("-9999"))
data_19701989 <- read.csv("./730605.csv",
                          na.string=c("-9999"))
data_19501969 <- read.csv("./730589.csv",
                          na.string=c("-9999"))

In order to manipulate and visualize data, I will need packages including dplyr, tidyr, and ggplot2, so all of them need to be loaded

library(ggplot2)
library(tidyr)
library(dplyr)

Now begins the actual manipulation of data, with the ultimate goal of combining my three sets into one well organized data set.

I begin by first looking at the data and seeing things I will need to change. The best way to do this is the str() function.

str(data_19501969)

## 'data.frame':    5075 obs. of  9 variables:
##  $ STATION     : Factor w/ 34 levels "GHCND:USC00190166",..: 29 29 29 29 29 29 29 29 29 29 ...
##  $ STATION_NAME: Factor w/ 33 levels "ARLINGTON MA US",..: 20 20 20 20 20 20 20 20 20 20 ...
##  $ ELEVATION   : Factor w/ 34 levels "10.1","11.9",..: 24 24 24 24 24 24 24 24 24 24 ...
##  $ LATITUDE    : Factor w/ 29 levels "42.05","42.08333",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ LONGITUDE   : Factor w/ 33 levels "-70.8","-70.83333",..: 25 25 25 25 25 25 25 25 25 25 ...
##  $ DATE        : int  19581001 19581101 19581201 19590101 19590201 19590301 19590401 19590501 19590601 19590701 ...
##  $ MMXT        : num  59.9 52.5 32.2 35.4 34.9 43.7 58.5 73.6 73.4 81.9 ...
##  $ MMNT        : num  41.4 34.9 14.9 18 14.9 27.1 39 50.2 55.4 64 ...
##  $ MNTM        : num  50.7 43.7 23.5 26.8 25 35.4 48.7 61.9 64.4 73 ...

str(data_19701989)

## 'data.frame':    4893 obs. of  9 variables:
##  $ STATION     : Factor w/ 28 levels "GHCND:USC00190535",..: 25 25 25 25 25 25 25 25 25 25 ...
##  $ STATION_NAME: Factor w/ 27 levels "BEDFORD HANSCOM FIELD MA US",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ ELEVATION   : num  63.1 63.1 63.1 63.1 63.1 63.1 63.1 63.1 63.1 63.1 ...
##  $ LATITUDE    : num  42.4 42.4 42.4 42.4 42.4 ...
##  $ LONGITUDE   : num  -71.5 -71.5 -71.5 -71.5 -71.5 ...
##  $ DATE        : int  19700101 19700201 19700301 19700401 19700501 19700601 19700701 19700801 19710401 19710601 ...
##  $ MMXT        : num  28 38.5 42.6 59 71.2 75.2 82.8 82.4 56.5 80.8 ...
##  $ MMNT        : num  2.7 14.2 21.2 32 45.3 52.2 58.6 54.9 32 53.1 ...
##  $ MNTM        : num  15.4 26.2 31.8 45.5 58.3 63.7 70.7 68.5 44.2 66.9 ...

str(data_19902016)

## 'data.frame':    6808 obs. of  9 variables:
##  $ STATION     : Factor w/ 33 levels "GHCND:USC00190535",..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ STATION_NAME: Factor w/ 31 levels "BEDFORD HANSCOM FIELD MA US",..: 19 19 19 19 19 19 19 19 19 19 ...
##  $ ELEVATION   : Factor w/ 39 levels "10.1","10.7",..: 34 34 34 34 34 34 34 34 34 34 ...
##  $ LATITUDE    : Factor w/ 71 levels "41.95028","41.9538",..: 37 37 37 37 37 37 37 37 37 37 ...
##  $ LONGITUDE   : Factor w/ 67 levels "-70.8371","-70.84028",..: 63 63 63 63 63 63 63 63 63 63 ...
##  $ DATE        : int  19900101 19900201 19900301 19900401 19900501 19900601 19900701 19900801 19900901 19901001 ...
##  $ MMXT        : num  41.2 41.2 50.5 57 63.5 77.9 83.1 81.9 74.3 67.1 ...
##  $ MMNT        : num  23.7 18.3 25 36.1 45 56.3 61.9 61.9 49.3 44.2 ...
##  $ MNTM        : num  32.4 29.8 37.8 46.6 54.1 67.1 72.5 72 61.9 55.8 ...

I can immediately see some issues, as many of my columns are listed as factors when they need to be numeric. Date is also listed as an integer, so that needs to be changed to date.

data_19501969 <- data_19501969 %>%
  mutate(ELEVATION = as.numeric(as.character(ELEVATION)), 
  LATITUDE = as.numeric(as.character(LATITUDE)), 
  LONGITUDE = as.numeric(as.character(LONGITUDE)),
  DATE = as.Date(as.character(DATE), format='%Y%m%d'))

data_19701989 <- data_19701989%>%
  mutate(ELEVATION = as.numeric(as.character(ELEVATION)), 
         LATITUDE = as.numeric(as.character(LATITUDE)), 
         LONGITUDE = as.numeric(as.character(LONGITUDE)),
         DATE = as.Date(as.character(DATE), format='%Y%m%d'))

data_19902016 <- data_19902016 %>%
  mutate(ELEVATION = as.numeric(as.character(ELEVATION)), 
         LATITUDE = as.numeric(as.character(LATITUDE)), 
         LONGITUDE = as.numeric(as.character(LONGITUDE)),
         DATE = as.Date(as.character(DATE), format='%Y%m%d'))

The next step is to begin joining my 3 data sets, which is easiest to do one at a time. Because sets contain different dates for each row, I don’t want to lose any information. For this reason a full join works best.

data_1950to1989 <- full_join(data_19501969, data_19701989, by = c("STATION", "DATE", "STATION_NAME", 
                                                                  "MMXT", "MMNT", "MNTM", "ELEVATION",
                                                                  "LATITUDE", "LONGITUDE"))

data_1950to2016 <- full_join(data_1950to1989, data_19902016, by = c("STATION", "DATE", "STATION_NAME", 
                                                                  "MMXT", "MMNT", "MNTM", "ELEVATION",
                                                                  "LATITUDE", "LONGITUDE"))

For the purposes of manipulating data, it also makes sense to add separate columns for Year and Month as well. Day is not necessary since all days are listed as “1”. Vectors also need to be in numeric form. Also removed “STATION” since STATION_NAME was the only column needed for stations, as well as ELEVATION, since I wasn’t using it in my analysis.

data_1950to2016$Year <- format(data_1950to2016$DATE, "%Y")
data_1950to2016$Month <- format(data_1950to2016$DATE, "%m")

data_1950to2016 <- data_1950to2016 %>%
  mutate(Year = as.numeric(Year),
         Month = as.numeric(Month)) %>%
  select(-STATION, -ELEVATION)

To tidy things up, some column names need to be renamed

data_1950to2016 <- data_1950to2016 %>%
  rename(long = LONGITUDE,
         lat = LATITUDE,
         Station = STATION_NAME)

For the final step of organizing my data, I want to order it all by Station so all the data for each station is grouped together. arrange() is the easiest way for this.

data_1950to2016 <- arrange(data_1950to2016, Station, DATE)

Now that all the data has been organized, it can be plotted to see if there are any observable trends or changes. Ggplot will work best to plot the data, and I want to graph changes over time, so my Y-axis will be MNTM(mean temp) and X-axis will be Year. Before any changes are made, a raw plot was made to quickly see if there appear to be any changes over time.

ggplot(data = data_1950to2016, aes(x = Year, y = MNTM, colour = Station)) +
  geom_point() +
  facet_wrap(~ Month) +
  guides(color = "none")

Figure 1. Raw plot of mean temp over time, faceted by month and colored by station

To fit all the data onto one graph, a few figures need to be manipulated to best visualize the results. First, I wanted to go from monthly averages to yearly averages for each station, which was done by grouping into station and by year, then taking the averages. Because some stations did not have full years’ worth of measurements, the plot worked best by only taking full years to comprise a true years average, which was done by creating a new column with number of months in the Yearly_Mean summary, then filtering out any rows where the count was not equal to 12. NAs also needed to be removed from the Yearly_Mean column to connect all the points.

annual_avg <- data_1950to2016 %>%
  group_by(Station, Year, lat, long) %>%
  summarise(Yearly_Mean = mean(MNTM), temp_count = n()) %>%
  filter(temp_count == 12, !is.na(Yearly_Mean))

Finally, because I wanted to make an overlaying plot with lines for stations as well as a line for averages of all the stations, I need to create yearly means for all the stations together, which involves grouping by year and creating a new mean(y)

allstations_avg <- annual_avg %>%
  group_by(Year) %>%
  summarise(y = mean(Yearly_Mean)) %>%
  ungroup()

Now that all the data points I need to create my plot of Yearly temperature means over time, I can make a plot. I want lines for both individual stations and total station averages, the latter laying on top and standing out over the former. To do this, I set the color of stations to “grey” so it would appear lighter on the background and made both points and line for total averages in black so it is clearly visible. Finally, to pull it all together, a linear model line was added to clearly represent any slope over time.

ggplot(annual_avg, mapping = aes(x=Year, y=Yearly_Mean)) +
  geom_line(color="grey", alpha=0.5, mapping=aes(group = Station)) +
  theme_bw() +
  geom_line(data = allstations_avg, mapping = aes(x=Year, y=y), color="black") +
  geom_point(data = allstations_avg, mapping = aes(x=Year, y=y), color="black") +
  stat_smooth(method="lm", fill=NA, color="red", lty=2)

Figure 2. Temperature Trends. Grey lines represent individual stations, black line is average of all stations, and red-dashed line shows trend

For better visualization of where all the stations are located around Boston, I wanted to put them on a map.

library(maps)
library(ggmap)
library(mapdata)

boston_map <- map_data("state", region = "massachusetts")

temp_map <- ggplot(boston_map, aes(x = long, y = lat)) +
  geom_polygon() +
  coord_map() +
  geom_point(data = annual_avg, aes(x=long, y=lat), color = "orange")

temp_map

Figure 3. Map of station locations around Boston

To observe changes between max monthly and min monthly temperatures, I created a new column called Fluctuations to observe how drastically average extremes were varying over time

data_1950to2016 <- data_1950to2016 %>%
  mutate(Fluctuation = (MMXT - MMNT))

Next, I wanted to view those changes over time using the same type of plot I used for mean temperatures over time

fluctuation_avg <- data_1950to2016 %>%
  group_by(Station, Year, lat, long) %>%
  summarise(Fluctuation_Mean = mean(Fluctuation), temp_count = n()) %>%
  filter(temp_count == 12, !is.na(Fluctuation_Mean))

fluc_mean <- fluctuation_avg %>%
  group_by(Year) %>%
  summarise(y = mean(Fluctuation_Mean)) %>%
  ungroup()

ggplot(fluctuation_avg, mapping = aes(x=Year, y=Fluctuation_Mean)) +
  geom_line(color="grey", alpha=0.5, mapping=aes(group = Station)) +
  theme_bw() +
  geom_line(data = fluc_mean, mapping = aes(x=Year, y=y), color="black")+
  geom_point(data = fluc_mean, mapping = aes(x=Year, y=y), color="black") +
  stat_smooth(method="lm", fill=NA, color="red", lty=2)

Figure 4. Fluctuation Trends. Grey lines represent individual stations, black line is average of all stations, and red-dashed line shows trend

The climate data now needs to be analyzed. To do this, the data must be modified into linear models, which can be done for both the annual averages of stations and the total annual average of all stations.

climate_mod <- lm(Year ~ Yearly_Mean, data = annual_avg)
allstations_mod <- lm(Year ~ y, data = allstations_avg)
fluc_mean_mod <- lm(Year ~ y, data = fluc_mean)

Results

Using the linear models, the data for the individual stations is summarized to show calculations for standard error, t value, F value, and P value

summary(climate_mod)

## 
## Call:
## lm(formula = Year ~ Yearly_Mean, data = annual_avg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.935 -14.102   0.382  15.675  37.334 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1849.8770    16.7114 110.696  < 2e-16 ***
## Yearly_Mean    2.6834     0.3376   7.948 4.18e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.66 on 1261 degrees of freedom
## Multiple R-squared:  0.0477, Adjusted R-squared:  0.04695 
## F-statistic: 63.17 on 1 and 1261 DF,  p-value: 4.181e-15

To show how well the actual data fits versus theorectical data, a qq plot is created. Ideally, the points would be perfectly linear

plot(climate_mod, which = 2)

The same tests are then performed for the averages of all stations across time, as well as fluctuations of extremes over time

summary(allstations_mod)

## 
## Call:
## lm(formula = Year ~ y, data = allstations_avg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.093 -12.517   1.826  14.351  33.278 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1754.658    103.417  16.967   <2e-16 ***
## y              4.605      2.090   2.204   0.0312 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.65 on 64 degrees of freedom
## Multiple R-squared:  0.07053,    Adjusted R-squared:  0.056 
## F-statistic: 4.856 on 1 and 64 DF,  p-value: 0.03115

plot(allstations_mod, which = 2)

summary(fluc_mean_mod)

## 
## Call:
## lm(formula = Year ~ y, data = fluc_mean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.193 -13.238   2.391  14.977  39.051 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2131.194     63.761  33.425   <2e-16 ***
## y             -7.358      3.153  -2.334   0.0228 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.57 on 64 degrees of freedom
## Multiple R-squared:  0.07841,    Adjusted R-squared:  0.06401 
## F-statistic: 5.445 on 1 and 64 DF,  p-value: 0.02277

plot(fluc_mean_mod, which = 2)

Discussion

In the results section, very low R-squared values were found, which isn’t abnormal for ecological data since the variation comes from some large deviations in annual temperature from year to year. QQ plots also did not align perfectly linearly, though when all stations were averaged together a better fit was achieved than the qqplot of all the individual stations, reducing some of the individual station variation.
As the trend line and p-value show, there is clearly an upward overall trend in average annual temperature from 1950-present day. A very small P-value of 0.03279 indicates the change is unlikely to have occured by chance, confirming my first hypothesis.
The variation of extreme average temperature fluctuation over time actually decreased over time, which was a bit surprising as I hypothesized that the different between mean monthly max temp and mean monthly min temp would have become greater over time. The linear model also produced a very low p-value of 0.02277, suggesting fluctuation is trending this way and is not due to chance, thus I am unable to reject my second null hypothesis.
For future or additional experiments, I would want to get even more detailed with my analysis. While temperature does appear to be increasing, perhaps finding exactly when and where the greatest changes are occuring would give greater insight into the cause of changes.