Mean annual air temperature in the contiguous US: 1895 to 2014

knitr::opts_chunk$set(
  echo = TRUE,
  message=FALSE,
  warning=FALSE,
  out.width = '100%',
  fig.height = 3,
  dev = "svglite"
  )

#Libraries----
library(paletteer)
library(pals)
library(ggridges)
library(kableExtra)
library(mapview)
library(sp)
library(broom)
library(tidyverse)

#Load data----
load("data/processed/temperature_comparison.R") #temp_comp

Overview

In this section we are going to compare two data sets: USCHN and PRISM mean annual temperatures. The purpose of such a comparison is to understand if we can apply PRISM data to other problems with little fear of introducing bias into any resulting analysis. PRISM data is commonly used in areas where there are no nearby weather stations (see “About the data” below).

For now I’m going to focus on the ubiquitous t-test for comparing our data, but in later iterations of this document, I’m hoping to add the non-parametric Mann-Whitney test, as well as a Bayesian t-test.

Questions

For this exercise I’m going to focus on a few questions to help guide the analysis:

What does climate data, and specifically temperature data, look like for the contiguous US from approximately 1895 to 2014?
Are observed and predicted mean annual air temperature values similar for data recorded or produced by major climate data warehouses? Specifically, how do data from the US Historical Climatology Network (observed) and Parameter-elevation Regressions on Independent Slopes Model (predicted) compare?
Is there systematic bias between observed and predicted temperature data?
Across the contiguous US, how much error should we expect from PRISM data, relative to the actual mean annual temperature?

About the data

USHCN (US Historical Climatology Network) hosts data observations from weather stations across the contiguous US going back to before 1895 in some cases (1866 is the earliest year for mean annual temperature data). The downside of this data set is that the observations are geographically discrete, meaning they only occur at specific points, and do not necessarily represent temperatures even a few hundred feet away.

PRISM (Parameter-elevation Regressions on Independent Slopes Model), on the other hand, occurs across the entire lower 48 as a grid. PRISM data is also a long-term, going back to exactly 1895. However, the spatial resolution of the data is between 800 m and 4 km, which is relatively coarse compared to the point estimates recorded by the USHCN data.

Thus, there is a potential trade-off in using each data set: temperature accuracy vs spatial coverage. And speaking of spatial coverage the figure below provides an example of the spatial distribution of both the USHCN and PRISM data sets, focusing in on the year 2014.

#Read in the prism data for 2014
prism_file <- paste0("data/raw/ushcn/PRISM_tmean_stable_4kmM2_2014_bil/",
                     "PRISM_tmean_stable_4kmM2_2014_bil.bil")
#.bil files are a kind of raster
prism_2014_df <- raster::raster(prism_file) %>%
  #Convert from a raster to SpatialPixelsDataFrame
  as("SpatialPixelsDataFrame")
#Convert to tibble
  #There is definitely a better way of coercing this thing...
prism_2014_df <- bind_cols(as_tibble(pluck(prism_2014_df, "coords")),
                           as_tibble(pluck(prism_2014_df, "data"))) %>%
  #Specify meaningful names
  setNames(c("x", "y", "temperature"))

#Keep only the 2014 data from USHCN data
temp_comp_2014 <- temp_comp %>%
  filter(year == 2014)

#Plot!
prism_2014_df %>%
  ggplot() +
  geom_raster(aes(x, y, fill = temperature)) +
  geom_point(data = temp_comp_2014, aes(longitude, latitude, fill = ushcn),
             shape = 21, size = 2, alpha = 0.8) +
  theme_void() +
  #`quickmap` allows me to use the right projection, but draw the raster at a
  # coarse resolution, so I don't have to sit around for long while the image
  # renders (unlike when I use `coord_map`).
  coord_quickmap() +
  scale_fill_distiller(palette = "Spectral", name = "Temp. (C)")

Figure: A comparison of the spatial distribution of the two data sets of interest, for a single year, 2014. The raster represents the PRISM data while the points are the weather stations from USHCN. Don’t be fooled by the size of the points, the spatial weather station temperature coverage is small.

Given both the PRISM and USHNC data are using the same legend scale, there are no obvious visual differences in mean annual temperature for 2014.

Data downloading and manipulation

The code chunks that follow contain the set of instructions used to:

Download the data from the:
1. USHNC website
2. PRISM website
Extract the PRISM information at each USHNC site
Join the two data sets.

In data science, downloading and formatting data can take more effort than the actual analysis. For this reason the code chunks below are pretty long. To see the analysis, click the “Code” buttons to the right.

#Note: For the sake reducing the amount of time it takes this site to render,
# I've broken up my analysis into scripts that do the actual work (see the R
# file below) and the .Rmd files that were rendered to make the site (e.g.,
# this file -- comparison.Rmd). This keeps me from having to re-do my analysis
# over and over.

#I found example 113 on Yihui's knitr-examples to be particularly useful in
# understanding code externalization:
# https://github.com/yihui/knitr-examples/blob/master/113-externalization.Rmd

knitr::read_chunk("scripts/draft/temperature_comparison.R")

#Header----
#Project: Earth Systems Data Science
#Purpose: Download and combine different temperature records for comparison
#File: ./scripts/temperature_comparison.R
#By: Jason Mercer
#Creation date (YYYY/MM/DD): 2019/05/10
#R Version(s): 3.5.3

#Libraries----
library(tidyverse)
library(sf)
library(sp)
library(raster)
library(prism)


#Options----
#Where to download the prism data
download_folder <- "data/temp"

#Create the download folder, if it doesn't already exist
if(!file.exists(download_folder)) {
  dir.create(download_folder)
  cat("Made folder: '", download_folder, "'.", sep = "")
} else {
  cat("Folder '", download_folder, "' already exists.", sep = "")
}

#Specify a download location for data
#Can't specify download location in functions, for some reason
options(prism.path = download_folder)


#Custom functions----
#Load in any custom functions I've made
purrr::walk(list.files("functions/R", pattern = "*.R$", full.names = TRUE,
                       recursive = FALSE), source, verbose = FALSE)


#Download data----
#*USHCN data----
#Monthly average temperature for sites in the US Historical Climatology Network
ta_mon_file <- paste0("https://cdiac.ess-dive.lbl.gov/ftp/ushcn_v2.5_monthly/",
                      "ushcn2014_FLs_52i_tavg.txt.gz")
#Station metadata (e.g., spatial locations)
station_file <- paste0("https://cdiac.ess-dive.lbl.gov/ftp/ushcn_v2.5_monthly/",
                       "ushcn-stations.txt")
#Download the data
download.file(url = ta_mon_file,
              destfile = "data/raw/ushcn/ushcn2014_FLs_52i_tavg.txt.gz")
download.file(url = station_file,
              destfile = "data/raw/ushcn/ushcn-stations.txt")

#*PRISM data----
#Download the prism data; this takes a few moments
#prism::get_prism_annual(type = "tmean", years = 1895:2014)


#Load data----
#*USHCN data----
#Station temperature data
ta_mon_raw <- read.table(file = "data/raw/ushcn/ushcn2014_FLs_52i_tavg.txt.gz",
                     as.is = TRUE,
                     colClasses = "character",
                     sep = "$")

#Station metadata
#Information from the metadata file for reading in a fixed width file
pos_start <- c(1, 8, 17, 27, 34, 37, 68, 75, 82, 89)
pos_end <- c(6, 15, 25, 32, 35, 66, 73, 80, 87, 90)
  #Note: elevation is in meters
pos_col <- c("station_id", "latitude", "longitude", "elevation", "state",
             "name", "component_1", "component_2", "component_3", "UTC")

station_raw <- read_fwf("data/raw/ushcn/ushcn-stations.txt",
                        col_positions = fwf_positions(start = pos_start,
                                                      end = pos_end,
                                                      col_names = pos_col),
                        na = c("------", "-999.9"))

#*PRISM data----
#Only one PRISM data set will be in memory at a time, to reduce overhead.
# Read-in occurs when executing the `prism_extractor` function.


#Data manipulation----
#Massage the USHCN file into something useable
  #Drop the first 17 characters -- they'll get added in later
ta_mon <- sub(pattern = "^.{17}", replacement = "",
                      x = ta_mon_raw$V1) %>%
  #Remove some superfluous stuff
  gsub(pattern = "(.{5}).{4}", replacement = "\\1,", x = .)  %>%
  sub(",(.{5}).{3}$", ",\\1", .) %>%
  #Read in the "file"; first 12 obs are month average temps
  read_csv(col_names = c(month.abb, "annual"),
           na = "-9999",
           trim_ws = TRUE) %>%
  #Keep only the annual data for each site/year combo
  dplyr::select(annual) %>%
  #Pull in the station info
  mutate(station_id = sub(pattern = "USH00([[:digit:]]+).*",
                                             replacement = "\\1",
                                             x = ta_mon_raw$V1)) %>%
  #Pull in the year
  mutate(year = as.numeric(sub("USH00[[:digit:]]+ +([[:digit:]]+).*", "\\1",
                               ta_mon_raw$V1)))

#Convert temperature from F to C
ta_mon <- ta_mon %>%
  mutate(annual = ((annual/10) - 32) * 5/9) %>%
  rename("ushcn" = "annual")

#Keep only the station metadata of interest
station <- station_raw %>%
  dplyr::select(station_id, latitude, longitude, elevation, state, name)

#Combine temperature and
ushcn_data <- full_join(station, ta_mon, by = "station_id")

#I'm guessing the CRS is 4326 base on:
# https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/
# OverviewCoordinateReferenceSystems.pdf
temper_sp <- ushcn_data %>%
  dplyr::select(longitude, latitude) %>%
  distinct() %>%
  sp::SpatialPointsDataFrame(data = ushcn_data %>%
                               distinct(longitude, latitude, station_id),
                             proj4string = CRS(sf::st_crs(4326)[[2]]))

#Convert the temper_sp to the same projection as the PRISM .bil files
temper_sp <- sp::spTransform(temper_sp,
  "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0")

#Record which folders were downloaded for the PRISM data
prism_folders <- list.dirs(download_folder, full.names = TRUE,
                           recursive = FALSE)

#Record the full name of each file in each folder that contains the raster info
prism_files <- paste0(prism_folders, "/", basename(prism_folders), ".bil")


#Data analysis----
#Read in the PRISM data! (This'll take a hot second)
prism_data_raw <- map_dfr(prism_files, prism_extractor,
                                 temper_sp["station_id"])
  
#Convert time into numeric, highlight that it is a year, and keep only the
# important stuff
prism_data <- prism_data_raw %>%
  dplyr::mutate(year = as.numeric(time)) %>%
  dplyr::select(site, year, value) %>%
  rename("prism" = "value")

#Join all the data together so it can be compared
temp_comp <- full_join(ushcn_data, prism_data,
                         by = c("station_id" = "site", "year" = "year"))


#Export----
#Save the data to be examined and plotted later
#save(temp_comp, file = "data/processed/temperature_comparison.R")


#Remove----
#Get rid of the large temporary data
#unlink(download_folder, recursive = TRUE)

Exploratory data analysis

Now that the data has been downloaded, it is time to explore it and make sure it looks okay. First, let’s look at the global distributions of the data, by source.

#Create a "fill palette" for the plot below
fp <- paletteer_d(ggsci, lanonc_lancet, 4)[c(3,4)] %>%
  setNames(c("PRISM", "USHCN"))

temp_comp %>%
  #Reformating the data into a "long" format for ggplot
  gather("source", "Temperature", ushcn, prism) %>%
  mutate(source = toupper(source)) %>%
  #Getting rid of some NAs
  drop_na() %>%
  #Plot!
  ggplot(aes(Temperature, fill = source)) +
  geom_density(alpha = 0.5) +
  labs(y = "Probability density",
       x = "Temperature (C)") +
  theme_classic() +
  scale_fill_manual(values = fp, name = NULL)

Figure: The density distributions of both data sets across all sites and years. There are no obvious outliers, suggesting the data is in reasonably good condition.

Well, the distributions of of all the data look pretty similar! That’s good. Now let’s see if we’ve got the kind of temporal coverage advertised by the two datasets. Note: exploring your data is essential. During the early stages of making this site, I didn’t download all of the PRISM data (download error) and didn’t realize it until making this plot. So remember to always plot your data!

temp_comp %>%
  rename("USHCN" = "ushcn", "PRISM" = "prism") %>%
  gather("source", "value", USHCN, PRISM) %>%
  mutate(source = factor(source, levels = c("PRISM", "USHCN"))) %>%
  drop_na() %>%
  ggplot(aes(year, value, color = latitude, group = station_id)) +
  geom_line() +
  labs(x = "Year",
       y = "Temperature (C)") +
  facet_wrap(~ source) +
  theme_classic() +
  theme(strip.background = element_blank()) +
  scale_color_paletteer_c(pals, kovesi.rainbow_bgyr_35_85_c72, direction = -1,
                          name = "Latitude\n(degrees)") +
  scale_x_continuous(expand = c(0, 0))

Figure: The temporal range of every site highlighting the latitude of that observation. As expected, we have a lot of data for the adverised range of dates. Also the latitudinal pattern of temperatures makes sense, with some deviation likely due to high elevation parts of the country (e.g., Rocky Mountains).

Since our temporal coverage looks good, let’s now examine the spatial coverage of the data. Specifically, let’s look at where the USHCN sites are located and how the mean annual temperature across years is distributed in space. This should provide us at least some sense as to whether there is systematic bias between the two data sets.

temp_comp %>%
  rename("USHCN" = "ushcn", "PRISM" = "prism") %>%
  group_by(station_id) %>%
  summarise(latitude = first(latitude),
            longitude = first(longitude),
            USHCN = mean(USHCN, na.rm = TRUE),
            PRISM = mean(PRISM, na.rm = TRUE)) %>%
  gather("source", "value", USHCN, PRISM) %>%
  mutate(source = factor(source, levels = c("PRISM", "USHCN"))) %>%
  ggplot() +
  geom_map(data = map_data("usa"), map = map_data("usa"),
           aes(map_id = region), fill = "white",
           color= "black") +
  geom_point(aes(longitude, latitude, fill = value),
             shape = 21, size = 2, alpha = 0.8) +
  facet_wrap(~source, nrow = 2) +
  theme_void() +
  coord_map() +
  scale_fill_distiller(palette = "Spectral",
                         name = "Mean\ntemperature (C)")

Figure: The mean annual air temperatures at every site for every year by source. USHCN points were used to extract PRISM data and then averaged over the available data at each site.

There are a number of other data checks we could perform, but what we’ve done is good enough for the purpose of this exercise, so let’s start our analysis.

Analyzing temperature differences in space and time

Paired t-test

#Test the hypothesis that ushcn - prism = 0
  #If estimate < 0 then the PRISM data is overestimating the temperature
  #If estimate > 0 then the PRISM data is underestimating the temperature

#This is really the workhorse code for the t-test.
temper_ttest <- temp_comp %>%
  #Split the dataframe in to lists by station id
  split(.$station_id) %>%
  #Keep only the temperature data from the two data sets
  purrr::map(dplyr::select, ushcn, prism) %>%
  #Get rid of any NAs, because `t.test` can't handle them
  purrr::map(drop_na) %>%
  #Do the t-tests!
  purrr::map(~t.test(.x$ushcn, .x$prism, alternative = "two.sided",
                     paired = TRUE)) %>%
  #The broom package has a convenient function for taking the output returned
  # from a t-test turning it into a tibble/dataframe.
    #`map_dfr` combines all the t-test information into a single dataframe and
    # uses the station ids (recorded from `split`) as a new column called
    # "station_id" allowing us to figure out where the data came from.
  purrr::map_dfr(broom::glance, .id = "station_id")

temper_tested <- temp_comp %>%
  dplyr::select(station_id, latitude, longitude, elevation) %>%
  distinct() %>%
  full_join(temper_ttest, by = "station_id") %>%
  mutate(signif = ifelse(estimate < 0, "Overestimate", "Underestimate"),
         signif = ifelse(p.value > 0.05, "Unbiased", signif),
         signif = factor(signif, levels = c("Overestimate", "Unbiased",
                                            "Underestimate"))) %>%
  mutate(estimate_NA = ifelse(p.value > 0.05, NA, estimate))

temper_tested %>%
  ggplot() +
  geom_map(data = map_data("usa"), map = map_data("usa"),
           aes(map_id = region), fill = "white",
           color= "black") +
  geom_point(aes(longitude, latitude, fill = estimate_NA),
             shape = 21, size = 2, alpha = 0.8) +
  theme_void() +
  coord_map() +
  scale_fill_paletteer_c(pals, ocean.balance, direction = -1, limits = c(-4, 4),
                         name = "Mean\ntemperature\nbias (C)",
                         na.value = "grey50")

Figure: The mean temperature bias at each site when comparing the observed (USHCN) to predicted (PRISM) mean annual air temperatures from 1895 to 2014. In this context, warm colors illustrate that the PRISM data is overestimating the true temperature, while cooler colors occur where the PRISM data is underestimating the true temperature. Gray points were not significantly different, therefore we’d expect most of the points to be gray if there were no significant differences between the two data sets. Instead we see a large number of sites that tend to be red, suggesting potential bias.

temper_tested %>%
  ggplot(aes(estimate, 1, fill = ..x..)) +
  geom_density_ridges_gradient() +
  labs(x = "Mean temperature bias (C)",
       y = "Probability density") +
  theme_classic() +
  theme(legend.position = "none") +  
  scale_x_continuous(limits = c(-4, 4), expand = c(0, 0)) +
  scale_fill_paletteer_c(pals, ocean.balance, -1, limits = c(-4, 4))

Figure: The distribution of mean bias in the temperature record between observed and predicted data. Again, warm and cool colors indicate over and underestimated temperature predictions. Note that the mean is near zero suggesting (though not conclusively) no underlying bias in temperature predictions in the PRISM data set.

temper_tested %>%
  group_by(signif) %>%
  summarise(Proportion = n()) %>%
  ungroup() %>%
  mutate(Proportion = round(Proportion/nrow(temper_tested), digits = 2)) %>%
  rename("Directionality in temperature difference" = "signif") %>%
  as_tibble() %>%
  kable(caption = "Table: The different proportions of data that were overestimated, underestimated, or unbiased. The relatively large proportion of overestimated data in combination with the last figure suggests the PRISM data may be systematically overestimating the true mean annual air temperature, assuming the USNHC data is also unbiased.") %>%
  kable_styling()

Table: The different proportions of data that were overestimated, underestimated, or unbiased. The relatively large proportion of overestimated data in combination with the last figure suggests the PRISM data may be systematically overestimating the true mean annual air temperature, assuming the USNHC data is also unbiased.
Directionality in temperature difference	Proportion
Overestimate	0.69
Unbiased	0.08
Underestimate	0.23

#Header----
#Project: Earth Systems Data Science
#Purpose: Download and combine different temperature records for comparison
#File: ./scripts/mapview_comparison.R
#By: Jason Mercer
#Creation date (YYYY/MM/DD): 2019/05/12
#R Version(s): 3.5.3

#Libraries----
library(leaflet)
library(mapview)
library(pals)
library(sp)
library(tidyverse)

#Load data----
load("data/processed/temperature_comparison.R") #temp_comp


#Data manipulation----
temper_ttest <- temp_comp %>%
  split(.$station_id) %>%
  purrr::map(dplyr::select, ushcn, prism) %>%
  purrr::map(drop_na) %>%
  purrr::map(~t.test(.x$ushcn, .x$prism, alternative = "two.sided", paired = TRUE)) %>%
  purrr::map_dfr(glance, .id = "station_id")

temper_tested <- temp_comp %>%
  dplyr::select(station_id, latitude, longitude, elevation) %>%
  distinct() %>%
  full_join(temper_ttest, by = "station_id") %>%
  mutate(signif = ifelse(estimate < 0, "Overestimate", "Underestimate"),
         signif = ifelse(p.value > 0.05, "Unbiased", signif),
         signif = factor(signif, levels = c("Overestimate", "Unbiased",
                                            "Underestimate"))) %>%
  mutate(estimate_NA = ifelse(p.value > 0.05, NA, estimate))


#Data analysis----
#Calculate an empirical density function for each site
temp_density <- temp_comp %>%
  split(.$station_id) %>%
  purrr::map(mutate, diff = ushcn - prism) %>%
  purrr::map(drop_na) %>%
  purrr::map(pull, diff) %>%
  purrr::map(density) %>%
  purrr::map(~.x[c("x", "y")]) %>%
  purrr::map_dfr(as_tibble, .id = "station_id") %>%
  split(.$station_id) %>%
  purrr::map(left_join, temper_tested, by = "station_id") %>%
  purrr::map(left_join, temp_comp %>% distinct(station_id, name, state),
             by = "station_id")

#Make a function that will generate a similar plot across all sites
density_plotter <- function(data) {
  p <- data %>%
    ggplot(aes(x, ymax = y, ymin = 0)) +
    geom_ribbon(aes(fill = estimate_NA), alpha = 0.75) +
    geom_line(aes(y = y)) +
    geom_vline(xintercept = 0, color = "gray", linetype = "dashed", size = 1) +
    geom_vline(aes(xintercept = estimate_NA)) +
    labs(x = "Temperature bias (C)",
         y = "Probability density",
         title = paste0(unique(data$station_id), ": ", unique(data$name), ", ",
                       unique(data$state))) +
    theme_classic() +
    theme(legend.position = "none") +
    scale_fill_paletteer_c(pals, ocean.balance, -1, limits = c(-4, 4),
                           na.value = "grey50")
  
  p
}

#Save all the plots as a list
density_p <- temp_density %>%
  purrr::map(density_plotter)

#Save all the plots locally (this can take a while as there are >1200 sites)
density_p %>%
  purrr::walk2(.y = names(density_p),
              ~ ggsave(filename = paste0("figures/R/comparison/", .y, ".png"),
                      plot = .x,
                      device = "png"))

#Export----

#Hardcode the static locations of the files.
  #This took toooooo long to figure out how to do.
web_files <- paste0("https://raw.githubusercontent.com/wetlandscapes/esds/master/",
                    "figures/R/comparison/", temper_ttest$station_id, ".png")
# web_files <- paste0("../figures/R/comparison/", temper_ttest$station_id, ".png")

#Generate a categorical color palette
cp <- ocean.balance(11)[c(2, 9)]
cp <- c(cp[2], "gray50", cp[1])
names(cp) <- c("Overestimate", "Unbiased", "Underestimate")

#Set some standard mapview options
mapviewOptions(basemaps = c("Esri.WorldImagery", "OpenStreetMap"),
               vector.palette = cp,
               na.color = "gray50",
               legend = TRUE,
               layers.control.pos = "topright")

#Generate 
temper_tested %>%
  left_join(temp_comp %>% distinct(station_id, name), by = "station_id") %>%
  select(longitude, latitude, signif, name) %>%
  drop_na() %>%
  rename("Temp. bias (C)" = "signif") %>%
  SpatialPointsDataFrame(cbind(.$longitude, .$latitude), data = .,
                         proj4string = CRS("+init=epsg:4326")) %>%
  mapView(zcol = "Temp. bias (C)",
          label = .$name,
          color = "gray50",
          popup = popupImage(web_files, src = "remote"))

Add an explanation

Conclusions

t-test

Based on the results of our 1,200+ pair-wise t-tests, it does look like there might be some systematic bias in mean annual temperature estimates of the PRISM data, relative to the USHCN data.