knitr::opts_chunk$set(
echo = TRUE,
message=FALSE,
warning=FALSE,
out.width = '100%',
fig.height = 3,
dev = "svglite"
)
#Libraries----
library(paletteer)
library(pals)
library(ggridges)
library(kableExtra)
library(mapview)
library(sp)
library(broom)
library(tidyverse)
#Load data----
load("data/processed/temperature_comparison.R") #temp_comp
In this section we are going to compare two data sets: USCHN and PRISM mean annual temperatures. The purpose of such a comparison is to understand if we can apply PRISM data to other problems with little fear of introducing bias into any resulting analysis. PRISM data is commonly used in areas where there are no nearby weather stations (see “About the data” below).
For now I’m going to focus on the ubiquitous t-test for comparing our data, but in later iterations of this document, I’m hoping to add the non-parametric Mann-Whitney test, as well as a Bayesian t-test.
For this exercise I’m going to focus on a few questions to help guide the analysis:
USHCN (US Historical Climatology Network) hosts data observations from weather stations across the contiguous US going back to before 1895 in some cases (1866 is the earliest year for mean annual temperature data). The downside of this data set is that the observations are geographically discrete, meaning they only occur at specific points, and do not necessarily represent temperatures even a few hundred feet away.
PRISM (Parameter-elevation Regressions on Independent Slopes Model), on the other hand, occurs across the entire lower 48 as a grid. PRISM data is also a long-term, going back to exactly 1895. However, the spatial resolution of the data is between 800 m and 4 km, which is relatively coarse compared to the point estimates recorded by the USHCN data.
Thus, there is a potential trade-off in using each data set: temperature accuracy vs spatial coverage. And speaking of spatial coverage the figure below provides an example of the spatial distribution of both the USHCN and PRISM data sets, focusing in on the year 2014.
#Read in the prism data for 2014
prism_file <- paste0("data/raw/ushcn/PRISM_tmean_stable_4kmM2_2014_bil/",
"PRISM_tmean_stable_4kmM2_2014_bil.bil")
#.bil files are a kind of raster
prism_2014_df <- raster::raster(prism_file) %>%
#Convert from a raster to SpatialPixelsDataFrame
as("SpatialPixelsDataFrame")
#Convert to tibble
#There is definitely a better way of coercing this thing...
prism_2014_df <- bind_cols(as_tibble(pluck(prism_2014_df, "coords")),
as_tibble(pluck(prism_2014_df, "data"))) %>%
#Specify meaningful names
setNames(c("x", "y", "temperature"))
#Keep only the 2014 data from USHCN data
temp_comp_2014 <- temp_comp %>%
filter(year == 2014)
#Plot!
prism_2014_df %>%
ggplot() +
geom_raster(aes(x, y, fill = temperature)) +
geom_point(data = temp_comp_2014, aes(longitude, latitude, fill = ushcn),
shape = 21, size = 2, alpha = 0.8) +
theme_void() +
#`quickmap` allows me to use the right projection, but draw the raster at a
# coarse resolution, so I don't have to sit around for long while the image
# renders (unlike when I use `coord_map`).
coord_quickmap() +
scale_fill_distiller(palette = "Spectral", name = "Temp. (C)")
Given both the PRISM and USHNC data are using the same legend scale, there are no obvious visual differences in mean annual temperature for 2014.
The code chunks that follow contain the set of instructions used to:
In data science, downloading and formatting data can take more effort than the actual analysis. For this reason the code chunks below are pretty long. To see the analysis, click the “Code” buttons to the right.
#Note: For the sake reducing the amount of time it takes this site to render,
# I've broken up my analysis into scripts that do the actual work (see the R
# file below) and the .Rmd files that were rendered to make the site (e.g.,
# this file -- comparison.Rmd). This keeps me from having to re-do my analysis
# over and over.
#I found example 113 on Yihui's knitr-examples to be particularly useful in
# understanding code externalization:
# https://github.com/yihui/knitr-examples/blob/master/113-externalization.Rmd
knitr::read_chunk("scripts/draft/temperature_comparison.R")
#Header----
#Project: Earth Systems Data Science
#Purpose: Download and combine different temperature records for comparison
#File: ./scripts/temperature_comparison.R
#By: Jason Mercer
#Creation date (YYYY/MM/DD): 2019/05/10
#R Version(s): 3.5.3
#Libraries----
library(tidyverse)
library(sf)
library(sp)
library(raster)
library(prism)
#Options----
#Where to download the prism data
download_folder <- "data/temp"
#Create the download folder, if it doesn't already exist
if(!file.exists(download_folder)) {
dir.create(download_folder)
cat("Made folder: '", download_folder, "'.", sep = "")
} else {
cat("Folder '", download_folder, "' already exists.", sep = "")
}
#Specify a download location for data
#Can't specify download location in functions, for some reason
options(prism.path = download_folder)
#Custom functions----
#Load in any custom functions I've made
purrr::walk(list.files("functions/R", pattern = "*.R$", full.names = TRUE,
recursive = FALSE), source, verbose = FALSE)
#Download data----
#*USHCN data----
#Monthly average temperature for sites in the US Historical Climatology Network
ta_mon_file <- paste0("https://cdiac.ess-dive.lbl.gov/ftp/ushcn_v2.5_monthly/",
"ushcn2014_FLs_52i_tavg.txt.gz")
#Station metadata (e.g., spatial locations)
station_file <- paste0("https://cdiac.ess-dive.lbl.gov/ftp/ushcn_v2.5_monthly/",
"ushcn-stations.txt")
#Download the data
download.file(url = ta_mon_file,
destfile = "data/raw/ushcn/ushcn2014_FLs_52i_tavg.txt.gz")
download.file(url = station_file,
destfile = "data/raw/ushcn/ushcn-stations.txt")
#*PRISM data----
#Download the prism data; this takes a few moments
#prism::get_prism_annual(type = "tmean", years = 1895:2014)
#Load data----
#*USHCN data----
#Station temperature data
ta_mon_raw <- read.table(file = "data/raw/ushcn/ushcn2014_FLs_52i_tavg.txt.gz",
as.is = TRUE,
colClasses = "character",
sep = "$")
#Station metadata
#Information from the metadata file for reading in a fixed width file
pos_start <- c(1, 8, 17, 27, 34, 37, 68, 75, 82, 89)
pos_end <- c(6, 15, 25, 32, 35, 66, 73, 80, 87, 90)
#Note: elevation is in meters
pos_col <- c("station_id", "latitude", "longitude", "elevation", "state",
"name", "component_1", "component_2", "component_3", "UTC")
station_raw <- read_fwf("data/raw/ushcn/ushcn-stations.txt",
col_positions = fwf_positions(start = pos_start,
end = pos_end,
col_names = pos_col),
na = c("------", "-999.9"))
#*PRISM data----
#Only one PRISM data set will be in memory at a time, to reduce overhead.
# Read-in occurs when executing the `prism_extractor` function.
#Data manipulation----
#Massage the USHCN file into something useable
#Drop the first 17 characters -- they'll get added in later
ta_mon <- sub(pattern = "^.{17}", replacement = "",
x = ta_mon_raw$V1) %>%
#Remove some superfluous stuff
gsub(pattern = "(.{5}).{4}", replacement = "\\1,", x = .) %>%
sub(",(.{5}).{3}$", ",\\1", .) %>%
#Read in the "file"; first 12 obs are month average temps
read_csv(col_names = c(month.abb, "annual"),
na = "-9999",
trim_ws = TRUE) %>%
#Keep only the annual data for each site/year combo
dplyr::select(annual) %>%
#Pull in the station info
mutate(station_id = sub(pattern = "USH00([[:digit:]]+).*",
replacement = "\\1",
x = ta_mon_raw$V1)) %>%
#Pull in the year
mutate(year = as.numeric(sub("USH00[[:digit:]]+ +([[:digit:]]+).*", "\\1",
ta_mon_raw$V1)))
#Convert temperature from F to C
ta_mon <- ta_mon %>%
mutate(annual = ((annual/10) - 32) * 5/9) %>%
rename("ushcn" = "annual")
#Keep only the station metadata of interest
station <- station_raw %>%
dplyr::select(station_id, latitude, longitude, elevation, state, name)
#Combine temperature and
ushcn_data <- full_join(station, ta_mon, by = "station_id")
#I'm guessing the CRS is 4326 base on:
# https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/
# OverviewCoordinateReferenceSystems.pdf
temper_sp <- ushcn_data %>%
dplyr::select(longitude, latitude) %>%
distinct() %>%
sp::SpatialPointsDataFrame(data = ushcn_data %>%
distinct(longitude, latitude, station_id),
proj4string = CRS(sf::st_crs(4326)[[2]]))
#Convert the temper_sp to the same projection as the PRISM .bil files
temper_sp <- sp::spTransform(temper_sp,
"+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0")
#Record which folders were downloaded for the PRISM data
prism_folders <- list.dirs(download_folder, full.names = TRUE,
recursive = FALSE)
#Record the full name of each file in each folder that contains the raster info
prism_files <- paste0(prism_folders, "/", basename(prism_folders), ".bil")
#Data analysis----
#Read in the PRISM data! (This'll take a hot second)
prism_data_raw <- map_dfr(prism_files, prism_extractor,
temper_sp["station_id"])
#Convert time into numeric, highlight that it is a year, and keep only the
# important stuff
prism_data <- prism_data_raw %>%
dplyr::mutate(year = as.numeric(time)) %>%
dplyr::select(site, year, value) %>%
rename("prism" = "value")
#Join all the data together so it can be compared
temp_comp <- full_join(ushcn_data, prism_data,
by = c("station_id" = "site", "year" = "year"))
#Export----
#Save the data to be examined and plotted later
#save(temp_comp, file = "data/processed/temperature_comparison.R")
#Remove----
#Get rid of the large temporary data
#unlink(download_folder, recursive = TRUE)
Now that the data has been downloaded, it is time to explore it and make sure it looks okay. First, let’s look at the global distributions of the data, by source.
#Create a "fill palette" for the plot below
fp <- paletteer_d(ggsci, lanonc_lancet, 4)[c(3,4)] %>%
setNames(c("PRISM", "USHCN"))
temp_comp %>%
#Reformating the data into a "long" format for ggplot
gather("source", "Temperature", ushcn, prism) %>%
mutate(source = toupper(source)) %>%
#Getting rid of some NAs
drop_na() %>%
#Plot!
ggplot(aes(Temperature, fill = source)) +
geom_density(alpha = 0.5) +
labs(y = "Probability density",
x = "Temperature (C)") +
theme_classic() +
scale_fill_manual(values = fp, name = NULL)
Well, the distributions of of all the data look pretty similar! That’s good. Now let’s see if we’ve got the kind of temporal coverage advertised by the two datasets. Note: exploring your data is essential. During the early stages of making this site, I didn’t download all of the PRISM data (download error) and didn’t realize it until making this plot. So remember to always plot your data!
temp_comp %>%
rename("USHCN" = "ushcn", "PRISM" = "prism") %>%
gather("source", "value", USHCN, PRISM) %>%
mutate(source = factor(source, levels = c("PRISM", "USHCN"))) %>%
drop_na() %>%
ggplot(aes(year, value, color = latitude, group = station_id)) +
geom_line() +
labs(x = "Year",
y = "Temperature (C)") +
facet_wrap(~ source) +
theme_classic() +
theme(strip.background = element_blank()) +
scale_color_paletteer_c(pals, kovesi.rainbow_bgyr_35_85_c72, direction = -1,
name = "Latitude\n(degrees)") +
scale_x_continuous(expand = c(0, 0))
Since our temporal coverage looks good, let’s now examine the spatial coverage of the data. Specifically, let’s look at where the USHCN sites are located and how the mean annual temperature across years is distributed in space. This should provide us at least some sense as to whether there is systematic bias between the two data sets.
temp_comp %>%
rename("USHCN" = "ushcn", "PRISM" = "prism") %>%
group_by(station_id) %>%
summarise(latitude = first(latitude),
longitude = first(longitude),
USHCN = mean(USHCN, na.rm = TRUE),
PRISM = mean(PRISM, na.rm = TRUE)) %>%
gather("source", "value", USHCN, PRISM) %>%
mutate(source = factor(source, levels = c("PRISM", "USHCN"))) %>%
ggplot() +
geom_map(data = map_data("usa"), map = map_data("usa"),
aes(map_id = region), fill = "white",
color= "black") +
geom_point(aes(longitude, latitude, fill = value),
shape = 21, size = 2, alpha = 0.8) +
facet_wrap(~source, nrow = 2) +
theme_void() +
coord_map() +
scale_fill_distiller(palette = "Spectral",
name = "Mean\ntemperature (C)")
There are a number of other data checks we could perform, but what we’ve done is good enough for the purpose of this exercise, so let’s start our analysis.
#Test the hypothesis that ushcn - prism = 0
#If estimate < 0 then the PRISM data is overestimating the temperature
#If estimate > 0 then the PRISM data is underestimating the temperature
#This is really the workhorse code for the t-test.
temper_ttest <- temp_comp %>%
#Split the dataframe in to lists by station id
split(.$station_id) %>%
#Keep only the temperature data from the two data sets
purrr::map(dplyr::select, ushcn, prism) %>%
#Get rid of any NAs, because `t.test` can't handle them
purrr::map(drop_na) %>%
#Do the t-tests!
purrr::map(~t.test(.x$ushcn, .x$prism, alternative = "two.sided",
paired = TRUE)) %>%
#The broom package has a convenient function for taking the output returned
# from a t-test turning it into a tibble/dataframe.
#`map_dfr` combines all the t-test information into a single dataframe and
# uses the station ids (recorded from `split`) as a new column called
# "station_id" allowing us to figure out where the data came from.
purrr::map_dfr(broom::glance, .id = "station_id")
temper_tested <- temp_comp %>%
dplyr::select(station_id, latitude, longitude, elevation) %>%
distinct() %>%
full_join(temper_ttest, by = "station_id") %>%
mutate(signif = ifelse(estimate < 0, "Overestimate", "Underestimate"),
signif = ifelse(p.value > 0.05, "Unbiased", signif),
signif = factor(signif, levels = c("Overestimate", "Unbiased",
"Underestimate"))) %>%
mutate(estimate_NA = ifelse(p.value > 0.05, NA, estimate))
temper_tested %>%
ggplot() +
geom_map(data = map_data("usa"), map = map_data("usa"),
aes(map_id = region), fill = "white",
color= "black") +
geom_point(aes(longitude, latitude, fill = estimate_NA),
shape = 21, size = 2, alpha = 0.8) +
theme_void() +
coord_map() +
scale_fill_paletteer_c(pals, ocean.balance, direction = -1, limits = c(-4, 4),
name = "Mean\ntemperature\nbias (C)",
na.value = "grey50")
temper_tested %>%
ggplot(aes(estimate, 1, fill = ..x..)) +
geom_density_ridges_gradient() +
labs(x = "Mean temperature bias (C)",
y = "Probability density") +
theme_classic() +
theme(legend.position = "none") +
scale_x_continuous(limits = c(-4, 4), expand = c(0, 0)) +
scale_fill_paletteer_c(pals, ocean.balance, -1, limits = c(-4, 4))
temper_tested %>%
group_by(signif) %>%
summarise(Proportion = n()) %>%
ungroup() %>%
mutate(Proportion = round(Proportion/nrow(temper_tested), digits = 2)) %>%
rename("Directionality in temperature difference" = "signif") %>%
as_tibble() %>%
kable(caption = "Table: The different proportions of data that were overestimated, underestimated, or unbiased. The relatively large proportion of overestimated data in combination with the last figure suggests the PRISM data may be systematically overestimating the true mean annual air temperature, assuming the USNHC data is also unbiased.") %>%
kable_styling()
Directionality in temperature difference | Proportion |
---|---|
Overestimate | 0.69 |
Unbiased | 0.08 |
Underestimate | 0.23 |
#Header----
#Project: Earth Systems Data Science
#Purpose: Download and combine different temperature records for comparison
#File: ./scripts/mapview_comparison.R
#By: Jason Mercer
#Creation date (YYYY/MM/DD): 2019/05/12
#R Version(s): 3.5.3
#Libraries----
library(leaflet)
library(mapview)
library(pals)
library(sp)
library(tidyverse)
#Load data----
load("data/processed/temperature_comparison.R") #temp_comp
#Data manipulation----
temper_ttest <- temp_comp %>%
split(.$station_id) %>%
purrr::map(dplyr::select, ushcn, prism) %>%
purrr::map(drop_na) %>%
purrr::map(~t.test(.x$ushcn, .x$prism, alternative = "two.sided", paired = TRUE)) %>%
purrr::map_dfr(glance, .id = "station_id")
temper_tested <- temp_comp %>%
dplyr::select(station_id, latitude, longitude, elevation) %>%
distinct() %>%
full_join(temper_ttest, by = "station_id") %>%
mutate(signif = ifelse(estimate < 0, "Overestimate", "Underestimate"),
signif = ifelse(p.value > 0.05, "Unbiased", signif),
signif = factor(signif, levels = c("Overestimate", "Unbiased",
"Underestimate"))) %>%
mutate(estimate_NA = ifelse(p.value > 0.05, NA, estimate))
#Data analysis----
#Calculate an empirical density function for each site
temp_density <- temp_comp %>%
split(.$station_id) %>%
purrr::map(mutate, diff = ushcn - prism) %>%
purrr::map(drop_na) %>%
purrr::map(pull, diff) %>%
purrr::map(density) %>%
purrr::map(~.x[c("x", "y")]) %>%
purrr::map_dfr(as_tibble, .id = "station_id") %>%
split(.$station_id) %>%
purrr::map(left_join, temper_tested, by = "station_id") %>%
purrr::map(left_join, temp_comp %>% distinct(station_id, name, state),
by = "station_id")
#Make a function that will generate a similar plot across all sites
density_plotter <- function(data) {
p <- data %>%
ggplot(aes(x, ymax = y, ymin = 0)) +
geom_ribbon(aes(fill = estimate_NA), alpha = 0.75) +
geom_line(aes(y = y)) +
geom_vline(xintercept = 0, color = "gray", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = estimate_NA)) +
labs(x = "Temperature bias (C)",
y = "Probability density",
title = paste0(unique(data$station_id), ": ", unique(data$name), ", ",
unique(data$state))) +
theme_classic() +
theme(legend.position = "none") +
scale_fill_paletteer_c(pals, ocean.balance, -1, limits = c(-4, 4),
na.value = "grey50")
p
}
#Save all the plots as a list
density_p <- temp_density %>%
purrr::map(density_plotter)
#Save all the plots locally (this can take a while as there are >1200 sites)
density_p %>%
purrr::walk2(.y = names(density_p),
~ ggsave(filename = paste0("figures/R/comparison/", .y, ".png"),
plot = .x,
device = "png"))
#Export----
#Hardcode the static locations of the files.
#This took toooooo long to figure out how to do.
web_files <- paste0("https://raw.githubusercontent.com/wetlandscapes/esds/master/",
"figures/R/comparison/", temper_ttest$station_id, ".png")
# web_files <- paste0("../figures/R/comparison/", temper_ttest$station_id, ".png")
#Generate a categorical color palette
cp <- ocean.balance(11)[c(2, 9)]
cp <- c(cp[2], "gray50", cp[1])
names(cp) <- c("Overestimate", "Unbiased", "Underestimate")
#Set some standard mapview options
mapviewOptions(basemaps = c("Esri.WorldImagery", "OpenStreetMap"),
vector.palette = cp,
na.color = "gray50",
legend = TRUE,
layers.control.pos = "topright")
#Generate
temper_tested %>%
left_join(temp_comp %>% distinct(station_id, name), by = "station_id") %>%
select(longitude, latitude, signif, name) %>%
drop_na() %>%
rename("Temp. bias (C)" = "signif") %>%
SpatialPointsDataFrame(cbind(.$longitude, .$latitude), data = .,
proj4string = CRS("+init=epsg:4326")) %>%
mapView(zcol = "Temp. bias (C)",
label = .$name,
color = "gray50",
popup = popupImage(web_files, src = "remote"))
Add an explanation
Based on the results of our 1,200+ pair-wise t-tests, it does look like there might be some systematic bias in mean annual temperature estimates of the PRISM data, relative to the USHCN data.