Intermediate R Tutorial

This intermediate R tutorial covers the following topics:

Workshop Goals

  • This workshop is laid out as a deep dive into data exploration. This excercise is meant to mimick many of the challenges and procedures you might encounter analyzing your own data for research. In the beginner workshop, we quickly flew past downloading the data to get you to the point of making a plot. Now, we’re going to handle the data like you would if it was new to you: check for messiness, clean it up, and look for patterns.

Using Open Source Data Retrieval Tools

When we examined the Bow River data in the previous workshop, I just handed it to you. We can practice retrieving it ourselves!

We’ll use the tidyhydat package to access Water Survey of Canada data for the Bow River. If this is your first use of the package, you’ll have to run download_hydat(). The function hy_stations allows us to sample sites, here we specify the province of Alberta. We can then subset these sites for the name and approximate latitude. It appears the Bow River at Cochrane should be a suitable gage!

library(tidyhydat)
#download_hydat()
ABstat=tidyhydat::hy_stations(prov_terr_state_loc = "AB")
ABstat[grepl("BOW RIVER",ABstat$STATION_NAME) & ABstat$LATITUDE>51 & ABstat$HYD_STATUS=="ACTIVE",]

Next, we’ll download the data for station 05BH004.

Bow_data=(as.data.frame(hy_daily_flows(station_number = "05BH004")))
Bow_data$Date=as.Date(Bow_data$Date)
head(Bow_data)
#write.csv(x = Bow_data,file="input/AB05BH004_BowRiver_FlowQ.csv")

So now we’ve gotten the data on our own!

Introduction to ggplot

If you didn’t install ggplot2 in the last workshop, this script will check, and then load it.

for(p in c("ggplot2")){
  if(!require(p,character.only = TRUE,quietly = TRUE)){
  install.packages(p)}
  library(p,character.only = TRUE,quietly = TRUE)}

ggplot has some nice improvements over R’s base-plot. The general format is as follows: The first line specifies the data that will be used to generate a plot: here it is the dataframe containing the columns. ggplot(data=HERE)+ We can “feed” ggplot just the data after 2013 for simplicity. Notice, the “+” after the ggplot() call tells R that there is more information coming. In contrast, with plot everything was wrapped in parentheses and separated by commas. Next, we specify the geometry object we’d like to accompany the plot. geom_point() Inside our geom, we call aes(), for aesthetics. Inside aes(), we specify the names of the columns we would like on the x and y axes. Notice, unlike base-plot, we don’t need to specify the dataframe first, as in: dataframe$column. Nor do the columns need to be in parentheses.

ggplot2::ggplot(data=Bow_data)+ 
  geom_line(aes(x=Date,y=Value)) 

# syntax of <packagename>::<function name> helps to ensure the correct function is called 
# if/when multiple packages have the same function name

Using the subset function

We worked with Boolean phrases in the Beginner workshop. The subset function requires “x”, an input data frame, and a subset logical expression. We can specify that “Date” should be greater or equal to (>=) Jan 1, 2013.

?subset
## starting httpd help server ... done
nrow(Bow_data)
## [1] 38383
nrow(subset(x = Bow_data,subset = Date>="2013-01-01"))
## [1] 1826

Using ggplot

Now we’re working with a much more manageable chunk of data.

ggplot2::ggplot(data=subset(Bow_data,Date>="2013-01-01"))+ 
  geom_line(aes(x=Date,y=Value)) 

Now that the data is not so squished on the x-axis, we see the dramatic peak during the spring 2013 flood. Let’s add aesthetics to the plot like we did with plot. Add a y-axis label. Scale the y-axis by log10.

ggplot2::ggplot(data=subset(Bow_data,Date>="2013-01-01"))+ 
  geom_line(aes(x=Date,y=Value))+
  ylab(expression(paste("Flow (",m^3,"/s)")))

plot <- ggplot2::ggplot(data=subset(Bow_data,Date>="2013-01-01"))+ 
  geom_line(aes(x=Date,y=Value))+
  scale_y_log10()+
  ylab(expression(paste("Flow (",m^3,"/s)")))+
  ggtitle("Bow River Flows at Calgary")
plot

## Saving plots from ggplot

When we saved a plot in the first workshop, we had to use two functions: png and dev.off to create a blank image and then “write” the plot to that file. The method for saving a ggplot is much easier! We can simply save the plot as an object and use a function to write to file. We can specify within the ggsave function the plot, filepath, dimensions and resolution.

setwd("C:/Users/Tyler/OneDrive/Software/R_Projects/workshops/2020_forWater_Rworkshop")
ggsave(filename = "output/BowRiverFlow2.png",plot = plot,width = 5,height = 4,units = "in",dpi=200)

More data subdivisions

In order to further explore this data, we need to extract some more information from the Date field. We can add a new column to denote the Year, Month, and Day for each Date.

Bow_data_Date=as.POSIXlt(Bow_data$Date)

Bow_data$Year=Bow_data_Date$year+1900
Bow_data$Month=Bow_data_Date$mo+1
Bow_data$yday=Bow_data_Date$yday+1

The as.POSIXlt function is a weird one, but as you can see we can use the $ operator to extract values from the Dates we fed in. Frustratingly, we also need to account for where the origin date is, and add the year 1900 to all years and 1 to months and Julian Day.

With a new geom, geom_boxplot, we can look at the annual hydrograph. Note that we can’t just specify Month on the x-axis alone. Like our date issue before, ggplot doesn’t know exactly how we want to see Month represented. Is it a continuous or discreet variable? To solve this, we can specify Month as a factor. Because the data is sorted from lowest to highest date, Month 10 always comes after 9, and as.factor will thus interpret 10 as greater than 9 despite the first digit being 1.

ggplot2::ggplot(data=Bow_data)+ 
  geom_boxplot(aes(x=Month,group=Month,y=Value)) 

ggplot2::ggplot(data=Bow_data)+ 
  geom_boxplot(aes(x=as.character(Month),y=Value)) 

ggplot2::ggplot(data=Bow_data)+ 
  geom_boxplot(aes(x=as.factor(Month),y=Value)) 

ggplot2::ggplot(data=Bow_data)+ 
  geom_boxplot(aes(x=as.factor(Month),y=Value),outlier.shape = ".") 

Generally, we want to specify data types within the dataframe itself.

Bow_data$Month=factor(Bow_data$Month,levels=1:12)

ggplot2::ggplot(data=Bow_data)+ 
  geom_boxplot(aes(x=Month,y=Value),outlier.shape = ".")+
  ylab(expression(paste("Daily Flow (",m^3,"/s)")))+
  ggtitle("Bow River Flows at Calgary")+
  scale_y_log10(limits=c(10,2000))

We can easily add more geom objects to a plot.

ggplot(data=subset(Bow_data,Year=="2013"))+ 
  geom_line(aes(x=yday,y=Value))+ 
  geom_point(aes(x=yday,y=Value)) 

ggplot(data=Bow_data)+ 
  geom_smooth(aes(x=yday,y=Value))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot()+ 
  geom_smooth(data=Bow_data,aes(x=yday,y=Value))+
  geom_line(data=subset(Bow_data,Year=="2013"),aes(x=yday,y=Value))+
  scale_y_log10()+
  ylab(expression(paste("Daily Flow (",m^3,"/s)")))+
  xlab("Julian Date")+
  ggtitle("Bow River Flows at Calgary, 2013")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Admittedly, I think the out-of-the-box ggplot is less nice than base-plot. So why do we advocate for it so heavily?? The coding of ggplot is strange looking at first, but the plus “+” functionality means we can add many more elements to a plot with ease.

Let’s first try to imitate R’s base-plot.

plot2013=ggplot(data=subset(Bow_data,Year=="2013"),
                aes(x=Date,y=Value)
                )+ 
  geom_line(col="gray")+ 
  geom_point()+
  scale_y_log10()+
  ylab(expression(paste("Flow (",m^3,"/s)")))+
  ggtitle("Bow River at Cochrane\nStream Flow, 2013")+
  theme(
    panel.background = element_rect(fill="white",colour = 1), # 1 is black by default
    #Note: R can use the European spelling of colour and color interchangeably
    panel.grid = element_blank(),
    plot.title = element_text(hjust = 0.5),
    axis.title.x = element_blank()
  )
plot2013

setwd("C:/Users/Tyler/OneDrive/Software/R_Projects/workshops/2020_forWater_Rworkshop")
ggsave(filename = "output/BowRiverFlow_2013.png",plot = plot2013,width = 5,height = 4,units = "in",dpi=200)

So imitating base-plot took a lot more code. Some of these elements look similar to base-plot in their functionality: ylab, xlab, ggtitle. theme() is the most challenging, and versatile, of elements within ggplot. While we add all plot elements to the plot with “+”, inside theme(), all elements are again separated by commas “,”, like in geom_point().

Now we can add much smaller amounts of code to dramatically change the data visualization.

plot2013+
  geom_point(aes(x=Date,y=Value,col=Value))+
  scale_color_gradientn(colors=rev(rainbow(5)))+
  theme(legend.position = "n")

Notice, ggplot allows us to create objects that contain our plots, here plot2013. These objects are called grobs, and are exclusive to ggplot. By adding more lines of code to the grob, it re-creates it with the new elements either on top or adding to the theme.

One more thing about this most recent plot: Notice how we specified col=Low.temperature. Not only are x and y variables, but ggplot can map multiple aspects of the data onto various plot-able features, such as color, shape, fill, size, or line type. This makes ggplot extremely useful for colorful and rich data visualizations. Let’s explore some more possibilities.

Creative Data Visualizations

We’ve seen the pattern of flows in just one year. Let’s visualize ALL the data.

ggplot(data=Bow_data)+ 
  geom_point(aes(x=yday,y=Value),col="gray")+
  geom_smooth(aes(x=yday,y=Value))+  
  ylab(expression(paste("Flow (",m^3,"/s)")))+
  ggtitle("Bow River at Cochrane")+
  theme(panel.background = element_rect(fill="white",colour = 1), 
    panel.grid = element_blank(),plot.title = element_text(hjust = 0.5),
    axis.title.x = element_blank())
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

sqcolorplot = ggplot(data=Bow_data)+
  geom_tile(aes(x=yday,y=Year,fill=log10(Value)),col="transparent")+
  scale_fill_gradientn(
    name=expression(paste("Flow (",m^3,"/s)")),
    breaks=1:3,labels=10^c(1:3),
    colors=rev(rainbow(5)),na.value = "white"
  )+
  ggtitle("Bow River at Cochrane")+
  xlab("Julian Day")+ylab("Year")+
  theme(panel.background = element_blank(),
        panel.grid = element_blank(),
        plot.title = element_text(hjust = 0.5))
sqcolorplot

setwd("C:/Users/Tyler/OneDrive/Software/R_Projects/workshops/2020_forWater_Rworkshop")
ggsave(filename = "output/BowRiverFlow3.png",plot = sqcolorplot,width = 5,height = 4,units = "in",dpi=200)
Bow_data$theta=-2*pi*Bow_data$yday/365+pi/2
Bow_data$radius=(Bow_data$Year+Bow_data$yday/365-1910)/100
Bow_data$x=cos(Bow_data$theta)*Bow_data$radius
Bow_data$y=sin(Bow_data$theta)*Bow_data$radius
labels=data.frame(x=c(0,1.05,0,-1.1),y=c(1.1,0,-1.05,0),
                  lab=c("Jan","Apr","July","Oct"))

circolorplot = ggplot(data=Bow_data)+
  geom_point(aes(x=x,y=y,col=log10(Value)),size=0.5)+
  geom_text(data=labels,aes(x=x,y=y,label=lab))+
  coord_equal(ratio=1)+
  scale_color_gradientn(
    name=expression(paste("Flow (",m^3,"/s)")),
    breaks=1:3,labels=10^c(1:3),
    colors=rev(rainbow(5)),na.value = "white"
  )+
  ggtitle("Bow River at Cochrane")+
  theme(
    panel.background = element_blank(),panel.grid = element_blank(),
    plot.title = element_text(hjust = 0.5),axis.title = element_blank(),
    axis.text = element_blank(),axis.ticks = element_blank()
  )
setwd("C:/Users/Tyler/OneDrive/Software/R_Projects/workshops/2020_forWater_Rworkshop")
ggsave(filename = "output/BowRiverFlow4.png",plot = circolorplot,width = 5,height = 4,units = "in",dpi=200)
circolorplot

Using facet_wrap

There appears to be some thrend showing increased flows in winter. Before 1950, flows were lower (bluer, cool colors) in winter and there were more large floods in spring (hot colors). After 1950, winter flows had increased and summer peaks were lower. Maybe a dam was installed to curb flooding and supplement low flows? Let’s investigate trends by month.

monthplot = ggplot(data=Bow_data,aes(x=Year,y=Value))+
  geom_point(col="gray")+geom_smooth(se=F)+
  facet_wrap(~Month,scales="free_y")+
  ggtitle("Bow River at Cochrane")+
  scale_x_continuous(breaks=c(1920,1970,2020))+
  xlab("Year")+
  ylab(expression(paste("Flow (",m^3,"/s)")))+
  theme(plot.title = element_text(hjust = 0.5))
setwd("C:/Users/Tyler/OneDrive/Software/R_Projects/workshops/2020_forWater_Rworkshop")
ggsave(filename = "output/BowRiverFlow5.png",plot = monthplot,width = 5,height = 4,units = "in",dpi=200)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
monthplot
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'