Beginner R Tutorial

This beginner R tutorial covers the following topics:

P.S. This document was generated using R Markdown. This is a great tool for code transparancy and data analysis, because the code blocks, code outputs, and your comments are “knitted” into a single document! Ask us more about this

Basic R Syntax

Basic Arithmetic

Let’s try some basic math, that shows R can be an overly fancy calculator. We see common signs like addition, multiplication, and division. In base-R or a script in RStudio, you can place your cursor in one line and press CNTL+ENTER to run that one line (CMD+ENTER for Mac). Additionally, you can highlight multiple lines, or all with CNTL+A and then CNTL+ENTER. The Run button in the upper right of your interface will run either the selected or highlighted lines as well.

Note: when you see the pount or “hashtag” next to a line of code, what follows to the right is a text comment that does not affect the code. Commenting is an important part of remembering what complex lines of code are meant to do!

1+1
## [1] 2
2*2
## [1] 4
4/3
## [1] 1.333333
5 %% 2 # modulus operator; remainder after division is returned by the function
## [1] 1

If you’re completely new to coding or “scripting”, you may wonder what the main gist of R is. R, like other languages including C++, Java, Python, and Matlab, is an Object-Oriented Language. For our purposes, that means we will be storing data within objects of various shapes, sizes, and properties, and using functions to operate on those objects to perform analyses.

Let’s explore R’s simplest object type, an atom. R is not strict in its definition of data types, i.e. you can declare and re-assign the data types to variables at your whim.

One final note. The sign you see below (<-) assigns the value from one end of the arrow to the object at the point of the arrow. This method is directional, and comes in handy, but often you will see the equal sign (=) as a left-only equivalent.

a <- 1
a
## [1] 1
a = 1
a # see that <- and = gave the same result here
## [1] 1
5 -> a
a
## [1] 5
b <- a+1 # This is our first example of 
b
## [1] 6

Intro to Functions

You’ve seen data stored in objects. So how can we do more than just perform arithmetic on these objects? R operates by loading and creating objects and modifying them or parts of them with functions. Functions have the following format: function(x) Where x is an object, and “running” this function will produce some output.

To understand how functions work, let’s first create our own. Creating your own custom functions is a way to streamline your research and perform ever-faster analyses. We’ll work with R’s built-in functions in a moment.

Here is a function that we will name “squared”. The input is some value, x, and the output is x squared (x^2).

squared=function(x){x^2}
squared(3)
## [1] 9
squared(a)
## [1] 25

Note: The squiggly brackets “{ }” Always surround the code being performed by a function. They cannot be interchanged with the parentheses “( )”. Parentheses are used to define the input to a function, but they can also be used to specify order-of-operations in arithemtic or complex calculations.

Recal, a had a value of 5, so the function returns a value of 25. Like we did with arithmetic functions, we can assign the output of a function to a new object.

c=squared(b)
c
## [1] 36
# optional parameter
exponentit <- function(a,b=2) {
  a^b
}
exponentit(3)
## [1] 9
exponentit(3,3)
## [1] 27

Data Exploration

# what class of variable is it?
class(b)
## [1] "numeric"
# what 'typeof' value is b?
typeof(b) # 'double' is a type of number (numeric value)
## [1] "double"
b <- "hello, world"
b
## [1] "hello, world"
# this has now changed from numeric/double
typeof(b)
## [1] "character"
# convert between data types
as.numeric("501") +1
## [1] 502
round(5.6) # round to integer
## [1] 6
floor(5.6) # round down to integer
## [1] 5
# as.numeric(b) # will produce an error
# as.character(a) + "_" # will produce an error

Now that we’ve started re-assigning values to variables, it’s useful to recal their new values. On the upper-right portion of your screen in RStudio, the Environment tab shows the values of all user-defined variables, and it even reminds us of our user-defined function.

Working with Text

R has a large capability to read, write, and modify strings of text. Here are just two examples. We will come back to working with text once we start generating labels for plots.

# write a statement with any format - substitute values from any type
paste0(as.character(a)," coconuts in my pocket")
## [1] "5 coconuts in my pocket"
sprintf("I have $%.2f in my pocket, that is %s",a,"cool")
## [1] "I have $5.00 in my pocket, that is cool"

Vectors and Matrices

Recal, a single valued varible, such as the ones above, is called an atom. Multiple atoms come together to form a row of values, called a vector. Multiple vectors form a matrix or data frame.

c() is the concatenate function, taking multiple objects separated by commas and creating a new vector. We saw above how we can use addition or multiplication on an atom. We can also perfom a single operation on all values in a vector. Let’s perform apply arithmetic and our custom “squared” function on the new vector a.

a <- c(1,2,3,4,5)
a
## [1] 1 2 3 4 5
b <- a/2
b
## [1] 0.5 1.0 1.5 2.0 2.5
b=squared(a)
b
## [1]  1  4  9 16 25

Getting Help on Commands

You’ll see below several functions that are new to you. A useful tool is the built-in Help database built into R and RStudio. For the function matrix, look up its documentation. The help() function can show you the Description and Usage of a function. Putting “?” before a function does the same. If you don’t quite know the name of a function, two “?” question marks will search for all matrix-related topics. Finally, if you place your cursor within the function name and press F1 on your keyboard, the documentation will again come up.

rep(1,5)
## [1] 1 1 1 1 1
seq(1,4)
## [1] 1 2 3 4
# get help on command, like the variance function 'rep()'
help(rep)
## starting httpd help server ... done
?rep
?seq

# search for functions by name or keyword
??variance

# see R code directly
var
## function (x, y = NULL, na.rm = FALSE, use) 
## {
##     if (missing(use)) 
##         use <- if (na.rm) 
##             "na.or.complete"
##         else "everything"
##     na.method <- pmatch(use, c("all.obs", "complete.obs", "pairwise.complete.obs", 
##         "everything", "na.or.complete"))
##     if (is.na(na.method)) 
##         stop("invalid 'use' argument")
##     if (is.data.frame(x)) 
##         x <- as.matrix(x)
##     else stopifnot(is.atomic(x))
##     if (is.data.frame(y)) 
##         y <- as.matrix(y)
##     else stopifnot(is.atomic(y))
##     .Call(C_cov, x, y, na.method, FALSE)
## }
## <bytecode: 0x0000000014db48e8>
## <environment: namespace:stats>
sqrt
## function (x)  .Primitive("sqrt")

For more resources, check out the CRAN website (https://cran.r-project.org/) for information, documentation, etc. StackOverflow and similar forums are also extremely helpful for R!

Use help to figure out the arguments to the functions “rep” and “seq”.

c(a,rep(1,5))
##  [1] 1 2 3 4 5 1 1 1 1 1
seq(1,10,by=2)
## [1] 1 3 5 7 9

Subsetting Vectors

Now that we’ve started working with objects of increasing size, it’s important to know how to extract specific values. To do this, we use the square brackets “[ ]”. They are not to be confused with () or {}, which each have their own special uses, as discussed above. A useful tool when examining large objects is the length() function.

# sequence from 1 to 30, jumping by a value of 2 (instead of the default 1)
a <- seq(1,30,by=2)
a
##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29
length(a)
## [1] 15
# subset by location
a[1] # first value in the vector
## [1] 1
a[c(1,2)] # first and second value in the vector
## [1] 1 3
a[c(1,2,3)] # note the use of c() to specify multiple values
## [1] 1 3 5
1:3 # the colon ":" is a useful operator that operates similar to seq(by=1)
## [1] 1 2 3
a[1:3] 
## [1] 1 3 5
a[-1]  # all values without the first one
##  [1]  3  5  7  9 11 13 15 17 19 21 23 25 27 29

Using Booleans

Boolean operators refer to greater than (>), less than (<), or equal to (==). Other operators or greater than or equal to (>=), and not equal to (!=)

1>3
## [1] FALSE
1<3
## [1] TRUE
1<=3 # less than or equal to
## [1] TRUE
1!=3
## [1] TRUE
1==1
## [1] TRUE

Note: <= looks very similar to <- but they are nothing alike.

We can use Booleans to perform subsetting operations.

a>5 # which values are greater than 5?
##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE
a[a>5] # same as above, but returns the values of a that meet the critera
##  [1]  7  9 11 13 15 17 19 21 23 25 27 29
a[a %% 2 > 0]
##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29
a[a %% 5 > 0] 
##  [1]  1  3  7  9 11 13 17 19 21 23 27 29

2D Data: Matrices and Data Frames

Now that we’ve worked with 1-Dimensional vectors, let’s build up to two dimensions.

Notice the new function “dim()”, which gives us the dimensions of the object, like “length()” did for vectors.

a=1:3
b=squared(a)
c=data.frame(a,b)
c
typeof(c)
## [1] "list"
dim(c) # returns rows, then columns
## [1] 3 2
# b <- matrix(base value for matrix, number of rows = 4, number of columns = 4)
d <- matrix(NA,nrow=4,ncol=4)
d
##      [,1] [,2] [,3] [,4]
## [1,]   NA   NA   NA   NA
## [2,]   NA   NA   NA   NA
## [3,]   NA   NA   NA   NA
## [4,]   NA   NA   NA   NA
typeof(d)
## [1] "logical"
dim(d)
## [1] 4 4

Here we’ve built a dataframe, “c”, with columns “a” and “b”. “d” is an emptry matrix, with 4 rows and 4 columns. While both data types are two dimensional, matrices are easier to operate across both dimensions. Data frames are mostly used when you are operating across columns.

Subsetting both of these data types works the same. Like with vectors, we use the square brackets “[ ]”, but instead of one value, we list the desired row and column, separed by a comma.

c[2,2]
## [1] 4
d[2,2]
## [1] NA
c[2,] # leaving an entry black will return the entire row or column specified
c[,2]
## [1] 1 4 9
d[2,]
## [1] NA NA NA NA
d[,2] 
## [1] NA NA NA NA

An important distinction: while “calling” a row of a dataframe will return a mini dataframe with just the one row, a row of a matrix will return a vector. This shows the flexibility of matricies.

d[,2] <- rep(2,4)
d[3,] <- seq(1,4)
d
##      [,1] [,2] [,3] [,4]
## [1,]   NA    2   NA   NA
## [2,]   NA    2   NA   NA
## [3,]    1    2    3    4
## [4,]   NA    2   NA   NA
d[3:4,1:2] # subset across multiple rows, columns
##      [,1] [,2]
## [1,]    1    2
## [2,]   NA    2
# sequence built with the seq() function
seq1 <- seq(1:6)
a <- matrix(seq1, 2)
a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Data frames have an additional method of subsetting, by using the names of the columns within them. The “$” sign can be used to directly call the named column.

names(c)
## [1] "a" "b"
c$a
## [1] 1 2 3
c[,"a"]
## [1] 1 2 3
c[,"b"]
## [1] 1 4 9

Running Commands

a <- rnorm(n = 100,mean=0,sd=2) # generate a normal distribution
mean(a)
## [1] 0.1189092
var(a)
## [1] 4.200929

Explore More Packages!

R has thousands of packages to offer, which is one of the main advantages of getting to know R. This section will show you how to easily install and get to know new packages.

Installing and exploring a new package

# install ggplot2 with dependencies from CRAN (internet connection required)
install.packages("ggplot2")

# load the library into your R session
library(ggplot2)

# explore the package contents
head(ls("package:ggplot2"),10)

Handling Data

library(ggplot2)
data(diamonds)  # load data frame from ggplot2 package
head(diamonds)  # preview data
class(diamonds)   # class of an object
## [1] "tbl_df"     "tbl"        "data.frame"
typeof(diamonds)  # type of object
## [1] "list"
names(diamonds)   # names inside the object
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"
str(diamonds)   # view structure of diamonds
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(diamonds) # summarize dataset
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 
nrow(diamonds)  # number of rows (lots!)
## [1] 53940
ncol(diamonds)  # number of columns
## [1] 10
# subset particular data from data frame
head(diamonds$carat)
## [1] 0.23 0.21 0.23 0.29 0.31 0.24
mean(diamonds$price)
## [1] 3932.8
mydb <- diamonds[diamonds$price > 3000,]

Reading in data from csv file

We’ll work with data from a very common data source: the Water Survey of Canada First, let’s see what folder we are in and what files there are.

getwd()
list.files()
Bow_data <- read.csv(file = "input/AB05BH004_BowRiver_FlowQ.csv")
head(Bow_data)

Note, the standard unit for Flow from WSC is m^3/s. We can find more information here:

browseURL("https://wateroffice.ec.gc.ca/report/real_time_e.html?stn=05BH004")

Explore this dataset

We’ve used the head function already to see that our dataset includes a station number, date, and values for stream flow. In the next tutorial we’ll talk about wide and tall data and why the format we see in front of us is a very efficient way to store and read data. Use the str function to examine the structure of this dataset.

str(Bow_data)
## 'data.frame':    38383 obs. of  6 variables:
##  $ X             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ STATION_NUMBER: chr  "05BH004" "05BH004" "05BH004" "05BH004" ...
##  $ Date          : chr  "1911-01-01" "1911-01-02" "1911-01-03" "1911-01-04" ...
##  $ Parameter     : chr  "Flow" "Flow" "Flow" "Flow" ...
##  $ Value         : num  19.5 18.1 17 17 NA ...
##  $ Symbol        : chr  "B" "B" "B" "B" ...

The “Value” column is the only numeric data. As we’ll see in the next section though, we need more information to plot a timeseries of Flow.

Explore the plot function

What fun is data if you can’t visualize it? Make a plot of the daily streamflow data.

P.S. your data is now in a data frame, produced by the read.csv function, called ‘Bow_data’. So it can be accessed with the $ character.

plot(Bow_data$Value)

Using the plot function gave us not the best image, but from it we can see that there about 38,000 data values (we also learned this from str), values range from 0 to about 1,500, and cluster largely to low values. This agrees with what we might expect for a timeseries of streamflow.

Generally to make a plot with x- and y-axes we need two vectors. In the first plot, the data index, or the row number, became the x axis. We could see if we can put the “Date” column on the x axis.

class(Bow_data$Date)
## [1] "character"

This doesn’t work … We need to redifine the class of “Date”

#Bow_data$Date=as.Date(Bow_data$Date)
Bow_data$Date=as.Date(Bow_data$Date,format="%Y-%m-%d")

typeof(Bow_data$Date)
## [1] "double"
class(Bow_data$Date)
## [1] "Date"

With the as.Date function, we can tell R to recognize what was originally a character vector as a Date. We can explicitly say that the first values is Year, separated by a dash from month and then from day.

Let’s return to the plot function, and specify the axis vectors, and change some plot characteristics!

# plot with lines and better axis labels
plot(
    y = Bow_data$Value, 
    x = Bow_data$Date,
    type='l', # specifies a line graph
    las=1, # specifies the orientation of axis text
    ylab=expression(paste("Flow (",m^3,"/s)")),
    xlab=NA,
    main="Bow River Flows at Calgary"
)

There are many ways to add special characters to plots. The “expression(paste())” formulation is one.

Feel free to check out the plotting options, try searching help on the plot function to bedazzle your plot, such as the col argument to change the colour of the plotted data.

Want to output your plot to an image file? Try this out.

png(filename = "output/BowRiverFlow1.png") # creates an empty png file
# everything you plot here goes into your file
plot(
    y = Bow_data$Value, 
    x = Bow_data$Date,
    type='l', # specifies a line graph
    las=1, # specifies the orientation of axis text
    ylab=expression(paste("Flow (",m^3,"/s)")),
    xlab=NA,
    main="Bow River Flows at Calgary"
)
dev.off() # release your hold on the plot

You should see an image file (.png) in the output folder (i.e. the one shown by the setwd() command). Ta-daaa!

list.files(path="output")
## [1] "BowRiverFlow1.png"