1 Introduction

1.1 Libraries we will use

Load packages:

library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.2.2
#> Warning: package 'tidyr' was built under R version 4.2.2
#> Warning: package 'readr' was built under R version 4.2.2
#> Warning: package 'purrr' was built under R version 4.2.2
#> Warning: package 'dplyr' was built under R version 4.2.2
#> Warning: package 'stringr' was built under R version 4.2.2
library(lubridate)
#> Warning: package 'lubridate' was built under R version 4.2.2

1.2 Lecture overview

The programming unit will introduce you to tools that tell your computer to do the same or similar things over and over, without having to write the code over and over (e.g., iteration). And the code you write to do things over and over, will be able to do things differently depending on conditions of the data or depending on things you specify.

Paraphrasing Will Doyle:

“Computers love to do the same thing over and over. It’s their favorite thing to do. Learn to make your computer happy.”

The 3 core foci of this unit are:

  • Iteration (loops)
  • Conditionals (if, if/else)
  • Functions

But more than learning these things, this unit is about developing a more formal, rigorous understanding of programming concepts so that you can become a more powerful programmer. Towards that end, we will be reading chapters from Wickham’s free text book Advanced R.

In fact, please spend 10 minutes reading the Chapter 1 (sections 1.1 through 1.5)

2 Foundational concepts

See here for a review of data structures and types.

2.1 Subsetting elements

What is subsetting?

  • Subsetting refers to isolating particular elements of an object
  • Subsetting operators can be used to select/exclude elements (e.g., variables, observations)
  • There are three subsetting operators: [], [[]], $
  • These operators function differently based on vector types (e.g., atomic vectors, lists, dataframes)


For the examples in the next few subsections, we will be working with the following named atomic vector, named list, and dataframe:

  • Create named atomic vector called v with 4 elements

    v <- c(a = 10, b = 20, c = 30, d = 40)
    v
    #>  a  b  c  d 
    #> 10 20 30 40
  • Create named list called l with 4 elements

    l <- list(a = TRUE, b = c("a", "b", "c"), c = list(1, 2), d = 10L)
    l
    #> $a
    #> [1] TRUE
    #> 
    #> $b
    #> [1] "a" "b" "c"
    #> 
    #> $c
    #> $c[[1]]
    #> [1] 1
    #> 
    #> $c[[2]]
    #> [1] 2
    #> 
    #> 
    #> $d
    #> [1] 10
  • Create dataframe called df with 4 columns and 3 rows

    df <- data.frame(
      a = c(11, 21, 31),
      b = c(12, 22, 32),
      c = c(13, 23, 33),
      d = c(14, 24, 34)
    )
    df
    #> # A tibble: 3 × 4
    #>       a     b     c     d
    #>   <dbl> <dbl> <dbl> <dbl>
    #> 1    11    12    13    14
    #> 2    21    22    23    24
    #> 3    31    32    33    34


2.1.1 Subsetting using []

The [] operator:

  • Subsetting an object using [] returns an object of the same type
    • E.g., Using [] on an atomic vector returns an atomic vector, using [] on a list returns a list, etc.
  • The returned object will contain the element(s) you selected
  • Object attributes are retained when using [] (e.g., name attribute)

Six ways to subset using []:

  1. Use positive integers to return elements at specified index positions
  2. Use negative integers to exclude elements at specified index positions
  3. Use logical vectors to return elements where corresponding logical is TRUE
  4. Empty vector [] returns original object (useful for dataframes)
  5. Zero vector [0] returns empty object (useful for testing data)
  6. If object is named, use character vectors to return elements with matching names

Example: Using positive integers with []

Selecting a single element: Specify the index of the element to subset

# Select 1st element from numeric vector (note that `names` attribute is retained)
v[1]
#>  a 
#> 10

# Subsetted object will be of type `numeric`
class(v[1])
#> [1] "numeric"

# Select 1st element from list (note that `names` attribute is retained)
l[1]
#> $a
#> [1] TRUE

# Subsetted object will be a `list` containing the element
class(l[1])
#> [1] "list"


Selecting multiple elements: Specify the indices of the elements to subset using c()

# Select 3rd and 1st elements from numeric vector
v[c(3,1)]
#>  c  a 
#> 30 10

# Subsetted object will be of type `numeric`
class(v[c(3,1)])
#> [1] "numeric"

# Select 1st element three times from list
l[c(1,1,1)]
#> $a
#> [1] TRUE
#> 
#> $a
#> [1] TRUE
#> 
#> $a
#> [1] TRUE

# Subsetted object will be a `list` containing the elements
class(l[c(1,1,1)])
#> [1] "list"

Example: Using negative integers with []

Excluding a single element: Specify the index of the element to exclude

# Exclude 1st element from numeric vector
v[-1]
#>  b  c  d 
#> 20 30 40

# Subsetted object will be of type `numeric`
class(v[-1])
#> [1] "numeric"


Excluding multiple elements: Specify the indices of the elements to exclude using -c()

# Exclude 1st and 3rd elements from list
l[-c(1,3)]
#> $b
#> [1] "a" "b" "c"
#> 
#> $d
#> [1] 10

# Subsetted object will be a `list` containing the remaining elements
class(l[-c(1,3)])
#> [1] "list"

Example: Using logical vectors with []

If the logical vector is the same length as the object, then each element in the object whose corresponding position in the logical vector is TRUE will be selected:

v
#>  a  b  c  d 
#> 10 20 30 40
# Select 2nd and 3rd elements from numeric vector
v[c(FALSE, TRUE, TRUE, FALSE)]
#>  b  c 
#> 20 30

# Subsetted object will be of type `numeric`
class(v[c(FALSE, TRUE, TRUE, FALSE)])
#> [1] "numeric"


If the logical vector is shorter than the object, then the elements in the logical vector will be recycled:

# This is equivalent to `l[c(FALSE, TRUE, FALSE, TRUE)]`, thus retaining 2nd and 4th elements
l[c(FALSE, TRUE)]
#> $b
#> [1] "a" "b" "c"
#> 
#> $d
#> [1] 10

# Subsetted object will be a `list` containing the elements
class(l[c(FALSE, TRUE)])
#> [1] "list"


We can also write expressions that evaluate to either TRUE or FALSE:

  • this is the sort of stuff we do when we filter observations based on variable values (e.g., show me the observations where the statement var1>=30 is TRUE )
# This expression is recycled and evaluates to be equivalent to `l[c(FALSE, FALSE, TRUE, TRUE)]`
v[v > 20]
#>  c  d 
#> 30 40

Example: Using empty vector []

An empty vector [] just returns the original object:

# Original atomic vector
v[]
#>  a  b  c  d 
#> 10 20 30 40

# Original list
l[]
#> $a
#> [1] TRUE
#> 
#> $b
#> [1] "a" "b" "c"
#> 
#> $c
#> $c[[1]]
#> [1] 1
#> 
#> $c[[2]]
#> [1] 2
#> 
#> 
#> $d
#> [1] 10

# Original dataframe
df[]
#> # A tibble: 3 × 4
#>       a     b     c     d
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    11    12    13    14
#> 2    21    22    23    24
#> 3    31    32    33    34

Example: Using zero vector [0]

A zero vector [0] just returns an empty object of the same type as the original object:

# Empty named atomic vector
v[0]
#> named numeric(0)

# Empty named list
l[0]
#> named list()

# Empty dataframe
df[0]
#> # A tibble: 3 × 0

Example: Using element names with []

We can select a single element or multiple elements by their name(s):

# Equivalent to v[2]
v["b"]
#>  b 
#> 20

# Equivalent to l[c(1, 3)]
l[c("a", "c")]
#> $a
#> [1] TRUE
#> 
#> $c
#> $c[[1]]
#> [1] 1
#> 
#> $c[[2]]
#> [1] 2


2.1.2 Subsetting using [[]]

The [[]] operator:

  • We can only use [[]] to extract a single element rather than multiple elements
  • Subsetting an object using [[]] returns the selected element itself, which might not be of the same type as the original object
    • E.g., Using [[]] to select a list element that is a numeric vector will return that numeric vector and not a list containing that numeric vector, like what [] would return
      • Let x be a list with 3 elements (Think of it as a train with 3 cars)
      • x[1] will be a list containing the 1st element, which is a numeric vector (i.e., train with the 1st car)
      • x[[1]] will be the numeric vector itself (i.e., the objects within the 1st car)
      • Source: Subsetting from R for Data Science
  • Object attributes are removed when using [[]]
    • E.g., Using [[]] on a named object returns just the selected element itself without the name attribute


Two ways to subset using [[]]:

  1. Use a positive integer to return an element at the specified index position
  2. If object is named, using a character to return an element with the specified name

Example: Using positive integer with [[]]
# Select 1st element from numeric vector (note that `names` attribute is gone)
v[[1]]
#> [1] 10

# Subsetted element is `numeric`
class(v[[1]])
#> [1] "numeric"

# Select 1st element from list (note that `names` attribute is gone)
l[[1]]
#> [1] TRUE

# Subsetted element is `logical`
class(l[[1]])
#> [1] "logical"

Example: Using element name with [[]]
# Equivalent to v[[2]]
v[["b"]]
#> [1] 20

# Subsetted element is `numeric`
class(v[["b"]])
#> [1] "numeric"

# Equivalent to l[[2]]
l[["b"]]
#> [1] "a" "b" "c"

# Subsetted element is `character` vector
class(l[["b"]])
#> [1] "character"


2.1.3 Subsetting using $

The $ operator:

  • obj_name$element_name is shorthand for obj_name[["element_name"]]
  • This operator only works on lists (including dataframes) and not on atomic vectors

Example: Subsetting with $

Subsetting a list with $:

# Equivalent to l[["b"]]
l$b
#> [1] "a" "b" "c"

# Subsetted element is `character` vector
class(l$b)
#> [1] "character"


Since dataframes are just a special kind of named list, it would work the same way:

# Equivalent to df[["d"]]
df$d
#> [1] 14 24 34


2.1.4 Subsetting dataframes

Subsetting dataframes with [], [[]], and $:

  • Subsetting dataframes works the same way as lists because dataframes are just a special kind of named list, where we can think of each element as a column
    • df_name[<column(s)>] returns a dataframe containing the selected column(s), with its attributes retained
    • df_name[[<column>]] or df_name$<column> returns the column itself, without any attributes
  • In addition to the normal way of subsetting, we are also allowed to subset dataframes by cell(s)
    • df_name[<row(s)>, <column(s)>] returns the selected cell(s)
      • If a single cell is selected, or cells from the same column, then these would be returned as an object of the same type as that column (similar to how [[]] normally works)
      • Otherwise, the subsetted object would be a dataframe, as we’d normally expect when using []
    • df_name[[<row>, <column>]] returns the selected cell
      • This is equivalent to selecting a single cell using df_name[<row(s)>, <column(s)>]

Example: Subsetting dataframe column(s) with []

We can subset dataframe column(s) the same way we have subsetted atomic vector or list element(s):

df
#> # A tibble: 3 × 4
#>       a     b     c     d
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    11    12    13    14
#> 2    21    22    23    24
#> 3    31    32    33    34

# Select 1st column from dataframe (note that `names` attribute is retained)
df[1]
#> # A tibble: 3 × 1
#>       a
#>   <dbl>
#> 1    11
#> 2    21
#> 3    31

# Subsetted object will be a `data.frame` containing the column
class(df[1])
#> [1] "data.frame"

# Exclude 1st and 3rd columns from dataframe (note that `names` attribute is retained)
df[-c(1,3)]
#> # A tibble: 3 × 2
#>       b     d
#>   <dbl> <dbl>
#> 1    12    14
#> 2    22    24
#> 3    32    34

# Subsetted object will be a `data.frame` containing the remaining columns
class(df[-c(1,3)])
#> [1] "data.frame"

Example: Subsetting dataframe column with [[]] and $

We can select a single dataframe column the same way we have subsetted a single atomic vector or list element:

# Select 1st column from dataframe by its index (note that `names` attribute is gone)
df[[1]]
#> [1] 11 21 31

# Subsetted column is `numeric` vector
class(df[[1]])
#> [1] "numeric"

# Equivalently, we could've selected 1st column by its name
df[["a"]]
#> [1] 11 21 31

# Equivalently, we could've selected 1st column using `$`
df$a
#> [1] 11 21 31

Example: Subsetting dataframe cell(s) with []

If we select a single cell by specifying its row and column, we will get back the element itself, not in a dataframe:

# Selects cell in 1st row and 2nd col
df[1, 2]
#> [1] 12

# Subsetted cell is of type `numeric`
class(df[1, 2])
#> [1] "numeric"

# Equivalently, we could select using column name instead of index
df[1, "b"]
#> [1] 12


Similarly, if we select cells from the same column, we will get back the elements themselves, not in a dataframe:

# Selects cells from the 2nd col
df[c(1,3), 2]
#> [1] 12 32

# Subsetted cells is of type `numeric`
class(df[c(1,3), 2])
#> [1] "numeric"

# Selects all cells from the 2nd col
df[, 2]
#> [1] 12 22 32

# Subsetted column is of type `numeric`
class(df[, 2])
#> [1] "numeric"


However, if we select cells from the same row, or cells across multiple rows and columns, we will get back a dataframe that contains the selected cells:

# Selects cells from the 2nd row
df[2, c("a", "c")]
#> # A tibble: 1 × 2
#>       a     c
#>   <dbl> <dbl>
#> 1    21    23

# Subsetted cells are returned as a dataframe
class(df[2, c("a", "c")])
#> [1] "data.frame"

# Selects all cells from the 2nd row
df[2, ]
#> # A tibble: 1 × 4
#>       a     b     c     d
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    21    22    23    24

# Subsetted row is returned as a dataframe
class(df[2, ])
#> [1] "data.frame"

# Selects cells from multiple rows and columns
df[1:2, c("a", "c")]
#> # A tibble: 2 × 2
#>       a     c
#>   <dbl> <dbl>
#> 1    11    13
#> 2    21    23

# Subsetted cells are returned as a dataframe
class(df[1:2, c("a", "c")])
#> [1] "data.frame"

Example: Subsetting dataframe cell with [[]]

With [[]], we are only allowed to select a single cell:

# Selects cell in 1st row and 2nd col
df[[1, 2]]
#> [1] 12

# Subsetted cell is of type `numeric`
class(df[[1, 2]])
#> [1] "numeric"

# This is equivalent to using `[]`
df[1, 2]
#> [1] 12


2.2 Prerequisite concepts

Several functions and concepts are used frequently when creating loops and/or functions.

2.2.1 Sequences

What are sequences?

  • (Loose) definition: A sequence is a list of numbers in ascending or descending order
  • Sequences can be created using the : operator or seq() function

Example: Creating sequences using :

# Sequence from -5 to 5
-5:5
#>  [1] -5 -4 -3 -2 -1  0  1  2  3  4  5

# Sequence from 5 to -5
5:-5
#>  [1]  5  4  3  2  1  0 -1 -2 -3 -4 -5


The seq() function:

?seq

# SYNTAX AND DEFAULT VALUES
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
    length.out = NULL, along.with = NULL, ...)
  • Function: Generate a sequence
  • Arguments
    • from: The starting value of sequence
    • to: The end (or maximal) value of sequence
    • by: Increment of the sequence

Example: Creating sequences using seq()

# Sequence from 10 to 15, by increment of 1 (default)
seq(from=10, to=15)
#> [1] 10 11 12 13 14 15

# Explicitly specify increment of 1 (equivalent to above)
seq(from=10, to=15, by=1)
#> [1] 10 11 12 13 14 15

# Sequence from 100 to 150, by increment of 10
seq(from=100, to=150, by=10)
#> [1] 100 110 120 130 140 150


2.2.2 Length

The length() function:

?length

# SYNTAX
length(x)
  • Function: Returns the number of elements in the object
  • Arguments
    • x: The object to find the length of


Example: Using length() to find number of elements in v

# View the atomic vector
v
#>  a  b  c  d 
#> 10 20 30 40

# Use `length()` to find number of elements
length(v)
#> [1] 4


Example: Using length() to find number of elements in df

Remember that dataframes are just lists where each element is a column, so the number of elements in a dataframe is just the number of columns it has:

# View the dataframe
df
#> # A tibble: 3 × 4
#>       a     b     c     d
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    11    12    13    14
#> 2    21    22    23    24
#> 3    31    32    33    34

# Use `length()` to find number of elements (i.e., columns)
length(df)
#> [1] 4


When we subset a dataframe using [] (i.e., select column(s) from the dataframe), the length of the subsetted object is the number of columns we selected:

# Subset one column
df[1]
#> # A tibble: 3 × 1
#>       a
#>   <dbl>
#> 1    11
#> 2    21
#> 3    31

# Length is one
length(df[1])
#> [1] 1

# Subset three columns
df[1:3]
#> # A tibble: 3 × 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1    11    12    13
#> 2    21    22    23
#> 3    31    32    33

# Length is three
length(df[1:3])
#> [1] 3


When we subset a dataframe using [[]] (i.e., isolate a specific column in the dataframe), the length of the subsetted object is the number of elements in the atomic vector (i.e., the number of rows in the dataframe):

# Isolate a specific column
df[[2]]
#> [1] 12 22 32

# Length is number of elements in that column (i.e., number of rows in dataframe)
length(df[[2]])
#> [1] 3


2.2.3 Sequences and length

When writing loops, it is very common to create a sequence from 1 to the length (i.e., number of elements) of an object.


Example: Generating a sequence from 1 to length of v

# There are 4 elements in the atomic vector
v
#>  a  b  c  d 
#> 10 20 30 40
length(v)
#> [1] 4

# Use `:` to generate a sequence from 1 to 4
1:length(v)
#> [1] 1 2 3 4

# Use `seq()` to generate a sequence from 1 to 4
seq(1, length(v))
#> [1] 1 2 3 4


There is also a function seq_along() that makes it easier to generate a sequence from 1 to the length of an object.


The seq_along() function:

?seq_along

# SYNTAX
seq_along(x)
  • Function: Generates a sequence from 1 to the length of the input object
  • Arguments
    • x: The object to generate the sequence for


Example: Generating a sequence from 1 to length of df

# There are 4 elements (i.e., columns) in the dataframe
df
#> # A tibble: 3 × 4
#>       a     b     c     d
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    11    12    13    14
#> 2    21    22    23    24
#> 3    31    32    33    34

# Use `seq_along()` to generate a sequence from 1 to 4
seq_along(df)
#> [1] 1 2 3 4

# which is gives us the same thing as this:
1:length(df)
#> [1] 1 2 3 4

3 Iteration

What is iteration?

  • Iteration is the repetition of some process or operation
    • E.g., Iteration can help with “repeating the same operation on different columns, or on different datasets” (From R for Data Science)
  • Looping is the most common way to iterate

3.1 Loop basics

What are loops?

  • Loops execute some set of commands multiple times
  • Each time the loop executes the set of commands is an iteration
  • The below loop iterates 4 times


Example: Printing each element of the vector c(1,2,3,4) using a loop

# There are 4 elements in the vector
c(1,2,3,4)
#> [1] 1 2 3 4

# Iterate over each element of the vector
for(i in c(1,2,3,4)) {
  print(i)  # Print out each element
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4


When to write loops?

  • Broadly, rationale for writing loop:
    • Do not duplicate code
    • Can make changes to code in one place rather than many
  • When to write a loop:
    • Grolemund and Wickham say don’t copy and paste more than twice
    • If you find yourself doing this, consider writing a loop or function
  • Don’t worry about knowing all the situations you should write a loop
    • Rather, you’ll be creating analysis dataset or analyzing data and you will notice there is some task that you are repeating over and over
    • Then you’ll think, “Oh, I should write a loop or function for this”

3.2 Components of a loop

How to write a loop?

  • We can build loops using the for() function
  • The loop sequence goes inside the parentheses of for()
  • The loop body goes inside the pair of curly brackets ({}) that follows for()
for(i in c(1,2,3,4)) {  # Loop sequence
  print(i)  # Loop body
}


Components of a loop:

  1. Sequence: Determines what to “loop over”
    • In the above example, the sequence is i in c(1,2,3,4)
    • This creates a temporary/local object named i (could name it anything)
    • Each iteration of the loop will assign a different value to i
    • c(1,2,3,4) is the set of values that will be assigned to i
      • In the first iteration, the value of i is 1
      • In the second iteration, the value of i is 2, etc.
  2. Body: What commands to execute for each iteration of the loop
    • In the above example, the body is print(i)
    • Each time through the loop (i.e., iteration), body prints the value of object i

3.2.1 Ways to write loop sequence

You may see the loop sequence being written in slightly different ways. For example, these three loops all do the same thing:

  • Looping over the vector c(1,2,3)

    c(1,2,3)
    #> [1] 1 2 3
    
    for(z in c(1,2,3)) {  # Loop sequence
      print(z)  # Loop body
    }
    #> [1] 1
    #> [1] 2
    #> [1] 3
  • Looping over the sequence 1:3

    1:3
    #> [1] 1 2 3
    
    for(z in 1:3) {  # Loop sequence
      print(z)  # Loop body
    }
    #> [1] 1
    #> [1] 2
    #> [1] 3
  • Looping over the object num_sequence

    num_sequence <- 1:3
    num_sequence
    #> [1] 1 2 3
    
    for(z in num_sequence) {  # Loop sequence
      print(z)  # Loop body
    }
    #> [1] 1
    #> [1] 2
    #> [1] 3

3.2.2 Printing values in loop body

When building a loop, it is useful to print out information to understand what the loop is doing.

  • In my opinion, the MOST IMPORTANT tool for learning how to write loops (and how to write functions) is knowing how to print out the value of the object(s) associated with each iteration

Using print() to print a single object z

  • Using print() to show the value of object(s) within an iteration is not best approach because
    • print() can only print one object per line

    • print() can’t include additional text that tells you what stuff is

      for(z in c(1,2,3)) {
        print(z)
      }
      #> [1] 1
      #> [1] 2
      #> [1] 3


The best way to print object(s) associated with each iteration is wrapping the str_c() function within the writeLines() function.

  • recommend spending a few minutes reviewing str_c() function
    • help: ?str_c
    • See section on str_c from “Intro to strings, dates, and time” lecture of Rclass1 LINK HERE
  • Why wrap str_c() within writeLines()?:
    • within a loop body (or function body),str_c() function by itself will not print output
    • writeLines(str_c(...)) forces whatever is returned by str_c() to be printed
  • Using str_c() and writeLines() to concatenate and print multiple items
    • within a loop body or function body, in order to print output returned by str_c() you must wrap str_c() within writeLines() function

      for(z in c(1,2,3)) {
        writeLines(str_c("object z=", z))
      }
      #> object z=1
      #> object z=2
      #> object z=3



  • Note: Using writeLines() by itself to print a single object z (code not run); this approach won’t work because writeLines can only write character objects
    for(z in c(1,2,3)) {
      writeLines(z)
    }
    • Note: Using str_c() without wrapping in writeLines()

    • str_c() function that is within a loop body (or function body) will not print output

      for(z in c(1,2,3)) {
        str_c("object z=", z)
      }

3.2.3 Student exercise

  1. Create a numeric vector that contains the birth years of your family members
    • E.g., birth_years <- c(1944,1950,1981,2016)
  2. Write a loop that calculates the current year minus birth year and prints this number for each member of your family
    • Within this loop, you will create a new variable that calculates current year minus birth year

Solutions
birth_years <- c(1944,1950,1981,2016,2019)
birth_years
#> [1] 1944 1950 1981 2016 2019

for(y in birth_years) {  # Loop sequence
  writeLines(str_c("object y=", y))  # Loop body
  z <- 2023 - y
  writeLines(str_c("value of 2023 minus ", y, " is ", z))
}
#> object y=1944
#> value of 2023 minus 1944 is 79
#> object y=1950
#> value of 2023 minus 1950 is 73
#> object y=1981
#> value of 2023 minus 1981 is 42
#> object y=2016
#> value of 2023 minus 2016 is 7
#> object y=2019
#> value of 2023 minus 2019 is 4

3.3 Ways to loop over a vector

There are 3 ways to loop over elements of an object:

  1. Looping over the element (contents) (approach we have used so far)
  2. Looping over names of the elements
  3. Looping over numeric indices associated with element position (approach recommended by Grolemnund and Wickham)


For the examples in the next few subsections, we will be working with the following named atomic vector and dataframe:

  • Create named atomic vector called vec

    vec <- c(a = 5, b = -10, c = 30)
    vec
    #>   a   b   c 
    #>   5 -10  30
    • element (contents) are: 10, 20, 30
    • element names are: a, b, c
    • indices of element position are: 1, 2, 3
  • Create dataframe called df with randomly generated data, 3 columns (vars) and 4 rows (obs)

    set.seed(12345) # so we all get the same variable values
    df <- tibble(a = rnorm(4), b = rnorm(4), c = rnorm(4))
    str(df)
    #> tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
    #>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
    #>  $ b: num [1:4] 0.606 -1.818 0.63 -0.276
    #>  $ c: num [1:4] -0.284 -0.919 -0.116 1.817

3.3.1 Looping over elements

Syntax: for (i in object_name)

  • This approach iterates over each element in the object
  • The value of i is equal to the element’s content (rather than its name or index position)


Example: Looping over elements in vec

vec  # View named atomic vector object
#>   a   b   c 
#>   5 -10  30

for (i in vec) {
  writeLines(str_c("value of object i=",i))
  writeLines(str_c("object i has: type=", typeof(i), "; length=", length(i), "; class=", class(i), "\n"))  # "\n" adds line break
}
#> value of object i=5
#> object i has: type=double; length=1; class=numeric
#> 
#> value of object i=-10
#> object i has: type=double; length=1; class=numeric
#> 
#> value of object i=30
#> object i has: type=double; length=1; class=numeric


Example: Looping over elements in df

df  # View dataframe object
#> # A tibble: 4 × 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  0.586  0.606 -0.284
#> 2  0.709 -1.82  -0.919
#> 3 -0.109  0.630 -0.116
#> 4 -0.453 -0.276  1.82

# show contents of element, outside of a loop
  # each element of the dataframe is a vector that contains one element for each observation
  str(df[1]) # single bracket
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
  str(df[[1]]) # double bracket
#>  num [1:4] 0.586 0.709 -0.109 -0.453

for (i in df) {
  writeLines(str_c("value of object i=",i))
  writeLines(str_c("object i has: type=", typeof(i), "; length=", length(i), "; class=", class(i), "\n"))  # "\n" adds line break
}
#> value of object i=0.585528817843856
#> value of object i=0.709466017509524
#> value of object i=-0.109303314681054
#> value of object i=-0.453497173462763
#> object i has: type=double; length=4; class=numeric
#> 
#> value of object i=0.605887455840394
#> value of object i=-1.81795596770373
#> value of object i=0.630098551068391
#> value of object i=-0.276184105225216
#> object i has: type=double; length=4; class=numeric
#> 
#> value of object i=-0.284159743943371
#> value of object i=-0.919322002474128
#> value of object i=-0.116247806352002
#> value of object i=1.81731204370422
#> object i has: type=double; length=4; class=numeric

Example: Calculating column averages for df by looping over columns


The dataframe df is a list object, where each element is a vector (i.e., column):

df  # View dataframe object
#> # A tibble: 4 × 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  0.586  0.606 -0.284
#> 2  0.709 -1.82  -0.919
#> 3 -0.109  0.630 -0.116
#> 4 -0.453 -0.276  1.82

for (i in df) {
  writeLines(str_c("value of object i=", i))
  writeLines(str_c("mean value of object i=", mean(i, na.rm = TRUE), "\n"))
}
#> value of object i=0.585528817843856
#> value of object i=0.709466017509524
#> value of object i=-0.109303314681054
#> value of object i=-0.453497173462763
#> mean value of object i=0.183048586802391
#> 
#> value of object i=0.605887455840394
#> value of object i=-1.81795596770373
#> value of object i=0.630098551068391
#> value of object i=-0.276184105225216
#> mean value of object i=-0.21453851650504
#> 
#> value of object i=-0.284159743943371
#> value of object i=-0.919322002474128
#> value of object i=-0.116247806352002
#> value of object i=1.81731204370422
#> mean value of object i=0.124395622733679

3.3.2 Looping over names

Syntax: for (i in names(object_name))

  • To use this approach, elements in the object must have name attributes
  • This approach iterates over the names of each element in the object
  • names() returns a vector of the object’s element names
  • The value of i is equal to the element’s name (rather than its content or index position)
  • But note that it is still possible to access the element’s content inside the loop:
    • Access element contents using object_name[i]
      • Same object type as object_name; retains attributes (e.g., name attribute)
    • Access element contents using object_name[[i]]
      • Removes level of hierarchy, thereby removing attributes
      • Approach recommended by Wickham because it isolates value of element


Example: Looping over element names in vec

vec  # View named atomic vector object
#>   a   b   c 
#>   5 -10  30
names(vec)  # View names of atomic vector object
#> [1] "a" "b" "c"

for (i in names(vec)) {
  writeLines(str_c("\nvalue of object i=", i, "; type=", typeof(i)))
  #str(vec[i])  # Access element contents using []
  str(vec[[i]])  # Access element contents using [[]]
}
#> 
#> value of object i=a; type=character
#>  num 5
#> 
#> value of object i=b; type=character
#>  num -10
#> 
#> value of object i=c; type=character
#>  num 30


Example: Looping over elements in df

df  # View dataframe object
#> # A tibble: 4 × 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  0.586  0.606 -0.284
#> 2  0.709 -1.82  -0.919
#> 3 -0.109  0.630 -0.116
#> 4 -0.453 -0.276  1.82
names(df)  # View names of dataframe object (i.e., column names)
#> [1] "a" "b" "c"

# show using name to print contents, outside of a loop
str(df["a"]) # single bracket
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
str(df[["a"]]) # double bracket
#>  num [1:4] 0.586 0.709 -0.109 -0.453

for (i in names(df)) {
  writeLines(str_c("\nvalue of object i=", i, "; type=", typeof(i)))
  #str(df[i])  # Access element contents using []
  str(df[[i]])  # Access element contents using [[]]
}
#> 
#> value of object i=a; type=character
#>  num [1:4] 0.586 0.709 -0.109 -0.453
#> 
#> value of object i=b; type=character
#>  num [1:4] 0.606 -1.818 0.63 -0.276
#> 
#> value of object i=c; type=character
#>  num [1:4] -0.284 -0.919 -0.116 1.817

Example: Calculating column averages for df by looping over column names
str(df)  # View structure of dataframe object
#> tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
#>  $ b: num [1:4] 0.606 -1.818 0.63 -0.276
#>  $ c: num [1:4] -0.284 -0.919 -0.116 1.817


Remember that we can use [[]] to access element contents by their name:

for (i in names(df)) {
  writeLines(str_c("mean of element named ", i, " = ", mean(df[[i]], na.rm = TRUE)))
}
#> mean of element named a = 0.183048586802391
#> mean of element named b = -0.21453851650504
#> mean of element named c = 0.124395622733679


If we tried completing the task using [] to access the element contents, we would get an error because mean() only takes numeric or logical vectors as input, and df[i] returns a dataframe object:

for (i in names(df)) {
  writeLines(str_c("mean of element named", i, "=", mean(df[i], na.rm = TRUE)))
  
  # print(class(df[i]))
}

3.3.3 Looping over indices

Syntax: for (i in 1:length(object_name)) OR for (i in seq_along(object_name))

  • This approach iterates over the index positions of each element in the object
  • There are two ways to create the loop sequence:
    • length() returns the number of elements in the input object, which we can use to create a sequence of index positions (i.e., 1:length(object_name))
    • seq_along() returns a sequence of numbers that represent the index positions for all elements in the input object (i.e., equivalent to 1:length(object_name))
  • The value of i is equal to the element’s index position (rather than its content or name)
  • But note that it is still possible to access the element’s content inside the loop:
    • Access element contents using object_name[i]
      • Same object type as object_name; retains attributes (e.g., name attribute)
    • Access element contents using object_name[[i]]
      • Removes level of hierarchy, thereby removing attributes
      • Approach recommended by Wickham because it isolates value of element
  • Similarly, we can access the element’s name by its index using names(object_name)[i] or names(object_name)[[i]]
    • In this case, using [[]] and [] are equivalent because names() returns an unnamed vector, which does not have any attributes


Example: Looping over indices of vec element position

vec  # View named atomic vector object
#>   a   b   c 
#>   5 -10  30
length(vec)  # View length of atomic vector object
#> [1] 3
1:length(vec)  # Create sequence from `1` to `length(vec)`
#> [1] 1 2 3

for (i in 1:length(vec)) {
  writeLines(str_c("\nvalue of object i=", i, "; type=", typeof(i)))
  #str(vec[i])  # Access element contents using []
  str(vec[[i]])  # Access element contents using [[]]
}
#> 
#> value of object i=1; type=integer
#>  num 5
#> 
#> value of object i=2; type=integer
#>  num -10
#> 
#> value of object i=3; type=integer
#>  num 30


Example: Looping over elements in df

df  # View dataframe object
#> # A tibble: 4 × 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  0.586  0.606 -0.284
#> 2  0.709 -1.82  -0.919
#> 3 -0.109  0.630 -0.116
#> 4 -0.453 -0.276  1.82
seq_along(df)  # Equivalent to `1:length(df)`
#> [1] 1 2 3

for (i in seq_along(df)) {
  writeLines(str_c("\nvalue of object i=", i, "; type=", typeof(i)))
  str(df[i])  # Access element contents using []
  str(df[[i]])  # Access element contents using [[]]
}
#> 
#> value of object i=1; type=integer
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
#>  num [1:4] 0.586 0.709 -0.109 -0.453
#> 
#> value of object i=2; type=integer
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ b: num [1:4] 0.606 -1.818 0.63 -0.276
#>  num [1:4] 0.606 -1.818 0.63 -0.276
#> 
#> value of object i=3; type=integer
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ c: num [1:4] -0.284 -0.919 -0.116 1.817
#>  num [1:4] -0.284 -0.919 -0.116 1.817


We could also access the element’s name by its index:

names(df)  # View names of dataframe object (i.e., column names)
#> [1] "a" "b" "c"
names(df)[[2]]  # We can access any element in the names vector by its index
#> [1] "b"
   #names(df)[2]  # same as above

# Incorporate the above line into the loop
for (i in 1:length(df)) {
  writeLines(str_c("i=", i, "; name=", names(df)[[i]]))
}
#> i=1; name=a
#> i=2; name=b
#> i=3; name=c

Example: Calculating column averages for df by looping over column indices


Use i in seq_along(df) to loop over the column indices and [[]] to access column contents:

str(df)  # View structure of dataframe object
#> tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
#>  $ b: num [1:4] 0.606 -1.818 0.63 -0.276
#>  $ c: num [1:4] -0.284 -0.919 -0.116 1.817

for (i in seq_along(df)) {
  writeLines(str_c("mean of element at index position", i, "=", mean(df[[i]], na.rm = TRUE)))
}
#> mean of element at index position1=0.183048586802391
#> mean of element at index position2=-0.21453851650504
#> mean of element at index position3=0.124395622733679

3.3.4 Summary

There are 3 ways to loop over elements of an object:

  1. Looping over the elements
  2. Looping over names of the elements
  3. Looping over numeric indices associated with element position
    • Grolemnund and Wickham recommends this approach (#3) because given an element’s index position, we can also extract the element name (#2) and value (#1)
for (i in seq_along(df)) {
  writeLines(str_c("\n", "i=", i))  # element's index position
  
  name <- names(df)[[i]]  # element's name (what we looped over in approach #2)
  writeLines(str_c("name=", name))
  
  value <- df[[i]]  # element's value (what we looped over in approach #1)
  writeLines(str_c("value=", value))
}
#> 
#> i=1
#> name=a
#> value=0.585528817843856
#> value=0.709466017509524
#> value=-0.109303314681054
#> value=-0.453497173462763
#> 
#> i=2
#> name=b
#> value=0.605887455840394
#> value=-1.81795596770373
#> value=0.630098551068391
#> value=-0.276184105225216
#> 
#> i=3
#> name=c
#> value=-0.284159743943371
#> value=-0.919322002474128
#> value=-0.116247806352002
#> value=1.81731204370422

3.4 Modifying vs. creating object

Grolemund and Wickham differentiate between two types of tasks loops accomplish:

  1. Modifying an existing object
    • E.g., Looping through a set of variables in a dataframe to:
      • Modify these variables OR
      • Create new variables (within the existing dataframe object)
    • When writing loops in Stata/SAS/SPSS, we are usually modifying an existing object because these programs typically only have one object (a dataset) open at a time
  2. Creating a new object
    • E.g., Creating an object that has summary statistics for each variable, which can be the basis for a table or graph, etc.
    • The new object will often be a vector of results based on looping through elements of a dataframe
    • In R (as opposed to Stata/SAS/SPSS), creating a new object is very common because R can hold many objects at the same time

3.4.1 Modifying an existing object

How to modify an existing object?

  • Recall that we can directly access elements in an object (e.g., atomic vector, lists) using [[]]. We can use this same notation to modify the object.
  • Even though atomic vectors can also be modified with [], Wickhams recommends using [[]] in all cases to make it clear we are working with a single element (From R for Data Science)

Example: Modifying an existing atomic vector

Recall our named atomic vector vec from the previous examples:

vec
#>   a   b   c 
#>   5 -10  30

We can loop over the index positions and use [[]] to modify the object:

for (i in seq_along(vec)) {
  vec[[i]] <- vec[[i]] * 2  # Double each element
}

vec
#>   a   b   c 
#>  10 -20  60

Example: Modifying an existing dataframe

Recall our dataframe df from the previous examples:

df
#> # A tibble: 4 × 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  0.586  0.606 -0.284
#> 2  0.709 -1.82  -0.919
#> 3 -0.109  0.630 -0.116
#> 4 -0.453 -0.276  1.82

We can loop over the index positions and use [[]] to modify the object:

for (i in seq_along(df)) {
  df[[i]] <- df[[i]] * 2  # Double each element
}

df
#> # A tibble: 4 × 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  1.17   1.21  -0.568
#> 2  1.42  -3.64  -1.84 
#> 3 -0.219  1.26  -0.232
#> 4 -0.907 -0.552  3.63

3.4.2 Creating a new object

So far our loops have two components:

  1. Sequence
  2. Body

When we create a new object to store the results of a loop, our loops have three components:

  1. Sequence
  2. Body
  3. Output (This is the new object that will store the results created from your loop)


Grolemund and Wickham recommend using vector() to create this new object prior to writing the loop (rather than creating the new object within the loop):

“Before you start loop…allocate sufficient space for the output. This is very important for efficiency: if you grow the for loop at each iteration using c() (for example), your for loop will be very slow.”


The vector() function:

?vector

# SYNTAX AND DEFAULT VALUES
vector(mode = "logical", length = 0)
  • Function: Creates a new vector object of the given length and mode
  • Arguments
    • mode: Type of vector to create (e.g., "logical", "numeric", "list")
    • length: Length of the vector

Example: Creating a new object to store dataframe column averages

Recall the previous example where we calculated the mean value of each column in dataframe df:

str(df)
#> tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 1.171 1.419 -0.219 -0.907
#>  $ b: num [1:4] 1.212 -3.636 1.26 -0.552
#>  $ c: num [1:4] -0.568 -1.839 -0.232 3.635

for (i in seq_along(df)) {
  writeLines(str_c("mean of element at index position", i, "=", mean(df[[i]], na.rm = TRUE)))
}
#> mean of element at index position1=0.366097173604781
#> mean of element at index position2=-0.42907703301008
#> mean of element at index position3=0.248791245467358


Let’s create a new object to store these column averages. Specifically, we’ll create a new numeric vector whose length is equal to the number of columns in df:

output <- vector(mode = "numeric", length = length(df))

output # print
#> [1] 0 0 0
class(output)  # Specified by `mode` argument in `vector()`
#> [1] "numeric"
length(output)  # Specified by `length` argument in `vector()`
#> [1] 3


We can loop over the index positions of df and use [[]] to modify output:

for (i in seq_along(df)) {
  output[[i]] <- mean(df[[i]], na.rm = TRUE)  # Mean of df[[1]] assigned to output[[1]], etc.
}

output
#> [1]  0.3660972 -0.4290770  0.2487912

3.5 Summary

The general recipe for how to write a loop:

  1. Complete the task for one instance outside a loop (this is akin to writing the body of the loop)

  2. Write the sequence of the loop

  3. Modify the parts of the loop body that need to change with each iteration

  4. If you are creating a new object to store output of the loop, create this object outside of the loop

  5. Construct the loop


When to write a loop vs a function

It’s usually obvious when you are duplicating code, but unclear whether you should write a loop or whether you should write a function.

  • Often, a repeated task can be completed with a loop or a function

In my experience, loops are better for repeated tasks when the individual tasks are very similar to one another

  • E.g., a loop that reads in datasets from individual years; each dataset you read in differs only by directory and name
  • E.g., a loop that converts negative values to NA for a set of variables

Because functions can have many arguments, functions are better when the individual tasks differ substantially from one another

  • E.g., a function that runs regression and creates formatted results table
    • Function allows you to specify (as function arguments): dependent variable; independent variables; what model to run, etc.

Note:

  • Can embed loops within functions; can call functions within loops
  • But for now, just try to understand basics of functions and loops

4 Conditional execution

What is conditional execution?

  • a condition is some statement that evaluates to either TRUE or FALSE; for example 1>5 evaluates to FALSE
  • Conditional execution is the running of specific blocks of code based on some condition
    • E.g., If the number is even, run this block of code. Otherwise, run the other block of code, etc.
  • We can write if, else if, and else statements to run code conditionally (covered in upcoming sections)
  • This is useful because it allows for decision-making in the code

Credit: Decision Making in C / C++, GeeksforGeeks

4.1 Conditions

What is a condition?

  • A condition is any expression that can evaluate to TRUE or FALSE
  • Your condition should have a length of 1 (otherwise R will warn you that it’ll only look at the first element)
  • Based on the condition, different blocks of your code will be run


Any expression that has a length of 1 and can evaluate to either TRUE or FALSE can be used as the condition:

# This expression evaluates to `TRUE`
2 + 2 == 4
#> [1] TRUE

# It is of type `logical`
typeof(2 + 2 == 4)
#> [1] "logical"

# It has length of `1`
length(2 + 2 == 4)
#> [1] 1

Some functions return a logical, so you might also see a function call being used as the condition:

# This function call returns `FALSE` because the string "NA" is not the missing value `NA`
is.na("NA")
#> [1] FALSE

# It is of type `logical`
typeof(is.na("NA"))
#> [1] "logical"

# It has length of `1`
length(is.na("NA"))
#> [1] 1

4.2 if statement conditions

What are if statement conditions?

  • if statements allow you to conditionally execute certain blocks of code depending on whether some condition(s) is TRUE
  • The condition goes inside of the parentheses in if() and the block of code to execute goes between the curly brackets ({})
  • The condition must evaluate to either TRUE or FALSE (i.e., be of type logical)
  • The condition should have length of 1
if (condition) {
  # code executed when condition is TRUE
}


The block of code is executed if the condition evaluates to TRUE:

if (TRUE) {
  writeLines("This block is executed.")
}
#> This block is executed.

note that below block of code yields the exact same result as above because the condition evaluates to TRUE

if (1==1) {
  writeLines("This block is executed.")
}
#> This block is executed.

The block of code is not executed if the condition evaluates to FALSE:

if (FALSE) {
  writeLines("This block is not executed.")
}

Example: Condition that evaluates to TRUE

Remember that any expression that has a length of 1 and can evaluate to either TRUE or FALSE can be used as the condition:

# This expression evaluates to `TRUE`
2 + 2 == 4
#> [1] TRUE

# It is of type `logical`
typeof(2 + 2 == 4)
#> [1] "logical"

# It has length of `1`
length(2 + 2 == 4)
#> [1] 1

# We can use it as the if statement condition
if (2 + 2 == 4) {
  writeLines("This block is executed because `2 + 2 == 4` evaluates to `TRUE`.")
}
#> This block is executed because `2 + 2 == 4` evaluates to `TRUE`.

Example: Condition that evaluates to FALSE

Recall that some functions return a logical, so you might also see a function call being used as the condition:

# This function call returns `FALSE` because the string "NA" is not the missing value `NA`
is.na("NA")
#> [1] FALSE

# It is of type `logical`
typeof(is.na("NA"))
#> [1] "logical"

# It has length of `1`
length(is.na("NA"))
#> [1] 1

# We can use it as the if statement condition
if (is.na("NA")) {
  writeLines("This block is not executed because the condition evaluates to `FALSE`.")
}

# double negative equals positive!
if (!is.na("NA")) {
  writeLines("This block is not not executed because `!FALSE` evaluates to `TRUE`!")
}
#> This block is not not executed because `!FALSE` evaluates to `TRUE`!

4.2.1 || and &&

How to combine multiple logical expressions in a condition?

  • Recall that a logical expression is of type logical and has a length of 1
  • An if statement condition can be made up of multiple logical expressions
  • We can use || (or) and && (and) to combine multiple logical expressions
  • “Never use | or & in an if statement: these are vectorised operations that apply to multiple values (that’s why you use them in filter())” (From R for Data Science)
    • Vectorised operations apply to each respective elements of the vectors and returns a vector:

      c(TRUE, TRUE, FALSE) | c(TRUE, FALSE, FALSE)
      #> [1]  TRUE  TRUE FALSE
      • 1st element of each vector: TRUE or TRUE is TRUE
      • 2nd element of each vector: TRUE or FALSE is TRUE
      • 3rd element of each vector: FALSE or FALSE is FALSE
    • Whereas || and && will only look at the first element of each vector:

      c(TRUE, TRUE, FALSE) || c(TRUE, FALSE, FALSE)
      #> Warning in c(TRUE, TRUE, FALSE) || c(TRUE, FALSE, FALSE): 'length(x) = 3 > 1'
      #> in coercion to 'logical(1)'
      #> [1] TRUE


When using || (or), the block of code is executed if any of the conditions evaluates to TRUE:

if (condition1 || condition2 || condition3) {
  # code executed when any of the conditions is TRUE
}

When using && (and), the block of code is executed if all of the conditions evaluate to TRUE:

if (condition1 && condition2 && condition3) {
  # code executed when all of the conditions are TRUE
}

Example: Combining multiple logical expressions using ||

When using || (or), the block of code is executed if any of the conditions evaluates to TRUE:

# This block is executed because at least 1 condition is `TRUE`
if (TRUE || FALSE) {
  writeLines("This block is executed.")
}
#> This block is executed.

# This block is not executed because both logical expressions evaluate to `FALSE`
if (is.na("NA") || 2 + 2 == 5) {
  writeLines("This block is not executed.")
}

Example: Combining multiple logical expressions using &&

When using && (and), the block of code is executed if all of the conditions evaluate to TRUE:

# This block is not executed because not all conditions are `TRUE`
if (TRUE && FALSE) {
  writeLines("This block is not executed.")
}

# This block is executed because all logical expressions evaluate to `TRUE`
if (!is.na("NA") && 2 + 2 == 4) {
  writeLines("This block is executed.")
}
#> This block is executed.

4.3 else statements

What are else statements?

  • After the if block, you can include an else block that will be executed if the if block did not execute
  • In other words, the else block is executed if the if statement’s condition is not met
if (condition) {
  # code executed when condition is TRUE
} else {
  # code executed when condition is FALSE
}

Example: Using if-else statement

Recall the function dir.exists() that checks if a directory exists:

getwd()
#> [1] "C:/Users/ozanj/Documents/rclass2/lectures/programming"
list.files()
#> [1] "my_new_directory" "programming.html" "programming.Rmd"

directory <- "my_new_directory"
dir.exists(directory)
#> [1] TRUE


Let’s take a look at using an if-else statement to create the directory (using dir.create()) only if it doesn’t currently exist:

if (dir.exists(directory)) {
  writeLines(str_c("The directory '", directory, "' already exists."))
} else {
  dir.create(directory)
  writeLines(str_c("Created directory '", directory, "'."))
}
#> The directory 'my_new_directory' already exists.

# Check that directory is created
list.files()
#> [1] "my_new_directory" "programming.html" "programming.Rmd"


If we try running this code again, the if block would be executed because the directory already exists:

dir.exists(directory)
#> [1] TRUE

if (dir.exists(directory)) {
  writeLines(str_c("The directory '", directory, "' already exists."))
} else {
  dir.create(directory)
  writeLines(str_c("Created directory '", directory, "'."))
}
#> The directory 'my_new_directory' already exists.

Example: Using if-else statement with loop

We can loop over multiple directory names and for each, create the directory only if it does not already exist:

directories <- c("scripts", "dictionaries", "output")
directories
#> [1] "scripts"      "dictionaries" "output"

for (i in directories) {
  if (dir.exists(i)) {
    writeLines(str_c("The directory '", i, "' already exists."))
  } else {
    dir.create(i)
    writeLines(str_c("Created directory '", i, "'."))
  }
}
#> Created directory 'scripts'.
#> Created directory 'dictionaries'.
#> Created directory 'output'.

# Check that directories are created
list.files()
#> [1] "dictionaries"     "my_new_directory" "output"           "programming.html"
#> [5] "programming.Rmd"  "scripts"


If we try running the code again, the if block would be executed during each iteration of the loop because all the directories already exist:

for (i in directories) {
  if (dir.exists(i)) {
    writeLines(str_c("The directory '", i, "' already exists."))
  } else {
    dir.create(i)
    writeLines(str_c("Created directory '", i, "'."))
  }
}
#> The directory 'scripts' already exists.
#> The directory 'dictionaries' already exists.
#> The directory 'output' already exists.

4.4 else if statements

What are else if statements?

  • Between the if blocks and else blocks, you can include additional block(s) using else if that gets executed if its condition is met and none of the previous blocks got executed
  • In other words, only 1 block will ever execute in an if/else if/else chain
if (condition) {
  # run this code if condition TRUE
} else if (condition) {
  # run this code if previous condition FALSE and this condition TRUE
} else if (condition) {
  # run this code if both previous conditions FALSE and this condition TRUE
} else {
  # run this code if all previous conditions FALSE
}

Example: Using else if statement

Using the diamonds dataset available from ggplot2 (part of tidyverse), let’s create a vector of 5 diamond prices:

prices <- unique(diamonds$price)[23:27]
str(prices)
#>  int [1:5] 405 552 553 554 2757


Let’s loop through the prices vector and print whether each is affordable (under $500), pricey (between $500 and $1000), or too expensive ($1000 and up):

for (i in prices) {
  if (i < 500) {
    writeLines(str_c("This diamond costs $", i, " and is affordable."))
  } else if (i >= 500 && i < 1000) {
    writeLines(str_c("This diamond costs $", i, " and is pricey..."))
  } else {
    writeLines(str_c("This diamond costs $", i, " and is too expensive!"))
  }
}
#> This diamond costs $405 and is affordable.
#> This diamond costs $552 and is pricey...
#> This diamond costs $553 and is pricey...
#> This diamond costs $554 and is pricey...
#> This diamond costs $2757 and is too expensive!


Remember that each subsequent else if statement will only be considered if all previous blocks did not run (i.e., their conditions were not met). This means we can simplify i >= 500 && i < 1000 to i < 1000 in the else if condition:

for (i in prices) {
  if (i < 500) {
    writeLines(str_c("This diamond costs $", i, " and is affordable."))
  } else if (i < 1000) {
    writeLines(str_c("This diamond costs $", i, " and is pricey..."))
  } else {
    writeLines(str_c("This diamond costs $", i, " and is too expensive!"))
  }
}
#> This diamond costs $405 and is affordable.
#> This diamond costs $552 and is pricey...
#> This diamond costs $553 and is pricey...
#> This diamond costs $554 and is pricey...
#> This diamond costs $2757 and is too expensive!

4.5 Processing time

Especially when working with large datasets, the time it takes for your code to run can really add up, so it is important to look for ways to optimize code such that it runs most efficiently. We can use system.time() to measure how long it takes for some code to run.

The system.time() function:

?system.time

# SYNTAX AND DEFAULT VALUES
system.time(expr, gcFirst = TRUE)
  • Function: Returns CPU (and other) times that expr used
  • Arguments
    • expr: Valid R expression to be timed


For the below examples, we’ll use this numeric atomic vector called prices that is equal to the price of each diamond in the diamonds dataframe:

prices <- diamonds$price
str(prices)  # 53,940 diamond prices
#>  int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...

Example: Allocating sufficient space for output before loop

Let’s take a look at an example of using a loop to calculate the z-score for each diamond price and storing the scores in a vector. First, we’ll calculate the mean and standard deviation of the prices:

m <- mean(prices, na.rm=TRUE)
s <- sd(prices, na.rm=TRUE)


[Method 1] Growing the vector inside the loop using c()

  • “Whenever you use c(), append(), cbind(), rbind(), or paste() to create a bigger object, R must first allocate space for the new object and then copy the old object to its new home. If you’re repeating this many times, like in a for loop, this can be quite expensive.” (From Advanced R)
z_prices <- c()

system.time(
  for (i in 1:length(prices)) {
    z_prices <- c(z_prices, (prices[i] - m)/s)
  }
)
#>    user  system elapsed 
#>    3.69    1.72    5.41


[Method 2] Creating the output vector before loop (Recommended)

  • “Before you start the loop, you must always allocate sufficient space for the output. This is very important for efficiency” (From R for Data Science)
  • As seen, we can do that by first creating the z_prices object using vector() before the loop
z_prices <- vector("double", length(prices))

system.time(
  for (i in 1:length(prices)) {
    z_prices[i] <- (prices[i] - m)/s
  }
)
#>    user  system elapsed 
#>       0       0       0

Example: Vectorising your code

What does it mean to “vectorise your code”?

  • “Vectorising is about taking a ‘whole object’ approach to a problem, thinking about vectors, not scalars.” (From Advanced R)
  • Often, this means avoiding loops and using vectorised functions instead (e.g., use if_else() function instead of if-else statement inside a for loop)

To see the difference, let’s look at the example of classifying diamond prices as affordable or expensive.


[Method 1] Using if-else statement inside a for loop

output <- vector("character", length(prices))

system.time(
  for (i in 1:length(prices)) {
    if (i < 500) {
      output[i] <- str_c("This diamond costs $", prices[i], " and is affordable.")
    } else {
      output[i] <- str_c("This diamond costs $", prices[i], " and is too expensive!")
    }
  }
)
#>    user  system elapsed 
#>    1.78    0.00    1.79


[Method 2] Using the vectorised if_else() function (Recommended)

system.time(
  output <- if_else(prices < 500,
                   str_c("This diamond costs $", prices, " and is affordable."),
                   str_c("This diamond costs $", prices, " and is too expensive!")
                   )
)
#>    user  system elapsed 
#>    0.06    0.00    0.06

Example: Using multiple if statements vs. if/else if/else statements

[Method 1] Using multiple if statements inside a for loop

  • Look out for situations like the below where we can use if/else if/else statements instead of multiple if statements
  • With multiple if statements, each of the if conditions need to be checked for every diamond price
output <- vector("integer", length(prices))

system.time(
  for (i in 1:length(prices)) {
    if (i < 200) {
      output[i] <- 1
    } 
    if (i >= 200 && i < 400) {
      output[i] <- 2
    }
    if (i >= 400 && i < 600) {
      output[i] <- 3
    } 
    if (i >= 600 && i < 800) {
      output[i] <- 4
    }
    if (i >= 800 && i < 1000) {
      output[i] <- 5
    } 
    if (i >= 1000 && i < 1500) {
      output[i] <- 6
    }
    if (i >= 1500 && i < 2000) {
      output[i] <- 7
    }
    if (i >= 2000) {
      output[i] <- 8
    }
  }
)
#>    user  system elapsed 
#>    0.03    0.00    0.03


[Method 2] Using if/else if/else statements inside a for loop

  • With if/else if/else statements, not all conditions below will be checked (only up to when one of the blocks get executed)
  • Thus, we see a reduction in the processing time compared to Method 1 - this will be especially true the more if statements there are
output <- vector("integer", length(prices))

system.time(
  for (i in 1:length(prices)) {
    if (i < 200) {
      output[i] <- 1
    } else if (i < 400) {
      output[i] <- 2
    } else if (i < 600) {
      output[i] <- 3
    } else if (i < 800) {
      output[i] <- 4
    } else if (i < 1000) {
      output[i] <- 5
    } else if (i < 1500) {
      output[i] <- 6
    } else if (i < 2000) {
      output[i] <- 7
    } else {
      output[i] <- 8
    }
  }
)
#>    user  system elapsed 
#>    0.03    0.00    0.04


[Method 3] Using the vectorised if_else() function

  • Note that using a vectorised function when possible would still be the fastest
  • But there can be a “trade-off between code speed and code readability”, as nested if_else() statements are hard to read (From Efficient R programming)
system.time(
  output <- ifelse(prices < 200, 1, ifelse(prices < 400, 2, ifelse(prices < 600, 3, 
                   ifelse(prices < 800, 4, ifelse(prices < 1000, 5, ifelse(prices < 1500, 6, 
                   ifelse(prices < 2000, 7, 8)))))))
)
#>    user  system elapsed 
#>    0.00    0.02    0.01

5 Functions

5.1 What are functions

What are functions?

  • Functions are pre-written bits of code that accomplish some task
  • Functions allow you to “automate” tasks that you perform more than once
  • We can call functions whenever we want to use them, again and again

Functions generally follow three sequential steps:

  1. Take in input object(s)
  2. Process the input
  3. Return a new object, which may be a vector, data-frame, plot, etc.

5.1.1 Functions from packages written by others

We’ve been working with functions all quarter.


Example: The sum() function

?sum
  1. Input: Takes in a vector of elements (class must be numeric or logical)
  2. Processing: Calculates the sum of elements
  3. Return: Returns a numeric vector of length 1 whose value is the sum of the input vector
    • You can use str() to investigate the return value of the function
# Apply sum() to atomic vector
sum(c(1,2,3))
#> [1] 6
sum(c(1,2,3)) %>% str()
#>  num 6

5.1.2 User-written functions

What are “user-written functions”? [my term]

  • User-written functions are functions you write to perform some specific task
  • It can often be for a data-manipulation or analysis task specific to your project

Like all functions, user-written functions usually follow three steps:

  1. Take in one or more input object(s)
  2. Process the input
    • This may include utilizing existing functions from other packages, for example sum() or length()
  3. Return a new object


Can think of user-written functions as anologous to mathematical functions that take one or more input variables

  • below, create function f_xy that takes two input variables: x and y
f_xy <- function(x,y) {
  # function body
  x^2 + y
}

f_xy(x=2,y=2)
#> [1] 6
f_xy(x=3,y=1)
#> [1] 10

#what the function returns
f_xy(x=3,y=1) %>% str() # a numeric atomic vector of length 1
#>  num 10

# assign the output of function to an object
z <- f_xy(x=3,y=1)
z
#> [1] 10


Examples of what we might want to write a function for:

  • Write a function to read in annual data, then call function for each year
  • Create interactive maps. e.g., see maps from policy report on off-campus recruiting by public research universities
    • Link to maps HERE
    • R code for interactive maps developed by Karina Salazar, modified by Crystal Han
  • Ben Skinner recommends [paraphrasing] writing “short functions that do one thing and do it well”


When to write a function?

  • Since functions are reusable pieces of code, they allow you to “automate” tasks that you perform more than once

    • E.g., Check to see if IPEDS data file is already downloaded. If not, then download the data.
    • E.g., Write function that runs regression model and creates results table
  • The alternative to writing a function to perform some specific task (aside from loops/iteration) is to copy and paste the code each time you want to perform a task

  • Wickham and Grolemund chapter 19.2:

    “You should consider writing a function whenever you’ve copied and pasted a block of code more than twice”

  • Darin Christenson (professor, UCLA public policy) refers to the programming mantra DRY (“Don’t Repeat Yourself”)

    “Functions enable you to perform multiple tasks (that are similar to one another) without copying the same code over and over”


Why write functions to complete a task? (as opposed to the copy-and-paste approach)

  • As task requirements change (and they always do!), you only need to revise code in one place rather than many places
  • Reduce errors that are common in copy-and-paste approach (e.g., forgetting to change variable name or variable value)
  • Functions give you an opportunity to make continual improvements to completing a task
    • E.g., you realize your function does not work for certain input values, so you modify function to handle those values


How to approach writing functions? (broad recipe)

  1. Experiment with performing the task outside of a function
    • Experiment with performing task with different sets of inputs
    • Often, you must revise this code, when an approach that worked outside a function does not work within a function
  2. Write the function
  3. Test the function
    • Try to “break” it
  4. Continual improvement. As you use the function, make continual improvements going back-and-forth between steps 1-3

5.2 Basics of writing functions

Often, the functions we write will utilize existing functions from Base R and other R packages. For example, create a function named z_score() that calculates how many standard deviations an observation is from the mean. Our z_score() function will use the existing Base R mean() and sd() functions.

We will avoid creating user-written functions that utilize Tidyverse functions, particularly functions from the dplyr package such as group_by(). The reason is that including certain Tidyverse/dplyr functions in a user-written function requires knowledge of some advanced programming skills that we have not introduced yet. For more explanation, see here and here.

Therefore, when teaching how to write functions that perform data manipulation tasks, we will use a “Base R approach” rather than a “Tidyverse approach.”

5.2.1 Components of a function

The function() function tells R that you are writing a function:

# To find help file for function():
  ?`function` # But help file is not a helpful introduction


function_name <- function(arg1, arg2, arg3) {
  # function body
}

Three components of a function:

  1. Function name
    • Define a function using function() and give it a name using the assignment operator <-
  2. Function arguments (sometimes called “inputs”)
    • Inputs that the function takes; they go inside the parentheses of function()
      • Can be vectors, data frames, logical statements, strings, etc.
    • In the above hypothetical code, the function took three inputs arg1, arg2, arg3, but we could have written:
      • function(x, y, z) or function(Larry, Curly, Moe)
    • In the “function call,” you specify values to assign to these function arguments
  3. Function body
    • What the function does to the inputs
    • Function body goes inside the pair of curly brackets ({}) that follows function()
    • Above hypothetical function doesn’t do anything, but your function can return a value (covered in later section)

5.2.3 z_score() function

The z-score for an observation i is the number of standard deviations away it is from the mean:

  • \(z_i = \frac{x_i - \bar{x}}{sd(x)}\)

Task: Write function called z_score() that calculates the z-score for each element of a vector

# Expected output
z_score(c(1, 2, 3, 4, 5))
#> [1] -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111
Step 1: Perform task outside of function

Create a vector of numbers we’ll use to calculate z-score:

v <- c(1, 2, 3, 4, 5)
v
#> [1] 1 2 3 4 5

typeof(v)
#> [1] "double"
class(v)
#> [1] "numeric"
length(v)
#> [1] 5

v[1]  # 1st element of v
#> [1] 1
v[4]  # 4th element of v
#> [1] 4


We can calculate the z-score using the Base R mean() and sd() functions:

  • \(z_i = \frac{x_i - \bar{x}}{sd(x)}\)
mean(v)
#> [1] 3
sd(v)
#> [1] 1.581139


Calculate z-score for some value:

(1-mean(v))/sd(v)
#> [1] -1.264911
(4-mean(v))/sd(v)
#> [1] 0.6324555


Calculate z-score for particular elements of vector v:

v[1]
#> [1] 1
(v[1]-mean(v))/sd(v)
#> [1] -1.264911

v[4]
#> [1] 4
(v[4]-mean(v))/sd(v)
#> [1] 0.6324555


Calculate z_i for all elements of vector v:

v
#> [1] 1 2 3 4 5
(v-mean(v))/sd(v)
#> [1] -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111

Step 2: Write the function

Write function to calculate z-score for all elements of the vector:

z_score <- function(x) {
  (x - mean(x))/sd(x)
}
  1. Function name
    • Function name is z_score
  2. Function arguments
    • z_score() function takes an object x as input to calculate the z-score for
  3. Function body
    • Body of z_score() calculates z-score of input (e.g., For each element of x, calculate difference between value of element and mean value of elements, then divide by standard deviation of elements)
    • Return value: A numeric vector containing z-scores calculated from input


Test/call the function:

z_score(x = c(1, 2, 3, 4, 5))
#> [1] -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111

  # investigate what function returns
  z_score(x = c(1, 2, 3, 4, 5)) %>% str()
#>  num [1:5] -1.265 -0.632 0 0.632 1.265

v
#> [1] 1 2 3 4 5
z_score(x = v)
#> [1] -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111

seq(20, 25)
#> [1] 20 21 22 23 24 25
z_score(x = seq(20, 25))
#> [1] -1.3363062 -0.8017837 -0.2672612  0.2672612  0.8017837  1.3363062

  #you could even create a new object whose values are the output/return of the function
  z_object <- z_score(x = c(1, 2, 3, 4, 5))
  z_object
#> [1] -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111
  z_object %>% str()
#>  num [1:5] -1.265 -0.632 0 0.632 1.265



Task: Improve the z_score() function by trying to break it

Test 1: Handling NA values

Let’s see what happens when we try passing in a vector containing NA to our z_score() function:

w <- c(NA, seq(1:5), NA)
w
#> [1] NA  1  2  3  4  5 NA
z_score(x=w)
#> [1] NA NA NA NA NA NA NA


What went wrong? Let’s revise our function to handle NA values:

z_score <- function(x) {
  (x - mean(x, na.rm=TRUE))/sd(x, na.rm=TRUE)
}

w
#> [1] NA  1  2  3  4  5 NA
z_score(w)
#> [1]         NA -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111         NA

Test 2: Applying function to variables from a dataframe

Create dataframe called df:

set.seed(12345) # set "seed" so we all get the same "random" numbers
df <- tibble(
  a = c(NA,rnorm(5)),
  b = c(NA,rnorm(5)),
  c = c(NA,rnorm(5))
)
class(df)
#> [1] "tbl_df"     "tbl"        "data.frame"
df
#> # A tibble: 6 × 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1 NA     NA     NA    
#> 2  0.586 -1.82  -0.116
#> 3  0.709  0.630  1.82 
#> 4 -0.109 -0.276  0.371
#> 5 -0.453 -0.284  0.520
#> 6  0.606 -0.919 -0.751

# subset a data frame w/ one element, using []
df["a"] 
#> # A tibble: 6 × 1
#>        a
#>    <dbl>
#> 1 NA    
#> 2  0.586
#> 3  0.709
#> 4 -0.109
#> 5 -0.453
#> 6  0.606
str(df["a"])
#> tibble [6 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:6] NA 0.586 0.709 -0.109 -0.453 ...

# subset values of an element using [[]] or $
df[["a"]]
#> [1]         NA  0.5855288  0.7094660 -0.1093033 -0.4534972  0.6058875
str(df[["a"]])
#>  num [1:6] NA 0.586 0.709 -0.109 -0.453 ...

df$a
#> [1]         NA  0.5855288  0.7094660 -0.1093033 -0.4534972  0.6058875
str(df$a)
#>  num [1:6] NA 0.586 0.709 -0.109 -0.453 ...


Experiment with components of z-score, outside of a function:

mean(df[["a"]], na.rm=TRUE)  # mean of variable "a"
#> [1] 0.2676164
sd(df[["a"]], na.rm=TRUE)  # std dev of variable "a"
#> [1] 0.5178803

mean(df$a, na.rm=TRUE)  # mean of variable "a"
#> [1] 0.2676164
sd(df$a, na.rm=TRUE)  # std dev of variable "a"
#> [1] 0.5178803

# Would these work?
  # mean(df["a"], na.rm=TRUE)  # mean of variable "a"
  # sd(df["a"], na.rm=TRUE)  # std dev of variable "a"

# Manually calculate z-score for second observation in variable "a"
df$a[2]
#> [1] 0.5855288
(df$a[2] - mean(df$a, na.rm=TRUE))/sd(df$a, na.rm=TRUE)
#> [1] 0.6138725

# Manually calculate z-score for all observations in variable "a"
df$a
#> [1]         NA  0.5855288  0.7094660 -0.1093033 -0.4534972  0.6058875
df$a %>% length()
#> [1] 6
(df$a - mean(df$a, na.rm=TRUE))/sd(df$a, na.rm=TRUE)
#> [1]         NA  0.6138725  0.8531888 -0.7278124 -1.3924329  0.6531840


Apply z_score() function to variables in dataframe:

# z_score() function to calculate z-score for each obs of variable "a"
df$a
#> [1]         NA  0.5855288  0.7094660 -0.1093033 -0.4534972  0.6058875
z_score(x = df$a)
#> [1]         NA  0.6138725  0.8531888 -0.7278124 -1.3924329  0.6531840
z_score(x = df[["a"]])
#> [1]         NA  0.6138725  0.8531888 -0.7278124 -1.3924329  0.6531840


# This approach doesn't work:
  # z_score(x = df["a"]) 
  # Why?:
    # df["a"] is a dataframe with one variable
    # you can't apply mean() or sd() functions to list/data frame object, only numeric atomic vector

# z-score for each obs of variable "b"
z_score(x = df[["b"]])
#> [1]         NA -1.4182167  1.2847832  0.2841184  0.2753122 -0.4259971

  # investigate the object returned by the function call
  z_score(x = df[["b"]]) %>% str()
#>  num [1:6] NA -1.418 1.285 0.284 0.275 ...
  
  # could create a new object whose values are the output/return of the function
  z_object <- z_score(x = df[["b"]])
  z_object %>% str()
#>  num [1:6] NA -1.418 1.285 0.284 0.275 ...
  
  # could even create new object that is a new variable in data frame
  df$b_z <- z_score(x = df[["b"]]) # same same
  
  df[["b_z"]] <- z_score(x = df[["b"]])
  
  
  df %>% glimpse()
#> Rows: 6
#> Columns: 4
#> $ a   <dbl> NA, 0.5855288, 0.7094660, -0.1093033, -0.4534972, 0.6058875
#> $ b   <dbl> NA, -1.8179560, 0.6300986, -0.2761841, -0.2841597, -0.9193220
#> $ c   <dbl> NA, -0.1162478, 1.8173120, 0.3706279, 0.5202165, -0.7505320
#> $ b_z <dbl> NA, -1.4182167, 1.2847832, 0.2841184, 0.2753122, -0.4259971
  df$b_z %>% str()
#>  num [1:6] NA -1.418 1.285 0.284 0.275 ...
  
  df$b_z <- NULL # delet variable b_z



Task: Use the z_score() function to create a new variable that is the z-score version of a variable

Example 1: Creating a new z-score variable for the df dataframe

First, briefly review how to create and delete variables using Base R approach:

df
#> # A tibble: 6 × 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1 NA     NA     NA    
#> 2  0.586 -1.82  -0.116
#> 3  0.709  0.630  1.82 
#> 4 -0.109 -0.276  0.371
#> 5 -0.453 -0.284  0.520
#> 6  0.606 -0.919 -0.751

df$c_plus2 <- df$c + 2  # create variable equal to "c" plus 2
df
#> # A tibble: 6 × 4
#>        a      b      c c_plus2
#>    <dbl>  <dbl>  <dbl>   <dbl>
#> 1 NA     NA     NA       NA   
#> 2  0.586 -1.82  -0.116    1.88
#> 3  0.709  0.630  1.82     3.82
#> 4 -0.109 -0.276  0.371    2.37
#> 5 -0.453 -0.284  0.520    2.52
#> 6  0.606 -0.919 -0.751    1.25

df$c_plus2 <- NULL  # remove variable "c_plus2"
df
#> # A tibble: 6 × 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1 NA     NA     NA    
#> 2  0.586 -1.82  -0.116
#> 3  0.709  0.630  1.82 
#> 4 -0.109 -0.276  0.371
#> 5 -0.453 -0.284  0.520
#> 6  0.606 -0.919 -0.751


Use z_score() function to create a new variable that equals the z-score of another variable.

  • Simply calling the z_score() function does not create a new variable:
z_score(x = df$c)
#> [1]           NA -0.510074390  1.525451514  0.002476613  0.159953743
#> [6] -1.177807481
df
#> # A tibble: 6 × 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1 NA     NA     NA    
#> 2  0.586 -1.82  -0.116
#> 3  0.709  0.630  1.82 
#> 4 -0.109 -0.276  0.371
#> 5 -0.453 -0.284  0.520
#> 6  0.606 -0.919 -0.751


  • Instead of modifying the z_score() function so that the variable is assigned within the function, the preferred approach is to call the z_score() function after the assignment operator <-:
df$c_z <- z_score(x = df$c)

# examine data frame
df
#> # A tibble: 6 × 4
#>        a      b      c      c_z
#>    <dbl>  <dbl>  <dbl>    <dbl>
#> 1 NA     NA     NA     NA      
#> 2  0.586 -1.82  -0.116 -0.510  
#> 3  0.709  0.630  1.82   1.53   
#> 4 -0.109 -0.276  0.371  0.00248
#> 5 -0.453 -0.284  0.520  0.160  
#> 6  0.606 -0.919 -0.751 -1.18

Example 2: Creating a new z-score variable for the recruiting dataset

We can apply our function to a “real” dataset too:

#load dataset with one obs per recruiting event
load(url("https://github.com/anyone-can-cook/rclass2/raw/main/data/recruiting/recruit_event_somevars.RData"))

df_event_small <- df_event[1:10,] %>% # keep first 10 observations
  select(instnm,univ_id,event_type,med_inc) # keep 4 vars

df_event_small
#> # A tibble: 10 × 4
#>    instnm      univ_id event_type med_inc
#>    <chr>         <int> <chr>        <dbl>
#>  1 UM Amherst   166629 public hs   71714.
#>  2 UM Amherst   166629 public hs   89122.
#>  3 UM Amherst   166629 public hs   70136.
#>  4 UM Amherst   166629 public hs   70136.
#>  5 Stony Brook  196097 public hs   71024.
#>  6 USCC         218663 private hs  71024.
#>  7 UM Amherst   166629 private hs  71024.
#>  8 UM Amherst   166629 public hs   97225 
#>  9 UM Amherst   166629 private hs  97225 
#> 10 UM Amherst   166629 public hs   77800.

#show observations for variable med_inc
df_event_small$med_inc
#>  [1] 71713.5 89121.5 70136.5 70136.5 71023.5 71023.5 71023.5 97225.0 97225.0
#> [10] 77799.5

#calculate z-score of variable med_inc (without assignment)
z_score(x = df_event_small$med_inc)
#>  [1] -0.60825958  0.91982879 -0.74668992 -0.74668992 -0.66882834 -0.66882834
#>  [7] -0.66882834  1.63116060  1.63116060 -0.07402556

#assign new variable equal to the z-score of med_inc
df_event_small$med_inc_z <- z_score(x = df_event_small$med_inc)

#inspect
df_event_small %>% head(5)
#> # A tibble: 5 × 5
#>   instnm      univ_id event_type med_inc med_inc_z
#>   <chr>         <int> <chr>        <dbl>     <dbl>
#> 1 UM Amherst   166629 public hs   71714.    -0.608
#> 2 UM Amherst   166629 public hs   89122.     0.920
#> 3 UM Amherst   166629 public hs   70136.    -0.747
#> 4 UM Amherst   166629 public hs   70136.    -0.747
#> 5 Stony Brook  196097 public hs   71024.    -0.669



Task: Improve the z_score() function by first checking whether input x is valid

Step 1: Breaking current function with invalid input

Current function:

z_score <- function(x) {
  (x - mean(x, na.rm=TRUE))/sd(x, na.rm=TRUE)
}
#?mean
#?sd


What kind of input is our current function limited to?

  • z_score() function does simple arithmetic and utilizes the mean() and sd() functions
  • mean() and sd() functions require x to be a numeric (or logical) atomic vector
    • z_score() function will break if the input x is not an atomic vector
    • z_score() function will break if the input x is not a numeric/logical atomic vector
#function works on below numeric atomic vector

  str(df_event_small$med_inc)
  str(df_event_small[["med_inc"]]) # same same

#function doesn't work if input is a list/dataframe

  str(df_event_small["med_inc"]) # investigate object

  z_score(x = df_event_small["med_inc"]) # try applying z_score function to object

#function doesn't work if x is not a numeric vector
  str(df_event_small$instnm)
  
  z_score(x = df_event_small$instnm)

Step 2: Modify the function to handle invalid inputs

We could modify z_score() by using conditional statements to calculate the z-score only if input object x is the appropriate class of object:

z_score <- function(x) {
  if (class(x) == "numeric" || class(x) == "logical") {
    (x - mean(x, na.rm=TRUE))/sd(x, na.rm=TRUE)
  }
}


We no longer run into errors if we supply an invalid input:

# Test with list/dataframe input
str(df_event_small["med_inc"])
#> tibble [10 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ med_inc: num [1:10] 71714 89122 70137 70137 71024 ...

z_score(x = df_event_small["med_inc"])
#> Warning in class(x) == "numeric" || class(x) == "logical": 'length(x) = 3 > 1'
#> in coercion to 'logical(1)'

#> Warning in class(x) == "numeric" || class(x) == "logical": 'length(x) = 3 > 1'
#> in coercion to 'logical(1)'

  #investigate what this function call returns
  z_score(x = df_event_small["med_inc"]) %>% str()
#> Warning in class(x) == "numeric" || class(x) == "logical": 'length(x) = 3 > 1'
#> in coercion to 'logical(1)'

#> Warning in class(x) == "numeric" || class(x) == "logical": 'length(x) = 3 > 1'
#> in coercion to 'logical(1)'
#>  NULL
  
# Test with character vector input
str(df_event_small$instnm)
#>  chr [1:10] "UM Amherst" "UM Amherst" "UM Amherst" "UM Amherst" ...

z_score(x = df_event_small$instnm)

  #investigate what this function call returns
  z_score(x = df_event_small$instnm) %>% str()
#>  NULL


Note that our function would return NULL if the input was invalid, so the new variable would not be created if we used <-:

str(df_event_small$instnm)
#>  chr [1:10] "UM Amherst" "UM Amherst" "UM Amherst" "UM Amherst" ...

# Invalid character vector input returns `NULL`
typeof(z_score(x = df_event_small$instnm))
#> [1] "NULL"

# We would not see new variable/column `instnm_z`
df_event_small$instnm_z <- z_score(x = df_event_small$instnm)
df_event_small %>% head(5)
#> # A tibble: 5 × 5
#>   instnm      univ_id event_type med_inc med_inc_z
#>   <chr>         <int> <chr>        <dbl>     <dbl>
#> 1 UM Amherst   166629 public hs   71714.    -0.608
#> 2 UM Amherst   166629 public hs   89122.     0.920
#> 3 UM Amherst   166629 public hs   70136.    -0.747
#> 4 UM Amherst   166629 public hs   70136.    -0.747
#> 5 Stony Brook  196097 public hs   71024.    -0.669


5.2.4 Student exercise [OPTIONAL]

Some common tasks when working with survey data:

  • Identify number of observations with NA values for a specific variable
  • Identify number of observations with negative values for a specific variable
  • Replace negative values with NA for a specific variable


5.2.4.1 num_negative() function

Task: Write function called num_negative()

  • Write a function that counts the number of observations with negative values for a specific variable
  • Apply this function to variables from dataframe df (created below)
  • Adapted from Ben Skinner’s Programming 1 R Workshop HERE
# Sample dataframe `df` that contains some negative values
df
#> # A tibble: 100 × 4
#>       id   age sibage parage
#>    <int> <dbl>  <dbl>  <dbl>
#>  1     1    17      8     49
#>  2     2    15    -97     46
#>  3     3   -97    -97     53
#>  4     4    13     12     -4
#>  5     5   -97     10     47
#>  6     6    12     10     52
#>  7     7   -99      5     51
#>  8     8   -97     10     55
#>  9     9    16      6     51
#> 10    10    16    -99     -8
#> # … with 90 more rows


Recommended steps:

  • Perform task outside of function
    • HINT: sum(data_frame_name$var_name<0)
  • Write function
  • Apply/test function on variables

Step 1: Perform task outside of function
names(df) # identify variable names
#> [1] "id"     "age"    "sibage" "parage"
df$age # print observations for a variable
#>   [1]  17  15 -97  13 -97  12 -99 -97  16  16 -98  20 -99  20  11  20  12  17
#>  [19]  19  17 -97 -99  12  13  11  15  20  14 -99  11  20 -98  11 -98  12  16
#>  [37]  12  18  12  19  12 -97  20  17  11  19  19  12 -98  11  15  18  15 -98
#>  [55]  15  19 -97  13 -98  16  13  12  16  19 -99  19 -98  13 -97  20  15  19
#>  [73]  15  12  18 -99  18 -98 -98 -98 -97  12  14  19 -97  11  20  18  14 -99
#>  [91]  15  20 -97  14  14  19  18  17  20  15

#BaseR
sum(df$age<0) # count number of obs w/ negative values for variable "age"
#> [1] 27

Step 2: Write function
num_missing <- function(x){
  sum(x<0)
}

Step 3: Apply function
num_missing(df$age)
#> [1] 27
num_missing(df$sibage)
#> [1] 22


5.2.4.2 num_missing() function

In survey data, negative values often refer to reason for missing values:

  • E.g., -8 refers to “didn’t take survey”
  • E.g., -7 refers to “took survey, but didn’t answer this question”

Task: Write function called num_negative()

  • Write a function that counts number of missing observations for a variable and allows you to specify which values are associated with missing for that variable. This function will take two arguments:
    • x: The variable (e.g., df$sibage)
    • miss_vals: Vector of values you want to associate with “missing” variable
      • Values to associate with missing for df$age: -97,-98,-99
      • Values to associate with missing for df$sibage: -97,-98,-99
      • Values to associate with missing for df$parage: -4,-7,-8


Recommended steps:

  • Perform task outside of function
    • HINT: sum(data_frame_name$var_name %in% c(-4,-5))
  • Write function
  • Apply/test function on variables

Step 1: Perform task outside of function
sum(df$age %in% c(-97,-98,-99))
#> [1] 27

Step 2: Write function
num_missing <- function(x, miss_vals){

  sum(x %in% miss_vals)
}

Step 3: Apply function
num_missing(df$age,c(-97,-98,-99))
#> [1] 27
num_missing(df$sibage,c(-97,-98,-99))
#> [1] 22
num_missing(df$parage,c(-4,-7,-8))
#> [1] 17


5.3 Function arguments

5.3.1 Default values

What are default values for arguments?

  • The default value for an argument is the value that will be used if the argument value was not supplied during the function call
  • When writing the function, you can specify the default value for an argument using name=value
  • Most Base R functions and functions from other packages specify default values for one or more arguments


Example: str_c() function

The str_c() function has default values for sep and collapse:

  • Syntax
    • str_c(..., sep = "", collapse = NULL)
  • Arguments
    • ...: One or more character vectors to join, separated by commas
    • sep: String to insert between input vectors
      • Default value: sep = ""
    • collapse: Optional string used to combine input vectors into single string
      • After joining vectors into single string within each element, should resulting elements be combined into a single string? If so, what string to insert between elements?
      • Default value: collapse = NULL is to not combine elements into a single string
# We want to join the following two vectors element-wise into a single character vector
c("a","b")
#> [1] "a" "b"
c(1,2)
#> [1] 1 2

# manually specifying default values
str_c(c("a", "b"), c(1, 2), sep = "", collapse = NULL)
#> [1] "a1" "b2"

# If we don't specify `sep` and `collapse`, they take the default values
str_c(c("a", "b"), c(1, 2))
#> [1] "a1" "b2"

# specify value for `sep` that overrides default value
str_c(c("a", "b"), c(1, 2), sep = "~")
#> [1] "a~1" "b~2"
length(str_c(c("a", "b"), c(1, 2), sep = "~")) # resulting vector has length = 2
#> [1] 2

# specify value for `collapse` that overrides default
str_c(c("a", "b"), c(1, 2), collapse = "|")
#> [1] "a1|b2"
length(str_c(c("a", "b"), c(1, 2), collapse = "|"))  # resulting vector has length = 1
#> [1] 1

# specify alternative values for both `sep` and `collapse`
  #str_c(c("a", "b"), c(1, 2), sep = "~", collapse = "|")

Example: Adding a default value to our z_score() function

Recall the z_score() function we developed previously, where we wrote this function to remove NA values prior to calculating z-score:

z_score <- function(x) {
  (x - mean(x, na.rm=TRUE))/sd(x, na.rm=TRUE)
}

w <- c(NA, seq(1:5), NA)
w
#> [1] NA  1  2  3  4  5 NA
z_score(w)
#> [1]         NA -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111         NA


We could add an argument (named na) that specifies whether NAs should be removed prior to calculating z-scores:

z_score <- function(x, na) {
  (x - mean(x, na.rm=na))/sd(x, na.rm=na)
}

w
#> [1] NA  1  2  3  4  5 NA
z_score(x=w, na=TRUE)
#> [1]         NA -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111         NA
z_score(x=w, na=FALSE)
#> [1] NA NA NA NA NA NA NA
#z_score(w) # error: argument "na" is missing, with no default


We could also add a default value for the na argument. Following conservative approach, we’ll specify default value as FALSE which means that any NA values in input vector x will result in z-score of NA for all observations:

z_score <- function(x, na = FALSE) {
  (x - mean(x, na.rm=na))/sd(x, na.rm=na)
}

w
#> [1] NA  1  2  3  4  5 NA

z_score(x=w) # uses default value of FALSE
#> [1] NA NA NA NA NA NA NA
z_score(w, na= FALSE) # manually specify default value
#> [1] NA NA NA NA NA NA NA
z_score(w, na = TRUE) # override default value
#> [1]         NA -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111         NA


5.3.2 Dot-dot-dot (...)

Many functions take an arbitrary number of arguments, including:

  • select()

    #?select
    select(df_event,instnm,univ_id,event_type,med_inc) %>% names()
    #> [1] "instnm"     "univ_id"    "event_type" "med_inc"
  • sum()

    #?sum
    sum(3,3,2,2,1,1)
    #> [1] 12
  • str_c

    #?str_c
    
    # 1 character vector as input
    str_c(c("a", "b", "c"))
    #> [1] "a" "b" "c"
    
    # 2 character vectors as input
    str_c(c("a", "b", "c"), " is for ")
    #> [1] "a is for " "b is for " "c is for "
    
    # 3 character vectors as input
    str_c(c("a", "b", "c"), " is for ", c("apple", "banana", "coffee"))
    #> [1] "a is for apple"  "b is for banana" "c is for coffee"


All of these functions rely on a special argument ... (pronounced “dot-dot-dot”)

  • Dot-dot-dot (...) allows a function to take an arbitrary number of arguments

  • Wickham and Grolemund chapter 19.5.3 states:

    ... captures any number of arguments that aren’t otherwise matched.”


When writing functions, there are two primary uses of including ... arguments:

  1. A means of allowing the function to take an arbitrary number of arguments, as in the select() and sum() functions
  2. When we write our own function with the special argument ..., we can pass those inputs into another function that takes ... (e.g., str_c())


Example: Adding dot-dot-dot (...) as function argument

Recall the first iteration of our print_hello() function, which basically just printed a name that we specified in function call. Let’s modify the function to make it take an arbitrary number of names to greet:

  • Function that only took one argument

    # Define function
    print_hello1 <- function(x) {  
      str_c("Hello ", x, "!") 
    }
    
    # Call function
    print_hello1(x="Ozan")
    #> [1] "Hello Ozan!"
  • Modify function to take an arbitrary number of names to greet

    # Define function
    print_hello2 <- function(...) {  # The function accepts an arbitrary number of inputs
      str_c("Hello ", str_c(..., sep = ", "), "!")  # Pass the `...` to `str_c()`
    }
    
    # Call function
    print_hello2("Dasher", "Dancer", "Prancer", "Vixen")
    #> [1] "Hello Dasher, Dancer, Prancer, Vixen!"

5.3.3 Checking values

How to handle invalid inputs?

  • As seen previously in the z_score() example, one way to check for invalid inputs is using conditional statements
  • “It’s good practice to check important preconditions, and throw an error (with stop()), if they are not true” (R for Data Science)
    • Especially in the case where the invalid input does not cause the function to break, but gives an unintended output instead, we want to explicitly raise an error so this does not go unnoticed


stop() function (base R):

  • The stop() function “stops execution of the current expression and executes an error action”
?stop

# SYNTAX AND DEFAULT VALUES
stop(..., call. = TRUE, domain = NULL)

Example: Using stop() to check invalid name input to print_hello() function

Recall the original print_hello() function. It will not print a greeting if NA is supplied as the input:

print_hello <- function(x) {
  str_c("Hello, world. My name is", x, sep = " ", collapse = NULL)
}

print_hello("ozan")
#> [1] "Hello, world. My name is ozan"
print_hello(NA)
#> [1] NA


We can raise an error with a custom message if the input is NA:

print_hello <- function(x) {
  if (is.na(x)) {
    stop("`x` must not be `NA`")
  }
  
  str_c("Hello, world. My name is", x, sep = " ", collapse = NULL)
}

print_hello(x="ozan")
print_hello(x=NA)


Example: Using stop() to check invalid date input to print_hello() function

Recall the version of print_hello() function that prints both the user’s name and age. It will not work properly if the birthdate input is not supplied in month-day-year format:

print_hello <- function(x, y) {
  age <- floor(as.numeric(as.duration(today() - mdy(y)), "years"))
  
  str_c("Hello, world. My name is", x, "and I am", age, "years old", sep = " ", collapse = NULL)
}

print_hello(x = "Sumru Jaquette-Nasiali", y="04/05/2019") # this works
#> [1] "Hello, world. My name is Sumru Jaquette-Nasiali and I am 3 years old"

print_hello(x = "Sumru Jaquette-Nasiali", y="2019/04/05") # this does not
#> Warning: All formats failed to parse. No formats found.
#> [1] NA


We can raise an error with a custom message if the birthdate is not in the right format:

print_hello <- function(x, y) {
  if (is.na(mdy(y))) {
    stop("`y` must be in month-day-year format")
  }
  
  age <- floor(as.numeric(as.duration(today() - mdy(y)), "years"))
  
  str_c("Hello, world. My name is", x, "and I am", age, "years old", sep = " ", collapse = NULL)
}

print_hello(x = "Sumru Jaquette-Nasiali", y = "04/05/2019")
print_hello(x = "Sumru Jaquette-Nasiali", y = "2019/04/05")


We can also add the check for the name input as well:

print_hello <- function(x, y) {
  # Check name input `x`
  if (is.na(x)) {
    stop("`x` must not be `NA`")
  }
  
  # Check birthdate input `y`
  if (is.na(mdy(y))) {
    stop("`y` must be in month-day-year format")
  }
  
  age <- floor(as.numeric(as.duration(today() - mdy(y)), "years"))
  
  str_c("Hello, world. My name is", x, "and I am", age, "years old", sep = " ", collapse = NULL)
}


5.4 Return values

Recall that functions generally follow three sequential steps:

  1. Take in input object(s)
  2. Process the input
  3. Return a new object, which may be a vector, data-frame, plot, etc.

5.4.1 Implicit returns

What are return values?

  • Just as functions can take inputs (i.e., arguments), functions can also return values as output
  • The last statement that the function evaluates will be automatically (i.e., implicitly) returned
    • e.g., so if you want a function to return a data frame named df, you could have the last line of the function be this:
      • df
  • We can use the assignment operator <- to store returned values in a new object for future use

Recall the print_hello() function:

# Define function
print_hello <- function() {
  "Hello!"  # The last statement in the function is returned
}

# Call function
print_hello() %>% str()
#>  chr "Hello!"
h <- print_hello()  # We can show that `print_hello()` returns a value by storing it in `h`
h                   # `h` stores the value "Hello!"
#> [1] "Hello!"

5.4.2 Explicit returns

How can we explicitly return values from the function?

  • We can use return() to explicitly return a value from our function
  • This is commonly used when we want to return from the function early (e.g., inside an if block)
  • There can be multiple return() in a function
  • Returning from a function means exiting the function, so no other code below the point of return would be run

Recall the print_hello() function:

# Define function
print_hello <- function() {
  return("Hello!")   # Explicitly return "Hello!"
  print("Goodbye!")  # Since this is after `return()`, it never gets run
}

# Call function
print_hello()
#> [1] "Hello!"
h <- print_hello()  # `print_hello()` returns "Hello!"
h
#> [1] "Hello!"

Example: Writing a function with multiple returns

Recall the previous example where we assess the prices of diamonds from the diamonds dataset from ggplot2. Let’s move the if/else if/else blocks inside of a function, then call the function from inside the loop.

As seen below, the last statement that the function evaluates (i.e., whichever if/else if/else block is run) will be implicitly returned:

assess_price <- function(price) {
  if (price < 500) {
    str_c("This diamond costs $", price, " and is affordable.")
  } else if (price < 1000) {
    str_c("This diamond costs $", price, " and is pricey...")
  } else {
    str_c("This diamond costs $", price, " and is too expensive!")
  }
}

assess_price(price=450)
#> [1] "This diamond costs $450 and is affordable."
assess_price(price=1050) %>% str()
#>  chr "This diamond costs $1050 and is too expensive!"

prices <- unique(diamonds$price)[23:27]
prices
#> [1]  405  552  553  554 2757
for (i in prices) {
  writeLines(assess_price(i))
}
#> This diamond costs $405 and is affordable.
#> This diamond costs $552 and is pricey...
#> This diamond costs $553 and is pricey...
#> This diamond costs $554 and is pricey...
#> This diamond costs $2757 and is too expensive!


But if we were to have another line after the conditional part, then that would be implicitly returned instead, since it is now the last statement in the function:

assess_price <- function(price) {
  if (price < 500) {
    str_c("This diamond costs $", price, " and is affordable.")
  } else if (price < 1000) {
    str_c("This diamond costs $", price, " and is pricey...")
  } else {
    str_c("This diamond costs $", price, " and is too expensive!")
  }
  
  "I can't afford that."  # This is now the last statement in the function that will be returned
}

for (i in prices) {
  writeLines(assess_price(i))
}
#> I can't afford that.
#> I can't afford that.
#> I can't afford that.
#> I can't afford that.
#> I can't afford that.


We can use return() to explicitly return early from the function:

assess_price <- function(price) {
  if (price < 500) {
    return(str_c("This diamond costs $", price, " and is affordable."))  # Return early
  } else if (price < 1000) {
    return(str_c("This diamond costs $", price, " and is pricey..."))  # Return early
  } else {
    writeLines(str_c("This diamond costs $", price, " and is too expensive!"))
  }
  
  "I can't afford that."
}

for (i in prices) {
  writeLines(assess_price(i))
}
#> This diamond costs $405 and is affordable.
#> This diamond costs $552 and is pricey...
#> This diamond costs $553 and is pricey...
#> This diamond costs $554 and is pricey...
#> This diamond costs $2757 and is too expensive!
#> I can't afford that.

5.4.3 Returning multiple values

How can we return multiple values from a function?

  • R can only return 1 value or object from a function
  • To return multiple values, one workaround is to create a list containing these items and return the list

Example: Returning multiple values from a function using a list

Let’s say we have the following function that filters the diamonds dataset by color, then generates some information on multiple characteristics (i.e., cut and clarity). For now, it is printing a frequency table for each characteristic to the screen:

diamond_info_by_color <- function(color) {
  df <- diamonds %>% filter(color == color)
  
  print(table(df$cut))
  print(table(df$clarity))
}

diamond_info_by_color(color = 'E')
#> 
#>      Fair      Good Very Good   Premium     Ideal 
#>      1610      4906     12082     13791     21551 
#> 
#>    I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF 
#>   741  9194 13065 12258  8171  5066  3655  1790
diamond_info_by_color('E') %>% str() # what is returned
#> 
#>      Fair      Good Very Good   Premium     Ideal 
#>      1610      4906     12082     13791     21551 
#> 
#>    I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF 
#>   741  9194 13065 12258  8171  5066  3655  1790 
#>  'table' int [1:8(1d)] 741 9194 13065 12258 8171 5066 3655 1790
#>  - attr(*, "dimnames")=List of 1
#>   ..$ : chr [1:8] "I1" "SI2" "SI1" "VS2" ...


If we want to return the frequency tables from the function (i.e., return multiple objects), we can do so by combining them together into a single list and returning that list:

diamond_info_by_color <- function(color) {
  df <- diamonds %>% filter(color == color)
  
  list(cut_table = table(df$cut), clarity_table = table(df$clarity))  # implicitly return list
}

diamond_info_by_color('E')
#> $cut_table
#> 
#>      Fair      Good Very Good   Premium     Ideal 
#>      1610      4906     12082     13791     21551 
#> 
#> $clarity_table
#> 
#>    I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF 
#>   741  9194 13065 12258  8171  5066  3655  1790
diamond_info_by_color('E') %>% str()
#> List of 2
#>  $ cut_table    : 'table' int [1:5(1d)] 1610 4906 12082 13791 21551
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..$ : chr [1:5] "Fair" "Good" "Very Good" "Premium" ...
#>  $ clarity_table: 'table' int [1:8(1d)] 741 9194 13065 12258 8171 5066 3655 1790
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..$ : chr [1:8] "I1" "SI2" "SI1" "VS2" ...


We can then store the returned list in an object using <-, and access the individual elements within the list using [[]] or $:

# Store returned list in `info`
info <- diamond_info_by_color('E')

info %>% str()
#> List of 2
#>  $ cut_table    : 'table' int [1:5(1d)] 1610 4906 12082 13791 21551
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..$ : chr [1:5] "Fair" "Good" "Very Good" "Premium" ...
#>  $ clarity_table: 'table' int [1:8(1d)] 741 9194 13065 12258 8171 5066 3655 1790
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..$ : chr [1:8] "I1" "SI2" "SI1" "VS2" ...

# Access individual elements of the list
info[['cut_table']]
#> 
#>      Fair      Good Very Good   Premium     Ideal 
#>      1610      4906     12082     13791     21551


# Can also store individual elements in new objects
clarity_table <- info$clarity_table
clarity_table
#> 
#>    I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF 
#>   741  9194 13065 12258  8171  5066  3655  1790

5.4.4 Pipeable functions

What are pipeable functions?

  • Pipeable functions return an object that is of the same type as the first argument it accepts
  • This allow us to chain functions that accept/return objects of the same type using pipes (%>%)
    • E.g., The filter() and select() functions from tidyverse both accept a dataframe as the first argument and return a modified dataframe, so they can be chained together

      df %>% filter(...) %>% select(...)

Wickham distinguishes between 2 types of pipeable functions (Chapter 19.6.2)

  • Transformation: Function modifies the first argument that is passed in and returns that
    • E.g., filter() or select() functions from tidyverse
  • Side-effects: Function does not modify the first argument
    • E.g., Functions that “performs an action on the object, like drawing a plot or saving a file”
    • The first argument should still be returned from these functions so that they remain pipeable

Example: Writing a pipeable function

Pipeable functions do not only work with dataframes, but with any objects like an atomic vector. For example:

vec <- c(1, 2, 3, 4)
vec
#> [1] 1 2 3 4


#  These functions accept a vector as the first argument, modify it, then return it
add_two <- function(v) {
  v + 2
}

vec
#> [1] 1 2 3 4
add_two(v=vec)
#> [1] 3 4 5 6
vec %>% add_two() # same
#> [1] 3 4 5 6


times_three <- function(v) {
  v * 3
}

vec
#> [1] 1 2 3 4
times_three(v=vec)
#> [1]  3  6  9 12
vec %>% times_three() # same
#> [1]  3  6  9 12


We can chain together the functions to perform the operations in order:

vec
#> [1] 1 2 3 4
vec %>% add_two()
#> [1] 3 4 5 6
vec %>% add_two() %>% times_three()
#> [1]  9 12 15 18

vec
#> [1] 1 2 3 4
vec %>% times_three()
#> [1]  3  6  9 12
vec %>% times_three() %>% add_two()
#> [1]  5  8 11 14


6 What to learn next


Writing functions and loops that utilize tidyverse functions

  • Including certain Tidyverse/dplyr functions in a user-written function requires some programming concepts that we did not introduce in this lecture
    • For more explanation, see here and here.
  • For a pretty digestable explanation of how to write user-defined functions that utilzie tidyverse functions, read the Programming with dplyr vignette



Replacing loops with “map” functions from the purrr package and/or “apply” functions from Base R

The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you (Wickham, chapter 21.5)

  • The purrr package, which is part of Tidyverse, creates a family of functions called “map” functions replace the need for writing loops
  • to learn more about purrr map functions, read section 21.4 and section 21.5 of R for Data Science by Wickham
  • purrr map functions are similar to the “apply family of functions” from Base R
    • check out this tutorial on apply functions