1 Introduction

1.1 Libraries we will use

Load packages:

library(tidyverse)
library(lubridate)

1.2 Lecture overview

The programming unit will introduce you to tools that tell your computer to do the same or similar things over and over, without having to write the code over and over (e.g., iteration). And the code you write to do things over and over, will be able to do things differently depending on conditions of the data or depending on things you specify.

Paraphrasing Will Doyle:

“Computers love to do the same thing over and over. It’s their favorite thing to do. Learn to make your computer happy.”

The 3 core foci of this unit are:

  • Iteration (loops)
  • Conditionals (if, if/else)
  • Functions

But more than learning these things, this unit is about developing a more formal, rigorous understanding of programming concepts so that you can become a more powerful programmer. Towards that end, we will be reading chapters from Wickham’s free text book Advanced R.

In fact, please spend 10 minutes reading the Chapter 1 (sections 1.1 through 1.5)

2 Foundational concepts

See here for a review of data structures and types.

2.1 Subsetting elements

What is subsetting?

  • Subsetting refers to isolating particular elements of an object
  • Subsetting operators can be used to select/exclude elements (e.g., variables, observations)
  • There are three subsetting operators: [], [[]], $
  • These operators function differently based on vector types (e.g., atomic vectors, lists, dataframes)


For the examples in the next few subsections, we will be working with the following named atomic vector, named list, and dataframe:

  • Create named atomic vector called v with 4 elements

    v <- c(a = 10, b = 20, c = 30, d = 40)
    v
    #>  a  b  c  d 
    #> 10 20 30 40
  • Create named list called l with 4 elements

    l <- list(a = TRUE, b = c("a", "b", "c"), c = list(1, 2), d = 10L)
    l
    #> $a
    #> [1] TRUE
    #> 
    #> $b
    #> [1] "a" "b" "c"
    #> 
    #> $c
    #> $c[[1]]
    #> [1] 1
    #> 
    #> $c[[2]]
    #> [1] 2
    #> 
    #> 
    #> $d
    #> [1] 10
  • Create dataframe called df with 4 columns and 3 rows

    df <- data.frame(
      a = c(11, 21, 31),
      b = c(12, 22, 32),
      c = c(13, 23, 33),
      d = c(14, 24, 34)
    )
    df
    #> # A tibble: 3 x 4
    #>       a     b     c     d
    #>   <dbl> <dbl> <dbl> <dbl>
    #> 1    11    12    13    14
    #> 2    21    22    23    24
    #> 3    31    32    33    34


2.1.1 Subsetting using []

The [] operator:

  • Subsetting an object using [] returns an object of the same type
    • E.g., Using [] on an atomic vector returns an atomic vector, using [] on a list returns a list, etc.
  • The returned object will contain the element(s) you selected
  • Object attributes are retained when using [] (e.g., name attribute)

Six ways to subset using []:

  1. Use positive integers to return elements at specified index positions
  2. Use negative integers to exclude elements at specified index positions
  3. Use logical vectors to return elements where corresponding logical is TRUE
  4. Empty vector [] returns original object (useful for dataframes)
  5. Zero vector [0] returns empty object (useful for testing data)
  6. If object is named, use character vectors to return elements with matching names

Example: Using positive integers with []

Selecting a single element: Specify the index of the element to subset

# Select 1st element from numeric vector (note that `names` attribute is retained)
v[1]
#>  a 
#> 10

# Subsetted object will be of type `numeric`
class(v[1])
#> [1] "numeric"

# Select 1st element from list (note that `names` attribute is retained)
l[1]
#> $a
#> [1] TRUE

# Subsetted object will be a `list` containing the element
class(l[1])
#> [1] "list"


Selecting multiple elements: Specify the indices of the elements to subset using c()

# Select 3rd and 1st elements from numeric vector
v[c(3,1)]
#>  c  a 
#> 30 10

# Subsetted object will be of type `numeric`
class(v[c(3,1)])
#> [1] "numeric"

# Select 1st element three times from list
l[c(1,1,1)]
#> $a
#> [1] TRUE
#> 
#> $a
#> [1] TRUE
#> 
#> $a
#> [1] TRUE

# Subsetted object will be a `list` containing the elements
class(l[c(1,1,1)])
#> [1] "list"

Example: Using negative integers with []

Excluding a single element: Specify the index of the element to exclude

# Exclude 1st element from numeric vector
v[-1]
#>  b  c  d 
#> 20 30 40

# Subsetted object will be of type `numeric`
class(v[-1])
#> [1] "numeric"


Excluding multiple elements: Specify the indices of the elements to exclude using -c()

# Exclude 1st and 3rd elements from list
l[-c(1,3)]
#> $b
#> [1] "a" "b" "c"
#> 
#> $d
#> [1] 10

# Subsetted object will be a `list` containing the remaining elements
class(l[-c(1,3)])
#> [1] "list"

Example: Using logical vectors with []

If the logical vector is the same length as the object, then each element in the object whose corresponding position in the logical vector is TRUE will be selected:

# Select 2nd and 3rd elements from numeric vector
v[c(FALSE, TRUE, TRUE, FALSE)]
#>  b  c 
#> 20 30

# Subsetted object will be of type `numeric`
class(v[c(FALSE, TRUE, TRUE, FALSE)])
#> [1] "numeric"


If the logical vector is shorter than the object, then the elements in the logical vector will be recycled:

# This is equivalent to `l[c(FALSE, TRUE, FALSE, TRUE)]`, thus retaining 2nd and 4th elements
l[c(FALSE, TRUE)]
#> $b
#> [1] "a" "b" "c"
#> 
#> $d
#> [1] 10

# Subsetted object will be a `list` containing the elements
class(l[c(FALSE, TRUE)])
#> [1] "list"


We can also write expressions that evaluate to either TRUE or FALSE:

# This expression is recycled and evaluates to be equivalent to `l[c(FALSE, FALSE, TRUE, TRUE)]`
v[v > 20]
#>  c  d 
#> 30 40

Example: Using empty vector []

An empty vector [] just returns the original object:

# Original atomic vector
v[]
#>  a  b  c  d 
#> 10 20 30 40

# Original list
l[]
#> $a
#> [1] TRUE
#> 
#> $b
#> [1] "a" "b" "c"
#> 
#> $c
#> $c[[1]]
#> [1] 1
#> 
#> $c[[2]]
#> [1] 2
#> 
#> 
#> $d
#> [1] 10

# Original dataframe
df[]
#> # A tibble: 3 x 4
#>       a     b     c     d
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    11    12    13    14
#> 2    21    22    23    24
#> 3    31    32    33    34

Example: Using zero vector [0]

A zero vector [0] just returns an empty object of the same type as the original object:

# Empty named atomic vector
v[0]
#> named numeric(0)

# Empty named list
l[0]
#> named list()

# Empty dataframe
df[0]
#> # A tibble: 3 x 0

Example: Using element names with []

We can select a single element or multiple elements by their name(s):

# Equivalent to v[2]
v["b"]
#>  b 
#> 20

# Equivalent to l[c(1, 3)]
l[c("a", "c")]
#> $a
#> [1] TRUE
#> 
#> $c
#> $c[[1]]
#> [1] 1
#> 
#> $c[[2]]
#> [1] 2


2.1.2 Subsetting using [[]]

The [[]] operator:

  • We can only use [[]] to extract a single element rather than multiple elements
  • Subsetting an object using [[]] returns the selected element itself, which might not be of the same type as the original object
    • E.g., Using [[]] to select a list element that is a numeric vector will return that numeric vector and not a list containing that numeric vector, like what [] would return
      • Let x be a list with 3 elements (Think of it as a train with 3 cars)
      • x[1] will be a list containing the 1st element, which is a numeric vector (i.e., train with the 1st car)
      • x[[1]] will be the numeric vector itself (i.e., the objects within the 1st car)
      • Source: Subsetting from R for Data Science
  • Object attributes are removed when using [[]]
    • E.g., Using [[]] on a named object returns just the selected element itself without the name attribute


Two ways to subset using [[]]:

  1. Use a positive integer to return an element at the specified index position
  2. If object is named, using a character to return an element with the specified name

Example: Using positive integer with [[]]

# Select 1st element from numeric vector (note that `names` attribute is gone)
v[[1]]
#> [1] 10

# Subsetted element is `numeric`
class(v[[1]])
#> [1] "numeric"

# Select 1st element from list (note that `names` attribute is gone)
l[[1]]
#> [1] TRUE

# Subsetted element is `logical`
class(l[[1]])
#> [1] "logical"

Example: Using element name with [[]]

# Equivalent to v[[2]]
v[["b"]]
#> [1] 20

# Subsetted element is `numeric`
class(v[["b"]])
#> [1] "numeric"

# Equivalent to l[[2]]
l[["b"]]
#> [1] "a" "b" "c"

# Subsetted element is `character` vector
class(l[["b"]])
#> [1] "character"


2.1.3 Subsetting using $

The $ operator:

  • obj_name$element_name is shorthand for obj_name[["element_name"]]
  • This operator only works on lists (including dataframes) and not on atomic vectors

Example: Subsetting with $

Subsetting a list with $:

# Equivalent to l[["b"]]
l$b
#> [1] "a" "b" "c"

# Subsetted element is `character` vector
class(l$b)
#> [1] "character"


Since dataframes are just a special kind of named list, it would work the same way:

# Equivalent to df[["d"]]
df$d
#> [1] 14 24 34


2.1.4 Subsetting dataframes

Subsetting dataframes with [], [[]], and $:

  • Subsetting dataframes works the same way as lists because dataframes are just a special kind of named list, where we can think of each element as a column
    • df_name[<column(s)>] returns a dataframe containing the selected column(s), with its attributes retained
    • df_name[[<column>]] or df_name$<column> returns the column itself, without any attributes
  • In addition to the normal way of subsetting, we are also allowed to subset dataframes by cell(s)
    • df_name[<row(s)>, <column(s)>] returns the selected cell(s)
      • If a single cell is selected, or cells from the same column, then these would be returned as an object of the same type as that column (similar to how [[]] normally works)
      • Otherwise, the subsetted object would be a dataframe, as we’d normally expect when using []
    • df_name[[<row>, <column>]] returns the selected cell
      • This is equivalent to selecting a single cell using df_name[<row(s)>, <column(s)>]

Example: Subsetting dataframe column(s) with []

We can subset dataframe column(s) the same way we have subsetted atomic vector or list element(s):

# Select 1st column from dataframe (note that `names` attribute is retained)
df[1]
#> # A tibble: 3 x 1
#>       a
#>   <dbl>
#> 1    11
#> 2    21
#> 3    31

# Subsetted object will be a `data.frame` containing the column
class(df[1])
#> [1] "data.frame"

# Exclude 1st and 3rd columns from dataframe (note that `names` attribute is retained)
df[-c(1,3)]
#> # A tibble: 3 x 2
#>       b     d
#>   <dbl> <dbl>
#> 1    12    14
#> 2    22    24
#> 3    32    34

# Subsetted object will be a `data.frame` containing the remaining columns
class(df[-c(1,3)])
#> [1] "data.frame"

Example: Subsetting dataframe column with [[]] and $

We can select a single dataframe column the same way we have subsetted a single atomic vector or list element:

# Select 1st column from dataframe by its index (note that `names` attribute is gone)
df[[1]]
#> [1] 11 21 31

# Subsetted column is `numeric` vector
class(df[[1]])
#> [1] "numeric"

# Equivalently, we could've selected 1st column by its name
df[["a"]]
#> [1] 11 21 31

# Equivalently, we could've selected 1st column using `$`
df$a
#> [1] 11 21 31

Example: Subsetting dataframe cell(s) with []

If we select a single cell by specifying its row and column, we will get back the element itself, not in a dataframe:

# Selects cell in 1st row and 2nd col
df[1, 2]
#> [1] 12

# Subsetted cell is of type `numeric`
class(df[1, 2])
#> [1] "numeric"

# Equivalently, we could select using column name instead of index
df[1, "b"]
#> [1] 12


Similarly, if we select cells from the same column, we will get back the elements themselves, not in a dataframe:

# Selects cells from the 2nd col
df[c(1,3), 2]
#> [1] 12 32

# Subsetted cells is of type `numeric`
class(df[c(1,3), 2])
#> [1] "numeric"

# Selects all cells from the 2nd col
df[, 2]
#> [1] 12 22 32

# Subsetted column is of type `numeric`
class(df[, 2])
#> [1] "numeric"


However, if we select cells from the same row, or cells across multiple rows and columns, we will get back a dataframe that contains the selected cells:

# Selects cells from the 2nd row
df[2, c("a", "c")]
#> # A tibble: 1 x 2
#>       a     c
#>   <dbl> <dbl>
#> 1    21    23

# Subsetted cells are returned as a dataframe
class(df[2, c("a", "c")])
#> [1] "data.frame"

# Selects all cells from the 2nd row
df[2, ]
#> # A tibble: 1 x 4
#>       a     b     c     d
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    21    22    23    24

# Subsetted row is returned as a dataframe
class(df[2, ])
#> [1] "data.frame"

# Selects cells from multiple rows and columns
df[1:2, c("a", "c")]
#> # A tibble: 2 x 2
#>       a     c
#>   <dbl> <dbl>
#> 1    11    13
#> 2    21    23

# Subsetted cells are returned as a dataframe
class(df[1:2, c("a", "c")])
#> [1] "data.frame"

Example: Subsetting dataframe cell with [[]]

With [[]], we are only allowed to select a single cell:

# Selects cell in 1st row and 2nd col
df[[1, 2]]
#> [1] 12

# Subsetted cell is of type `numeric`
class(df[[1, 2]])
#> [1] "numeric"

# This is equivalent to using `[]`
df[1, 2]
#> [1] 12


2.2 Prerequisite concepts

Several functions and concepts are used frequently when creating loops and/or functions.

2.2.1 Sequences

What are sequences?

  • (Loose) definition: A sequence is a list of numbers in ascending or descending order
  • Sequences can be created using the : operator or seq() function

Example: Creating sequences using :

# Sequence from -5 to 5
-5:5
#>  [1] -5 -4 -3 -2 -1  0  1  2  3  4  5

# Sequence from 5 to -5
5:-5
#>  [1]  5  4  3  2  1  0 -1 -2 -3 -4 -5


The seq() function:

?seq

# SYNTAX AND DEFAULT VALUES
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
    length.out = NULL, along.with = NULL, ...)
  • Function: Generate a sequence
  • Arguments
    • from: The starting value of sequence
    • to: The end (or maximal) value of sequence
    • by: Increment of the sequence

Example: Creating sequences using seq()

# Sequence from 10 to 15, by increment of 1 (default)
seq(from=10, to=15)
#> [1] 10 11 12 13 14 15

# Explicitly specify increment of 1 (equivalent to above)
seq(from=10, to=15, by=1)
#> [1] 10 11 12 13 14 15

# Sequence from 100 to 150, by increment of 10
seq(from=100, to=150, by=10)
#> [1] 100 110 120 130 140 150


2.2.2 Length

The length() function:

?length

# SYNTAX
length(x)
  • Function: Returns the number of elements in the object
  • Arguments
    • x: The object to find the length of


Example: Using length() to find number of elements in v

# View the atomic vector
v
#>  a  b  c  d 
#> 10 20 30 40

# Use `length()` to find number of elements
length(v)
#> [1] 4


Example: Using length() to find number of elements in df

Remember that dataframes are just lists where each element is a column, so the number of elements in a dataframe is just the number of columns it has:

# View the dataframe
df
#> # A tibble: 3 x 4
#>       a     b     c     d
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    11    12    13    14
#> 2    21    22    23    24
#> 3    31    32    33    34

# Use `length()` to find number of elements (i.e., columns)
length(df)
#> [1] 4


When we subset a dataframe using [] (i.e., select column(s) from the dataframe), the length of the subsetted object is the number of columns we selected:

# Subset one column
df[1]
#> # A tibble: 3 x 1
#>       a
#>   <dbl>
#> 1    11
#> 2    21
#> 3    31

# Length is one
length(df[1])
#> [1] 1

# Subset three columns
df[1:3]
#> # A tibble: 3 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1    11    12    13
#> 2    21    22    23
#> 3    31    32    33

# Length is three
length(df[1:3])
#> [1] 3


When we subset a dataframe using [[]] (i.e., isolate a specific column in the dataframe), the length of the subsetted object is the number of rows in the dataframe:

# Isolate a specific column
df[[2]]
#> [1] 12 22 32

# Length is number of elements in that column (i.e., number of rows in dataframe)
length(df[[2]])
#> [1] 3


2.2.3 Sequences and length

When writing loops, it is very common to create a sequence from 1 to the length (i.e., number of elements) of an object.


Example: Generating a sequence from 1 to length of v

# There are 4 elements in the atomic vector
v
#>  a  b  c  d 
#> 10 20 30 40

# Use `:` to generate a sequence from 1 to 4
1:length(v)
#> [1] 1 2 3 4

# Use `seq()` to generate a sequence from 1 to 4
seq(1, length(v))
#> [1] 1 2 3 4


There is also a function seq_along() that makes it easier to generate a sequence from 1 to the length of an object.


The seq_along() function:

?seq_along

# SYNTAX
seq_along(x)
  • Function: Generates a sequence from 1 to the length of the input object
  • Arguments
    • x: The object to generate the sequence for


Example: Generating a sequence from 1 to length of df

# There are 4 elements (i.e., columns) in the dataframe
df
#> # A tibble: 3 x 4
#>       a     b     c     d
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    11    12    13    14
#> 2    21    22    23    24
#> 3    31    32    33    34

# Use `seq_along()` to generate a sequence from 1 to 4
seq_along(df)
#> [1] 1 2 3 4

3 Iteration

What is iteration?

  • Iteration is the repetition of some process or operation
    • E.g., Iteration can help with “repeating the same operation on different columns, or on different datasets” (From R for Data Science)
  • Looping is the most common way to iterate

3.1 Loop basics

What are loops?

  • Loops execute some set of commands multiple times
  • Each time the loop executes the set of commands is an iteration
  • The below loop iterates 4 times


Example: Printing each element of the vector c(1,2,3,4) using a loop

# There are 4 elements in the vector
c(1,2,3,4)
#> [1] 1 2 3 4

# Iterate over each element of the vector
for(i in c(1,2,3,4)) {
  print(i)  # Print out each element
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4


When to write loops?

  • Broadly, rationale for writing loop:
    • Do not duplicate code
    • Can make changes to code in one place rather than many
  • When to write a loop:
    • Grolemund and Wickham say don’t copy and paste more than twice
    • If you find yourself doing this, consider writing a loop or function
  • Don’t worry about knowing all the situations you should write a loop
    • Rather, you’ll be creating analysis dataset or analyzing data and you will notice there is some task that you are repeating over and over
    • Then you’ll think, “Oh, I should write a loop or function for this”

3.2 Components of a loop

How to write a loop?

  • We can build loops using the for() function
  • The loop sequence goes inside the parentheses of for()
  • The loop body goes inside the pair of curly brackets ({}) that follows for()
for(i in c(1,2,3,4)) {  # Loop sequence
  print(i)  # Loop body
}


Components of a loop:

  1. Sequence: Determines what to “loop over”
    • In the above example, the sequence is i in c(1,2,3,4)
    • This creates a temporary/local object named i (could name it anything)
    • Each iteration of the loop will assign a different value to i
    • c(1,2,3,4) is the set of values that will be assigned to i
      • In the first iteration, the value of i is 1
      • In the second iteration, the value of i is 2, etc.
  2. Body: What commands to execute for each iteration of the loop
    • In the above example, the body is print(i)
    • Each time through the loop (i.e., iteration), body prints the value of object i

3.2.1 Ways to write loop sequence

You may see the loop sequence being written in slightly different ways. For example, these three loops all do the same thing:

  • Looping over the vector c(1,2,3)

    for(z in c(1,2,3)) {  # Loop sequence
      print(z)  # Loop body
    }
    #> [1] 1
    #> [1] 2
    #> [1] 3
  • Looping over the sequence 1:3

    for(z in 1:3) {  # Loop sequence
      print(z)  # Loop body
    }
    #> [1] 1
    #> [1] 2
    #> [1] 3
  • Looping over the object num_sequence

    num_sequence <- 1:3
    for(z in num_sequence) {  # Loop sequence
      print(z)  # Loop body
    }
    #> [1] 1
    #> [1] 2
    #> [1] 3

3.2.2 Printing values in loop body

When building a loop, it is useful to print out information to understand what the loop is doing.

For example, the two loops below are essentially the same, but the second approach is preferable because it more clearly prints out what object we are working with inside the loop:

  • Using print() to print a single object z (print() also outputs element number [1] while writeLines() does not)

    for(z in c(1,2,3)) {
      print(z)
    }
    #> [1] 1
    #> [1] 2
    #> [1] 3
  • Using str_c() and writeLines() to concatenate and print multiple items

    for(z in c(1,2,3)) {
      writeLines(str_c("object z=", z))
    }
    #> object z=1
    #> object z=2
    #> object z=3

3.2.3 Student exercise

  1. Create a numeric vector that contains the birth years of your family members
    • E.g., birth_years <- c(1944,1950,1981,2016)
  2. Write a loop that calculates the current year minus birth year and prints this number for each member of your family
    • Within this loop, you will create a new variable that calculates current year minus birth year

Solutions

birth_years <- c(1944,1950,1981,2016)
birth_years
#> [1] 1944 1950 1981 2016

for(y in birth_years) {  # Loop sequence
  writeLines(str_c("object y=", y))  # Loop body
  z <- 2021 - y
  writeLines(str_c("value of 2021 minus ", y, " is ", z))
}
#> object y=1944
#> value of 2021 minus 1944 is 77
#> object y=1950
#> value of 2021 minus 1950 is 71
#> object y=1981
#> value of 2021 minus 1981 is 40
#> object y=2016
#> value of 2021 minus 2016 is 5

3.3 Ways to loop over a vector

There are 3 ways to loop over elements of an object:

  1. Looping over the elements (approach we have used so far)
  2. Looping over names of the elements
  3. Looping over numeric indices associated with element position (approach recommended by Grolemnund and Wickham)


For the examples in the next few subsections, we will be working with the following named atomic vector and dataframe:

  • Create named atomic vector called vec

    vec <- c(a = 5, b = -10, c = 30)
    vec
    #>   a   b   c 
    #>   5 -10  30
  • Create dataframe called df with randomly generated data, 3 columns (vars) and 4 rows (obs)

    set.seed(12345) # so we all get the same variable values
    df <- tibble(a = rnorm(4), b = rnorm(4), c = rnorm(4))
    str(df)
    #> tibble [4 x 3] (S3: tbl_df/tbl/data.frame)
    #>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
    #>  $ b: num [1:4] 0.606 -1.818 0.63 -0.276
    #>  $ c: num [1:4] -0.284 -0.919 -0.116 1.817

3.3.1 Looping over elements

Syntax: for (i in object_name)

  • This approach iterates over each element in the object
  • The value of i is equal to the element’s content (rather than its name or index position)


Example: Looping over elements in vec

vec  # View named atomic vector object
#>   a   b   c 
#>   5 -10  30

for (i in vec) {
  writeLines(str_c("value of object i=",i))
  writeLines(str_c("object i has: type=", typeof(i), "; length=", length(i), "; class=", class(i),
      "\n"))  # "\n" adds line break
}
#> value of object i=5
#> object i has: type=double; length=1; class=numeric
#> 
#> value of object i=-10
#> object i has: type=double; length=1; class=numeric
#> 
#> value of object i=30
#> object i has: type=double; length=1; class=numeric


Example: Looping over elements in df

df  # View dataframe object
#> # A tibble: 4 x 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  0.586  0.606 -0.284
#> 2  0.709 -1.82  -0.919
#> 3 -0.109  0.630 -0.116
#> 4 -0.453 -0.276  1.82

# show contents of element, outside of a loop
  # each element of the dataframe is a vector that contains one element for each observation
  str(df[1]) # single bracket
#> tibble [4 x 1] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
  str(df[[1]]) # double bracket
#>  num [1:4] 0.586 0.709 -0.109 -0.453

for (i in df) {
  writeLines(str_c("value of object i=",i))
  writeLines(str_c("object i has: type=", typeof(i), "; length=", length(i), "; class=", class(i),
      "\n"))  # "\n" adds line break
}
#> value of object i=0.585528817843856
#> value of object i=0.709466017509524
#> value of object i=-0.109303314681054
#> value of object i=-0.453497173462763
#> object i has: type=double; length=4; class=numeric
#> 
#> value of object i=0.605887455840394
#> value of object i=-1.81795596770373
#> value of object i=0.630098551068391
#> value of object i=-0.276184105225216
#> object i has: type=double; length=4; class=numeric
#> 
#> value of object i=-0.284159743943371
#> value of object i=-0.919322002474128
#> value of object i=-0.116247806352002
#> value of object i=1.81731204370422
#> object i has: type=double; length=4; class=numeric

Example: Calculating column averages for df by looping over columns

The dataframe df is a list object, where each element is a vector (i.e., column):

df  # View dataframe object
#> # A tibble: 4 x 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  0.586  0.606 -0.284
#> 2  0.709 -1.82  -0.919
#> 3 -0.109  0.630 -0.116
#> 4 -0.453 -0.276  1.82

for (i in df) {
  writeLines(str_c("value of object i=", i))
  writeLines(str_c("mean value of object i=", mean(i, na.rm = TRUE), "\n"))
}
#> value of object i=0.585528817843856
#> value of object i=0.709466017509524
#> value of object i=-0.109303314681054
#> value of object i=-0.453497173462763
#> mean value of object i=0.183048586802391
#> 
#> value of object i=0.605887455840394
#> value of object i=-1.81795596770373
#> value of object i=0.630098551068391
#> value of object i=-0.276184105225216
#> mean value of object i=-0.21453851650504
#> 
#> value of object i=-0.284159743943371
#> value of object i=-0.919322002474128
#> value of object i=-0.116247806352002
#> value of object i=1.81731204370422
#> mean value of object i=0.124395622733679

3.3.2 Looping over names

Syntax: for (i in names(object_name))

  • To use this approach, elements in the object must have name attributes
  • This approach iterates over the names of each element in the object
  • names() returns a vector of the object’s element names
  • The value of i is equal to the element’s name (rather than its content or index position)
  • But note that it is still possible to access the element’s content inside the loop:
    • Access element contents using object_name[i]
      • Same object type as object_name; retains attributes (e.g., name attribute)
    • Access element contents using object_name[[i]]
      • Removes level of hierarchy, thereby removing attributes
      • Approach recommended by Wickham because it isolates value of element


Example: Looping over elements in vec

vec  # View named atomic vector object
#>   a   b   c 
#>   5 -10  30
names(vec)  # View names of atomic vector object
#> [1] "a" "b" "c"

for (i in names(vec)) {
  writeLines(str_c("\nvalue of object i=", i, "; type=", typeof(i)))
  str(vec[i])  # Access element contents using []
  str(vec[[i]])  # Access element contents using [[]]
}
#> 
#> value of object i=a; type=character
#>  Named num 5
#>  - attr(*, "names")= chr "a"
#>  num 5
#> 
#> value of object i=b; type=character
#>  Named num -10
#>  - attr(*, "names")= chr "b"
#>  num -10
#> 
#> value of object i=c; type=character
#>  Named num 30
#>  - attr(*, "names")= chr "c"
#>  num 30


Example: Looping over elements in df

df  # View dataframe object
#> # A tibble: 4 x 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  0.586  0.606 -0.284
#> 2  0.709 -1.82  -0.919
#> 3 -0.109  0.630 -0.116
#> 4 -0.453 -0.276  1.82
names(df)  # View names of dataframe object (i.e., column names)
#> [1] "a" "b" "c"

# show using name to print contents, outside of a loop
str(df["a"]) # single bracket
#> tibble [4 x 1] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
str(df[["a"]]) # double bracket
#>  num [1:4] 0.586 0.709 -0.109 -0.453

for (i in names(df)) {
  writeLines(str_c("\nvalue of object i=", i, "; type=", typeof(i)))
  str(df[i])  # Access element contents using []
  str(df[[i]])  # Access element contents using [[]]
}
#> 
#> value of object i=a; type=character
#> tibble [4 x 1] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
#>  num [1:4] 0.586 0.709 -0.109 -0.453
#> 
#> value of object i=b; type=character
#> tibble [4 x 1] (S3: tbl_df/tbl/data.frame)
#>  $ b: num [1:4] 0.606 -1.818 0.63 -0.276
#>  num [1:4] 0.606 -1.818 0.63 -0.276
#> 
#> value of object i=c; type=character
#> tibble [4 x 1] (S3: tbl_df/tbl/data.frame)
#>  $ c: num [1:4] -0.284 -0.919 -0.116 1.817
#>  num [1:4] -0.284 -0.919 -0.116 1.817

Example: Calculating column averages for df by looping over column names

str(df)  # View structure of dataframe object
#> tibble [4 x 3] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
#>  $ b: num [1:4] 0.606 -1.818 0.63 -0.276
#>  $ c: num [1:4] -0.284 -0.919 -0.116 1.817


Remember that we can use [[]] to access element contents by their name:

for (i in names(df)) {
  writeLines(str_c("mean of element named ", i, " = ", mean(df[[i]], na.rm = TRUE)))
}
#> mean of element named a = 0.183048586802391
#> mean of element named b = -0.21453851650504
#> mean of element named c = 0.124395622733679


If we tried completing the task using [] to access the element contents, we would get an error because mean() only takes numeric or logical vectors as input, and df[i] returns a dataframe object:

for (i in names(df)) {
  writeLines(str_c("mean of element named", i, "=", mean(df[i], na.rm = TRUE)))
  
  # print(class(df[i]))
}

3.3.3 Looping over indices

Syntax: for (i in 1:length(object_name)) OR for (i in seq_along(object_name))

  • This approach iterates over the index positions of each element in the object
  • There are two ways to create the loop sequence:
    • length() returns the number of elements in the input object, which we can use to create a sequence of index positions (i.e., 1:length(object_name))
    • seq_along() returns a sequence of numbers that represent the index positions for all elements in the input object (i.e., equivalent to 1:length(object_name))
  • The value of i is equal to the element’s index position (rather than its content or name)
  • But note that it is still possible to access the element’s content inside the loop:
    • Access element contents using object_name[i]
      • Same object type as object_name; retains attributes (e.g., name attribute)
    • Access element contents using object_name[[i]]
      • Removes level of hierarchy, thereby removing attributes
      • Approach recommended by Wickham because it isolates value of element
  • Similarly, we can access the element’s name by its index using names(object_name)[i] or names(object_name)[[i]]
    • In this case, using [[]] and [] are equivalent because names() returns an unnamed vector, which does not have any attributes


Example: Looping over elements in vec

vec  # View named atomic vector object
#>   a   b   c 
#>   5 -10  30
length(vec)  # View length of atomic vector object
#> [1] 3
1:length(vec)  # Create sequence from `1` to `length(vec)`
#> [1] 1 2 3

for (i in 1:length(vec)) {
  writeLines(str_c("\nvalue of object i=", i, "; type=", typeof(i)))
  str(vec[i])  # Access element contents using []
  str(vec[[i]])  # Access element contents using [[]]
}
#> 
#> value of object i=1; type=integer
#>  Named num 5
#>  - attr(*, "names")= chr "a"
#>  num 5
#> 
#> value of object i=2; type=integer
#>  Named num -10
#>  - attr(*, "names")= chr "b"
#>  num -10
#> 
#> value of object i=3; type=integer
#>  Named num 30
#>  - attr(*, "names")= chr "c"
#>  num 30


Example: Looping over elements in df

df  # View dataframe object
#> # A tibble: 4 x 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  0.586  0.606 -0.284
#> 2  0.709 -1.82  -0.919
#> 3 -0.109  0.630 -0.116
#> 4 -0.453 -0.276  1.82
seq_along(df)  # Equivalent to `1:length(df)`
#> [1] 1 2 3

for (i in seq_along(df)) {
  writeLines(str_c("\nvalue of object i=", i, "; type=", typeof(i)))
  str(df[i])  # Access element contents using []
  str(df[[i]])  # Access element contents using [[]]
}
#> 
#> value of object i=1; type=integer
#> tibble [4 x 1] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
#>  num [1:4] 0.586 0.709 -0.109 -0.453
#> 
#> value of object i=2; type=integer
#> tibble [4 x 1] (S3: tbl_df/tbl/data.frame)
#>  $ b: num [1:4] 0.606 -1.818 0.63 -0.276
#>  num [1:4] 0.606 -1.818 0.63 -0.276
#> 
#> value of object i=3; type=integer
#> tibble [4 x 1] (S3: tbl_df/tbl/data.frame)
#>  $ c: num [1:4] -0.284 -0.919 -0.116 1.817
#>  num [1:4] -0.284 -0.919 -0.116 1.817


We could also access the element’s name by its index:

names(df)  # View names of dataframe object (i.e., column names)
#> [1] "a" "b" "c"
names(df)[[2]]  # We can access any element in the names vector by its index
#> [1] "b"

# Incorporate the above line into the loop
for (i in 1:length(df)) {
  writeLines(str_c("i=", i, "; name=", names(df)[[i]]))
}
#> i=1; name=a
#> i=2; name=b
#> i=3; name=c

Example: Calculating column averages for df by looping over column indices

Use i in seq_along(df) to loop over the column indices and [[]] to access column contents:

str(df)  # View structure of dataframe object
#> tibble [4 x 3] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 0.586 0.709 -0.109 -0.453
#>  $ b: num [1:4] 0.606 -1.818 0.63 -0.276
#>  $ c: num [1:4] -0.284 -0.919 -0.116 1.817

for (i in seq_along(df)) {
  writeLines(str_c("mean of element at index position", i, "=", mean(df[[i]], na.rm = TRUE)))
}
#> mean of element at index position1=0.183048586802391
#> mean of element at index position2=-0.21453851650504
#> mean of element at index position3=0.124395622733679

3.3.4 Summary

There are 3 ways to loop over elements of an object:

  1. Looping over the elements
  2. Looping over names of the elements
  3. Looping over numeric indices associated with element position (approach recommended by Grolemnund and Wickham)
    • Grolemnund and Wickham recommends this approach (#3) because given an element’s index position, we can also extract the element name (#2) and value (#1)
for (i in seq_along(df)) {
  writeLines(str_c("\n", "i=", i))  # element's index position
  
  name <- names(df)[[i]]  # element's name (what we looped over in approach #2)
  writeLines(str_c("name=", name))
  
  value <- df[[i]]  # element's value (what we looped over in approach #1)
  writeLines(str_c("value=", value))
}
#> 
#> i=1
#> name=a
#> value=0.585528817843856
#> value=0.709466017509524
#> value=-0.109303314681054
#> value=-0.453497173462763
#> 
#> i=2
#> name=b
#> value=0.605887455840394
#> value=-1.81795596770373
#> value=0.630098551068391
#> value=-0.276184105225216
#> 
#> i=3
#> name=c
#> value=-0.284159743943371
#> value=-0.919322002474128
#> value=-0.116247806352002
#> value=1.81731204370422

3.4 Modifying vs. creating object

Grolemund and Wickham differentiate between two types of tasks loops accomplish:

  1. Modifying an existing object
    • E.g., Looping through a set of variables in a dataframe to:
      • Modify these variables OR
      • Create new variables (within the existing dataframe object)
    • When writing loops in Stata/SAS/SPSS, we are usually modifying an existing object because these programs typically only have one object (a dataset) open at a time
  2. Creating a new object
    • E.g., Creating an object that has summary statistics for each variable, which can be the basis for a table or graph, etc.
    • The new object will often be a vector of results based on looping through elements of a dataframe
    • In R (as opposed to Stata/SAS/SPSS), creating a new object is very common because R can hold many objects at the same time

3.4.1 Modifying an existing object

How to modify an existing object?

  • Recall that we can directly access elements in an object (e.g., atomic vector, lists) using [[]]. We can use this same notation to modify the object.
  • Even though atomic vectors can also be modified with [], Wickhams recommends using [[]] in all cases to make it clear we are working with a single element (From R for Data Science)

Example: Modifying an existing atomic vector

Recall our named atomic vector vec from the previous examples:

vec
#>   a   b   c 
#>   5 -10  30

We can loop over the index positions and use [[]] to modify the object:

for (i in seq_along(vec)) {
  vec[[i]] <- vec[[i]] * 2  # Double each element
}

vec
#>   a   b   c 
#>  10 -20  60

Example: Modifying an existing dataframe

Recall our dataframe df from the previous examples:

df
#> # A tibble: 4 x 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  0.586  0.606 -0.284
#> 2  0.709 -1.82  -0.919
#> 3 -0.109  0.630 -0.116
#> 4 -0.453 -0.276  1.82

We can loop over the index positions and use [[]] to modify the object:

for (i in seq_along(df)) {
  df[[i]] <- df[[i]] * 2  # Double each element
}

df
#> # A tibble: 4 x 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  1.17   1.21  -0.568
#> 2  1.42  -3.64  -1.84 
#> 3 -0.219  1.26  -0.232
#> 4 -0.907 -0.552  3.63

3.4.2 Creating a new object

So far our loops have two components:

  1. Sequence
  2. Body

When we create a new object to store the results of a loop, our loops have three components:

  1. Sequence
  2. Body
  3. Output (This is the new object that will store the results created from your loop)


Grolemund and Wickham recommend using vector() to create this new object prior to writing the loop (rather than creating the new object within the loop):

“Before you start loop…allocate sufficient space for the output. This is very important for efficiency: if you grow the for loop at each iteration using c() (for example), your for loop will be very slow.”


The vector() function:

?vector

# SYNTAX AND DEFAULT VALUES
vector(mode = "logical", length = 0)
  • Function: Creates a new vector object of the given length and mode
  • Arguments
    • mode: Type of vector to create (e.g., "logical", "numeric", "list")
    • length: Length of the vector

Example: Creating a new object to store dataframe column averages

Recall the previous example where we calculated the mean value of each column in dataframe df:

str(df)
#> tibble [4 x 3] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:4] 1.171 1.419 -0.219 -0.907
#>  $ b: num [1:4] 1.212 -3.636 1.26 -0.552
#>  $ c: num [1:4] -0.568 -1.839 -0.232 3.635

for (i in seq_along(df)) {
  writeLines(str_c("mean of element at index position", i, "=", mean(df[[i]], na.rm = TRUE)))
}
#> mean of element at index position1=0.366097173604781
#> mean of element at index position2=-0.42907703301008
#> mean of element at index position3=0.248791245467358


Let’s create a new object to store these column averages. Specifically, we’ll create a new numeric vector whose length is equal to the number of columns in df:

output <- vector(mode = "numeric", length = length(df))
class(output)  # Specified by `mode` argument in `vector()`
#> [1] "numeric"
length(output)  # Specified by `length` argument in `vector()`
#> [1] 3


We can loop over the index positions of df and use [[]] to modify output:

for (i in seq_along(df)) {
  output[[i]] <- mean(df[[i]], na.rm = TRUE)  # Mean of df[[1]] assigned to output[[1]], etc.
}

output
#> [1]  0.3660972 -0.4290770  0.2487912

3.5 Summary

The general recipe for how to write a loop:

  1. Complete the task for one instance outside a loop (this is akin to writing the body of the loop)

  2. Write the sequence of the loop

  3. Modify the parts of the loop body that need to change with each iteration

  4. If you are creating a new object to store output of the loop, create this object outside of the loop

  5. Construct the loop


When to write a loop vs a function

It’s usually obvious when you are duplicating code, but unclear whether you should write a loop or whether you should write a function.

  • Often, a repeated task can be completed with a loop or a function

In my experience, loops are better for repeated tasks when the individual tasks are very similar to one another

  • E.g., a loop that reads in datasets from individual years; each dataset you read in differs only by directory and name
  • E.g., a loop that converts negative values to NA for a set of variables

Because functions can have many arguments, functions are better when the individual tasks differ substantially from one another

  • E.g., a function that runs regression and creates formatted results table
    • Function allows you to specify (as function arguments): dependent variable; independent variables; what model to run, etc.

Note:

  • Can embed loops within functions; can call functions within loops
  • But for now, just try to understand basics of functions and loops

4 Conditional execution

What is conditional execution?

  • Conditional execution is the running of specific blocks of code based on some condition
    • E.g., If the number is even, run this block of code. Otherwise, run the other block of code, etc.
  • We can write if-(else if)-else statements to run code conditionally (covered in upcoming sections)
  • This is useful because it allows for decision-making in the code