Load packages:
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.2.2
#> Warning: package 'tidyr' was built under R version 4.2.2
#> Warning: package 'readr' was built under R version 4.2.2
#> Warning: package 'purrr' was built under R version 4.2.2
#> Warning: package 'dplyr' was built under R version 4.2.2
#> Warning: package 'stringr' was built under R version 4.2.2
library(lubridate)
#> Warning: package 'lubridate' was built under R version 4.2.2
The programming unit will introduce you to tools that tell your computer to do the same or similar things over and over, without having to write the code over and over (e.g., iteration). And the code you write to do things over and over, will be able to do things differently depending on conditions of the data or depending on things you specify.
Paraphrasing Will Doyle:
“Computers love to do the same thing over and over. It’s their favorite thing to do. Learn to make your computer happy.”
The 3 core foci of this unit are:
But more than learning these things, this unit is about developing a more formal, rigorous understanding of programming concepts so that you can become a more powerful programmer. Towards that end, we will be reading chapters from Wickham’s free text book Advanced R.
In fact, please spend 10 minutes reading the Chapter 1 (sections 1.1 through 1.5)
See here for a review of data structures and types.
What is subsetting?
[]
,
[[]]
, $
For the examples in the next few subsections, we will be working
with the following named atomic vector, named list, and dataframe:
Create named atomic vector called v
with 4 elements
<- c(a = 10, b = 20, c = 30, d = 40)
v
v#> a b c d
#> 10 20 30 40
Create named list called l
with 4
elements
<- list(a = TRUE, b = c("a", "b", "c"), c = list(1, 2), d = 10L)
l
l#> $a
#> [1] TRUE
#>
#> $b
#> [1] "a" "b" "c"
#>
#> $c
#> $c[[1]]
#> [1] 1
#>
#> $c[[2]]
#> [1] 2
#>
#>
#> $d
#> [1] 10
Create dataframe called df
with 4
columns and 3 rows
<- data.frame(
df a = c(11, 21, 31),
b = c(12, 22, 32),
c = c(13, 23, 33),
d = c(14, 24, 34)
)
df#> # A tibble: 3 × 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 11 12 13 14
#> 2 21 22 23 24
#> 3 31 32 33 34
[]
The []
operator:
[]
returns an object of the
same type
[]
on an atomic vector returns an atomic
vector, using []
on a list returns a list, etc.[]
(e.g.,
name attribute)Six ways to subset using []
:
TRUE
[]
returns original object (useful for
dataframes)[0]
returns empty object (useful for
testing data)[]
Selecting a single element: Specify the index of the element to subset
# Select 1st element from numeric vector (note that `names` attribute is retained)
1]
v[#> a
#> 10
# Subsetted object will be of type `numeric`
class(v[1])
#> [1] "numeric"
# Select 1st element from list (note that `names` attribute is retained)
1]
l[#> $a
#> [1] TRUE
# Subsetted object will be a `list` containing the element
class(l[1])
#> [1] "list"
Selecting multiple elements: Specify the
indices of the elements to subset using c()
# Select 3rd and 1st elements from numeric vector
c(3,1)]
v[#> c a
#> 30 10
# Subsetted object will be of type `numeric`
class(v[c(3,1)])
#> [1] "numeric"
# Select 1st element three times from list
c(1,1,1)]
l[#> $a
#> [1] TRUE
#>
#> $a
#> [1] TRUE
#>
#> $a
#> [1] TRUE
# Subsetted object will be a `list` containing the elements
class(l[c(1,1,1)])
#> [1] "list"
[]
Excluding a single element: Specify the index of the element to exclude
# Exclude 1st element from numeric vector
-1]
v[#> b c d
#> 20 30 40
# Subsetted object will be of type `numeric`
class(v[-1])
#> [1] "numeric"
Excluding multiple elements: Specify the
indices of the elements to exclude using -c()
# Exclude 1st and 3rd elements from list
-c(1,3)]
l[#> $b
#> [1] "a" "b" "c"
#>
#> $d
#> [1] 10
# Subsetted object will be a `list` containing the remaining elements
class(l[-c(1,3)])
#> [1] "list"
[]
If the logical vector is the same length as the object, then each
element in the object whose corresponding position in the logical vector
is TRUE
will be selected:
v#> a b c d
#> 10 20 30 40
# Select 2nd and 3rd elements from numeric vector
c(FALSE, TRUE, TRUE, FALSE)]
v[#> b c
#> 20 30
# Subsetted object will be of type `numeric`
class(v[c(FALSE, TRUE, TRUE, FALSE)])
#> [1] "numeric"
If the logical vector is shorter than the object, then the
elements in the logical vector will be recycled:
# This is equivalent to `l[c(FALSE, TRUE, FALSE, TRUE)]`, thus retaining 2nd and 4th elements
c(FALSE, TRUE)]
l[#> $b
#> [1] "a" "b" "c"
#>
#> $d
#> [1] 10
# Subsetted object will be a `list` containing the elements
class(l[c(FALSE, TRUE)])
#> [1] "list"
We can also write expressions that evaluate to either
TRUE
or FALSE
:
var1>=30
is TRUE
)# This expression is recycled and evaluates to be equivalent to `l[c(FALSE, FALSE, TRUE, TRUE)]`
> 20]
v[v #> c d
#> 30 40
[]
An empty vector []
just returns the original object:
# Original atomic vector
v[]#> a b c d
#> 10 20 30 40
# Original list
l[]#> $a
#> [1] TRUE
#>
#> $b
#> [1] "a" "b" "c"
#>
#> $c
#> $c[[1]]
#> [1] 1
#>
#> $c[[2]]
#> [1] 2
#>
#>
#> $d
#> [1] 10
# Original dataframe
df[]#> # A tibble: 3 × 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 11 12 13 14
#> 2 21 22 23 24
#> 3 31 32 33 34
[0]
A zero vector [0]
just returns an empty object of the
same type as the original object:
# Empty named atomic vector
0]
v[#> named numeric(0)
# Empty named list
0]
l[#> named list()
# Empty dataframe
0]
df[#> # A tibble: 3 × 0
[]
We can select a single element or multiple elements by their name(s):
# Equivalent to v[2]
"b"]
v[#> b
#> 20
# Equivalent to l[c(1, 3)]
c("a", "c")]
l[#> $a
#> [1] TRUE
#>
#> $c
#> $c[[1]]
#> [1] 1
#>
#> $c[[2]]
#> [1] 2
[[]]
The [[]]
operator:
[[]]
to extract a single element rather
than multiple elements[[]]
returns the selected
element itself, which might not be of the same type as the original
object
[[]]
to select a list element that is a
numeric vector will return that numeric vector and not a list containing
that numeric vector, like what []
would return
x
be a list with 3 elements (Think of it as a
train with 3 cars) x[1]
will be a list containing the 1st element, which
is a numeric vector (i.e., train with the 1st car)x[[1]]
will be the numeric vector itself (i.e., the
objects within the 1st car) [[]]
[[]]
on a named object returns just the
selected element itself without the name attribute
Two ways to subset using [[]]
:
[[]]
# Select 1st element from numeric vector (note that `names` attribute is gone)
1]]
v[[#> [1] 10
# Subsetted element is `numeric`
class(v[[1]])
#> [1] "numeric"
# Select 1st element from list (note that `names` attribute is gone)
1]]
l[[#> [1] TRUE
# Subsetted element is `logical`
class(l[[1]])
#> [1] "logical"
[[]]
# Equivalent to v[[2]]
"b"]]
v[[#> [1] 20
# Subsetted element is `numeric`
class(v[["b"]])
#> [1] "numeric"
# Equivalent to l[[2]]
"b"]]
l[[#> [1] "a" "b" "c"
# Subsetted element is `character` vector
class(l[["b"]])
#> [1] "character"
$
The $
operator:
obj_name$element_name
is shorthand for
obj_name[["element_name"]]
$
Subsetting a list with $
:
# Equivalent to l[["b"]]
$b
l#> [1] "a" "b" "c"
# Subsetted element is `character` vector
class(l$b)
#> [1] "character"
Since dataframes are just a special kind of named list, it would
work the same way:
# Equivalent to df[["d"]]
$d
df#> [1] 14 24 34
Subsetting dataframes with []
, [[]]
, and
$
:
df_name[<column(s)>]
returns a dataframe
containing the selected column(s), with its attributes retaineddf_name[[<column>]]
or
df_name$<column>
returns the column itself, without
any attributesdf_name[<row(s)>, <column(s)>]
returns the
selected cell(s)
[[]]
normally works)[]
df_name[[<row>, <column>]]
returns the
selected cell
df_name[<row(s)>, <column(s)>]
[]
We can subset dataframe column(s) the same way we have subsetted atomic vector or list element(s):
df#> # A tibble: 3 × 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 11 12 13 14
#> 2 21 22 23 24
#> 3 31 32 33 34
# Select 1st column from dataframe (note that `names` attribute is retained)
1]
df[#> # A tibble: 3 × 1
#> a
#> <dbl>
#> 1 11
#> 2 21
#> 3 31
# Subsetted object will be a `data.frame` containing the column
class(df[1])
#> [1] "data.frame"
# Exclude 1st and 3rd columns from dataframe (note that `names` attribute is retained)
-c(1,3)]
df[#> # A tibble: 3 × 2
#> b d
#> <dbl> <dbl>
#> 1 12 14
#> 2 22 24
#> 3 32 34
# Subsetted object will be a `data.frame` containing the remaining columns
class(df[-c(1,3)])
#> [1] "data.frame"
[[]]
and $
We can select a single dataframe column the same way we have subsetted a single atomic vector or list element:
# Select 1st column from dataframe by its index (note that `names` attribute is gone)
1]]
df[[#> [1] 11 21 31
# Subsetted column is `numeric` vector
class(df[[1]])
#> [1] "numeric"
# Equivalently, we could've selected 1st column by its name
"a"]]
df[[#> [1] 11 21 31
# Equivalently, we could've selected 1st column using `$`
$a
df#> [1] 11 21 31
[]
If we select a single cell by specifying its row and column, we will get back the element itself, not in a dataframe:
# Selects cell in 1st row and 2nd col
1, 2]
df[#> [1] 12
# Subsetted cell is of type `numeric`
class(df[1, 2])
#> [1] "numeric"
# Equivalently, we could select using column name instead of index
1, "b"]
df[#> [1] 12
Similarly, if we select cells from the same column, we will get
back the elements themselves, not in a dataframe:
# Selects cells from the 2nd col
c(1,3), 2]
df[#> [1] 12 32
# Subsetted cells is of type `numeric`
class(df[c(1,3), 2])
#> [1] "numeric"
# Selects all cells from the 2nd col
2]
df[, #> [1] 12 22 32
# Subsetted column is of type `numeric`
class(df[, 2])
#> [1] "numeric"
However, if we select cells from the same row, or cells across
multiple rows and columns, we will get back a dataframe that contains
the selected cells:
# Selects cells from the 2nd row
2, c("a", "c")]
df[#> # A tibble: 1 × 2
#> a c
#> <dbl> <dbl>
#> 1 21 23
# Subsetted cells are returned as a dataframe
class(df[2, c("a", "c")])
#> [1] "data.frame"
# Selects all cells from the 2nd row
2, ]
df[#> # A tibble: 1 × 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 21 22 23 24
# Subsetted row is returned as a dataframe
class(df[2, ])
#> [1] "data.frame"
# Selects cells from multiple rows and columns
1:2, c("a", "c")]
df[#> # A tibble: 2 × 2
#> a c
#> <dbl> <dbl>
#> 1 11 13
#> 2 21 23
# Subsetted cells are returned as a dataframe
class(df[1:2, c("a", "c")])
#> [1] "data.frame"
[[]]
With [[]]
, we are only allowed to select a single
cell:
# Selects cell in 1st row and 2nd col
1, 2]]
df[[#> [1] 12
# Subsetted cell is of type `numeric`
class(df[[1, 2]])
#> [1] "numeric"
# This is equivalent to using `[]`
1, 2]
df[#> [1] 12
Several functions and concepts are used frequently when creating loops and/or functions.
What are sequences?
:
operator or
seq()
functionExample: Creating sequences using :
# Sequence from -5 to 5
-5:5
#> [1] -5 -4 -3 -2 -1 0 1 2 3 4 5
# Sequence from 5 to -5
5:-5
#> [1] 5 4 3 2 1 0 -1 -2 -3 -4 -5
The seq()
function:
?seq
# SYNTAX AND DEFAULT VALUES
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
length.out = NULL, along.with = NULL, ...)
from
: The starting value of sequenceto
: The end (or maximal) value of sequenceby
: Increment of the sequenceExample: Creating sequences using
seq()
# Sequence from 10 to 15, by increment of 1 (default)
seq(from=10, to=15)
#> [1] 10 11 12 13 14 15
# Explicitly specify increment of 1 (equivalent to above)
seq(from=10, to=15, by=1)
#> [1] 10 11 12 13 14 15
# Sequence from 100 to 150, by increment of 10
seq(from=100, to=150, by=10)
#> [1] 100 110 120 130 140 150
The length()
function:
?length
# SYNTAX
length(x)
x
: The object to find the length of
Example: Using length()
to find
number of elements in v
# View the atomic vector
v#> a b c d
#> 10 20 30 40
# Use `length()` to find number of elements
length(v)
#> [1] 4
Example: Using length()
to find
number of elements in df
Remember that dataframes are just lists where each element is a column, so the number of elements in a dataframe is just the number of columns it has:
# View the dataframe
df#> # A tibble: 3 × 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 11 12 13 14
#> 2 21 22 23 24
#> 3 31 32 33 34
# Use `length()` to find number of elements (i.e., columns)
length(df)
#> [1] 4
When we subset a dataframe using []
(i.e.,
select column(s) from the dataframe), the length of the
subsetted object is the number of columns we selected:
# Subset one column
1]
df[#> # A tibble: 3 × 1
#> a
#> <dbl>
#> 1 11
#> 2 21
#> 3 31
# Length is one
length(df[1])
#> [1] 1
# Subset three columns
1:3]
df[#> # A tibble: 3 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 11 12 13
#> 2 21 22 23
#> 3 31 32 33
# Length is three
length(df[1:3])
#> [1] 3
When we subset a dataframe using [[]]
(i.e.,
isolate a specific column in the dataframe), the length of the
subsetted object is the number of elements in the atomic vector (i.e.,
the number of rows in the dataframe):
# Isolate a specific column
2]]
df[[#> [1] 12 22 32
# Length is number of elements in that column (i.e., number of rows in dataframe)
length(df[[2]])
#> [1] 3
When writing loops, it is very common to create a sequence from 1 to the length (i.e., number of elements) of an object.
Example: Generating a sequence from 1 to length
of v
# There are 4 elements in the atomic vector
v#> a b c d
#> 10 20 30 40
length(v)
#> [1] 4
# Use `:` to generate a sequence from 1 to 4
1:length(v)
#> [1] 1 2 3 4
# Use `seq()` to generate a sequence from 1 to 4
seq(1, length(v))
#> [1] 1 2 3 4
There is also a function seq_along()
that makes it
easier to generate a sequence from 1 to the length of an object.
The seq_along()
function:
?seq_along
# SYNTAX
seq_along(x)
x
: The object to generate the sequence for
Example: Generating a sequence from 1 to length
of df
# There are 4 elements (i.e., columns) in the dataframe
df#> # A tibble: 3 × 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 11 12 13 14
#> 2 21 22 23 24
#> 3 31 32 33 34
# Use `seq_along()` to generate a sequence from 1 to 4
seq_along(df)
#> [1] 1 2 3 4
# which is gives us the same thing as this:
1:length(df)
#> [1] 1 2 3 4
What is iteration?
What are loops?
Example: Printing each element of the vector
c(1,2,3,4)
using a loop
# There are 4 elements in the vector
c(1,2,3,4)
#> [1] 1 2 3 4
# Iterate over each element of the vector
for(i in c(1,2,3,4)) {
print(i) # Print out each element
}#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
When to write loops?
How to write a loop?
for()
functionfor()
{}
) that follows for()
for(i in c(1,2,3,4)) { # Loop sequence
print(i) # Loop body
}
Components of a loop:
i in c(1,2,3,4)
i
(could
name it anything)i
c(1,2,3,4)
is the set of values that will be assigned
to i
i
is
1
i
is
2
, etc.print(i)
i
You may see the loop sequence being written in slightly different ways. For example, these three loops all do the same thing:
Looping over the vector c(1,2,3)
c(1,2,3)
#> [1] 1 2 3
for(z in c(1,2,3)) { # Loop sequence
print(z) # Loop body
}#> [1] 1
#> [1] 2
#> [1] 3
Looping over the sequence 1:3
1:3
#> [1] 1 2 3
for(z in 1:3) { # Loop sequence
print(z) # Loop body
}#> [1] 1
#> [1] 2
#> [1] 3
Looping over the object num_sequence
<- 1:3
num_sequence
num_sequence#> [1] 1 2 3
for(z in num_sequence) { # Loop sequence
print(z) # Loop body
}#> [1] 1
#> [1] 2
#> [1] 3
When building a loop, it is useful to print out information to understand what the loop is doing.
Using print()
to print a single object
z
print()
to show the value of object(s) within an
iteration is not best approach because
print()
can only print one object per line
print()
can’t include additional text that tells you
what stuff is
for(z in c(1,2,3)) {
print(z)
}#> [1] 1
#> [1] 2
#> [1] 3
The best way to print object(s) associated with each
iteration is wrapping the str_c()
function within the
writeLines()
function.
str_c()
function
?str_c
str_c
from “Intro to strings, dates, and
time” lecture of Rclass1 LINK
HEREstr_c()
within writeLines()
?:
str_c()
function
by itself will not print outputwriteLines(str_c(...))
forces whatever is returned by
str_c()
to be printedstr_c()
and writeLines()
to
concatenate and print multiple items
within a loop body or function body, in order to print output
returned by str_c()
you must wrap str_c()
within writeLines()
function
for(z in c(1,2,3)) {
writeLines(str_c("object z=", z))
}#> object z=1
#> object z=2
#> object z=3
writeLines()
by itself to print a
single object z
(code not run); this approach won’t work
because writeLines
can only write character objectsfor(z in c(1,2,3)) {
writeLines(z)
}
Note: Using str_c()
without wrapping in
writeLines()
str_c()
function that is within a loop body (or
function body) will not print output
for(z in c(1,2,3)) {
str_c("object z=", z)
}
birth_years <- c(1944,1950,1981,2016)
<- c(1944,1950,1981,2016,2019)
birth_years
birth_years#> [1] 1944 1950 1981 2016 2019
for(y in birth_years) { # Loop sequence
writeLines(str_c("object y=", y)) # Loop body
<- 2023 - y
z writeLines(str_c("value of 2023 minus ", y, " is ", z))
}#> object y=1944
#> value of 2023 minus 1944 is 79
#> object y=1950
#> value of 2023 minus 1950 is 73
#> object y=1981
#> value of 2023 minus 1981 is 42
#> object y=2016
#> value of 2023 minus 2016 is 7
#> object y=2019
#> value of 2023 minus 2019 is 4
There are 3 ways to loop over elements of an object:
For the examples in the next few subsections, we will be working
with the following named atomic vector and dataframe:
Create named atomic vector called vec
<- c(a = 5, b = -10, c = 30)
vec
vec#> a b c
#> 5 -10 30
Create dataframe called df
with randomly generated
data, 3 columns (vars) and 4 rows (obs)
set.seed(12345) # so we all get the same variable values
<- tibble(a = rnorm(4), b = rnorm(4), c = rnorm(4))
df str(df)
#> tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
#> $ a: num [1:4] 0.586 0.709 -0.109 -0.453
#> $ b: num [1:4] 0.606 -1.818 0.63 -0.276
#> $ c: num [1:4] -0.284 -0.919 -0.116 1.817
Syntax: for (i in object_name)
i
is equal to the element’s
content (rather than its name or index
position)
Example: Looping over elements in
vec
# View named atomic vector object
vec #> a b c
#> 5 -10 30
for (i in vec) {
writeLines(str_c("value of object i=",i))
writeLines(str_c("object i has: type=", typeof(i), "; length=", length(i), "; class=", class(i), "\n")) # "\n" adds line break
}#> value of object i=5
#> object i has: type=double; length=1; class=numeric
#>
#> value of object i=-10
#> object i has: type=double; length=1; class=numeric
#>
#> value of object i=30
#> object i has: type=double; length=1; class=numeric
Example: Looping over elements in
df
# View dataframe object
df #> # A tibble: 4 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 0.586 0.606 -0.284
#> 2 0.709 -1.82 -0.919
#> 3 -0.109 0.630 -0.116
#> 4 -0.453 -0.276 1.82
# show contents of element, outside of a loop
# each element of the dataframe is a vector that contains one element for each observation
str(df[1]) # single bracket
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#> $ a: num [1:4] 0.586 0.709 -0.109 -0.453
str(df[[1]]) # double bracket
#> num [1:4] 0.586 0.709 -0.109 -0.453
for (i in df) {
writeLines(str_c("value of object i=",i))
writeLines(str_c("object i has: type=", typeof(i), "; length=", length(i), "; class=", class(i), "\n")) # "\n" adds line break
}#> value of object i=0.585528817843856
#> value of object i=0.709466017509524
#> value of object i=-0.109303314681054
#> value of object i=-0.453497173462763
#> object i has: type=double; length=4; class=numeric
#>
#> value of object i=0.605887455840394
#> value of object i=-1.81795596770373
#> value of object i=0.630098551068391
#> value of object i=-0.276184105225216
#> object i has: type=double; length=4; class=numeric
#>
#> value of object i=-0.284159743943371
#> value of object i=-0.919322002474128
#> value of object i=-0.116247806352002
#> value of object i=1.81731204370422
#> object i has: type=double; length=4; class=numeric
df
by looping over columns
The dataframe df
is a list object, where each
element is a vector (i.e., column):
# View dataframe object
df #> # A tibble: 4 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 0.586 0.606 -0.284
#> 2 0.709 -1.82 -0.919
#> 3 -0.109 0.630 -0.116
#> 4 -0.453 -0.276 1.82
for (i in df) {
writeLines(str_c("value of object i=", i))
writeLines(str_c("mean value of object i=", mean(i, na.rm = TRUE), "\n"))
}#> value of object i=0.585528817843856
#> value of object i=0.709466017509524
#> value of object i=-0.109303314681054
#> value of object i=-0.453497173462763
#> mean value of object i=0.183048586802391
#>
#> value of object i=0.605887455840394
#> value of object i=-1.81795596770373
#> value of object i=0.630098551068391
#> value of object i=-0.276184105225216
#> mean value of object i=-0.21453851650504
#>
#> value of object i=-0.284159743943371
#> value of object i=-0.919322002474128
#> value of object i=-0.116247806352002
#> value of object i=1.81731204370422
#> mean value of object i=0.124395622733679
Syntax:
for (i in names(object_name))
names()
returns a vector of the object’s element
namesi
is equal to the element’s name
(rather than its content or index position)object_name[i]
object_name
; retains attributes
(e.g., name attribute)object_name[[i]]
Example: Looping over element names in
vec
# View named atomic vector object
vec #> a b c
#> 5 -10 30
names(vec) # View names of atomic vector object
#> [1] "a" "b" "c"
for (i in names(vec)) {
writeLines(str_c("\nvalue of object i=", i, "; type=", typeof(i)))
#str(vec[i]) # Access element contents using []
str(vec[[i]]) # Access element contents using [[]]
}#>
#> value of object i=a; type=character
#> num 5
#>
#> value of object i=b; type=character
#> num -10
#>
#> value of object i=c; type=character
#> num 30
Example: Looping over elements in
df
# View dataframe object
df #> # A tibble: 4 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 0.586 0.606 -0.284
#> 2 0.709 -1.82 -0.919
#> 3 -0.109 0.630 -0.116
#> 4 -0.453 -0.276 1.82
names(df) # View names of dataframe object (i.e., column names)
#> [1] "a" "b" "c"
# show using name to print contents, outside of a loop
str(df["a"]) # single bracket
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#> $ a: num [1:4] 0.586 0.709 -0.109 -0.453
str(df[["a"]]) # double bracket
#> num [1:4] 0.586 0.709 -0.109 -0.453
for (i in names(df)) {
writeLines(str_c("\nvalue of object i=", i, "; type=", typeof(i)))
#str(df[i]) # Access element contents using []
str(df[[i]]) # Access element contents using [[]]
}#>
#> value of object i=a; type=character
#> num [1:4] 0.586 0.709 -0.109 -0.453
#>
#> value of object i=b; type=character
#> num [1:4] 0.606 -1.818 0.63 -0.276
#>
#> value of object i=c; type=character
#> num [1:4] -0.284 -0.919 -0.116 1.817
df
by looping over column names
str(df) # View structure of dataframe object
#> tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
#> $ a: num [1:4] 0.586 0.709 -0.109 -0.453
#> $ b: num [1:4] 0.606 -1.818 0.63 -0.276
#> $ c: num [1:4] -0.284 -0.919 -0.116 1.817
Remember that we can use [[]]
to access element
contents by their name:
for (i in names(df)) {
writeLines(str_c("mean of element named ", i, " = ", mean(df[[i]], na.rm = TRUE)))
}#> mean of element named a = 0.183048586802391
#> mean of element named b = -0.21453851650504
#> mean of element named c = 0.124395622733679
If we tried completing the task using []
to access
the element contents, we would get an error because mean()
only takes numeric or logical vectors as input, and df[i]
returns a dataframe object:
for (i in names(df)) {
writeLines(str_c("mean of element named", i, "=", mean(df[i], na.rm = TRUE)))
# print(class(df[i]))
}
Syntax:
for (i in 1:length(object_name))
OR
for (i in seq_along(object_name))
length()
returns the number of elements in the input
object, which we can use to create a sequence of index positions (i.e.,
1:length(object_name)
)seq_along()
returns a sequence of numbers that
represent the index positions for all elements in the input object
(i.e., equivalent to 1:length(object_name)
)i
is equal to the element’s index
position (rather than its content or name)object_name[i]
object_name
; retains attributes
(e.g., name attribute)object_name[[i]]
names(object_name)[i]
or
names(object_name)[[i]]
[[]]
and []
are
equivalent because names()
returns an unnamed vector, which
does not have any attributes
Example: Looping over indices of
vec
element position
# View named atomic vector object
vec #> a b c
#> 5 -10 30
length(vec) # View length of atomic vector object
#> [1] 3
1:length(vec) # Create sequence from `1` to `length(vec)`
#> [1] 1 2 3
for (i in 1:length(vec)) {
writeLines(str_c("\nvalue of object i=", i, "; type=", typeof(i)))
#str(vec[i]) # Access element contents using []
str(vec[[i]]) # Access element contents using [[]]
}#>
#> value of object i=1; type=integer
#> num 5
#>
#> value of object i=2; type=integer
#> num -10
#>
#> value of object i=3; type=integer
#> num 30
Example: Looping over elements in
df
# View dataframe object
df #> # A tibble: 4 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 0.586 0.606 -0.284
#> 2 0.709 -1.82 -0.919
#> 3 -0.109 0.630 -0.116
#> 4 -0.453 -0.276 1.82
seq_along(df) # Equivalent to `1:length(df)`
#> [1] 1 2 3
for (i in seq_along(df)) {
writeLines(str_c("\nvalue of object i=", i, "; type=", typeof(i)))
str(df[i]) # Access element contents using []
str(df[[i]]) # Access element contents using [[]]
}#>
#> value of object i=1; type=integer
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#> $ a: num [1:4] 0.586 0.709 -0.109 -0.453
#> num [1:4] 0.586 0.709 -0.109 -0.453
#>
#> value of object i=2; type=integer
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#> $ b: num [1:4] 0.606 -1.818 0.63 -0.276
#> num [1:4] 0.606 -1.818 0.63 -0.276
#>
#> value of object i=3; type=integer
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#> $ c: num [1:4] -0.284 -0.919 -0.116 1.817
#> num [1:4] -0.284 -0.919 -0.116 1.817
We could also access the element’s name by its index:
names(df) # View names of dataframe object (i.e., column names)
#> [1] "a" "b" "c"
names(df)[[2]] # We can access any element in the names vector by its index
#> [1] "b"
#names(df)[2] # same as above
# Incorporate the above line into the loop
for (i in 1:length(df)) {
writeLines(str_c("i=", i, "; name=", names(df)[[i]]))
}#> i=1; name=a
#> i=2; name=b
#> i=3; name=c
df
by looping over column indices
Use i in seq_along(df)
to loop over the column indices
and [[]]
to access column contents:
str(df) # View structure of dataframe object
#> tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
#> $ a: num [1:4] 0.586 0.709 -0.109 -0.453
#> $ b: num [1:4] 0.606 -1.818 0.63 -0.276
#> $ c: num [1:4] -0.284 -0.919 -0.116 1.817
for (i in seq_along(df)) {
writeLines(str_c("mean of element at index position", i, "=", mean(df[[i]], na.rm = TRUE)))
}#> mean of element at index position1=0.183048586802391
#> mean of element at index position2=-0.21453851650504
#> mean of element at index position3=0.124395622733679
There are 3 ways to loop over elements of an object:
for (i in seq_along(df)) {
writeLines(str_c("\n", "i=", i)) # element's index position
<- names(df)[[i]] # element's name (what we looped over in approach #2)
name writeLines(str_c("name=", name))
<- df[[i]] # element's value (what we looped over in approach #1)
value writeLines(str_c("value=", value))
}#>
#> i=1
#> name=a
#> value=0.585528817843856
#> value=0.709466017509524
#> value=-0.109303314681054
#> value=-0.453497173462763
#>
#> i=2
#> name=b
#> value=0.605887455840394
#> value=-1.81795596770373
#> value=0.630098551068391
#> value=-0.276184105225216
#>
#> i=3
#> name=c
#> value=-0.284159743943371
#> value=-0.919322002474128
#> value=-0.116247806352002
#> value=1.81731204370422
Grolemund and Wickham differentiate between two types of tasks loops accomplish:
How to modify an existing object?
[[]]
. We can use this same
notation to modify the object.[]
, Wickhams recommends using [[]]
in all
cases to make it clear we are working with a single element (From R
for Data Science)Recall our named atomic vector vec
from the previous
examples:
vec#> a b c
#> 5 -10 30
We can loop over the index positions and use [[]]
to
modify the object:
for (i in seq_along(vec)) {
<- vec[[i]] * 2 # Double each element
vec[[i]]
}
vec#> a b c
#> 10 -20 60
Recall our dataframe df
from the previous examples:
df#> # A tibble: 4 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 0.586 0.606 -0.284
#> 2 0.709 -1.82 -0.919
#> 3 -0.109 0.630 -0.116
#> 4 -0.453 -0.276 1.82
We can loop over the index positions and use [[]]
to
modify the object:
for (i in seq_along(df)) {
<- df[[i]] * 2 # Double each element
df[[i]]
}
df#> # A tibble: 4 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 1.17 1.21 -0.568
#> 2 1.42 -3.64 -1.84
#> 3 -0.219 1.26 -0.232
#> 4 -0.907 -0.552 3.63
So far our loops have two components:
When we create a new object to store the results of a loop, our loops have three components:
Grolemund and Wickham recommend using vector()
to
create this new object prior to writing the loop
(rather than creating the new object within the loop):
“Before you start loop…allocate sufficient space for the output. This is very important for efficiency: if you grow the for loop at each iteration using
c()
(for example), your for loop will be very slow.”
The vector()
function:
?vector
# SYNTAX AND DEFAULT VALUES
vector(mode = "logical", length = 0)
mode
: Type of vector to create (e.g.,
"logical"
, "numeric"
,
"list"
)length
: Length of the vectorRecall the previous example where we calculated the mean value of
each column in dataframe df
:
str(df)
#> tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
#> $ a: num [1:4] 1.171 1.419 -0.219 -0.907
#> $ b: num [1:4] 1.212 -3.636 1.26 -0.552
#> $ c: num [1:4] -0.568 -1.839 -0.232 3.635
for (i in seq_along(df)) {
writeLines(str_c("mean of element at index position", i, "=", mean(df[[i]], na.rm = TRUE)))
}#> mean of element at index position1=0.366097173604781
#> mean of element at index position2=-0.42907703301008
#> mean of element at index position3=0.248791245467358
Let’s create a new object to store these column averages.
Specifically, we’ll create a new numeric vector whose length is equal to
the number of columns in df
:
<- vector(mode = "numeric", length = length(df))
output
# print
output #> [1] 0 0 0
class(output) # Specified by `mode` argument in `vector()`
#> [1] "numeric"
length(output) # Specified by `length` argument in `vector()`
#> [1] 3
We can loop over the index positions of df
and use
[[]]
to modify output
:
for (i in seq_along(df)) {
<- mean(df[[i]], na.rm = TRUE) # Mean of df[[1]] assigned to output[[1]], etc.
output[[i]]
}
output#> [1] 0.3660972 -0.4290770 0.2487912
The general recipe for how to write a loop:
Complete the task for one instance outside a loop (this is akin to writing the body of the loop)
Write the sequence of the loop
Modify the parts of the loop body that need to change with each iteration
If you are creating a new object to store output of the loop, create this object outside of the loop
Construct the loop
It’s usually obvious when you are duplicating code, but unclear whether you should write a loop or whether you should write a function.
In my experience, loops are better for repeated tasks when the individual tasks are very similar to one another
NA
for a
set of variablesBecause functions can have many arguments, functions are better when the individual tasks differ substantially from one another
Note:
What is conditional execution?
TRUE
or FALSE
; for example
1>5
evaluates to FALSE
if
, else if
, and
else
statements to run code conditionally (covered in
upcoming sections)Credit: Decision Making in C / C++, GeeksforGeeks
What is a condition?
TRUE
or FALSE
1
(otherwise R
will warn you that it’ll only look at the first element)
Any expression that has a length of 1
and can
evaluate to either TRUE
or FALSE
can be used
as the condition:
# This expression evaluates to `TRUE`
2 + 2 == 4
#> [1] TRUE
# It is of type `logical`
typeof(2 + 2 == 4)
#> [1] "logical"
# It has length of `1`
length(2 + 2 == 4)
#> [1] 1
Some functions return a logical
, so you might also see a
function call being used as the condition:
# This function call returns `FALSE` because the string "NA" is not the missing value `NA`
is.na("NA")
#> [1] FALSE
# It is of type `logical`
typeof(is.na("NA"))
#> [1] "logical"
# It has length of `1`
length(is.na("NA"))
#> [1] 1
if
statement conditionsWhat are if
statement conditions?
if
statements allow you to conditionally execute
certain blocks of code depending on whether some condition(s) is
TRUE
if()
and the block of code to execute goes between the curly brackets
({}
)TRUE
or
FALSE
(i.e., be of type logical
)1
if (condition) {
# code executed when condition is TRUE
}
The block of code is executed if the condition evaluates to
TRUE
:
if (TRUE) {
writeLines("This block is executed.")
}#> This block is executed.
note that below block of code yields the exact same result as above
because the condition evaluates to TRUE
if (1==1) {
writeLines("This block is executed.")
}#> This block is executed.
The block of code is not executed if the condition evaluates to
FALSE
:
if (FALSE) {
writeLines("This block is not executed.")
}
TRUE
Remember that any expression that has a length of 1
and
can evaluate to either TRUE
or FALSE
can be
used as the condition:
# This expression evaluates to `TRUE`
2 + 2 == 4
#> [1] TRUE
# It is of type `logical`
typeof(2 + 2 == 4)
#> [1] "logical"
# It has length of `1`
length(2 + 2 == 4)
#> [1] 1
# We can use it as the if statement condition
if (2 + 2 == 4) {
writeLines("This block is executed because `2 + 2 == 4` evaluates to `TRUE`.")
}#> This block is executed because `2 + 2 == 4` evaluates to `TRUE`.
FALSE
Recall that some functions return a logical
, so you
might also see a function call being used as the condition:
# This function call returns `FALSE` because the string "NA" is not the missing value `NA`
is.na("NA")
#> [1] FALSE
# It is of type `logical`
typeof(is.na("NA"))
#> [1] "logical"
# It has length of `1`
length(is.na("NA"))
#> [1] 1
# We can use it as the if statement condition
if (is.na("NA")) {
writeLines("This block is not executed because the condition evaluates to `FALSE`.")
}
# double negative equals positive!
if (!is.na("NA")) {
writeLines("This block is not not executed because `!FALSE` evaluates to `TRUE`!")
}#> This block is not not executed because `!FALSE` evaluates to `TRUE`!
||
and
&&
How to combine multiple logical expressions in a condition?
logical
and has a length of 1
if
statement condition can be made up of multiple
logical expressions||
(or) and &&
(and) to
combine multiple logical expressions|
or &
in an if statement:
these are vectorised operations that apply to multiple values
(that’s why you use them in filter()
)” (From R for
Data Science)
Vectorised operations apply to each respective elements of the vectors and returns a vector:
c(TRUE, TRUE, FALSE) | c(TRUE, FALSE, FALSE)
#> [1] TRUE TRUE FALSE
TRUE
or TRUE
is TRUE
TRUE
or FALSE
is TRUE
FALSE
or FALSE
is FALSE
Whereas ||
and &&
will only
look at the first element of each vector:
c(TRUE, TRUE, FALSE) || c(TRUE, FALSE, FALSE)
#> Warning in c(TRUE, TRUE, FALSE) || c(TRUE, FALSE, FALSE): 'length(x) = 3 > 1'
#> in coercion to 'logical(1)'
#> [1] TRUE
When using ||
(or), the block of code is executed
if any of the conditions evaluates to TRUE
:
if (condition1 || condition2 || condition3) {
# code executed when any of the conditions is TRUE
}
When using &&
(and), the block of code is
executed if all of the conditions evaluate to TRUE
:
if (condition1 && condition2 && condition3) {
# code executed when all of the conditions are TRUE
}
||
When using ||
(or), the block of code is executed if any
of the conditions evaluates to TRUE
:
# This block is executed because at least 1 condition is `TRUE`
if (TRUE || FALSE) {
writeLines("This block is executed.")
}#> This block is executed.
# This block is not executed because both logical expressions evaluate to `FALSE`
if (is.na("NA") || 2 + 2 == 5) {
writeLines("This block is not executed.")
}
&&
When using &&
(and), the block of code is
executed if all of the conditions evaluate to TRUE
:
# This block is not executed because not all conditions are `TRUE`
if (TRUE && FALSE) {
writeLines("This block is not executed.")
}
# This block is executed because all logical expressions evaluate to `TRUE`
if (!is.na("NA") && 2 + 2 == 4) {
writeLines("This block is executed.")
}#> This block is executed.
else
statementsWhat are else
statements?
if
block, you can include an
else
block that will be executed if the if
block did not executeelse
block is executed if the
if
statement’s condition is not metif (condition) {
# code executed when condition is TRUE
else {
} # code executed when condition is FALSE
}
Recall the function dir.exists()
that checks if a directory exists:
getwd()
#> [1] "C:/Users/ozanj/Documents/rclass2/lectures/programming"
list.files()
#> [1] "my_new_directory" "programming.html" "programming.Rmd"
<- "my_new_directory"
directory dir.exists(directory)
#> [1] TRUE
Let’s take a look at using an if-else statement to create the
directory (using dir.create()
)
only if it doesn’t currently exist:
if (dir.exists(directory)) {
writeLines(str_c("The directory '", directory, "' already exists."))
else {
} dir.create(directory)
writeLines(str_c("Created directory '", directory, "'."))
}#> The directory 'my_new_directory' already exists.
# Check that directory is created
list.files()
#> [1] "my_new_directory" "programming.html" "programming.Rmd"
If we try running this code again, the if
block
would be executed because the directory already exists:
dir.exists(directory)
#> [1] TRUE
if (dir.exists(directory)) {
writeLines(str_c("The directory '", directory, "' already exists."))
else {
} dir.create(directory)
writeLines(str_c("Created directory '", directory, "'."))
}#> The directory 'my_new_directory' already exists.
We can loop over multiple directory names and for each, create the directory only if it does not already exist:
<- c("scripts", "dictionaries", "output")
directories
directories#> [1] "scripts" "dictionaries" "output"
for (i in directories) {
if (dir.exists(i)) {
writeLines(str_c("The directory '", i, "' already exists."))
else {
} dir.create(i)
writeLines(str_c("Created directory '", i, "'."))
}
}#> Created directory 'scripts'.
#> Created directory 'dictionaries'.
#> Created directory 'output'.
# Check that directories are created
list.files()
#> [1] "dictionaries" "my_new_directory" "output" "programming.html"
#> [5] "programming.Rmd" "scripts"
If we try running the code again, the if
block
would be executed during each iteration of the loop because all the
directories already exist:
for (i in directories) {
if (dir.exists(i)) {
writeLines(str_c("The directory '", i, "' already exists."))
else {
} dir.create(i)
writeLines(str_c("Created directory '", i, "'."))
}
}#> The directory 'scripts' already exists.
#> The directory 'dictionaries' already exists.
#> The directory 'output' already exists.
else if
statementsWhat are else if
statements?
if
blocks and else
blocks, you
can include additional block(s) using else if
that gets
executed if its condition is met and none of the previous blocks got
executedif
/else if
/else
chainif (condition) {
# run this code if condition TRUE
else if (condition) {
} # run this code if previous condition FALSE and this condition TRUE
else if (condition) {
} # run this code if both previous conditions FALSE and this condition TRUE
else {
} # run this code if all previous conditions FALSE
}
else if
statement
Using the diamonds
dataset available from
ggplot2
(part of tidyverse
), let’s create a
vector of 5 diamond prices:
<- unique(diamonds$price)[23:27]
prices str(prices)
#> int [1:5] 405 552 553 554 2757
Let’s loop through the prices
vector and print
whether each is affordable (under $500), pricey (between $500 and
$1000), or too expensive ($1000 and up):
for (i in prices) {
if (i < 500) {
writeLines(str_c("This diamond costs $", i, " and is affordable."))
else if (i >= 500 && i < 1000) {
} writeLines(str_c("This diamond costs $", i, " and is pricey..."))
else {
} writeLines(str_c("This diamond costs $", i, " and is too expensive!"))
}
}#> This diamond costs $405 and is affordable.
#> This diamond costs $552 and is pricey...
#> This diamond costs $553 and is pricey...
#> This diamond costs $554 and is pricey...
#> This diamond costs $2757 and is too expensive!
Remember that each subsequent else if
statement
will only be considered if all previous blocks did not run (i.e., their
conditions were not met). This means we can simplify
i >= 500 && i < 1000
to
i < 1000
in the else if
condition:
for (i in prices) {
if (i < 500) {
writeLines(str_c("This diamond costs $", i, " and is affordable."))
else if (i < 1000) {
} writeLines(str_c("This diamond costs $", i, " and is pricey..."))
else {
} writeLines(str_c("This diamond costs $", i, " and is too expensive!"))
}
}#> This diamond costs $405 and is affordable.
#> This diamond costs $552 and is pricey...
#> This diamond costs $553 and is pricey...
#> This diamond costs $554 and is pricey...
#> This diamond costs $2757 and is too expensive!
Especially when working with large datasets, the time it takes for
your code to run can really add up, so it is important to look for ways
to optimize code such that it runs most efficiently. We can use
system.time()
to measure how long it takes for some code to
run.
The system.time()
function:
?system.time
# SYNTAX AND DEFAULT VALUES
system.time(expr, gcFirst = TRUE)
expr
usedexpr
: Valid R expression to be timed
For the below examples, we’ll use this numeric atomic vector
called prices
that is equal to the price of each diamond in
the diamonds
dataframe:
<- diamonds$price
prices str(prices) # 53,940 diamond prices
#> int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
Let’s take a look at an example of using a loop to calculate the z-score for each diamond price and storing the scores in a vector. First, we’ll calculate the mean and standard deviation of the prices:
<- mean(prices, na.rm=TRUE)
m <- sd(prices, na.rm=TRUE) s
[Method 1] Growing the vector inside the loop
using c()
c()
, append()
,
cbind()
, rbind()
, or paste()
to
create a bigger object, R must first allocate space for the new object
and then copy the old object to its new home. If you’re repeating this
many times, like in a for loop, this can be quite expensive.” (From Advanced
R)<- c()
z_prices
system.time(
for (i in 1:length(prices)) {
<- c(z_prices, (prices[i] - m)/s)
z_prices
}
)#> user system elapsed
#> 3.69 1.72 5.41
[Method 2] Creating the output vector before
loop (Recommended)
z_prices
object using vector()
before the loop<- vector("double", length(prices))
z_prices
system.time(
for (i in 1:length(prices)) {
<- (prices[i] - m)/s
z_prices[i]
}
)#> user system elapsed
#> 0 0 0
What does it mean to “vectorise your code”?
if_else()
function instead of if-else
statement inside a for loop)To see the difference, let’s look at the example of classifying diamond prices as affordable or expensive.
[Method 1] Using if-else statement inside a for
loop
<- vector("character", length(prices))
output
system.time(
for (i in 1:length(prices)) {
if (i < 500) {
<- str_c("This diamond costs $", prices[i], " and is affordable.")
output[i] else {
} <- str_c("This diamond costs $", prices[i], " and is too expensive!")
output[i]
}
}
)#> user system elapsed
#> 1.78 0.00 1.79
[Method 2] Using the vectorised
if_else()
function (Recommended)
system.time(
<- if_else(prices < 500,
output str_c("This diamond costs $", prices, " and is affordable."),
str_c("This diamond costs $", prices, " and is too expensive!")
)
)#> user system elapsed
#> 0.06 0.00 0.06
if
statements
vs. if
/else if
/else
statements
[Method 1] Using multiple if
statements
inside a for loop
if
/else if
/else
statements
instead of multiple if
statementsif
statements, each of the
if
conditions need to be checked for every diamond
price<- vector("integer", length(prices))
output
system.time(
for (i in 1:length(prices)) {
if (i < 200) {
<- 1
output[i]
} if (i >= 200 && i < 400) {
<- 2
output[i]
}if (i >= 400 && i < 600) {
<- 3
output[i]
} if (i >= 600 && i < 800) {
<- 4
output[i]
}if (i >= 800 && i < 1000) {
<- 5
output[i]
} if (i >= 1000 && i < 1500) {
<- 6
output[i]
}if (i >= 1500 && i < 2000) {
<- 7
output[i]
}if (i >= 2000) {
<- 8
output[i]
}
}
)#> user system elapsed
#> 0.03 0.00 0.03
[Method 2] Using
if
/else if
/else
statements inside
a for loop
if
/else if
/else
statements, not all conditions below will be checked (only up to when
one of the blocks get executed)if
statements there are<- vector("integer", length(prices))
output
system.time(
for (i in 1:length(prices)) {
if (i < 200) {
<- 1
output[i] else if (i < 400) {
} <- 2
output[i] else if (i < 600) {
} <- 3
output[i] else if (i < 800) {
} <- 4
output[i] else if (i < 1000) {
} <- 5
output[i] else if (i < 1500) {
} <- 6
output[i] else if (i < 2000) {
} <- 7
output[i] else {
} <- 8
output[i]
}
}
)#> user system elapsed
#> 0.03 0.00 0.04
[Method 3] Using the vectorised
if_else()
function
if_else()
statements are hard to
read (From Efficient
R programming)system.time(
<- ifelse(prices < 200, 1, ifelse(prices < 400, 2, ifelse(prices < 600, 3,
output ifelse(prices < 800, 4, ifelse(prices < 1000, 5, ifelse(prices < 1500, 6,
ifelse(prices < 2000, 7, 8)))))))
)#> user system elapsed
#> 0.00 0.02 0.01
What are functions?
Functions generally follow three sequential steps:
We’ve been working with functions all quarter.
Example: The sum()
function
?sum
numeric
or
logical
)1
whose value is the sum of the input vector
str()
to investigate the return value of
the function# Apply sum() to atomic vector
sum(c(1,2,3))
#> [1] 6
sum(c(1,2,3)) %>% str()
#> num 6
What are “user-written functions”? [my term]
Like all functions, user-written functions usually follow three steps:
sum()
or length()
Can think of user-written functions as anologous to
mathematical functions that take one or more input
variables
f_xy
that takes two input
variables: x
and y
<- function(x,y) {
f_xy # function body
^2 + y
x
}
f_xy(x=2,y=2)
#> [1] 6
f_xy(x=3,y=1)
#> [1] 10
#what the function returns
f_xy(x=3,y=1) %>% str() # a numeric atomic vector of length 1
#> num 10
# assign the output of function to an object
<- f_xy(x=3,y=1)
z
z#> [1] 10
Examples of what we might want to write a function for:
When to write a function?
Since functions are reusable pieces of code, they allow you to “automate” tasks that you perform more than once
The alternative to writing a function to perform some specific task (aside from loops/iteration) is to copy and paste the code each time you want to perform a task
Wickham and Grolemund chapter 19.2:
“You should consider writing a function whenever you’ve copied and pasted a block of code more than twice”
Darin Christenson (professor, UCLA public policy) refers to the programming mantra DRY (“Don’t Repeat Yourself”)
“Functions enable you to perform multiple tasks (that are similar to one another) without copying the same code over and over”
Why write functions to complete a task? (as
opposed to the copy-and-paste approach)
How to approach writing functions? (broad
recipe)
Often, the functions we write will utilize existing functions from
Base R and other R packages. For example, create a function named
z_score()
that calculates how many standard deviations an
observation is from the mean. Our z_score()
function will
use the existing Base R mean()
and sd()
functions.
We will avoid creating user-written functions that utilize
Tidyverse
functions, particularly functions from the
dplyr
package such as group_by()
. The reason
is that including certain Tidyverse
/dplyr
functions in a user-written function requires knowledge of some advanced
programming skills that we have not introduced yet. For more
explanation, see here
and here.
Therefore, when teaching how to write functions that perform data manipulation tasks, we will use a “Base R approach” rather than a “Tidyverse approach.”
The function()
function tells R that you are writing a
function:
# To find help file for function():
`function` # But help file is not a helpful introduction
?
<- function(arg1, arg2, arg3) {
function_name # function body
}
Three components of a function:
function()
and give it a
name using the assignment operator
<-
function()
arg1
, arg2
, arg3
, but we could
have written:
function(x, y, z)
or
function(Larry, Curly, Moe)
{}
) that follows function()
print_hello()
functionTask: Write function called
print_hello()
that prints "Hello, world."
# Expected output
print_hello()
#> [1] "Hello, world"
We want to print "Hello, world"
:
"Hello, world"
#> [1] "Hello, world"
Alternative approaches to perform task outside of function:
print("Hello, world")
#> [1] "Hello, world"
str_c("Hello, world")
#> [1] "Hello, world"
str_c("Hello, world", sep = "", collapse = NULL)
#> [1] "Hello, world"
writeLines(str_c("Hello, world", sep = "", collapse = NULL))
#> Hello, world
# Define function called `print_hello()`
<- function() { # This function takes no arguments
print_hello "Hello, world" # The body of the function simply prints "Hello!"
}
# Call function
print_hello()
#> [1] "Hello, world"
# Investigate return value
print_hello() %>% str()
#> chr "Hello, world"
print_hello()
print_hello()
function doesn’t take any argumentsprint_hello()
simply prints “Hello, world”"Hello, world!"
Task: Modify print_hello()
to take a
name as input and print
"Hello, world. My name is <name>"
# Expected output
print_hello("Ozan Jaquette")
#> [1] "Hello, world. My name is Ozan Jaquette"
Say we want to print
"Hello, world. My name is Ozan Jaquette"
:
"Hello, world. My name is Ozan Jaquette"
#> [1] "Hello, world. My name is Ozan Jaquette"
Remember that we eventually want the name to be an input to our
function, so let’s create a separate object, x
, to store
name:
<- "Ozan Jaquette"
x str_c("Hello, world. My name is", x, sep = " ", collapse = NULL)
#> [1] "Hello, world. My name is Ozan Jaquette"
# Modify `print_hello()` function
<- function(x) { # This function takes 1 argument
print_hello # In the body, use `str_c()` to concatenate greeting and name
str_c("Hello, world. My name is", x, sep = " ", collapse = NULL)
}
# Call function
print_hello(x = "Ozan Jaquette")
#> [1] "Hello, world. My name is Ozan Jaquette"
print_hello("Ozan Jaquette")
#> [1] "Hello, world. My name is Ozan Jaquette"
# Investigate return value
print_hello(x = "Ozan Jaquette") %>% str()
#> chr "Hello, world. My name is Ozan Jaquette"
print_hello()
print_hello()
function takes a name as inputprint_hello()
prints
"Hello, world. My name is <name>"
"Hello, world. My name is <name>"
Task: Modify print_hello()
to take a
name and birthdate as inputs and print
"Hello, world. My name is <name> and I am <age> years old"
# Expected output
print_hello("Ozan Jaquette", "01/16/1979")
#> [1] "Hello, world. My name is Ozan Jaquette and I am 44 years old"
Use mdy()
from the lubridate
package to
help handle birthdates:
<- "01/16/1979"
y
y#> [1] "01/16/1979"
%>% str()
y #> chr "01/16/1979"
mdy(y)
#> [1] "1979-01-16"
mdy(y) %>% str()
#> Date[1:1], format: "1979-01-16"
Using today()
to get today’s date, we can calculate
an age given a birthdate:
today()
#> [1] "2023-03-02"
# Calculate difference
today() - mdy(y)
#> Time difference of 16116 days
str(today() - mdy(y))
#> 'difftime' num 16116
#> - attr(*, "units")= chr "days"
# Convert to duration
as.duration(today() - mdy(y))
#> [1] "1392422400s (~44.12 years)"
as.duration(today() - mdy(y)) %>% str()
#> Formal class 'Duration' [package "lubridate"] with 1 slot
#> ..@ .Data: num 1.39e+09
# Create age in years as numeric vector
as.numeric(as.duration(today() - mdy(y)), "years")
#> [1] 44.1232
as.numeric(as.duration(today() - mdy(y)), "years") %>% str()
#> num 44.1
floor(as.numeric(as.duration(today() - mdy(y)), "years"))
#> [1] 44
str(floor(as.numeric(as.duration(today() - mdy(y)), "years")))
#> num 44
Putting it all together, let’s print the name and age in
years:
<- "Ozan Jaquette"
x <- "01/16/1979"
y
<- floor(as.numeric(as.duration(today() - mdy(y)), "years"))
age
str_c("Hello, world. My name is", x, "and I am", age, "years old", sep = " ", collapse = NULL)
#> [1] "Hello, world. My name is Ozan Jaquette and I am 44 years old"
# Modify `print_hello()` function
<- function(x, y) { # This function takes 2 arguments
print_hello # In the body, calculate age
<- floor(as.numeric(as.duration(today() - mdy(y)), "years"))
age
# Use `str_c()` to concatenate greeting, name, and age
str_c("Hello, world. My name is", x, "and I am", age, "years old", sep = " ", collapse = NULL)
}
# Call function
print_hello(x = "Ozan Jaquette", y = "01/16/1979")
#> [1] "Hello, world. My name is Ozan Jaquette and I am 44 years old"
print_hello(x = "Kartal Jaquette", y = "01/24/1983")
#> [1] "Hello, world. My name is Kartal Jaquette and I am 40 years old"
print_hello(x = "Dan Jaquette", y = "10/29/1950")
#> [1] "Hello, world. My name is Dan Jaquette and I am 72 years old"
print_hello(x = "Sumru Erkut", y = "06/15/1944")
#> [1] "Hello, world. My name is Sumru Erkut and I am 78 years old"
print_hello(x = "Sumru Jaquette-Nasiali", y = "04/05/2019")
#> [1] "Hello, world. My name is Sumru Jaquette-Nasiali and I am 3 years old"
# Investigate return value
print_hello(x = "Ozan Jaquette", y = "01/16/1979") %>% str()
#> chr "Hello, world. My name is Ozan Jaquette and I am 44 years old"
print_hello
print_hello()
function takes a name and birthdate as
inputsprint_hello()
prints
"Hello, world. My name is <name> and I am <age> years old"
"Hello, world. My name is <name> and I am <age> years old"
Task: Test/break print_hello()
by
passing in birthdate using a different format
print_hello(x = "Sumru Jaquette-Nasiali", y = "04/05/2019") # this works
print_hello(x = "Sumru Jaquette-Nasiali", y = "2019/04/05") # this does not
If we wanted to make additional improvements to
print_hello()
, we could modify the function to allow date
of birth to be entered using several alternative formats (e.g.,
"04/05/2019"
or "2019/04/05"
). Alternatively,
we could throw a custom error to warn users to use correct inputs (see below).
z_score()
functionThe z-score for an observation i is the number of standard deviations away it is from the mean:
Task: Write function called z_score()
that calculates the z-score for each element of a vector
# Expected output
z_score(c(1, 2, 3, 4, 5))
#> [1] -1.2649111 -0.6324555 0.0000000 0.6324555 1.2649111
Create a vector of numbers we’ll use to calculate z-score:
<- c(1, 2, 3, 4, 5)
v
v#> [1] 1 2 3 4 5
typeof(v)
#> [1] "double"
class(v)
#> [1] "numeric"
length(v)
#> [1] 5
1] # 1st element of v
v[#> [1] 1
4] # 4th element of v
v[#> [1] 4
We can calculate the z-score using the Base R
mean()
and sd()
functions:
mean(v)
#> [1] 3
sd(v)
#> [1] 1.581139
Calculate z-score for some value:
1-mean(v))/sd(v)
(#> [1] -1.264911
4-mean(v))/sd(v)
(#> [1] 0.6324555
Calculate z-score for particular elements of vector
v
:
1]
v[#> [1] 1
1]-mean(v))/sd(v)
(v[#> [1] -1.264911
4]
v[#> [1] 4
4]-mean(v))/sd(v)
(v[#> [1] 0.6324555
Calculate z_i
for all elements of vector
v
:
v#> [1] 1 2 3 4 5
-mean(v))/sd(v)
(v#> [1] -1.2649111 -0.6324555 0.0000000 0.6324555 1.2649111
Write function to calculate z-score for all elements of the vector:
<- function(x) {
z_score - mean(x))/sd(x)
(x }
z_score
z_score()
function takes an object x
as
input to calculate the z-score forz_score()
calculates z-score of input (e.g.,
For each element of x
, calculate difference between value
of element and mean value of elements, then divide by standard deviation
of elements)
Test/call the function:
z_score(x = c(1, 2, 3, 4, 5))
#> [1] -1.2649111 -0.6324555 0.0000000 0.6324555 1.2649111
# investigate what function returns
z_score(x = c(1, 2, 3, 4, 5)) %>% str()
#> num [1:5] -1.265 -0.632 0 0.632 1.265
v#> [1] 1 2 3 4 5
z_score(x = v)
#> [1] -1.2649111 -0.6324555 0.0000000 0.6324555 1.2649111
seq(20, 25)
#> [1] 20 21 22 23 24 25
z_score(x = seq(20, 25))
#> [1] -1.3363062 -0.8017837 -0.2672612 0.2672612 0.8017837 1.3363062
#you could even create a new object whose values are the output/return of the function
<- z_score(x = c(1, 2, 3, 4, 5))
z_object
z_object#> [1] -1.2649111 -0.6324555 0.0000000 0.6324555 1.2649111
%>% str()
z_object #> num [1:5] -1.265 -0.632 0 0.632 1.265
Task: Improve the z_score()
function by
trying to break it
NA
values
Let’s see what happens when we try passing in a vector containing
NA
to our z_score()
function:
<- c(NA, seq(1:5), NA)
w
w#> [1] NA 1 2 3 4 5 NA
z_score(x=w)
#> [1] NA NA NA NA NA NA NA
What went wrong? Let’s revise our function to handle
NA
values:
<- function(x) {
z_score - mean(x, na.rm=TRUE))/sd(x, na.rm=TRUE)
(x
}
w#> [1] NA 1 2 3 4 5 NA
z_score(w)
#> [1] NA -1.2649111 -0.6324555 0.0000000 0.6324555 1.2649111 NA
Create dataframe called df
:
set.seed(12345) # set "seed" so we all get the same "random" numbers
<- tibble(
df a = c(NA,rnorm(5)),
b = c(NA,rnorm(5)),
c = c(NA,rnorm(5))
)class(df)
#> [1] "tbl_df" "tbl" "data.frame"
df#> # A tibble: 6 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 NA NA NA
#> 2 0.586 -1.82 -0.116
#> 3 0.709 0.630 1.82
#> 4 -0.109 -0.276 0.371
#> 5 -0.453 -0.284 0.520
#> 6 0.606 -0.919 -0.751
# subset a data frame w/ one element, using []
"a"]
df[#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 NA
#> 2 0.586
#> 3 0.709
#> 4 -0.109
#> 5 -0.453
#> 6 0.606
str(df["a"])
#> tibble [6 × 1] (S3: tbl_df/tbl/data.frame)
#> $ a: num [1:6] NA 0.586 0.709 -0.109 -0.453 ...
# subset values of an element using [[]] or $
"a"]]
df[[#> [1] NA 0.5855288 0.7094660 -0.1093033 -0.4534972 0.6058875
str(df[["a"]])
#> num [1:6] NA 0.586 0.709 -0.109 -0.453 ...
$a
df#> [1] NA 0.5855288 0.7094660 -0.1093033 -0.4534972 0.6058875
str(df$a)
#> num [1:6] NA 0.586 0.709 -0.109 -0.453 ...
Experiment with components of z-score, outside of a
function:
mean(df[["a"]], na.rm=TRUE) # mean of variable "a"
#> [1] 0.2676164
sd(df[["a"]], na.rm=TRUE) # std dev of variable "a"
#> [1] 0.5178803
mean(df$a, na.rm=TRUE) # mean of variable "a"
#> [1] 0.2676164
sd(df$a, na.rm=TRUE) # std dev of variable "a"
#> [1] 0.5178803
# Would these work?
# mean(df["a"], na.rm=TRUE) # mean of variable "a"
# sd(df["a"], na.rm=TRUE) # std dev of variable "a"
# Manually calculate z-score for second observation in variable "a"
$a[2]
df#> [1] 0.5855288
$a[2] - mean(df$a, na.rm=TRUE))/sd(df$a, na.rm=TRUE)
(df#> [1] 0.6138725
# Manually calculate z-score for all observations in variable "a"
$a
df#> [1] NA 0.5855288 0.7094660 -0.1093033 -0.4534972 0.6058875
$a %>% length()
df#> [1] 6
$a - mean(df$a, na.rm=TRUE))/sd(df$a, na.rm=TRUE)
(df#> [1] NA 0.6138725 0.8531888 -0.7278124 -1.3924329 0.6531840
Apply z_score()
function to variables in
dataframe:
# z_score() function to calculate z-score for each obs of variable "a"
$a
df#> [1] NA 0.5855288 0.7094660 -0.1093033 -0.4534972 0.6058875
z_score(x = df$a)
#> [1] NA 0.6138725 0.8531888 -0.7278124 -1.3924329 0.6531840
z_score(x = df[["a"]])
#> [1] NA 0.6138725 0.8531888 -0.7278124 -1.3924329 0.6531840
# This approach doesn't work:
# z_score(x = df["a"])
# Why?:
# df["a"] is a dataframe with one variable
# you can't apply mean() or sd() functions to list/data frame object, only numeric atomic vector
# z-score for each obs of variable "b"
z_score(x = df[["b"]])
#> [1] NA -1.4182167 1.2847832 0.2841184 0.2753122 -0.4259971
# investigate the object returned by the function call
z_score(x = df[["b"]]) %>% str()
#> num [1:6] NA -1.418 1.285 0.284 0.275 ...
# could create a new object whose values are the output/return of the function
<- z_score(x = df[["b"]])
z_object %>% str()
z_object #> num [1:6] NA -1.418 1.285 0.284 0.275 ...
# could even create new object that is a new variable in data frame
$b_z <- z_score(x = df[["b"]]) # same same
df
"b_z"]] <- z_score(x = df[["b"]])
df[[
%>% glimpse()
df #> Rows: 6
#> Columns: 4
#> $ a <dbl> NA, 0.5855288, 0.7094660, -0.1093033, -0.4534972, 0.6058875
#> $ b <dbl> NA, -1.8179560, 0.6300986, -0.2761841, -0.2841597, -0.9193220
#> $ c <dbl> NA, -0.1162478, 1.8173120, 0.3706279, 0.5202165, -0.7505320
#> $ b_z <dbl> NA, -1.4182167, 1.2847832, 0.2841184, 0.2753122, -0.4259971
$b_z %>% str()
df#> num [1:6] NA -1.418 1.285 0.284 0.275 ...
$b_z <- NULL # delet variable b_z df
Task: Use the z_score()
function to
create a new variable that is the z-score version of a variable
df
dataframe
First, briefly review how to create and delete variables using Base R approach:
df#> # A tibble: 6 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 NA NA NA
#> 2 0.586 -1.82 -0.116
#> 3 0.709 0.630 1.82
#> 4 -0.109 -0.276 0.371
#> 5 -0.453 -0.284 0.520
#> 6 0.606 -0.919 -0.751
$c_plus2 <- df$c + 2 # create variable equal to "c" plus 2
df
df#> # A tibble: 6 × 4
#> a b c c_plus2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 NA NA NA NA
#> 2 0.586 -1.82 -0.116 1.88
#> 3 0.709 0.630 1.82 3.82
#> 4 -0.109 -0.276 0.371 2.37
#> 5 -0.453 -0.284 0.520 2.52
#> 6 0.606 -0.919 -0.751 1.25
$c_plus2 <- NULL # remove variable "c_plus2"
df
df#> # A tibble: 6 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 NA NA NA
#> 2 0.586 -1.82 -0.116
#> 3 0.709 0.630 1.82
#> 4 -0.109 -0.276 0.371
#> 5 -0.453 -0.284 0.520
#> 6 0.606 -0.919 -0.751
Use z_score()
function to create a new variable
that equals the z-score of another variable.
z_score()
function does not create a
new variable:z_score(x = df$c)
#> [1] NA -0.510074390 1.525451514 0.002476613 0.159953743
#> [6] -1.177807481
df#> # A tibble: 6 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 NA NA NA
#> 2 0.586 -1.82 -0.116
#> 3 0.709 0.630 1.82
#> 4 -0.109 -0.276 0.371
#> 5 -0.453 -0.284 0.520
#> 6 0.606 -0.919 -0.751
z_score()
function so that the
variable is assigned within the function, the preferred approach is to
call the z_score()
function after the assignment operator
<-
:$c_z <- z_score(x = df$c)
df
# examine data frame
df#> # A tibble: 6 × 4
#> a b c c_z
#> <dbl> <dbl> <dbl> <dbl>
#> 1 NA NA NA NA
#> 2 0.586 -1.82 -0.116 -0.510
#> 3 0.709 0.630 1.82 1.53
#> 4 -0.109 -0.276 0.371 0.00248
#> 5 -0.453 -0.284 0.520 0.160
#> 6 0.606 -0.919 -0.751 -1.18
We can apply our function to a “real” dataset too:
#load dataset with one obs per recruiting event
load(url("https://github.com/anyone-can-cook/rclass2/raw/main/data/recruiting/recruit_event_somevars.RData"))
<- df_event[1:10,] %>% # keep first 10 observations
df_event_small select(instnm,univ_id,event_type,med_inc) # keep 4 vars
df_event_small#> # A tibble: 10 × 4
#> instnm univ_id event_type med_inc
#> <chr> <int> <chr> <dbl>
#> 1 UM Amherst 166629 public hs 71714.
#> 2 UM Amherst 166629 public hs 89122.
#> 3 UM Amherst 166629 public hs 70136.
#> 4 UM Amherst 166629 public hs 70136.
#> 5 Stony Brook 196097 public hs 71024.
#> 6 USCC 218663 private hs 71024.
#> 7 UM Amherst 166629 private hs 71024.
#> 8 UM Amherst 166629 public hs 97225
#> 9 UM Amherst 166629 private hs 97225
#> 10 UM Amherst 166629 public hs 77800.
#show observations for variable med_inc
$med_inc
df_event_small#> [1] 71713.5 89121.5 70136.5 70136.5 71023.5 71023.5 71023.5 97225.0 97225.0
#> [10] 77799.5
#calculate z-score of variable med_inc (without assignment)
z_score(x = df_event_small$med_inc)
#> [1] -0.60825958 0.91982879 -0.74668992 -0.74668992 -0.66882834 -0.66882834
#> [7] -0.66882834 1.63116060 1.63116060 -0.07402556
#assign new variable equal to the z-score of med_inc
$med_inc_z <- z_score(x = df_event_small$med_inc)
df_event_small
#inspect
%>% head(5)
df_event_small #> # A tibble: 5 × 5
#> instnm univ_id event_type med_inc med_inc_z
#> <chr> <int> <chr> <dbl> <dbl>
#> 1 UM Amherst 166629 public hs 71714. -0.608
#> 2 UM Amherst 166629 public hs 89122. 0.920
#> 3 UM Amherst 166629 public hs 70136. -0.747
#> 4 UM Amherst 166629 public hs 70136. -0.747
#> 5 Stony Brook 196097 public hs 71024. -0.669
Task: Improve the z_score()
function by
first checking whether input x
is valid
Current function:
<- function(x) {
z_score - mean(x, na.rm=TRUE))/sd(x, na.rm=TRUE)
(x
}#?mean
#?sd
What kind of input is our current function limited to?
z_score()
function does simple arithmetic and utilizes
the mean()
and sd()
functionsmean()
and sd()
functions require
x
to be a numeric (or logical) atomic vector
z_score()
function will break if the input
x
is not an atomic vectorz_score()
function will break if the input
x
is not a numeric/logical atomic vector#function works on below numeric atomic vector
str(df_event_small$med_inc)
str(df_event_small[["med_inc"]]) # same same
#function doesn't work if input is a list/dataframe
str(df_event_small["med_inc"]) # investigate object
z_score(x = df_event_small["med_inc"]) # try applying z_score function to object
#function doesn't work if x is not a numeric vector
str(df_event_small$instnm)
z_score(x = df_event_small$instnm)
We could modify z_score()
by using conditional
statements to calculate the z-score only if input object x
is the appropriate class of object:
<- function(x) {
z_score if (class(x) == "numeric" || class(x) == "logical") {
- mean(x, na.rm=TRUE))/sd(x, na.rm=TRUE)
(x
} }
We no longer run into errors if we supply an invalid input:
# Test with list/dataframe input
str(df_event_small["med_inc"])
#> tibble [10 × 1] (S3: tbl_df/tbl/data.frame)
#> $ med_inc: num [1:10] 71714 89122 70137 70137 71024 ...
z_score(x = df_event_small["med_inc"])
#> Warning in class(x) == "numeric" || class(x) == "logical": 'length(x) = 3 > 1'
#> in coercion to 'logical(1)'
#> Warning in class(x) == "numeric" || class(x) == "logical": 'length(x) = 3 > 1'
#> in coercion to 'logical(1)'
#investigate what this function call returns
z_score(x = df_event_small["med_inc"]) %>% str()
#> Warning in class(x) == "numeric" || class(x) == "logical": 'length(x) = 3 > 1'
#> in coercion to 'logical(1)'
#> Warning in class(x) == "numeric" || class(x) == "logical": 'length(x) = 3 > 1'
#> in coercion to 'logical(1)'
#> NULL
# Test with character vector input
str(df_event_small$instnm)
#> chr [1:10] "UM Amherst" "UM Amherst" "UM Amherst" "UM Amherst" ...
z_score(x = df_event_small$instnm)
#investigate what this function call returns
z_score(x = df_event_small$instnm) %>% str()
#> NULL
Note that our function would return NULL
if the
input was invalid, so the new variable would not be created if we used
<-
:
str(df_event_small$instnm)
#> chr [1:10] "UM Amherst" "UM Amherst" "UM Amherst" "UM Amherst" ...
# Invalid character vector input returns `NULL`
typeof(z_score(x = df_event_small$instnm))
#> [1] "NULL"
# We would not see new variable/column `instnm_z`
$instnm_z <- z_score(x = df_event_small$instnm)
df_event_small%>% head(5)
df_event_small #> # A tibble: 5 × 5
#> instnm univ_id event_type med_inc med_inc_z
#> <chr> <int> <chr> <dbl> <dbl>
#> 1 UM Amherst 166629 public hs 71714. -0.608
#> 2 UM Amherst 166629 public hs 89122. 0.920
#> 3 UM Amherst 166629 public hs 70136. -0.747
#> 4 UM Amherst 166629 public hs 70136. -0.747
#> 5 Stony Brook 196097 public hs 71024. -0.669
Some common tasks when working with survey data:
NA
values for a
specific variableNA
for a specific
variablenum_negative()
functionTask: Write function called
num_negative()
df
(created below)# Sample dataframe `df` that contains some negative values
df#> # A tibble: 100 × 4
#> id age sibage parage
#> <int> <dbl> <dbl> <dbl>
#> 1 1 17 8 49
#> 2 2 15 -97 46
#> 3 3 -97 -97 53
#> 4 4 13 12 -4
#> 5 5 -97 10 47
#> 6 6 12 10 52
#> 7 7 -99 5 51
#> 8 8 -97 10 55
#> 9 9 16 6 51
#> 10 10 16 -99 -8
#> # … with 90 more rows
Recommended steps:
sum(data_frame_name$var_name<0)
names(df) # identify variable names
#> [1] "id" "age" "sibage" "parage"
$age # print observations for a variable
df#> [1] 17 15 -97 13 -97 12 -99 -97 16 16 -98 20 -99 20 11 20 12 17
#> [19] 19 17 -97 -99 12 13 11 15 20 14 -99 11 20 -98 11 -98 12 16
#> [37] 12 18 12 19 12 -97 20 17 11 19 19 12 -98 11 15 18 15 -98
#> [55] 15 19 -97 13 -98 16 13 12 16 19 -99 19 -98 13 -97 20 15 19
#> [73] 15 12 18 -99 18 -98 -98 -98 -97 12 14 19 -97 11 20 18 14 -99
#> [91] 15 20 -97 14 14 19 18 17 20 15
#BaseR
sum(df$age<0) # count number of obs w/ negative values for variable "age"
#> [1] 27
<- function(x){
num_missing sum(x<0)
}
num_missing(df$age)
#> [1] 27
num_missing(df$sibage)
#> [1] 22
num_missing()
functionIn survey data, negative values often refer to reason for missing values:
-8
refers to “didn’t take survey”-7
refers to “took survey, but didn’t answer this
question”Task: Write function called
num_negative()
x
: The variable (e.g., df$sibage
)miss_vals
: Vector of values you want to associate with
“missing” variable
df$age
:
-97,-98,-99
df$sibage
:
-97,-98,-99
df$parage
:
-4,-7,-8
Recommended steps:
sum(data_frame_name$var_name %in% c(-4,-5))
sum(df$age %in% c(-97,-98,-99))
#> [1] 27
<- function(x, miss_vals){
num_missing
sum(x %in% miss_vals)
}
num_missing(df$age,c(-97,-98,-99))
#> [1] 27
num_missing(df$sibage,c(-97,-98,-99))
#> [1] 22
num_missing(df$parage,c(-4,-7,-8))
#> [1] 17
What are default values for arguments?
name=value
Example: str_c()
function
The str_c()
function has default values for
sep
and collapse
:
str_c(..., sep = "", collapse = NULL)
...
: One or more character vectors to join, separated
by commassep
: String to insert between input vectors
sep = ""
collapse
: Optional string used to combine input vectors
into single string
collapse = NULL
is to not combine
elements into a single string# We want to join the following two vectors element-wise into a single character vector
c("a","b")
#> [1] "a" "b"
c(1,2)
#> [1] 1 2
# manually specifying default values
str_c(c("a", "b"), c(1, 2), sep = "", collapse = NULL)
#> [1] "a1" "b2"
# If we don't specify `sep` and `collapse`, they take the default values
str_c(c("a", "b"), c(1, 2))
#> [1] "a1" "b2"
# specify value for `sep` that overrides default value
str_c(c("a", "b"), c(1, 2), sep = "~")
#> [1] "a~1" "b~2"
length(str_c(c("a", "b"), c(1, 2), sep = "~")) # resulting vector has length = 2
#> [1] 2
# specify value for `collapse` that overrides default
str_c(c("a", "b"), c(1, 2), collapse = "|")
#> [1] "a1|b2"
length(str_c(c("a", "b"), c(1, 2), collapse = "|")) # resulting vector has length = 1
#> [1] 1
# specify alternative values for both `sep` and `collapse`
#str_c(c("a", "b"), c(1, 2), sep = "~", collapse = "|")
z_score()
function
Recall the z_score()
function we developed previously,
where we wrote this function to remove NA
values prior to
calculating z-score:
<- function(x) {
z_score - mean(x, na.rm=TRUE))/sd(x, na.rm=TRUE)
(x
}
<- c(NA, seq(1:5), NA)
w
w#> [1] NA 1 2 3 4 5 NA
z_score(w)
#> [1] NA -1.2649111 -0.6324555 0.0000000 0.6324555 1.2649111 NA
We could add an argument (named na
) that specifies
whether NA
s should be removed prior to calculating
z-scores:
<- function(x, na) {
z_score - mean(x, na.rm=na))/sd(x, na.rm=na)
(x
}
w#> [1] NA 1 2 3 4 5 NA
z_score(x=w, na=TRUE)
#> [1] NA -1.2649111 -0.6324555 0.0000000 0.6324555 1.2649111 NA
z_score(x=w, na=FALSE)
#> [1] NA NA NA NA NA NA NA
#z_score(w) # error: argument "na" is missing, with no default
We could also add a default value for the na
argument. Following conservative approach, we’ll specify default value
as FALSE
which means that any NA
values in
input vector x
will result in z-score of NA
for all observations:
<- function(x, na = FALSE) {
z_score - mean(x, na.rm=na))/sd(x, na.rm=na)
(x
}
w#> [1] NA 1 2 3 4 5 NA
z_score(x=w) # uses default value of FALSE
#> [1] NA NA NA NA NA NA NA
z_score(w, na= FALSE) # manually specify default value
#> [1] NA NA NA NA NA NA NA
z_score(w, na = TRUE) # override default value
#> [1] NA -1.2649111 -0.6324555 0.0000000 0.6324555 1.2649111 NA
...
)Many functions take an arbitrary number of arguments, including:
select()
#?select
select(df_event,instnm,univ_id,event_type,med_inc) %>% names()
#> [1] "instnm" "univ_id" "event_type" "med_inc"
sum()
#?sum
sum(3,3,2,2,1,1)
#> [1] 12
str_c
#?str_c
# 1 character vector as input
str_c(c("a", "b", "c"))
#> [1] "a" "b" "c"
# 2 character vectors as input
str_c(c("a", "b", "c"), " is for ")
#> [1] "a is for " "b is for " "c is for "
# 3 character vectors as input
str_c(c("a", "b", "c"), " is for ", c("apple", "banana", "coffee"))
#> [1] "a is for apple" "b is for banana" "c is for coffee"
All of these functions rely on a special
argument ...
(pronounced “dot-dot-dot”)
Dot-dot-dot (...
) allows a function to take an
arbitrary number of arguments
Wickham and Grolemund chapter 19.5.3 states:
“
...
captures any number of arguments that aren’t otherwise matched.”
When writing functions, there are two primary
uses of including ...
arguments:
select()
and sum()
functions...
, we can pass those inputs into another function that
takes ...
(e.g., str_c()
)...
) as
function argument
Recall the first iteration of our print_hello()
function, which basically just printed a name that we specified in
function call. Let’s modify the function to make it take an arbitrary
number of names to greet:
Function that only took one argument
# Define function
<- function(x) {
print_hello1 str_c("Hello ", x, "!")
}
# Call function
print_hello1(x="Ozan")
#> [1] "Hello Ozan!"
Modify function to take an arbitrary number of names to greet
# Define function
<- function(...) { # The function accepts an arbitrary number of inputs
print_hello2 str_c("Hello ", str_c(..., sep = ", "), "!") # Pass the `...` to `str_c()`
}
# Call function
print_hello2("Dasher", "Dancer", "Prancer", "Vixen")
#> [1] "Hello Dasher, Dancer, Prancer, Vixen!"
How to handle invalid inputs?
z_score()
example, one way to
check for invalid inputs is using conditional statementsstop()
), if they are not true” (R for Data
Science)
stop()
function (base R):
stop()
function “stops execution of the current
expression and executes an error action”
?stop
# SYNTAX AND DEFAULT VALUES
stop(..., call. = TRUE, domain = NULL)
stop()
to check invalid
name input to print_hello()
function
Recall the original print_hello()
function. It will not
print a greeting if NA
is supplied as the input:
<- function(x) {
print_hello str_c("Hello, world. My name is", x, sep = " ", collapse = NULL)
}
print_hello("ozan")
#> [1] "Hello, world. My name is ozan"
print_hello(NA)
#> [1] NA
We can raise an error with a custom message if the input is
NA
:
<- function(x) {
print_hello if (is.na(x)) {
stop("`x` must not be `NA`")
}
str_c("Hello, world. My name is", x, sep = " ", collapse = NULL)
}
print_hello(x="ozan")
print_hello(x=NA)
stop()
to check invalid
date input to print_hello()
function
Recall the version of print_hello()
function that prints
both the user’s name and age. It will not work properly if the birthdate
input is not supplied in month-day-year format:
<- function(x, y) {
print_hello <- floor(as.numeric(as.duration(today() - mdy(y)), "years"))
age
str_c("Hello, world. My name is", x, "and I am", age, "years old", sep = " ", collapse = NULL)
}
print_hello(x = "Sumru Jaquette-Nasiali", y="04/05/2019") # this works
#> [1] "Hello, world. My name is Sumru Jaquette-Nasiali and I am 3 years old"
print_hello(x = "Sumru Jaquette-Nasiali", y="2019/04/05") # this does not
#> Warning: All formats failed to parse. No formats found.
#> [1] NA
We can raise an error with a custom message if the birthdate is
not in the right format:
<- function(x, y) {
print_hello if (is.na(mdy(y))) {
stop("`y` must be in month-day-year format")
}
<- floor(as.numeric(as.duration(today() - mdy(y)), "years"))
age
str_c("Hello, world. My name is", x, "and I am", age, "years old", sep = " ", collapse = NULL)
}
print_hello(x = "Sumru Jaquette-Nasiali", y = "04/05/2019")
print_hello(x = "Sumru Jaquette-Nasiali", y = "2019/04/05")
We can also add the check for the name input as well:
<- function(x, y) {
print_hello # Check name input `x`
if (is.na(x)) {
stop("`x` must not be `NA`")
}
# Check birthdate input `y`
if (is.na(mdy(y))) {
stop("`y` must be in month-day-year format")
}
<- floor(as.numeric(as.duration(today() - mdy(y)), "years"))
age
str_c("Hello, world. My name is", x, "and I am", age, "years old", sep = " ", collapse = NULL)
}
Recall that functions generally follow three sequential steps:
What are return values?
df
, you could have the last line of the function be this:
df
<-
to store
returned values in a new object for future useRecall the print_hello()
function:
# Define function
<- function() {
print_hello "Hello!" # The last statement in the function is returned
}
# Call function
print_hello() %>% str()
#> chr "Hello!"
<- print_hello() # We can show that `print_hello()` returns a value by storing it in `h`
h # `h` stores the value "Hello!"
h #> [1] "Hello!"
How can we explicitly return values from the function?
return()
to explicitly return a value from
our functionif
block)return()
in a functionRecall the print_hello()
function:
# Define function
<- function() {
print_hello return("Hello!") # Explicitly return "Hello!"
print("Goodbye!") # Since this is after `return()`, it never gets run
}
# Call function
print_hello()
#> [1] "Hello!"
<- print_hello() # `print_hello()` returns "Hello!"
h
h#> [1] "Hello!"
Recall the previous example where we assess the prices of diamonds
from the diamonds
dataset from ggplot2
. Let’s
move the if
/else if
/else
blocks
inside of a function, then call the function from inside the loop.
As seen below, the last statement that the function evaluates (i.e.,
whichever if
/else if
/else
block
is run) will be implicitly returned:
<- function(price) {
assess_price if (price < 500) {
str_c("This diamond costs $", price, " and is affordable.")
else if (price < 1000) {
} str_c("This diamond costs $", price, " and is pricey...")
else {
} str_c("This diamond costs $", price, " and is too expensive!")
}
}
assess_price(price=450)
#> [1] "This diamond costs $450 and is affordable."
assess_price(price=1050) %>% str()
#> chr "This diamond costs $1050 and is too expensive!"
<- unique(diamonds$price)[23:27]
prices
prices#> [1] 405 552 553 554 2757
for (i in prices) {
writeLines(assess_price(i))
}#> This diamond costs $405 and is affordable.
#> This diamond costs $552 and is pricey...
#> This diamond costs $553 and is pricey...
#> This diamond costs $554 and is pricey...
#> This diamond costs $2757 and is too expensive!
But if we were to have another line after the conditional part,
then that would be implicitly returned instead, since it is now the last
statement in the function:
<- function(price) {
assess_price if (price < 500) {
str_c("This diamond costs $", price, " and is affordable.")
else if (price < 1000) {
} str_c("This diamond costs $", price, " and is pricey...")
else {
} str_c("This diamond costs $", price, " and is too expensive!")
}
"I can't afford that." # This is now the last statement in the function that will be returned
}
for (i in prices) {
writeLines(assess_price(i))
}#> I can't afford that.
#> I can't afford that.
#> I can't afford that.
#> I can't afford that.
#> I can't afford that.
We can use return()
to explicitly return early from
the function:
<- function(price) {
assess_price if (price < 500) {
return(str_c("This diamond costs $", price, " and is affordable.")) # Return early
else if (price < 1000) {
} return(str_c("This diamond costs $", price, " and is pricey...")) # Return early
else {
} writeLines(str_c("This diamond costs $", price, " and is too expensive!"))
}
"I can't afford that."
}
for (i in prices) {
writeLines(assess_price(i))
}#> This diamond costs $405 and is affordable.
#> This diamond costs $552 and is pricey...
#> This diamond costs $553 and is pricey...
#> This diamond costs $554 and is pricey...
#> This diamond costs $2757 and is too expensive!
#> I can't afford that.
How can we return multiple values from a function?
Let’s say we have the following function that filters the
diamonds
dataset by color, then generates some information
on multiple characteristics (i.e., cut
and
clarity
). For now, it is printing a frequency table for
each characteristic to the screen:
<- function(color) {
diamond_info_by_color <- diamonds %>% filter(color == color)
df
print(table(df$cut))
print(table(df$clarity))
}
diamond_info_by_color(color = 'E')
#>
#> Fair Good Very Good Premium Ideal
#> 1610 4906 12082 13791 21551
#>
#> I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
#> 741 9194 13065 12258 8171 5066 3655 1790
diamond_info_by_color('E') %>% str() # what is returned
#>
#> Fair Good Very Good Premium Ideal
#> 1610 4906 12082 13791 21551
#>
#> I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
#> 741 9194 13065 12258 8171 5066 3655 1790
#> 'table' int [1:8(1d)] 741 9194 13065 12258 8171 5066 3655 1790
#> - attr(*, "dimnames")=List of 1
#> ..$ : chr [1:8] "I1" "SI2" "SI1" "VS2" ...
If we want to return the frequency tables from the function
(i.e., return multiple objects), we can do so by combining them together
into a single list and returning that list:
<- function(color) {
diamond_info_by_color <- diamonds %>% filter(color == color)
df
list(cut_table = table(df$cut), clarity_table = table(df$clarity)) # implicitly return list
}
diamond_info_by_color('E')
#> $cut_table
#>
#> Fair Good Very Good Premium Ideal
#> 1610 4906 12082 13791 21551
#>
#> $clarity_table
#>
#> I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
#> 741 9194 13065 12258 8171 5066 3655 1790
diamond_info_by_color('E') %>% str()
#> List of 2
#> $ cut_table : 'table' int [1:5(1d)] 1610 4906 12082 13791 21551
#> ..- attr(*, "dimnames")=List of 1
#> .. ..$ : chr [1:5] "Fair" "Good" "Very Good" "Premium" ...
#> $ clarity_table: 'table' int [1:8(1d)] 741 9194 13065 12258 8171 5066 3655 1790
#> ..- attr(*, "dimnames")=List of 1
#> .. ..$ : chr [1:8] "I1" "SI2" "SI1" "VS2" ...
We can then store the returned list in an object using
<-
, and access the individual elements within the list
using [[]]
or $
:
# Store returned list in `info`
<- diamond_info_by_color('E')
info
%>% str()
info #> List of 2
#> $ cut_table : 'table' int [1:5(1d)] 1610 4906 12082 13791 21551
#> ..- attr(*, "dimnames")=List of 1
#> .. ..$ : chr [1:5] "Fair" "Good" "Very Good" "Premium" ...
#> $ clarity_table: 'table' int [1:8(1d)] 741 9194 13065 12258 8171 5066 3655 1790
#> ..- attr(*, "dimnames")=List of 1
#> .. ..$ : chr [1:8] "I1" "SI2" "SI1" "VS2" ...
# Access individual elements of the list
'cut_table']]
info[[#>
#> Fair Good Very Good Premium Ideal
#> 1610 4906 12082 13791 21551
# Can also store individual elements in new objects
<- info$clarity_table
clarity_table
clarity_table#>
#> I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
#> 741 9194 13065 12258 8171 5066 3655 1790
What are pipeable functions?
%>%
)
E.g., The filter()
and select()
functions from tidyverse
both accept a dataframe as the
first argument and return a modified dataframe, so they can be chained
together
%>% filter(...) %>% select(...) df
Wickham distinguishes between 2 types of pipeable functions (Chapter 19.6.2)
filter()
or select()
functions from
tidyverse
Pipeable functions do not only work with dataframes, but with any objects like an atomic vector. For example:
<- c(1, 2, 3, 4)
vec
vec#> [1] 1 2 3 4
# These functions accept a vector as the first argument, modify it, then return it
<- function(v) {
add_two + 2
v
}
vec#> [1] 1 2 3 4
add_two(v=vec)
#> [1] 3 4 5 6
%>% add_two() # same
vec #> [1] 3 4 5 6
<- function(v) {
times_three * 3
v
}
vec#> [1] 1 2 3 4
times_three(v=vec)
#> [1] 3 6 9 12
%>% times_three() # same
vec #> [1] 3 6 9 12
We can chain together the functions to perform the operations in
order:
vec#> [1] 1 2 3 4
%>% add_two()
vec #> [1] 3 4 5 6
%>% add_two() %>% times_three()
vec #> [1] 9 12 15 18
vec#> [1] 1 2 3 4
%>% times_three()
vec #> [1] 3 6 9 12
%>% times_three() %>% add_two()
vec #> [1] 5 8 11 14
Writing functions and loops that utilize tidyverse functions
Tidyverse
/dplyr
functions in a user-written function requires some programming concepts
that we did not introduce in this lecture
Replacing loops with “map” functions from the
purrr
package and/or “apply” functions from Base
R
The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you (Wickham, chapter 21.5)
purrr
package, which is part of Tidyverse, creates
a family of functions called “map” functions replace the need for
writing loopspurrr
map functions, read section
21.4
and section 21.5
of R for Data Science by Wickhampurrr
map functions are similar to the “apply family of
functions” from Base R