1 Introduction

Load packages:

#install.packages('tidyverse')
library(tidyverse)

Load data:

#data we used in the last problem set
load(url('https://github.com/anyone-can-cook/rclass1/raw/master/data/recruiting/recruit_event_somevars.RData'))  

#data we are using in this lecture 
load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData"))

Resources used to create this document:

2 Operator Precedence and Associativity

It is important to understand operator precedence and associativity when working with data in general, but particularly when we use logical operators like & or | to filter data.


Operator Precedence

Operator Precedence refers to the order in which operators are executed. Operators with higher precedence are evaluated first and operators with the lowest precedence are evaluated last. If you recall PEMDAS, multiplication * takes precedence over addition +.

1 + 2 * 6
## [1] 13

If we add a parenthesis to the equation, our answer changes.

(1 + 2) * 6
## [1] 18


Operator Associativity

When you are working with operators with the same precedence, the execution of the operators is determined through associativity.

Operators with same precedence follows operator associativity defined for its operator group. In R, operators can either follow left-associative, right-associative or have no associativity. Operators with left-associative are evaluted from left to right, operators with right-associative are evaluated from right to left and operators with no associativity, does not follow any predefined order.

Credit: R Operator Precedence

8 / 4 / 2
## [1] 1
8 / (4 / 2)
## [1] 4

Credit: R Operator Precedence and Associativity, Operator Syntax and Precedence


R logical operator precedence

According to this link, the & takes precedence over |. And both logical operators are left-associative, meaning they are evaluated from left to right.

Credit R Operator Precedence


Now let’s see an example using the data frame from problem set 3

Why do we get different results?

nrow(subset(df_event, event_state %in% c("CA", "FL", "MA") & event_type == "public hs" | total_students_pri > 1000))
## [1] 2911
nrow(subset(df_event, event_state %in% c("CA", "FL", "MA") & (event_type == "public hs" | total_students_pri > 1000)))
## [1] 2151

Let’s try to put our code into words.

The first code reads. Return observations where the recruiting visit took place at a public high school in the state of California, Florida, or Massachusetts OR a private high school where the total number of students is greater than 1000.

tail(subset(df_event, event_state %in% c("CA", "FL", "MA") & event_type == "public hs" | total_students_pri > 1000, select = c(event_state, event_type, total_students_pri)))
## # A tibble: 6 x 3
##   event_state event_type total_students_pri
##   <chr>       <chr>                   <dbl>
## 1 HI          private hs               1383
## 2 HI          private hs               1880
## 3 OR          private hs               1300
## 4 OR          private hs               1300
## 5 OR          private hs               1300
## 6 WA          private hs               1132

The second code reads. Return observations where the recruiting visit took place at a public high school OR private high school where the total number of students is greater than 1000 in the state of California, Florida, or Massachusetts.

tail(subset(df_event, event_state %in% c("CA", "FL", "MA") & (event_type == "public hs" | total_students_pri > 1000), select = c(event_state, event_type, total_students_pri)))
## # A tibble: 6 x 3
##   event_state event_type total_students_pri
##   <chr>       <chr>                   <dbl>
## 1 CA          public hs                  NA
## 2 CA          public hs                  NA
## 3 CA          public hs                  NA
## 4 CA          public hs                  NA
## 5 CA          public hs                  NA
## 6 CA          public hs                  NA

3 Practice with pipes

wwlist data frame

  • De-identified list of prospective students purchased by Western Washington University from College Board
  • We collected these data using public records requests request
dim(wwlist)
## [1] 268396     41
names(wwlist)
##  [1] "receive_date"           "psat_range"             "state"                 
##  [4] "zip9"                   "for_country"            "sex"                   
##  [7] "hs_ceeb_code"           "hs_name"                "hs_city"               
## [10] "hs_state"               "hs_grad_date"           "ethn_code"             
## [13] "homeschool"             "firstgen"               "zip5"                  
## [16] "pop_total_zip"          "pop_white_zip"          "pop_black_zip"         
## [19] "pop_asian_zip"          "pop_latinx_zip"         "pop_nativeam_zip"      
## [22] "pop_nativehawaii_zip"   "pop_multirace_zip"      "pop_otherrace_zip"     
## [25] "med_inc_zip"            "school_type"            "merged_hs"             
## [28] "school_category"        "total_12"               "total_students"        
## [31] "fr_lunch"               "pop_total_state"        "pop_white_state"       
## [34] "pop_black_state"        "pop_nativeam_state"     "pop_asian_state"       
## [37] "pop_nativehawaii_state" "pop_otherrace_state"    "pop_multirace_state"   
## [40] "pop_latinx_state"       "med_inc_state"
#glimpse(wwlist)
#str(wwlist)

Let’s use select(), filter(), and arrange() to do the following using the Base R approach:

  • Sort wwlist descending by total_students
  • Select the following variables: hs_state, hs_city, hs_name, school_type, total_students
  • Filter for private schools (school_type == "private")
  • Print the first 10 observations
head(select(arrange(filter(wwlist, school_type == "private"), desc(total_students)), hs_state, hs_city, hs_name, school_type, total_students), n = 10)
## # A tibble: 10 x 5
##    hs_state hs_city   hs_name                      school_type total_students
##    <chr>    <chr>     <chr>                        <chr>                <int>
##  1 GA       Norcross  James Madison High School    private               4500
##  2 GA       Norcross  James Madison High School    private               4500
##  3 IL       Lansing   American School              private               3356
##  4 <NA>     <NA>      <NA>                         private               2014
##  5 NC       Charlotte Charlotte Country Day School private               1603
##  6 NC       Charlotte Providence Day School        private               1603
##  7 NC       Charlotte Charlotte Country Day School private               1603
##  8 NC       Charlotte Providence Day School        private               1603
##  9 NC       Charlotte Providence Day School        private               1603
## 10 NC       Charlotte Charlotte Country Day School private               1603
df_temp <- filter(wwlist,school_type == "private")
df_temp2 <- arrange(df_temp,desc(total_students))
head(select(df_temp2, hs_state, hs_city, hs_name, school_type, total_students),n=10)
## # A tibble: 10 x 5
##    hs_state hs_city   hs_name                      school_type total_students
##    <chr>    <chr>     <chr>                        <chr>                <int>
##  1 GA       Norcross  James Madison High School    private               4500
##  2 GA       Norcross  James Madison High School    private               4500
##  3 IL       Lansing   American School              private               3356
##  4 <NA>     <NA>      <NA>                         private               2014
##  5 NC       Charlotte Charlotte Country Day School private               1603
##  6 NC       Charlotte Providence Day School        private               1603
##  7 NC       Charlotte Charlotte Country Day School private               1603
##  8 NC       Charlotte Providence Day School        private               1603
##  9 NC       Charlotte Providence Day School        private               1603
## 10 NC       Charlotte Charlotte Country Day School private               1603
rm(df_temp,df_temp2)

Now let’s use pipes %

wwlist %>%
  filter(school_type == "private") %>%
  arrange(desc(total_students)) %>%
  select(state, hs_name, school_type, total_students) %>%
  head(n = 10)
## # A tibble: 10 x 4
##    state hs_name                      school_type total_students
##    <chr> <chr>                        <chr>                <int>
##  1 SD    James Madison High School    private               4500
##  2 HI    James Madison High School    private               4500
##  3 MT    American School              private               3356
##  4 HI    <NA>                         private               2014
##  5 NC    Charlotte Country Day School private               1603
##  6 NC    Providence Day School        private               1603
##  7 NC    Charlotte Country Day School private               1603
##  8 NC    Providence Day School        private               1603
##  9 NC    Providence Day School        private               1603
## 10 NC    Charlotte Country Day School private               1603


Your turn

Useselect(), filter(), and arrange() to do the following using both the Base R & tidyverse approach:

  • Sort wwlist descending by med_inc_zip
  • Select the following variables: hs_state, hs_city, hs_name, school_type, med_inc_zip, ethn_code, med_inc_state
  • Filter for public schools (school_type == "public") in the state of New York hs_state == "NY"
  • Print the first 10 observations

Base R

Tidyverse using pipes


Now let’s useselect(), filter(), and arrange() to do the following using both the Base R & tidyverse approach:

  • Sort wwlist descending by med_inc_zip
  • Select the following variables: hs_state, hs_city, hs_name, school_type, med_inc_zip, ethn_code, med_inc_state
  • Filter for public schools (school_type == "public") where the med_inc_zip is less than the med_inc_state
  • Print the first 10 observations

Base R

Tidyverse using pipes


Bonus question

  • Write down a question you have about the data.
  • Using any or all of the following functions select(), filter(), arrange(), how would you go about subsetting and sorting the data to answer your question?

Write down your question below:

Now try to work through your question.