1 Introduction

Load packages:

#install.packages('tidyverse')
library(tidyverse)

Load data:

#data we used in the last problem set
load(url('https://github.com/anyone-can-cook/rclass1/raw/master/data/recruiting/recruit_event_somevars.RData'))  

#data we are using in this lecture 
load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData"))

Resources used to create this document:

2 Operator Precedence and Associativity

It is important to understand operator precedence and associativity when working with data in general, but particularly when we use logical operators like & or | to filter data.

Operator Precedence

Operator Precedence refers to the order in which operators are executed. Operators with higher precedence are evaluated first and operators with the lowest precedence are evaluated last. If you recall PEMDAS, multiplication * takes precedence over addition +.

1 + 2 * 6

## [1] 13

If we add a parenthesis to the equation, our answer changes.

(1 + 2) * 6

## [1] 18

Operator Associativity

When you are working with operators with the same precedence, the execution of the operators is determined through associativity.

Operators with same precedence follows operator associativity defined for its operator group. In R, operators can either follow left-associative, right-associative or have no associativity. Operators with left-associative are evaluted from left to right, operators with right-associative are evaluated from right to left and operators with no associativity, does not follow any predefined order.

Credit: R Operator Precedence

8 / 4 / 2

## [1] 1

8 / (4 / 2)

## [1] 4

Credit: R Operator Precedence and Associativity, Operator Syntax and Precedence

R logical operator precedence

According to this link, the & takes precedence over |. And both logical operators are left-associative, meaning they are evaluated from left to right.

Credit R Operator Precedence

Now let’s see an example using the data frame from problem set 3

Why do we get different results?

nrow(subset(df_event, event_state %in% c("CA", "FL", "MA") & event_type == "public hs" | total_students_pri > 1000))

## [1] 2911

nrow(subset(df_event, event_state %in% c("CA", "FL", "MA") & (event_type == "public hs" | total_students_pri > 1000)))

## [1] 2151

Let’s try to put our code into words.

The first code reads. Return observations where the recruiting visit took place at a public high school in the state of California, Florida, or Massachusetts OR a private high school where the total number of students is greater than 1000.

tail(subset(df_event, event_state %in% c("CA", "FL", "MA") & event_type == "public hs" | total_students_pri > 1000, select = c(event_state, event_type, total_students_pri)))

## # A tibble: 6 x 3
##   event_state event_type total_students_pri
##   <chr>       <chr>                   <dbl>
## 1 HI          private hs               1383
## 2 HI          private hs               1880
## 3 OR          private hs               1300
## 4 OR          private hs               1300
## 5 OR          private hs               1300
## 6 WA          private hs               1132

The second code reads. Return observations where the recruiting visit took place at a public high school OR private high school where the total number of students is greater than 1000 in the state of California, Florida, or Massachusetts.

tail(subset(df_event, event_state %in% c("CA", "FL", "MA") & (event_type == "public hs" | total_students_pri > 1000), select = c(event_state, event_type, total_students_pri)))

## # A tibble: 6 x 3
##   event_state event_type total_students_pri
##   <chr>       <chr>                   <dbl>
## 1 CA          public hs                  NA
## 2 CA          public hs                  NA
## 3 CA          public hs                  NA
## 4 CA          public hs                  NA
## 5 CA          public hs                  NA
## 6 CA          public hs                  NA

3 Practice with pipes

wwlist data frame

De-identified list of prospective students purchased by Western Washington University from College Board
We collected these data using public records requests request

dim(wwlist)

## [1] 268396     41

names(wwlist)

##  [1] "receive_date"           "psat_range"             "state"                 
##  [4] "zip9"                   "for_country"            "sex"                   
##  [7] "hs_ceeb_code"           "hs_name"                "hs_city"               
## [10] "hs_state"               "hs_grad_date"           "ethn_code"             
## [13] "homeschool"             "firstgen"               "zip5"                  
## [16] "pop_total_zip"          "pop_white_zip"          "pop_black_zip"         
## [19] "pop_asian_zip"          "pop_latinx_zip"         "pop_nativeam_zip"      
## [22] "pop_nativehawaii_zip"   "pop_multirace_zip"      "pop_otherrace_zip"     
## [25] "med_inc_zip"            "school_type"            "merged_hs"             
## [28] "school_category"        "total_12"               "total_students"        
## [31] "fr_lunch"               "pop_total_state"        "pop_white_state"       
## [34] "pop_black_state"        "pop_nativeam_state"     "pop_asian_state"       
## [37] "pop_nativehawaii_state" "pop_otherrace_state"    "pop_multirace_state"   
## [40] "pop_latinx_state"       "med_inc_state"

#glimpse(wwlist)
#str(wwlist)

Let’s use select(), filter(), and arrange() to do the following using the Base R approach:

Sort wwlist descending by total_students
Select the following variables: hs_state, hs_city, hs_name, school_type, total_students
Filter for private schools (school_type == "private")
Print the first 10 observations

head(select(arrange(filter(wwlist, school_type == "private"), desc(total_students)), hs_state, hs_city, hs_name, school_type, total_students), n = 10)

## # A tibble: 10 x 5
##    hs_state hs_city   hs_name                      school_type total_students
##    <chr>    <chr>     <chr>                        <chr>                <int>
##  1 GA       Norcross  James Madison High School    private               4500
##  2 GA       Norcross  James Madison High School    private               4500
##  3 IL       Lansing   American School              private               3356
##  4 <NA>     <NA>      <NA>                         private               2014
##  5 NC       Charlotte Charlotte Country Day School private               1603
##  6 NC       Charlotte Providence Day School        private               1603
##  7 NC       Charlotte Charlotte Country Day School private               1603
##  8 NC       Charlotte Providence Day School        private               1603
##  9 NC       Charlotte Providence Day School        private               1603
## 10 NC       Charlotte Charlotte Country Day School private               1603

df_temp <- filter(wwlist,school_type == "private")
df_temp2 <- arrange(df_temp,desc(total_students))
head(select(df_temp2, hs_state, hs_city, hs_name, school_type, total_students),n=10)

## # A tibble: 10 x 5
##    hs_state hs_city   hs_name                      school_type total_students
##    <chr>    <chr>     <chr>                        <chr>                <int>
##  1 GA       Norcross  James Madison High School    private               4500
##  2 GA       Norcross  James Madison High School    private               4500
##  3 IL       Lansing   American School              private               3356
##  4 <NA>     <NA>      <NA>                         private               2014
##  5 NC       Charlotte Charlotte Country Day School private               1603
##  6 NC       Charlotte Providence Day School        private               1603
##  7 NC       Charlotte Charlotte Country Day School private               1603
##  8 NC       Charlotte Providence Day School        private               1603
##  9 NC       Charlotte Providence Day School        private               1603
## 10 NC       Charlotte Charlotte Country Day School private               1603

rm(df_temp,df_temp2)

Now let’s use pipes %

wwlist %>%
  filter(school_type == "private") %>%
  arrange(desc(total_students)) %>%
  select(state, hs_name, school_type, total_students) %>%
  head(n = 10)

## # A tibble: 10 x 4
##    state hs_name                      school_type total_students
##    <chr> <chr>                        <chr>                <int>
##  1 SD    James Madison High School    private               4500
##  2 HI    James Madison High School    private               4500
##  3 MT    American School              private               3356
##  4 HI    <NA>                         private               2014
##  5 NC    Charlotte Country Day School private               1603
##  6 NC    Providence Day School        private               1603
##  7 NC    Charlotte Country Day School private               1603
##  8 NC    Providence Day School        private               1603
##  9 NC    Providence Day School        private               1603
## 10 NC    Charlotte Country Day School private               1603

Your turn

Useselect(), filter(), and arrange() to do the following using both the Base R & tidyverse approach:

Sort wwlist descending by med_inc_zip
Select the following variables: hs_state, hs_city, hs_name, school_type, med_inc_zip, ethn_code, med_inc_state
Filter for public schools (school_type == "public") in the state of New York hs_state == "NY"
Print the first 10 observations

Base R

Tidyverse using pipes

Now let’s useselect(), filter(), and arrange() to do the following using both the Base R & tidyverse approach:

Sort wwlist descending by med_inc_zip
Select the following variables: hs_state, hs_city, hs_name, school_type, med_inc_zip, ethn_code, med_inc_state
Filter for public schools (school_type == "public") where the med_inc_zip is less than the med_inc_state
Print the first 10 observations

Base R

Tidyverse using pipes

Bonus question

Write down a question you have about the data.
Using any or all of the following functions select(), filter(), arrange(), how would you go about subsetting and sorting the data to answer your question?

Write down your question below:

Now try to work through your question.

R Operators & Pipes

1 Introduction

2 Operator Precedence and Associativity

3 Practice with pipes