Load packages:
#install.packages('tidyverse')
library(tidyverse)Load data:
#data we used in the last problem set
load(url('https://github.com/anyone-can-cook/rclass1/raw/master/data/recruiting/recruit_event_somevars.RData'))  
#data we are using in this lecture 
load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData"))Resources used to create this document:
It is important to understand operator precedence and associativity when working with data in general, but particularly when we use logical operators like & or | to filter data.
Operator Precedence
Operator Precedence refers to the order in which operators are executed. Operators with higher precedence are evaluated first and operators with the lowest precedence are evaluated last. If you recall PEMDAS, multiplication * takes precedence over addition +.
1 + 2 * 6## [1] 13If we add a parenthesis to the equation, our answer changes.
(1 + 2) * 6## [1] 18Operator Associativity
When you are working with operators with the same precedence, the execution of the operators is determined through associativity.
Operators with same precedence follows operator associativity defined for its operator group. In R, operators can either follow left-associative, right-associative or have no associativity. Operators with left-associative are evaluted from left to right, operators with right-associative are evaluated from right to left and operators with no associativity, does not follow any predefined order.
Credit: R Operator Precedence
8 / 4 / 2## [1] 18 / (4 / 2)## [1] 4Credit: R Operator Precedence and Associativity, Operator Syntax and Precedence
R logical operator precedence
According to this link, the & takes precedence over |. And both logical operators are left-associative, meaning they are evaluated from left to right.
Credit R Operator Precedence
Now let’s see an example using the data frame from problem set 3
Why do we get different results?
nrow(subset(df_event, event_state %in% c("CA", "FL", "MA") & event_type == "public hs" | total_students_pri > 1000))## [1] 2911nrow(subset(df_event, event_state %in% c("CA", "FL", "MA") & (event_type == "public hs" | total_students_pri > 1000)))## [1] 2151Let’s try to put our code into words.
The first code reads. Return observations where the recruiting visit took place at a public high school in the state of California, Florida, or Massachusetts OR a private high school where the total number of students is greater than 1000.
tail(subset(df_event, event_state %in% c("CA", "FL", "MA") & event_type == "public hs" | total_students_pri > 1000, select = c(event_state, event_type, total_students_pri)))## # A tibble: 6 x 3
##   event_state event_type total_students_pri
##   <chr>       <chr>                   <dbl>
## 1 HI          private hs               1383
## 2 HI          private hs               1880
## 3 OR          private hs               1300
## 4 OR          private hs               1300
## 5 OR          private hs               1300
## 6 WA          private hs               1132The second code reads. Return observations where the recruiting visit took place at a public high school OR private high school where the total number of students is greater than 1000 in the state of California, Florida, or Massachusetts.
tail(subset(df_event, event_state %in% c("CA", "FL", "MA") & (event_type == "public hs" | total_students_pri > 1000), select = c(event_state, event_type, total_students_pri)))## # A tibble: 6 x 3
##   event_state event_type total_students_pri
##   <chr>       <chr>                   <dbl>
## 1 CA          public hs                  NA
## 2 CA          public hs                  NA
## 3 CA          public hs                  NA
## 4 CA          public hs                  NA
## 5 CA          public hs                  NA
## 6 CA          public hs                  NAwwlist data frame
dim(wwlist)## [1] 268396     41names(wwlist)##  [1] "receive_date"           "psat_range"             "state"                 
##  [4] "zip9"                   "for_country"            "sex"                   
##  [7] "hs_ceeb_code"           "hs_name"                "hs_city"               
## [10] "hs_state"               "hs_grad_date"           "ethn_code"             
## [13] "homeschool"             "firstgen"               "zip5"                  
## [16] "pop_total_zip"          "pop_white_zip"          "pop_black_zip"         
## [19] "pop_asian_zip"          "pop_latinx_zip"         "pop_nativeam_zip"      
## [22] "pop_nativehawaii_zip"   "pop_multirace_zip"      "pop_otherrace_zip"     
## [25] "med_inc_zip"            "school_type"            "merged_hs"             
## [28] "school_category"        "total_12"               "total_students"        
## [31] "fr_lunch"               "pop_total_state"        "pop_white_state"       
## [34] "pop_black_state"        "pop_nativeam_state"     "pop_asian_state"       
## [37] "pop_nativehawaii_state" "pop_otherrace_state"    "pop_multirace_state"   
## [40] "pop_latinx_state"       "med_inc_state"#glimpse(wwlist)
#str(wwlist)Let’s use select(), filter(), and arrange() to do the following using the Base R approach:
wwlist descending by total_studentshs_state, hs_city, hs_name, school_type, total_students(school_type == "private")head(select(arrange(filter(wwlist, school_type == "private"), desc(total_students)), hs_state, hs_city, hs_name, school_type, total_students), n = 10)## # A tibble: 10 x 5
##    hs_state hs_city   hs_name                      school_type total_students
##    <chr>    <chr>     <chr>                        <chr>                <int>
##  1 GA       Norcross  James Madison High School    private               4500
##  2 GA       Norcross  James Madison High School    private               4500
##  3 IL       Lansing   American School              private               3356
##  4 <NA>     <NA>      <NA>                         private               2014
##  5 NC       Charlotte Charlotte Country Day School private               1603
##  6 NC       Charlotte Providence Day School        private               1603
##  7 NC       Charlotte Charlotte Country Day School private               1603
##  8 NC       Charlotte Providence Day School        private               1603
##  9 NC       Charlotte Providence Day School        private               1603
## 10 NC       Charlotte Charlotte Country Day School private               1603df_temp <- filter(wwlist,school_type == "private")
df_temp2 <- arrange(df_temp,desc(total_students))
head(select(df_temp2, hs_state, hs_city, hs_name, school_type, total_students),n=10)## # A tibble: 10 x 5
##    hs_state hs_city   hs_name                      school_type total_students
##    <chr>    <chr>     <chr>                        <chr>                <int>
##  1 GA       Norcross  James Madison High School    private               4500
##  2 GA       Norcross  James Madison High School    private               4500
##  3 IL       Lansing   American School              private               3356
##  4 <NA>     <NA>      <NA>                         private               2014
##  5 NC       Charlotte Charlotte Country Day School private               1603
##  6 NC       Charlotte Providence Day School        private               1603
##  7 NC       Charlotte Charlotte Country Day School private               1603
##  8 NC       Charlotte Providence Day School        private               1603
##  9 NC       Charlotte Providence Day School        private               1603
## 10 NC       Charlotte Charlotte Country Day School private               1603rm(df_temp,df_temp2)Now let’s use pipes %
wwlist %>%
  filter(school_type == "private") %>%
  arrange(desc(total_students)) %>%
  select(state, hs_name, school_type, total_students) %>%
  head(n = 10)## # A tibble: 10 x 4
##    state hs_name                      school_type total_students
##    <chr> <chr>                        <chr>                <int>
##  1 SD    James Madison High School    private               4500
##  2 HI    James Madison High School    private               4500
##  3 MT    American School              private               3356
##  4 HI    <NA>                         private               2014
##  5 NC    Charlotte Country Day School private               1603
##  6 NC    Providence Day School        private               1603
##  7 NC    Charlotte Country Day School private               1603
##  8 NC    Providence Day School        private               1603
##  9 NC    Providence Day School        private               1603
## 10 NC    Charlotte Country Day School private               1603Your turn
Useselect(), filter(), and arrange() to do the following using both the Base R & tidyverse approach:
wwlist descending by med_inc_ziphs_state, hs_city, hs_name, school_type, med_inc_zip, ethn_code, med_inc_state(school_type == "public") in the state of New York hs_state == "NY"Base R
Tidyverse using pipes
Now let’s useselect(), filter(), and arrange() to do the following using both the Base R & tidyverse approach:
wwlist descending by med_inc_ziphs_state, hs_city, hs_name, school_type, med_inc_zip, ethn_code, med_inc_state(school_type == "public") where the med_inc_zip is less than the med_inc_stateBase R
Tidyverse using pipes
Bonus question
select(), filter(), arrange(), how would you go about subsetting and sorting the data to answer your question?Write down your question below:
Now try to work through your question.