Load packages:
#install.packages('tidyverse')
library(tidyverse)
Load data:
#data we used in the last problem set
load(url('https://github.com/anyone-can-cook/rclass1/raw/master/data/recruiting/recruit_event_somevars.RData'))
#data we are using in this lecture
load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData"))
Resources used to create this document:
It is important to understand operator precedence and associativity when working with data in general, but particularly when we use logical operators like &
or |
to filter data.
Operator Precedence
Operator Precedence refers to the order in which operators are executed. Operators with higher precedence are evaluated first and operators with the lowest precedence are evaluated last. If you recall PEMDAS, multiplication *
takes precedence over addition +
.
1 + 2 * 6
## [1] 13
If we add a parenthesis to the equation, our answer changes.
1 + 2) * 6 (
## [1] 18
Operator Associativity
When you are working with operators with the same precedence, the execution of the operators is determined through associativity.
Operators with same precedence follows operator associativity defined for its operator group. In R, operators can either follow left-associative, right-associative or have no associativity. Operators with left-associative are evaluted from left to right, operators with right-associative are evaluated from right to left and operators with no associativity, does not follow any predefined order.
Credit: R Operator Precedence
8 / 4 / 2
## [1] 1
8 / (4 / 2)
## [1] 4
Credit: R Operator Precedence and Associativity, Operator Syntax and Precedence
R logical operator precedence
According to this link, the &
takes precedence over |
. And both logical operators are left-associative, meaning they are evaluated from left to right.
Credit R Operator Precedence
Now let’s see an example using the data frame from problem set 3
Why do we get different results?
nrow(subset(df_event, event_state %in% c("CA", "FL", "MA") & event_type == "public hs" | total_students_pri > 1000))
## [1] 2911
nrow(subset(df_event, event_state %in% c("CA", "FL", "MA") & (event_type == "public hs" | total_students_pri > 1000)))
## [1] 2151
Let’s try to put our code into words.
The first code reads. Return observations where the recruiting visit took place at a public high school in the state of California, Florida, or Massachusetts OR a private high school where the total number of students is greater than 1000.
tail(subset(df_event, event_state %in% c("CA", "FL", "MA") & event_type == "public hs" | total_students_pri > 1000, select = c(event_state, event_type, total_students_pri)))
## # A tibble: 6 x 3
## event_state event_type total_students_pri
## <chr> <chr> <dbl>
## 1 HI private hs 1383
## 2 HI private hs 1880
## 3 OR private hs 1300
## 4 OR private hs 1300
## 5 OR private hs 1300
## 6 WA private hs 1132
The second code reads. Return observations where the recruiting visit took place at a public high school OR private high school where the total number of students is greater than 1000 in the state of California, Florida, or Massachusetts.
tail(subset(df_event, event_state %in% c("CA", "FL", "MA") & (event_type == "public hs" | total_students_pri > 1000), select = c(event_state, event_type, total_students_pri)))
## # A tibble: 6 x 3
## event_state event_type total_students_pri
## <chr> <chr> <dbl>
## 1 CA public hs NA
## 2 CA public hs NA
## 3 CA public hs NA
## 4 CA public hs NA
## 5 CA public hs NA
## 6 CA public hs NA
wwlist data frame
dim(wwlist)
## [1] 268396 41
names(wwlist)
## [1] "receive_date" "psat_range" "state"
## [4] "zip9" "for_country" "sex"
## [7] "hs_ceeb_code" "hs_name" "hs_city"
## [10] "hs_state" "hs_grad_date" "ethn_code"
## [13] "homeschool" "firstgen" "zip5"
## [16] "pop_total_zip" "pop_white_zip" "pop_black_zip"
## [19] "pop_asian_zip" "pop_latinx_zip" "pop_nativeam_zip"
## [22] "pop_nativehawaii_zip" "pop_multirace_zip" "pop_otherrace_zip"
## [25] "med_inc_zip" "school_type" "merged_hs"
## [28] "school_category" "total_12" "total_students"
## [31] "fr_lunch" "pop_total_state" "pop_white_state"
## [34] "pop_black_state" "pop_nativeam_state" "pop_asian_state"
## [37] "pop_nativehawaii_state" "pop_otherrace_state" "pop_multirace_state"
## [40] "pop_latinx_state" "med_inc_state"
#glimpse(wwlist)
#str(wwlist)
Let’s use select()
, filter()
, and arrange()
to do the following using the Base R approach:
wwlist
descending by total_students
hs_state
, hs_city
, hs_name
, school_type
, total_students
(school_type == "private")
head(select(arrange(filter(wwlist, school_type == "private"), desc(total_students)), hs_state, hs_city, hs_name, school_type, total_students), n = 10)
## # A tibble: 10 x 5
## hs_state hs_city hs_name school_type total_students
## <chr> <chr> <chr> <chr> <int>
## 1 GA Norcross James Madison High School private 4500
## 2 GA Norcross James Madison High School private 4500
## 3 IL Lansing American School private 3356
## 4 <NA> <NA> <NA> private 2014
## 5 NC Charlotte Charlotte Country Day School private 1603
## 6 NC Charlotte Providence Day School private 1603
## 7 NC Charlotte Charlotte Country Day School private 1603
## 8 NC Charlotte Providence Day School private 1603
## 9 NC Charlotte Providence Day School private 1603
## 10 NC Charlotte Charlotte Country Day School private 1603
filter(wwlist,school_type == "private")
df_temp <- arrange(df_temp,desc(total_students))
df_temp2 <-head(select(df_temp2, hs_state, hs_city, hs_name, school_type, total_students),n=10)
## # A tibble: 10 x 5
## hs_state hs_city hs_name school_type total_students
## <chr> <chr> <chr> <chr> <int>
## 1 GA Norcross James Madison High School private 4500
## 2 GA Norcross James Madison High School private 4500
## 3 IL Lansing American School private 3356
## 4 <NA> <NA> <NA> private 2014
## 5 NC Charlotte Charlotte Country Day School private 1603
## 6 NC Charlotte Providence Day School private 1603
## 7 NC Charlotte Charlotte Country Day School private 1603
## 8 NC Charlotte Providence Day School private 1603
## 9 NC Charlotte Providence Day School private 1603
## 10 NC Charlotte Charlotte Country Day School private 1603
rm(df_temp,df_temp2)
Now let’s use pipes %
%>%
wwlist filter(school_type == "private") %>%
arrange(desc(total_students)) %>%
select(state, hs_name, school_type, total_students) %>%
head(n = 10)
## # A tibble: 10 x 4
## state hs_name school_type total_students
## <chr> <chr> <chr> <int>
## 1 SD James Madison High School private 4500
## 2 HI James Madison High School private 4500
## 3 MT American School private 3356
## 4 HI <NA> private 2014
## 5 NC Charlotte Country Day School private 1603
## 6 NC Providence Day School private 1603
## 7 NC Charlotte Country Day School private 1603
## 8 NC Providence Day School private 1603
## 9 NC Providence Day School private 1603
## 10 NC Charlotte Country Day School private 1603
Your turn
Useselect()
, filter()
, and arrange()
to do the following using both the Base R & tidyverse approach:
wwlist
descending by med_inc_zip
hs_state
, hs_city
, hs_name
, school_type
, med_inc_zip
, ethn_code
, med_inc_state
(school_type == "public")
in the state of New York hs_state == "NY"
Base R
Tidyverse using pipes
Now let’s useselect()
, filter()
, and arrange()
to do the following using both the Base R & tidyverse approach:
wwlist
descending by med_inc_zip
hs_state
, hs_city
, hs_name
, school_type
, med_inc_zip
, ethn_code
, med_inc_state
(school_type == "public")
where the med_inc_zip
is less than the med_inc_state
Base R
Tidyverse using pipes
Bonus question
select()
, filter()
, arrange()
, how would you go about subsetting and sorting the data to answer your question?Write down your question below:
Now try to work through your question.