library(tidyverse)
Across the tidyverse: processing across rows
1 Introduction
1.1 Libraries we will use today
“Load” the package we will use today (output omitted)
- you must run this code chunk
If package not yet installed, then must install before you load. Install in “console” rather than .Rmd file
- Generic syntax:
install.packages("package_name")
- Install “tidyverse”:
install.packages("tidyverse")
Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes:
install.packages("tidyverse")
library(tidyverse)
1.2 Data we will use today
Data on off-campus recruiting events by public universities
- Object
df_event
- One observation per university, recruiting event
rm(list = ls()) # remove all objects
#load dataset with one obs per recruiting event
load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData"))
#load("../../data/recruiting/recruit_event_allvars.Rdata")
1.3 What we are doing today
Processing across variables vs. processing across observations
In today’s lecture, I’ll use the terms “observations” and “rows” interchangeably. Creation of analysis datasets often requires calculations across obs.
Examples:
- You have a dataset with one observation per student-term and want to create a variable of credits attempted per term
- You have a dataset with one observation per student-term and want to create a variable of GPA for the semester or cumulative GPA for all semesters
- Number of off-campus recruiting events a university makes to each state
- Average household income at visited versus non-visited high schools
Creating graphs and tables of descriptive stats usually require calculations across obs
Example: Want to create a graph that shows number of recruiting events by event “type” (e.g., public HS, private HS) for each university
- Start with
df_event
dataset that has one observation per university, recruiting event - Create a new data frame object that has one observation per university and event type and has a variable for the number of events
- this variable calculated by counting number of rows in each combination of university and event type
- This new data frame object is the input for creating desired graph
So far, we have focused on ``processing across variables’’
- Performing calculations across columns (i.e., vars), typically within a row (i.e., observation)
- Example: percent free-reduced lunch (above)
Visits by UC Berkeley to public high schools
#> # A tibble: 5 × 6
#> school_id state tot_stu_pub fr_lunch pct_fr_lunch med_inc
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 340882002126 NJ 1846 29 0.0157 178732
#> 2 340147000250 NJ 1044 50 0.0479 62288
#> 3 340561003796 NJ 1505 298 0.198 100684.
#> 4 340165005124 NJ 1900 43 0.0226 160476.
#> 5 341341003182 NJ 1519 130 0.0856 144346
Processing across obs (focus of today’s lecture)
- Performing calculations across rows (i.e., obs), often within a column (i.e., variable)
- Example: Average household income of visited high schools, by state
Strategy for teaching processing across obs
In tidyverse
the group_by()
and summarize()
functions are the primary means of performing calculations across observations
- Usually, processing across observations requires using
group_by()
andsummarize()
together group_by()
typically not very useful by itselfsummarize()
[with or withoutgroup_by()
] can be helpful for creating summary statistics that are the inputs for tables or graphs you create
How we’ll teach:
- introduce
group_by()
andsummarize()
separately- goal: you understand what each function does
- then we’ll combine them
2 group_by()
and summarize()
2.1 group_by
Description: “group_by()
takes an existing data frame and converts it into a grouped data frame where operations are performed”by group”. ungroup()
removes grouping.”
- part of dplyr package within tidyverse; not part of Base R
- works best with pipes
%>%
andsummarize()
function [described below]
Basic syntax: group_by(.data, ...)
.data
argument refers to name of data frame...
argument refers to names of “group_by” variables, separated by commas- Can “group by” one or many variables
- Typically, “group_by” variables are character, factor, or integer variables
Possible “group by” variables in df_event
data:
- university name/id; event type (e.g., public HS, private HS); state
Example: in df_event
, create frequency count of event_type
[output omitted]
names(df_event)
#without group_by()
%>% count(event_type)
df_event %>% count(instnm)
df_event #group_by() university
%>% group_by(instnm) %>% count(event_type) df_event
By itself group_by()
doesn’t do much; it just prints data
- Below, group
df_event
data by university, event type, and event state
#group_by (without pipes)
group_by(df_event, univ_id, event_type, event_state)
#group_by (with pipes)
%>% group_by(univ_id, event_type, event_state)
df_event %>% group_by(univ_id, event_type, event_state) %>% glimpse() df_event
But once an object is grouped, all subsequent functions are run separately “by group”
- recall that
count()
counts number of observations by group
# count number of observations in group, ungrouped data
%>% count()
df_event #group by and then count obs
%>% group_by(univ_id) %>% count()
df_event %>% group_by(univ_id) %>% count() %>% glimpse()
df_event #df_event %>% group_by(univ_id) %>% count() %>% View()
%>% group_by(univ_id, event_type) %>% count()
df_event %>% group_by(univ_id, event_type) %>% count() %>% glimpse()
df_event #df_event %>% group_by(univ_id, event_type) %>% count() %>% View()
2.1.1 Grouping not retained unless you assign it
Below, we’ll use class()
function to show whether data frame is grouped
- will talk more about
class()
next week, but for now, just think of it as a function that provides information about an object - similar to
typeof()
, butclass()
provides different info about object
Grouping is not retained unless you assign it
class(df_event)
#> [1] "tbl_df" "tbl" "data.frame"
%>% group_by(univ_id, event_type, event_state)
df_event <- df_event %>% group_by(univ_id, event_type, event_state) # using pipes df_event_grp
class(df_event_grp)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
2.1.2 Un-grouping an object
Use ungroup(object)
to un-group grouped data
class(df_event_grp)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
<- ungroup(df_event_grp)
df_event_grp class(df_event_grp)
#> [1] "tbl_df" "tbl" "data.frame"
rm(df_event_grp)
2.1.3 Student exercise group_by()
- Group by “instnm” and get a frequency count.
- How many rows and columns do you have? What do the number of rows mean?
- How many rows and columns do you have? What do the number of rows mean?
- Now group by “instnm” and “event_type” and get a frequency count.
- How many rows and columns do you have? What do the number of rows mean?
- Bonus: In the same code chunk, group by “instnm” and “event_type”, but this time filter for observations where “med_inc” is greater than 75000 and get a frequency count.
Solutions group_by()
Group by “instnm” and get a frequency count.
- How many rows and columns do you have? What do the number of rows mean?
%>% df_event group_by(instnm) %>% count() #> # A tibble: 16 × 2 #> # Groups: instnm [16] #> instnm n #> <chr> <int> #> 1 Arkansas 994 #> 2 Bama 4258 #> 3 CU Boulder 1439 #> 4 Cinci 679 #> 5 Kansas 1014 #> 6 NC State 640 #> 7 Pitt 1225 #> 8 Rutgers 1135 #> 9 S Illinois 549 #> 10 Stony Brook 730 #> 11 UC Berkeley 879 #> 12 UC Irvine 539 #> 13 UGA 827 #> 14 UM Amherst 908 #> 15 UNL 1397 #> 16 USCC 1467
- How many rows and columns do you have? What do the number of rows mean?
Now group by “instnm” and “event_type” and get a frequency count.
- How many rows and columns do you have? What do the number of rows mean?
%>% df_event group_by(instnm, event_type) %>% count() #> # A tibble: 80 × 3 #> # Groups: instnm, event_type [80] #> instnm event_type n #> <chr> <chr> <int> #> 1 Arkansas 2yr college 32 #> 2 Arkansas 4yr college 14 #> 3 Arkansas other 112 #> 4 Arkansas private hs 222 #> 5 Arkansas public hs 614 #> 6 Bama 2yr college 127 #> 7 Bama 4yr college 158 #> 8 Bama other 608 #> 9 Bama private hs 963 #> 10 Bama public hs 2402 #> # ℹ 70 more rows
Bonus: Group by “instnm” and “event_type”, but this time filter for observations where “med_inc” is greater than 75000 and get a frequency count.
%>%
df_event group_by(instnm, event_type) %>%
filter(med_inc > 75000) %>%
count()
#> # A tibble: 80 × 3
#> # Groups: instnm, event_type [80]
#> instnm event_type n
#> <chr> <chr> <int>
#> 1 Arkansas 2yr college 7
#> 2 Arkansas 4yr college 3
#> 3 Arkansas other 30
#> 4 Arkansas private hs 99
#> 5 Arkansas public hs 303
#> 6 Bama 2yr college 21
#> 7 Bama 4yr college 42
#> 8 Bama other 249
#> 9 Bama private hs 477
#> 10 Bama public hs 1478
#> # ℹ 70 more rows
2.2 summarize()
summarize()
function
Description: summarize()
calculates across rows; then collapses into single row
summarize()
create scalar vars summarizing variables of existing data frame- if you first group data frame using
group_by()
,summarize()
creates summary vars separately for each group, returning object with one row per group - if data frame not grouped,
summarize()
will result in one row.
Syntax: summarize(.data, ...)
.data
: a data frame; omit if usingsummarize()
after pipe%>%
...
: Name-value pairs of summary functions separated by commas- “name” will be the name of new variable you will create
- “value” should be expression that returns a single value like
min(x)
,n()
- variable names do not need to be placed within quotes
Value (what summarize()
returns/creates)
- Object of same class as
.data.
; object will have one obs per “by group”
Useful functions (i.e., “helper functions”)
- Standalone functions called within
summarize()
, e.g.,mean()
,n()
- e.g., count function
n()
takes no arguments; returns number of rows in group
2.2.1 Investigate objects created by summarize()
Example: Count total number of events
- function
n()
fromdplyr
package (dplyr::n
) returns the size of the current group
#?n # dplyr function n() gives current group size
summarize(df_event, num_events=n()) # without pipes
#> # A tibble: 1 × 1
#> num_events
#> <int>
#> 1 18680
%>% summarize(num_events=n()) # with pipes
df_event #> # A tibble: 1 × 1
#> num_events
#> <int>
#> 1 18680
%>% summarize(num_events=n()) %>% str() # use str to see what pipe returned
df_event #> tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
#> $ num_events: int 18680
#df_event %>% summarize(num_events=n()) %>% View()
Example: What is max value of med_inc
across all events
- function
max()
frombase
package (base::max
) returns max value
#?max # base R function max() returns max value
%>% summarize(max_inc = max(med_inc, na.rm = TRUE))
df_event #> # A tibble: 1 × 1
#> max_inc
#> <dbl>
#> 1 250001
%>% summarize(max_inc = max(med_inc, na.rm = TRUE)) %>% str()
df_event #> tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
#> $ max_inc: num 250001
Example: Count total number of events AND max value of median income
%>% summarize(
df_event num_events=n(),
max_inc=max(med_inc, na.rm = TRUE)
)#> # A tibble: 1 × 2
#> num_events max_inc
#> <int> <dbl>
#> 1 18680 250001
# show object returned by pipe
%>% summarize(
df_event num_events=n(),
max_inc=max(med_inc, na.rm = TRUE)
%>% str()
) #> tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
#> $ num_events: int 18680
#> $ max_inc : num 250001
#df_event %>% summarize(num_events=n(), max_inc=max(med_inc, na.rm = TRUE)) %>% View()
Example: Count total number of events AND max value of median income
We can use assignment to keep the object created by summarize()
<- df_event %>%
df_event_temp summarize(num_events=n(), max_inc=max(med_inc, na.rm = TRUE))
df_event_temprm(df_event_temp)
What if we forgot na.rm = TRUE
?
- then
max_inc
equalsNA
cuz can’t perform calculation onNA
values
%>%
df_event summarize(num_events=n(),max_inc=max(med_inc, na.rm = FALSE))
2.2.2 Takeaways, summarize()
- By default, objects created by
summarize()
are data frames that contain variables created withinsummarize()
and one observation [per “by group”] - most “helper” functions (e.g.,
max()
,mean()
have optionna.rm
to keep/remove missing obs before performing calculations)na.rm = FALSE
(default); don’t removeNAs
prior to calculation- if any obs missing, then result of calculation is
NA
- if any obs missing, then result of calculation is
na.rm = TRUE
(default); removeNAs
prior to calculation
- Object created by
summarize()
not retained unless you assign it (output omitted)
<- df_event %>% summarize(num_events=n(),
event_temp mean_inc=mean(med_inc, na.rm = TRUE))
event_temprm(event_temp)
2.2.3 Using [] operator to filter observations within summarize
Imagine we want to calculate avg. income, separately for in-state vs. out-of-state visits
- first, make sure we can identify in vs. out-state (output omitted)
%>% filter(event_state == instst) %>% count() #in state
df_event #> # A tibble: 1 × 1
#> n
#> <int>
#> 1 5425
%>% filter(event_state != instst) %>% count() #out state
df_event #> # A tibble: 1 × 1
#> n
#> <int>
#> 1 13255
Task: calculate mean income for: all events; in-state events; out-of-state events
%>%
df_event summarize(
avg_inc = mean(med_inc, na.rm = TRUE), # all events
avg_inc_inst = mean(med_inc[event_state == instst], na.rm = TRUE), # in-state
avg_inc_outst = mean(med_inc[event_state != instst], na.rm = TRUE) # out-state
)#> # A tibble: 1 × 3
#> avg_inc avg_inc_inst avg_inc_outst
#> <dbl> <dbl> <dbl>
#> 1 89089. 71589. 96162.
Understanding code: mean(med_inc[event_state == instst], na.rm = TRUE)
mean()
function takes atomic vector as first argument- the variable
med_inc
is an numeric atomic vector event_state == instst
is condition that yields a logical atomic vector- From Base R lecture,
med_inc[event_state == instst]
is:- “subset atomic vectors using
[]
” approach 3: “Using logicals to return elements where corresponding logical isTRUE
” - isolates observations of
med_inc
whereevent_state
is same as state the university is located in (instst
)
- “subset atomic vectors using
2.2.4 Using summarize()
to create descriptive statistics table
Often helpful to use summarize()
to calculate summary statistics that are the basis for a table of descriptive statistics
Task: create a table of descriptive statistics about variable med_inc
- want these measures: number of non-missing obs; mean; standard deviation
%>% mutate(non_miss_inc = is.na(med_inc)==0) %>%
df_event summarize(
n = sum(non_miss_inc, na.rm = TRUE), #SAMPLE SIZE all
avg_inc = mean(med_inc, na.rm = TRUE), # MEAN
std_inc = sd(med_inc, na.rm = TRUE) # STANDARD DEVIATION all events
)#> # A tibble: 1 × 3
#> n avg_inc std_inc
#> <int> <dbl> <dbl>
#> 1 18210 89089. 39267.
Task: same as above but separate measures for: all events; in-state; out-of-state
%>% mutate(non_miss_inc = is.na(med_inc)==0) %>%
df_event summarize(
n = sum(non_miss_inc, na.rm = TRUE), #SAMPLE SIZE
n_inst = sum(non_miss_inc[event_state == instst], na.rm = TRUE),
n_outst = sum(non_miss_inc[event_state != instst], na.rm = TRUE),
avg_inc = mean(med_inc, na.rm = TRUE), # MEAN
avg_inc_inst = mean(med_inc[event_state == instst], na.rm = TRUE),
avg_inc_outst = mean(med_inc[event_state != instst], na.rm = TRUE),
std_inc = sd(med_inc, na.rm = TRUE), # STANDARD DEVIATION
std_inc_inst = sd(med_inc[event_state == instst], na.rm = TRUE),
std_inc_outst = sd(med_inc[event_state != instst], na.rm = TRUE)
)#> # A tibble: 1 × 9
#> n n_inst n_outst avg_inc avg_inc_inst avg_inc_outst std_inc std_inc_inst
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 18210 5241 12969 89089. 71589. 96162. 39267. 32077.
#> # ℹ 1 more variable: std_inc_outst <dbl>
2.2.5 Student exercise summarize()
- What is the min value of
med_inc
across all events?- Hint: Use min()
- What is the mean value of
fr_lunch
across all events?- Hint: Use mean()
Solutions using summarize()
- What is min value of
med_inc
across all events?
%>%
df_event summarize(min_med_income = min(med_inc, na.rm = TRUE))
#> # A tibble: 1 × 1
#> min_med_income
#> <dbl>
#> 1 12894.
What is the mean value of
fr_lunch
across all events?- Hint: Use mean()
%>% df_event summarize(mean_fr_lunch = mean(fr_lunch, na.rm = TRUE)) #> # A tibble: 1 × 1 #> mean_fr_lunch #> <dbl> #> 1 475.
3 Combining group_by() and summarize()
Combining summarize()
and group_by
summarize()
on ungrouped vs. grouped data:
- By itself,
summarize()
performs calculations across all rows of data frame then collapses the data frame to a single row - When data frame is grouped,
summarize()
performs calculations across rows within a group and then collapses to a single row for each group
Example: Count the number of events for each university
- remember:
df_event
has one observation per university, recruiting event
%>% summarize(num_events=n())
df_event #> # A tibble: 1 × 1
#> num_events
#> <int>
#> 1 18680
%>% group_by(instnm) %>% summarize(num_events=n())
df_event #> # A tibble: 16 × 2
#> instnm num_events
#> <chr> <int>
#> 1 Arkansas 994
#> 2 Bama 4258
#> 3 CU Boulder 1439
#> 4 Cinci 679
#> 5 Kansas 1014
#> 6 NC State 640
#> 7 Pitt 1225
#> 8 Rutgers 1135
#> 9 S Illinois 549
#> 10 Stony Brook 730
#> 11 UC Berkeley 879
#> 12 UC Irvine 539
#> 13 UGA 827
#> 14 UM Amherst 908
#> 15 UNL 1397
#> 16 USCC 1467
- Investigate the object created above
%>% group_by(instnm) %>% summarize(num_events=n()) %>% str()
df_event #> tibble [16 × 2] (S3: tbl_df/tbl/data.frame)
#> $ instnm : chr [1:16] "Arkansas" "Bama" "CU Boulder" "Cinci" ...
#> $ num_events: int [1:16] 994 4258 1439 679 1014 640 1225 1135 549 730 ...
#df_event %>% group_by(instnm) %>% summarize(num_events=n()) %>% View()
- Or we could retain object for later use
<- df_event %>% group_by(instnm) %>% summarize(num_events=n())
event_by_univ str(event_by_univ)
#> tibble [16 × 2] (S3: tbl_df/tbl/data.frame)
#> $ instnm : chr [1:16] "Arkansas" "Bama" "CU Boulder" "Cinci" ...
#> $ num_events: int [1:16] 994 4258 1439 679 1014 640 1225 1135 549 730 ...
# print
event_by_univ #> # A tibble: 16 × 2
#> instnm num_events
#> <chr> <int>
#> 1 Arkansas 994
#> 2 Bama 4258
#> 3 CU Boulder 1439
#> 4 Cinci 679
#> 5 Kansas 1014
#> 6 NC State 640
#> 7 Pitt 1225
#> 8 Rutgers 1135
#> 9 S Illinois 549
#> 10 Stony Brook 730
#> 11 UC Berkeley 879
#> 12 UC Irvine 539
#> 13 UGA 827
#> 14 UM Amherst 908
#> 15 UNL 1397
#> 16 USCC 1467
rm(event_by_univ)
Task
Count number of recruiting events by institution and event_type
- within
summarize()
, we will take advantage of the helper functionn()
n()
“gives the current group size”; i.e., how many observations in current group
#?n %>% group_by(instnm, event_type) %>% summarize(num_events=n()) df_event #> # A tibble: 80 × 3 #> # Groups: instnm [16] #> instnm event_type num_events #> <chr> <chr> <int> #> 1 Arkansas 2yr college 32 #> 2 Arkansas 4yr college 14 #> 3 Arkansas other 112 #> 4 Arkansas private hs 222 #> 5 Arkansas public hs 614 #> 6 Bama 2yr college 127 #> 7 Bama 4yr college 158 #> 8 Bama other 608 #> 9 Bama private hs 963 #> 10 Bama public hs 2402 #> # ℹ 70 more rows #investigate object created %>% group_by(instnm, event_type) %>% summarize(num_events=n()) %>% glimpse() df_event #> Rows: 80 #> Columns: 3 #> Groups: instnm [16] #> $ instnm <chr> "Arkansas", "Arkansas", "Arkansas", "Arkansas", "Arkansas",… #> $ event_type <chr> "2yr college", "4yr college", "other", "private hs", "publi… #> $ num_events <int> 32, 14, 112, 222, 614, 127, 158, 608, 963, 2402, 112, 40, 7… #df_event %>% group_by(instnm, event_type) %>% summarize(num_events=n()) %>% View()
- within
Task
- Count number of recruiting events by institution and event_type
Note that data frame object created by group_by()
and summarize()
can be input to graph
#bar chart of number of events, all universities combined
%>% group_by(instnm, event_type) %>%
df_event summarize(num_events=n()) %>%
ggplot(aes(x=event_type, y=num_events)) + # plot
ylab("Number of events") + xlab("Event type") +geom_col()
#bar chart of number of events, separete chart for each university
%>% group_by(instnm, event_type) %>%
df_event summarize(num_events=n()) %>%
ggplot(aes(x=event_type, y=num_events)) + # plot
ylab("Number of events") + xlab("Event type") + geom_col() +
coord_flip() + facet_wrap(~ instnm)
Task. Count number of recruiting events by institution, event_type, and whether event is in- or out-of-state (var=event_inst
)
- Note: in
group_by()
, the optionaldrop
argument controls whether empty groups dropped. default isdrop = TRUE
%>% group_by(instnm, event_type, event_inst) %>%
df_event summarize(num_events=n())
#> # A tibble: 159 × 4
#> # Groups: instnm, event_type [80]
#> instnm event_type event_inst num_events
#> <chr> <chr> <chr> <int>
#> 1 Arkansas 2yr college In-State 16
#> 2 Arkansas 2yr college Out-State 16
#> 3 Arkansas 4yr college In-State 3
#> 4 Arkansas 4yr college Out-State 11
#> 5 Arkansas other In-State 27
#> 6 Arkansas other Out-State 85
#> 7 Arkansas private hs In-State 21
#> 8 Arkansas private hs Out-State 201
#> 9 Arkansas public hs In-State 156
#> 10 Arkansas public hs Out-State 458
#> # ℹ 149 more rows
%>% group_by(instnm, event_type, event_inst, .drop = TRUE) %>%
df_event summarize(num_events=n())
#> # A tibble: 159 × 4
#> # Groups: instnm, event_type [80]
#> instnm event_type event_inst num_events
#> <chr> <chr> <chr> <int>
#> 1 Arkansas 2yr college In-State 16
#> 2 Arkansas 2yr college Out-State 16
#> 3 Arkansas 4yr college In-State 3
#> 4 Arkansas 4yr college Out-State 11
#> 5 Arkansas other In-State 27
#> 6 Arkansas other Out-State 85
#> 7 Arkansas private hs In-State 21
#> 8 Arkansas private hs Out-State 201
#> 9 Arkansas public hs In-State 156
#> 10 Arkansas public hs Out-State 458
#> # ℹ 149 more rows
%>%
df_event group_by(as.factor(instnm), as.factor(event_type), as.factor(event_inst),
.drop = FALSE) %>% summarize(num_events=n()) %>% arrange(num_events)
#> # A tibble: 160 × 4
#> # Groups: as.factor(instnm), as.factor(event_type) [80]
#> `as.factor(instnm)` `as.factor(event_type)` as.factor(event_inst…¹ num_events
#> <fct> <fct> <fct> <int>
#> 1 UGA 2yr college Out-State 0
#> 2 UC Berkeley 2yr college Out-State 1
#> 3 USCC 2yr college In-State 2
#> 4 Arkansas 4yr college In-State 3
#> 5 Kansas 4yr college In-State 3
#> 6 NC State 2yr college Out-State 3
#> 7 Rutgers 4yr college In-State 3
#> 8 Stony Brook 4yr college In-State 3
#> 9 UC Irvine 2yr college Out-State 3
#> 10 UGA 2yr college In-State 3
#> # ℹ 150 more rows
#> # ℹ abbreviated name: ¹`as.factor(event_inst)`
# .drop=FALSE affects only grouping columns that are coded as factors
# combinations that include non-factor grouping variables are still
# silently dropped even with .drop=FALSE.
Make a graph, showing in/out state as fill color of bar
%>% group_by(instnm, event_type, event_inst) %>%
df_event summarize(num_events=n()) %>%
ggplot(aes(x=event_type, y=num_events, fill = event_inst)) + # plot
ylab("Number of events") + xlab("Event type") + geom_col() +
coord_flip() + facet_wrap(~ instnm)
Task
- By university, event type, event_inst count the number of events and calculate the avg. pct white in the zip-code
%>% group_by(instnm, event_type, event_inst) %>%
df_event summarize(num_events=n(),
mean_pct_white=mean(pct_white_zip, na.rm = TRUE)
)#> # A tibble: 159 × 5
#> # Groups: instnm, event_type [80]
#> instnm event_type event_inst num_events mean_pct_white
#> <chr> <chr> <chr> <int> <dbl>
#> 1 Arkansas 2yr college In-State 16 68.3
#> 2 Arkansas 2yr college Out-State 16 71.4
#> 3 Arkansas 4yr college In-State 3 75.6
#> 4 Arkansas 4yr college Out-State 11 63.9
#> 5 Arkansas other In-State 27 67.4
#> 6 Arkansas other Out-State 85 58.6
#> 7 Arkansas private hs In-State 21 71.1
#> 8 Arkansas private hs Out-State 201 62.3
#> 9 Arkansas public hs In-State 156 77.6
#> 10 Arkansas public hs Out-State 458 68.7
#> # ℹ 149 more rows
#investigate object you created
%>% group_by(instnm, event_type, event_inst) %>%
df_event summarize(num_events=n(),
mean_pct_white=mean(pct_white_zip, na.rm = FALSE)
%>% glimpse()
) #> Rows: 159
#> Columns: 5
#> Groups: instnm, event_type [80]
#> $ instnm <chr> "Arkansas", "Arkansas", "Arkansas", "Arkansas", "Arkans…
#> $ event_type <chr> "2yr college", "2yr college", "4yr college", "4yr colle…
#> $ event_inst <chr> "In-State", "Out-State", "In-State", "Out-State", "In-S…
#> $ num_events <int> 16, 16, 3, 11, 27, 85, 21, 201, 156, 458, 66, 61, 15, 1…
#> $ mean_pct_white <dbl> 68.27347, NA, NA, NA, 67.39773, NA, NA, NA, NA, NA, NA,…
Recruiting events by UC Berkeley
%>% filter(univ_id == 110635) %>%
df_event group_by(event_type) %>% summarize(num_events=n())
#> # A tibble: 5 × 2
#> event_type num_events
#> <chr> <int>
#> 1 2yr college 112
#> 2 4yr college 18
#> 3 other 138
#> 4 private hs 169
#> 5 public hs 442
Let’s create a dataset of recruiting events at UC Berkeley
<- df_event %>% filter(univ_id == 110635)
event_berk
%>% count(event_type)
event_berk #> # A tibble: 5 × 2
#> event_type n
#> <chr> <int>
#> 1 2yr college 112
#> 2 4yr college 18
#> 3 other 138
#> 4 private hs 169
#> 5 public hs 442
3.1 summarize() and Counts
The count function n()
takes no arguments and returns the size of the current group
%>% group_by(event_type, event_inst) %>%
event_berk summarize(num_events=n())
#> # A tibble: 10 × 3
#> # Groups: event_type [5]
#> event_type event_inst num_events
#> <chr> <chr> <int>
#> 1 2yr college In-State 111
#> 2 2yr college Out-State 1
#> 3 4yr college In-State 14
#> 4 4yr college Out-State 4
#> 5 other In-State 49
#> 6 other Out-State 89
#> 7 private hs In-State 35
#> 8 private hs Out-State 134
#> 9 public hs In-State 259
#> 10 public hs Out-State 183
Because counts are so important, dplyr
package includes separate count()
function that can be called outside summarize()
function
%>% group_by(event_type, event_inst) %>% count()
event_berk #> # A tibble: 10 × 3
#> # Groups: event_type, event_inst [10]
#> event_type event_inst n
#> <chr> <chr> <int>
#> 1 2yr college In-State 111
#> 2 2yr college Out-State 1
#> 3 4yr college In-State 14
#> 4 4yr college Out-State 4
#> 5 other In-State 49
#> 6 other Out-State 89
#> 7 private hs In-State 35
#> 8 private hs Out-State 134
#> 9 public hs In-State 259
#> 10 public hs Out-State 183
3.1.1 summarize()
: count with logical vectors and sum()
Logical vectors have values TRUE
and FALSE
.
- When used with numeric functions,
TRUE
converted to 1 andFALSE
to 0.
sum()
is a numeric function that returns the sum of values
sum(c(5,10))
#> [1] 15
sum(c(TRUE,TRUE,FALSE,FALSE))
#> [1] 2
is.na()
returns TRUE
if value is NA
and otherwise returns FALSE
is.na(c(5,NA,4,NA))
#> [1] FALSE TRUE FALSE TRUE
sum(is.na(c(5,NA,4,NA,5)))
#> [1] 2
sum(!is.na(c(5,NA,4,NA,5)))
#> [1] 3
Application: How many missing/non-missing obs in variable [very important]
%>% group_by(event_type) %>%
event_berk summarize(
n_events = n(),
n_miss_inc = sum(is.na(med_inc)),
n_nonmiss_inc = sum(!is.na(med_inc)),
n_nonmiss_fr_lunch = sum(!is.na(fr_lunch))
)#> # A tibble: 5 × 5
#> event_type n_events n_miss_inc n_nonmiss_inc n_nonmiss_fr_lunch
#> <chr> <int> <int> <int> <int>
#> 1 2yr college 112 5 107 0
#> 2 4yr college 18 5 13 0
#> 3 other 138 8 130 0
#> 4 private hs 169 2 167 0
#> 5 public hs 442 3 439 437
3.1.2 summarize()
and count()
student exercise
Use one code chunk for this exercise. You could tackle this a step at a time and run the entire code chunk when you have answered all parts of this question. Create your own variable names.
- Using the
event_berk
object, filter observations whereevent_state
is VA and group byevent_type
.- Using the summarize function to create a variable that represents the count for each
event_type
.
- Create a variable that represents the sum of missing obs for
med_inc
.
- Create a variable that represents the sum of non-missing obs for
med_inc
.
- Bonus: Arrange variable you created representing the count of each
event_type
in descending order.
- Using the summarize function to create a variable that represents the count for each
Solutions using summarize()
and count()
- Using the
event_berk
object filter observations whereevent_state
is VA and group byevent_type
.- Using the summarize function, create a variable that represents the count for each
event_type
.
- Now get the sum of missing obs for
med_inc
.
- Now get the sum of non-missing obs for
med_inc
.
- Using the summarize function, create a variable that represents the count for each
%>%
event_berk filter(event_state == "VA") %>%
group_by(event_type) %>%
summarize(
n_events = n(),
n_miss_inc = sum(is.na(med_inc)),
n_nonmiss_inc = sum(!is.na(med_inc))) %>%
arrange(desc(n_events))
#> # A tibble: 3 × 4
#> event_type n_events n_miss_inc n_nonmiss_inc
#> <chr> <int> <int> <int>
#> 1 public hs 20 0 20
#> 2 private hs 13 0 13
#> 3 other 3 0 3
3.2 summarize() and means
summarize()
: means
The mean()
function within summarize()
calculates means, separately for each group
%>% group_by(event_inst, event_type) %>% summarize(
event_berk n_events=n(),
mean_inc=mean(med_inc, na.rm = TRUE),
mean_pct_white=mean(pct_white_zip, na.rm = TRUE))
#> # A tibble: 10 × 5
#> # Groups: event_inst [2]
#> event_inst event_type n_events mean_inc mean_pct_white
#> <chr> <chr> <int> <dbl> <dbl>
#> 1 In-State 2yr college 111 78486. 40.1
#> 2 In-State 4yr college 14 131691. 58.0
#> 3 In-State other 49 75040. 37.6
#> 4 In-State private hs 35 95229. 48.4
#> 5 In-State public hs 259 87097. 39.6
#> 6 Out-State 2yr college 1 153070. 89.7
#> 7 Out-State 4yr college 4 76913. 65.8
#> 8 Out-State other 89 69004. 56.5
#> 9 Out-State private hs 134 87654. 64.3
#> 10 Out-State public hs 183 103603. 62.0
3.2.1 summarize()
: means and na.rm
argument
Default behavior of “aggregation functions” (e.g., summarize()
)
- if input has any missing values (
NA
), than output will be missing.
Many functions have argument na.rm
(means “remove NAs
”)
na.rm = FALSE
[the default formean()
]- Do not remove missing values from input before calculating
- Therefore, missing values in input will cause output to be missing
na.rm = TRUE
- Remove missing values from input before calculating
- Therefore, missing values in input will not cause output to be missing
#na.rm = FALSE; the default setting
%>% group_by(event_inst, event_type) %>% summarize(
event_berk n_events=n(),
n_miss_inc = sum(is.na(med_inc)),
mean_inc=mean(med_inc, na.rm = FALSE),
n_miss_frlunch = sum(is.na(fr_lunch)),
mean_fr_lunch=mean(fr_lunch, na.rm = FALSE))
#> # A tibble: 10 × 7
#> # Groups: event_inst [2]
#> event_inst event_type n_events n_miss_inc mean_inc n_miss_frlunch
#> <chr> <chr> <int> <int> <dbl> <int>
#> 1 In-State 2yr college 111 5 NA 111
#> 2 In-State 4yr college 14 4 NA 14
#> 3 In-State other 49 2 NA 49
#> 4 In-State private hs 35 0 95229. 35
#> 5 In-State public hs 259 3 NA 0
#> 6 Out-State 2yr college 1 0 153070. 1
#> 7 Out-State 4yr college 4 1 NA 4
#> 8 Out-State other 89 6 NA 89
#> 9 Out-State private hs 134 2 NA 134
#> 10 Out-State public hs 183 0 103603. 5
#> # ℹ 1 more variable: mean_fr_lunch <dbl>
# na.rm = TRUE
%>% group_by(event_inst, event_type) %>% summarize(
event_berk n_events=n(),
n_miss_inc = sum(is.na(med_inc)),
mean_inc=mean(med_inc, na.rm = TRUE),
n_miss_frlunch = sum(is.na(fr_lunch)),
mean_fr_lunch=mean(fr_lunch, na.rm = TRUE))
#> # A tibble: 10 × 7
#> # Groups: event_inst [2]
#> event_inst event_type n_events n_miss_inc mean_inc n_miss_frlunch
#> <chr> <chr> <int> <int> <dbl> <int>
#> 1 In-State 2yr college 111 5 78486. 111
#> 2 In-State 4yr college 14 4 131691. 14
#> 3 In-State other 49 2 75040. 49
#> 4 In-State private hs 35 0 95229. 35
#> 5 In-State public hs 259 3 87097. 0
#> 6 Out-State 2yr college 1 0 153070. 1
#> 7 Out-State 4yr college 4 1 76913. 4
#> 8 Out-State other 89 6 69004. 89
#> 9 Out-State private hs 134 2 87654. 134
#> 10 Out-State public hs 183 0 103603. 5
#> # ℹ 1 more variable: mean_fr_lunch <dbl>
3.2.2 Student exercise
- Using the
event_berk
object, group byinstnm
,event_inst
, &event_type
.- Create vars for number non_missing for these racial/ethnic groups (
pct_white_zip
,pct_black_zip
,pct_asian_zip
,pct_hispanic_zip
,pct_amerindian_zip
,pct_nativehawaii_zip
)
- Create vars for mean percent for each racial/ethnic group
- Create vars for number non_missing for these racial/ethnic groups (
Solutions
%>% group_by(instnm, event_inst, event_type) %>%
event_berk summarize(
n_events=n(),
n_miss_white = sum(!is.na(pct_white_zip)),
mean_white = mean(pct_white_zip, na.rm = TRUE),
n_miss_black = sum(!is.na(pct_black_zip)),
mean_black = mean(pct_black_zip, na.rm = TRUE),
n_miss_asian = sum(!is.na(pct_asian_zip)),
mean_asian = mean(pct_asian_zip, na.rm = TRUE),
n_miss_lat = sum(!is.na(pct_hispanic_zip)),
mean_lat = mean(pct_hispanic_zip, na.rm = TRUE),
n_miss_na = sum(!is.na(pct_amerindian_zip)),
mean_na = mean(pct_amerindian_zip, na.rm = TRUE),
n_miss_nh = sum(!is.na(pct_nativehawaii_zip)),
mean_nh = mean(pct_nativehawaii_zip, na.rm = TRUE)) %>%
head(6)
#> # A tibble: 6 × 16
#> # Groups: instnm, event_inst [2]
#> instnm event_inst event_type n_events n_miss_white mean_white n_miss_black
#> <chr> <chr> <chr> <int> <int> <dbl> <int>
#> 1 UC Berkel… In-State 2yr colle… 111 106 40.1 106
#> 2 UC Berkel… In-State 4yr colle… 14 12 58.0 12
#> 3 UC Berkel… In-State other 49 48 37.6 48
#> 4 UC Berkel… In-State private hs 35 35 48.4 35
#> 5 UC Berkel… In-State public hs 259 258 39.6 258
#> 6 UC Berkel… Out-State 2yr colle… 1 1 89.7 1
#> # ℹ 9 more variables: mean_black <dbl>, n_miss_asian <int>, mean_asian <dbl>,
#> # n_miss_lat <int>, mean_lat <dbl>, n_miss_na <int>, mean_na <dbl>,
#> # n_miss_nh <int>, mean_nh <dbl>
3.3 summarize() and logical vectors, part II
Logical vectors (e.g., is.na()
) useful for counting obs that satisfy some condition
is.na(c(5,NA,4,NA))
#> [1] FALSE TRUE FALSE TRUE
typeof(is.na(c(5,NA,4,NA)))
#> [1] "logical"
sum(is.na(c(5,NA,4,NA)))
#> [1] 2
Task: Using object event_berk
, calculate the following measures for each combination of event_type
and event_inst
:
- count of number of rows for each group
- count of rows non-missing for both
pct_black_zip
andpct_hispanic_zip
- count of number of visits to communities where the
sum
of Black and Latinx people comprise more than 50% of the total population
%>% group_by (event_inst, event_type) %>% summarize(
event_berk n_events=n(),
n_nonmiss_latbl = sum(!is.na(pct_black_zip) & !is.na(pct_hispanic_zip)),
n_majority_latbl= sum(pct_black_zip+ pct_hispanic_zip>50, na.rm = TRUE)
)#> # A tibble: 10 × 5
#> # Groups: event_inst [2]
#> event_inst event_type n_events n_nonmiss_latbl n_majority_latbl
#> <chr> <chr> <int> <int> <int>
#> 1 In-State 2yr college 111 106 30
#> 2 In-State 4yr college 14 12 0
#> 3 In-State other 49 48 15
#> 4 In-State private hs 35 35 6
#> 5 In-State public hs 259 258 88
#> 6 Out-State 2yr college 1 1 0
#> 7 Out-State 4yr college 4 3 0
#> 8 Out-State other 89 86 20
#> 9 Out-State private hs 134 132 19
#> 10 Out-State public hs 183 183 27
(!is.na(pct_black_zip) & !is.na(pct_hispanic_zip))
is a logical condition
- condition is
TRUE
(evaluates to1
) orFALSE
(evaluates to0
) for each obs
3.3.1 summarize()
: logical vectors to count proportions
Synatx: group_by(vars) %>% summarize(prop = mean(TRUE/FALSE conditon))
Task: separately for in-state/out-of-state, what proportion of visits to public high schools are to communities with median income greater than $100,000?
Steps:
- Filter public HS visits
- group by in-state vs. out-of-state
- Create measure
%>% filter(event_type == "public hs") %>% # filter public hs visits
event_berk group_by (event_inst) %>% # group by in-state vs. out-of-state
summarize(
n_events=n(), # number of events by group
n_nonmiss_inc = sum(!is.na(med_inc)), # w/ nonmissings values median inc,
p_incgt100k = mean(med_inc>100000, na.rm=TRUE)) # proportion visits to $100K+ commmunities
#> # A tibble: 2 × 4
#> event_inst n_events n_nonmiss_inc p_incgt100k
#> <chr> <int> <int> <dbl>
#> 1 In-State 259 256 0.273
#> 2 Out-State 183 183 0.519
What if we forgot to put na.rm=TRUE
in the above task?
**Task**: separately for in-state/out-of-state, what proportion of visits to public high schools are to communities with median income greater than $100,000?
%>% filter(event_type == "public hs") %>% # filter public hs visits
event_berk group_by (event_inst) %>% # group by in-state vs. out-of-state
summarize(
n_events=n(), # number of events by group
n_nonmiss_inc = sum(!is.na(med_inc)), # w/ nonmissings values median inc,
p_incgt100k = mean(med_inc>100000, na.rm=FALSE)) # proportion visits to $100K+ commmunities
#> # A tibble: 2 × 4
#> event_inst n_events n_nonmiss_inc p_incgt100k
#> <chr> <int> <int> <dbl>
#> 1 In-State 259 256 NA
#> 2 Out-State 183 183 0.519
3.3.2 summarize()
: Other “helper” functions
Lots of other functions we can use within summarize()
Common functions to use with summarize()
:
Function | Description |
---|---|
n |
count |
n_distinct |
count unique values |
mean |
mean |
median |
median |
max |
largest value |
min |
smallest value |
sd |
standard deviation |
sum |
sum of values |
first |
first value |
last |
last value |
nth |
nth value |
any |
condition true for at least one value |
Note: These functions can also be used on their own or with mutate()
Maximum value in a group
max(c(10,50,8))
#> [1] 50
Task: For each combination of in-state/out-of-state and event type, what is the maximum value of med_inc
?
%>% group_by(event_type, event_inst) %>%
event_berk summarize(max_inc = max(med_inc)) # oops, we forgot to remove NAs!
%>% group_by(event_type, event_inst) %>%
event_berk summarize(max_inc = max(med_inc, na.rm = TRUE))
Isolate first/last/nth observation in a group
<- c(10,15,20,25,30)
x first(x)
#> [1] 10
last(x)
#> [1] 30
nth(x,1)
#> [1] 10
nth(x,3)
#> [1] 20
nth(x,10)
#> [1] NA
Task: after sorting object event_berk
by event_type
and event_datetime_start
, what is the value of event_date
for:
- first event for each event type?
- the last event for each event type?
- the 50th event for each event type?
%>% arrange(event_type, event_datetime_start) %>%
event_berk group_by(event_type) %>%
summarize(
n_events = n(),
date_first= first(event_date),
date_last= last(event_date),
date_50th= nth(event_date, 50)
)#> # A tibble: 5 × 5
#> event_type n_events date_first date_last date_50th
#> <chr> <int> <date> <date> <date>
#> 1 2yr college 112 2017-04-25 2017-11-16 2017-10-04
#> 2 4yr college 18 2017-04-30 2017-11-16 NA
#> 3 other 138 2017-04-11 2017-10-26 2017-09-09
#> 4 private hs 169 2017-04-23 2017-10-25 2017-09-26
#> 5 public hs 442 2017-04-14 2017-12-15 2017-09-16
3.3.3 Student exercise
Identify value of event_date
for the nth event in each by group
Specific task:
- arrange (i.e., sort) by
event_type
andevent_datetme_start
, then group byevent_type
, and then identify the value ofevent_date
for:- the first event in each by group (
event_type
) - the second event in each by group
- the third event in each by group
- the fourth event in each by group
- the fifth event in each by group
- the first event in each by group (
Student exercise solutions
%>% arrange(event_type, event_datetime_start) %>%
event_berk group_by(event_type) %>%
summarize(
n_events = n(),
date_1st= first(event_date),
date_2nd= nth(event_date,2),
date_3rd= nth(event_date,3),
date_4th= nth(event_date,4),
date_5th= nth(event_date,5))
#> # A tibble: 5 × 7
#> event_type n_events date_1st date_2nd date_3rd date_4th date_5th
#> <chr> <int> <date> <date> <date> <date> <date>
#> 1 2yr college 112 2017-04-25 2017-09-05 2017-09-05 2017-09-06 2017-09-06
#> 2 4yr college 18 2017-04-30 2017-05-01 2017-05-06 2017-09-13 2017-09-14
#> 3 other 138 2017-04-11 2017-04-23 2017-04-25 2017-04-29 2017-05-14
#> 4 private hs 169 2017-04-23 2017-04-24 2017-04-29 2017-04-30 2017-09-05
#> 5 public hs 442 2017-04-14 2017-04-24 2017-04-26 2017-04-27 2017-04-27
3.4 count() in place of group_by() and summarise()
Recall from Section 3.1, the count()
function returns the count of unique values of a particular variable. For example, if we count by event_state, we get the count of visits by state. This is the same if we were to group_by()
event_state and summarise()
to get the count of events by state.
%>%
event_berk group_by(event_state) %>%
summarise(n_state = n())
#> # A tibble: 35 × 2
#> event_state n_state
#> <chr> <int>
#> 1 AL 2
#> 2 AR 3
#> 3 AZ 18
#> 4 CA 468
#> 5 CO 3
#> 6 CT 3
#> 7 DC 9
#> 8 FL 33
#> 9 GA 33
#> 10 HI 1
#> # ℹ 25 more rows
%>%
event_berk count(event_state)
#> # A tibble: 35 × 2
#> event_state n
#> <chr> <int>
#> 1 AL 2
#> 2 AR 3
#> 3 AZ 18
#> 4 CA 468
#> 5 CO 3
#> 6 CT 3
#> 7 DC 9
#> 8 FL 33
#> 9 GA 33
#> 10 HI 1
#> # ℹ 25 more rows
If we want to calculate the percent of UC Berkeley’s visits across states, we can use the count()
and mutate()
functions instead of group_by()
and summarize()
.
%>%
event_berk group_by(event_state) %>% #group_by event state
summarise(n_state = n()) %>% #summarize and get a count of the # of events by state
mutate(pct_state = (n_state / sum(n_state)) * 100) #create percent variable of visits by UC Berkeley (e.g., 53% of visits in California)
#> # A tibble: 35 × 3
#> event_state n_state pct_state
#> <chr> <int> <dbl>
#> 1 AL 2 0.228
#> 2 AR 3 0.341
#> 3 AZ 18 2.05
#> 4 CA 468 53.2
#> 5 CO 3 0.341
#> 6 CT 3 0.341
#> 7 DC 9 1.02
#> 8 FL 33 3.75
#> 9 GA 33 3.75
#> 10 HI 1 0.114
#> # ℹ 25 more rows
%>%
event_berk count(event_state) %>% #count of visits by UC Berkeley by state
mutate(pct_state = (n / sum(n)) * 100) #percent of visits by UC Berkeley across states
#> # A tibble: 35 × 3
#> event_state n pct_state
#> <chr> <int> <dbl>
#> 1 AL 2 0.228
#> 2 AR 3 0.341
#> 3 AZ 18 2.05
#> 4 CA 468 53.2
#> 5 CO 3 0.341
#> 6 CT 3 0.341
#> 7 DC 9 1.02
#> 8 FL 33 3.75
#> 9 GA 33 3.75
#> 10 HI 1 0.114
#> # ℹ 25 more rows
4 Summarize across multiple columns
4.1 What are column-wise operations?
across()
allows you to perform the same operation on multiple columns.
Description - across()
apply the same transformation to multiple columns.
Syntax - across(.cols, .fns, .names)
Arguments
.cols
Columns to transform.fns
Function to apply to each of the selected columns. Some values include:- A function, e.g.
mean()
- A purr-style lambda, e.g. ~ mean(.x, na.rm = TRUE)
- A named list of functions or lambdas, e.g.
list(mean = mean, n_miss = ~sum(is.na(.x)))
- A function, e.g.
.names
A glue specification that describes how to name the output columns. Use{.col}
to stand for the selected column name and{.fn}
to stand for the name of the function being applied, e.g. “{.col}_{.fn}
”
4.2 across() affects every variable
Task:
- For U. Pittsburgh (
univ_id = 215293
) events at public and private high schools, calculate the mean value ofmed_inc
andpct_white_zip
for each combination ofevent_type
andevent_inst
%>%
df_event filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
select(event_type, event_inst,med_inc,pct_white_zip) %>%
group_by(event_type,event_inst) %>%
summarize(across(c(med_inc, pct_white_zip), mean))
#> # A tibble: 4 × 4
#> # Groups: event_type [2]
#> event_type event_inst med_inc pct_white_zip
#> <chr> <chr> <dbl> <dbl>
#> 1 private hs In-State NA 78.9
#> 2 private hs Out-State NA NA
#> 3 public hs In-State NA NA
#> 4 public hs Out-State NA NA
Try again, this time applying na.rm = TRUE
- this is an example of a purr style lambda ~ mean(.x, na.rm = TRUE)
argument “for the function calls in .fns
.”
%>%
df_event filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
select(event_type, event_inst,med_inc,pct_white_zip) %>%
group_by(event_type,event_inst) %>%
summarize(across(c(med_inc, pct_white_zip), ~mean(.x, na.rm = TRUE)))
#> # A tibble: 4 × 4
#> # Groups: event_type [2]
#> event_type event_inst med_inc pct_white_zip
#> <chr> <chr> <dbl> <dbl>
#> 1 private hs In-State 77115. 78.9
#> 2 private hs Out-State 103915. 63.3
#> 3 public hs In-State 78408. 83.0
#> 4 public hs Out-State 114212. 67.5
Task:
For U. Pittsburgh (
univ_id = 215293
) events at public and private high schools, calculate mean and standard deviation ofmed_inc
andpct_white_zip
for each combination ofevent_type
andevent_inst
You can create a named list of functions to supply into the second argument
.fns
<- list(
mean_sd mean = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE)
)
%>%
df_event filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
select(event_type, event_inst,med_inc,pct_white_zip) %>%
group_by(event_type,event_inst) %>%
summarize(across(c(med_inc, pct_white_zip), mean_sd))
#> # A tibble: 4 × 6
#> # Groups: event_type [2]
#> event_type event_inst med_inc_mean med_inc_sd pct_white_zip_mean
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 private hs In-State 77115. 36559. 78.9
#> 2 private hs Out-State 103915. 44220. 63.3
#> 3 public hs In-State 78408. 25841. 83.0
#> 4 public hs Out-State 114212. 39745. 67.5
#> # ℹ 1 more variable: pct_white_zip_sd <dbl>
Use this syntax to control variable name suffixes:
.names = "{.col}_{.fn}"
%>%
df_event filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
select(event_type, event_inst,med_inc,pct_white_zip) %>%
group_by(event_type,event_inst) %>%
summarize(across(c(med_inc, pct_white_zip), mean_sd, .names = "{.col}_{.fn}"))
#> # A tibble: 4 × 6
#> # Groups: event_type [2]
#> event_type event_inst med_inc_mean med_inc_sd pct_white_zip_mean
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 private hs In-State 77115. 36559. 78.9
#> 2 private hs Out-State 103915. 44220. 63.3
#> 3 public hs In-State 78408. 25841. 83.0
#> 4 public hs Out-State 114212. 39745. 67.5
#> # ℹ 1 more variable: pct_white_zip_sd <dbl>
.names = "{.fn}_{.col}"
Change the order of the name to include function first and then the column name
%>%
df_event filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
select(event_type, event_inst,med_inc,pct_white_zip) %>%
group_by(event_type,event_inst) %>%
summarize(across(c(med_inc, pct_white_zip), mean_sd, .names = "{.fn}_{.col}"))
#> # A tibble: 4 × 6
#> # Groups: event_type [2]
#> event_type event_inst mean_med_inc sd_med_inc mean_pct_white_zip
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 private hs In-State 77115. 36559. 78.9
#> 2 private hs Out-State 103915. 44220. 63.3
#> 3 public hs In-State 78408. 25841. 83.0
#> 4 public hs Out-State 114212. 39745. 67.5
#> # ℹ 1 more variable: sd_pct_white_zip <dbl>
Task:
- Same task as before, but now calculate mean, standard deviation, min, and max of
med_inc
andpct_white_zip
for each combination ofevent_type
andevent_inst
<- list(
desc_stat mean = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE),
low = ~min(.x, na.rm = TRUE),
high = ~max(.x, na.rm=TRUE)
)
%>%
df_event filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
select(event_type, event_inst,med_inc,pct_white_zip) %>%
group_by(event_type,event_inst) %>%
summarize(across(c(med_inc, pct_white_zip), desc_stat, .names = "{.col}_{.fn}"))
#> # A tibble: 4 × 10
#> # Groups: event_type [2]
#> event_type event_inst med_inc_mean med_inc_sd med_inc_low med_inc_high
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 private hs In-State 77115. 36559. 12894. 224163
#> 2 private hs Out-State 103915. 44220. 29630. 223556.
#> 3 public hs In-State 78408. 25841. 23168. 169036.
#> 4 public hs Out-State 114212. 39745. 21581 250001
#> # ℹ 4 more variables: pct_white_zip_mean <dbl>, pct_white_zip_sd <dbl>,
#> # pct_white_zip_low <dbl>, pct_white_zip_high <dbl>
4.2.1 across(), quosure style lambdas ~ func_name(.x)
Syntax: across(.cols, .fns, .names)
.fns
Function to apply to each of the selected columns. Some values include:- A function, e.g.
mean()
; A purr-style lambda, e.g. ~ mean(.x, na.rm = TRUE); A named list of functions or lambdas, e.g.list(mean = mean, n_miss = ~sum(is.na(.x)))
- A function, e.g.
Task: Calculate mean, number of obs, and number of non-missing obs for variables
- Can exclude
n()
from the.cols
argument inacross()
function because you only need to calculate once to get the number of observations for each group.
<- list(
mean_obs_nmiss mean = ~mean(.x, na.rm = TRUE),
n_non_miss = ~sum(!is.na(.x))
)
%>%
df_event filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
select(event_type, event_inst,med_inc,pop_total) %>%
group_by(event_type,event_inst) %>%
summarize(nrow = n(), across(c(med_inc, pop_total), mean_obs_nmiss,.names = "{.col}_{.fn}"))
#> # A tibble: 4 × 7
#> # Groups: event_type [2]
#> event_type event_inst nrow med_inc_mean med_inc_n_non_miss pop_total_mean
#> <chr> <chr> <int> <dbl> <int> <dbl>
#> 1 private hs In-State 51 77115. 50 25320.
#> 2 private hs Out-State 209 103915. 204 29771.
#> 3 public hs In-State 211 78408. 210 21922.
#> 4 public hs Out-State 559 114212. 557 30191.
#> # ℹ 1 more variable: pop_total_n_non_miss <int>
4.2.2 across() and where() affects selected variables if they meet a condition
Syntax: across(where(fn)), .fns, .names)
fn
A function that returns TRUE or FALSE.fns
Function to apply to each of the selected columns. Some values include:- A function, e.g.
mean()
; A purr-style lambda, e.g. ~ mean(.x, na.rm = TRUE); A named list of functions or lambdas, e.g.list(mean = mean, n_miss = ~sum(is.na(.x)))
- A function, e.g.
Task: For U. Pittsburgh events at public and private high schools, caclulate mean, min, and max of variables med_inc
and event_date
for each combination of event_type
and event_inst
<- list(
mean_min_max mean = ~mean(.x, na.rm = TRUE),
low = ~min(.x, na.rm=TRUE),
high = ~max(.x, na.rm=TRUE)
)
%>%
df_event filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
select(event_type, event_inst, med_inc, event_date) %>%
group_by(event_type,event_inst) %>%
summarize(across(where(is.double), mean_min_max))
#> # A tibble: 4 × 8
#> # Groups: event_type [2]
#> event_type event_inst med_inc_mean med_inc_low med_inc_high event_date_mean
#> <chr> <chr> <dbl> <dbl> <dbl> <date>
#> 1 private hs In-State 77115. 12894. 224163 2017-09-16
#> 2 private hs Out-State 103915. 29630. 223556. 2017-08-24
#> 3 public hs In-State 78408. 23168. 169036. 2017-09-19
#> 4 public hs Out-State 114212. 21581 250001 2017-08-20
#> # ℹ 2 more variables: event_date_low <date>, event_date_high <date>
- Although
event_date
is a Date class, it is type double and therefore we can applyis.double
to thewhere()
function to calculate the mean, min, and max.
Task: For events by U. Pittsburgh at public and private high schools, caclulate mean and standard deviation for numeric variables
#First, which vars are numeric
%>%
df_event select(event_type, event_inst,instnm,school_id,med_inc,pct_white_zip) %>%
glimpse()
#> Rows: 18,680
#> Columns: 6
#> $ event_type <chr> "public hs", "public hs", "public hs", "public hs", "pub…
#> $ event_inst <chr> "In-State", "In-State", "In-State", "In-State", "Out-Sta…
#> $ instnm <chr> "UM Amherst", "UM Amherst", "UM Amherst", "UM Amherst", …
#> $ school_id <chr> "250192000042", "250243000134", "250366000496", "2503660…
#> $ med_inc <dbl> 71713.5, 89121.5, 70136.5, 70136.5, 71023.5, 71023.5, 71…
#> $ pct_white_zip <dbl> 73.6803, 91.3555, 79.4345, 79.4345, 88.6526, 88.6526, 88…
%>%
df_event filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
select(event_type, event_inst,instnm,school_id,med_inc,pct_white_zip) %>%
group_by(event_type,event_inst) %>%
summarize(across(where(is.numeric), mean_sd))
#> # A tibble: 4 × 6
#> # Groups: event_type [2]
#> event_type event_inst med_inc_mean med_inc_sd pct_white_zip_mean
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 private hs In-State 77115. 36559. 78.9
#> 2 private hs Out-State 103915. 44220. 63.3
#> 3 public hs In-State 78408. 25841. 83.0
#> 4 public hs Out-State 114212. 39745. 67.5
#> # ℹ 1 more variable: pct_white_zip_sd <dbl>
5 Attach aggregate measures to your data frame
We can attach aggregate measures to a data frame by using group_by without summarize()
What do I mean by “attaching aggregate measures to a data frame”?
- Calculate measures at the by_group level, but attach them to original object rather than creating an object with one row for each by_group
Task: Using event_berk
data frame, create (1) a measure of average income across all events and (2) a measure of average income for each event type
- resulting object should have same number of observations as
event_berk
Steps:
- create measure of avg. income across all events without using
group_by()
orsummarize()
and assign as (new) object - Using object from previous step, create measure of avg. income across by event type using
group_by()
withoutsummarize()
and assign as new object
Task: Using event_berk
data frame, create (1) a measure of average income across all events and (2) a measure of average income for each event type
- Create measure of average income across all events
<- event_berk %>%
event_berk_temp arrange(event_date) %>% # sort by event_date (optional)
select(event_date, event_type,med_inc) %>% # select vars to be retained (optioanl)
mutate(avg_inc = mean(med_inc, na.rm=TRUE)) # create avg. inc measure
dim(event_berk_temp)
#> [1] 879 4
%>% head(5)
event_berk_temp #> # A tibble: 5 × 4
#> event_date event_type med_inc avg_inc
#> <date> <chr> <dbl> <dbl>
#> 1 2017-04-11 other 62022. 88124.
#> 2 2017-04-14 public hs 125056. 88124.
#> 3 2017-04-23 private hs 78897 88124.
#> 4 2017-04-23 other 55127 88124.
#> 5 2017-04-24 private hs 122307 88124.
- Create measure of average income by event type
<- event_berk_temp %>%
event_berk_temp group_by(event_type) %>% # grouping by event type
mutate(avg_inc_type = mean(med_inc, na.rm=TRUE)) # create avg. inc measure
str(event_berk_temp)
#> gropd_df [879 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
#> $ event_date : Date[1:879], format: "2017-04-11" "2017-04-14" ...
#> $ event_type : chr [1:879] "other" "public hs" "private hs" "other" ...
#> $ med_inc : num [1:879] 62022 125056 78897 55127 122307 ...
#> $ avg_inc : num [1:879] 88124 88124 88124 88124 88124 ...
#> $ avg_inc_type: num [1:879] 71186 93978 89242 71186 89242 ...
#> - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
#> ..$ event_type: chr [1:5] "2yr college" "4yr college" "other" "private hs" ...
#> ..$ .rows : list<int> [1:5]
#> .. ..$ : int [1:112] 8 65 67 76 78 81 88 105 110 112 ...
#> .. ..$ : int [1:18] 15 16 17 129 149 158 312 373 462 508 ...
#> .. ..$ : int [1:138] 1 4 7 13 18 19 21 22 23 24 ...
#> .. ..$ : int [1:169] 3 5 12 14 60 62 68 70 71 72 ...
#> .. ..$ : int [1:442] 2 6 9 10 11 20 61 63 75 77 ...
#> .. ..@ ptype: int(0)
#> ..- attr(*, ".drop")= logi TRUE
%>% head(5)
event_berk_temp #> # A tibble: 5 × 5
#> # Groups: event_type [3]
#> event_date event_type med_inc avg_inc avg_inc_type
#> <date> <chr> <dbl> <dbl> <dbl>
#> 1 2017-04-11 other 62022. 88124. 71186.
#> 2 2017-04-14 public hs 125056. 88124. 93978.
#> 3 2017-04-23 private hs 78897 88124. 89242.
#> 4 2017-04-23 other 55127 88124. 71186.
#> 5 2017-04-24 private hs 122307 88124. 89242.
Task: Using event_berk_temp
from previous question, create a measure that identifies whether med_inc
associated with the event is higher/lower than average income for all events of that type
Steps:
- Create measure of average income for each event type [already done]
- Create a 0/1 indicator that identifies whether the median income at the event location is higher than the average median income for events of that type.
# average income at recruiting events across all universities
<- event_berk_temp %>%
event_berk_tempv2 mutate(gt_avg_inc_type = med_inc > avg_inc_type) %>%
select(-(avg_inc)) # drop avg_inc (optional)
# note how med_ic = NA are treated
event_berk_tempv2 #> # A tibble: 879 × 5
#> # Groups: event_type [5]
#> event_date event_type med_inc avg_inc_type gt_avg_inc_type
#> <date> <chr> <dbl> <dbl> <lgl>
#> 1 2017-04-11 other 62022. 71186. FALSE
#> 2 2017-04-14 public hs 125056. 93978. TRUE
#> 3 2017-04-23 private hs 78897 89242. FALSE
#> 4 2017-04-23 other 55127 71186. FALSE
#> 5 2017-04-24 private hs 122307 89242. TRUE
#> 6 2017-04-24 public hs 133966 93978. TRUE
#> 7 2017-04-25 other 114580 71186. TRUE
#> 8 2017-04-25 2yr college 103600. 79183. TRUE
#> 9 2017-04-26 public hs 38516. 93978. FALSE
#> 10 2017-04-27 public hs 49950. 93978. FALSE
#> # ℹ 869 more rows
Same as above, but this time create integer indicator rather than logical
<- event_berk_tempv2 %>%
event_berk_tempv2 mutate(gt_avg_inc_type = as.integer(med_inc > avg_inc_type))
%>% head(4)
event_berk_tempv2 #> # A tibble: 4 × 5
#> # Groups: event_type [3]
#> event_date event_type med_inc avg_inc_type gt_avg_inc_type
#> <date> <chr> <dbl> <dbl> <int>
#> 1 2017-04-11 other 62022. 71186. 0
#> 2 2017-04-14 public hs 125056. 93978. 1
#> 3 2017-04-23 private hs 78897 89242. 0
#> 4 2017-04-23 other 55127 71186. 0
5.0.1 Student exercise
Task: is pct_white_zip
at a particular event higher or lower than the average pct_white_zip for that event_type
?
- Note: all events attached to a particular zip_code
pct_white_zip
: pct of people in that zip_code who identify as white
Steps in task:
- Create measure of average pct white for each event_type
- Compare whether pct_white_zip is higher or lower than this average
Student exercise solution
Task: is pct_white_zip
at a particular event higher or lower than the average pct_white_zip for that event_type
?
<- event_berk %>%
event_berk_tempv3 arrange(event_date) %>% # sort by event_date (optional)
select(event_date, event_type, pct_white_zip) %>% #optional
group_by(event_type) %>% # grouping by event type
mutate(avg_pct_white = mean(pct_white_zip, na.rm=TRUE),
gt_avg_pctwhite_type = as.integer(pct_white_zip > avg_pct_white))
%>% head(4)
event_berk_tempv3 #> # A tibble: 4 × 5
#> # Groups: event_type [3]
#> event_date event_type pct_white_zip avg_pct_white gt_avg_pctwhite_type
#> <date> <chr> <dbl> <dbl> <int>
#> 1 2017-04-11 other 37.2 49.7 0
#> 2 2017-04-14 public hs 78.3 48.9 1
#> 3 2017-04-23 private hs 84.7 61.0 1
#> 4 2017-04-23 other 20.9 49.7 0