library(tidyverse)Across the tidyverse: processing across rows
1 Introduction
1.1 Libraries we will use today
“Load” the package we will use today (output omitted)
- you must run this code chunk
If package not yet installed, then must install before you load. Install in “console” rather than .Rmd file
- Generic syntax: install.packages("package_name")
- Install “tidyverse”: install.packages("tidyverse")
Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes:
- install.packages("tidyverse")
- library(tidyverse)
1.2 Data we will use today
Data on off-campus recruiting events by public universities
- Object df_event- One observation per university, recruiting event
 
rm(list = ls()) # remove all objects
#load dataset with one obs per recruiting event
load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData"))
#load("../../data/recruiting/recruit_event_allvars.Rdata")1.3 What we are doing today
Processing across variables vs. processing across observations
In today’s lecture, I’ll use the terms “observations” and “rows” interchangeably. Creation of analysis datasets often requires calculations across obs.
Examples:
- You have a dataset with one observation per student-term and want to create a variable of credits attempted per term
- You have a dataset with one observation per student-term and want to create a variable of GPA for the semester or cumulative GPA for all semesters
- Number of off-campus recruiting events a university makes to each state
- Average household income at visited versus non-visited high schools
Creating graphs and tables of descriptive stats usually require calculations across obs
Example: Want to create a graph that shows number of recruiting events by event “type” (e.g., public HS, private HS) for each university
- Start with df_eventdataset that has one observation per university, recruiting event
- Create a new data frame object that has one observation per university and event type and has a variable for the number of events
- this variable calculated by counting number of rows in each combination of university and event type
 
- This new data frame object is the input for creating desired graph
So far, we have focused on ``processing across variables’’
- Performing calculations across columns (i.e., vars), typically within a row (i.e., observation)
- Example: percent free-reduced lunch (above)
Visits by UC Berkeley to public high schools
#> # A tibble: 5 × 6
#>   school_id    state tot_stu_pub fr_lunch pct_fr_lunch med_inc
#>   <chr>        <chr>       <dbl>    <dbl>        <dbl>   <dbl>
#> 1 340882002126 NJ           1846       29       0.0157 178732 
#> 2 340147000250 NJ           1044       50       0.0479  62288 
#> 3 340561003796 NJ           1505      298       0.198  100684.
#> 4 340165005124 NJ           1900       43       0.0226 160476.
#> 5 341341003182 NJ           1519      130       0.0856 144346Processing across obs (focus of today’s lecture)
- Performing calculations across rows (i.e., obs), often within a column (i.e., variable)
- Example: Average household income of visited high schools, by state
Strategy for teaching processing across obs
In tidyverse the group_by() and summarize() functions are the primary means of performing calculations across observations
- Usually, processing across observations requires using group_by()andsummarize()together
- group_by()typically not very useful by itself
- summarize()[with or without- group_by()] can be helpful for creating summary statistics that are the inputs for tables or graphs you create
How we’ll teach:
- introduce group_by()andsummarize()separately- goal: you understand what each function does
 
- then we’ll combine them
2 group_by() and summarize()
2.1 group_by
Description: “group_by() takes an existing data frame and converts it into a grouped data frame where operations are performed”by group”. ungroup() removes grouping.”
- part of dplyr package within tidyverse; not part of Base R
- works best with pipes %>%andsummarize()function [described below]
Basic syntax: group_by(.data, ...)
- .dataargument refers to name of data frame
- ...argument refers to names of “group_by” variables, separated by commas- Can “group by” one or many variables
- Typically, “group_by” variables are character, factor, or integer variables
 
Possible “group by” variables in df_event data:
- university name/id; event type (e.g., public HS, private HS); state
Example: in df_event, create frequency count of event_type [output omitted]
names(df_event)
#without group_by()
df_event %>% count(event_type)
df_event %>% count(instnm)
#group_by() university
df_event %>% group_by(instnm) %>% count(event_type)By itself group_by() doesn’t do much; it just prints data
- Below, group df_eventdata by university, event type, and event state
#group_by (without pipes)
group_by(df_event, univ_id, event_type, event_state)
#group_by (with pipes)
df_event %>% group_by(univ_id, event_type, event_state) 
df_event %>% group_by(univ_id, event_type, event_state) %>% glimpse()But once an object is grouped, all subsequent functions are run separately “by group”
- recall that count()counts number of observations by group
# count number of observations in group, ungrouped data
df_event  %>% count()
#group by and then count obs
df_event %>% group_by(univ_id) %>% count()
df_event %>% group_by(univ_id) %>% count() %>% glimpse()
  #df_event %>% group_by(univ_id) %>% count() %>% View()
df_event %>% group_by(univ_id, event_type) %>% count()
df_event %>% group_by(univ_id, event_type) %>% count() %>% glimpse()
  #df_event %>% group_by(univ_id, event_type) %>% count() %>% View()2.1.1 Grouping not retained unless you assign it
Below, we’ll use class() function to show whether data frame is grouped
- will talk more about class()next week, but for now, just think of it as a function that provides information about an object
- similar to typeof(), butclass()provides different info about object
Grouping is not retained unless you assign it
class(df_event)
#> [1] "tbl_df"     "tbl"        "data.frame"df_event %>% group_by(univ_id, event_type, event_state)
df_event_grp <- df_event %>% group_by(univ_id, event_type, event_state) # using pipesclass(df_event_grp)
#> [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"2.1.2 Un-grouping an object
Use ungroup(object) to un-group grouped data
class(df_event_grp)
#> [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
df_event_grp <- ungroup(df_event_grp)
class(df_event_grp)
#> [1] "tbl_df"     "tbl"        "data.frame"
rm(df_event_grp)2.1.3 Student exercise group_by()
- Group by “instnm” and get a frequency count.
- How many rows and columns do you have? What do the number of rows mean?
 
 
- How many rows and columns do you have? What do the number of rows mean?
- Now group by “instnm” and “event_type” and get a frequency count.
- How many rows and columns do you have? What do the number of rows mean?
 
- Bonus: In the same code chunk, group by “instnm” and “event_type”, but this time filter for observations where “med_inc” is greater than 75000 and get a frequency count.
Solutions group_by()
- Group by “instnm” and get a frequency count. - How many rows and columns do you have? What do the number of rows mean?
 
 - df_event %>% group_by(instnm) %>% count() #> # A tibble: 16 × 2 #> # Groups: instnm [16] #> instnm n #> <chr> <int> #> 1 Arkansas 994 #> 2 Bama 4258 #> 3 CU Boulder 1439 #> 4 Cinci 679 #> 5 Kansas 1014 #> 6 NC State 640 #> 7 Pitt 1225 #> 8 Rutgers 1135 #> 9 S Illinois 549 #> 10 Stony Brook 730 #> 11 UC Berkeley 879 #> 12 UC Irvine 539 #> 13 UGA 827 #> 14 UM Amherst 908 #> 15 UNL 1397 #> 16 USCC 1467
- How many rows and columns do you have? What do the number of rows mean?
- Now group by “instnm” and “event_type” and get a frequency count. - How many rows and columns do you have? What do the number of rows mean?
 - df_event %>% group_by(instnm, event_type) %>% count() #> # A tibble: 80 × 3 #> # Groups: instnm, event_type [80] #> instnm event_type n #> <chr> <chr> <int> #> 1 Arkansas 2yr college 32 #> 2 Arkansas 4yr college 14 #> 3 Arkansas other 112 #> 4 Arkansas private hs 222 #> 5 Arkansas public hs 614 #> 6 Bama 2yr college 127 #> 7 Bama 4yr college 158 #> 8 Bama other 608 #> 9 Bama private hs 963 #> 10 Bama public hs 2402 #> # ℹ 70 more rows
- Bonus: Group by “instnm” and “event_type”, but this time filter for observations where “med_inc” is greater than 75000 and get a frequency count. 
df_event %>%
  group_by(instnm, event_type) %>%
  filter(med_inc > 75000) %>%
  count()
#> # A tibble: 80 × 3
#> # Groups:   instnm, event_type [80]
#>    instnm   event_type      n
#>    <chr>    <chr>       <int>
#>  1 Arkansas 2yr college     7
#>  2 Arkansas 4yr college     3
#>  3 Arkansas other          30
#>  4 Arkansas private hs     99
#>  5 Arkansas public hs     303
#>  6 Bama     2yr college    21
#>  7 Bama     4yr college    42
#>  8 Bama     other         249
#>  9 Bama     private hs    477
#> 10 Bama     public hs    1478
#> # ℹ 70 more rows2.2 summarize()
summarize() function
Description: summarize() calculates across rows; then collapses into single row
- summarize()create scalar vars summarizing variables of existing data frame
- if you first group data frame using group_by(),summarize()creates summary vars separately for each group, returning object with one row per group
- if data frame not grouped, summarize()will result in one row.
Syntax: summarize(.data, ...)
- .data: a data frame; omit if using- summarize()after pipe- %>%
- ...: Name-value pairs of summary functions separated by commas- “name” will be the name of new variable you will create
- “value” should be expression that returns a single value like min(x),n()
- variable names do not need to be placed within quotes
 
Value (what summarize() returns/creates)
- Object of same class as .data.; object will have one obs per “by group”
Useful functions (i.e., “helper functions”)
- Standalone functions called within summarize(), e.g.,mean(),n()
- e.g., count function n()takes no arguments; returns number of rows in group
2.2.1 Investigate objects created by summarize()
Example: Count total number of events
- function n()fromdplyrpackage (dplyr::n) returns the size of the current group
#?n # dplyr function n() gives current group size
summarize(df_event, num_events=n()) # without pipes
#> # A tibble: 1 × 1
#>   num_events
#>        <int>
#> 1      18680
df_event %>% summarize(num_events=n()) # with pipes
#> # A tibble: 1 × 1
#>   num_events
#>        <int>
#> 1      18680
df_event %>% summarize(num_events=n()) %>% str() # use str to see what pipe returned
#> tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ num_events: int 18680
  #df_event %>% summarize(num_events=n()) %>% View()Example: What is max value of med_inc across all events
- function max()frombasepackage (base::max) returns max value
#?max # base R function max() returns max value
df_event %>% summarize(max_inc = max(med_inc, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   max_inc
#>     <dbl>
#> 1  250001
df_event %>% summarize(max_inc = max(med_inc, na.rm = TRUE)) %>% str()
#> tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ max_inc: num 250001Example: Count total number of events AND max value of median income
df_event %>% summarize(
    num_events=n(), 
    max_inc=max(med_inc, na.rm = TRUE)
  )
#> # A tibble: 1 × 2
#>   num_events max_inc
#>        <int>   <dbl>
#> 1      18680  250001
# show object returned by pipe
df_event %>% summarize(
    num_events=n(),
    max_inc=max(med_inc, na.rm = TRUE)
  ) %>% str()
#> tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ num_events: int 18680
#>  $ max_inc   : num 250001
#df_event %>% summarize(num_events=n(), max_inc=max(med_inc, na.rm = TRUE)) %>% View()Example: Count total number of events AND max value of median income
We can use assignment to keep the object created by summarize()
df_event_temp <- df_event %>% 
  summarize(num_events=n(), max_inc=max(med_inc, na.rm = TRUE))
df_event_temp
rm(df_event_temp)What if we forgot na.rm = TRUE?
- then max_incequalsNAcuz can’t perform calculation onNAvalues
df_event %>% 
  summarize(num_events=n(),max_inc=max(med_inc, na.rm = FALSE))2.2.2 Takeaways, summarize()
- By default, objects created by summarize()are data frames that contain variables created withinsummarize()and one observation [per “by group”]
- most “helper” functions (e.g., max(),mean()have optionna.rmto keep/remove missing obs before performing calculations)- na.rm = FALSE(default); don’t remove- NAsprior to calculation- if any obs missing, then result of calculation is NA
 
- if any obs missing, then result of calculation is 
- na.rm = TRUE(default); remove- NAsprior to calculation
 
- Object created by summarize()not retained unless you assign it (output omitted)
event_temp <- df_event %>% summarize(num_events=n(), 
  mean_inc=mean(med_inc, na.rm = TRUE))
event_temp
rm(event_temp)2.2.3 Using [] operator to filter observations within summarize
Imagine we want to calculate avg. income, separately for in-state vs. out-of-state visits
- first, make sure we can identify in vs. out-state (output omitted)
df_event %>% filter(event_state == instst) %>% count() #in state
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1  5425
df_event %>% filter(event_state != instst) %>% count() #out state
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1 13255Task: calculate mean income for: all events; in-state events; out-of-state events
df_event %>%
  summarize(
    avg_inc = mean(med_inc, na.rm = TRUE), # all events
    avg_inc_inst = mean(med_inc[event_state == instst], na.rm = TRUE), # in-state
    avg_inc_outst = mean(med_inc[event_state != instst], na.rm = TRUE) # out-state
  )
#> # A tibble: 1 × 3
#>   avg_inc avg_inc_inst avg_inc_outst
#>     <dbl>        <dbl>         <dbl>
#> 1  89089.       71589.        96162.Understanding code: mean(med_inc[event_state == instst], na.rm = TRUE)
- mean()function takes atomic vector as first argument
- the variable med_incis an numeric atomic vector
- event_state == inststis condition that yields a logical atomic vector
- From Base R lecture, med_inc[event_state == instst]is:- “subset atomic vectors using []” approach 3: “Using logicals to return elements where corresponding logical isTRUE”
- isolates observations of med_incwhereevent_stateis same as state the university is located in (instst)
 
- “subset atomic vectors using 
2.2.4 Using summarize() to create descriptive statistics table
Often helpful to use summarize() to calculate summary statistics that are the basis for a table of descriptive statistics
Task: create a table of descriptive statistics about variable med_inc
- want these measures: number of non-missing obs; mean; standard deviation
df_event %>% mutate(non_miss_inc = is.na(med_inc)==0) %>%
  summarize(
    n = sum(non_miss_inc, na.rm = TRUE), #SAMPLE SIZE all
    avg_inc = mean(med_inc, na.rm = TRUE), # MEAN
    std_inc = sd(med_inc, na.rm = TRUE) # STANDARD DEVIATION all events
  )
#> # A tibble: 1 × 3
#>       n avg_inc std_inc
#>   <int>   <dbl>   <dbl>
#> 1 18210  89089.  39267.Task: same as above but separate measures for: all events; in-state; out-of-state
df_event %>% mutate(non_miss_inc = is.na(med_inc)==0) %>%
  summarize(
    n = sum(non_miss_inc, na.rm = TRUE), #SAMPLE SIZE 
    n_inst = sum(non_miss_inc[event_state == instst], na.rm = TRUE), 
    n_outst = sum(non_miss_inc[event_state != instst], na.rm = TRUE), 
    avg_inc = mean(med_inc, na.rm = TRUE), # MEAN
    avg_inc_inst = mean(med_inc[event_state == instst], na.rm = TRUE), 
    avg_inc_outst = mean(med_inc[event_state != instst], na.rm = TRUE),
    std_inc = sd(med_inc, na.rm = TRUE), # STANDARD DEVIATION 
    std_inc_inst = sd(med_inc[event_state == instst], na.rm = TRUE), 
    std_inc_outst = sd(med_inc[event_state != instst], na.rm = TRUE) 
  )
#> # A tibble: 1 × 9
#>       n n_inst n_outst avg_inc avg_inc_inst avg_inc_outst std_inc std_inc_inst
#>   <int>  <int>   <int>   <dbl>        <dbl>         <dbl>   <dbl>        <dbl>
#> 1 18210   5241   12969  89089.       71589.        96162.  39267.       32077.
#> # ℹ 1 more variable: std_inc_outst <dbl>2.2.5 Student exercise summarize()
- What is the min value of med_incacross all events?- Hint: Use min()
 
- What is the mean value of fr_lunchacross all events?- Hint: Use mean()
 
Solutions using summarize()
- What is min value of med_incacross all events?
df_event %>%
  summarize(min_med_income = min(med_inc, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   min_med_income
#>            <dbl>
#> 1         12894.- What is the mean value of - fr_lunchacross all events?- Hint: Use mean()
 - df_event %>% summarize(mean_fr_lunch = mean(fr_lunch, na.rm = TRUE)) #> # A tibble: 1 × 1 #> mean_fr_lunch #> <dbl> #> 1 475.
3 Combining group_by() and summarize()
Combining summarize() and group_by
summarize() on ungrouped vs. grouped data:
- By itself, summarize()performs calculations across all rows of data frame then collapses the data frame to a single row
- When data frame is grouped, summarize()performs calculations across rows within a group and then collapses to a single row for each group
Example: Count the number of events for each university
- remember: df_eventhas one observation per university, recruiting event
df_event %>% summarize(num_events=n())
#> # A tibble: 1 × 1
#>   num_events
#>        <int>
#> 1      18680
df_event %>% group_by(instnm) %>% summarize(num_events=n())
#> # A tibble: 16 × 2
#>    instnm      num_events
#>    <chr>            <int>
#>  1 Arkansas           994
#>  2 Bama              4258
#>  3 CU Boulder        1439
#>  4 Cinci              679
#>  5 Kansas            1014
#>  6 NC State           640
#>  7 Pitt              1225
#>  8 Rutgers           1135
#>  9 S Illinois         549
#> 10 Stony Brook        730
#> 11 UC Berkeley        879
#> 12 UC Irvine          539
#> 13 UGA                827
#> 14 UM Amherst         908
#> 15 UNL               1397
#> 16 USCC              1467- Investigate the object created above
df_event %>% group_by(instnm) %>% summarize(num_events=n()) %>% str()
#> tibble [16 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ instnm    : chr [1:16] "Arkansas" "Bama" "CU Boulder" "Cinci" ...
#>  $ num_events: int [1:16] 994 4258 1439 679 1014 640 1225 1135 549 730 ...
  #df_event %>% group_by(instnm) %>% summarize(num_events=n()) %>% View()- Or we could retain object for later use
event_by_univ <- df_event %>% group_by(instnm) %>% summarize(num_events=n())
str(event_by_univ)
#> tibble [16 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ instnm    : chr [1:16] "Arkansas" "Bama" "CU Boulder" "Cinci" ...
#>  $ num_events: int [1:16] 994 4258 1439 679 1014 640 1225 1135 549 730 ...
event_by_univ # print
#> # A tibble: 16 × 2
#>    instnm      num_events
#>    <chr>            <int>
#>  1 Arkansas           994
#>  2 Bama              4258
#>  3 CU Boulder        1439
#>  4 Cinci              679
#>  5 Kansas            1014
#>  6 NC State           640
#>  7 Pitt              1225
#>  8 Rutgers           1135
#>  9 S Illinois         549
#> 10 Stony Brook        730
#> 11 UC Berkeley        879
#> 12 UC Irvine          539
#> 13 UGA                827
#> 14 UM Amherst         908
#> 15 UNL               1397
#> 16 USCC              1467
rm(event_by_univ)Task
- Count number of recruiting events by institution and event_type - within summarize(), we will take advantage of the helper functionn()
- n()“gives the current group size”; i.e., how many observations in current group
 - #?n df_event %>% group_by(instnm, event_type) %>% summarize(num_events=n()) #> # A tibble: 80 × 3 #> # Groups: instnm [16] #> instnm event_type num_events #> <chr> <chr> <int> #> 1 Arkansas 2yr college 32 #> 2 Arkansas 4yr college 14 #> 3 Arkansas other 112 #> 4 Arkansas private hs 222 #> 5 Arkansas public hs 614 #> 6 Bama 2yr college 127 #> 7 Bama 4yr college 158 #> 8 Bama other 608 #> 9 Bama private hs 963 #> 10 Bama public hs 2402 #> # ℹ 70 more rows #investigate object created df_event %>% group_by(instnm, event_type) %>% summarize(num_events=n()) %>% glimpse() #> Rows: 80 #> Columns: 3 #> Groups: instnm [16] #> $ instnm <chr> "Arkansas", "Arkansas", "Arkansas", "Arkansas", "Arkansas",… #> $ event_type <chr> "2yr college", "4yr college", "other", "private hs", "publi… #> $ num_events <int> 32, 14, 112, 222, 614, 127, 158, 608, 963, 2402, 112, 40, 7… #df_event %>% group_by(instnm, event_type) %>% summarize(num_events=n()) %>% View()
- within 
Task
- Count number of recruiting events by institution and event_type
Note that data frame object created by group_by() and summarize() can be input to graph
#bar chart of number of events, all universities combined
df_event %>% group_by(instnm, event_type) %>% 
  summarize(num_events=n()) %>%
    ggplot(aes(x=event_type, y=num_events)) +  # plot
    ylab("Number of events") + xlab("Event type") +geom_col()
#bar chart of number of events, separete chart for each university
df_event %>% group_by(instnm, event_type) %>% 
  summarize(num_events=n()) %>%
    ggplot(aes(x=event_type, y=num_events)) +  # plot
    ylab("Number of events") + xlab("Event type") + geom_col() +
    coord_flip() + facet_wrap(~ instnm) Task. Count number of recruiting events by institution, event_type, and whether event is in- or out-of-state (var=event_inst)
- Note: in group_by(), the optionaldropargument controls whether empty groups dropped. default isdrop = TRUE
df_event %>% group_by(instnm, event_type, event_inst) %>% 
  summarize(num_events=n())
#> # A tibble: 159 × 4
#> # Groups:   instnm, event_type [80]
#>    instnm   event_type  event_inst num_events
#>    <chr>    <chr>       <chr>           <int>
#>  1 Arkansas 2yr college In-State           16
#>  2 Arkansas 2yr college Out-State          16
#>  3 Arkansas 4yr college In-State            3
#>  4 Arkansas 4yr college Out-State          11
#>  5 Arkansas other       In-State           27
#>  6 Arkansas other       Out-State          85
#>  7 Arkansas private hs  In-State           21
#>  8 Arkansas private hs  Out-State         201
#>  9 Arkansas public hs   In-State          156
#> 10 Arkansas public hs   Out-State         458
#> # ℹ 149 more rows
df_event %>% group_by(instnm, event_type, event_inst, .drop = TRUE) %>% 
  summarize(num_events=n()) 
#> # A tibble: 159 × 4
#> # Groups:   instnm, event_type [80]
#>    instnm   event_type  event_inst num_events
#>    <chr>    <chr>       <chr>           <int>
#>  1 Arkansas 2yr college In-State           16
#>  2 Arkansas 2yr college Out-State          16
#>  3 Arkansas 4yr college In-State            3
#>  4 Arkansas 4yr college Out-State          11
#>  5 Arkansas other       In-State           27
#>  6 Arkansas other       Out-State          85
#>  7 Arkansas private hs  In-State           21
#>  8 Arkansas private hs  Out-State         201
#>  9 Arkansas public hs   In-State          156
#> 10 Arkansas public hs   Out-State         458
#> # ℹ 149 more rows
df_event %>% 
  group_by(as.factor(instnm), as.factor(event_type), as.factor(event_inst),
           .drop = FALSE) %>% summarize(num_events=n()) %>% arrange(num_events)
#> # A tibble: 160 × 4
#> # Groups:   as.factor(instnm), as.factor(event_type) [80]
#>    `as.factor(instnm)` `as.factor(event_type)` as.factor(event_inst…¹ num_events
#>    <fct>               <fct>                   <fct>                       <int>
#>  1 UGA                 2yr college             Out-State                       0
#>  2 UC Berkeley         2yr college             Out-State                       1
#>  3 USCC                2yr college             In-State                        2
#>  4 Arkansas            4yr college             In-State                        3
#>  5 Kansas              4yr college             In-State                        3
#>  6 NC State            2yr college             Out-State                       3
#>  7 Rutgers             4yr college             In-State                        3
#>  8 Stony Brook         4yr college             In-State                        3
#>  9 UC Irvine           2yr college             Out-State                       3
#> 10 UGA                 2yr college             In-State                        3
#> # ℹ 150 more rows
#> # ℹ abbreviated name: ¹`as.factor(event_inst)`
# .drop=FALSE affects only grouping columns that are coded as factors
# combinations that include non-factor grouping variables are still 
# silently dropped even with .drop=FALSE.Make a graph, showing in/out state as fill color of bar
df_event %>% group_by(instnm, event_type, event_inst) %>% 
  summarize(num_events=n()) %>%
    ggplot(aes(x=event_type, y=num_events, fill = event_inst)) +  # plot
    ylab("Number of events") + xlab("Event type") + geom_col() +
    coord_flip() + facet_wrap(~ instnm) Task
- By university, event type, event_inst count the number of events and calculate the avg. pct white in the zip-code
df_event %>% group_by(instnm, event_type, event_inst) %>% 
  summarize(num_events=n(),
    mean_pct_white=mean(pct_white_zip, na.rm = TRUE)
  )
#> # A tibble: 159 × 5
#> # Groups:   instnm, event_type [80]
#>    instnm   event_type  event_inst num_events mean_pct_white
#>    <chr>    <chr>       <chr>           <int>          <dbl>
#>  1 Arkansas 2yr college In-State           16           68.3
#>  2 Arkansas 2yr college Out-State          16           71.4
#>  3 Arkansas 4yr college In-State            3           75.6
#>  4 Arkansas 4yr college Out-State          11           63.9
#>  5 Arkansas other       In-State           27           67.4
#>  6 Arkansas other       Out-State          85           58.6
#>  7 Arkansas private hs  In-State           21           71.1
#>  8 Arkansas private hs  Out-State         201           62.3
#>  9 Arkansas public hs   In-State          156           77.6
#> 10 Arkansas public hs   Out-State         458           68.7
#> # ℹ 149 more rows
#investigate object you created
df_event %>% group_by(instnm, event_type, event_inst) %>% 
  summarize(num_events=n(),
    mean_pct_white=mean(pct_white_zip, na.rm = FALSE)
  ) %>% glimpse()
#> Rows: 159
#> Columns: 5
#> Groups: instnm, event_type [80]
#> $ instnm         <chr> "Arkansas", "Arkansas", "Arkansas", "Arkansas", "Arkans…
#> $ event_type     <chr> "2yr college", "2yr college", "4yr college", "4yr colle…
#> $ event_inst     <chr> "In-State", "Out-State", "In-State", "Out-State", "In-S…
#> $ num_events     <int> 16, 16, 3, 11, 27, 85, 21, 201, 156, 458, 66, 61, 15, 1…
#> $ mean_pct_white <dbl> 68.27347, NA, NA, NA, 67.39773, NA, NA, NA, NA, NA, NA,…Recruiting events by UC Berkeley
df_event %>% filter(univ_id == 110635) %>% 
  group_by(event_type) %>% summarize(num_events=n())
#> # A tibble: 5 × 2
#>   event_type  num_events
#>   <chr>            <int>
#> 1 2yr college        112
#> 2 4yr college         18
#> 3 other              138
#> 4 private hs         169
#> 5 public hs          442Let’s create a dataset of recruiting events at UC Berkeley
event_berk <- df_event %>% filter(univ_id == 110635)
event_berk %>% count(event_type)
#> # A tibble: 5 × 2
#>   event_type      n
#>   <chr>       <int>
#> 1 2yr college   112
#> 2 4yr college    18
#> 3 other         138
#> 4 private hs    169
#> 5 public hs     4423.1 summarize() and Counts
The count function n() takes no arguments and returns the size of the current group
event_berk %>% group_by(event_type, event_inst) %>% 
  summarize(num_events=n())
#> # A tibble: 10 × 3
#> # Groups:   event_type [5]
#>    event_type  event_inst num_events
#>    <chr>       <chr>           <int>
#>  1 2yr college In-State          111
#>  2 2yr college Out-State           1
#>  3 4yr college In-State           14
#>  4 4yr college Out-State           4
#>  5 other       In-State           49
#>  6 other       Out-State          89
#>  7 private hs  In-State           35
#>  8 private hs  Out-State         134
#>  9 public hs   In-State          259
#> 10 public hs   Out-State         183Because counts are so important, dplyr package includes separate count() function that can be called outside summarize() function
event_berk %>% group_by(event_type, event_inst) %>% count()
#> # A tibble: 10 × 3
#> # Groups:   event_type, event_inst [10]
#>    event_type  event_inst     n
#>    <chr>       <chr>      <int>
#>  1 2yr college In-State     111
#>  2 2yr college Out-State      1
#>  3 4yr college In-State      14
#>  4 4yr college Out-State      4
#>  5 other       In-State      49
#>  6 other       Out-State     89
#>  7 private hs  In-State      35
#>  8 private hs  Out-State    134
#>  9 public hs   In-State     259
#> 10 public hs   Out-State    1833.1.1 summarize(): count with logical vectors and sum()
Logical vectors have values TRUE and FALSE.
- When used with numeric functions, TRUEconverted to 1 andFALSEto 0.
sum() is a numeric function that returns the sum of values
sum(c(5,10))
#> [1] 15
sum(c(TRUE,TRUE,FALSE,FALSE))
#> [1] 2is.na() returns TRUE if value is NA and otherwise returns FALSE
is.na(c(5,NA,4,NA))
#> [1] FALSE  TRUE FALSE  TRUE
sum(is.na(c(5,NA,4,NA,5)))
#> [1] 2
sum(!is.na(c(5,NA,4,NA,5)))
#> [1] 3Application: How many missing/non-missing obs in variable [very important]
event_berk %>% group_by(event_type) %>% 
  summarize(
    n_events = n(),
    n_miss_inc = sum(is.na(med_inc)),
    n_nonmiss_inc = sum(!is.na(med_inc)),
    n_nonmiss_fr_lunch = sum(!is.na(fr_lunch))
  )
#> # A tibble: 5 × 5
#>   event_type  n_events n_miss_inc n_nonmiss_inc n_nonmiss_fr_lunch
#>   <chr>          <int>      <int>         <int>              <int>
#> 1 2yr college      112          5           107                  0
#> 2 4yr college       18          5            13                  0
#> 3 other            138          8           130                  0
#> 4 private hs       169          2           167                  0
#> 5 public hs        442          3           439                4373.1.2 summarize() and count() student exercise
Use one code chunk for this exercise. You could tackle this a step at a time and run the entire code chunk when you have answered all parts of this question. Create your own variable names.
- Using the event_berkobject, filter observations whereevent_stateis VA and group byevent_type.- Using the summarize function to create a variable that represents the count for each event_type.
 
- Create a variable that represents the sum of missing obs for med_inc.
 
- Create a variable that represents the sum of non-missing obs for med_inc.
 
- Bonus: Arrange variable you created representing the count of each event_typein descending order.
 
- Using the summarize function to create a variable that represents the count for each 
Solutions using summarize() and count()
- Using the event_berkobject filter observations whereevent_stateis VA and group byevent_type.- Using the summarize function, create a variable that represents the count for each event_type.
 
- Now get the sum of missing obs for med_inc.
 
- Now get the sum of non-missing obs for med_inc.
 
- Using the summarize function, create a variable that represents the count for each 
event_berk %>%
  filter(event_state == "VA") %>%
  group_by(event_type) %>%
  summarize(
    n_events = n(),
    n_miss_inc = sum(is.na(med_inc)),
    n_nonmiss_inc = sum(!is.na(med_inc))) %>%
  arrange(desc(n_events))
#> # A tibble: 3 × 4
#>   event_type n_events n_miss_inc n_nonmiss_inc
#>   <chr>         <int>      <int>         <int>
#> 1 public hs        20          0            20
#> 2 private hs       13          0            13
#> 3 other             3          0             33.2 summarize() and means
summarize(): means
The mean() function within summarize() calculates means, separately for each group
event_berk %>% group_by(event_inst, event_type) %>% summarize(
  n_events=n(),
  mean_inc=mean(med_inc, na.rm = TRUE),
  mean_pct_white=mean(pct_white_zip, na.rm = TRUE))
#> # A tibble: 10 × 5
#> # Groups:   event_inst [2]
#>    event_inst event_type  n_events mean_inc mean_pct_white
#>    <chr>      <chr>          <int>    <dbl>          <dbl>
#>  1 In-State   2yr college      111   78486.           40.1
#>  2 In-State   4yr college       14  131691.           58.0
#>  3 In-State   other             49   75040.           37.6
#>  4 In-State   private hs        35   95229.           48.4
#>  5 In-State   public hs        259   87097.           39.6
#>  6 Out-State  2yr college        1  153070.           89.7
#>  7 Out-State  4yr college        4   76913.           65.8
#>  8 Out-State  other             89   69004.           56.5
#>  9 Out-State  private hs       134   87654.           64.3
#> 10 Out-State  public hs        183  103603.           62.03.2.1 summarize(): means and na.rm argument
Default behavior of “aggregation functions” (e.g., summarize())
- if input has any missing values (NA), than output will be missing.
Many functions have argument na.rm (means “remove NAs”)
- na.rm = FALSE[the default for- mean()]- Do not remove missing values from input before calculating
- Therefore, missing values in input will cause output to be missing
 
- na.rm = TRUE- Remove missing values from input before calculating
- Therefore, missing values in input will not cause output to be missing
 
#na.rm = FALSE; the default setting
event_berk %>% group_by(event_inst, event_type) %>% summarize(
  n_events=n(),
  n_miss_inc = sum(is.na(med_inc)),
  mean_inc=mean(med_inc, na.rm = FALSE),
  n_miss_frlunch = sum(is.na(fr_lunch)),
  mean_fr_lunch=mean(fr_lunch, na.rm = FALSE))
#> # A tibble: 10 × 7
#> # Groups:   event_inst [2]
#>    event_inst event_type  n_events n_miss_inc mean_inc n_miss_frlunch
#>    <chr>      <chr>          <int>      <int>    <dbl>          <int>
#>  1 In-State   2yr college      111          5      NA             111
#>  2 In-State   4yr college       14          4      NA              14
#>  3 In-State   other             49          2      NA              49
#>  4 In-State   private hs        35          0   95229.             35
#>  5 In-State   public hs        259          3      NA               0
#>  6 Out-State  2yr college        1          0  153070.              1
#>  7 Out-State  4yr college        4          1      NA               4
#>  8 Out-State  other             89          6      NA              89
#>  9 Out-State  private hs       134          2      NA             134
#> 10 Out-State  public hs        183          0  103603.              5
#> # ℹ 1 more variable: mean_fr_lunch <dbl>
# na.rm = TRUE
event_berk %>% group_by(event_inst, event_type) %>% summarize(
  n_events=n(),
  n_miss_inc = sum(is.na(med_inc)),
  mean_inc=mean(med_inc, na.rm = TRUE),
  n_miss_frlunch = sum(is.na(fr_lunch)),
  mean_fr_lunch=mean(fr_lunch, na.rm = TRUE))
#> # A tibble: 10 × 7
#> # Groups:   event_inst [2]
#>    event_inst event_type  n_events n_miss_inc mean_inc n_miss_frlunch
#>    <chr>      <chr>          <int>      <int>    <dbl>          <int>
#>  1 In-State   2yr college      111          5   78486.            111
#>  2 In-State   4yr college       14          4  131691.             14
#>  3 In-State   other             49          2   75040.             49
#>  4 In-State   private hs        35          0   95229.             35
#>  5 In-State   public hs        259          3   87097.              0
#>  6 Out-State  2yr college        1          0  153070.              1
#>  7 Out-State  4yr college        4          1   76913.              4
#>  8 Out-State  other             89          6   69004.             89
#>  9 Out-State  private hs       134          2   87654.            134
#> 10 Out-State  public hs        183          0  103603.              5
#> # ℹ 1 more variable: mean_fr_lunch <dbl>3.2.2 Student exercise
- Using the event_berkobject, group byinstnm,event_inst, &event_type.- Create vars for number non_missing for these racial/ethnic groups (pct_white_zip,pct_black_zip,pct_asian_zip,pct_hispanic_zip,pct_amerindian_zip,pct_nativehawaii_zip)
 
- Create vars for mean percent for each racial/ethnic group
 
- Create vars for number non_missing for these racial/ethnic groups (
Solutions
event_berk %>% group_by(instnm, event_inst, event_type) %>%
  summarize(
  n_events=n(),
  n_miss_white = sum(!is.na(pct_white_zip)),
  mean_white = mean(pct_white_zip, na.rm = TRUE),
  n_miss_black = sum(!is.na(pct_black_zip)),
  mean_black = mean(pct_black_zip, na.rm = TRUE),
  n_miss_asian = sum(!is.na(pct_asian_zip)),
  mean_asian = mean(pct_asian_zip, na.rm = TRUE),
  n_miss_lat = sum(!is.na(pct_hispanic_zip)),
  mean_lat = mean(pct_hispanic_zip, na.rm = TRUE),
  n_miss_na = sum(!is.na(pct_amerindian_zip)),
  mean_na = mean(pct_amerindian_zip, na.rm = TRUE),
  n_miss_nh = sum(!is.na(pct_nativehawaii_zip)),
  mean_nh = mean(pct_nativehawaii_zip, na.rm = TRUE)) %>%
  head(6)
#> # A tibble: 6 × 16
#> # Groups:   instnm, event_inst [2]
#>   instnm     event_inst event_type n_events n_miss_white mean_white n_miss_black
#>   <chr>      <chr>      <chr>         <int>        <int>      <dbl>        <int>
#> 1 UC Berkel… In-State   2yr colle…      111          106       40.1          106
#> 2 UC Berkel… In-State   4yr colle…       14           12       58.0           12
#> 3 UC Berkel… In-State   other            49           48       37.6           48
#> 4 UC Berkel… In-State   private hs       35           35       48.4           35
#> 5 UC Berkel… In-State   public hs       259          258       39.6          258
#> 6 UC Berkel… Out-State  2yr colle…        1            1       89.7            1
#> # ℹ 9 more variables: mean_black <dbl>, n_miss_asian <int>, mean_asian <dbl>,
#> #   n_miss_lat <int>, mean_lat <dbl>, n_miss_na <int>, mean_na <dbl>,
#> #   n_miss_nh <int>, mean_nh <dbl>3.3 summarize() and logical vectors, part II
Logical vectors (e.g., is.na()) useful for counting obs that satisfy some condition
is.na(c(5,NA,4,NA))
#> [1] FALSE  TRUE FALSE  TRUE
typeof(is.na(c(5,NA,4,NA)))
#> [1] "logical"
sum(is.na(c(5,NA,4,NA)))
#> [1] 2Task: Using object event_berk, calculate the following measures for each combination of event_type and event_inst:
- count of number of rows for each group
- count of rows non-missing for both pct_black_zipandpct_hispanic_zip
- count of number of visits to communities where the sumof Black and Latinx people comprise more than 50% of the total population
event_berk %>% group_by (event_inst, event_type) %>% summarize(
    n_events=n(),
    n_nonmiss_latbl = sum(!is.na(pct_black_zip) & !is.na(pct_hispanic_zip)),
    n_majority_latbl= sum(pct_black_zip+ pct_hispanic_zip>50, na.rm = TRUE)
  )
#> # A tibble: 10 × 5
#> # Groups:   event_inst [2]
#>    event_inst event_type  n_events n_nonmiss_latbl n_majority_latbl
#>    <chr>      <chr>          <int>           <int>            <int>
#>  1 In-State   2yr college      111             106               30
#>  2 In-State   4yr college       14              12                0
#>  3 In-State   other             49              48               15
#>  4 In-State   private hs        35              35                6
#>  5 In-State   public hs        259             258               88
#>  6 Out-State  2yr college        1               1                0
#>  7 Out-State  4yr college        4               3                0
#>  8 Out-State  other             89              86               20
#>  9 Out-State  private hs       134             132               19
#> 10 Out-State  public hs        183             183               27(!is.na(pct_black_zip) & !is.na(pct_hispanic_zip)) is a logical condition
- condition is TRUE(evaluates to1) orFALSE(evaluates to0) for each obs
3.3.1 summarize(): logical vectors to count proportions
Synatx: group_by(vars) %>% summarize(prop = mean(TRUE/FALSE conditon))
Task: separately for in-state/out-of-state, what proportion of visits to public high schools are to communities with median income greater than $100,000?
Steps:
- Filter public HS visits
- group by in-state vs. out-of-state
- Create measure
event_berk %>% filter(event_type == "public hs") %>% # filter public hs visits
  group_by (event_inst) %>% # group by in-state vs. out-of-state
  summarize(
    n_events=n(), # number of events by group
    n_nonmiss_inc = sum(!is.na(med_inc)), # w/ nonmissings values median inc,
    p_incgt100k = mean(med_inc>100000, na.rm=TRUE)) # proportion visits to $100K+ commmunities 
#> # A tibble: 2 × 4
#>   event_inst n_events n_nonmiss_inc p_incgt100k
#>   <chr>         <int>         <int>       <dbl>
#> 1 In-State        259           256       0.273
#> 2 Out-State       183           183       0.519What if we forgot to put na.rm=TRUE in the above task?
**Task**: separately for in-state/out-of-state, what proportion of visits to public high schools are to communities with median income greater than $100,000?
event_berk %>% filter(event_type == "public hs") %>% # filter public hs visits
  group_by (event_inst) %>% # group by in-state vs. out-of-state
  summarize(
    n_events=n(), # number of events by group
    n_nonmiss_inc = sum(!is.na(med_inc)), # w/ nonmissings values median inc,
    p_incgt100k = mean(med_inc>100000, na.rm=FALSE)) # proportion visits to $100K+ commmunities 
#> # A tibble: 2 × 4
#>   event_inst n_events n_nonmiss_inc p_incgt100k
#>   <chr>         <int>         <int>       <dbl>
#> 1 In-State        259           256      NA    
#> 2 Out-State       183           183       0.5193.3.2 summarize(): Other “helper” functions
Lots of other functions we can use within summarize()
Common functions to use with summarize():
| Function | Description | 
|---|---|
| n | count | 
| n_distinct | count unique values | 
| mean | mean | 
| median | median | 
| max | largest value | 
| min | smallest value | 
| sd | standard deviation | 
| sum | sum of values | 
| first | first value | 
| last | last value | 
| nth | nth value | 
| any | condition true for at least one value | 
Note: These functions can also be used on their own or with mutate()
Maximum value in a group
max(c(10,50,8))
#> [1] 50Task: For each combination of in-state/out-of-state and event type, what is the maximum value of med_inc?
event_berk %>% group_by(event_type, event_inst) %>% 
  summarize(max_inc = max(med_inc)) # oops, we forgot to remove NAs!
event_berk %>% group_by(event_type, event_inst) %>% 
  summarize(max_inc = max(med_inc, na.rm = TRUE))Isolate first/last/nth observation in a group
x <- c(10,15,20,25,30)
first(x)
#> [1] 10
last(x)
#> [1] 30
nth(x,1)
#> [1] 10
nth(x,3)
#> [1] 20
nth(x,10)
#> [1] NATask: after sorting object event_berk by event_type and event_datetime_start, what is the value of event_date for:
- first event for each event type?
- the last event for each event type?
- the 50th event for each event type?
event_berk %>% arrange(event_type, event_datetime_start) %>%
  group_by(event_type) %>%
  summarize(
    n_events = n(),
    date_first= first(event_date),
    date_last= last(event_date),
    date_50th= nth(event_date, 50)
  )
#> # A tibble: 5 × 5
#>   event_type  n_events date_first date_last  date_50th 
#>   <chr>          <int> <date>     <date>     <date>    
#> 1 2yr college      112 2017-04-25 2017-11-16 2017-10-04
#> 2 4yr college       18 2017-04-30 2017-11-16 NA        
#> 3 other            138 2017-04-11 2017-10-26 2017-09-09
#> 4 private hs       169 2017-04-23 2017-10-25 2017-09-26
#> 5 public hs        442 2017-04-14 2017-12-15 2017-09-163.3.3 Student exercise
Identify value of event_date for the nth event in each by group
Specific task:
- arrange (i.e., sort) by event_typeandevent_datetme_start, then group byevent_type, and then identify the value ofevent_datefor:- the first event in each by group (event_type)
- the second event in each by group
- the third event in each by group
- the fourth event in each by group
- the fifth event in each by group
 
- the first event in each by group (
Student exercise solutions
event_berk %>% arrange(event_type, event_datetime_start) %>%
  group_by(event_type) %>%
  summarize(
    n_events = n(),
    date_1st= first(event_date),
    date_2nd= nth(event_date,2),
    date_3rd= nth(event_date,3),
    date_4th= nth(event_date,4),
    date_5th= nth(event_date,5))
#> # A tibble: 5 × 7
#>   event_type  n_events date_1st   date_2nd   date_3rd   date_4th   date_5th  
#>   <chr>          <int> <date>     <date>     <date>     <date>     <date>    
#> 1 2yr college      112 2017-04-25 2017-09-05 2017-09-05 2017-09-06 2017-09-06
#> 2 4yr college       18 2017-04-30 2017-05-01 2017-05-06 2017-09-13 2017-09-14
#> 3 other            138 2017-04-11 2017-04-23 2017-04-25 2017-04-29 2017-05-14
#> 4 private hs       169 2017-04-23 2017-04-24 2017-04-29 2017-04-30 2017-09-05
#> 5 public hs        442 2017-04-14 2017-04-24 2017-04-26 2017-04-27 2017-04-273.4 count() in place of group_by() and summarise()
Recall from Section 3.1, the count() function returns the count of unique values of a particular variable. For example, if we count by event_state, we get the count of visits by state. This is the same if we were to group_by() event_state and summarise() to get the count of events by state.
event_berk %>%
  group_by(event_state) %>%
  summarise(n_state = n())
#> # A tibble: 35 × 2
#>    event_state n_state
#>    <chr>         <int>
#>  1 AL                2
#>  2 AR                3
#>  3 AZ               18
#>  4 CA              468
#>  5 CO                3
#>  6 CT                3
#>  7 DC                9
#>  8 FL               33
#>  9 GA               33
#> 10 HI                1
#> # ℹ 25 more rows
event_berk %>%
  count(event_state)
#> # A tibble: 35 × 2
#>    event_state     n
#>    <chr>       <int>
#>  1 AL              2
#>  2 AR              3
#>  3 AZ             18
#>  4 CA            468
#>  5 CO              3
#>  6 CT              3
#>  7 DC              9
#>  8 FL             33
#>  9 GA             33
#> 10 HI              1
#> # ℹ 25 more rowsIf we want to calculate the percent of UC Berkeley’s visits across states, we can use the count() and mutate() functions instead of group_by() and summarize().
event_berk %>%
  group_by(event_state) %>% #group_by event state
  summarise(n_state = n()) %>% #summarize and get a count of the # of events by state
  mutate(pct_state = (n_state / sum(n_state)) * 100) #create percent variable of visits by UC Berkeley (e.g., 53% of visits in California)
#> # A tibble: 35 × 3
#>    event_state n_state pct_state
#>    <chr>         <int>     <dbl>
#>  1 AL                2     0.228
#>  2 AR                3     0.341
#>  3 AZ               18     2.05 
#>  4 CA              468    53.2  
#>  5 CO                3     0.341
#>  6 CT                3     0.341
#>  7 DC                9     1.02 
#>  8 FL               33     3.75 
#>  9 GA               33     3.75 
#> 10 HI                1     0.114
#> # ℹ 25 more rows
  
event_berk %>%
  count(event_state) %>% #count of visits by UC Berkeley by state
  mutate(pct_state = (n / sum(n)) * 100) #percent of visits by UC Berkeley across states
#> # A tibble: 35 × 3
#>    event_state     n pct_state
#>    <chr>       <int>     <dbl>
#>  1 AL              2     0.228
#>  2 AR              3     0.341
#>  3 AZ             18     2.05 
#>  4 CA            468    53.2  
#>  5 CO              3     0.341
#>  6 CT              3     0.341
#>  7 DC              9     1.02 
#>  8 FL             33     3.75 
#>  9 GA             33     3.75 
#> 10 HI              1     0.114
#> # ℹ 25 more rows4 Summarize across multiple columns
4.1 What are column-wise operations?
across() allows you to perform the same operation on multiple columns.
Description - across() apply the same transformation to multiple columns.
Syntax - across(.cols, .fns, .names)
Arguments
- .colsColumns to transform
- .fnsFunction to apply to each of the selected columns. Some values include:- A function, e.g. mean()
- A purr-style lambda, e.g. ~ mean(.x, na.rm = TRUE)
- A named list of functions or lambdas, e.g. list(mean = mean, n_miss = ~sum(is.na(.x)))
 
- A function, e.g. 
- .namesA glue specification that describes how to name the output columns. Use- {.col}to stand for the selected column name and- {.fn}to stand for the name of the function being applied, e.g. “- {.col}_{.fn}”
4.2 across() affects every variable
Task:
- For U. Pittsburgh (univ_id = 215293) events at public and private high schools, calculate the mean value ofmed_incandpct_white_zipfor each combination ofevent_typeandevent_inst
df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(c(med_inc, pct_white_zip), mean))
#> # A tibble: 4 × 4
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc pct_white_zip
#>   <chr>      <chr>        <dbl>         <dbl>
#> 1 private hs In-State        NA          78.9
#> 2 private hs Out-State       NA          NA  
#> 3 public hs  In-State        NA          NA  
#> 4 public hs  Out-State       NA          NATry again, this time applying na.rm = TRUE - this is an example of a purr style lambda ~ mean(.x, na.rm = TRUE) argument “for the function calls in .fns.”
df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(c(med_inc, pct_white_zip), ~mean(.x, na.rm = TRUE)))
#> # A tibble: 4 × 4
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc pct_white_zip
#>   <chr>      <chr>        <dbl>         <dbl>
#> 1 private hs In-State    77115.          78.9
#> 2 private hs Out-State  103915.          63.3
#> 3 public hs  In-State    78408.          83.0
#> 4 public hs  Out-State  114212.          67.5Task:
- For U. Pittsburgh ( - univ_id = 215293) events at public and private high schools, calculate mean and standard deviation of- med_incand- pct_white_zipfor each combination of- event_typeand- event_inst
- You can create a named list of functions to supply into the second argument - .fns
mean_sd <- list(
  mean = ~mean(.x, na.rm = TRUE), 
  sd = ~sd(.x, na.rm = TRUE)
)
df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(c(med_inc, pct_white_zip), mean_sd))
#> # A tibble: 4 × 6
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc_mean med_inc_sd pct_white_zip_mean
#>   <chr>      <chr>             <dbl>      <dbl>              <dbl>
#> 1 private hs In-State         77115.     36559.               78.9
#> 2 private hs Out-State       103915.     44220.               63.3
#> 3 public hs  In-State         78408.     25841.               83.0
#> 4 public hs  Out-State       114212.     39745.               67.5
#> # ℹ 1 more variable: pct_white_zip_sd <dbl>Use this syntax to control variable name suffixes:
- .names = "{.col}_{.fn}"
df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(c(med_inc, pct_white_zip), mean_sd, .names = "{.col}_{.fn}"))
#> # A tibble: 4 × 6
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc_mean med_inc_sd pct_white_zip_mean
#>   <chr>      <chr>             <dbl>      <dbl>              <dbl>
#> 1 private hs In-State         77115.     36559.               78.9
#> 2 private hs Out-State       103915.     44220.               63.3
#> 3 public hs  In-State         78408.     25841.               83.0
#> 4 public hs  Out-State       114212.     39745.               67.5
#> # ℹ 1 more variable: pct_white_zip_sd <dbl>- .names = "{.fn}_{.col}"Change the order of the name to include function first and then the column name
df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(c(med_inc, pct_white_zip), mean_sd, .names = "{.fn}_{.col}"))
#> # A tibble: 4 × 6
#> # Groups:   event_type [2]
#>   event_type event_inst mean_med_inc sd_med_inc mean_pct_white_zip
#>   <chr>      <chr>             <dbl>      <dbl>              <dbl>
#> 1 private hs In-State         77115.     36559.               78.9
#> 2 private hs Out-State       103915.     44220.               63.3
#> 3 public hs  In-State         78408.     25841.               83.0
#> 4 public hs  Out-State       114212.     39745.               67.5
#> # ℹ 1 more variable: sd_pct_white_zip <dbl>Task:
- Same task as before, but now calculate mean, standard deviation, min, and max of med_incandpct_white_zipfor each combination ofevent_typeandevent_inst
desc_stat <- list(
  mean = ~mean(.x, na.rm = TRUE), 
  sd = ~sd(.x, na.rm = TRUE),
  low = ~min(.x, na.rm = TRUE),
  high = ~max(.x, na.rm=TRUE)
)
df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(c(med_inc, pct_white_zip), desc_stat, .names = "{.col}_{.fn}"))
#> # A tibble: 4 × 10
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc_mean med_inc_sd med_inc_low med_inc_high
#>   <chr>      <chr>             <dbl>      <dbl>       <dbl>        <dbl>
#> 1 private hs In-State         77115.     36559.      12894.      224163 
#> 2 private hs Out-State       103915.     44220.      29630.      223556.
#> 3 public hs  In-State         78408.     25841.      23168.      169036.
#> 4 public hs  Out-State       114212.     39745.      21581       250001 
#> # ℹ 4 more variables: pct_white_zip_mean <dbl>, pct_white_zip_sd <dbl>,
#> #   pct_white_zip_low <dbl>, pct_white_zip_high <dbl>4.2.1 across(), quosure style lambdas ~ func_name(.x)
Syntax: across(.cols, .fns, .names)
- .fnsFunction to apply to each of the selected columns. Some values include:- A function, e.g. mean(); A purr-style lambda, e.g. ~ mean(.x, na.rm = TRUE); A named list of functions or lambdas, e.g.list(mean = mean, n_miss = ~sum(is.na(.x)))
 
- A function, e.g. 
Task: Calculate mean, number of obs, and number of non-missing obs for variables
- Can exclude n()from the.colsargument inacross()function because you only need to calculate once to get the number of observations for each group.
mean_obs_nmiss <- list(
  mean = ~mean(.x, na.rm = TRUE), 
  n_non_miss = ~sum(!is.na(.x))
)
df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pop_total) %>%
  group_by(event_type,event_inst) %>%
  summarize(nrow = n(), across(c(med_inc, pop_total), mean_obs_nmiss,.names = "{.col}_{.fn}"))
#> # A tibble: 4 × 7
#> # Groups:   event_type [2]
#>   event_type event_inst  nrow med_inc_mean med_inc_n_non_miss pop_total_mean
#>   <chr>      <chr>      <int>        <dbl>              <int>          <dbl>
#> 1 private hs In-State      51       77115.                 50         25320.
#> 2 private hs Out-State    209      103915.                204         29771.
#> 3 public hs  In-State     211       78408.                210         21922.
#> 4 public hs  Out-State    559      114212.                557         30191.
#> # ℹ 1 more variable: pop_total_n_non_miss <int>4.2.2 across() and where() affects selected variables if they meet a condition
Syntax: across(where(fn)), .fns, .names)
- fnA function that returns TRUE or FALSE
- .fnsFunction to apply to each of the selected columns. Some values include:- A function, e.g. mean(); A purr-style lambda, e.g. ~ mean(.x, na.rm = TRUE); A named list of functions or lambdas, e.g.list(mean = mean, n_miss = ~sum(is.na(.x)))
 
- A function, e.g. 
Task: For U. Pittsburgh events at public and private high schools, caclulate mean, min, and max of variables med_inc and event_date for each combination of event_type and event_inst
mean_min_max <- list(
  mean = ~mean(.x, na.rm = TRUE),
  low = ~min(.x, na.rm=TRUE),
  high = ~max(.x, na.rm=TRUE)
)
df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst, med_inc, event_date) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(where(is.double), mean_min_max))
#> # A tibble: 4 × 8
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc_mean med_inc_low med_inc_high event_date_mean
#>   <chr>      <chr>             <dbl>       <dbl>        <dbl> <date>         
#> 1 private hs In-State         77115.      12894.      224163  2017-09-16     
#> 2 private hs Out-State       103915.      29630.      223556. 2017-08-24     
#> 3 public hs  In-State         78408.      23168.      169036. 2017-09-19     
#> 4 public hs  Out-State       114212.      21581       250001  2017-08-20     
#> # ℹ 2 more variables: event_date_low <date>, event_date_high <date>- Although event_dateis a Date class, it is type double and therefore we can applyis.doubleto thewhere()function to calculate the mean, min, and max.
Task: For events by U. Pittsburgh at public and private high schools, caclulate mean and standard deviation for numeric variables
#First, which vars are numeric
df_event %>% 
  select(event_type, event_inst,instnm,school_id,med_inc,pct_white_zip) %>%
  glimpse()
#> Rows: 18,680
#> Columns: 6
#> $ event_type    <chr> "public hs", "public hs", "public hs", "public hs", "pub…
#> $ event_inst    <chr> "In-State", "In-State", "In-State", "In-State", "Out-Sta…
#> $ instnm        <chr> "UM Amherst", "UM Amherst", "UM Amherst", "UM Amherst", …
#> $ school_id     <chr> "250192000042", "250243000134", "250366000496", "2503660…
#> $ med_inc       <dbl> 71713.5, 89121.5, 70136.5, 70136.5, 71023.5, 71023.5, 71…
#> $ pct_white_zip <dbl> 73.6803, 91.3555, 79.4345, 79.4345, 88.6526, 88.6526, 88…
df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,instnm,school_id,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(where(is.numeric), mean_sd))
#> # A tibble: 4 × 6
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc_mean med_inc_sd pct_white_zip_mean
#>   <chr>      <chr>             <dbl>      <dbl>              <dbl>
#> 1 private hs In-State         77115.     36559.               78.9
#> 2 private hs Out-State       103915.     44220.               63.3
#> 3 public hs  In-State         78408.     25841.               83.0
#> 4 public hs  Out-State       114212.     39745.               67.5
#> # ℹ 1 more variable: pct_white_zip_sd <dbl>5 Attach aggregate measures to your data frame
We can attach aggregate measures to a data frame by using group_by without summarize()
What do I mean by “attaching aggregate measures to a data frame”?
- Calculate measures at the by_group level, but attach them to original object rather than creating an object with one row for each by_group
Task: Using event_berk data frame, create (1) a measure of average income across all events and (2) a measure of average income for each event type
- resulting object should have same number of observations as event_berk
Steps:
- create measure of avg. income across all events without using group_by()orsummarize()and assign as (new) object
- Using object from previous step, create measure of avg. income across by event type using group_by()withoutsummarize()and assign as new object
Task: Using event_berk data frame, create (1) a measure of average income across all events and (2) a measure of average income for each event type
- Create measure of average income across all events
event_berk_temp <- event_berk %>% 
  arrange(event_date) %>% # sort by event_date (optional)
  select(event_date, event_type,med_inc) %>% # select vars to be retained (optioanl) 
  mutate(avg_inc = mean(med_inc, na.rm=TRUE)) # create avg. inc measure
dim(event_berk_temp)
#> [1] 879   4
event_berk_temp %>% head(5)
#> # A tibble: 5 × 4
#>   event_date event_type med_inc avg_inc
#>   <date>     <chr>        <dbl>   <dbl>
#> 1 2017-04-11 other       62022.  88124.
#> 2 2017-04-14 public hs  125056.  88124.
#> 3 2017-04-23 private hs  78897   88124.
#> 4 2017-04-23 other       55127   88124.
#> 5 2017-04-24 private hs 122307   88124.- Create measure of average income by event type
event_berk_temp <- event_berk_temp %>% 
  group_by(event_type) %>% # grouping by event type
  mutate(avg_inc_type = mean(med_inc, na.rm=TRUE)) # create avg. inc measure
  
str(event_berk_temp)
#> gropd_df [879 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
#>  $ event_date  : Date[1:879], format: "2017-04-11" "2017-04-14" ...
#>  $ event_type  : chr [1:879] "other" "public hs" "private hs" "other" ...
#>  $ med_inc     : num [1:879] 62022 125056 78897 55127 122307 ...
#>  $ avg_inc     : num [1:879] 88124 88124 88124 88124 88124 ...
#>  $ avg_inc_type: num [1:879] 71186 93978 89242 71186 89242 ...
#>  - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
#>   ..$ event_type: chr [1:5] "2yr college" "4yr college" "other" "private hs" ...
#>   ..$ .rows     : list<int> [1:5] 
#>   .. ..$ : int [1:112] 8 65 67 76 78 81 88 105 110 112 ...
#>   .. ..$ : int [1:18] 15 16 17 129 149 158 312 373 462 508 ...
#>   .. ..$ : int [1:138] 1 4 7 13 18 19 21 22 23 24 ...
#>   .. ..$ : int [1:169] 3 5 12 14 60 62 68 70 71 72 ...
#>   .. ..$ : int [1:442] 2 6 9 10 11 20 61 63 75 77 ...
#>   .. ..@ ptype: int(0) 
#>   ..- attr(*, ".drop")= logi TRUE
event_berk_temp %>% head(5)
#> # A tibble: 5 × 5
#> # Groups:   event_type [3]
#>   event_date event_type med_inc avg_inc avg_inc_type
#>   <date>     <chr>        <dbl>   <dbl>        <dbl>
#> 1 2017-04-11 other       62022.  88124.       71186.
#> 2 2017-04-14 public hs  125056.  88124.       93978.
#> 3 2017-04-23 private hs  78897   88124.       89242.
#> 4 2017-04-23 other       55127   88124.       71186.
#> 5 2017-04-24 private hs 122307   88124.       89242.Task: Using event_berk_temp from previous question, create a measure that identifies whether med_inc associated with the event is higher/lower than average income for all events of that type
Steps:
- Create measure of average income for each event type [already done]
- Create a 0/1 indicator that identifies whether the median income at the event location is higher than the average median income for events of that type.
# average income at recruiting events across all universities
event_berk_tempv2 <- event_berk_temp %>% 
  mutate(gt_avg_inc_type = med_inc > avg_inc_type) %>% 
  select(-(avg_inc)) # drop avg_inc (optional)
event_berk_tempv2 # note how med_ic = NA are treated
#> # A tibble: 879 × 5
#> # Groups:   event_type [5]
#>    event_date event_type  med_inc avg_inc_type gt_avg_inc_type
#>    <date>     <chr>         <dbl>        <dbl> <lgl>          
#>  1 2017-04-11 other        62022.       71186. FALSE          
#>  2 2017-04-14 public hs   125056.       93978. TRUE           
#>  3 2017-04-23 private hs   78897        89242. FALSE          
#>  4 2017-04-23 other        55127        71186. FALSE          
#>  5 2017-04-24 private hs  122307        89242. TRUE           
#>  6 2017-04-24 public hs   133966        93978. TRUE           
#>  7 2017-04-25 other       114580        71186. TRUE           
#>  8 2017-04-25 2yr college 103600.       79183. TRUE           
#>  9 2017-04-26 public hs    38516.       93978. FALSE          
#> 10 2017-04-27 public hs    49950.       93978. FALSE          
#> # ℹ 869 more rowsSame as above, but this time create integer indicator rather than logical
event_berk_tempv2 <- event_berk_tempv2 %>% 
  mutate(gt_avg_inc_type = as.integer(med_inc > avg_inc_type)) 
event_berk_tempv2  %>% head(4)
#> # A tibble: 4 × 5
#> # Groups:   event_type [3]
#>   event_date event_type med_inc avg_inc_type gt_avg_inc_type
#>   <date>     <chr>        <dbl>        <dbl>           <int>
#> 1 2017-04-11 other       62022.       71186.               0
#> 2 2017-04-14 public hs  125056.       93978.               1
#> 3 2017-04-23 private hs  78897        89242.               0
#> 4 2017-04-23 other       55127        71186.               05.0.1 Student exercise
Task: is pct_white_zip at a particular event higher or lower than the average pct_white_zip for that event_type?
- Note: all events attached to a particular zip_code
- pct_white_zip: pct of people in that zip_code who identify as white
Steps in task:
- Create measure of average pct white for each event_type
- Compare whether pct_white_zip is higher or lower than this average
Student exercise solution
Task: is pct_white_zip at a particular event higher or lower than the average pct_white_zip for that event_type?
event_berk_tempv3 <- event_berk %>% 
  arrange(event_date) %>% # sort by event_date (optional)
  select(event_date, event_type, pct_white_zip) %>% #optional
  group_by(event_type) %>% # grouping by event type
  mutate(avg_pct_white = mean(pct_white_zip, na.rm=TRUE),
         gt_avg_pctwhite_type = as.integer(pct_white_zip > avg_pct_white)) 
event_berk_tempv3 %>% head(4)
#> # A tibble: 4 × 5
#> # Groups:   event_type [3]
#>   event_date event_type pct_white_zip avg_pct_white gt_avg_pctwhite_type
#>   <date>     <chr>              <dbl>         <dbl>                <int>
#> 1 2017-04-11 other               37.2          49.7                    0
#> 2 2017-04-14 public hs           78.3          48.9                    1
#> 3 2017-04-23 private hs          84.7          61.0                    1
#> 4 2017-04-23 other               20.9          49.7                    0