Across the tidyverse: processing across rows

1 Introduction

1.1 Libraries we will use today

“Load” the package we will use today (output omitted)

  • you must run this code chunk
library(tidyverse)

If package not yet installed, then must install before you load. Install in “console” rather than .Rmd file

  • Generic syntax: install.packages("package_name")
  • Install “tidyverse”: install.packages("tidyverse")

Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes:

  • install.packages("tidyverse")
  • library(tidyverse)

1.2 Data we will use today

Data on off-campus recruiting events by public universities

  • Object df_event
    • One observation per university, recruiting event
rm(list = ls()) # remove all objects

#load dataset with one obs per recruiting event
load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData"))
#load("../../data/recruiting/recruit_event_allvars.Rdata")

1.3 What we are doing today

Processing across variables vs. processing across observations

In today’s lecture, I’ll use the terms “observations” and “rows” interchangeably. Creation of analysis datasets often requires calculations across obs.

Examples:

  • You have a dataset with one observation per student-term and want to create a variable of credits attempted per term
  • You have a dataset with one observation per student-term and want to create a variable of GPA for the semester or cumulative GPA for all semesters
  • Number of off-campus recruiting events a university makes to each state
  • Average household income at visited versus non-visited high schools

Creating graphs and tables of descriptive stats usually require calculations across obs

Example: Want to create a graph that shows number of recruiting events by event “type” (e.g., public HS, private HS) for each university

  • Start with df_event dataset that has one observation per university, recruiting event
  • Create a new data frame object that has one observation per university and event type and has a variable for the number of events
    • this variable calculated by counting number of rows in each combination of university and event type
  • This new data frame object is the input for creating desired graph

So far, we have focused on ``processing across variables’’

  • Performing calculations across columns (i.e., vars), typically within a row (i.e., observation)
  • Example: percent free-reduced lunch (above)

Visits by UC Berkeley to public high schools

#> # A tibble: 5 × 6
#>   school_id    state tot_stu_pub fr_lunch pct_fr_lunch med_inc
#>   <chr>        <chr>       <dbl>    <dbl>        <dbl>   <dbl>
#> 1 340882002126 NJ           1846       29       0.0157 178732 
#> 2 340147000250 NJ           1044       50       0.0479  62288 
#> 3 340561003796 NJ           1505      298       0.198  100684.
#> 4 340165005124 NJ           1900       43       0.0226 160476.
#> 5 341341003182 NJ           1519      130       0.0856 144346

Processing across obs (focus of today’s lecture)

  • Performing calculations across rows (i.e., obs), often within a column (i.e., variable)
  • Example: Average household income of visited high schools, by state

Strategy for teaching processing across obs

In tidyverse the group_by() and summarize() functions are the primary means of performing calculations across observations

  • Usually, processing across observations requires using group_by() and summarize() together
  • group_by() typically not very useful by itself
  • summarize() [with or without group_by()] can be helpful for creating summary statistics that are the inputs for tables or graphs you create

How we’ll teach:

  • introduce group_by() and summarize() separately
    • goal: you understand what each function does
  • then we’ll combine them

2 group_by() and summarize()

2.1 group_by

Description: “group_by() takes an existing data frame and converts it into a grouped data frame where operations are performed”by group”. ungroup() removes grouping.”

  • part of dplyr package within tidyverse; not part of Base R
  • works best with pipes %>% and summarize() function [described below]

Basic syntax: group_by(.data, ...)

  • .data argument refers to name of data frame
  • ... argument refers to names of “group_by” variables, separated by commas
    • Can “group by” one or many variables
    • Typically, “group_by” variables are character, factor, or integer variables

Possible “group by” variables in df_event data:

  • university name/id; event type (e.g., public HS, private HS); state

Example: in df_event, create frequency count of event_type [output omitted]

names(df_event)
#without group_by()
df_event %>% count(event_type)
df_event %>% count(instnm)
#group_by() university
df_event %>% group_by(instnm) %>% count(event_type)

By itself group_by() doesn’t do much; it just prints data

  • Below, group df_event data by university, event type, and event state
#group_by (without pipes)
group_by(df_event, univ_id, event_type, event_state)
#group_by (with pipes)
df_event %>% group_by(univ_id, event_type, event_state) 
df_event %>% group_by(univ_id, event_type, event_state) %>% glimpse()

But once an object is grouped, all subsequent functions are run separately “by group”

  • recall that count() counts number of observations by group
# count number of observations in group, ungrouped data
df_event  %>% count()
#group by and then count obs
df_event %>% group_by(univ_id) %>% count()
df_event %>% group_by(univ_id) %>% count() %>% glimpse()
  #df_event %>% group_by(univ_id) %>% count() %>% View()

df_event %>% group_by(univ_id, event_type) %>% count()
df_event %>% group_by(univ_id, event_type) %>% count() %>% glimpse()
  #df_event %>% group_by(univ_id, event_type) %>% count() %>% View()

2.1.1 Grouping not retained unless you assign it

Below, we’ll use class() function to show whether data frame is grouped

  • will talk more about class() next week, but for now, just think of it as a function that provides information about an object
  • similar to typeof(), but class() provides different info about object

Grouping is not retained unless you assign it

class(df_event)
#> [1] "tbl_df"     "tbl"        "data.frame"
df_event %>% group_by(univ_id, event_type, event_state)
df_event_grp <- df_event %>% group_by(univ_id, event_type, event_state) # using pipes
class(df_event_grp)
#> [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

2.1.2 Un-grouping an object

Use ungroup(object) to un-group grouped data

class(df_event_grp)
#> [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
df_event_grp <- ungroup(df_event_grp)
class(df_event_grp)
#> [1] "tbl_df"     "tbl"        "data.frame"
rm(df_event_grp)

2.1.3 Student exercise group_by()

  1. Group by “instnm” and get a frequency count.
    • How many rows and columns do you have? What do the number of rows mean?
  2. Now group by “instnm” and “event_type” and get a frequency count.
    • How many rows and columns do you have? What do the number of rows mean?
  3. Bonus: In the same code chunk, group by “instnm” and “event_type”, but this time filter for observations where “med_inc” is greater than 75000 and get a frequency count.
Solutions group_by()
  1. Group by “instnm” and get a frequency count.

    • How many rows and columns do you have? What do the number of rows mean?
    df_event %>%
      group_by(instnm) %>%
      count() 
    #> # A tibble: 16 × 2
    #> # Groups:   instnm [16]
    #>    instnm          n
    #>    <chr>       <int>
    #>  1 Arkansas      994
    #>  2 Bama         4258
    #>  3 CU Boulder   1439
    #>  4 Cinci         679
    #>  5 Kansas       1014
    #>  6 NC State      640
    #>  7 Pitt         1225
    #>  8 Rutgers      1135
    #>  9 S Illinois    549
    #> 10 Stony Brook   730
    #> 11 UC Berkeley   879
    #> 12 UC Irvine     539
    #> 13 UGA           827
    #> 14 UM Amherst    908
    #> 15 UNL          1397
    #> 16 USCC         1467
  2. Now group by “instnm” and “event_type” and get a frequency count.

    • How many rows and columns do you have? What do the number of rows mean?
    df_event %>%
      group_by(instnm, event_type) %>%
      count() 
    #> # A tibble: 80 × 3
    #> # Groups:   instnm, event_type [80]
    #>    instnm   event_type      n
    #>    <chr>    <chr>       <int>
    #>  1 Arkansas 2yr college    32
    #>  2 Arkansas 4yr college    14
    #>  3 Arkansas other         112
    #>  4 Arkansas private hs    222
    #>  5 Arkansas public hs     614
    #>  6 Bama     2yr college   127
    #>  7 Bama     4yr college   158
    #>  8 Bama     other         608
    #>  9 Bama     private hs    963
    #> 10 Bama     public hs    2402
    #> # ℹ 70 more rows
  3. Bonus: Group by “instnm” and “event_type”, but this time filter for observations where “med_inc” is greater than 75000 and get a frequency count.

df_event %>%
  group_by(instnm, event_type) %>%
  filter(med_inc > 75000) %>%
  count()
#> # A tibble: 80 × 3
#> # Groups:   instnm, event_type [80]
#>    instnm   event_type      n
#>    <chr>    <chr>       <int>
#>  1 Arkansas 2yr college     7
#>  2 Arkansas 4yr college     3
#>  3 Arkansas other          30
#>  4 Arkansas private hs     99
#>  5 Arkansas public hs     303
#>  6 Bama     2yr college    21
#>  7 Bama     4yr college    42
#>  8 Bama     other         249
#>  9 Bama     private hs    477
#> 10 Bama     public hs    1478
#> # ℹ 70 more rows


2.2 summarize()

summarize() function

Description: summarize() calculates across rows; then collapses into single row

  • summarize() create scalar vars summarizing variables of existing data frame
  • if you first group data frame using group_by(), summarize() creates summary vars separately for each group, returning object with one row per group
  • if data frame not grouped, summarize() will result in one row.

Syntax: summarize(.data, ...)

  • .data: a data frame; omit if using summarize() after pipe %>%
  • ...: Name-value pairs of summary functions separated by commas
    • “name” will be the name of new variable you will create
    • “value” should be expression that returns a single value like min(x), n()
    • variable names do not need to be placed within quotes

Value (what summarize() returns/creates)

  • Object of same class as .data.; object will have one obs per “by group”

Useful functions (i.e., “helper functions”)

  • Standalone functions called within summarize(), e.g., mean(), n()
  • e.g., count function n() takes no arguments; returns number of rows in group

2.2.1 Investigate objects created by summarize()

Example: Count total number of events

  • function n() from dplyr package (dplyr::n) returns the size of the current group
#?n # dplyr function n() gives current group size
summarize(df_event, num_events=n()) # without pipes
#> # A tibble: 1 × 1
#>   num_events
#>        <int>
#> 1      18680

df_event %>% summarize(num_events=n()) # with pipes
#> # A tibble: 1 × 1
#>   num_events
#>        <int>
#> 1      18680
df_event %>% summarize(num_events=n()) %>% str() # use str to see what pipe returned
#> tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ num_events: int 18680
  #df_event %>% summarize(num_events=n()) %>% View()

Example: What is max value of med_inc across all events

  • function max() from base package (base::max) returns max value
#?max # base R function max() returns max value
df_event %>% summarize(max_inc = max(med_inc, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   max_inc
#>     <dbl>
#> 1  250001
df_event %>% summarize(max_inc = max(med_inc, na.rm = TRUE)) %>% str()
#> tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ max_inc: num 250001

Example: Count total number of events AND max value of median income

df_event %>% summarize(
    num_events=n(), 
    max_inc=max(med_inc, na.rm = TRUE)
  )
#> # A tibble: 1 × 2
#>   num_events max_inc
#>        <int>   <dbl>
#> 1      18680  250001

# show object returned by pipe
df_event %>% summarize(
    num_events=n(),
    max_inc=max(med_inc, na.rm = TRUE)
  ) %>% str()
#> tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ num_events: int 18680
#>  $ max_inc   : num 250001

#df_event %>% summarize(num_events=n(), max_inc=max(med_inc, na.rm = TRUE)) %>% View()

Example: Count total number of events AND max value of median income

We can use assignment to keep the object created by summarize()

df_event_temp <- df_event %>% 
  summarize(num_events=n(), max_inc=max(med_inc, na.rm = TRUE))

df_event_temp
rm(df_event_temp)

What if we forgot na.rm = TRUE?

  • then max_inc equals NA cuz can’t perform calculation on NA values
df_event %>% 
  summarize(num_events=n(),max_inc=max(med_inc, na.rm = FALSE))

2.2.2 Takeaways, summarize()

  • By default, objects created by summarize() are data frames that contain variables created within summarize() and one observation [per “by group”]
  • most “helper” functions (e.g., max(), mean() have option na.rm to keep/remove missing obs before performing calculations)
    • na.rm = FALSE (default); don’t remove NAs prior to calculation
      • if any obs missing, then result of calculation is NA
    • na.rm = TRUE (default); remove NAs prior to calculation
  • Object created by summarize() not retained unless you assign it (output omitted)
event_temp <- df_event %>% summarize(num_events=n(), 
  mean_inc=mean(med_inc, na.rm = TRUE))

event_temp
rm(event_temp)

2.2.3 Using [] operator to filter observations within summarize

Imagine we want to calculate avg. income, separately for in-state vs. out-of-state visits

  • first, make sure we can identify in vs. out-state (output omitted)
df_event %>% filter(event_state == instst) %>% count() #in state
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1  5425

df_event %>% filter(event_state != instst) %>% count() #out state
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1 13255

Task: calculate mean income for: all events; in-state events; out-of-state events

df_event %>%
  summarize(
    avg_inc = mean(med_inc, na.rm = TRUE), # all events
    avg_inc_inst = mean(med_inc[event_state == instst], na.rm = TRUE), # in-state
    avg_inc_outst = mean(med_inc[event_state != instst], na.rm = TRUE) # out-state
  )
#> # A tibble: 1 × 3
#>   avg_inc avg_inc_inst avg_inc_outst
#>     <dbl>        <dbl>         <dbl>
#> 1  89089.       71589.        96162.

Understanding code: mean(med_inc[event_state == instst], na.rm = TRUE)

  • mean() function takes atomic vector as first argument
  • the variable med_inc is an numeric atomic vector
  • event_state == instst is condition that yields a logical atomic vector
  • From Base R lecture, med_inc[event_state == instst] is:
    • “subset atomic vectors using []” approach 3: “Using logicals to return elements where corresponding logical is TRUE
    • isolates observations of med_inc where event_state is same as state the university is located in (instst)

2.2.4 Using summarize() to create descriptive statistics table

Often helpful to use summarize() to calculate summary statistics that are the basis for a table of descriptive statistics

Task: create a table of descriptive statistics about variable med_inc

  • want these measures: number of non-missing obs; mean; standard deviation
df_event %>% mutate(non_miss_inc = is.na(med_inc)==0) %>%
  summarize(
    n = sum(non_miss_inc, na.rm = TRUE), #SAMPLE SIZE all
    avg_inc = mean(med_inc, na.rm = TRUE), # MEAN
    std_inc = sd(med_inc, na.rm = TRUE) # STANDARD DEVIATION all events
  )
#> # A tibble: 1 × 3
#>       n avg_inc std_inc
#>   <int>   <dbl>   <dbl>
#> 1 18210  89089.  39267.

Task: same as above but separate measures for: all events; in-state; out-of-state

df_event %>% mutate(non_miss_inc = is.na(med_inc)==0) %>%
  summarize(
    n = sum(non_miss_inc, na.rm = TRUE), #SAMPLE SIZE 
    n_inst = sum(non_miss_inc[event_state == instst], na.rm = TRUE), 
    n_outst = sum(non_miss_inc[event_state != instst], na.rm = TRUE), 
    avg_inc = mean(med_inc, na.rm = TRUE), # MEAN
    avg_inc_inst = mean(med_inc[event_state == instst], na.rm = TRUE), 
    avg_inc_outst = mean(med_inc[event_state != instst], na.rm = TRUE),
    std_inc = sd(med_inc, na.rm = TRUE), # STANDARD DEVIATION 
    std_inc_inst = sd(med_inc[event_state == instst], na.rm = TRUE), 
    std_inc_outst = sd(med_inc[event_state != instst], na.rm = TRUE) 
  )
#> # A tibble: 1 × 9
#>       n n_inst n_outst avg_inc avg_inc_inst avg_inc_outst std_inc std_inc_inst
#>   <int>  <int>   <int>   <dbl>        <dbl>         <dbl>   <dbl>        <dbl>
#> 1 18210   5241   12969  89089.       71589.        96162.  39267.       32077.
#> # ℹ 1 more variable: std_inc_outst <dbl>

2.2.5 Student exercise summarize()

  1. What is the min value of med_inc across all events?
    • Hint: Use min()
  2. What is the mean value of fr_lunch across all events?
    • Hint: Use mean()
Solutions using summarize()
  1. What is min value of med_inc across all events?
df_event %>%
  summarize(min_med_income = min(med_inc, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   min_med_income
#>            <dbl>
#> 1         12894.
  1. What is the mean value of fr_lunch across all events?

    • Hint: Use mean()
    df_event %>%
      summarize(mean_fr_lunch = mean(fr_lunch, na.rm = TRUE))
    #> # A tibble: 1 × 1
    #>   mean_fr_lunch
    #>           <dbl>
    #> 1          475.


3 Combining group_by() and summarize()

Combining summarize() and group_by

summarize() on ungrouped vs. grouped data:

  • By itself, summarize() performs calculations across all rows of data frame then collapses the data frame to a single row
  • When data frame is grouped, summarize() performs calculations across rows within a group and then collapses to a single row for each group

Example: Count the number of events for each university

  • remember: df_event has one observation per university, recruiting event
df_event %>% summarize(num_events=n())
#> # A tibble: 1 × 1
#>   num_events
#>        <int>
#> 1      18680
df_event %>% group_by(instnm) %>% summarize(num_events=n())
#> # A tibble: 16 × 2
#>    instnm      num_events
#>    <chr>            <int>
#>  1 Arkansas           994
#>  2 Bama              4258
#>  3 CU Boulder        1439
#>  4 Cinci              679
#>  5 Kansas            1014
#>  6 NC State           640
#>  7 Pitt              1225
#>  8 Rutgers           1135
#>  9 S Illinois         549
#> 10 Stony Brook        730
#> 11 UC Berkeley        879
#> 12 UC Irvine          539
#> 13 UGA                827
#> 14 UM Amherst         908
#> 15 UNL               1397
#> 16 USCC              1467
  • Investigate the object created above
df_event %>% group_by(instnm) %>% summarize(num_events=n()) %>% str()
#> tibble [16 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ instnm    : chr [1:16] "Arkansas" "Bama" "CU Boulder" "Cinci" ...
#>  $ num_events: int [1:16] 994 4258 1439 679 1014 640 1225 1135 549 730 ...
  #df_event %>% group_by(instnm) %>% summarize(num_events=n()) %>% View()
  • Or we could retain object for later use
event_by_univ <- df_event %>% group_by(instnm) %>% summarize(num_events=n())
str(event_by_univ)
#> tibble [16 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ instnm    : chr [1:16] "Arkansas" "Bama" "CU Boulder" "Cinci" ...
#>  $ num_events: int [1:16] 994 4258 1439 679 1014 640 1225 1135 549 730 ...
event_by_univ # print
#> # A tibble: 16 × 2
#>    instnm      num_events
#>    <chr>            <int>
#>  1 Arkansas           994
#>  2 Bama              4258
#>  3 CU Boulder        1439
#>  4 Cinci              679
#>  5 Kansas            1014
#>  6 NC State           640
#>  7 Pitt              1225
#>  8 Rutgers           1135
#>  9 S Illinois         549
#> 10 Stony Brook        730
#> 11 UC Berkeley        879
#> 12 UC Irvine          539
#> 13 UGA                827
#> 14 UM Amherst         908
#> 15 UNL               1397
#> 16 USCC              1467
rm(event_by_univ)

Task

  • Count number of recruiting events by institution and event_type

    • within summarize(), we will take advantage of the helper function n()
    • n() “gives the current group size”; i.e., how many observations in current group
    #?n
    df_event %>% group_by(instnm, event_type) %>% summarize(num_events=n())
    #> # A tibble: 80 × 3
    #> # Groups:   instnm [16]
    #>    instnm   event_type  num_events
    #>    <chr>    <chr>            <int>
    #>  1 Arkansas 2yr college         32
    #>  2 Arkansas 4yr college         14
    #>  3 Arkansas other              112
    #>  4 Arkansas private hs         222
    #>  5 Arkansas public hs          614
    #>  6 Bama     2yr college        127
    #>  7 Bama     4yr college        158
    #>  8 Bama     other              608
    #>  9 Bama     private hs         963
    #> 10 Bama     public hs         2402
    #> # ℹ 70 more rows
    
    #investigate object created
    df_event %>% group_by(instnm, event_type) %>% summarize(num_events=n()) %>% glimpse()
    #> Rows: 80
    #> Columns: 3
    #> Groups: instnm [16]
    #> $ instnm     <chr> "Arkansas", "Arkansas", "Arkansas", "Arkansas", "Arkansas",…
    #> $ event_type <chr> "2yr college", "4yr college", "other", "private hs", "publi…
    #> $ num_events <int> 32, 14, 112, 222, 614, 127, 158, 608, 963, 2402, 112, 40, 7…
      #df_event %>% group_by(instnm, event_type) %>% summarize(num_events=n()) %>% View()

Task

  • Count number of recruiting events by institution and event_type

Note that data frame object created by group_by() and summarize() can be input to graph

#bar chart of number of events, all universities combined
df_event %>% group_by(instnm, event_type) %>% 
  summarize(num_events=n()) %>%
    ggplot(aes(x=event_type, y=num_events)) +  # plot
    ylab("Number of events") + xlab("Event type") +geom_col()

#bar chart of number of events, separete chart for each university
df_event %>% group_by(instnm, event_type) %>% 
  summarize(num_events=n()) %>%
    ggplot(aes(x=event_type, y=num_events)) +  # plot
    ylab("Number of events") + xlab("Event type") + geom_col() +
    coord_flip() + facet_wrap(~ instnm) 

Task. Count number of recruiting events by institution, event_type, and whether event is in- or out-of-state (var=event_inst)

  • Note: in group_by(), the optional drop argument controls whether empty groups dropped. default is drop = TRUE

df_event %>% group_by(instnm, event_type, event_inst) %>% 
  summarize(num_events=n())
#> # A tibble: 159 × 4
#> # Groups:   instnm, event_type [80]
#>    instnm   event_type  event_inst num_events
#>    <chr>    <chr>       <chr>           <int>
#>  1 Arkansas 2yr college In-State           16
#>  2 Arkansas 2yr college Out-State          16
#>  3 Arkansas 4yr college In-State            3
#>  4 Arkansas 4yr college Out-State          11
#>  5 Arkansas other       In-State           27
#>  6 Arkansas other       Out-State          85
#>  7 Arkansas private hs  In-State           21
#>  8 Arkansas private hs  Out-State         201
#>  9 Arkansas public hs   In-State          156
#> 10 Arkansas public hs   Out-State         458
#> # ℹ 149 more rows

df_event %>% group_by(instnm, event_type, event_inst, .drop = TRUE) %>% 
  summarize(num_events=n()) 
#> # A tibble: 159 × 4
#> # Groups:   instnm, event_type [80]
#>    instnm   event_type  event_inst num_events
#>    <chr>    <chr>       <chr>           <int>
#>  1 Arkansas 2yr college In-State           16
#>  2 Arkansas 2yr college Out-State          16
#>  3 Arkansas 4yr college In-State            3
#>  4 Arkansas 4yr college Out-State          11
#>  5 Arkansas other       In-State           27
#>  6 Arkansas other       Out-State          85
#>  7 Arkansas private hs  In-State           21
#>  8 Arkansas private hs  Out-State         201
#>  9 Arkansas public hs   In-State          156
#> 10 Arkansas public hs   Out-State         458
#> # ℹ 149 more rows

df_event %>% 
  group_by(as.factor(instnm), as.factor(event_type), as.factor(event_inst),
           .drop = FALSE) %>% summarize(num_events=n()) %>% arrange(num_events)
#> # A tibble: 160 × 4
#> # Groups:   as.factor(instnm), as.factor(event_type) [80]
#>    `as.factor(instnm)` `as.factor(event_type)` as.factor(event_inst…¹ num_events
#>    <fct>               <fct>                   <fct>                       <int>
#>  1 UGA                 2yr college             Out-State                       0
#>  2 UC Berkeley         2yr college             Out-State                       1
#>  3 USCC                2yr college             In-State                        2
#>  4 Arkansas            4yr college             In-State                        3
#>  5 Kansas              4yr college             In-State                        3
#>  6 NC State            2yr college             Out-State                       3
#>  7 Rutgers             4yr college             In-State                        3
#>  8 Stony Brook         4yr college             In-State                        3
#>  9 UC Irvine           2yr college             Out-State                       3
#> 10 UGA                 2yr college             In-State                        3
#> # ℹ 150 more rows
#> # ℹ abbreviated name: ¹​`as.factor(event_inst)`
# .drop=FALSE affects only grouping columns that are coded as factors
# combinations that include non-factor grouping variables are still 
# silently dropped even with .drop=FALSE.

Make a graph, showing in/out state as fill color of bar

df_event %>% group_by(instnm, event_type, event_inst) %>% 
  summarize(num_events=n()) %>%
    ggplot(aes(x=event_type, y=num_events, fill = event_inst)) +  # plot
    ylab("Number of events") + xlab("Event type") + geom_col() +
    coord_flip() + facet_wrap(~ instnm) 

Task

  • By university, event type, event_inst count the number of events and calculate the avg. pct white in the zip-code
df_event %>% group_by(instnm, event_type, event_inst) %>% 
  summarize(num_events=n(),
    mean_pct_white=mean(pct_white_zip, na.rm = TRUE)
  )
#> # A tibble: 159 × 5
#> # Groups:   instnm, event_type [80]
#>    instnm   event_type  event_inst num_events mean_pct_white
#>    <chr>    <chr>       <chr>           <int>          <dbl>
#>  1 Arkansas 2yr college In-State           16           68.3
#>  2 Arkansas 2yr college Out-State          16           71.4
#>  3 Arkansas 4yr college In-State            3           75.6
#>  4 Arkansas 4yr college Out-State          11           63.9
#>  5 Arkansas other       In-State           27           67.4
#>  6 Arkansas other       Out-State          85           58.6
#>  7 Arkansas private hs  In-State           21           71.1
#>  8 Arkansas private hs  Out-State         201           62.3
#>  9 Arkansas public hs   In-State          156           77.6
#> 10 Arkansas public hs   Out-State         458           68.7
#> # ℹ 149 more rows

#investigate object you created
df_event %>% group_by(instnm, event_type, event_inst) %>% 
  summarize(num_events=n(),
    mean_pct_white=mean(pct_white_zip, na.rm = FALSE)
  ) %>% glimpse()
#> Rows: 159
#> Columns: 5
#> Groups: instnm, event_type [80]
#> $ instnm         <chr> "Arkansas", "Arkansas", "Arkansas", "Arkansas", "Arkans…
#> $ event_type     <chr> "2yr college", "2yr college", "4yr college", "4yr colle…
#> $ event_inst     <chr> "In-State", "Out-State", "In-State", "Out-State", "In-S…
#> $ num_events     <int> 16, 16, 3, 11, 27, 85, 21, 201, 156, 458, 66, 61, 15, 1…
#> $ mean_pct_white <dbl> 68.27347, NA, NA, NA, 67.39773, NA, NA, NA, NA, NA, NA,…

Recruiting events by UC Berkeley

df_event %>% filter(univ_id == 110635) %>% 
  group_by(event_type) %>% summarize(num_events=n())
#> # A tibble: 5 × 2
#>   event_type  num_events
#>   <chr>            <int>
#> 1 2yr college        112
#> 2 4yr college         18
#> 3 other              138
#> 4 private hs         169
#> 5 public hs          442

Let’s create a dataset of recruiting events at UC Berkeley

event_berk <- df_event %>% filter(univ_id == 110635)

event_berk %>% count(event_type)
#> # A tibble: 5 × 2
#>   event_type      n
#>   <chr>       <int>
#> 1 2yr college   112
#> 2 4yr college    18
#> 3 other         138
#> 4 private hs    169
#> 5 public hs     442

3.1 summarize() and Counts

The count function n() takes no arguments and returns the size of the current group

event_berk %>% group_by(event_type, event_inst) %>% 
  summarize(num_events=n())
#> # A tibble: 10 × 3
#> # Groups:   event_type [5]
#>    event_type  event_inst num_events
#>    <chr>       <chr>           <int>
#>  1 2yr college In-State          111
#>  2 2yr college Out-State           1
#>  3 4yr college In-State           14
#>  4 4yr college Out-State           4
#>  5 other       In-State           49
#>  6 other       Out-State          89
#>  7 private hs  In-State           35
#>  8 private hs  Out-State         134
#>  9 public hs   In-State          259
#> 10 public hs   Out-State         183

Because counts are so important, dplyr package includes separate count() function that can be called outside summarize() function

event_berk %>% group_by(event_type, event_inst) %>% count()
#> # A tibble: 10 × 3
#> # Groups:   event_type, event_inst [10]
#>    event_type  event_inst     n
#>    <chr>       <chr>      <int>
#>  1 2yr college In-State     111
#>  2 2yr college Out-State      1
#>  3 4yr college In-State      14
#>  4 4yr college Out-State      4
#>  5 other       In-State      49
#>  6 other       Out-State     89
#>  7 private hs  In-State      35
#>  8 private hs  Out-State    134
#>  9 public hs   In-State     259
#> 10 public hs   Out-State    183

3.1.1 summarize(): count with logical vectors and sum()

Logical vectors have values TRUE and FALSE.

  • When used with numeric functions, TRUE converted to 1 and FALSE to 0.

sum() is a numeric function that returns the sum of values

sum(c(5,10))
#> [1] 15
sum(c(TRUE,TRUE,FALSE,FALSE))
#> [1] 2

is.na() returns TRUE if value is NA and otherwise returns FALSE

is.na(c(5,NA,4,NA))
#> [1] FALSE  TRUE FALSE  TRUE
sum(is.na(c(5,NA,4,NA,5)))
#> [1] 2
sum(!is.na(c(5,NA,4,NA,5)))
#> [1] 3

Application: How many missing/non-missing obs in variable [very important]

event_berk %>% group_by(event_type) %>% 
  summarize(
    n_events = n(),
    n_miss_inc = sum(is.na(med_inc)),
    n_nonmiss_inc = sum(!is.na(med_inc)),
    n_nonmiss_fr_lunch = sum(!is.na(fr_lunch))
  )
#> # A tibble: 5 × 5
#>   event_type  n_events n_miss_inc n_nonmiss_inc n_nonmiss_fr_lunch
#>   <chr>          <int>      <int>         <int>              <int>
#> 1 2yr college      112          5           107                  0
#> 2 4yr college       18          5            13                  0
#> 3 other            138          8           130                  0
#> 4 private hs       169          2           167                  0
#> 5 public hs        442          3           439                437

3.1.2 summarize() and count() student exercise

Use one code chunk for this exercise. You could tackle this a step at a time and run the entire code chunk when you have answered all parts of this question. Create your own variable names.

  1. Using the event_berk object, filter observations where event_state is VA and group by event_type.
    1. Using the summarize function to create a variable that represents the count for each event_type.
    2. Create a variable that represents the sum of missing obs for med_inc.
    3. Create a variable that represents the sum of non-missing obs for med_inc.
    4. Bonus: Arrange variable you created representing the count of each event_type in descending order.
Solutions using summarize() and count()
  1. Using the event_berk object filter observations where event_state is VA and group by event_type.
    1. Using the summarize function, create a variable that represents the count for each event_type.
    2. Now get the sum of missing obs for med_inc.
    3. Now get the sum of non-missing obs for med_inc.
event_berk %>%
  filter(event_state == "VA") %>%
  group_by(event_type) %>%
  summarize(
    n_events = n(),
    n_miss_inc = sum(is.na(med_inc)),
    n_nonmiss_inc = sum(!is.na(med_inc))) %>%
  arrange(desc(n_events))
#> # A tibble: 3 × 4
#>   event_type n_events n_miss_inc n_nonmiss_inc
#>   <chr>         <int>      <int>         <int>
#> 1 public hs        20          0            20
#> 2 private hs       13          0            13
#> 3 other             3          0             3


3.2 summarize() and means

summarize(): means

The mean() function within summarize() calculates means, separately for each group

event_berk %>% group_by(event_inst, event_type) %>% summarize(
  n_events=n(),
  mean_inc=mean(med_inc, na.rm = TRUE),
  mean_pct_white=mean(pct_white_zip, na.rm = TRUE))
#> # A tibble: 10 × 5
#> # Groups:   event_inst [2]
#>    event_inst event_type  n_events mean_inc mean_pct_white
#>    <chr>      <chr>          <int>    <dbl>          <dbl>
#>  1 In-State   2yr college      111   78486.           40.1
#>  2 In-State   4yr college       14  131691.           58.0
#>  3 In-State   other             49   75040.           37.6
#>  4 In-State   private hs        35   95229.           48.4
#>  5 In-State   public hs        259   87097.           39.6
#>  6 Out-State  2yr college        1  153070.           89.7
#>  7 Out-State  4yr college        4   76913.           65.8
#>  8 Out-State  other             89   69004.           56.5
#>  9 Out-State  private hs       134   87654.           64.3
#> 10 Out-State  public hs        183  103603.           62.0

3.2.1 summarize(): means and na.rm argument

Default behavior of “aggregation functions” (e.g., summarize())

  • if input has any missing values (NA), than output will be missing.

Many functions have argument na.rm (means “remove NAs”)

  • na.rm = FALSE [the default for mean()]
    • Do not remove missing values from input before calculating
    • Therefore, missing values in input will cause output to be missing
  • na.rm = TRUE
    • Remove missing values from input before calculating
    • Therefore, missing values in input will not cause output to be missing
#na.rm = FALSE; the default setting
event_berk %>% group_by(event_inst, event_type) %>% summarize(
  n_events=n(),
  n_miss_inc = sum(is.na(med_inc)),
  mean_inc=mean(med_inc, na.rm = FALSE),
  n_miss_frlunch = sum(is.na(fr_lunch)),
  mean_fr_lunch=mean(fr_lunch, na.rm = FALSE))
#> # A tibble: 10 × 7
#> # Groups:   event_inst [2]
#>    event_inst event_type  n_events n_miss_inc mean_inc n_miss_frlunch
#>    <chr>      <chr>          <int>      <int>    <dbl>          <int>
#>  1 In-State   2yr college      111          5      NA             111
#>  2 In-State   4yr college       14          4      NA              14
#>  3 In-State   other             49          2      NA              49
#>  4 In-State   private hs        35          0   95229.             35
#>  5 In-State   public hs        259          3      NA               0
#>  6 Out-State  2yr college        1          0  153070.              1
#>  7 Out-State  4yr college        4          1      NA               4
#>  8 Out-State  other             89          6      NA              89
#>  9 Out-State  private hs       134          2      NA             134
#> 10 Out-State  public hs        183          0  103603.              5
#> # ℹ 1 more variable: mean_fr_lunch <dbl>
# na.rm = TRUE
event_berk %>% group_by(event_inst, event_type) %>% summarize(
  n_events=n(),
  n_miss_inc = sum(is.na(med_inc)),
  mean_inc=mean(med_inc, na.rm = TRUE),
  n_miss_frlunch = sum(is.na(fr_lunch)),
  mean_fr_lunch=mean(fr_lunch, na.rm = TRUE))
#> # A tibble: 10 × 7
#> # Groups:   event_inst [2]
#>    event_inst event_type  n_events n_miss_inc mean_inc n_miss_frlunch
#>    <chr>      <chr>          <int>      <int>    <dbl>          <int>
#>  1 In-State   2yr college      111          5   78486.            111
#>  2 In-State   4yr college       14          4  131691.             14
#>  3 In-State   other             49          2   75040.             49
#>  4 In-State   private hs        35          0   95229.             35
#>  5 In-State   public hs        259          3   87097.              0
#>  6 Out-State  2yr college        1          0  153070.              1
#>  7 Out-State  4yr college        4          1   76913.              4
#>  8 Out-State  other             89          6   69004.             89
#>  9 Out-State  private hs       134          2   87654.            134
#> 10 Out-State  public hs        183          0  103603.              5
#> # ℹ 1 more variable: mean_fr_lunch <dbl>

3.2.2 Student exercise

  1. Using the event_berk object, group by instnm, event_inst, & event_type.
    1. Create vars for number non_missing for these racial/ethnic groups (pct_white_zip, pct_black_zip, pct_asian_zip, pct_hispanic_zip, pct_amerindian_zip, pct_nativehawaii_zip)
    2. Create vars for mean percent for each racial/ethnic group
Solutions
event_berk %>% group_by(instnm, event_inst, event_type) %>%
  summarize(
  n_events=n(),
  n_miss_white = sum(!is.na(pct_white_zip)),
  mean_white = mean(pct_white_zip, na.rm = TRUE),
  n_miss_black = sum(!is.na(pct_black_zip)),
  mean_black = mean(pct_black_zip, na.rm = TRUE),
  n_miss_asian = sum(!is.na(pct_asian_zip)),
  mean_asian = mean(pct_asian_zip, na.rm = TRUE),
  n_miss_lat = sum(!is.na(pct_hispanic_zip)),
  mean_lat = mean(pct_hispanic_zip, na.rm = TRUE),
  n_miss_na = sum(!is.na(pct_amerindian_zip)),
  mean_na = mean(pct_amerindian_zip, na.rm = TRUE),
  n_miss_nh = sum(!is.na(pct_nativehawaii_zip)),
  mean_nh = mean(pct_nativehawaii_zip, na.rm = TRUE)) %>%
  head(6)
#> # A tibble: 6 × 16
#> # Groups:   instnm, event_inst [2]
#>   instnm     event_inst event_type n_events n_miss_white mean_white n_miss_black
#>   <chr>      <chr>      <chr>         <int>        <int>      <dbl>        <int>
#> 1 UC Berkel… In-State   2yr colle…      111          106       40.1          106
#> 2 UC Berkel… In-State   4yr colle…       14           12       58.0           12
#> 3 UC Berkel… In-State   other            49           48       37.6           48
#> 4 UC Berkel… In-State   private hs       35           35       48.4           35
#> 5 UC Berkel… In-State   public hs       259          258       39.6          258
#> 6 UC Berkel… Out-State  2yr colle…        1            1       89.7            1
#> # ℹ 9 more variables: mean_black <dbl>, n_miss_asian <int>, mean_asian <dbl>,
#> #   n_miss_lat <int>, mean_lat <dbl>, n_miss_na <int>, mean_na <dbl>,
#> #   n_miss_nh <int>, mean_nh <dbl>


3.3 summarize() and logical vectors, part II

Logical vectors (e.g., is.na()) useful for counting obs that satisfy some condition

is.na(c(5,NA,4,NA))
#> [1] FALSE  TRUE FALSE  TRUE
typeof(is.na(c(5,NA,4,NA)))
#> [1] "logical"
sum(is.na(c(5,NA,4,NA)))
#> [1] 2

Task: Using object event_berk, calculate the following measures for each combination of event_type and event_inst:

  • count of number of rows for each group
  • count of rows non-missing for both pct_black_zip and pct_hispanic_zip
  • count of number of visits to communities where the sum of Black and Latinx people comprise more than 50% of the total population
event_berk %>% group_by (event_inst, event_type) %>% summarize(
    n_events=n(),
    n_nonmiss_latbl = sum(!is.na(pct_black_zip) & !is.na(pct_hispanic_zip)),
    n_majority_latbl= sum(pct_black_zip+ pct_hispanic_zip>50, na.rm = TRUE)
  )
#> # A tibble: 10 × 5
#> # Groups:   event_inst [2]
#>    event_inst event_type  n_events n_nonmiss_latbl n_majority_latbl
#>    <chr>      <chr>          <int>           <int>            <int>
#>  1 In-State   2yr college      111             106               30
#>  2 In-State   4yr college       14              12                0
#>  3 In-State   other             49              48               15
#>  4 In-State   private hs        35              35                6
#>  5 In-State   public hs        259             258               88
#>  6 Out-State  2yr college        1               1                0
#>  7 Out-State  4yr college        4               3                0
#>  8 Out-State  other             89              86               20
#>  9 Out-State  private hs       134             132               19
#> 10 Out-State  public hs        183             183               27

(!is.na(pct_black_zip) & !is.na(pct_hispanic_zip)) is a logical condition

  • condition is TRUE (evaluates to 1) or FALSE (evaluates to 0) for each obs

3.3.1 summarize(): logical vectors to count proportions

Synatx: group_by(vars) %>% summarize(prop = mean(TRUE/FALSE conditon))

Task: separately for in-state/out-of-state, what proportion of visits to public high schools are to communities with median income greater than $100,000?

Steps:

  1. Filter public HS visits
  2. group by in-state vs. out-of-state
  3. Create measure
event_berk %>% filter(event_type == "public hs") %>% # filter public hs visits
  group_by (event_inst) %>% # group by in-state vs. out-of-state
  summarize(
    n_events=n(), # number of events by group
    n_nonmiss_inc = sum(!is.na(med_inc)), # w/ nonmissings values median inc,
    p_incgt100k = mean(med_inc>100000, na.rm=TRUE)) # proportion visits to $100K+ commmunities 
#> # A tibble: 2 × 4
#>   event_inst n_events n_nonmiss_inc p_incgt100k
#>   <chr>         <int>         <int>       <dbl>
#> 1 In-State        259           256       0.273
#> 2 Out-State       183           183       0.519

What if we forgot to put na.rm=TRUE in the above task?

**Task**: separately for in-state/out-of-state, what proportion of visits to public high schools are to communities with median income greater than $100,000?

event_berk %>% filter(event_type == "public hs") %>% # filter public hs visits
  group_by (event_inst) %>% # group by in-state vs. out-of-state
  summarize(
    n_events=n(), # number of events by group
    n_nonmiss_inc = sum(!is.na(med_inc)), # w/ nonmissings values median inc,
    p_incgt100k = mean(med_inc>100000, na.rm=FALSE)) # proportion visits to $100K+ commmunities 
#> # A tibble: 2 × 4
#>   event_inst n_events n_nonmiss_inc p_incgt100k
#>   <chr>         <int>         <int>       <dbl>
#> 1 In-State        259           256      NA    
#> 2 Out-State       183           183       0.519

3.3.2 summarize(): Other “helper” functions

Lots of other functions we can use within summarize()

Common functions to use with summarize():

Function Description
n count
n_distinct count unique values
mean mean
median median
max largest value
min smallest value
sd standard deviation
sum sum of values
first first value
last last value
nth nth value
any condition true for at least one value

Note: These functions can also be used on their own or with mutate()

Maximum value in a group

max(c(10,50,8))
#> [1] 50

Task: For each combination of in-state/out-of-state and event type, what is the maximum value of med_inc?

event_berk %>% group_by(event_type, event_inst) %>% 
  summarize(max_inc = max(med_inc)) # oops, we forgot to remove NAs!

event_berk %>% group_by(event_type, event_inst) %>% 
  summarize(max_inc = max(med_inc, na.rm = TRUE))

Isolate first/last/nth observation in a group

x <- c(10,15,20,25,30)
first(x)
#> [1] 10
last(x)
#> [1] 30
nth(x,1)
#> [1] 10
nth(x,3)
#> [1] 20
nth(x,10)
#> [1] NA

Task: after sorting object event_berk by event_type and event_datetime_start, what is the value of event_date for:

  • first event for each event type?
  • the last event for each event type?
  • the 50th event for each event type?
event_berk %>% arrange(event_type, event_datetime_start) %>%
  group_by(event_type) %>%
  summarize(
    n_events = n(),
    date_first= first(event_date),
    date_last= last(event_date),
    date_50th= nth(event_date, 50)
  )
#> # A tibble: 5 × 5
#>   event_type  n_events date_first date_last  date_50th 
#>   <chr>          <int> <date>     <date>     <date>    
#> 1 2yr college      112 2017-04-25 2017-11-16 2017-10-04
#> 2 4yr college       18 2017-04-30 2017-11-16 NA        
#> 3 other            138 2017-04-11 2017-10-26 2017-09-09
#> 4 private hs       169 2017-04-23 2017-10-25 2017-09-26
#> 5 public hs        442 2017-04-14 2017-12-15 2017-09-16

3.3.3 Student exercise

Identify value of event_date for the nth event in each by group

Specific task:

  • arrange (i.e., sort) by event_type and event_datetme_start, then group by event_type, and then identify the value of event_date for:
    • the first event in each by group (event_type)
    • the second event in each by group
    • the third event in each by group
    • the fourth event in each by group
    • the fifth event in each by group
Student exercise solutions
event_berk %>% arrange(event_type, event_datetime_start) %>%
  group_by(event_type) %>%
  summarize(
    n_events = n(),
    date_1st= first(event_date),
    date_2nd= nth(event_date,2),
    date_3rd= nth(event_date,3),
    date_4th= nth(event_date,4),
    date_5th= nth(event_date,5))
#> # A tibble: 5 × 7
#>   event_type  n_events date_1st   date_2nd   date_3rd   date_4th   date_5th  
#>   <chr>          <int> <date>     <date>     <date>     <date>     <date>    
#> 1 2yr college      112 2017-04-25 2017-09-05 2017-09-05 2017-09-06 2017-09-06
#> 2 4yr college       18 2017-04-30 2017-05-01 2017-05-06 2017-09-13 2017-09-14
#> 3 other            138 2017-04-11 2017-04-23 2017-04-25 2017-04-29 2017-05-14
#> 4 private hs       169 2017-04-23 2017-04-24 2017-04-29 2017-04-30 2017-09-05
#> 5 public hs        442 2017-04-14 2017-04-24 2017-04-26 2017-04-27 2017-04-27


3.4 count() in place of group_by() and summarise()

Recall from Section 3.1, the count() function returns the count of unique values of a particular variable. For example, if we count by event_state, we get the count of visits by state. This is the same if we were to group_by() event_state and summarise() to get the count of events by state.

event_berk %>%
  group_by(event_state) %>%
  summarise(n_state = n())
#> # A tibble: 35 × 2
#>    event_state n_state
#>    <chr>         <int>
#>  1 AL                2
#>  2 AR                3
#>  3 AZ               18
#>  4 CA              468
#>  5 CO                3
#>  6 CT                3
#>  7 DC                9
#>  8 FL               33
#>  9 GA               33
#> 10 HI                1
#> # ℹ 25 more rows

event_berk %>%
  count(event_state)
#> # A tibble: 35 × 2
#>    event_state     n
#>    <chr>       <int>
#>  1 AL              2
#>  2 AR              3
#>  3 AZ             18
#>  4 CA            468
#>  5 CO              3
#>  6 CT              3
#>  7 DC              9
#>  8 FL             33
#>  9 GA             33
#> 10 HI              1
#> # ℹ 25 more rows

If we want to calculate the percent of UC Berkeley’s visits across states, we can use the count() and mutate() functions instead of group_by() and summarize().

event_berk %>%
  group_by(event_state) %>% #group_by event state
  summarise(n_state = n()) %>% #summarize and get a count of the # of events by state
  mutate(pct_state = (n_state / sum(n_state)) * 100) #create percent variable of visits by UC Berkeley (e.g., 53% of visits in California)
#> # A tibble: 35 × 3
#>    event_state n_state pct_state
#>    <chr>         <int>     <dbl>
#>  1 AL                2     0.228
#>  2 AR                3     0.341
#>  3 AZ               18     2.05 
#>  4 CA              468    53.2  
#>  5 CO                3     0.341
#>  6 CT                3     0.341
#>  7 DC                9     1.02 
#>  8 FL               33     3.75 
#>  9 GA               33     3.75 
#> 10 HI                1     0.114
#> # ℹ 25 more rows
  
event_berk %>%
  count(event_state) %>% #count of visits by UC Berkeley by state
  mutate(pct_state = (n / sum(n)) * 100) #percent of visits by UC Berkeley across states
#> # A tibble: 35 × 3
#>    event_state     n pct_state
#>    <chr>       <int>     <dbl>
#>  1 AL              2     0.228
#>  2 AR              3     0.341
#>  3 AZ             18     2.05 
#>  4 CA            468    53.2  
#>  5 CO              3     0.341
#>  6 CT              3     0.341
#>  7 DC              9     1.02 
#>  8 FL             33     3.75 
#>  9 GA             33     3.75 
#> 10 HI              1     0.114
#> # ℹ 25 more rows

4 Summarize across multiple columns

4.1 What are column-wise operations?

across() allows you to perform the same operation on multiple columns.

Description - across() apply the same transformation to multiple columns.

Syntax - across(.cols, .fns, .names)

Arguments

  • .cols Columns to transform
  • .fns Function to apply to each of the selected columns. Some values include:
    • A function, e.g. mean()
    • A purr-style lambda, e.g. ~ mean(.x, na.rm = TRUE)
    • A named list of functions or lambdas, e.g. list(mean = mean, n_miss = ~sum(is.na(.x)))
  • .names A glue specification that describes how to name the output columns. Use {.col} to stand for the selected column name and {.fn} to stand for the name of the function being applied, e.g. “{.col}_{.fn}

4.2 across() affects every variable

Task:

  • For U. Pittsburgh (univ_id = 215293) events at public and private high schools, calculate the mean value of med_inc and pct_white_zip for each combination of event_type and event_inst
df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(c(med_inc, pct_white_zip), mean))
#> # A tibble: 4 × 4
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc pct_white_zip
#>   <chr>      <chr>        <dbl>         <dbl>
#> 1 private hs In-State        NA          78.9
#> 2 private hs Out-State       NA          NA  
#> 3 public hs  In-State        NA          NA  
#> 4 public hs  Out-State       NA          NA

Try again, this time applying na.rm = TRUE - this is an example of a purr style lambda ~ mean(.x, na.rm = TRUE) argument “for the function calls in .fns.”

df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(c(med_inc, pct_white_zip), ~mean(.x, na.rm = TRUE)))
#> # A tibble: 4 × 4
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc pct_white_zip
#>   <chr>      <chr>        <dbl>         <dbl>
#> 1 private hs In-State    77115.          78.9
#> 2 private hs Out-State  103915.          63.3
#> 3 public hs  In-State    78408.          83.0
#> 4 public hs  Out-State  114212.          67.5

Task:

  • For U. Pittsburgh (univ_id = 215293) events at public and private high schools, calculate mean and standard deviation of med_inc and pct_white_zip for each combination of event_type and event_inst

  • You can create a named list of functions to supply into the second argument .fns

mean_sd <- list(
  mean = ~mean(.x, na.rm = TRUE), 
  sd = ~sd(.x, na.rm = TRUE)
)

df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(c(med_inc, pct_white_zip), mean_sd))
#> # A tibble: 4 × 6
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc_mean med_inc_sd pct_white_zip_mean
#>   <chr>      <chr>             <dbl>      <dbl>              <dbl>
#> 1 private hs In-State         77115.     36559.               78.9
#> 2 private hs Out-State       103915.     44220.               63.3
#> 3 public hs  In-State         78408.     25841.               83.0
#> 4 public hs  Out-State       114212.     39745.               67.5
#> # ℹ 1 more variable: pct_white_zip_sd <dbl>

Use this syntax to control variable name suffixes:

  • .names = "{.col}_{.fn}"
df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(c(med_inc, pct_white_zip), mean_sd, .names = "{.col}_{.fn}"))
#> # A tibble: 4 × 6
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc_mean med_inc_sd pct_white_zip_mean
#>   <chr>      <chr>             <dbl>      <dbl>              <dbl>
#> 1 private hs In-State         77115.     36559.               78.9
#> 2 private hs Out-State       103915.     44220.               63.3
#> 3 public hs  In-State         78408.     25841.               83.0
#> 4 public hs  Out-State       114212.     39745.               67.5
#> # ℹ 1 more variable: pct_white_zip_sd <dbl>
  • .names = "{.fn}_{.col}" Change the order of the name to include function first and then the column name
df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(c(med_inc, pct_white_zip), mean_sd, .names = "{.fn}_{.col}"))
#> # A tibble: 4 × 6
#> # Groups:   event_type [2]
#>   event_type event_inst mean_med_inc sd_med_inc mean_pct_white_zip
#>   <chr>      <chr>             <dbl>      <dbl>              <dbl>
#> 1 private hs In-State         77115.     36559.               78.9
#> 2 private hs Out-State       103915.     44220.               63.3
#> 3 public hs  In-State         78408.     25841.               83.0
#> 4 public hs  Out-State       114212.     39745.               67.5
#> # ℹ 1 more variable: sd_pct_white_zip <dbl>

Task:

  • Same task as before, but now calculate mean, standard deviation, min, and max of med_inc and pct_white_zip for each combination of event_type and event_inst
desc_stat <- list(
  mean = ~mean(.x, na.rm = TRUE), 
  sd = ~sd(.x, na.rm = TRUE),
  low = ~min(.x, na.rm = TRUE),
  high = ~max(.x, na.rm=TRUE)
)

df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(c(med_inc, pct_white_zip), desc_stat, .names = "{.col}_{.fn}"))
#> # A tibble: 4 × 10
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc_mean med_inc_sd med_inc_low med_inc_high
#>   <chr>      <chr>             <dbl>      <dbl>       <dbl>        <dbl>
#> 1 private hs In-State         77115.     36559.      12894.      224163 
#> 2 private hs Out-State       103915.     44220.      29630.      223556.
#> 3 public hs  In-State         78408.     25841.      23168.      169036.
#> 4 public hs  Out-State       114212.     39745.      21581       250001 
#> # ℹ 4 more variables: pct_white_zip_mean <dbl>, pct_white_zip_sd <dbl>,
#> #   pct_white_zip_low <dbl>, pct_white_zip_high <dbl>

4.2.1 across(), quosure style lambdas ~ func_name(.x)

Syntax: across(.cols, .fns, .names)

  • .fns Function to apply to each of the selected columns. Some values include:
    • A function, e.g. mean(); A purr-style lambda, e.g. ~ mean(.x, na.rm = TRUE); A named list of functions or lambdas, e.g. list(mean = mean, n_miss = ~sum(is.na(.x)))

Task: Calculate mean, number of obs, and number of non-missing obs for variables

  • Can exclude n() from the .cols argument in across() function because you only need to calculate once to get the number of observations for each group.
mean_obs_nmiss <- list(
  mean = ~mean(.x, na.rm = TRUE), 
  n_non_miss = ~sum(!is.na(.x))
)

df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,med_inc,pop_total) %>%
  group_by(event_type,event_inst) %>%
  summarize(nrow = n(), across(c(med_inc, pop_total), mean_obs_nmiss,.names = "{.col}_{.fn}"))
#> # A tibble: 4 × 7
#> # Groups:   event_type [2]
#>   event_type event_inst  nrow med_inc_mean med_inc_n_non_miss pop_total_mean
#>   <chr>      <chr>      <int>        <dbl>              <int>          <dbl>
#> 1 private hs In-State      51       77115.                 50         25320.
#> 2 private hs Out-State    209      103915.                204         29771.
#> 3 public hs  In-State     211       78408.                210         21922.
#> 4 public hs  Out-State    559      114212.                557         30191.
#> # ℹ 1 more variable: pop_total_n_non_miss <int>

4.2.2 across() and where() affects selected variables if they meet a condition

Syntax: across(where(fn)), .fns, .names)

  • fn A function that returns TRUE or FALSE
  • .fns Function to apply to each of the selected columns. Some values include:
    • A function, e.g. mean(); A purr-style lambda, e.g. ~ mean(.x, na.rm = TRUE); A named list of functions or lambdas, e.g. list(mean = mean, n_miss = ~sum(is.na(.x)))

Task: For U. Pittsburgh events at public and private high schools, caclulate mean, min, and max of variables med_inc and event_date for each combination of event_type and event_inst

mean_min_max <- list(
  mean = ~mean(.x, na.rm = TRUE),
  low = ~min(.x, na.rm=TRUE),
  high = ~max(.x, na.rm=TRUE)
)

df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst, med_inc, event_date) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(where(is.double), mean_min_max))
#> # A tibble: 4 × 8
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc_mean med_inc_low med_inc_high event_date_mean
#>   <chr>      <chr>             <dbl>       <dbl>        <dbl> <date>         
#> 1 private hs In-State         77115.      12894.      224163  2017-09-16     
#> 2 private hs Out-State       103915.      29630.      223556. 2017-08-24     
#> 3 public hs  In-State         78408.      23168.      169036. 2017-09-19     
#> 4 public hs  Out-State       114212.      21581       250001  2017-08-20     
#> # ℹ 2 more variables: event_date_low <date>, event_date_high <date>
  • Although event_date is a Date class, it is type double and therefore we can apply is.double to the where() function to calculate the mean, min, and max.

Task: For events by U. Pittsburgh at public and private high schools, caclulate mean and standard deviation for numeric variables

#First, which vars are numeric
df_event %>% 
  select(event_type, event_inst,instnm,school_id,med_inc,pct_white_zip) %>%
  glimpse()
#> Rows: 18,680
#> Columns: 6
#> $ event_type    <chr> "public hs", "public hs", "public hs", "public hs", "pub…
#> $ event_inst    <chr> "In-State", "In-State", "In-State", "In-State", "Out-Sta…
#> $ instnm        <chr> "UM Amherst", "UM Amherst", "UM Amherst", "UM Amherst", …
#> $ school_id     <chr> "250192000042", "250243000134", "250366000496", "2503660…
#> $ med_inc       <dbl> 71713.5, 89121.5, 70136.5, 70136.5, 71023.5, 71023.5, 71…
#> $ pct_white_zip <dbl> 73.6803, 91.3555, 79.4345, 79.4345, 88.6526, 88.6526, 88…

df_event %>% 
  filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>%
  select(event_type, event_inst,instnm,school_id,med_inc,pct_white_zip) %>%
  group_by(event_type,event_inst) %>%
  summarize(across(where(is.numeric), mean_sd))
#> # A tibble: 4 × 6
#> # Groups:   event_type [2]
#>   event_type event_inst med_inc_mean med_inc_sd pct_white_zip_mean
#>   <chr>      <chr>             <dbl>      <dbl>              <dbl>
#> 1 private hs In-State         77115.     36559.               78.9
#> 2 private hs Out-State       103915.     44220.               63.3
#> 3 public hs  In-State         78408.     25841.               83.0
#> 4 public hs  Out-State       114212.     39745.               67.5
#> # ℹ 1 more variable: pct_white_zip_sd <dbl>

5 Attach aggregate measures to your data frame

We can attach aggregate measures to a data frame by using group_by without summarize()

What do I mean by “attaching aggregate measures to a data frame”?

  • Calculate measures at the by_group level, but attach them to original object rather than creating an object with one row for each by_group

Task: Using event_berk data frame, create (1) a measure of average income across all events and (2) a measure of average income for each event type

  • resulting object should have same number of observations as event_berk

Steps:

  1. create measure of avg. income across all events without using group_by() or summarize() and assign as (new) object
  2. Using object from previous step, create measure of avg. income across by event type using group_by() without summarize() and assign as new object

Task: Using event_berk data frame, create (1) a measure of average income across all events and (2) a measure of average income for each event type

  1. Create measure of average income across all events
event_berk_temp <- event_berk %>% 
  arrange(event_date) %>% # sort by event_date (optional)
  select(event_date, event_type,med_inc) %>% # select vars to be retained (optioanl) 
  mutate(avg_inc = mean(med_inc, na.rm=TRUE)) # create avg. inc measure

dim(event_berk_temp)
#> [1] 879   4
event_berk_temp %>% head(5)
#> # A tibble: 5 × 4
#>   event_date event_type med_inc avg_inc
#>   <date>     <chr>        <dbl>   <dbl>
#> 1 2017-04-11 other       62022.  88124.
#> 2 2017-04-14 public hs  125056.  88124.
#> 3 2017-04-23 private hs  78897   88124.
#> 4 2017-04-23 other       55127   88124.
#> 5 2017-04-24 private hs 122307   88124.
  1. Create measure of average income by event type
event_berk_temp <- event_berk_temp %>% 
  group_by(event_type) %>% # grouping by event type
  mutate(avg_inc_type = mean(med_inc, na.rm=TRUE)) # create avg. inc measure
  
str(event_berk_temp)
#> gropd_df [879 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
#>  $ event_date  : Date[1:879], format: "2017-04-11" "2017-04-14" ...
#>  $ event_type  : chr [1:879] "other" "public hs" "private hs" "other" ...
#>  $ med_inc     : num [1:879] 62022 125056 78897 55127 122307 ...
#>  $ avg_inc     : num [1:879] 88124 88124 88124 88124 88124 ...
#>  $ avg_inc_type: num [1:879] 71186 93978 89242 71186 89242 ...
#>  - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
#>   ..$ event_type: chr [1:5] "2yr college" "4yr college" "other" "private hs" ...
#>   ..$ .rows     : list<int> [1:5] 
#>   .. ..$ : int [1:112] 8 65 67 76 78 81 88 105 110 112 ...
#>   .. ..$ : int [1:18] 15 16 17 129 149 158 312 373 462 508 ...
#>   .. ..$ : int [1:138] 1 4 7 13 18 19 21 22 23 24 ...
#>   .. ..$ : int [1:169] 3 5 12 14 60 62 68 70 71 72 ...
#>   .. ..$ : int [1:442] 2 6 9 10 11 20 61 63 75 77 ...
#>   .. ..@ ptype: int(0) 
#>   ..- attr(*, ".drop")= logi TRUE
event_berk_temp %>% head(5)
#> # A tibble: 5 × 5
#> # Groups:   event_type [3]
#>   event_date event_type med_inc avg_inc avg_inc_type
#>   <date>     <chr>        <dbl>   <dbl>        <dbl>
#> 1 2017-04-11 other       62022.  88124.       71186.
#> 2 2017-04-14 public hs  125056.  88124.       93978.
#> 3 2017-04-23 private hs  78897   88124.       89242.
#> 4 2017-04-23 other       55127   88124.       71186.
#> 5 2017-04-24 private hs 122307   88124.       89242.

Task: Using event_berk_temp from previous question, create a measure that identifies whether med_inc associated with the event is higher/lower than average income for all events of that type

Steps:

  1. Create measure of average income for each event type [already done]
  2. Create a 0/1 indicator that identifies whether the median income at the event location is higher than the average median income for events of that type.
# average income at recruiting events across all universities
event_berk_tempv2 <- event_berk_temp %>% 
  mutate(gt_avg_inc_type = med_inc > avg_inc_type) %>% 
  select(-(avg_inc)) # drop avg_inc (optional)
event_berk_tempv2 # note how med_ic = NA are treated
#> # A tibble: 879 × 5
#> # Groups:   event_type [5]
#>    event_date event_type  med_inc avg_inc_type gt_avg_inc_type
#>    <date>     <chr>         <dbl>        <dbl> <lgl>          
#>  1 2017-04-11 other        62022.       71186. FALSE          
#>  2 2017-04-14 public hs   125056.       93978. TRUE           
#>  3 2017-04-23 private hs   78897        89242. FALSE          
#>  4 2017-04-23 other        55127        71186. FALSE          
#>  5 2017-04-24 private hs  122307        89242. TRUE           
#>  6 2017-04-24 public hs   133966        93978. TRUE           
#>  7 2017-04-25 other       114580        71186. TRUE           
#>  8 2017-04-25 2yr college 103600.       79183. TRUE           
#>  9 2017-04-26 public hs    38516.       93978. FALSE          
#> 10 2017-04-27 public hs    49950.       93978. FALSE          
#> # ℹ 869 more rows

Same as above, but this time create integer indicator rather than logical

event_berk_tempv2 <- event_berk_tempv2 %>% 
  mutate(gt_avg_inc_type = as.integer(med_inc > avg_inc_type)) 
event_berk_tempv2  %>% head(4)
#> # A tibble: 4 × 5
#> # Groups:   event_type [3]
#>   event_date event_type med_inc avg_inc_type gt_avg_inc_type
#>   <date>     <chr>        <dbl>        <dbl>           <int>
#> 1 2017-04-11 other       62022.       71186.               0
#> 2 2017-04-14 public hs  125056.       93978.               1
#> 3 2017-04-23 private hs  78897        89242.               0
#> 4 2017-04-23 other       55127        71186.               0

5.0.1 Student exercise

Task: is pct_white_zip at a particular event higher or lower than the average pct_white_zip for that event_type?

  • Note: all events attached to a particular zip_code
  • pct_white_zip: pct of people in that zip_code who identify as white

Steps in task:

  • Create measure of average pct white for each event_type
  • Compare whether pct_white_zip is higher or lower than this average
Student exercise solution

Task: is pct_white_zip at a particular event higher or lower than the average pct_white_zip for that event_type?

event_berk_tempv3 <- event_berk %>% 
  arrange(event_date) %>% # sort by event_date (optional)
  select(event_date, event_type, pct_white_zip) %>% #optional
  group_by(event_type) %>% # grouping by event type
  mutate(avg_pct_white = mean(pct_white_zip, na.rm=TRUE),
         gt_avg_pctwhite_type = as.integer(pct_white_zip > avg_pct_white)) 
event_berk_tempv3 %>% head(4)
#> # A tibble: 4 × 5
#> # Groups:   event_type [3]
#>   event_date event_type pct_white_zip avg_pct_white gt_avg_pctwhite_type
#>   <date>     <chr>              <dbl>         <dbl>                <int>
#> 1 2017-04-11 other               37.2          49.7                    0
#> 2 2017-04-14 public hs           78.3          48.9                    1
#> 3 2017-04-23 private hs          84.7          61.0                    1
#> 4 2017-04-23 other               20.9          49.7                    0