## Missing values ### Missing values Missing values have the value `NA` - `NA` is a special keyword, not the same as the character string `"NA"` use `is.na()` function to determine if a value is missing - `is.na()` returns a logical vector ```{r} is.na(5) is.na(NA) is.na("NA") typeof(is.na("NA")) # example of a logical vector nvector <- c(10,5,NA) is.na(nvector) typeof(is.na(nvector)) # example of a logical vector svector <- c("e","f",NA,"NA") is.na(svector) ``` ### Missing values are "contagious" What does "contagious" mean? - operations involving a missing value will yield a missing value ```{r} 7>5 7>NA sum(1,2,NA) 0==NA 2*c(0,1,2,NA) NA*c(0,1,2,NA) ``` ### Functions and missing values example, `table()` `table()` function is useful for investigating categorical variables ```{r} str(df_event$event_type) table(df_event$event_type) ``` By default `table()` ignores `NA` values ```{r} #?table str(df_event$school_type_pri) table(df_event$school_type_pri) ``` `useNA` argument controls if table includes counts of `NA`s. Allowed values: - never ("no") \[DEFAULT VALUE\] - only if count is positive ("ifany"); - even for zero counts ("always")" ```{r} nrow(df_event) table(df_event$school_type_pri, useNA="always") ``` Broader point: Most functions that create descriptive statistics have options about how to treat missing values\` - When investigating data, good practice to *always* show missing values # Subsetting using subset operators ### Subsetting to Extract Elements "Subsetting" refers to isolating particular elements of an object Subsetting operators can be used to select/exclude elements (e.g., variables, observations) - there are three subsetting operators: `[]`, `$` , `[[]]` - these operators function differently based on vector types (e.g, atomic vectors, lists, data frames) ### Wichham refers to number of "dimensions" in R objects An atomic vector is a 1-dimensional object that contains n elements ```{r} x <- c(1.1, 2.2, 3.3, 4.4, 5.5) str(x) ``` Lists are multi-dimensional objects - Contains n elements; each element may contain a 1-dimensional atomic vector or a multi-dimensional list. Below list contains 3 dimensions ```{r} list <- list(c(1,2), list("apple", "orange")) str(list) ``` Data frames are 2-dimensional lists - each element is a variable (dimension=columns) - within each variable, each element is an observation (dimension=rows) ```{r} ncol(df_school) nrow(df_school) ``` ## Subset atomic vectors using \[\] **"Subsetting" a vector refers to isolating particular elements of a vector** - I sometimes refer to this as "accessing elements of a vector" - subsetting elements of a vector is similar to "filtering" rows of a data-frame - `[]` is the subsetting function for vectors Six ways to subset an atomic vector using `[]` 1. Using positive integers to return elements at specified positions 2. Using negative integers to exclude elements at specified positions 3. Using logicals to return elements where corresponding logical is `TRUE` 4. Empty `[]` returns original vector (useful for dataframes) 5. Zero vector \[0\], useful for testing data 6. If vector is "named," use character vectors to return elements with matching names ### 1. Using positive integers to return elements at specified positions (subset atomic vectors using \[\]) Create atomic vector `x` ```{r} (x <- c(1.1, 2.2, 3.3, 4.4, 5.5)) str(x) ``` `[]` is the subsetting function for vectors - contents inside `[]` can refer to element number (also called "position"). - e.g., `[3]` refers to contents of 3rd element (or position 3) ```{r} x[5] #return 5th element x[c(3, 1)] #return 3rd and 1st element x[c(4,4,4)] #return 4th element, 4th element, and 4th element #Return 3rd through 5th element x[3:5] ``` ### 2. Using negative integers to exclude elements at specified positions (subset atomic vectors using \[\]) Before excluding elements based on position, investigate object ```{r} x length(x) str(x) ``` Use negative integers to exclude elements based on element position ```{r} x[-1] # exclude 1st element x[c(3,1)] # 3rd and 1st element x[-c(3,1)] # exclude 3rd and 1st element ``` ### 3. Using logicals to return elements where corresponding logical is `TRUE` (subset atomic vectors using \[\]) ```{r} x ``` When using `x[y]` to subset `x`, good practice to have `length(x)==length(y)` ```{r} length(x) # length of vector x length(c(TRUE,FALSE,TRUE,FALSE,TRUE)) # length of y length(x) == length(c(TRUE,FALSE,TRUE,FALSE,TRUE)) # condition true x[c(TRUE,TRUE,FALSE,FALSE,TRUE)] ``` Recycling rules: - in `x[y]`, if `x` is different length than `y`, R "recycles" length of shorter to match length of longer ```{r} length(c(TRUE,FALSE)) x x[c(TRUE,FALSE)] ``` ```{r} x ``` Note that a missing value (`NA`) in the index always yields a missing value in the output: ```{r} x[c(TRUE, FALSE, NA, TRUE, NA)] ``` Return all elements of object `x` where element is greater than 3: ```{r} x # print object X x>3 # for each element of X, print T/F whether element value > 3 str(x>3) x[x>3] # prints only the values that had TRUE at that position ``` The `visits_by_100751` column shows how many visits the University of Alabama made to each school. Let's subset this to only include 2 or more visits: ```{r} df_school$visits_by_100751[1:100] df_school$visits_by_100751[1:100]>2 df_school$visits_by_100751[df_school$visits_by_100751>2] ``` ### 4. Empty `[]` returns original vector (subset atomic vectors using \[\]) ```{r} x x[] ``` This is useful for sub-setting data frames, as we will show below ### 5. Zero vector \[0\] (subset atomic vectors using \[\]) Zero vector, `x[0]` - R interprets this as returning element 0 ```{r} x[0] ``` Wickham states: - "This is not something you usually do on purpose, but it can be helpful for generating test data." ### 6. If vector is named, character vectors to return elements with matching names (subset atomic vectors using \[\]) Create vector `y` that has values of vector `x` but each element is named ```{r} x (y <- c(a=1.1, b=2.2, c=3.3, d=4.4, e=5.5)) ``` Return elements of vector based on name of element - enclose element names in single `''` or double `""` quotes ```{r} #show element named "a" y["a"] #show elements "a", "b", and "d" y[c("a", "b", "d" )] ``` ## Subsetting lists/data frames using \[\] Using `[]` operator to subset lists works the same as subsetting atomic vector - Using `[]` with a list always returns a list ```{r} list_a <- list(list(1,2),3,"apple") str(list_a) #create new list that consists of elements 3 and 1 of list_a list_b <- list_a[c(3, 1)] str(list_b) #show elements 3 and 1 of object list_a #str(list_a[c(3, 1)]) ``` Recall that a data frame is just a particular kind of list - each element = a column = a variable Using `[]` with a list always returns a list - Using `[]` with a data frame always returns a data frame Two ways to use `[]` to extract elements of a data frame 1. use "single index" `df_name[

## Subsetting lists/data frames using \[\[\]\] and \$ ### Subset single element from object using \[\[\]\] operator, atomic vectors So far we have used `[]` to extract elements from an object - Apply `[]` to atomic vector: returns atomic vector with elements you requested - Apply `[]` to list: returns list with elements you requested `[[]]` also extract elements from an object - Applying `[[]]` to atomic vector gives same result as `[]`; that is, an atomic vector with element you request ```{r} (x <- c(1.1, 2.2, 3.3, 4.4, 5.5)) str(x[3]) str(x[[3]]) ``` - Caveat: when applying `[[]]` to atomic vector, you can only subset a single element ```{r} x[c(3,4)] # single bracket; this works #x[[c(3,4)]] # double bracket; this won't work ``` ### Subsetting lists using `[]` vs. `[[]]`, introduce "train metaphor" Applying `[[]]` to a list - Understanding what `[]` vs. `[[]]` does to a list is very important but requires some explanation! *Advanced R* [chapter 4.3](https://adv-r.hadley.nz/subsetting.html#subset-single) by Wickham uses the "train metaphor" to explain a list vs. **contents** of a list and how this relates to `[]` vs. `[[]]` Below code chunk makes a list named `list_x` that contains 3 elements ```{r} list_x <- list(1:3, "a", 4:6) # create list object list_x str(list_x) ``` In our train metaphor, object `list_x` is a train that contains 3 carriages [![](three_carriage_train.png)](https://adv-r.hadley.nz/subsetting.html#subset-single) list object `list_x` is a train that contains 3 carriages ```{r, out.width = "45%", echo = FALSE} library(knitr) include_graphics("three_carriage_train.png") #[![](three_carriage_train.png)](https://adv-r.hadley.nz/subsetting.html#subset-single) ``` When we "subset a list" -- that is, extract one or more elements from the list -- we have two broad choices (image below) ```{r, out.width = "45%", echo = FALSE} library(knitr) include_graphics("one_carriage_train_vs_contents.png") # [![](one_carriage_train_vs_contents.png)](https://adv-r.hadley.nz/subsetting.html#subset-single) ``` 1. Extracting elements using `[]` always returns a list, usually one with fewer elements - you can think of this as a train with fewer carriages ```{r} #str(list_x) str(list_x[1]) # returns a list ``` 2. Extracting element using `[[]]` returns ***contents*** of particular carriage - I say applying `[[]]` to a list or data frame returns a simpler object that moves up one level of hierarchy ```{r} str(list_x[[1]]) # returns an atomic vector ``` ### Subset lists using `[]` vs. `[[]]`, deepen understanding of `[]` Rules about applying subset operator `[]` to a list - Applying `[]` to a list always returns a list - Resulting list contains 1 or more elements depending on what typed inside `[]` Here is a list object named `list_x` ```{r} list_x <- list(1:3, "a", 4:6) ``` Here is an image of a few "trains" that can be created by applying `[]` to `list_x` ```{r, out.width = "45%", echo = FALSE} library(knitr) include_graphics("smaller_trains.png") ``` And here is code to create the "trains" shown in above image (output omitted) ```{r, results = "hide"} list_x[1:2] list_x[-2] list_x[c(1,1)] list_x[0] list_x[] # returns the original list; not shown in above train picture ``` Rules about applying subset operator `[[]]` to a list - Can apply `[[]]` to return the **contents** of a **single element** of a list Create list `list_x` and show "train" Image of applying `list_x[1]` vs. `list_x[[1]]` ```{r} list_x <- list(1:3, "a", 4:6) ``` ```{r, out.width = "45%", echo = FALSE} library(knitr) include_graphics("one_carriage_train_vs_contents.png") ``` Object created by `list_x[1]` is a list with one element (output omitted) ```{r} list_x[1] str(list_x[1]) ``` Object created by `list_x[[1]]` is a vector with 3 elements (output omitted) - `list_x[[1]]` gives us "contents" of element 1 - Since element 1 contains a numeric vector, object created by `list_x[[1]]` is a numeric vector ```{r} list_x[[1]] str(list_x[[1]]) ``` Rules about applying subset operator `[[]]` to a list - Can apply `[[]]` to return the **contents** of a **single element** of a list ```{r} list_x <- list(1:3, "a", 4:6) # create list list_x ``` We cannot use `[[]]` to subset multiple elements of a list (output omitted) - e.g., we could write `list_x[[2]]` but not `list_x[[2:3]]` ```{r, eval = FALSE} list_x[[c(2)]] # this works, subset element 2 using [[]] list_x[[c(2,3)]] # this doesn't work; subset element 2 and 3 using [[]] list_x[c(2,3)] # this works; subset element 2 and 3 using [] ``` Like `[]`, can use `[[]]` to return contents of **named** elements specified using quotes - syntax: `obj_name[["element_name"]]` ```{r} list_x <- list(var1=1:3, var2="a", var3=4:6) # create list with named elements ``` Subset list `list_x` using `[[]]` element names ```{r} list_x[["var1"]] # subset by element position: list_x[[1]] str(list_x[["var1"]]) str(list_x["var1"]) # note: suggests var name is attribute of list, not atomic vector ``` Can do same thing with data frames because data frames are lists (output omitted) - e.g., `df_event[["zip"]]` returns contents of element named `"zip"` - object created by `df_event[["zip"]]` is character vector of length = 18,680 ```{r, results='hide'} # df_event[["zip"]] # this works but long output str(df_event[["zip"]]) # character vector of length 18,860 typeof(df_event[["zip"]]) length(df_event[["zip"]]) str(df_event["zip"]) # by contrast, this is a dataframe w/ one variable ``` ### General rules of applying `[]` vs `[[]]` to (nested) objects What we just learned about applying `[]` vs `[[]]` to lists applies more generally to "nested objects" - "nested objects" are objects with a hierarchical structure such that an element of an object contains another object General rules of applying `[]` vs. `[[]]` to nested objects - subset any object `x` using `[]` will return object with same data structure as `x` - subset any object `x` using `[[]]` will return an object thay may or may not have same data structure of `x` - if object `x` is not a nested object, then applying `[[]]` to a single element of `x` will return object with same data structure as `x` - if object `x` has a nested data structure, then then applying `[[]]` to a single element of `x` will "move up one level of hierarchy" to extract the **contents** of an element within the object `x` When working w/ data frames, functions that calculate things expect to be working with atomic vectors (think `[[]]`) not lists (think `[]`) ```{r} mean(df_event[['med_inc']], na.rm = TRUE) # mean(df_event['med_inc'], na.rm = TRUE) # by contrast, this doesn't work str(df_event['med_inc']) str(df_event[['med_inc']]) ``` ### Subset lists/data frames using \$ ```{r} list_x <- list(var1=1:3, var2="a", var3=4:6) ``` `obj_name$element_name` is shorthand operator for `obj_name[["element_name"]]` These three lines of code all give the same result ```{r} list_x[[1]] list_x[["var1"]] list_x$var1 ``` `df_name$var_name`: easiest way in base R to refer to variable in a data frame - these two lines of code are equivalent ```{r} str(df_event[["zip"]]) str(df_event$zip) ``` # Appendix ## Subset Data frames by combining \[\] and \$ **Motivation** - When working with data frames we often want to isolate those observations that satisfy certain conditions - This is often referred to as "filtering" - We filter observations based on the values of one or more variables - Perhaps you have seen "filtering" in Microsoft Excel - open some spreadsheet that contains variables (columns) and observations (rows) - click on `Data` \>\> `Filter` and then filter observations based on values of variable(s) Filtering example using data frame `df_school` - Observations: - One observation per high school (public and private) - Variables: - high school characteristics; number of off-campus recruiting visits from particular universities - NCES ID for UC Berkeley is `110635` - variable `visits_by_110635` shows number of visits a high school received from UC Berkeley **Task**: - Isolate obs where school received at least 1 visit from UC Berkeley General syntax: `df_name[df_name$var_name

- **key point**: `df_name[df_name$var_name

**Example**: Count the number of schools in the Northeast that received a visit from either UC Berkeley, U of Alabama, or CU Boulder. ```{r} # Vector containing states located in the Northeast region northeast_states <- c('CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY', 'PA') # Filter for schools in the Northeast AND visited by any of the 3 univs nrow(df_school[df_school$state_code %in% northeast_states & (df_school$visits_by_110635 >= 1 | df_school$visits_by_100751 >= 1 | df_school$visits_by_126614 >= 1), ]) ``` ### Subset Data Frames by combining `[]` and `$`, `NA` Observations Filtering observations of data frame using `[]` combined with `$` is more complicated in the presence of missing values (`NA` values) The below we will explain - why it is more complicated - how to filter correctly when `NA`s are present When sub-setting via `[]` combined with `$`, result will include: - rows where condition is `TRUE` - **as well as** rows with `NA` (missing) values for `

## Subset using subset() function The `subset()` is a base R function to "filter" observations from some object `x` - object `x` can be a matrix, data frame, list - `subset()` automatically excludes elements/rows with `NA` for condition - Can also use `subset()` to select variables - what `subset()` function returns: - "An object similar to x contain just the selected \ldots rows and columns (for a matrix or data frame)" - `subset()` can be combined with: - assignment (`<-`) to create new objects - `nrow()` to count number of observations that satisfy criteria ```{r, eval=FALSE} ?subset ``` \medskip Syntax \[when object is data frame\]: **subset(x, subset, select, drop = FALSE)** - `x` is object to be subset - `subset` is the logical expression(s) (evaluates to `TRUE/FALSE`) indicating elements (rows) to keep - `select` indicates columns to select from data frame (if argument is not used default will keep all columns) - `drop` to preserve original **dimensions** \[SKIP\] ### Subset function, examples Recall the previous example where we count events at public HS with at least \$50k median household income. - *Note*. `subset()` automatically excludes rows where condition is `NA`: ```{r} #Base R, `[]` combined with `$`, without which(); includes `NA` nrow(df_event[df_event$event_type == "public hs" & df_event$med_inc>=50000, ]) #Base R, `[]` combined with `$`, with which(); excludes `NA` nrow(df_event[which(df_event$event_type == "public hs" & df_event$med_inc>=50000), ]) #Base R, `subset()`; excludes `NA` nrow(subset(df_event, event_type == "public hs" & med_inc>=50000)) #Base R, `subset()`; excludes `NA`; explicitly name arguments of subset() nrow(subset(x = df_event, subset = event_type == "public hs" & med_inc>=50000)) ``` Using `df_school`, show all public high schools that are at least 50% Latinx (var=`pct_hispanic`) student enrollment in California - Using base R, `subset()` \[output omitted\] ```{r, results="hide"} #public high schools with at least 50% Latinx student enrollment subset(x= df_school, subset = school_type == "public" & pct_hispanic >= 50 & state_code == "CA") ``` - Can wrap `subset()` within `nrow()` to count number of observations that satisfy criteria ```{r} nrow(subset(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA")) ``` Note that `subset()` identify the number of observations for which the condition is `TRUE` ```{r} nrow(subset(x = df_school, subset = TRUE)) nrow(subset(x = df_school, subset = FALSE)) ``` Count all CA public high schools that are at least 50% Latinx and received at least 1 visit from UC Berkeley (var=`visits_by_110635`) ```{r} nrow(subset(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA" & visits_by_110635 >= 1)) ``` `subset()` can also use `%in%` operator, which is more efficient version of **OR** operator `|` - Count number of schools from MA, ME, or VT that received at least one visit from University of Alabama (var=`visits_by_100751`) ```{r} nrow(subset(df_school, state_code %in% c("MA","ME","VT") & visits_by_100751 >= 1)) ``` Use the `select` argument within `subset()` to keep selected variables - syntax: `select = c(var_name1,var_name2,...,var_name_n)` Subset all CA public high schools that are at least 50% Latinx **AND** only keep variables `name` and `address` ```{r} nrow(subset(x = df_school, subset = school_type == "public" & pct_hispanic >= 50 & state_code == "CA", select = c(name, address))) ``` Combine `subset()` with assignment (`<-`) to create a new data frame Create a new date frame of all CA public high schools that are at least 50% Latinx **AND** only keep variables `name` and `address` ```{r} df_school_v2 <- subset(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA", select = c(name, address)) head(df_school_v2, n=5) nrow(df_school_v2) ``` ### Student Exercises Using `subset()` from base R: 1. Create a new dataframe by extracting the columns `instnm`, `event_date`, `event_type` from `df_event` data frame. And show what columns (variables) are in the newly created dataframe. 2. Create a new dataframe from the `df_school` data frame that includes out-of-state public high schools with 50%+ Latinx student enrollment that received at least one visit by the University of California Berkeley (var= visits_by_110635). And count the number of observations. 3. Count the number of public schools from CA, FL or MA that received one or two visits from UC Berkeley from the `df_school` data frame. 4. Subset all public out-of-state high schools visited by University of California Berkeley that enroll at least 50% Black students, and only keep variables `state_code`, `name` and `zip_code`.

## Creating variables ### Create new data frame based on `df_school_all` Data frame `df_school_all` has one obs per US high school and then variables identifying number of visits by particular universities ```{r} load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_allvars.RData")) names(df_school_all) ``` Create new version of data frame, called `school_v2`, which we'll use to introduce how to create new variables ```{r, results='hide'} library(tidyverse) # below code use tidyverse functions and pipe operator school_v2 <- df_school_all %>% select(-contains("inst_")) %>% # remove vars that start with "inst_" rename( # rename selected variables visits_by_berkeley = visits_by_110635, visits_by_boulder = visits_by_126614, visits_by_bama = visits_by_100751, visits_by_stonybrook = visits_by_196097, visits_by_rutgers = visits_by_186380, visits_by_pitt = visits_by_215293, visits_by_cinci = visits_by_201885, visits_by_nebraska = visits_by_181464, visits_by_georgia = visits_by_139959, visits_by_scarolina = visits_by_218663, visits_by_ncstate = visits_by_199193, visits_by_irvine = visits_by_110653, visits_by_kansas = visits_by_155317, visits_by_arkansas = visits_by_106397, visits_by_sillinois = visits_by_149222, visits_by_umass = visits_by_166629, num_took_read = num_took_rla, num_prof_read = num_prof_rla, med_inc = avgmedian_inc_2564 ) glimpse(school_v2) ``` ### Base R approach to creating new variables Create new variables using assignment operator `<-` and subsetting operators `[]` and `$` to create new variables and set conditions of the input variables Pseudo syntax: `df$newvar <- ...` - where `...` argument is expression(s)/calculation(s) used to create new variables - expressions can include subsetting operators and/or other base R functions **Task**: Create measure of percent of students on free-reduced lunch **base R approach** ```{r} school_v2_temp<- school_v2 #create copy of dataset; not necessary school_v2_temp$pct_fr_lunch <- school_v2_temp$num_fr_lunch/school_v2_temp$total_students #investigate variable you created str(school_v2_temp$pct_fr_lunch) school_v2_temp$pct_fr_lunch[1:5] # print first 5 obs ``` **tidyverse approach (with pipes)** ```{r} school_v2_temp <- school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) ``` If creating new variable based on the condition/values of input variables, basically the tidyverse equivalent of `mutate()` **with** `if_else()` or `recode()` - Pseudo syntax: `df$newvar[logical condition]<- new value` - `logical condition`: a condition that evaluates to `TRUE` or `FALSE` **Task**: Create 0/1 indicator if school has median income greater than \$100k **tidyverse approach (using pipes)** ```{r} school_v2_temp %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0)) %>% count(inc_gt_100k) # note how NA values of med_inc treated ``` **Base R approach** ```{r} school_v2_temp$inc_gt_100k<-NA #initialize an empty column with NAs # otherwise you'll get warning school_v2_temp$inc_gt_100k[school_v2_temp$med_inc>100000] <- 1 school_v2_temp$inc_gt_100k[school_v2_temp$med_inc<=100000] <- 0 count(school_v2_temp, inc_gt_100k) ``` **Task**: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories: - "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen" **tidyverse approach (using pipes)** ```{r} load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData")) wwlist_temp <- wwlist %>% mutate(state_gen = case_when( state == "WA" & firstgen =="Y" ~ "instate_firstgen", state == "WA" & firstgen =="N" ~ "instate_nonfirstgen", state != "WA" & firstgen =="Y" ~ "outstate_firstgen", state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen") ) str(wwlist_temp$state_gen) wwlist_temp %>% count(state_gen) ``` **Task**: Using `wwlist` and input vars `state` and `firstgen`, create a 4-category var **base R approach** ```{r} wwlist_temp <- wwlist wwlist_temp$state_gen <- NA wwlist_temp$state_gen[wwlist_temp$state == "WA" & wwlist_temp$firstgen =="Y"] <- "instate_firstgen" wwlist_temp$state_gen[wwlist_temp$state == "WA" & wwlist_temp$firstgen =="N"] <- "instate_nonfirstgen" wwlist_temp$state_gen[wwlist_temp$state != "WA" & wwlist_temp$firstgen =="Y"] <- "outstate_firstgen" wwlist_temp$state_gen[wwlist_temp$state != "WA" & wwlist_temp$firstgen =="N"] <- "outstate_nonfirstgen" str(wwlist_temp$state_gen) count(wwlist_temp, state_gen) ``` ## Sorting data ### Base R `sort()` for vectors `sort()` is a base R function that sorts vectors Syntax: `sort(x, decreasing=FALSE, ...)` - where x is object being sorted - By default it sorts in ascending order (low to high) - Need to set decreasing argument to `TRUE` to sort from high to low ```{r} #?sort() x<- c(31, 5, 8, 2, 25) sort(x) sort(x, decreasing = TRUE) ``` ### Base R `order()` for dataframes `order()` is a base R function that sorts vectors - Syntax: `order(..., na.last = TRUE, decreasing = FALSE)` - where `...` are variable(s) to sort by - By default it sorts in ascending order (low to high) - Need to set decreasing argument to `TRUE` to sort from high to low Descending argument only works when we want either one (and only) variable descending or all variables descending (when sorting by multiple vars) - use `-` when you want to indicate which variables are descending while using the default ascending sorting ```{r results="hide"} df_event[order(df_event$event_date), ] df_event[order(df_event$event_date, df_event$total_12), ] #sort descending via argument df_event[order(df_event$event_date, decreasing = TRUE), ] df_event[order(df_event$event_date, df_event$total_12, decreasing = TRUE), ] #sorting by both ascending and descending variables df_event[order(df_event$event_date, -df_event$total_12), ] ``` ### Example, sorting - Create a new dataframe from df_events that sorts by ascending by `event_date`, ascending `event_state`, and descending `pop_total`. **base R** using `order()` function: ```{r results="hide"} df_event_br1 <- df_event[order(df_event$event_date, df_event$event_state, -df_event$pop_total), ] ```