1 Introduction

1.1 Libraries we will use

Load packages:

library(tidyverse)
library(stringr)  # package for manipulating strings (part of tidyverse)
library(rtweet)
#> Warning: package 'rtweet' was built under R version 4.0.3
getwd()
#> [1] "C:/Users/ozanj/Documents/rclass2/lectures/strings_and_regex"

1.2 Dataset we will use

We used the rtweet package to pull Twitter data from the PAC-12 universities. Specifically, we used the university’s admissions Twitter handle if there was one, or the main Twitter handle for the university if there wasn’t one:

# p12 <- c("uaadmissions", "FutureSunDevils", "caladmissions", "UCLAAdmission",
#          "futurebuffs", "uoregon", "BeaverVIP", "USCAdmission",
#          "engagestanford", "UtahAdmissions", "UW", "WSUPullman")
# p12_full_df <- search_tweets(paste0("from:", p12, collapse = " OR "), n = 500)
#
# saveRDS(p12_full_df, "p12_dataset.RDS")

# Load previously pulled Twitter data
p12_url <- "https://github.com/anyone-can-cook/rclass2/raw/main/data/recruiting/p12_dataset.RDS"
p12_full_df <- readRDS(url(p12_url, "rb"))

# Use subset of data
p12_df <- p12_full_df %>% select("user_id", "created_at", "screen_name", "text", "location")
head(p12_df)
#> # A tibble: 6 x 5
#>   user_id  created_at          screen_name text                     location    
#>   <chr>    <dttm>              <chr>       <chr>                    <chr>       
#> 1 22080148 2020-04-25 22:37:18 WSUPullman  "Big Dez is headed to I~ Pullman, Wa~
#> 2 22080148 2020-04-23 21:11:49 WSUPullman  "Cougar Cheese. That's ~ Pullman, Wa~
#> 3 22080148 2020-04-21 04:00:00 WSUPullman  "Darien McLaughlin '19,~ Pullman, Wa~
#> 4 22080148 2020-04-24 03:00:00 WSUPullman  "6 houses, one pick. Co~ Pullman, Wa~
#> 5 22080148 2020-04-20 19:00:21 WSUPullman  "Why did you choose to ~ Pullman, Wa~
#> 6 22080148 2020-04-20 02:20:01 WSUPullman  "Tell us one of your Br~ Pullman, Wa~

1.3 Cheat sheets

Here are two useful cheat sheets about working with strings and regular expressions

Print these cheat sheets. Make one of them your “go to” cheat sheet.

1.4 Lecture overview

Credit: Regex Humor (Rex Egg)

In rclass1, we introduced strings and some basic functions for working with strings.

In rclass2, this “Strings and Regular Expressions” lecture provides deeper knowledge about strings, string functions, and – most importantly – regular expressions.

  • This lecture focuses mostly on regular expressions because regular expressions are the most useful tool for working with strings and also the most difficult to learn.


What are regular expressions? (Geeks for Geeks)

  • Regular expressions are an efficient way to match different patterns in strings, similar to the ctrl+f or cmd+f function you use to find text in a pdf or word document

  • For example, regex can be used to match all cases of the exact text "out-of-state". But what makes it so powerful is that we could also have it match different variations or patterns, like "Out-of-state", "out of state", etc.