1 Introduction

1.1 Libraries we will use

Load packages:

library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.2.2
#> Warning: package 'tidyr' was built under R version 4.2.2
#> Warning: package 'readr' was built under R version 4.2.2
#> Warning: package 'purrr' was built under R version 4.2.2
#> Warning: package 'dplyr' was built under R version 4.2.2
#> Warning: package 'stringr' was built under R version 4.2.2
library(stringr)  # package for manipulating strings (part of tidyverse)
library(rtweet)
getwd()
#> [1] "C:/Users/ozanj/Documents/rclass2/lectures/strings_and_regex"

1.2 Dataset we will use

We used the rtweet package to pull Twitter data from the PAC-12 universities. Specifically, we used the university’s admissions Twitter handle if there was one, or the main Twitter handle for the university if there wasn’t one:

# p12 <- c("uaadmissions", "FutureSunDevils", "caladmissions", "UCLAAdmission",
#          "futurebuffs", "uoregon", "BeaverVIP", "USCAdmission",
#          "engagestanford", "UtahAdmissions", "UW", "WSUPullman")
# p12_full_df <- search_tweets(paste0("from:", p12, collapse = " OR "), n = 500)
#
# saveRDS(p12_full_df, "p12_dataset.RDS")

# Load previously pulled Twitter data
p12_url <- "https://github.com/anyone-can-cook/rclass2/raw/main/data/recruiting/p12_dataset.RDS"
p12_full_df <- readRDS(url(p12_url, "rb"))

# Use subset of data
p12_df <- p12_full_df %>% select("user_id", "created_at", "screen_name", "text", "location")
head(p12_df)
#> # A tibble: 6 × 5
#>   user_id  created_at          screen_name text                          locat…¹
#>   <chr>    <dttm>              <chr>       <chr>                         <chr>  
#> 1 22080148 2020-04-25 22:37:18 WSUPullman  "Big Dez is headed to Indy!\… Pullma…
#> 2 22080148 2020-04-23 21:11:49 WSUPullman  "Cougar Cheese. That's it. T… Pullma…
#> 3 22080148 2020-04-21 04:00:00 WSUPullman  "Darien McLaughlin '19, and … Pullma…
#> 4 22080148 2020-04-24 03:00:00 WSUPullman  "6 houses, one pick. Cougs, … Pullma…
#> 5 22080148 2020-04-20 19:00:21 WSUPullman  "Why did you choose to atten… Pullma…
#> 6 22080148 2020-04-20 02:20:01 WSUPullman  "Tell us one of your Bryan C… Pullma…
#> # … with abbreviated variable name ¹​location

1.3 Cheat sheets

Here are two useful cheat sheets about working with strings and regular expressions

Print these cheat sheets. Make one of them your “go to” cheat sheet.

1.4 Lecture overview

Credit: Regex Humor (Rex Egg)

In rclass1, we introduced strings and some basic functions for working with strings.

In rclass2, this “Strings and Regular Expressions” lecture provides deeper knowledge about strings, string functions, and – most importantly – regular expressions.

  • This lecture focuses mostly on regular expressions because regular expressions are the most useful tool for working with strings and also the most difficult to learn.


What are regular expressions? (Geeks for Geeks)

  • Regular expressions are an efficient way to match different patterns in strings, similar to the ctrl+f or cmd+f function you use to find text in a pdf or word document

  • For example, regex can be used to match all cases of the exact text "out-of-state". But what makes it so powerful is that we could also have it match different variations or patterns, like "Out-of-state", "out of state", etc.

Credit: Crystal Han, Ozan Jaquette, & Karina Salazar (Recruiting the Out-Of-State University)


In her popular STAT545 class Jenny Bryan, professor of statistics at University of British Columbia, describes regular expressions (regex) as:

A God-awful and powerful language for expressing patterns to match in text or for search-and-replace. Frequently described as “write only”, because regular expressions are easier to write than to read/understand. And they are not particularly easy to write.”

Yes, learning regular expressions is painful. So why are we making you do this? Because regular expressions are a fundamental building block of data science.



A thing people say is that data science is about trying to find the “signal in the noise

  • Noisy data “is data with a large amount of additional meaningless information in it called noise” (Wikipedia)
  • Prior to data science revolution, quant people thought of “data” as something in columns and rows
  • The data science revolution is about creating analysis datasets from many pieces of structured, semi-structured, and unstructured data
  • But processing all this semi-structured data requires a lot of complex (and often tedious) data manipulation


Another thing people say is that “data science is 80% data cleaning and 20% analysis.”

Much handcrafted work — what data scientists call “data wrangling,” and “data munging” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a start-up based in San Francisco.

But if the value [of data science] comes from combining different data sets, so does the headache. Data from sensors, documents, the web and conventional databases all come in different formats. Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.


So why learn regular expressions? Because regular expressions are THE preeminent tool for identifying data patterns, and cleaning/transforming “noisy” data

  • Most programmers I speak to talk about regular expressions as one of the most important tools for a programmer to learn
  • One could argue that regular expressions are a fundamental driver of the data science revolution, in that they are what made it possible to format and integrate diverse data sources into analysis datasets (I don’t know if that is true, but it seems reasonable!)
  • For example, web-scraping is fundamentally an application of regular expressions. Grabbing data from the internet is usually very easy. The hard part is processing all that html code into something that can be analyzed.


2 Prerequisites to regex

This section introduces some prerequisite functions and concepts that will help us learn regular expressions.

2.1 str_view() and str_view_all()

We introduce the str_view() & str_view_all() functions from the stringr package (part of tidyverse) to help us visualize what is being matched with our regular expressions


The str_view() & str_view_all() functions:

?str_view
?str_view_all

# SYNTAX AND DEFAULT VALUES
str_view(string, pattern, match = NA, html = FALSE)
str_view_all(string, pattern, match = NA, html = FALSE)
  • Function:
    • str_view() shows the first match of a regex pattern
    • str_view_all() shows all the matches of a regex pattern
  • Arguments:
    • string: Input vector. Either a character vector, or something coercible to one.
    • pattern: Pattern to look for.
      • The default interpretation is a regular expression, as described in stringi::stringi-search-regex. Control options with regex().
    • match: If TRUE, shows only strings that match the pattern. If FALSE, shows only the strings that don’t match the pattern. Otherwise (the default, NA) displays both matches and non-matches.
    • html: Use HTML output? If TRUE will create an HTML widget; if FALSE will style using ANSI escapes. The default prefers ANSI escapes if available in the current terminal; you can override by setting options(stringr.html = TRUE)


Override default settings of html=FALSE

options(stringr.html = TRUE)

Example: Using str_view() & str_view_all() to match literal text

We’ll match text from this string vector:

#p12_df$text[119]
writeLines(p12_df$text[119])
#> "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
#> 
#> #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

p12_df$text[119] %>% length() # our string has length==1 (i.e., it is a one-element character vector)
#> [1] 1

Let’s use these functions to match the exact string "Co" from one of the tweets in our p12_df dataframe. str_view() will show us the first pattern match. Notice that the pattern is case-sensitive, as the "co" in "colleagues" was not matched:

str_view(string = p12_df$text[119], pattern = 'Co', html = TRUE)


We can use str_view_all() to show all matches, not just the first match:

str_view_all(string = p12_df$text[119], pattern = 'Co')
#> Warning: `str_view()` was deprecated in stringr 1.5.0.
#> ℹ Please use `str_view_all()` instead.
#> [1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to <Co>vid-19 in our labs and hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #Always<Co>mpete x #GoHuskies https://t.co/4YSf4SpPe0

We can also apply str_view() and str_view_all() to vectors with more than one element:

p12_df$text[119] %>% length() # one element
#> [1] 1
p12_df$text %>% length() # many elements
#> [1] 328

When applying str_view() to a character vector with more then one element, str_view() shows us the first pattern match for each element (output omitted):

str_view(string = p12_df$text, pattern = 'Co')

When applying str_view_all() to a character vector with more then one element, str_view_all() shows all pattern matches for each element (output omitted):

str_view_all(string = p12_df$text, pattern = 'Co')

2.2 Special characters and (\) escape

The concepts special characters and escape sequence are essential for a deeper understanding of strings and for working with regular expressions. But these concepts are tricky to get your head around, in part because you cannot understand one concept without understanding the other.

Special characters

The literal definition of special characters are characters that are not alphanumeric characters (e.g., \,?, ()

But usually, when the programming world talks about special characters in relation to working with strings, special characters are defined as:

  • A character or sequence of characters that will not be interpreted literally because they have a special meaning (e.g., they represent some symbol or some function)

For example, here are two common special characters

  • \n represents a new line
  • \t represents a tab

These characters followed by a backslash \ take on a new meaning. The n by itself is just an n. When you add a backslash to the \n you are “escaping it” and making it a special character where \n now represents a newline.

x <- "Hi!\nMy name is\nWhat?\nMy name is\nWho?\nMy name is\nChika-chika\nSlim Shady"

# note that print(x) prints the literal text, not its special meaning
print(x) # print 
#> [1] "Hi!\nMy name is\nWhat?\nMy name is\nWho?\nMy name is\nChika-chika\nSlim Shady"

# wrapping print(x) within writeLInes() prints text after executing special meaning
writeLines(x)
#> Hi!
#> My name is
#> What?
#> My name is
#> Who?
#> My name is
#> Chika-chika
#> Slim Shady

Escape sequences

Definition of escape sequence

  • From DataCamp
    • “A sequence in a string that starts with a \ is called an escape sequence and allows us to include special characters in our strings.”
  • From Wikipedia
    • “An escape sequence is a sequence of characters that does not represent itself when used inside a character or string literal, but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly”


This [Wikipedia quote] about the C programming language is also true for R and most other programming languages:

In C, all escape sequences consist of two or more characters, the first of which is the backslash, \ (called the “Escape character”); the remaining characters determine the interpretation of the escape sequence. For example, \n is an escape sequence that denotes a newline character

Usually, we use the backslash (\) escape character for one of two broad purposes:

  1. To enable our string to include a literal character that would otherwise be interpreted by the programming language as a special character (e.g., to include a quote character in our string)
  2. To include a special character in our string (e.g., using \n in our string to insert a newline character in our string)


Using backslash escape character (\) to enable our string to include a literal character that would otherwise be interpreted by the programming language as a special character

Example: What if we wanted to include quote characters (e.g., single quote ', or double quote ") in our string

  • If we enclose our string using single quotation marks ', we cannot insert a single quotation mark within the string (code not run):
x <- 'I am trying to include a single quote ' within my string'
x
  • Similarly, if we enclose our string using double quotation marks ", we cannot insert a double quotation mark within the string (code not run):
x <- "I am trying to include a double quote " within my string"
x

Solution, without using backslash (\) escape character

  • If we want to include a single quote ' in our string, then enclose the entire string using double quotes ":
x <- "I am trying to include a single quote ' within my string"
writeLines(x)
#> I am trying to include a single quote ' within my string
  • If we want to include a double quote " in our string, then enclose the entire string using single quotes ':
x <- 'I am trying to include a double quote " within my string'
writeLines(x)
#> I am trying to include a double quote " within my string

Solution, using backslash (\) escape character

  • To include a literal single quote ' within our string we can use \' to “escape” a single quotation mark:
my_string <- 'Escaping a single quote \' within single quotes'
writeLines(my_string)
#> Escaping a single quote ' within single quotes
my_string <- "Escaping a double quote \" within double quotes"
writeLines(my_string)
#> Escaping a double quote " within double quotes


Similarly, to include a literal backslash \ in the string, we need to escape the backslash with another backslash:

my_string <- "The executable is located in C:\\Program Files\\Git\\bin"
my_string
#> [1] "The executable is located in C:\\Program Files\\Git\\bin"
writeLines(my_string)
#> The executable is located in C:\Program Files\Git\bin

By contrast, this won’t work:

  • says “\P” is an unrecognized escape character
my_string <- "The executable is located in C:\Program Files\Git\bin"
my_string
writeLines(my_string)


Using backslash escape character (\) to include a special character in our string

We create different versions of an object named my_string that contains special character \t for tab and special character \n for newline:

my_string <- "A\tB" # contains \t tab
my_string
#> [1] "A\tB"
writeLines(my_string)
#> A    B

my_string <- "B\nC" # contains \n newline
my_string
#> [1] "B\nC"
writeLines(my_string)
#> B
#> C

my_string <- "A\tB\nC\tD" # contains \t tab and \n newline
writeLines(my_string)
#> A    B
#> C    D


Summary: We can use the backslash (\) escape character for:

  • Referring to special characters, e.g.:
    • \n: newline
    • \t: tab
  • Including a literal character in our string that would otherwise be interpreted as a special character:
    • \': include literal single quote
    • \": include literal double quote
    • \\: include literal backslash

2.3 Functions for printing strings

Let’s examine the object my_string (created below) which contains special characters \t for tab and \n for newline:

my_string <- "A\tB\nC\tD" # contains \t tab and \n newline
writeLines(my_string)
#> A    B
#> C    D


When we print my_string using the print() function, the output looks different than printing it using writeLines().

Print using the print() function

  • This shows how the string text is stored by R (“underlying string”)

  • We can see the enclosing double quotes (") that R uses to store the string

  • Special characters like \n are printed literally (i.e., prints a literal backslash \ followed by the letter n) rather than being interpreted as a newline character and displaying a line break

    print(my_string) 
    #> [1] "A\tB\nC\tD"
    my_string # same as printing my_string object using print()
    #> [1] "A\tB\nC\tD"

Print using the writeLines() function

The writeLines() function:

?writeLines

# SYNTAX AND DEFAULT VALUES
writeLines(text, con = stdout(), sep = "\n", useBytes = FALSE)
  • Function: “writeLines() displays quotes and backslashes as they would be read, rather than as R stores them.” (From writeLines documentation)
  • Arguments:
    • text: Character vector containing the text you want to display

      writeLines(my_string)
      #> A    B
      #> C    D

Commentary on writeLines() function

  • What it does: shows the string text as it is meant to be read by the end user
  • Special characters like \n are interpreted so that the end user sees a new line inserted rather than seeing a literal \n
  • Does not show the enclosing double (") or single (') quotes around the string, so we only see the content of the string
  • Utility:
    • when we include escape sequences in the string, it is helpful to use writeLines() to see how the escaped string looks


2.4 Backslashes and regular expressions


Regular expressions utilize special characters to match to text patterns. Many regular expression special characters start with a backslash \. For example, the regular expresion \d matches to any digit (e.g., 0,5) in the text.

Because \ is an escape character in R, if we want to use the regular expression \d to match to any digit, we must write the regular expression out as \\d


For example, consider a string that is printed out like this by writeLines():

  • The executable is located in C:\Program Files\Git\bin
  • This is like the text we would see on Twitter as an end user

But is stored internally in R like this, which is the way it is printed out by the print() function:

  • The executable is located in C:\\Program Files\\Git\\bin
  • This is how the Tweet is stored internally

Imagine our goal is to match the \ (as it is seen on Twitter by the end user)

my_string <- "The executable is located in C:\\Program Files\\Git\\bin"

my_string # printing an object shows how it is stored internally
#> [1] "The executable is located in C:\\Program Files\\Git\\bin"

writeLines(my_string) # Use writeLines() to see escaped string
#> The executable is located in C:\Program Files\Git\bin

The the pattern we need to match to in the (internally stored) text is \\. But this doesn’t work:

# This will give an error if we try to run it
str_view_all(string = my_string, pattern = "\\")

Why is that? Let’s take a look at what is happening with the string "\\" we are providing as the pattern argument:

# Use writeLines() to see the escaped string
writeLines("\\")
#> \

As seen, once escaped, the string "\\" becomes \ - so we were providing \ as the regular expression (i.e., pattern argument) instead of the \\ that we wanted. In order to get \\, we need to use the string "\\\\", where the 1st \ escapes the 2nd and the 3rd \ escapes the 4th:

# Use writeLines() to see the escaped string
writeLines("\\\\")
#> \\

# This properly matches the `\` in the string
str_view_all(string = my_string, pattern = "\\\\")
#> [1] │ The executable is located in C:<\>Program Files<\>Git<\>bin


Summary:

  • Whenever we need to use backslash in our regular expression, we’ll need to escape the backslash (by using another backslash) in the string that we provide as the regex pattern.
  • For example, to match a newline character \n we need to specify "\\n", to match a tab character \t we need to specify "\\t", etc.


3 Regular expression basics

Example of using regular expression in action:

  • How can we match all occurrences of times in the following string? (i.e., 10 AM and 1 PM)
    • "Class starts at 10 AM and ends at 1 PM."
  • The regular expression \d+ [AP]M can!
my_string = "Class starts at 10 AM and ends at 1 PM."
my_regex = "\\d+ [AP]M"

# The escaped string "\\d" results in the regex \d
print(my_regex)
#> [1] "\\d+ [AP]M"
writeLines(my_regex)
#> \d+ [AP]M

# View matches for our regular expression
str_view_all(string = my_string, pattern = my_regex)
#> [1] │ Class starts at <10 AM> and ends at <1 PM>.
  • How the regular expression \d+ [AP]M works:
    • \d+ matches 1 or more digits in a row
      • \d means match all numeric digits (i.e., 0-9)
      • + means match 1 or more of
    • matches a literal space
    • [AP]M matches either AM or PM
      • [AP] means match either an A or P at that position
      • M means match a literal M


Some common regular expression patterns include (not inclusive):

  • Character classes
  • Quantifiers
  • Anchors
  • Sets and ranges
  • Groups and backreferences

Credit: DaveChild Regular Expression Cheat Sheet

Select each tab

3.1 Character classes

STRING
(type string that represents regex)
REGEX
(to have this appear in your regex)
MATCHES
(to match with this text)
"\\d" \d any digit
"\\D" \D any non-digit
"\\s" \s any whitespace
"\\S" \S any non-whitespace
"\\w" \w any word character
"\\W" \W any non-word character
Other regex involving backslashes…
"\\n" \n newline
"\\t" \t tab
"\\\\" \\ \
"\\." \. .
"\\?" \? ?
"\\(" \( (
"\\)" \) )
"\\{" \{ {
"\\}" \} }

Credit: Working with strings in stringr Cheat sheet


There are certain character classes in regular expression that have special meaning. For example, \d is used to match any digit (i.e., number), \s is used to match any whitespace (i.e., space, tab, or newline character), and \w is used to match any word character (i.e., alphanumeric character or underscore).

“But wait… there’s more! Before a regex is interpreted as a regular expression, it is also interpreted by R as a string. And backslash is used to escape there as well. So, in the end, you need to preprend two backslashes…”

Credit: Escaping sequences from Stat 545

This means in R, when we want to use regular expression patterns "\d","\s", "\w", etc. to match to strings, we must write out the regex patterns as "\\d","\\s", "\\w", etc.


Example: Using \d & \D to match digits & non-digits

Goal: write a regular expression pattern that matches to any digit in the string p12_df$text[119]

# print string
p12_df$text[119]
#> [1] "\"I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals.\"\n\n#ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0"

# writeLines string
writeLines(p12_df$text[119])
#> "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
#> 
#> #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

We can use \d to match all instances of a digit (i.e., number):

print("\\d")
#> [1] "\\d"

# The escaped string "\\d" results in the regex \d
writeLines("\\d")
#> \d

# Match any instances of a digit
str_view_all(string = p12_df$text[119], pattern = "\\d")
#> [1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-<1><9> in our labs and hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/<4>YSf<4>SpPe<0>

What if we defined our our regex pattern as "\d" instead of "\\d"?

# Error: '\d' is an unrecognized escape in character string starting ""\d"
print("\d")

writeLines("\d") # Error: '\d' is an unrecognized escape in character string starting ""\d"

# Error: '\d' is an unrecognized escape in character string starting ""\d"
str_view_all(string = p12_df$text[119], pattern = "\d")

The correct regular expression pattern to match any digits

str_view_all(string = p12_df$text[119], pattern = "\\d")
#> [1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-<1><9> in our labs and hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/<4>YSf<4>SpPe<0>

KEY POINT

  • Our regular expression is the value we specify for the pattern argument above; this is our “regex object”
  • We want our regex object to include the regular expression \d, which matches to any digit
  • We specify our regex object as "\\d" rather than "\d"

EXPLAINING WHY THIS KEY POINT IS TRUE

  • our regex object is first interpreted by R as a string before it is interpreted as a regular expression
  • In R, a backslash \ is an escape character
  • we want our regex object to include a singled backslash \ after it is interpreted by R,
  • So the specification of our regex object must include two backslashes in a row \\
    • the purpose of the first backslash is to escape the second backslash, because a backslash is a special character in R

TAKEAWAY

  • write your regex object the way it be printed by the print() function
  • because before the regex object is interpreted as a regular expression, it is interpreted by R the way it would be printed by the writeLines() function
# so, write your regex object like this
print("\\d")
#> [1] "\\d"

#so it will be interpreted like this
writeLines("\\d")
#> \d


Example: use regular expression \D to match all instances of a non-digit character:

# The escaped string "\\D" results in the regex \D
print("\\D")
#> [1] "\\D"
writeLines("\\D")
#> \D

# Match any instances of a non-digit
str_view_all(string = p12_df$text[119], pattern = "\\D")
#> [1] │ <"><I>< ><s><t><a><n><d>< ><w><i><t><h>< ><m><y>< ><c><o><l><l><e><a><g><u><e><s>< ><a><t>< ><@><U><W>< ><a><n><d>< ><A><m><e><r><i><c><a><'><s>< ><l><e><a><d><i><n><g>< ><r><e><s><e><a><r><c><h>< ><u><n><i><v><e><r><s><i><t><i><e><s>< ><a><s>< ><t><h><e><y>< ><t><a><k><e>< ><f><i><g><h><t>< ><t><o>< ><C><o><v><i><d><->19< ><i><n>< ><o><u><r>< ><l><a><b><s>< ><a><n><d>< ><h><o><s><p><i><t><a><l><s><.><"><
#>     │ ><
#>     │ ><#><P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m>< ><x>< ><#><A><l><w><a><y><s><C><o><m><p><e><t><e>< ><x>< ><#><G><o><H><u><s><k><i><e><s>< ><h><t><t><p><s><:></></><t><.><c><o></>4<Y><S><f>4<S><p><P><e>0


Example: match to all instances of a digit followed by a non-digit character:

str_view_all(string = p12_df$text[119], pattern = "\\d\\D")
#> [1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-1<9 >in our labs and hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/<4Y>Sf<4S>pPe0

Example: Using \s & \S to match whitespace & non-whitespace

We can use \s to match all instances of a whitespace (i.e., space, tab, or newline character):

# The escaped string "\\s" results in the regex \s
writeLines("\\s")
#> \s

# Match any instances of a whitespace
str_view_all(string = p12_df$text[119], pattern = "\\s")
#> [1] │ "I< >stand< >with< >my< >colleagues< >at< >@UW< >and< >America's< >leading< >research< >universities< >as< >they< >take< >fight< >to< >Covid-19< >in< >our< >labs< >and< >hospitals."<
#>     │ ><
#>     │ >#ProudToBeOnTheirTeam< >x< >#AlwaysCompete< >x< >#GoHuskies< >https://t.co/4YSf4SpPe0


We can use \S to match all instances of a non-whitespace character:

# The escaped string "\\S" results in the regex \S
writeLines("\\S")
#> \S

# Match any instances of a non-whitespace
str_view_all(string = p12_df$text[119], pattern = "\\S")
#> [1] │ <"><I> <s><t><a><n><d> <w><i><t><h> <m><y> <c><o><l><l><e><a><g><u><e><s> <a><t> <@><U><W> <a><n><d> <A><m><e><r><i><c><a><'><s> <l><e><a><d><i><n><g> <r><e><s><e><a><r><c><h> <u><n><i><v><e><r><s><i><t><i><e><s> <a><s> <t><h><e><y> <t><a><k><e> <f><i><g><h><t> <t><o> <C><o><v><i><d><-><1><9> <i><n> <o><u><r> <l><a><b><s> <a><n><d> <h><o><s><p><i><t><a><l><s><.><">
#>     │ 
#>     │ <#><P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m> <x> <#><A><l><w><a><y><s><C><o><m><p><e><t><e> <x> <#><G><o><H><u><s><k><i><e><s> <h><t><t><p><s><:></></><t><.><c><o></><4><Y><S><f><4><S><p><P><e><0>


This matches all instances of the letter e followed by a whitespace character:

str_view_all(string = p12_df$text[39], pattern = "e\\s")
#> [1] │ Meet Luke! “No matter wher<e >you’r<e >from, @UCBerkeley is a plac<e >that will tak<e >you out of your comfort zon<e >and shap<e >you into your best self” #IamBerkeley 
#>     │ 
#>     │ Here’s Luk<e >on his first day at Berkeley in his dorm, posing with th<e >ax<e >after our big football gam<e >win and present day! https://t.co/2fO2hRnmPb

Example: Using \w & \W to match words & non-words

We can use \w to match all instances of a word character (i.e., alphanumeric character or underscore):

# The escaped string "\\w" results in the regex \w
writeLines("\\w")
#> \w

# Match any instances of a word character
str_view_all(string = p12_df$text[119], pattern = "\\w")
#> [1] │ "<I> <s><t><a><n><d> <w><i><t><h> <m><y> <c><o><l><l><e><a><g><u><e><s> <a><t> @<U><W> <a><n><d> <A><m><e><r><i><c><a>'<s> <l><e><a><d><i><n><g> <r><e><s><e><a><r><c><h> <u><n><i><v><e><r><s><i><t><i><e><s> <a><s> <t><h><e><y> <t><a><k><e> <f><i><g><h><t> <t><o> <C><o><v><i><d>-<1><9> <i><n> <o><u><r> <l><a><b><s> <a><n><d> <h><o><s><p><i><t><a><l><s>."
#>     │ 
#>     │ #<P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m> <x> #<A><l><w><a><y><s><C><o><m><p><e><t><e> <x> #<G><o><H><u><s><k><i><e><s> <h><t><t><p><s>://<t>.<c><o>/<4><Y><S><f><4><S><p><P><e><0>


We can use \W to match all instances of a non-word character:

# The escaped string "\\W" results in the regex \W
writeLines("\\W")
#> \W

# Match any instances of a non-word character
str_view_all(string = p12_df$text[119], pattern = "\\W")
#> [1] │ <">I< >stand< >with< >my< >colleagues< >at< ><@>UW< >and< >America<'>s< >leading< >research< >universities< >as< >they< >take< >fight< >to< >Covid<->19< >in< >our< >labs< >and< >hospitals<.><"><
#>     │ ><
#>     │ ><#>ProudToBeOnTheirTeam< >x< ><#>AlwaysCompete< >x< ><#>GoHuskies< >https<:></></>t<.>co</>4YSf4SpPe0


This matches all instances of 3-letter words:

str_view_all(string = p12_df$text[119], pattern = "\\W\\w\\w\\w\\W")
#> [1] │ "I stand with my colleagues at @UW< and >America's leading research universities as they take fight to Covid-19 in< our >labs< and >hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0


The second half of the table above shows other regular expressions involving backslashes. This includes special characters like \n and \t, as well as using backslash to escape characters that have special meanings in regex, like . or ? (as we will soon see). So to match a literal period or question mark, we need to use the regex \. and \?, or strings "\\." and "\\?" in R.


3.2 Quantifiers

Character Description
* 0 or more
? 0 or 1
+ 1 or more
{3} Exactly 3
{3,} 3 or more
{3,5} 3, 4, or 5


We can use quantifiers to specify the amount of a certain character or expression to match. The quantifier should directly follow the pattern you want to quantify. For example, s? matches 0 or 1 s and \d{4} matches exactly 4 digits.


Example: Using the *, ?, and + quantifiers

We can use * to match 0 or more of a pattern:

# Matches all instances of `s` followed by 0 or more non-word character
str_view_all(string = p12_df$text[119], pattern = "s\\W*")
#> [1] │ "I <s>tand with my colleague<s >at @UW and America'<s >leading re<s>earch univer<s>itie<s >a<s >they take fight to Covid-19 in our lab<s >and ho<s>pital<s."
#>     │ 
#>     │ #>ProudToBeOnTheirTeam x #Alway<s>Compete x #GoHu<s>kie<s >http<s://>t.co/4YSf4SpPe0


We can use ? to match 0 or 1 of a pattern:

# Matches all instances of `s` followed by 0 or 1 non-word character
str_view_all(string = p12_df$text[119], pattern = "s\\W?")
#> [1] │ "I <s>tand with my colleague<s >at @UW and America'<s >leading re<s>earch univer<s>itie<s >a<s >they take fight to Covid-19 in our lab<s >and ho<s>pital<s.>"
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #Alway<s>Compete x #GoHu<s>kie<s >http<s:>//t.co/4YSf4SpPe0


We can use + to match 1 or more of a pattern:

# Matches all instances of `s` followed by 1 or more non-word character
str_view_all(string = p12_df$text[119], pattern = "s\\W+")
#> [1] │ "I stand with my colleague<s >at @UW and America'<s >leading research universitie<s >a<s >they take fight to Covid-19 in our lab<s >and hospital<s."
#>     │ 
#>     │ #>ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskie<s >http<s://>t.co/4YSf4SpPe0

# Matche all twitter hashtags
  # hashtag defined as hashtag character # followed by 1 or more word characters
str_view_all(string = p12_df$text[119], pattern = "#\\w+")
#> [1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
#>     │ 
#>     │ <#ProudToBeOnTheirTeam> x <#AlwaysCompete> x <#GoHuskies> https://t.co/4YSf4SpPe0

Example: Using {...} to specify how many occurrences to match

We can use {n} to specify the exact number of characters or expressions to match:

# Matches words with exactly 3 letters
str_view_all(string = p12_df$text[119], pattern = "\\s\\w{3}\\s")
#> [1] │ "I stand with my colleagues at @UW< and >America's leading research universities as they take fight to Covid-19 in< our >labs< and >hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0


We can use {n,} to specify n as the minimum amount to match:

# Matches words with 3 or more letters
str_view_all(string = p12_df$text[119], pattern = "\\s\\w{3,}\\s")
#> [1] │ "I< stand >with my< colleagues >at @UW< and >America's< leading >research< universities >as< they >take< fight >to Covid-19 in< our >labs< and >hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0


We can use {n,m} to specify we want to match between n and m amount (inclusive):

# Matches words with between 3 to 5 letters (inclusive)
str_view_all(string = p12_df$text[119], pattern = "\\s\\w{3,5}\\s")
#> [1] │ "I< stand >with my colleagues at @UW< and >America's leading research universities as< they >take< fight >to Covid-19 in< our >labs< and >hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0



3.3 Anchors

String Character Description
"^" ^ Start of string, or start of line in multi-line pattern
"$" $ End of string, or end of line in multi-line pattern
"\\b" \b Word boundary
"\\B" \B Non-word boundary


We can use anchors to indicate which part of the string to match. For example, ^ matches the start of the string, $ matches the end of the string (Notice how we do not need to escape these characters). \b can be used to help detect word boundaries, and \B can be used to help match characters within a word.


Example: Using ^ & $ to match start & end of string

We can use ^ to match the start of a string:

# Matches only the quotation mark at the start of the text and not the end quote
str_view_all(string = p12_df$text[119], pattern = '^"')
#> [1] │ <">I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0


We can use $ to match the end of a string:

# Matches only the number at the end of the text and not any other numbers
str_view_all(string = p12_df$text[119], pattern = "\\d$")
#> [1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe<0>

Example: Using \b & \B to match word boundary & non-word boundary

We can use \b to help detect word boundary:

# Match to all word bounraries
str_view_all(string = p12_df$text[119], pattern = "\\b")
#> [1] │ "<>I<> <>stand<> <>with<> <>my<> <>colleagues<> <>at<> @<>UW<> <>and<> <>America<>'<>s<> <>leading<> <>research<> <>universities<> <>as<> <>they<> <>take<> <>fight<> <>to<> <>Covid<>-<>19<> <>in<> <>our<> <>labs<> <>and<> <>hospitals<>."
#>     │ 
#>     │ #<>ProudToBeOnTheirTeam<> <>x<> #<>AlwaysCompete<> <>x<> #<>GoHuskies<> <>https<>://<>t<>.<>co<>/<>4YSf4SpPe0<>
# Matches words with 3 or more letters using \b
str_view_all(string = p12_df$text[119], pattern = "\\b\\w{3,}\\b")
#> [1] │ "I <stand> <with> my <colleagues> at @UW <and> <America>'s <leading> <research> <universities> as <they> <take> <fight> to <Covid>-19 in <our> <labs> <and> <hospitals>."
#>     │ 
#>     │ #<ProudToBeOnTheirTeam> x #<AlwaysCompete> x #<GoHuskies> <https>://t.co/<4YSf4SpPe0>

Notice how this is much flexible than trying to use whitespace (\s) to determine word boundary:

# Matches words with 3 or more letters using \s
str_view_all(string = p12_df$text[119], pattern = "\\s\\w{3,}\\s")
#> [1] │ "I< stand >with my< colleagues >at @UW< and >America's< leading >research< universities >as< they >take< fight >to Covid-19 in< our >labs< and >hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

Regular expression \B matches to “non-word boundary”; what does that mean?

str_view_all(string = p12_df$text[119], pattern = "\\B")
#> [1] │ <>"I s<>t<>a<>n<>d w<>i<>t<>h m<>y c<>o<>l<>l<>e<>a<>g<>u<>e<>s a<>t <>@U<>W a<>n<>d A<>m<>e<>r<>i<>c<>a's l<>e<>a<>d<>i<>n<>g r<>e<>s<>e<>a<>r<>c<>h u<>n<>i<>v<>e<>r<>s<>i<>t<>i<>e<>s a<>s t<>h<>e<>y t<>a<>k<>e f<>i<>g<>h<>t t<>o C<>o<>v<>i<>d-1<>9 i<>n o<>u<>r l<>a<>b<>s a<>n<>d h<>o<>s<>p<>i<>t<>a<>l<>s.<>"<>
#>     │ <>
#>     │ <>#P<>r<>o<>u<>d<>T<>o<>B<>e<>O<>n<>T<>h<>e<>i<>r<>T<>e<>a<>m x <>#A<>l<>w<>a<>y<>s<>C<>o<>m<>p<>e<>t<>e x <>#G<>o<>H<>u<>s<>k<>i<>e<>s h<>t<>t<>p<>s:<>/<>/t.c<>o/4<>Y<>S<>f<>4<>S<>p<>P<>e<>0


We can use \B to help match characters within a word:

# Matches only the letter `s` within a word and not at the start or end
str_view_all(string = p12_df$text[119], pattern = "\\Bs\\B")
#> [1] │ "I stand with my colleagues at @UW and America's leading re<s>earch univer<s>ities as they take fight to Covid-19 in our labs and ho<s>pitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #Alway<s>Compete x #GoHu<s>kies https://t.co/4YSf4SpPe0



3.4 Sets and ranges

Character Description
. Match any character except newline (\n)
a|b Match a or b
[abc] Match either a, b, or c
[^abc] Match anything except a, b, or c
[a-z] Match range of lowercase letters from a to z
[A-Z] Match range of uppercase letters from A to Z
[0-9] Match range of numbers from 0 to 9


The table above lists some more ways regular expression offers us flexibility and option in what we want to match. The period . acts as a wildcard to match any character except newline. The vertical bar | is similar to an OR operator. Square brackets [...] can be used to specify a set or range of characters to match (or not to match).


Example: Using . as a wildcard

We can use . to match any character except newline (\n):

# Matches any character except newline
str_view_all(string = p12_df$text[119], pattern = ".")
#> [1] │ <"><I>< ><s><t><a><n><d>< ><w><i><t><h>< ><m><y>< ><c><o><l><l><e><a><g><u><e><s>< ><a><t>< ><@><U><W>< ><a><n><d>< ><A><m><e><r><i><c><a><'><s>< ><l><e><a><d><i><n><g>< ><r><e><s><e><a><r><c><h>< ><u><n><i><v><e><r><s><i><t><i><e><s>< ><a><s>< ><t><h><e><y>< ><t><a><k><e>< ><f><i><g><h><t>< ><t><o>< ><C><o><v><i><d><-><1><9>< ><i><n>< ><o><u><r>< ><l><a><b><s>< ><a><n><d>< ><h><o><s><p><i><t><a><l><s><.><">
#>     │ 
#>     │ <#><P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m>< ><x>< ><#><A><l><w><a><y><s><C><o><m><p><e><t><e>< ><x>< ><#><G><o><H><u><s><k><i><e><s>< ><h><t><t><p><s><:></></><t><.><c><o></><4><Y><S><f><4><S><p><P><e><0>


We can confirm there is a newline in the tweet above by using writeLines() or print():

writeLines(p12_df$text[119])
#> "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
#> 
#> #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

print(p12_df$text[119])
#> [1] "\"I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals.\"\n\n#ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0"

Example: Using | as an OR operator

We can use | to match either one of multiple patterns:

# Matches `research`, `fight`, or `labs`
str_view_all(string = p12_df$text[119], pattern = "research|fight|labs")
#> [1] │ "I stand with my colleagues at @UW and America's leading <research> universities as they take <fight> to Covid-19 in our <labs> and hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

# Matches hashtags or handles
str_view_all(string = p12_df$text[119], pattern = "@\\w+|#\\w+")
#> [1] │ "I stand with my colleagues at <@UW> and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
#>     │ 
#>     │ <#ProudToBeOnTheirTeam> x <#AlwaysCompete> x <#GoHuskies> https://t.co/4YSf4SpPe0

Example: Using [...] to match (or not match) a set or range of characters

We can use [...] to match any set of characters:

# Matches hashtags or handles
str_view_all(string = p12_df$text[119], pattern = "[@#]\\w+")
#> [1] │ "I stand with my colleagues at <@UW> and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
#>     │ 
#>     │ <#ProudToBeOnTheirTeam> x <#AlwaysCompete> x <#GoHuskies> https://t.co/4YSf4SpPe0

# Matches any 2 consecutive vowels
str_view_all(string = p12_df$text[119], pattern = "[aeiouAEIOU]{2}")
#> [1] │ "I stand with my coll<ea>g<ue>s at @UW and America's l<ea>ding res<ea>rch universit<ie>s as they take fight to Covid-19 in <ou>r labs and hospitals."
#>     │ 
#>     │ #Pr<ou>dToB<eO>nTh<ei>rT<ea>m x #AlwaysCompete x #GoHusk<ie>s https://t.co/4YSf4SpPe0


We can also use [...] to match any range of alpha or numeric characters:

# Matches only lowercase x through z or uppercase A through C
str_view_all(string = p12_df$text[119], pattern = "[x-zA-C]")
#> [1] │ "I stand with m<y> colleagues at @UW and <A>merica's leading research universities as the<y> take fight to <C>ovid-19 in our labs and hospitals."
#>     │ 
#>     │ #ProudTo<B>eOnTheirTeam <x> #<A>lwa<y>s<C>ompete <x> #GoHuskies https://t.co/4YSf4SpPe0

# Matches only numbers 1 through 4 or the pound sign
str_view_all(string = p12_df$text[119], pattern = "[1-4#]")
#> [1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-<1>9 in our labs and hospitals."
#>     │ 
#>     │ <#>ProudToBeOnTheirTeam x <#>AlwaysCompete x <#>GoHuskies https://t.co/<4>YSf<4>SpPe0


We can use [^...] to indicate we do not want to match the provided set or range of characters:

# Matches any vowels
str_view_all(string = p12_df$text[119], pattern = "[aeiouAEIOU]")
#> [1] │ "<I> st<a>nd w<i>th my c<o>ll<e><a>g<u><e>s <a>t @<U>W <a>nd <A>m<e>r<i>c<a>'s l<e><a>d<i>ng r<e>s<e><a>rch <u>n<i>v<e>rs<i>t<i><e>s <a>s th<e>y t<a>k<e> f<i>ght t<o> C<o>v<i>d-19 <i>n <o><u>r l<a>bs <a>nd h<o>sp<i>t<a>ls."
#>     │ 
#>     │ #Pr<o><u>dT<o>B<e><O>nTh<e><i>rT<e><a>m x #<A>lw<a>ysC<o>mp<e>t<e> x #G<o>H<u>sk<i><e>s https://t.c<o>/4YSf4SpP<e>0

# Matches anything except vowels
str_view_all(string = p12_df$text[119], pattern = "[^aeiouAEIOU]")
#> [1] │ <">I< ><s><t>a<n><d>< ><w>i<t><h>< ><m><y>< ><c>o<l><l>ea<g>ue<s>< >a<t>< ><@>U<W>< >a<n><d>< >A<m>e<r>i<c>a<'><s>< ><l>ea<d>i<n><g>< ><r>e<s>ea<r><c><h>< >u<n>i<v>e<r><s>i<t>ie<s>< >a<s>< ><t><h>e<y>< ><t>a<k>e< ><f>i<g><h><t>< ><t>o< ><C>o<v>i<d><-><1><9>< >i<n>< >ou<r>< ><l>a<b><s>< >a<n><d>< ><h>o<s><p>i<t>a<l><s><.><"><
#>     │ ><
#>     │ ><#><P><r>ou<d><T>o<B>eO<n><T><h>ei<r><T>ea<m>< ><x>< ><#>A<l><w>a<y><s><C>o<m><p>e<t>e< ><x>< ><#><G>o<H>u<s><k>ie<s>< ><h><t><t><p><s><:></></><t><.><c>o</><4><Y><S><f><4><S><p><P>e<0>

# Matches anything that's not uppercase letters
str_view_all(string = p12_df$text[119], pattern = "[^A-Z]+")
#> [1] │ <">I< stand with my colleagues at @>UW< and >A<merica's leading research universities as they take fight to >C<ovid-19 in our labs and hospitals."
#>     │ 
#>     │ #>P<roud>T<o>B<e>O<n>T<heir>T<eam x #>A<lways>C<ompete x #>G<o>H<uskies https://t.co/4>YS<f4>S<p>P<e0>

Notice that [...] only matches a single character (see second to last example above). We need to use quantifiers if we want to match a stretch of characters (see last example above).



3.5 Groups and backreferences

String Character Description
"(...)" (...) Capturing group
"(?:...)" (?:...) Non-capturing group
"\\1" \1 Part of the string matched by capturing group 1
"\\2" \2 Part of the string matched by capturing group 2


Parentheses can be used to group parts of our regular expression together. Normal parentheses (...) creates what is called a numbered capturing group. “A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses”. For example, if we have (\d), we can refer back to the digit matched by this capturing group using backreferences, like \1.

Credit: Hadley Wickham (R for Data Science) Grouping and backreferences

If we only want to use parentheses for grouping purposes and do not need to reference the matched values, we can use a non-capturing group (?:...).


Example: Using capturing groups (...) and backreferences

We can use capturing groups (...) to match certain patterns, then reference what was matched:

# Matches any letter that is repeated 2 times in a row
str_view_all(string = p12_df$text[119], pattern = "([A-Za-z])\\1")
#> [1] │ "I stand with my co<ll>eagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies h<tt>ps://t.co/4YSf4SpPe0

# Matches any string of characters where the first and last letters are the same,
# and the second and least letters are the same
str_view_all(string = p12_df$text[119], pattern = "([a-z])([a-z]).*\\2\\1")
#> [1] │ "I s<tand with my colleagues at> @UW and Am<erica's leading re><search universities> as <they take fight> to Covid-19 in our <labs and hospital>s."
#>     │ 
#>     │ #ProudToBeOnTh<eirTeam x #AlwaysCompete x #GoHuskie>s https://t.co/4YSf4SpPe0

Example: Using non-capturing groups (?:...) for grouping purposes

We can use non-capturing groups (?:...) if we just want to group certain parts of the regex but don’t need to reference the matched value:

# Matches one or more of a digit followed by 3 letters
str_view_all(string = p12_df$text[119], pattern = "(?:\\d[A-Za-z]{3})+")
#> [1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
#>     │ 
#>     │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/<4YSf4SpP>e0


Normal parentheses (capturing groups) can still work for general grouping purposes too. But if you want to group things together without capturing them, you can just use non-capturing groups:

# Here, we have 2 capturing groups but only need to reference the 2nd
str_view_all(string = "A1A1A1eeee", pattern = "([A-Z]\\d)+([a-z])\\2{2}")
#> [1] │ <A1A1A1eee>e

# So we can just turn the first group into a non-capturing group
str_view_all(string = "A1A1A1eeee", pattern = "(?:[A-Z]\\d)+([a-z])\\1{2}")
#> [1] │ <A1A1A1eee>e



4 Regex with stringr functions

This section is about how to solve problems by using regular expressions in combination with functions from the stringr package. This section closely follows section 14.4 Tools from R for Data Science by Wickham and Grolemund.


4.0.1 A Word of Caution!

The following quotes text from R for Data Science 14.4 Tools:

A word of caution before we continue: because regular expressions are so powerful, it’s easy to try and solve every problem with a single regular expression. In the words of Jamie Zawinski:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

As a cautionary tale, check out this regular expression that checks if a email address is valid:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)



The lesson here:

Don’t forget that you’re in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it’s often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.


So I recommend thinking about ways to use regular expressions to solve small, simple problems:

  • For example, below I load a data frame created from Table 208.30 of the NCES Digest of Education Statistics
  • Notice that in the resulting data frame, the variable state has lots of periods . after the name of the state
  • How could we use regular expressions to get rid of those periods?
load(url("https://github.com/anyone-can-cook/rclass1/raw/master/data/nces_digest/nces_digest_table_208_30.RData"))

table208_30 %>% select(state,tot_fall_2000,tot_fall_2010)
#> # A tibble: 51 × 3
#>    state                              tot_fall_2000      tot_fall_2010     
#>    <chr>                              <chr>              <chr>             
#>  1 Alabama .......................... 48194.400000000001 49363.240000000005
#>  2 Alaska .........................   7880.3999999999996 8170.6399999999994
#>  3 Arizona ......................     44438.400000000001 50030.619999999995
#>  4 Arkansas ........................  31947.400000000001 34272.800000000003
#>  5 California ......................  298021.40000000002 260806.29999999999
#>  6 Colorado .......................   41983.400000000001 48542.990000000005
#>  7 Connecticut .....................  41044.400000000001 42951.389999999999
#>  8 Delaware .......................   7469.3999999999996 8933              
#>  9 District of Columbia ...........   4949.3999999999996 5925.3299999999999
#> 10 Florida ........................   132030.39999999999 175609.28999999998
#> # … with 41 more rows

4.1 stringr package overview

The stringr package is part of the tidyverse suite of packages.

The stringr package is built on top of the stringi package.

  • The stringi package is designed to handle every possible challenge one might encounter in working with strings and contains around 250 functions.
  • By contrast, the stringr package contains around 50 functions – a subset of stringi functions – “which have been carefully picked to handle the most common string manipulation functions” (Wickham & Grolemund, 2017, sec. 14.7)

You can perform most/all regular expression tasks using either Base R or stringi, so why use the stringr package?

Functions from the stringr package are nice to work with for a few reasons, most of which relate to consistency (as described in the stringr.tidyverse.org page):

  • All stringr functions in start with str_ (e.g., str_view(), str_subset(), str_replace)
  • All stringr functions take a vector of strings as the first argument
  • Most stringr are designed to utilize regular expressions

This section will introduce the most commonly used stringr functions – Wickham refers to these functions as “verbs” – for working with string patterns.

In each of the following functions, argument x is a vector of strings; and argument pattern is the pattern to look for within string x, which often utilizes regular expressions:

  • str_detect(x, pattern): TRUE/FALSE if there is a match to the pattern
  • str_subset(x, pattern): extracts the matching components
  • str_extract(x, pattern): extracts the text of the match
  • str_match(x, pattern): extracts parts of the match defined by parentheses
  • str_replace(x, pattern, replacement): replaces the matches with new text
  • str_split(x, pattern): splits a string into multiple pieces
  • str_count(x, pattern): counts the number of matches to the pattern
  • str_locate(x, pattern): gives the position of the match

4.1.1 regex() function

(From R for Data Science, section 14.5)

When we specify a value for the pattern argument of a stringr function – such as str_view() – it is automatically wrapped in a call to the regex() function (i.e., it is treated as a regular expression)

# This function call:
str_view(string = "Turn to page 394...", pattern = "\\d+")
#> [1] │ Turn to page <394>...

# Is shorthand for:
str_view(string = "Turn to page 394...", pattern = regex(pattern = "\\d+"))
#> [1] │ Turn to page <394>...
  • For simplicity, we can omit the call to to the regex() function
  • But, there are additional arguments we can supply to regex() if we wanted

regex(pattern, ignore_case = FALSE, multiline = FALSE, comments = FALSE, ...)

  • ignore_case: If TRUE, allows characters to match either their uppercase or lowercase forms
  • multiline: If TRUE, allows ^ and $ to match the start and end of each line within an element rather than the start and end of the complete string
  • comments: If TRUE, allows you to use comments and whitespace to make complex regular expressions more understandable
    • Spaces are ignored, as is everything after #
    • To match a literal space, you’ll need to escape it: "\\ "

Example: Specifying ignore_case = TRUE in regex()

Let’s say we have the following string:

s <- "Yay, yay.... YAY!"
s
#> [1] "Yay, yay.... YAY!"


We can match all the yay’s using the following regex:

str_view_all(string = s, pattern = "[Yy][Aa][Yy]")
#> [1] │ <Yay>, <yay>.... <YAY>!


Equivalently, we can specify ignore_case = TRUE to avoid dealing with casing variations:

str_view_all(string = s, pattern = regex("yay", ignore_case = TRUE))
#> [1] │ <Yay>, <yay>.... <YAY>!

4.2 str_detect()


The str_detect() function:

?str_detect

# SYNTAX AND DEFAULT VALUES
str_detect(string, pattern, negate = FALSE)
  • Function: Detects the presence or absence of a pattern in a string, separately for each element of the string
    • Returns logical vector (TRUE if there is a match, FALSE if there is not)
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for
    • negate: If set to TRUE, the returned logical vector will contain TRUE if there is not a match and FALSE if there is one

Example: Using str_detect() on string
# Detects if there is a digit in the string
str_detect(string = "P. Sherman, 42 Wallaby Way, Sydney", pattern = "\\d")
#> [1] TRUE

Example: Using str_detect() on character vector
# Detects if there is a digit in each string in the vector
str_detect(string = c("One", "25th", "3000"), pattern = "\\d")
#> [1] FALSE  TRUE  TRUE

Example: Using str_detect() on dataframe column

Consider the variable created_at from data frame p12_df

# print a few obs
p12_df %>% select(user_id,screen_name,created_at) %>% head(n=5)
#> # A tibble: 5 × 3
#>   user_id  screen_name created_at         
#>   <chr>    <chr>       <dttm>             
#> 1 22080148 WSUPullman  2020-04-25 22:37:18
#> 2 22080148 WSUPullman  2020-04-23 21:11:49
#> 3 22080148 WSUPullman  2020-04-21 04:00:00
#> 4 22080148 WSUPullman  2020-04-24 03:00:00
#> 5 22080148 WSUPullman  2020-04-20 19:00:21

# examine variable type
p12_df$created_at %>% str()
#>  POSIXct[1:328], format: "2020-04-25 22:37:18" "2020-04-23 21:11:49" "2020-04-21 04:00:00" ...


Let’s create new columns in p12_df called is_am and is_pm that indicates whether or not each tweet’s created_at time is in the AM or PM, respectively:

  • is_am: return TRUE if we see the pattern of a space followed by 0 followed by any digit; OR if we see a 1 followed by a 0 or a 1
    • 0\\d captures 0 followed by any digit; returns TRUE for 0-9 or 10, or 11
p12_df %>%
  mutate(
    # Returns `TRUE` if the hour is 0#, 10, or 11, `FALSE` otherwise
    is_am = str_detect(string = created_at, pattern = " 0\\d| 1[01]"),
    # Recall we can set the `negate` argument to switch the returned `TRUE`/`FALSE`
    is_pm = str_detect(string = created_at, pattern = " 0\\d| 1[01]", negate = TRUE)
  ) %>% select(created_at, is_am, is_pm)
#> # A tibble: 328 × 3
#>    created_at          is_am is_pm
#>    <dttm>              <lgl> <lgl>
#>  1 2020-04-25 22:37:18 FALSE TRUE 
#>  2 2020-04-23 21:11:49 FALSE TRUE 
#>  3 2020-04-21 04:00:00 TRUE  FALSE
#>  4 2020-04-24 03:00:00 TRUE  FALSE
#>  5 2020-04-20 19:00:21 FALSE TRUE 
#>  6 2020-04-20 02:20:01 TRUE  FALSE
#>  7 2020-04-22 04:00:00 TRUE  FALSE
#>  8 2020-04-25 17:00:00 FALSE TRUE 
#>  9 2020-04-21 15:13:06 FALSE TRUE 
#> 10 2020-04-21 17:52:47 FALSE TRUE 
#> # … with 318 more rows


Because TRUE evaluates to 1 and FALSE evaluates to 0 in a numerical context, we could also sum the returned logical vector to see how many of the elements in the vector had a match:

# Number of tweets that were created in the AM
num_am_tweets <- sum(str_detect(string = p12_df$created_at, pattern = " 0\\d| 1[01]"))
num_am_tweets
#> [1] 53


Additionally, we can take the average of the logical vector to get the proportion of elements in the input vector that had a match:

# Proportion of tweets that were created in the AM
pct_am_tweets <- mean(str_detect(string = p12_df$created_at, pattern = " 0\\d| 1[01]"))
pct_am_tweets
#> [1] 0.1615854


We can also use the logical vector returned from str_detect() to filter p12_df to only include rows that had a match:

# Keep only rows whose tweet was created in the AM
p12_df %>%
  filter(str_detect(string = created_at, pattern = " 0\\d| 1[01]") == TRUE)
#> # A tibble: 53 × 5
#>    user_id  created_at          screen_name   text                       locat…¹
#>    <chr>    <dttm>              <chr>         <chr>                      <chr>  
#>  1 22080148 2020-04-21 04:00:00 WSUPullman    "Darien McLaughlin '19, a… Pullma…
#>  2 22080148 2020-04-24 03:00:00 WSUPullman    "6 houses, one pick. Coug… Pullma…
#>  3 22080148 2020-04-20 02:20:01 WSUPullman    "Tell us one of your Brya… Pullma…
#>  4 22080148 2020-04-22 04:00:00 WSUPullman    "We loved seeing your top… Pullma…
#>  5 22080148 2020-04-24 01:58:04 WSUPullman    "#WSU agricultural scienc… Pullma…
#>  6 22080148 2020-04-22 02:22:03 WSUPullman    "Nice \U0001f44d https://… Pullma…
#>  7 15988549 2020-04-20 02:52:31 CalAdmissions "@PaulineARoxas Congrats!… Berkel…
#>  8 15988549 2020-04-22 03:07:00 CalAdmissions "It’s time to make this t… Berkel…
#>  9 15988549 2020-04-22 00:00:08 CalAdmissions "Are you a #BerkeleyBound… Berkel…
#> 10 15988549 2020-04-20 03:03:21 CalAdmissions "@N48260756 We suggest ta… Berkel…
#> # … with 43 more rows, and abbreviated variable name ¹​location

4.3 str_subset()


The str_subset() function:

?str_subset

# SYNTAX AND DEFAULT VALUES
str_subset(string, pattern, negate = FALSE)
  • Function: Keeps strings that match a pattern
    • Returns input vector filtered to only keep elements that match the specified pattern
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for
    • negate: If set to TRUE, the returned vector will contain only elements that did not match the specified pattern

Example: Using str_subset() on character vector
# Subsets the input vector to only keep elements that contain a digit
str_subset(string = c("One", "25th", "3000"), pattern = "\\d")
#> [1] "25th" "3000"

# thus, the vector returned by str_subset() usually contains fewer elements than input string
c("One", "25th", "3000") %>% length()
#> [1] 3
str_subset(string = c("One", "25th", "3000"), pattern = "\\d") %>% length()
#> [1] 2

Example: Using str_subset() on dataframe column
# Subsets the `created_at` vector of `p12_df` to only keep elements that occured in the AM
str_subset(string = p12_df$created_at, pattern = " 0\\d| 1[01]")
#>  [1] "2020-04-21 04:00:00" "2020-04-24 03:00:00" "2020-04-20 02:20:01"
#>  [4] "2020-04-22 04:00:00" "2020-04-24 01:58:04" "2020-04-22 02:22:03"
#>  [7] "2020-04-20 02:52:31" "2020-04-22 03:07:00" "2020-04-22 00:00:08"
#> [10] "2020-04-20 03:03:21" "2020-04-22 00:47:00" "2020-04-23 06:34:00"
#> [13] "2020-04-23 04:06:49" "2020-04-19 03:32:21" "2020-04-20 02:53:38"
#> [16] "2020-04-20 02:53:14" "2020-04-20 03:04:11" "2020-04-19 03:30:14"
#> [19] "2020-04-20 02:58:55" "2020-04-19 05:37:00" "2020-04-21 02:34:00"
#> [22] "2020-04-20 00:15:07" "2020-04-25 04:18:29" "2020-04-25 00:00:01"
#> [25] "2020-04-21 02:33:00" "2020-04-24 01:00:01" "2020-04-23 02:38:46"
#> [28] "2020-04-24 04:48:28" "2020-04-24 01:06:33" "2020-04-25 04:48:08"
#> [31] "2020-04-22 00:10:43" "2020-04-21 05:58:12" "2020-04-24 01:41:19"
#> [34] "2020-04-24 01:42:44" "2020-04-24 01:43:11" "2020-04-23 02:45:24"
#> [37] "2020-04-20 00:44:42" "2020-04-24 01:41:13" "2020-04-25 00:26:02"
#> [40] "2020-04-25 00:31:23" "2020-04-25 00:46:40" "2020-04-25 00:20:36"
#> [43] "2020-04-20 00:09:58" "2020-04-20 00:09:46" "2020-04-20 00:10:08"
#> [46] "2020-04-25 00:29:12" "2020-04-22 01:45:02" "2020-04-23 02:00:14"
#> [49] "2020-04-25 00:34:47" "2020-04-24 02:11:51" "2020-04-25 00:05:59"
#> [52] "2020-04-21 04:14:11" "2020-04-23 02:13:21"

p12_df$created_at %>% length()
#> [1] 328
str_subset(string = p12_df$created_at, pattern = " 0\\d| 1[01]") %>% length()
#> [1] 53

4.4 str_extract() & str_extract_all()


The str_extract() & str_extract_all() functions:

?str_extract
?str_extract_all

# SYNTAX AND DEFAULT VALUES
str_extract(string, pattern)
str_extract_all(string, pattern, simplify = FALSE)
  • Function: Extracts matching patterns from a string
    • Returns first match (str_extract()) or all matches (str_extract_all()) for input vector
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for
    • simplify: If set to TRUE, the returned matches will be in a character matrix rather than the default list of character vectors

How str_extract() differs from str_subset()

  • str_subset() returns the entire element of the elements that match the pattern
  • str_extract() returns character vector that contains the part of the element that matches; and returns NA for elements w/ no match
# str_subset() returns the entire element of the elements that match the pattern
str_subset(string = c("One", "25th", "3000"), pattern = "\\d+")
#> [1] "25th" "3000"
str_subset(string = c("One", "25th", "3000"), pattern = "\\d+") %>% length()
#> [1] 2

# str_extract() returns just the part of the element that matches; and returns NA for elements w/ no match
str_extract(string = c("One", "25th", "3000"), pattern = "\\d+")
#> [1] NA     "25"   "3000"
str_extract(string = c("One", "25th", "3000"), pattern = "\\d+") %>% length()
#> [1] 3

Example: Using str_extract() & str_extract_all() on character vector


[str_extract()] Extract the first occurrence of a word for each string:

# Extracts first match of a word
str_extract(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"),
            pattern = "\\w+")
#> [1] "Three" "Two"   "A"

str_extract(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"),
            pattern = "\\w+") %>% str() # a character vector of length 3
#>  chr [1:3] "Three" "Two" "A"
# Extracts first match to element that begins with "A"
str_extract(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"),
            pattern = "^A")
#> [1] NA  NA  "A"

# note that length of vector returned by str_extract() is same as length of input string
c("Three French hens", "Two turtle doves", "A partridge in a pear tree") %>% length()
#> [1] 3

str_extract(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"),
            pattern = "^A") %>% length()
#> [1] 3


[str_extract_all()] Extract all occurrences of a word for each string:

# Extracts all matches of a word, returning a list of character vectors
str_extract_all(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"), 
                pattern = "\\w+")
#> [[1]]
#> [1] "Three"  "French" "hens"  
#> 
#> [[2]]
#> [1] "Two"    "turtle" "doves" 
#> 
#> [[3]]
#> [1] "A"         "partridge" "in"        "a"         "pear"      "tree"

# Extracts all matches of a word, setting simplify = TRUE returns a character matrix
str_extract_all(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"), 
                pattern = "\\w+", simplify = TRUE)
#>      [,1]    [,2]        [,3]    [,4] [,5]   [,6]  
#> [1,] "Three" "French"    "hens"  ""   ""     ""    
#> [2,] "Two"   "turtle"    "doves" ""   ""     ""    
#> [3,] "A"     "partridge" "in"    "a"  "pear" "tree"


Types of objects returned by str_extract() and str_extract_all()

  • By default str_extract() returns a character vector, str_extract_all() returns a list
  • By default str_extract_all() returns a list
    • can use Base R subsetting to isolate particular elements of object returned by str_extract_all()
  • str_extract_all() with argument simplify = TRUE returns character matrix
    • can use Base R subsetting to isolate particular elements of object returned by str_extract_all()
# str_extract returns a character vector
str_extract(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"), 
                pattern = "\\w+") %>% str()
#>  chr [1:3] "Three" "Two" "A"

# by default, str_extract_all returns a list
str_extract_all(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"), 
                pattern = "\\w+") %>% str()
#> List of 3
#>  $ : chr [1:3] "Three" "French" "hens"
#>  $ : chr [1:3] "Two" "turtle" "doves"
#>  $ : chr [1:6] "A" "partridge" "in" "a" ...

# str_extract_all with simplify = TRUE returns a character matrix
str_extract_all(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"), 
                pattern = "\\w+", simplify = TRUE)
#>      [,1]    [,2]        [,3]    [,4] [,5]   [,6]  
#> [1,] "Three" "French"    "hens"  ""   ""     ""    
#> [2,] "Two"   "turtle"    "doves" ""   ""     ""    
#> [3,] "A"     "partridge" "in"    "a"  "pear" "tree"

str_extract_all(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"), 
                pattern = "\\w+", simplify = TRUE) %>% str()
#>  chr [1:3, 1:6] "Three" "Two" "A" "French" "turtle" "partridge" "hens" ...

Example: Using str_extract() & str_extract_all() on dataframe column

[str_extract()] Extract first hashtag:

# Extracts first match of a hashtag (if there is one)
p12_df %>% 
  mutate(
    hashtag = str_extract(string = text, pattern = "#\\S+") # pattern is a hashtag followed by one or more non-white-space characters
  ) %>% select(text, hashtag)
#> # A tibble: 328 × 2
#>    text                                                                  hashtag
#>    <chr>                                                                 <chr>  
#>  1 "Big Dez is headed to Indy!\n\n#GoCougs | #NFLDraft2020 | @dadpat7 |… #GoCou…
#>  2 "Cougar Cheese. That's it. That's the tweet. \U0001f9c0#WSU #GoCougs… #WSU   
#>  3 "Darien McLaughlin '19, and her dog, Yuki, went on a #Pullman distan… #Pullm…
#>  4 "6 houses, one pick. Cougs, which one you got? Reply ⬇️  #WSU #CougsC… #WSU   
#>  5 "Why did you choose to attend @WSUPullman?\U0001f914 #WSU #GoCougs h… #WSU   
#>  6 "Tell us one of your Bryan Clock Tower memories ⏰ \U0001f43e #WSU #… #WSU   
#>  7 "We loved seeing your top three @WSUPullman buildings, but what are … #WSU   
#>  8 "Congratulations, graduates! We’re two weeks away from the #WSU syst… #WSU   
#>  9 "Learn more about this story at https://t.co/45BzKc2rFE. #WSU #GoCou… #WSU   
#> 10 "Tomorrow, our @WSUEsports Team is facing off against \n@Esports_WA … #GoCou…
#> # … with 318 more rows

[str_extract_all()] Extract all hashtags:

# Extracts all matches of hashtags (if there are any)
p12_df %>% 
  mutate(
    hashtags_list = str_extract_all(string = text, pattern = "#\\S+"),
    # Use `as.character()` so we can see the content of the character vector of matches
    hashtags_vector = as.character(hashtags_list)
  ) %>% select(text, hashtags_list, hashtags_vector)
#> # A tibble: 328 × 3
#>    text                                                          hasht…¹ hasht…²
#>    <chr>                                                         <list>  <chr>  
#>  1 "Big Dez is headed to Indy!\n\n#GoCougs | #NFLDraft2020 | @d… <chr>   "c(\"#…
#>  2 "Cougar Cheese. That's it. That's the tweet. \U0001f9c0#WSU … <chr>   "c(\"#…
#>  3 "Darien McLaughlin '19, and her dog, Yuki, went on a #Pullma… <chr>   "c(\"#…
#>  4 "6 houses, one pick. Cougs, which one you got? Reply ⬇️  #WSU… <chr>   "c(\"#…
#>  5 "Why did you choose to attend @WSUPullman?\U0001f914 #WSU #G… <chr>   "c(\"#…
#>  6 "Tell us one of your Bryan Clock Tower memories ⏰ \U0001f43… <chr>   "c(\"#…
#>  7 "We loved seeing your top three @WSUPullman buildings, but w… <chr>   "c(\"#…
#>  8 "Congratulations, graduates! We’re two weeks away from the #… <chr>   "c(\"#…
#>  9 "Learn more about this story at https://t.co/45BzKc2rFE. #WS… <chr>   "c(\"#…
#> 10 "Tomorrow, our @WSUEsports Team is facing off against \n@Esp… <chr>   "#GoCo…
#> # … with 318 more rows, and abbreviated variable names ¹​hashtags_list,
#> #   ²​hashtags_vector

4.5 str_match() & str_match_all()


The str_match() & str_match_all() functions:

?str_match
?str_match_all

# SYNTAX
str_match(string, pattern)
str_match_all(string, pattern)
  • Function: Extracts matched capturing groups from a string
    • Returns a character matrix containing the full match in the first column, then additional columns for matches from each capturing group
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for

Example: Using str_match() & str_match_all() on character vector


[str_match()] Extract the first month, day, year for each string:

# we a string of 3-elements with dates stored in MDY format, but each stored slighlty different 
c("5-1-2020", "12/25/17", "01.01.13 to 01.01.14")
#> [1] "5-1-2020"             "12/25/17"             "01.01.13 to 01.01.14"

# Use str_match to extracts first match of month, day, year, separating month day and year using capturing groups
str_match(string = c("5-1-2020", "12/25/17", "01.01.13 to 01.01.14"),
          pattern = "(\\d+)[-/\\.](\\d+)[-/\\.](\\d+)")
#>      [,1]       [,2] [,3] [,4]  
#> [1,] "5-1-2020" "5"  "1"  "2020"
#> [2,] "12/25/17" "12" "25" "17"  
#> [3,] "01.01.13" "01" "01" "13"
  # note: pattern is digit one or more times followed by "-" or "."; then digit one ore more times ....

# without specifying capturing groups
str_match(string = c("5-1-2020", "12/25/17", "01.01.13 to 01.01.14"),
          pattern = "\\d+[-/\\.]\\d+[-/\\.]\\d+")
#>      [,1]      
#> [1,] "5-1-2020"
#> [2,] "12/25/17"
#> [3,] "01.01.13"

str_match() returns a character matrix

  • Use Base R subsetting to isolate desired elements: object_name[<rows>,<columns>]
m <- str_match(string = c("5-1-2020", "12/25/17", "01.01.13 to 01.01.14"),
          pattern = "(\\d+)[-/\\.](\\d+)[-/\\.](\\d+)")

m %>% str() # character matrix of threw rows and four columns
#>  chr [1:3, 1:4] "5-1-2020" "12/25/17" "01.01.13" "5" "12" "01" "1" "25" ...

m # print entire character matrix
#>      [,1]       [,2] [,3] [,4]  
#> [1,] "5-1-2020" "5"  "1"  "2020"
#> [2,] "12/25/17" "12" "25" "17"  
#> [3,] "01.01.13" "01" "01" "13"

m[1,] # isolate first row
#> [1] "5-1-2020" "5"        "1"        "2020"
m[1:2,] # rows 1 and 2
#>      [,1]       [,2] [,3] [,4]  
#> [1,] "5-1-2020" "5"  "1"  "2020"
#> [2,] "12/25/17" "12" "25" "17"

m[,1] # isolate first column
#> [1] "5-1-2020" "12/25/17" "01.01.13"

m[,4] # isolate fourth column
#> [1] "2020" "17"   "13"

m[3,4] # isolate cell defined by row 3 and column 4
#> [1] "13"


How str_match() differs from str_extract()

# str_match(): first column contains full match; then separate columns for matches from each capturing group
str_match(string = c("5-1-2020", "12/25/17", "01.01.13 to 01.01.14"),
          pattern = "(\\d+)[-/\\.](\\d+)[-/\\.](\\d+)")
#>      [,1]       [,2] [,3] [,4]  
#> [1,] "5-1-2020" "5"  "1"  "2020"
#> [2,] "12/25/17" "12" "25" "17"  
#> [3,] "01.01.13" "01" "01" "13"


#`str_extract()` returns character vector with each element containing full match; 
  # str_extract() doesn't return separate elements for each matching group
str_extract(string = c("5-1-2020", "12/25/17", "01.01.13 to 01.01.14"),
          pattern = "(\\d+)[-/\\.](\\d+)[-/\\.](\\d+)")
#> [1] "5-1-2020" "12/25/17" "01.01.13"


[str_match_all()] Extract all month, day, year for each string:

Whereas str_match() returns a character matrix containing text from the first match, str_match_all() returns a list containing text from all matches; and each element in the list is a character matrix

# Extracts all matches of month, day, year
str_match_all(string = c("5-1-2020", "12/25/17", "01.01.13 to 01.01.14"),
              pattern = "(\\d+)[-/\\.](\\d+)[-/\\.](\\d+)")
#> [[1]]
#>      [,1]       [,2] [,3] [,4]  
#> [1,] "5-1-2020" "5"  "1"  "2020"
#> 
#> [[2]]
#>      [,1]       [,2] [,3] [,4]
#> [1,] "12/25/17" "12" "25" "17"
#> 
#> [[3]]
#>      [,1]       [,2] [,3] [,4]
#> [1,] "01.01.13" "01" "01" "13"
#> [2,] "01.01.14" "01" "01" "14"

# examine structure created by str_match_all
str_match_all(string = c("5-1-2020", "12/25/17", "01.01.13 to 01.01.14"),
              pattern = "(\\d+)[-/\\.](\\d+)[-/\\.](\\d+)") %>% str()
#> List of 3
#>  $ : chr [1, 1:4] "5-1-2020" "5" "1" "2020"
#>  $ : chr [1, 1:4] "12/25/17" "12" "25" "17"
#>  $ : chr [1:2, 1:4] "01.01.13" "01.01.14" "01" "01" ...

Example: Using str_match() on dataframe column


first, show how to extact date and time from variables p12_df$created_at

str_match(string = p12_df$created_at[1:10], pattern = "([\\d-]+) ([\\d:]+)")
#>       [,1]                  [,2]         [,3]      
#>  [1,] "2020-04-25 22:37:18" "2020-04-25" "22:37:18"
#>  [2,] "2020-04-23 21:11:49" "2020-04-23" "21:11:49"
#>  [3,] "2020-04-21 04:00:00" "2020-04-21" "04:00:00"
#>  [4,] "2020-04-24 03:00:00" "2020-04-24" "03:00:00"
#>  [5,] "2020-04-20 19:00:21" "2020-04-20" "19:00:21"
#>  [6,] "2020-04-20 02:20:01" "2020-04-20" "02:20:01"
#>  [7,] "2020-04-22 04:00:00" "2020-04-22" "04:00:00"
#>  [8,] "2020-04-25 17:00:00" "2020-04-25" "17:00:00"
#>  [9,] "2020-04-21 15:13:06" "2020-04-21" "15:13:06"
#> [10,] "2020-04-21 17:52:47" "2020-04-21" "17:52:47"

Below, we extract datetime from the created_at column. The first capturing group matches the date part and the second capturing group matches the time part:

datetime_regex <- "([\\d-]+) ([\\d:]+)"
p12_df %>%
  mutate(
    # The 1st capturing group will be in the 2nd column of the matrix returned from `str_match()`
    # So we use [, 2] below and save the result to the `date` column of the dataframe
    date = str_match(string = created_at, pattern = datetime_regex)[, 2],
    # The 2nd capturing group will be in the 3rd column of the matrix returned from `str_match()`
    # So we use [, 3] below and save the result to the `time` column of the dataframe
    time = str_match(string = created_at, pattern = datetime_regex)[, 3]
  ) %>% select(created_at, date, time)
#> # A tibble: 328 × 3
#>    created_at          date       time    
#>    <dttm>              <chr>      <chr>   
#>  1 2020-04-25 22:37:18 2020-04-25 22:37:18
#>  2 2020-04-23 21:11:49 2020-04-23 21:11:49
#>  3 2020-04-21 04:00:00 2020-04-21 04:00:00
#>  4 2020-04-24 03:00:00 2020-04-24 03:00:00
#>  5 2020-04-20 19:00:21 2020-04-20 19:00:21
#>  6 2020-04-20 02:20:01 2020-04-20 02:20:01
#>  7 2020-04-22 04:00:00 2020-04-22 04:00:00
#>  8 2020-04-25 17:00:00 2020-04-25 17:00:00
#>  9 2020-04-21 15:13:06 2020-04-21 15:13:06
#> 10 2020-04-21 17:52:47 2020-04-21 17:52:47
#> # … with 318 more rows

4.6 str_replace() & str_replace_all()


The str_replace() & str_replace_all() functions:

?str_replace
?str_replace_all

# SYNTAX
str_replace(string, pattern, replacement)
str_replace_all(string, pattern, replacement)
  • Function: Replaces matched patterns in a string
    • Returns input vector with first match (str_replace()) or all matches (str_replace_all()) for each string replaced with specified replacement
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for
    • replacement: What the matched pattern should be replaced with
  • str_replace_all() also supports multiple replacements, where you can omit the replacement argument and just provide a named vector of replacements as the pattern

Example: Using str_replace() & str_replace_all()

[str_replace()] Replace the first occurrence of a vowel:

# Replace first vowel with empty string
str_replace(string = "Thanks for the Memories", pattern = "[aeiou]", replacement = "")
#> [1] "Thnks for the Memories"

[str_replace_all()] Replace all occurrences of a vowel:

# Replace all vowels with empty strings
str_replace_all(string = "Thanks for the Memories", pattern = "[aeiou]", replacement = "")
#> [1] "Thnks fr th Mmrs"

Example: Using backreferences with str_replace() & str_replace_all()

[str_replace()] Change first word that is matched to pig latin:

# Use \\1 and \\2 to refer to the capturing groups
str_replace(string = "pig latin", pattern = "(\\w{1})(\\w+)",
            replacement = "\\2\\1ay")
#> [1] "igpay latin"

# this works too
str_replace(string = "pig latin", pattern = "(\\w)(\\w+)",
            replacement = "\\2\\1ay")
#> [1] "igpay latin"

[str_replace_all()] Change all words to pig latin:

# Use \\1 and \\2 to refer to the capturing groups
str_replace_all(string = "pig latin", pattern = "(\\w{1})(\\w+)",
                replacement = "\\2\\1ay")
#> [1] "igpay atinlay"

Example: Using str_replace_all() for multiple replacements
# Replace all occurrences of "at" with "@", and all digits with "#"
str_replace_all(string = "Tomorrow at 10:30AM", pattern = c("at" = "@", "\\d" = "#"))
#> [1] "Tomorrow @ ##:##AM"

Example: Using str_replace_all() on dataframe column
p12_df %>%
  mutate(
    # Replace all hashtags and handles from tweet with an empty string
    removed_hashtags_handles = str_replace_all(string = text, pattern = "[@#]\\S+", replacement = "")
  ) %>% select(text, removed_hashtags_handles)
#> # A tibble: 328 × 2
#>    text                                                                  remov…¹
#>    <chr>                                                                 <chr>  
#>  1 "Big Dez is headed to Indy!\n\n#GoCougs | #NFLDraft2020 | @dadpat7 |… "Big D…
#>  2 "Cougar Cheese. That's it. That's the tweet. \U0001f9c0#WSU #GoCougs… "Couga…
#>  3 "Darien McLaughlin '19, and her dog, Yuki, went on a #Pullman distan… "Darie…
#>  4 "6 houses, one pick. Cougs, which one you got? Reply ⬇️  #WSU #CougsC… "6 hou…
#>  5 "Why did you choose to attend @WSUPullman?\U0001f914 #WSU #GoCougs h… "Why d…
#>  6 "Tell us one of your Bryan Clock Tower memories ⏰ \U0001f43e #WSU #… "Tell …
#>  7 "We loved seeing your top three @WSUPullman buildings, but what are … "We lo…
#>  8 "Congratulations, graduates! We’re two weeks away from the #WSU syst… "Congr…
#>  9 "Learn more about this story at https://t.co/45BzKc2rFE. #WSU #GoCou… "Learn…
#> 10 "Tomorrow, our @WSUEsports Team is facing off against \n@Esports_WA … "Tomor…
#> # … with 318 more rows, and abbreviated variable name ¹​removed_hashtags_handles

4.7 str_split()


The str_split() function:

?str_split

# SYNTAX AND DEFAULT VALUES
str_split(string, pattern, n = Inf, simplify = FALSE)
  • Function: Splits a string by specified pattern
    • [by default] returns a list that contains character vectors containing the split substrings
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for and split by
    • n: Maximum number of substrings to return
    • simplify: If set to TRUE, the returned matches will be in a character matrix rather than the default list of character vectors

Example: Using str_split() on character vector
# Split by comma or the word "and"
str_split(string = c("The Lion, the Witch, and the Wardrobe", "Peanut butter and jelly"),
          pattern = ",? and |, ")
#> [[1]]
#> [1] "The Lion"     "the Witch"    "the Wardrobe"
#> 
#> [[2]]
#> [1] "Peanut butter" "jelly"


We can specify n to control the maximum number of substrings we want to return:

# Limit split to only return 2 substrings
str_split(string = c("The Lion, the Witch, and the Wardrobe", "Peanut butter and jelly"),
          pattern = ",? and |, ", n = 2)
#> [[1]]
#> [1] "The Lion"                    "the Witch, and the Wardrobe"
#> 
#> [[2]]
#> [1] "Peanut butter" "jelly"


We can specify simplify = TRUE to return a character matrix instead of a list:

# Return split substrings in a character matrix
str_split(string = c("The Lion, the Witch, and the Wardrobe", "Peanut butter and jelly"),
          pattern = ",? and |, ", simplify = TRUE)
#>      [,1]            [,2]        [,3]          
#> [1,] "The Lion"      "the Witch" "the Wardrobe"
#> [2,] "Peanut butter" "jelly"     ""

Example: Using str_split() on dataframe column

When we split the created_at field at either a hyphen or space, we can separated out the year, month, day, and time components of the string:

p12_df %>%
  mutate(
    # Use `as.character()` so we can see the content of the character vector of splitted strings
    year_month_day_time = as.character(str_split(string = created_at, pattern = "[- ]"))
  ) %>% select(created_at, year_month_day_time)
#> # A tibble: 328 × 2
#>    created_at          year_month_day_time                        
#>    <dttm>              <chr>                                      
#>  1 2020-04-25 22:37:18 "c(\"2020\", \"04\", \"25\", \"22:37:18\")"
#>  2 2020-04-23 21:11:49 "c(\"2020\", \"04\", \"23\", \"21:11:49\")"
#>  3 2020-04-21 04:00:00 "c(\"2020\", \"04\", \"21\", \"04:00:00\")"
#>  4 2020-04-24 03:00:00 "c(\"2020\", \"04\", \"24\", \"03:00:00\")"
#>  5 2020-04-20 19:00:21 "c(\"2020\", \"04\", \"20\", \"19:00:21\")"
#>  6 2020-04-20 02:20:01 "c(\"2020\", \"04\", \"20\", \"02:20:01\")"
#>  7 2020-04-22 04:00:00 "c(\"2020\", \"04\", \"22\", \"04:00:00\")"
#>  8 2020-04-25 17:00:00 "c(\"2020\", \"04\", \"25\", \"17:00:00\")"
#>  9 2020-04-21 15:13:06 "c(\"2020\", \"04\", \"21\", \"15:13:06\")"
#> 10 2020-04-21 17:52:47 "c(\"2020\", \"04\", \"21\", \"17:52:47\")"
#> # … with 318 more rows

4.8 str_count()


The str_count() function:

?str_count

# SYNTAX AND DEFAULT VALUES
str_count(string, pattern = "")
  • Function: Counts the number of matches in a string
    • Returns the number of matches
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for

Example: Using str_count() on character vector
# Counts the number of digits
str_count(string = c("H2O2", "Year 3000", "4th of July"), pattern = "\\d")
#> [1] 2 4 1

Example: Using str_count() on dataframe column
p12_df %>%
  mutate(
    # Counts the total number of hashtags and mentions
    num_hashtags_and_mentions = str_count(string = text, pattern = "[@#]\\S+")
  ) %>% select(text, num_hashtags_and_mentions)
#> # A tibble: 328 × 2
#>    text                                                                  num_h…¹
#>    <chr>                                                                   <int>
#>  1 "Big Dez is headed to Indy!\n\n#GoCougs | #NFLDraft2020 | @dadpat7 |…       5
#>  2 "Cougar Cheese. That's it. That's the tweet. \U0001f9c0#WSU #GoCougs…       2
#>  3 "Darien McLaughlin '19, and her dog, Yuki, went on a #Pullman distan…       4
#>  4 "6 houses, one pick. Cougs, which one you got? Reply ⬇️  #WSU #CougsC…       3
#>  5 "Why did you choose to attend @WSUPullman?\U0001f914 #WSU #GoCougs h…       3
#>  6 "Tell us one of your Bryan Clock Tower memories ⏰ \U0001f43e #WSU #…       2
#>  7 "We loved seeing your top three @WSUPullman buildings, but what are …       3
#>  8 "Congratulations, graduates! We’re two weeks away from the #WSU syst…       3
#>  9 "Learn more about this story at https://t.co/45BzKc2rFE. #WSU #GoCou…       2
#> 10 "Tomorrow, our @WSUEsports Team is facing off against \n@Esports_WA …       5
#> # … with 318 more rows, and abbreviated variable name
#> #   ¹​num_hashtags_and_mentions

4.9 str_locate() & str_locate_all()


The str_locate() & str_locate_all() functions:

?str_locate
?str_locate_all

# SYNTAX
str_locate(string, pattern)
str_locate_all(string, pattern)
  • Function: Locates the position of patterns in a string
    • Returns an integer matrix containing the start position of match in the first column and end position of match in second column
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for

Example: Using str_locate() & str_locate_all() on character vector

[str_locate()] Locate the start and end positions for first stretch of numbers:

# Locate positions for first stretch of numbers
str_locate(string = c("555.123.4567", "(555) 135-7900 and (555) 246-8000"),
           pattern = "\\d+")
#>      start end
#> [1,]     1   3
#> [2,]     2   4

[str_locate_all()] Locate the start and end positions for all stretches of numbers:

# Locate positions for all stretches of numbers
str_locate_all(string = c("555.123.4567", "(555) 135-7900 and (555) 246-8000"),
               pattern = "\\d+")
#> [[1]]
#>      start end
#> [1,]     1   3
#> [2,]     5   7
#> [3,]     9  12
#> 
#> [[2]]
#>      start end
#> [1,]     2   4
#> [2,]     7   9
#> [3,]    11  14
#> [4,]    21  23
#> [5,]    26  28
#> [6,]    30  33

# basically, str_locate_all gives the positions associated wtih elements 
str_extract_all(string = c("555.123.4567", "(555) 135-7900 and (555) 246-8000"),
               pattern = "\\d+")
#> [[1]]
#> [1] "555"  "123"  "4567"
#> 
#> [[2]]
#> [1] "555"  "135"  "7900" "555"  "246"  "8000"

Example: Using str_locate() on dataframe column
p12_df %>%
  mutate(
    # Start position of first hashtag in tweet (ie. 1st column of matrix returned from `str_locate()`)
    start_of_first_hashtag = str_locate(string = text, pattern = "#\\S+")[, 1],
    # End position of first hashtag in tweet (ie. 2nd column of matrix returned from `str_locate()`)
    end_of_first_hashtag = str_locate(string = text, pattern = "#\\S+")[, 2],
    # Length of first hashtag in tweet (ie. difference between start and end positions)
    length_of_first_hashtag = end_of_first_hashtag - start_of_first_hashtag
  ) %>% select(text, start_of_first_hashtag, end_of_first_hashtag, length_of_first_hashtag)
#> # A tibble: 328 × 4
#>    text                                                  start…¹ end_o…² lengt…³
#>    <chr>                                                   <int>   <int>   <int>
#>  1 "Big Dez is headed to Indy!\n\n#GoCougs | #NFLDraft2…      29      36       7
#>  2 "Cougar Cheese. That's it. That's the tweet. \U0001f…      46      49       3
#>  3 "Darien McLaughlin '19, and her dog, Yuki, went on a…      53      60       7
#>  4 "6 houses, one pick. Cougs, which one you got? Reply…      57      60       3
#>  5 "Why did you choose to attend @WSUPullman?\U0001f914…      44      47       3
#>  6 "Tell us one of your Bryan Clock Tower memories ⏰ \…      52      55       3
#>  7 "We loved seeing your top three @WSUPullman building…     144     147       3
#>  8 "Congratulations, graduates! We’re two weeks away fr…      59      62       3
#>  9 "Learn more about this story at https://t.co/45BzKc2…      57      60       3
#> 10 "Tomorrow, our @WSUEsports Team is facing off agains…     266     274       8
#> # … with 318 more rows, and abbreviated variable names ¹​start_of_first_hashtag,
#> #   ²​end_of_first_hashtag, ³​length_of_first_hashtag

5 Appendix

5.1 RegExplain Addin

Regular expressions are tricky. RegExplain makes it easier to see what you’re doing.

Credit: Garrick Aden-Buie (RegExplain)


RegExplain is an RStudio addin that allows the user to check their regex matching functions interactively.

# Installation
devtools::install_github("gadenbuie/regexplain")
library(regexplain)

References

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media. Retrieved from https://r4ds.had.co.nz/