1 Introduction

1.1 Libraries we will use

Load packages:

library(dplyr)
library(readr)
library(readxl)
library(haven)

1.2 Lecture overview

When you begin a new project, it is a good idea to have a consistent and efficient way of organizing your folders and files. This can help save you time and trouble later down the road, as you will know exactly where everything is when you need to look for them. Another common task you may encounter in your project is reading and writing data. This lecture covers these two fundamental topics for helping you get started with a project.

Organizing folders and files:

  • Organizing folders: how to structure and navigate your project directory
  • Organizing files: how to structure an R script

Reading and writing data:

Common file formats to read from:

Format Package Function
Comma-separated values (.csv) readr read_csv()
Text-formated data (.txt) readr read_table()
Tab-separated values (.tsv) readr read_tsv()
Excel (.xls or .xlsx) readxl read_excel()
Stata (.dta) haven read_dta()
SPSS (.sav) haven read_sav()
SAS (.sas) haven read_sas()
R (.rds) base R readRDS()
R (.Rdata) base R load()

Source: Professor Darin Christensen


Note:

  • .rds files can store a single R object
  • .Rdata files can store multiple R objects


Common file formats to write to:

Format Package Function
Comma-separated values (.csv) readr write_csv()
Stata (.dta) haven write_dta()
R (.rds) base R saveRDS()
R (.Rdata) base R save()

2 Organizing project directory

What is a directory and how to organize one?

  • Folders are referred to as directories
  • Your project directory is the folder that contains all the files and subfolders (i.e., subdirectories) you create for the purpose of your project
  • There is no one-right-way to organize a project directory, but the structure shown below is generally a good way to set up your project
    • You can have a data/ subdirectory to hold all data files, scripts/ subdirectory to hold your scripts, etc.
  • When using RStudio, it is often useful to turn your project directory into an RStudio project
  • In the below directory structure, the folder named my_project is the “root directory” for your research project (sometimes called the “project directory” or “project root directory”)
    • the folders data, scripts, and figures are sub-directories of your root directory
    • the file my_project.Rproj is your RStudio project file and lives in the root directory for your research project
my_project/
|
|- data/
|- scripts/
|- figures/
|- my_project.Rproj

2.1 RStudio project

How to create an RStudio project?

  • On the top right corner in RStudio, select New Project under the dropdown menu
  • If there’s a folder you want to turn into a project, select Existing Directory
  • Under Project working directory, browse for your folder and click Create Project


Why use RStudio project?

  • Creating a RStudio project helps keep everything relative to the project root directory
    • Your R Console and R scripts will run using the project root directory as the working directory
    • Your Terminal in RStudio will start in the project root directory
    • Your file browser window (bottom right panel) will also start off in the project root directory

2.2 Working directory

What is a working directory?

  • Your working directory is the directory that you are currently working in (e.g., running a script from)
    • It’ll be important to keep the working directory in mind when you need to refer to files/folders relative to the working directory
  • Note that the working directory is different when you run an R markdown file (.Rmd) vs. an R script (.R)
  • You can use the getwd() function to check your current working directory


The getwd() function:

?getwd

# SYNTAX
getwd()
  • Function: Returns an absolute filepath representing the current working directory
  • Arguments: NA


When you run R code in an R markdown file, the working directory is the directory that your .Rmd file is in (the directory where the .Rmd file is saved):

getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"


When you run an R script, the working directory is the directory indicated at the top of your console in RStudio:

  • If you are not working from an RStudio project, this is typically your home directory
  • If you are working from an RStudio project, your working directory would be the project root directory


What is the home directory?

  • You can view your home directory in RStudio by clicking on Home in the bottom right file viewer panel
  • The home directory is a user’s main directory on a computer and it varies by operating systems. It is usually:
    • MacOS: /Users/<username>
    • Windows: C:\Users\<username>
  • The tilde (~) refers to the user’s home directory
    • Note that it is possible to change the default ~ in R [see here]
  • You can check your home directory by running the following R code
# Check home directory
Sys.getenv('HOME')
#> [1] "/Users/cyouh95"

# Confirm that ~ denotes a user's home directory
path.expand('~')
#> [1] "/Users/cyouh95"


To summarize, your working directory will be as follows in the various scenarios:

Not working from RStudio Project Working from RStudio Project
In R Markdown code chunk Directory that the .Rmd file is in Directory that the .Rmd file is in
In R script Home directory project root directory
In R console Home directory project root directory

2.3 File paths

What are file paths?

  • A file path specifies the list of directories needed to locate a file
  • The directories in a path can be separated by a forward slash (/) or backward slash (\) depending on the operating system
    • MacOS: /path/to/file
    • Windows: C:\path\to\file
  • R uses / in file paths regardless of whether you’re a Mac or PC user


/path/to/my_project/   # E.g., /Users/my_username/Desktop/
|
|- my_project/
   |
   |- data/
   |- scripts/

There are two types of file paths:

  • Absolute file path: full path that includes the complete list of directories needed to locate a file or folder
    • E.g., /Users/my_username/Desktop/my_project/ is the absolute file path to the my_project/ directory
    • E.g., /Users/my_username/Desktop/my_project/data/ is the absolute file path to the data/ directory
    • E.g., /Users/my_username/Desktop/my_project/scripts/ is the absolute file path to the scripts/ directory
  • Relative file path: path relative to your current working directory
    • E.g., Assuming your working directory is the Desktop/ folder in the example above:
      • ./my_project/ is the relative file path to the my_project/ directory
      • ./my_project/data/ is the relative file path to the data/ directory
    • E.g., Assuming your working directory is the data/ folder:
      • ./ is the relative file path to the data/ directory (i.e., folder you are currently in)
      • ../ is the relative file path to the my_project/ directory
      • ../scripts/ is the relative file path to the scripts/ directory


As seen, relative file paths uses dots to indicate the relative directory:

Key Description
. this directory (i.e., current directory)
.. up a directory (i.e., parent directory)
/ separates directories in a file path
  • Usually, the leading ./ and trailing / in relative paths are not mandatory
    • E.g., ./my_project/ is equivalent to my_project, my_project/, and ./my_project
  • You can add the slashes if you want to make it clear/explicit that you are referring to a file path
    • Slashes might be required in other scenarios (e.g., when referring to filenames for some commands on the command line)
    • But otherwise it can just be for clarification purposes


What are the advantages of using relative file paths?

  • Notice that the absolute file path is dependent on your specific machine
    • E.g., /Users/<username>/path/to/file has your specific username in it
  • When you are collaborating with others on a shared project root directory (e.g., on a GitHub repository), you will want to use relative file paths in your scripts to refer to the shared files/folders, so that the path will be valid for everyone

2.5 Example project directory

Let’s create a directory for a hypothetical research project called research_project. Inside this directory, we’ll create separate subdirectories for data, scripts, output, output/tables, output/figures:

research_project/
|
|- data/
|- scripts/
|- output/
  |- tables/
  |- figures/
# delete research_project folder in case it already exists
unlink(x = 'research_project', recursive = TRUE, force = TRUE)

# Check working directory to see where your directory will be created
getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"
list.files()
#> [1] "organizing_and_io.html" "organizing_and_io.Rmd"

# Create `research_project` directory
dir.create(path = "research_project")

# Create `data` and `scripts` within `research_project`
dir.create(path = "research_project/data")
dir.create(path = "research_project/scripts")

# List the contents of `research_project` directory
list.files(path = "research_project")
#> [1] "data"    "scripts"

# Though we did not create the `output` directory yet, we can still directly create the `output/tables` subdirectory if we specify `recursive = TRUE`, which will create both `output` and `tables`
dir.create(path = "research_project/output/tables", recursive = TRUE)
dir.create(path = "research_project/output/figures")

# List the contents of `output` directory
list.files(path = "research_project/output")
#> [1] "figures" "tables"

# To delete `research_project`, remember you need `recursive = TRUE` to delete a directory
# unlink(x = "research_project", recursive = TRUE)

3 Organizing R script

What is an R script?

  • An R script contains R code (i.e., what you’d write inside R code chunks in an R markdown file)
  • Any lines of text that aren’t R code should start with a # to indicate that they are lines of comments
  • You can run code in an R script by putting your cursor on the line you want to run (or highlighting the code to run multiple lines) and clicking Run at the top of the script
    • Alternatively, you can hit ctrl + enter/cmd + enter on your keyboard to run the code
  • You will want to use an R script for the purpose of writing R code
    • On the other hand, you can only execute R code within code chunks in R markdown files, so they are not as well-suited for large amounts of code writing. .Rmd files are mostly useful for combining text and code (e.g., writing a report) and presenting it in various formats (e.g., PDF, HTML)

How to organize an R script?

  • Like folder structure, there is no absolute right or wrong way to organize your R script, but below is one way you can do it
    • The general guideline is to clearly label each section of your script and define objects (e.g., directory paths, functions) at the top of your file to be used throughout the script
  • You will be completing the remaining problem sets in this course using an R script rather than R markdown file
################################################################################
##
## [ PROJ ] < Name of the overall project >
## [ FILE ] < Name of this particular file >
## [ AUTH ] < Your name + email / Twitter / GitHub handle >
## [ INIT ] < Date you started the file >
##
################################################################################

## ---------------------------
## libraries
## ---------------------------

## ---------------------------
## directory paths
## ---------------------------

## ---------------------------
## functions
## ---------------------------

## -----------------------------------------------------------------------------
## < BODY >
## -----------------------------------------------------------------------------

## ---------------------------
## input
## ---------------------------

## ---------------------------
## process
## ---------------------------

## ---------------------------
## output
## ---------------------------

## -----------------------------------------------------------------------------
## END SCRIPT
## -----------------------------------------------------------------------------

Source: R script template by Ben Skinner

3.1 Creating directory path objects

We use the file.path() command because it is smart. Some computer operating systems use forward slashes, /, for their file paths; others use backslashes, \. Rather than try to guess or assume what operating system future users will use, we can use R’s function, file.path(), to check the current operating system and build the paths correctly for us.

Source: Organizing Lecture by Ben Skinner


The file.path() function:

?file.path

# SYNTAX AND DEFAULT VALUES
file.path(..., fsep = .Platform$file.sep)
  • Function: Construct the path to a file from components in a platform-independent way
  • Arguments
    • ...: File path component(s)
    • fsep: The path separator to use (default is /)
      • Usually, we ignore this argument
  • Output: A character vector object of the arguments concatenated by the / path separator (unless an alternative path separator is specified in fsep)


Example: Creating file path objects using file.path()

Recall our example research_project directory from earlier:

research_project/
|
|- data/
|- scripts/
|- output/
  |- tables/
  |- figures/


Let’s use file.path() to create file path objects for some of these directories:

# Pass in each section of the path as a separate argument
file.path('.', 'research_project', 'data')
#> [1] "./research_project/data"


We would usually create and save these objects at the top of our script to be used later on:

# Create file path object for `data` directory
data_dir <- file.path('.', 'research_project', 'data')
data_dir
#> [1] "./research_project/data"

# Create file path object for `output` directory
output_dir <- file.path('.', 'research_project', 'output')
output_dir
#> [1] "./research_project/output"

# Create file path object for `tables` directory
tables_dir <- file.path('.', 'research_project', 'output', 'tables')
tables_dir
#> [1] "./research_project/output/tables"


Note that the object created by file.path() is just a character vector containing the path:

# Investigate file path object
output_dir %>% str()
#>  chr "./research_project/output"


Since the file path object is just a regular character vector, we could use that as input to file.path() to help create subdirectory path objects:

# Create file path object for `figures` directory using `output_dir`
figures_dir <- file.path(output_dir, 'figures')
figures_dir
#> [1] "./research_project/output/figures"


Similarly, we can use the file path object anywhere that we would normally input a file path:

getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"
# List the contents of the `output` directory
list.files(path = output_dir)
#> [1] "figures" "tables"

Example: Adding a figure to the figures_dir

Let’s download an image from R for Data Science to the figures_dir we created:

# We will introduce the download.file() function in the next section
download.file(url = 'https://d33wubrfki0l68.cloudfront.net/8b89c5554ed6108359d59909d441dbeb010e8802/9f366/visualize_files/figure-html/unnamed-chunk-7-1.png',
              destfile = file.path(figures_dir, 'scatterplot.png'))

# confirm figure is there:
list.files(path = figures_dir)
#> [1] "scatterplot.png"


We can use the file path object figures_dir to help us refer to the saved image:

# Display image using include_graphics()
knitr::include_graphics(path = file.path(figures_dir, 'scatterplot.png'))

4 Downloading and unzipping data

There are many functions available to read in various types of data. The most common types that we will cover are:

Format Package Function
Comma-separated values (.csv) readr read_csv()
Excel (.xls or .xlsx) readxl read_excel()
Stata (.dta) haven read_dta()
R (.rds) base R readRDS()
R (.Rdata) base R load()

Note:

  • .rds files can store a single R object
  • .Rdata files can store multiple R objects


What can these functions read in?

  • All the functions can take a file path (absolute or relative) to your local data file
  • Some functions can take a URL to the file directly on the web
    • This method saves time and reduces the steps of downloading, saving, and reading in data
  • Some functions can also read in literal data (i.e., passed in as a string)


As we transition into learning about reading/writing data, let’s first take a look at some helpful functions to download and unzip data files, as you may need to use them when obtaining your data.

4.1 download.file() function

Although it is most convenient to read in data directly from the web, not all functions support that. In addition, we may want to download and save a copy of the data locally in some cases. This can be done using the download.file() function.


The download.file() function:

?download.file

# SYNTAX AND DEFAULT VALUES
download.file(url, destfile, method, quiet = FALSE, mode = "w",
              cacheOK = TRUE,
              extra = getOption("download.file.extra"),
              headers = NULL, ...)
  • Function: Downloads file from the Internet
  • Arguments
    • url: URL of a resource to be downloaded
    • destfile: Name where the downloaded file is saved
      • This can specify both where you want the downloaded file to be saved and what you want the file to be named


Example: Downloading data from the Internet using download.file()

We will be downloading the 2019 Institutional Characteristics data dictionary file from the IPEDS Data Center. Select the year and survey from the drop down and right-click the “Dictionary” link to obtain the URL to download:

# Recall the file path object to the `data` directory we created earlier
data_dir
#> [1] "./research_project/data"

# Download data dictionary file to `data` directory
download.file(url = 'https://nces.ed.gov/ipeds/datacenter/data/HD2019_Dict.zip',
              destfile = file.path(data_dir, 'hd2019_dictionary.zip'))  # rename downloaded file

# Check where we downloaded the data (i.e., the `destfile` arg from above)
file.path(data_dir, 'hd2019_dictionary.zip')
#> [1] "./research_project/data/hd2019_dictionary.zip"

# Confirm that the file has been downloaded in `data` folder
list.files(path = data_dir)
#> [1] "hd2019_dictionary.zip"

4.2 unzip() function

Some files downloaded from the web, like the example above, is contained in a zip folder. The unzip() function can be used to extract the zipped contents.


The unzip() function:

?unzip

# SYNTAX AND DEFAULT VALUES
unzip(zipfile, files = NULL, list = FALSE, overwrite = TRUE,
      junkpaths = FALSE, exdir = ".", unzip = "internal",
      setTimes = FALSE)
  • Function: Extract files from or list a zip archive
  • Arguments
    • zipfile: Path to zip file (including file name)
    • exdir: The directory to extract files to


Example: Extracting zipped data file using unzip()

Continuing from the previous example, we can use unzip() to extract the contents of the downloaded file:

# Extract data dictionary file
unzip(zipfile = file.path(data_dir, 'hd2019_dictionary.zip'),
      exdir = data_dir)  # extract to `data` folder

# Check that the file has been extracted in `data` folder
list.files(path = data_dir)
#> [1] "hd2019_dictionary.zip" "hd2019.xlsx"

5 readr package

The readr package:

?readr
  • The readr package contains functions to “read rectangular text data (like ‘csv’, ‘tsv’, and ‘fwf’)” into R [doc]
  • readr is part of tidyverse, and it is automatically loaded every time you load tidyverse
  • Comma-separated values (CSV) files are delimited text files that use commas to separate values
  • The functions to read and write CSV files are read_csv() and write_csv()


readr functions for reading data:

Format Function
Comma-separated values (.csv) read_csv()
Semicolon-separated values read_csv2()
Tab-separated values (.tsv) read_tsv()
General delimited files read_delim()
Fixed width files read_fwf()
Text-formatted data (.txt) read_table()
Web log files read_log()

5.1 read_csv() function

The read_csv() function:

?read_csv

# SYNTAX AND DEFAULT VALUES
read_csv(file, col_names = TRUE, col_types = NULL,
         locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
         quote = "\"", comment = "", trim_ws = TRUE, skip = 0,
         n_max = Inf, guess_max = min(1000, n_max),
         progress = show_progress(), skip_empty_rows = TRUE)
  • Function: Reads in CSV file
  • Arguments
    • file: File path or URL to a CSV file, or literal data
    • col_names: Whether to use first row of input as column names or provide own column names
    • col_types: Specifies data type for columns you read in
    • na: Vector of values to treat as missing value
    • comment: A string used to identify comments
    • skip: Number of lines to skip before reading data


As you can see from above, the read_csv() function provides you with many options for how you can read in the data. We’ll introduce you to some of the most commonly used arguments in read_csv(), but keep in mind this won’t cover them all exhaustively.

5.1.1 file argument

The file argument:

  • Provide URL to read in data directly from the web
  • Provide file path to read in a data file on your computer
  • Provide a string containing comma-separated values to read in literal data

We will be reading data from the Mobility Report Cards: The Role of Colleges in Intergenerational Mobility. This is part of the Equality of Opportunity Project, which uses two data sources – federal tax records and Department of Education records (1999-2013) – to investigate intergenerational income mobility at colleges in the US.

We will be reading in Online Data Table 1 under the Mobility Report Cards: The Role of Colleges in Intergenerational Mobility drop down from the Equality of Opportunity Project Data Page. Right-click the “Excel” link to obtain the URL to read (Note: it is actually a CSV file, not Excel):


Example: Reading data from the web using read_csv()

# Read data from URL
mrc <- read_csv(file = 'http://www.equality-of-opportunity.org/data/college/mrc_table1.csv')

# View first 4 rows and 4 columns 
mrc[1:4, 1:4]
#> # A tibble: 4 x 4
#>   super_opeid name                                         czname   state
#>         <dbl> <chr>                                        <chr>    <chr>
#> 1        2665 Vaughn College Of Aeronautics And Technology New York NY   
#> 2        7273 CUNY Bernard M. Baruch College               New York NY   
#> 3        2688 City College Of New York - CUNY              New York NY   
#> 4        7022 CUNY Lehman College                          New York NY


Example: Reading data from local file using read_csv()

If we have downloaded data files on our computer, we can read them in by providing the path to the file:

# First, download Chetty data file
download.file(url = 'http://www.equality-of-opportunity.org/data/college/mrc_table1.csv',
              destfile = file.path(data_dir, 'mrc_table1.csv'))  # save to `data` folder

# Read data from local file
mrc <- read_csv(file = file.path(data_dir, 'mrc_table1.csv'))


Example: Reading in literal data from string using read_csv()

We can also provide literal comma-separated data in the form of a string to be read:

# Read literal data
mrc <- read_csv(
  file = "super_opeid,name,czname,state
          2665,Vaughn College Of Aeronautics And Technology,New York,NY
          7273,CUNY Bernard M. Baruch College,New York,NY
          2688,City College Of New York - CUNY,New York,NY
          7022,CUNY Lehman College,New York,NY"
)

mrc
#> # A tibble: 4 x 4
#>   super_opeid name                                         czname   state
#>         <dbl> <chr>                                        <chr>    <chr>
#> 1        2665 Vaughn College Of Aeronautics And Technology New York NY   
#> 2        7273 CUNY Bernard M. Baruch College               New York NY   
#> 3        2688 City College Of New York - CUNY              New York NY   
#> 4        7022 CUNY Lehman College                          New York NY

Note that for a real project, you would typically be reading in data from a file, rather than providing literal data. But for the purpose of experimentation and providing examples (as in the next few sections), it is helpful to use literal data so we are able to see the data.

5.1.2 col_names argument

The col_names argument:

  • If TRUE, the first row of the input will be used as the column names (default)
  • If FALSE, column names will be generated automatically (X1, X2, X3, etc.)
  • If provide a character vector, the values will be used as the names of the columns

Example: Setting col_names to TRUE in read_csv() (default)
read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T",
  col_names = TRUE
)
#> # A tibble: 2 x 3
#>       a     b c    
#>   <dbl> <dbl> <lgl>
#> 1     1     2 FALSE
#> 2     4     5 TRUE

Example: Setting col_names to FALSE in read_csv()
read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T",
  col_names = FALSE
)
#> # A tibble: 3 x 3
#>   X1    X2    X3   
#>   <chr> <chr> <chr>
#> 1 a     b     c    
#> 2 1     2     F    
#> 3 4     5     T

Example: Providing character vector for col_names in read_csv()
read_csv(
  file = "1, 2, F
          4, 5, T",
  col_names = c('a', 'b', 'c')
)
#> # A tibble: 2 x 3
#>       a     b c    
#>   <dbl> <dbl> <lgl>
#> 1     1     2 FALSE
#> 2     4     5 TRUE

5.1.3 col_types argument

By default, read_csv() attempts to guess each column’s data type by looking at the first 1000 rows (e.g. character, double, etc). But you can manually specify the data type for the columns you read in using col_types.

The col_types argument:

  • If NULL, column data types guessed from the first 1000 rows (default)
  • Use cols() to specify data type for all columns
  • Use cols_only() to specify data type and read in only a subset of the columns
    • This is particularly useful for reading in one variable at a time to check that its type looks good
  • Use compact string representation where each character represents one column

To specify column type:

Column Type Parser function Character representation
Logical col_logical() l
Integers col_integer() i
Doubles col_double() d
Characters col_character() c
Numbers col_numeric() n
Factors col_factors(levels, ordered) f
Dates col_date(format = "") D

Example: Setting col_types to NULL in read_csv() (default)

We can see that read_csv() guessed double, double, and logical as the data type of the 3 columns:

df <- read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T",
  col_types = NULL
)

str(df)
#> spec_tbl_df [2 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ a: num [1:2] 1 4
#>  $ b: num [1:2] 2 5
#>  $ c: logi [1:2] FALSE TRUE
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   a = col_double(),
#>   ..   b = col_double(),
#>   ..   c = col_logical()
#>   .. )

Example: Setting col_types using cols() in read_csv()

We can manually choose the data type of columns using cols() and the corresponding parser function:

df <- read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T", 
  col_types = cols(
       a = col_factor(c('1', '2', '3', '4')),
       b = col_character()
    )
)

str(df)
#> spec_tbl_df [2 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ a: Factor w/ 4 levels "1","2","3","4": 1 4
#>  $ b: chr [1:2] "2" "5"
#>  $ c: logi [1:2] FALSE TRUE
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   a = col_factor(levels = c("1", "2", "3", "4"), ordered = FALSE, include_na = FALSE),
#>   ..   b = col_character(),
#>   ..   c = col_logical()
#>   .. )

Example: Setting col_types using cols_only() in read_csv()

We can use cols_only() to specify data type and read in only the specified columns. This is useful when you want to read in one variable at a time to check that the type looks good.

df <- read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T", 
  col_types = cols_only(
       a = col_factor(c('1', '2', '3', '4'))
    )
)

str(df)
#> spec_tbl_df [2 × 1] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ a: Factor w/ 4 levels "1","2","3","4": 1 4
#>  - attr(*, "spec")=
#>   .. cols_only(
#>   ..   a = col_factor(levels = c("1", "2", "3", "4"), ordered = FALSE, include_na = FALSE),
#>   ..   b = col_skip(),
#>   ..   c = col_skip()
#>   .. )

df <- read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T", 
  col_types = cols_only(
       a = col_factor(c('1', '2', '3', '4')),
       c = col_logical()
    )
)

str(df)
#> spec_tbl_df [2 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ a: Factor w/ 4 levels "1","2","3","4": 1 4
#>  $ c: logi [1:2] FALSE TRUE
#>  - attr(*, "spec")=
#>   .. cols_only(
#>   ..   a = col_factor(levels = c("1", "2", "3", "4"), ordered = FALSE, include_na = FALSE),
#>   ..   b = col_skip(),
#>   ..   c = col_logical()
#>   .. )


To summarize, the approach would be:

  • Read in first column of data using col_types = cols_only(...) and make sure variable looks good
  • Add second column of data to cols_only()
  • Add nth column of data to cols_only()

Example: Setting col_types using compact string representation in read_csv()

For example, the string representation 'icl' specifies the 3 columns to be of type integer, character, and logical, respectively:

df <- read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T",
  col_types = 'icl'
)

str(df)
#> spec_tbl_df [2 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ a: int [1:2] 1 4
#>  $ b: chr [1:2] "2" "5"
#>  $ c: logi [1:2] FALSE TRUE
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   a = col_integer(),
#>   ..   b = col_character(),
#>   ..   c = col_logical()
#>   .. )

5.1.4 Other arguments

Other useful arguments for read_csv():

  • na: Vector of values to treat as missing value
    • E.g., na = c(-2, 'TBD'): Treat the values -2 and 'TBD' as missing values
  • comment: A string used to identify comments
    • E.g., comment = '#': Any lines starting with # should be treated as a comment and ignored when reading in data
  • skip: Number of lines to skip before reading data
    • E.g., n = 5: Skip the first 5 lines when reading in the data

Example: Specifying the na argument in read_csv()

Treat the values -2 and 'TBD' as missing values (i.e., NA values):

read_csv(
  file = "column_1, column_2, column_3
          1, -2, 3
          4, 5, TBD",
  na = c(-2, 'TBD')
)
#> # A tibble: 2 x 3
#>   column_1 column_2 column_3
#>      <dbl>    <dbl>    <dbl>
#> 1        1       NA        3
#> 2        4        5       NA

Example: Specifying the comment argument in read_csv()

Skip the first line of comment that contains some meta information on the data:

read_csv(
  file = "# This file contains data on student charges for the acdemic year.
          a, b, c
          1, 2, 3
          4, 5, 6", 
  comment = '#'
)
#> # A tibble: 2 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6


We can specify what character indicates the start of a comment:

read_csv(
  file = "* This file contains data on student charges for the acdemic year.
          a, b, c
          1, 2, 3
          4, 5, 6", 
  comment = '*'
)
#> # A tibble: 2 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

Example: Specifying the skip argument in read_csv()

Skip the first 2 lines that contains some meta information on the data:

read_csv(
  file = 
    "This file contains data on student charges for the acdemic year.
     File name: IC2016_AY
     a, b, c
     1, 2, 3
     4, 5, 6", 
  skip = 2
)
#> # A tibble: 2 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6


We can also use skip to skip the first row of header data when we want to provide our own headings using col_names:

read_csv(
  file = 
    "This file contains data on student charges for the acdemic year.
     File name: IC2016_AY
     a, b, c
     1, 2, 3
     4, 5, 6", 
  skip = 3,
  col_names = c('colA', 'colB', 'colC')
)
#> # A tibble: 2 x 3
#>    colA  colB  colC
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

5.2 write_csv() function

The write_csv() function:

?write_csv

# SYNTAX AND DEFAULT VALUES
write_csv(x, path, na = "NA", append = FALSE, col_names = !append,
          quote_escape = "double")
  • Function: Writes to CSV file
  • Arguments
    • x: A data frame to write to disk
    • path: Path or connection to write to
    • na: Vector of values to treat as missing value
    • append: Whether or not to overwrite existing file or append to it
    • col_names: Whether or not to write column names at the top of the file
    • quote_escape: The type of escaping to use for quoted values


Example: Writing to CSV file using write_csv()

Recall the Chetty data from earlier. We can write the data from the dataframe to a CSV file:

# Chetty data
mrc
#> # A tibble: 4 x 4
#>   super_opeid name                                         czname   state
#>         <dbl> <chr>                                        <chr>    <chr>
#> 1        2665 Vaughn College Of Aeronautics And Technology New York NY   
#> 2        7273 CUNY Bernard M. Baruch College               New York NY   
#> 3        2688 City College Of New York - CUNY              New York NY   
#> 4        7022 CUNY Lehman College                          New York NY

# Write to CSV file
write_csv(x = mrc,
          path = file.path(data_dir, 'mrc.csv'))  # write to `data` folder we created earlier

6 readxl package

The readxl package:

?readxl
  • The readxl package is designed to import Excel files into R [doc]
  • readxl is part of tidyverse, so you’ll have the package if you have tidyverse installed. But unlike readr, it is not automatically loaded when you load tidyverse so you’ll need to explicitly load readxl if you want to use it.
  • The function to read Excel files is read_excel()

6.1 read_excel() function

The read_excel() function:

?read_excel

# SYNTAX AND DEFAULT VALUES
read_excel(path, sheet = NULL, range = NULL, col_names = TRUE,
           col_types = NULL, na = "", trim_ws = TRUE, skip = 0,
           n_max = Inf, guess_max = min(1000, n_max),
           progress = readxl_progress(), .name_repair = "unique")
  • Function: Reads in .xls and .xlsx files
  • Arguments
    • path: Path to the Excel file
    • sheet: Sheet to read – either a string (the name of a sheet) or an integer (the position of the sheet)
    • range: A cell range to read from
      • cell_rows(): Cell rows to read from
      • cell_cols(): Cell columns to read from
    • col_names: Whether to use first row of input as column names or provide own column names
    • col_types: Specifies data type for columns you read in
    • na: Character vector of strings to interpret as missing values
    • n_max: Maximum number of data rows to read


Example: Reading in an Excel spreadsheet using read_excel()

readxl has several example files that we could use as practice:

# List available sample Excel files
readxl_example()
#>  [1] "clippy.xls"    "clippy.xlsx"   "datasets.xls"  "datasets.xlsx"
#>  [5] "deaths.xls"    "deaths.xlsx"   "geometry.xls"  "geometry.xlsx"
#>  [9] "type-me.xls"   "type-me.xlsx"

# Get path to datasets.xlsx
path_to_datasets <- readxl_example('datasets.xlsx')
path_to_datasets
#> [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/readxl/extdata/datasets.xlsx"

# View sheets in datasets.xlsx
excel_sheets(path_to_datasets)
#> [1] "iris"     "mtcars"   "chickwts" "quakes"


If we read in the Excel file without specifying the sheet, it will default to the first sheet:

iris_dataset <- read_excel(path = path_to_datasets)
head(iris_dataset, n = 4)
#> # A tibble: 4 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa


We could also specify a specific sheet to read using the sheet argument:

quakes_dataset <- read_excel(path = path_to_datasets, sheet = 'quakes')
head(quakes_dataset, n = 4)
#> # A tibble: 4 x 5
#>     lat  long depth   mag stations
#>   <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 -20.4  182.   562   4.8       41
#> 2 -20.6  181.   650   4.2       15
#> 3 -26    184.    42   5.4       43
#> 4 -18.0  182.   626   4.1       19

Example: Specifying the range argument in read_excel()

Cell notation in Excel uses letters to indicate columns and numbers to indicate rows (e.g., cell A4 is the cell on the 1st column and 4th row). We can specify the cell range we want to read using this notation:

# Selects range of cells from C1 at top left corner to E4 at bottom right corner
read_excel(path = path_to_datasets, sheet = 'quakes', range = 'C1:E4')
#> # A tibble: 3 x 3
#>   depth   mag stations
#>   <dbl> <dbl>    <dbl>
#> 1   562   4.8       41
#> 2   650   4.2       15
#> 3    42   5.4       43

# Selects rows of cells from row 1 to 3 using cell_rows()
read_excel(path = path_to_datasets, sheet = 'quakes', range = cell_rows(1:3))
#> # A tibble: 2 x 5
#>     lat  long depth   mag stations
#>   <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 -20.4  182.   562   4.8       41
#> 2 -20.6  181.   650   4.2       15

# Selects columns of cells from column A to C using cell_cols()
head(read_excel(path = path_to_datasets, sheet = 'quakes', range = cell_cols('A:C')))
#> # A tibble: 6 x 3
#>     lat  long depth
#>   <dbl> <dbl> <dbl>
#> 1 -20.4  182.   562
#> 2 -20.6  181.   650
#> 3 -26    184.    42
#> 4 -18.0  182.   626
#> 5 -20.4  182.   649
#> 6 -19.7  184.   195


Note that we could also specify the sheet we want to select from using range, thus eliminating the need to specify sheet using the sheet argument:

# Selects range of cells from C1 to E4 in the quakes spreadsheet
read_excel(path = path_to_datasets, range = 'quakes!C1:E4')
#> # A tibble: 3 x 3
#>   depth   mag stations
#>   <dbl> <dbl>    <dbl>
#> 1   562   4.8       41
#> 2   650   4.2       15
#> 3    42   5.4       43

Example: Specifying the n_max argument in read_excel()
# Read in at most 4 rows (if available)
read_excel(path = path_to_datasets, n_max = 4)
#> # A tibble: 4 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa

Example: Reading in local Excel file using read_excel()

Recall the data dictionary we downloaded earlier for the 2019 IPEDS Institutional Characteristics survey (hd2019.xlsx). We will use read_excel() to:

  • Read in the Frequencies sheet
  • Read in only the rows 151 to 154
  • Make sure the column names are: varnumber, varname, codevalue, valuelabel, frequency, percent
  • Read in '-3' as NA values
read_excel(path = file.path(data_dir, 'hd2019.xlsx'),
           sheet = 'Frequencies',
           range = cell_rows(151:154),
           col_names = c('varnumber', 'varname', 'codevalue', 'valuelabel', 'frequency', 'percent'),
           na = '-3')
#> # A tibble: 4 x 6
#>   varnumber varname codevalue valuelabel             frequency percent
#>       <dbl> <chr>   <chr>     <chr>                      <dbl>   <dbl>
#> 1     10091 ICLEVEL NA        {Not available}               32    0.49
#> 2     10096 CONTROL 1         Public                      2056   31.4 
#> 3     10096 CONTROL 2         Private not-for-profit      1905   29.0 
#> 4     10096 CONTROL 3         Private for-profit          2566   39.1

7 haven package

The haven package:

?haven
  • The haven package that allows users to import and export data from the following statistical packages: SAS, SPSS, Stata [doc]
  • haven is part of tidyverse so you’ll have the package if you have tidyverse installed. But unlike readr, it is not automatically loaded when you load tidyverse so you’ll need to explicitly load haven if you want to use it.
  • The functions to read and write to Stata .dta files are read_dta() and write_dta()


haven functions for reading data:

Format Function
SAS read_sas()
SPSS read_sav()
Stata read_dta()

7.1 read_dta() function

The read_dta() function:

?read_dta

# SYNTAX AND DEFAULT VALUES
read_dta(file, encoding = NULL, col_select = NULL, skip = 0,
         n_max = Inf, .name_repair = "unique")
  • Function: Reads in Stata .dta files
  • Arguments
    • file: File path or URL to a .dta file, or literal data
    • skip: Number of lines to skip before reading data
    • n_max: Maximum number of data rows to read


Example: Reading in .dta data file using read_dta()

High school longitudinal surveys from National Center for Education Statistics (NCES) follow U.S. students from high school through college and the labor market. We will be working with High School Longitudinal Study of 2009 (HSLS:09):

  • Follows 9th graders from 2009
  • Data collection waves
    • Base Year (2009)
    • First Follow-up (2012)
    • 2013 Update (2013)
    • High School Transcripts (2013-2014)
    • Second Follow-up (2016)
# Read .dta file from URL
hsls <- read_dta(file = 'https://raw.githubusercontent.com/anyone-can-cook/rclass2/main/data/hsls/hsls_sch_small.dta')

# Print first few rows of the hsls dataframe
head(hsls)
#> # A tibble: 6 x 5
#>   sch_id          x1control  x1locale   x1region               a1schcontrol
#>   <chr>           <dbl+lbl> <dbl+lbl>  <dbl+lbl>                  <dbl+lbl>
#> 1 1001   1 [Public]         4 [City]  6 [South]  2 [Unit non-response/comp…
#> 2 1002   1 [Public]         5 [Subur… 6 [South]  4 [Public]                
#> 3 1003   1 [Public]         4 [City]  6 [South]  4 [Public]                
#> 4 1004   2 [Catholic or ot… 5 [Subur… 5 [Midwes… 5 [Private]               
#> 5 1005   1 [Public]         5 [Subur… 6 [South]  4 [Public]                
#> 6 1006   1 [Public]         5 [Subur… 6 [South]  4 [Public]

7.2 write_dta() function

The write_dta() function:

?write_dta

# SYNTAX AND DEFAULT VALUES
write_dta(data, path, version = 14, label = attr(data, "label"))
  • Function: Writes to Stata .dta files
  • Arguments
    • data: Dataframe to write
    • path: Path to a file where the data will be written


Example: Writing to .dta data file using read_dta()

# Write the first few rows of the hsls dataframe to a .dta file
write_dta(data = head(hsls),
          path = file.path(data_dir, 'hsls_sch_small_subset.dta'))

8 Saving and loading R objects

There are base R functions for saving R objects, such as dataframes, to R data files. This is useful when you want to preserve data structures, such as column data types of a dataframe. [x]

We’ll be looking at saving and loading R objects with .RDS and .RData file types. A summary of the differences is below and we will provide examples in the following sections.

.RDS files .RData files
Can store only 1 R object Can store 1 or more R object(s)
Use saveRDS() to save object Use save() to save object(s)
Use readRDS() to load object Use load() to load object(s)
Need to assign loaded object in order to retain it Object(s) will be loaded directly to your environment

8.1 .RDS files

Single R objects (e.g., a single data frame, or a single character vector) can be saved to .RDS files using saveRDS() and loaded again using readRDS().


The saveRDS() function:

?saveRDS

# SYNTAX AND DEFAULT VALUES
saveRDS(object, file = "", ascii = FALSE, version = NULL,
        compress = TRUE, refhook = NULL)
  • Function: Saves single R object to .RDS file
  • Arguments
    • object: R object to write
    • file: Name of the file where the R object is saved to


Example: Saving single R object to .RDS file using saveRDS()

# Save the `mrc` dataframe from earlier
saveRDS(object = mrc, file = file.path(data_dir, 'mrc.RDS'))


The readRDS() function:

?readRDS

# SYNTAX AND DEFAULT VALUES
readRDS(file, refhook = NULL)
  • Function: Loads single R object from .RDS file
  • Arguments
    • file: Name of the file where the R object is loaded from
  • Notes
    • You need to assign the loaded object using <- to retain it in your environment


Example: Loading single R object from .RDS file using readRDS()

# Load `mrc` dataframe
mrc_df <- readRDS(file = file.path(data_dir, 'mrc.RDS'))

mrc_df
#> # A tibble: 4 x 4
#>   super_opeid name                                         czname   state
#>         <dbl> <chr>                                        <chr>    <chr>
#> 1        2665 Vaughn College Of Aeronautics And Technology New York NY   
#> 2        7273 CUNY Bernard M. Baruch College               New York NY   
#> 3        2688 City College Of New York - CUNY              New York NY   
#> 4        7022 CUNY Lehman College                          New York NY

8.2 .RData files

One or more R objects can be saved to .RData files using save() and loaded again using load().


The save() function:

?save

# SYNTAX AND DEFAULT VALUES
save(..., list = character(),
     file = stop("'file' must be specified"),
     ascii = FALSE, version = NULL, envir = parent.frame(),
     compress = isTRUE(!ascii), compression_level,
     eval.promises = TRUE, precheck = TRUE)
  • Function: Saves R object(s) to .RData file
  • Arguments
    • ...: Dataframe to write
    • file: Name of the file where the R object(s) are saved to


Example: Saving multiple R objects to .RData file using save()

# Save the `iris_dataset` and `quakes_dataset` dataframes from earlier
save(iris_dataset, quakes_dataset, file = file.path(data_dir, 'datasets.RData'))


The load() function:

?load

# SYNTAX AND DEFAULT VALUES
load(file, envir = parent.frame(), verbose = FALSE)
  • Function: Loads R object(s) from .RData file
  • Arguments
    • file: Name of the file where the R object(s) are loaded from
  • Notes
    • You do NOT need to assign the loaded object(s) using <- because they will all be loaded into your environment with their original names


Example: Loading multiple R objects from .RData file using load()

# Load `iris_dataset` and `quakes_dataset` dataframes
load(file = file.path(data_dir, 'datasets.RData'))

8.3 Loading from URLs

Besides loading local data files, we can also load them in directly from the web using url().


The url() function:

?url

# SYNTAX AND DEFAULT VALUES
url(description, open = "", blocking = TRUE,
    encoding = getOption("encoding"),
    method = getOption("url.method", "default"),
    headers = NULL)
  • Function: Open connection to URL
  • Arguments
    • description: Character string containing the URL


Example: Loading .RDS file from URL using url() and readRDS()

# Load `recruit_school_somevars.RDS` file from URL
df_school <- readRDS(file = url(description = 'https://github.com/anyone-can-cook/rclass2/raw/main/data/recruiting/recruit_school_somevars.RDS'))


Example: Loading .RData file from URL using url() and load()

# Load `recruit_school_somevars.RData` file from URL
load(file = url(description = 'https://github.com/anyone-can-cook/rclass2/raw/main/data/recruiting/recruit_school_somevars.RData'))