1 Introduction

1.1 Libraries we will use

Load packages:

library(dplyr)
library(readr)
library(readxl)
library(haven)

1.2 Lecture overview

When you begin a new project, it is a good idea to have a consistent and efficient way of organizing your folders and files. This can help save you time and trouble later down the road, as you will know exactly where everything is when you need to look for them. Another common task you may encounter in your project is reading and writing data. This lecture covers these two fundamental topics for helping you get started with a project.

Organizing folders and files:

Organizing folders: how to structure and navigate your project directory
Organizing files: how to structure an R script

Reading and writing data:

Common file formats to read from:

Format	Package	Function
Comma-separated values (`.csv`)	`readr`	`read_csv()`
Text-formated data (`.txt`)	`readr`	`read_table()`
Tab-separated values (`.tsv`)	`readr`	`read_tsv()`
Excel (`.xls` or `.xlsx`)	`readxl`	`read_excel()`
Stata (`.dta`)	`haven`	`read_dta()`
SPSS (`.sav`)	`haven`	`read_sav()`
SAS (`.sas`)	`haven`	`read_sas()`
R (`.rds`)	base R	`readRDS()`
R (`.Rdata`)	base R	`load()`

Source: Professor Darin Christensen

Note:

.rds files can store a single R object
.Rdata files can store multiple R objects

Common file formats to write to:

Format	Package	Function
Comma-separated values (`.csv`)	`readr`	`write_csv()`
Stata (`.dta`)	`haven`	`write_dta()`
R (`.rds`)	base R	`saveRDS()`
R (`.Rdata`)	base R	`save()`

2 Organizing project directory

What is a directory and how to organize one?

Folders are referred to as directories
Your project directory is the folder that contains all the files and subfolders (i.e., subdirectories) you create for the purpose of your project
There is no one-right-way to organize a project directory, but the structure shown below is generally a good way to set up your project
- You can have a data/ subdirectory to hold all data files, scripts/ subdirectory to hold your scripts, etc.
When using RStudio, it is often useful to turn your project directory into an RStudio project
In the below directory structure, the folder named my_project is the “root directory” for your research project (sometimes called the “project directory” or “project root directory”)
- the folders data, scripts, and figures are sub-directories of your root directory
- the file my_project.Rproj is your RStudio project file and lives in the root directory for your research project

my_project/
|
|- data/
|- scripts/
|- figures/
|- my_project.Rproj

2.1 RStudio project

How to create an RStudio project?

On the top right corner in RStudio, select New Project under the dropdown menu
If there’s a folder you want to turn into a project, select Existing Directory
Under Project working directory, browse for your folder and click Create Project

Why use RStudio project?

Creating a RStudio project helps keep everything relative to the project root directory
- Your R Console and R scripts will run using the project root directory as the working directory
- Your Terminal in RStudio will start in the project root directory
- Your file browser window (bottom right panel) will also start off in the project root directory

2.2 Working directory

What is a working directory?

Your working directory is the directory that you are currently working in (e.g., running a script from)
- It’ll be important to keep the working directory in mind when you need to refer to files/folders relative to the working directory
Note that the working directory is different when you run an R markdown file (.Rmd) vs. an R script (.R)
You can use the getwd() function to check your current working directory

The getwd() function:

?getwd

# SYNTAX
getwd()

Function: Returns an absolute filepath representing the current working directory
Arguments: NA

When you run R code in an R markdown file, the working directory is the directory that your .Rmd file is in (the directory where the .Rmd file is saved):

getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"

When you run an R script, the working directory is the directory indicated at the top of your console in RStudio:

If you are not working from an RStudio project, this is typically your home directory
If you are working from an RStudio project, your working directory would be the project root directory

What is the home directory?

You can view your home directory in RStudio by clicking on Home in the bottom right file viewer panel
The home directory is a user’s main directory on a computer and it varies by operating systems. It is usually:
- MacOS: /Users/<username>
- Windows: C:\Users\<username>
The tilde (~) refers to the user’s home directory
- Note that it is possible to change the default ~ in R [see here]
You can check your home directory by running the following R code

# Check home directory
Sys.getenv('HOME')
#> [1] "/Users/cyouh95"

# Confirm that ~ denotes a user's home directory
path.expand('~')
#> [1] "/Users/cyouh95"

To summarize, your working directory will be as follows in the various scenarios:

	Not working from RStudio Project	Working from RStudio Project
In R Markdown code chunk	Directory that the `.Rmd` file is in	Directory that the `.Rmd` file is in
In R script	Home directory	project root directory
In R console	Home directory	project root directory

2.3 File paths

What are file paths?

A file path specifies the list of directories needed to locate a file
The directories in a path can be separated by a forward slash (/) or backward slash (\) depending on the operating system
- MacOS: /path/to/file
- Windows: C:\path\to\file
R uses / in file paths regardless of whether you’re a Mac or PC user

/path/to/my_project/   # E.g., /Users/my_username/Desktop/
|
|- my_project/
   |
   |- data/
   |- scripts/

There are two types of file paths:

Absolute file path: full path that includes the complete list of directories needed to locate a file or folder
- E.g., /Users/my_username/Desktop/my_project/ is the absolute file path to the my_project/ directory
- E.g., /Users/my_username/Desktop/my_project/data/ is the absolute file path to the data/ directory
- E.g., /Users/my_username/Desktop/my_project/scripts/ is the absolute file path to the scripts/ directory
Relative file path: path relative to your current working directory
- E.g., Assuming your working directory is the Desktop/ folder in the example above:
  - ./my_project/ is the relative file path to the my_project/ directory
  - ./my_project/data/ is the relative file path to the data/ directory
- E.g., Assuming your working directory is the data/ folder:
  - ./ is the relative file path to the data/ directory (i.e., folder you are currently in)
  - ../ is the relative file path to the my_project/ directory
  - ../scripts/ is the relative file path to the scripts/ directory

As seen, relative file paths uses dots to indicate the relative directory:

Key	Description
`.`	this directory (i.e., current directory)
`..`	up a directory (i.e., parent directory)
`/`	separates directories in a file path

Usually, the leading ./ and trailing / in relative paths are not mandatory
- E.g., ./my_project/ is equivalent to my_project, my_project/, and ./my_project
You can add the slashes if you want to make it clear/explicit that you are referring to a file path
- Slashes might be required in other scenarios (e.g., when referring to filenames for some commands on the command line)
- But otherwise it can just be for clarification purposes

What are the advantages of using relative file paths?

Notice that the absolute file path is dependent on your specific machine
- E.g., /Users/<username>/path/to/file has your specific username in it
When you are collaborating with others on a shared project root directory (e.g., on a GitHub repository), you will want to use relative file paths in your scripts to refer to the shared files/folders, so that the path will be valid for everyone

2.4 Navigating files and folders

2.4.1 Listing content of directory

The list.files() function:

?list.files

# SYNTAX AND DEFAULT VALUES
list.files(path = ".", pattern = NULL, all.files = FALSE,
           full.names = FALSE, recursive = FALSE,
           ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)

Function: List all files and folders inside a directory
Arguments
- path: Path to the directory you want to list content of
- full.names: The provided path will be prepended to the listed file/folder names

Example: Listing content of current working directory using list.files()

# Check current working directory
getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"

# List files and folders in current directory
list.files()
#> [1] "organizing_and_io.html" "organizing_and_io.Rmd" 
#> [3] "research_project"

# List files and folders in parent directory
list.files(path = '..')
#> [1] "apis_and_json"     "ggplot"            "git_and_github"   
#> [4] "organizing_and_io" "programming"       "strings_and_regex"

2.4.2 Changing working directory

The setwd() function:

?setwd

# SYNTAX
setwd(dir)

Function: Change working directory
Arguments
- dir: path to set working directory to
Notes
- In an R script, you only need to set your working directory using setwd() once in the script
- Similarly, if you set working directory in your R console, it will persist for the duration of your current R session
- In an R markdown file, if you set your working directory in a code chunk, it will be reset to the default directory after the code chunk is finished running

Example: Setting working directory using setwd()

# Check current working directory
getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"

# List files and folders in current directory
list.files()
#> [1] "organizing_and_io.html" "organizing_and_io.Rmd" 
#> [3] "research_project"

# Set working directory to parent directory
setwd(dir = '..')

# Check new working directory
getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures"

# List files and folders in new working directory
list.files()
#> [1] "apis_and_json"     "ggplot"            "git_and_github"   
#> [4] "organizing_and_io" "programming"       "strings_and_regex"

Remember that for .Rmd files, the working directory will be reset in new code chunks:

# Check current working directory
getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"

2.4.3 Creating directories

The dir.create() function:

?dir.create

# SYNTAX AND DEFAULT VALUES
dir.create(path, showWarnings = TRUE, recursive = FALSE, mode = "0777")

Function: Creates new directories
Arguments
- path: A character vector containing a single path name
- showWarnings: Should the warnings on failure be shown?
  - If directory you want to create already exists, you will get a warning, but this won’t cause the code to stop running
- recursive: Should elements of the path other than the last be created?
  - That is, will dir.create() create the file path new_directory/new_sub_directory if neither new_directory nor new_sub_directory exist?

Example: Creating a new directory within current working directory using dir.create()

# Check current working directory
getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"
list.files()
#> [1] "organizing_and_io.html" "organizing_and_io.Rmd" 
#> [3] "research_project"

# Create new directory called `my_folder`
dir.create(path = "my_folder")

# Check that `my_folder` has been created
list.files()
#> [1] "my_folder"              "organizing_and_io.html"
#> [3] "organizing_and_io.Rmd"  "research_project"

2.4.4 Checking if file/directory exists

The file.exists() function:

?file.exists

# SYNTAX
file.exists(...)

Function: Checks if a file exists
Argument(s): filename(s)

The dir.exists() function:

?dir.exists

# SYNTAX
dir.exists(paths)

Function: Checks if a directory exists
Argument(s): path(s)

Example: Checking if file exists using file.exists()

# List files and folders in current directory
list.files()
#> [1] "my_folder"              "organizing_and_io.html"
#> [3] "organizing_and_io.Rmd"  "research_project"

# Checks if file `organizing_and_io.Rmd` exists
file.exists("organizing_and_io.Rmd")
#> [1] TRUE

# Checks if file `organizing_and_io.pdf` exists
file.exists("organizing_and_io.pdf")
#> [1] FALSE

Example: Checking if directory exists using dir.exists()

# List files and folders in current directory
list.files()
#> [1] "my_folder"              "organizing_and_io.html"
#> [3] "organizing_and_io.Rmd"  "research_project"

# Checks if directory `my_folder` exists
dir.exists("my_folder")
#> [1] TRUE

# Checks if directory `nonexistent_folder` exists
dir.exists("nonexistent_folder")
#> [1] FALSE

2.4.5 Deleting files and directories

The unlink() function:

?unlink

# SYNTAX AND DEFAULT VALUES
unlink(x, recursive = FALSE, force = FALSE)

Function: Deletes files or directories
Arguments
- x: A character vector with the names of the files or directories to be deleted
- recursive: Should directories be deleted recursively? (i.e., delete content that’s nested inside a directory)
  - If recursive = FALSE, you can only delete files. Directories are not deleted, not even empty ones.
  - You must set recursive = TRUE in order to delete a directory, including the files/subdirectories inside it
- force: Should permissions be changed (if possible) to allow the file or directory to be removed?
  - In the case where you do not have write permission for a file or directory, you need to set force = TRUE in order to be able to delete it

Example: Deleting a directory within current working directory using unlink()

getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"
list.files()
#> [1] "my_folder"              "organizing_and_io.html"
#> [3] "organizing_and_io.Rmd"  "research_project"

# Delete `my_folder` we just created
unlink(x = "my_folder", recursive = TRUE) 

# Check that `my_folder` has been deleted
list.files()
#> [1] "organizing_and_io.html" "organizing_and_io.Rmd" 
#> [3] "research_project"

2.5 Example project directory

Let’s create a directory for a hypothetical research project called research_project. Inside this directory, we’ll create separate subdirectories for data, scripts, output, output/tables, output/figures:

research_project/
|
|- data/
|- scripts/
|- output/
  |- tables/
  |- figures/

# delete research_project folder in case it already exists
unlink(x = 'research_project', recursive = TRUE, force = TRUE)

# Check working directory to see where your directory will be created
getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"
list.files()
#> [1] "organizing_and_io.html" "organizing_and_io.Rmd"

# Create `research_project` directory
dir.create(path = "research_project")

# Create `data` and `scripts` within `research_project`
dir.create(path = "research_project/data")
dir.create(path = "research_project/scripts")

# List the contents of `research_project` directory
list.files(path = "research_project")
#> [1] "data"    "scripts"

# Though we did not create the `output` directory yet, we can still directly create the `output/tables` subdirectory if we specify `recursive = TRUE`, which will create both `output` and `tables`
dir.create(path = "research_project/output/tables", recursive = TRUE)
dir.create(path = "research_project/output/figures")

# List the contents of `output` directory
list.files(path = "research_project/output")
#> [1] "figures" "tables"

# To delete `research_project`, remember you need `recursive = TRUE` to delete a directory
# unlink(x = "research_project", recursive = TRUE)

3 Organizing R script

What is an R script?

An R script contains R code (i.e., what you’d write inside R code chunks in an R markdown file)
Any lines of text that aren’t R code should start with a # to indicate that they are lines of comments
You can run code in an R script by putting your cursor on the line you want to run (or highlighting the code to run multiple lines) and clicking Run at the top of the script
- Alternatively, you can hit ctrl + enter/cmd + enter on your keyboard to run the code
You will want to use an R script for the purpose of writing R code
- On the other hand, you can only execute R code within code chunks in R markdown files, so they are not as well-suited for large amounts of code writing. .Rmd files are mostly useful for combining text and code (e.g., writing a report) and presenting it in various formats (e.g., PDF, HTML)

How to organize an R script?

Like folder structure, there is no absolute right or wrong way to organize your R script, but below is one way you can do it
- The general guideline is to clearly label each section of your script and define objects (e.g., directory paths, functions) at the top of your file to be used throughout the script
You will be completing the remaining problem sets in this course using an R script rather than R markdown file
- Your scripts will be organized based on the Problem set R script template that can be found under the Syllabus & Resources section of the class website

################################################################################
##
## [ PROJ ] < Name of the overall project >
## [ FILE ] < Name of this particular file >
## [ AUTH ] < Your name + email / Twitter / GitHub handle >
## [ INIT ] < Date you started the file >
##
################################################################################

## ---------------------------
## libraries
## ---------------------------

## ---------------------------
## directory paths
## ---------------------------

## ---------------------------
## functions
## ---------------------------

## -----------------------------------------------------------------------------
## < BODY >
## -----------------------------------------------------------------------------

## ---------------------------
## input
## ---------------------------

## ---------------------------
## process
## ---------------------------

## ---------------------------
## output
## ---------------------------

## -----------------------------------------------------------------------------
## END SCRIPT
## -----------------------------------------------------------------------------

Source: R script template by Ben Skinner

3.1 Creating directory path objects

We use the file.path() command because it is smart. Some computer operating systems use forward slashes, /, for their file paths; others use backslashes, \. Rather than try to guess or assume what operating system future users will use, we can use R’s function, file.path(), to check the current operating system and build the paths correctly for us.

Source: Organizing Lecture by Ben Skinner

The file.path() function:

?file.path

# SYNTAX AND DEFAULT VALUES
file.path(..., fsep = .Platform$file.sep)

Function: Construct the path to a file from components in a platform-independent way
Arguments
- ...: File path component(s)
- fsep: The path separator to use (default is /)
  - Usually, we ignore this argument
Output: A character vector object of the arguments concatenated by the / path separator (unless an alternative path separator is specified in fsep)

Example: Creating file path objects using file.path()

Recall our example research_project directory from earlier:

research_project/
|
|- data/
|- scripts/
|- output/
  |- tables/
  |- figures/

Let’s use file.path() to create file path objects for some of these directories:

# Pass in each section of the path as a separate argument
file.path('.', 'research_project', 'data')
#> [1] "./research_project/data"

We would usually create and save these objects at the top of our script to be used later on:

# Create file path object for `data` directory
data_dir <- file.path('.', 'research_project', 'data')
data_dir
#> [1] "./research_project/data"

# Create file path object for `output` directory
output_dir <- file.path('.', 'research_project', 'output')
output_dir
#> [1] "./research_project/output"

# Create file path object for `tables` directory
tables_dir <- file.path('.', 'research_project', 'output', 'tables')
tables_dir
#> [1] "./research_project/output/tables"

Note that the object created by file.path() is just a character vector containing the path:

# Investigate file path object
output_dir %>% str()
#>  chr "./research_project/output"

Since the file path object is just a regular character vector, we could use that as input to file.path() to help create subdirectory path objects:

# Create file path object for `figures` directory using `output_dir`
figures_dir <- file.path(output_dir, 'figures')
figures_dir
#> [1] "./research_project/output/figures"

Similarly, we can use the file path object anywhere that we would normally input a file path:

getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"
# List the contents of the `output` directory
list.files(path = output_dir)
#> [1] "figures" "tables"

Example: Adding a figure to the figures_dir

Let’s download an image from R for Data Science to the figures_dir we created:

# We will introduce the download.file() function in the next section
download.file(url = 'https://d33wubrfki0l68.cloudfront.net/8b89c5554ed6108359d59909d441dbeb010e8802/9f366/visualize_files/figure-html/unnamed-chunk-7-1.png',
              destfile = file.path(figures_dir, 'scatterplot.png'))

# confirm figure is there:
list.files(path = figures_dir)
#> [1] "scatterplot.png"

We can use the file path object figures_dir to help us refer to the saved image:

# Display image using include_graphics()
knitr::include_graphics(path = file.path(figures_dir, 'scatterplot.png'))

4 Downloading and unzipping data

There are many functions available to read in various types of data. The most common types that we will cover are:

Format	Package	Function
Comma-separated values (`.csv`)	`readr`	`read_csv()`
Excel (`.xls` or `.xlsx`)	`readxl`	`read_excel()`
Stata (`.dta`)	`haven`	`read_dta()`
R (`.rds`)	base R	`readRDS()`
R (`.Rdata`)	base R	`load()`

Note:

.rds files can store a single R object
.Rdata files can store multiple R objects

What can these functions read in?

All the functions can take a file path (absolute or relative) to your local data file
Some functions can take a URL to the file directly on the web
- This method saves time and reduces the steps of downloading, saving, and reading in data
Some functions can also read in literal data (i.e., passed in as a string)

As we transition into learning about reading/writing data, let’s first take a look at some helpful functions to download and unzip data files, as you may need to use them when obtaining your data.

4.1 `download.file()` function

Although it is most convenient to read in data directly from the web, not all functions support that. In addition, we may want to download and save a copy of the data locally in some cases. This can be done using the download.file() function.

The download.file() function:

?download.file

# SYNTAX AND DEFAULT VALUES
download.file(url, destfile, method, quiet = FALSE, mode = "w",
              cacheOK = TRUE,
              extra = getOption("download.file.extra"),
              headers = NULL, ...)

Function: Downloads file from the Internet
Arguments
- url: URL of a resource to be downloaded
- destfile: Name where the downloaded file is saved
  - This can specify both where you want the downloaded file to be saved and what you want the file to be named

Example: Downloading data from the Internet using download.file()

We will be downloading the 2019 Institutional Characteristics data dictionary file from the IPEDS Data Center. Select the year and survey from the drop down and right-click the “Dictionary” link to obtain the URL to download:

# Recall the file path object to the `data` directory we created earlier
data_dir
#> [1] "./research_project/data"

# Download data dictionary file to `data` directory
download.file(url = 'https://nces.ed.gov/ipeds/datacenter/data/HD2019_Dict.zip',
              destfile = file.path(data_dir, 'hd2019_dictionary.zip'))  # rename downloaded file

# Check where we downloaded the data (i.e., the `destfile` arg from above)
file.path(data_dir, 'hd2019_dictionary.zip')
#> [1] "./research_project/data/hd2019_dictionary.zip"

# Confirm that the file has been downloaded in `data` folder
list.files(path = data_dir)
#> [1] "hd2019_dictionary.zip"

4.2 `unzip()` function

Some files downloaded from the web, like the example above, is contained in a zip folder. The unzip() function can be used to extract the zipped contents.

The unzip() function:

?unzip

# SYNTAX AND DEFAULT VALUES
unzip(zipfile, files = NULL, list = FALSE, overwrite = TRUE,
      junkpaths = FALSE, exdir = ".", unzip = "internal",
      setTimes = FALSE)

Function: Extract files from or list a zip archive
Arguments
- zipfile: Path to zip file (including file name)
- exdir: The directory to extract files to

Example: Extracting zipped data file using unzip()

Continuing from the previous example, we can use unzip() to extract the contents of the downloaded file:

# Extract data dictionary file
unzip(zipfile = file.path(data_dir, 'hd2019_dictionary.zip'),
      exdir = data_dir)  # extract to `data` folder

# Check that the file has been extracted in `data` folder
list.files(path = data_dir)
#> [1] "hd2019_dictionary.zip" "hd2019.xlsx"

5 `readr` package

The readr package:

?readr

The readr package contains functions to “read rectangular text data (like ‘csv’, ‘tsv’, and ‘fwf’)” into R [doc]
readr is part of tidyverse, and it is automatically loaded every time you load tidyverse
Comma-separated values (CSV) files are delimited text files that use commas to separate values
The functions to read and write CSV files are read_csv() and write_csv()

readr functions for reading data:

Format	Function
Comma-separated values (`.csv`)	`read_csv()`
Semicolon-separated values	`read_csv2()`
Tab-separated values (`.tsv`)	`read_tsv()`
General delimited files	`read_delim()`
Fixed width files	`read_fwf()`
Text-formatted data (`.txt`)	`read_table()`
Web log files	`read_log()`

5.1 `read_csv()` function

The read_csv() function:

?read_csv

# SYNTAX AND DEFAULT VALUES
read_csv(file, col_names = TRUE, col_types = NULL,
         locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
         quote = "\"", comment = "", trim_ws = TRUE, skip = 0,
         n_max = Inf, guess_max = min(1000, n_max),
         progress = show_progress(), skip_empty_rows = TRUE)

Function: Reads in CSV file
Arguments
- file: File path or URL to a CSV file, or literal data
- col_names: Whether to use first row of input as column names or provide own column names
- col_types: Specifies data type for columns you read in
- na: Vector of values to treat as missing value
- comment: A string used to identify comments
- skip: Number of lines to skip before reading data

As you can see from above, the read_csv() function provides you with many options for how you can read in the data. We’ll introduce you to some of the most commonly used arguments in read_csv(), but keep in mind this won’t cover them all exhaustively.

5.1.1 `file` argument

The file argument:

Provide URL to read in data directly from the web
Provide file path to read in a data file on your computer
Provide a string containing comma-separated values to read in literal data

We will be reading data from the Mobility Report Cards: The Role of Colleges in Intergenerational Mobility. This is part of the Equality of Opportunity Project, which uses two data sources – federal tax records and Department of Education records (1999-2013) – to investigate intergenerational income mobility at colleges in the US.

We will be reading in Online Data Table 1 under the Mobility Report Cards: The Role of Colleges in Intergenerational Mobility drop down from the Equality of Opportunity Project Data Page. Right-click the “Excel” link to obtain the URL to read (Note: it is actually a CSV file, not Excel):

Example: Reading data from the web using read_csv()

# Read data from URL
mrc <- read_csv(file = 'http://www.equality-of-opportunity.org/data/college/mrc_table1.csv')

# View first 4 rows and 4 columns 
mrc[1:4, 1:4]
#> [38;5;246m# A tibble: 4 x 4[39m
#>   super_opeid name                                         czname   state
#>         [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m                                        [3m[38;5;246m<chr>[39m[23m    [3m[38;5;246m<chr>[39m[23m
#> [38;5;250m1[39m        [4m2[24m665 Vaughn College Of Aeronautics And Technology New York NY   
#> [38;5;250m2[39m        [4m7[24m273 CUNY Bernard M. Baruch College               New York NY   
#> [38;5;250m3[39m        [4m2[24m688 City College Of New York - CUNY              New York NY   
#> [38;5;250m4[39m        [4m7[24m022 CUNY Lehman College                          New York NY

Example: Reading data from local file using read_csv()

If we have downloaded data files on our computer, we can read them in by providing the path to the file:

# First, download Chetty data file
download.file(url = 'http://www.equality-of-opportunity.org/data/college/mrc_table1.csv',
              destfile = file.path(data_dir, 'mrc_table1.csv'))  # save to `data` folder

# Read data from local file
mrc <- read_csv(file = file.path(data_dir, 'mrc_table1.csv'))

Example: Reading in literal data from string using read_csv()

We can also provide literal comma-separated data in the form of a string to be read:

# Read literal data
mrc <- read_csv(
  file = "super_opeid,name,czname,state
          2665,Vaughn College Of Aeronautics And Technology,New York,NY
          7273,CUNY Bernard M. Baruch College,New York,NY
          2688,City College Of New York - CUNY,New York,NY
          7022,CUNY Lehman College,New York,NY"
)

mrc
#> [38;5;246m# A tibble: 4 x 4[39m
#>   super_opeid name                                         czname   state
#>         [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m                                        [3m[38;5;246m<chr>[39m[23m    [3m[38;5;246m<chr>[39m[23m
#> [38;5;250m1[39m        [4m2[24m665 Vaughn College Of Aeronautics And Technology New York NY   
#> [38;5;250m2[39m        [4m7[24m273 CUNY Bernard M. Baruch College               New York NY   
#> [38;5;250m3[39m        [4m2[24m688 City College Of New York - CUNY              New York NY   
#> [38;5;250m4[39m        [4m7[24m022 CUNY Lehman College                          New York NY

Note that for a real project, you would typically be reading in data from a file, rather than providing literal data. But for the purpose of experimentation and providing examples (as in the next few sections), it is helpful to use literal data so we are able to see the data.

5.1.2 `col_names` argument

The col_names argument:

If TRUE, the first row of the input will be used as the column names (default)
If FALSE, column names will be generated automatically (X1, X2, X3, etc.)
If provide a character vector, the values will be used as the names of the columns

Example: Setting col_names to TRUE in read_csv() (default)

read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T",
  col_names = TRUE
)
#> [38;5;246m# A tibble: 2 x 3[39m
#>       a     b c    
#>   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<lgl>[39m[23m
#> [38;5;250m1[39m     1     2 FALSE
#> [38;5;250m2[39m     4     5 TRUE

Example: Setting col_names to FALSE in read_csv()

read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T",
  col_names = FALSE
)
#> [38;5;246m# A tibble: 3 x 3[39m
#>   X1    X2    X3   
#>   [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m
#> [38;5;250m1[39m a     b     c    
#> [38;5;250m2[39m 1     2     F    
#> [38;5;250m3[39m 4     5     T

Example: Providing character vector for col_names in read_csv()

read_csv(
  file = "1, 2, F
          4, 5, T",
  col_names = c('a', 'b', 'c')
)
#> [38;5;246m# A tibble: 2 x 3[39m
#>       a     b c    
#>   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<lgl>[39m[23m
#> [38;5;250m1[39m     1     2 FALSE
#> [38;5;250m2[39m     4     5 TRUE

5.1.3 `col_types` argument

By default, read_csv() attempts to guess each column’s data type by looking at the first 1000 rows (e.g. character, double, etc). But you can manually specify the data type for the columns you read in using col_types.

The col_types argument:

If NULL, column data types guessed from the first 1000 rows (default)
Use cols() to specify data type for all columns
Use cols_only() to specify data type and read in only a subset of the columns
- This is particularly useful for reading in one variable at a time to check that its type looks good
Use compact string representation where each character represents one column

To specify column type:

Column Type	Parser function	Character representation
Logical	`col_logical()`	`l`
Integers	`col_integer()`	`i`
Doubles	`col_double()`	`d`
Characters	`col_character()`	`c`
Numbers	`col_numeric()`	`n`
Factors	`col_factors(levels, ordered)`	`f`
Dates	`col_date(format = "")`	`D`

Example: Setting col_types to NULL in read_csv() (default)

We can see that read_csv() guessed double, double, and logical as the data type of the 3 columns:

df <- read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T",
  col_types = NULL
)

str(df)
#> spec_tbl_df [2 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ a: num [1:2] 1 4
#>  $ b: num [1:2] 2 5
#>  $ c: logi [1:2] FALSE TRUE
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   a = col_double(),
#>   ..   b = col_double(),
#>   ..   c = col_logical()
#>   .. )

Example: Setting col_types using cols() in read_csv()

We can manually choose the data type of columns using cols() and the corresponding parser function:

df <- read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T", 
  col_types = cols(
       a = col_factor(c('1', '2', '3', '4')),
       b = col_character()
    )
)

str(df)
#> spec_tbl_df [2 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ a: Factor w/ 4 levels "1","2","3","4": 1 4
#>  $ b: chr [1:2] "2" "5"
#>  $ c: logi [1:2] FALSE TRUE
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   a = col_factor(levels = c("1", "2", "3", "4"), ordered = FALSE, include_na = FALSE),
#>   ..   b = col_character(),
#>   ..   c = col_logical()
#>   .. )

Example: Setting col_types using cols_only() in read_csv()

We can use cols_only() to specify data type and read in only the specified columns. This is useful when you want to read in one variable at a time to check that the type looks good.

df <- read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T", 
  col_types = cols_only(
       a = col_factor(c('1', '2', '3', '4'))
    )
)

str(df)
#> spec_tbl_df [2 × 1] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ a: Factor w/ 4 levels "1","2","3","4": 1 4
#>  - attr(*, "spec")=
#>   .. cols_only(
#>   ..   a = col_factor(levels = c("1", "2", "3", "4"), ordered = FALSE, include_na = FALSE),
#>   ..   b = col_skip(),
#>   ..   c = col_skip()
#>   .. )

df <- read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T", 
  col_types = cols_only(
       a = col_factor(c('1', '2', '3', '4')),
       c = col_logical()
    )
)

str(df)
#> spec_tbl_df [2 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ a: Factor w/ 4 levels "1","2","3","4": 1 4
#>  $ c: logi [1:2] FALSE TRUE
#>  - attr(*, "spec")=
#>   .. cols_only(
#>   ..   a = col_factor(levels = c("1", "2", "3", "4"), ordered = FALSE, include_na = FALSE),
#>   ..   b = col_skip(),
#>   ..   c = col_logical()
#>   .. )

To summarize, the approach would be:

Read in first column of data using col_types = cols_only(...) and make sure variable looks good
Add second column of data to cols_only()
…
Add nth column of data to cols_only()

Example: Setting col_types using compact string representation in read_csv()

For example, the string representation 'icl' specifies the 3 columns to be of type integer, character, and logical, respectively:

df <- read_csv(
  file = "a, b, c
          1, 2, F
          4, 5, T",
  col_types = 'icl'
)

str(df)
#> spec_tbl_df [2 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ a: int [1:2] 1 4
#>  $ b: chr [1:2] "2" "5"
#>  $ c: logi [1:2] FALSE TRUE
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   a = col_integer(),
#>   ..   b = col_character(),
#>   ..   c = col_logical()
#>   .. )

5.1.4 Other arguments

Other useful arguments for read_csv():

na: Vector of values to treat as missing value
- E.g., na = c(-2, 'TBD'): Treat the values -2 and 'TBD' as missing values
comment: A string used to identify comments
- E.g., comment = '#': Any lines starting with # should be treated as a comment and ignored when reading in data
skip: Number of lines to skip before reading data
- E.g., n = 5: Skip the first 5 lines when reading in the data

Example: Specifying the na argument in read_csv()

Treat the values -2 and 'TBD' as missing values (i.e., NA values):

read_csv(
  file = "column_1, column_2, column_3
          1, -2, 3
          4, 5, TBD",
  na = c(-2, 'TBD')
)
#> [38;5;246m# A tibble: 2 x 3[39m
#>   column_1 column_2 column_3
#>      [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m        1       [31mNA[39m        3
#> [38;5;250m2[39m        4        5       [31mNA[39m

Example: Specifying the comment argument in read_csv()

Skip the first line of comment that contains some meta information on the data:

read_csv(
  file = "# This file contains data on student charges for the acdemic year.
          a, b, c
          1, 2, 3
          4, 5, 6", 
  comment = '#'
)
#> [38;5;246m# A tibble: 2 x 3[39m
#>       a     b     c
#>   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m     1     2     3
#> [38;5;250m2[39m     4     5     6

We can specify what character indicates the start of a comment:

read_csv(
  file = "* This file contains data on student charges for the acdemic year.
          a, b, c
          1, 2, 3
          4, 5, 6", 
  comment = '*'
)
#> [38;5;246m# A tibble: 2 x 3[39m
#>       a     b     c
#>   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m     1     2     3
#> [38;5;250m2[39m     4     5     6

Example: Specifying the skip argument in read_csv()

Skip the first 2 lines that contains some meta information on the data:

read_csv(
  file = 
    "This file contains data on student charges for the acdemic year.
     File name: IC2016_AY
     a, b, c
     1, 2, 3
     4, 5, 6", 
  skip = 2
)
#> [38;5;246m# A tibble: 2 x 3[39m
#>       a     b     c
#>   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m     1     2     3
#> [38;5;250m2[39m     4     5     6

We can also use skip to skip the first row of header data when we want to provide our own headings using col_names:

read_csv(
  file = 
    "This file contains data on student charges for the acdemic year.
     File name: IC2016_AY
     a, b, c
     1, 2, 3
     4, 5, 6", 
  skip = 3,
  col_names = c('colA', 'colB', 'colC')
)
#> [38;5;246m# A tibble: 2 x 3[39m
#>    colA  colB  colC
#>   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m     1     2     3
#> [38;5;250m2[39m     4     5     6

5.2 `write_csv()` function

The write_csv() function:

?write_csv

# SYNTAX AND DEFAULT VALUES
write_csv(x, path, na = "NA", append = FALSE, col_names = !append,
          quote_escape = "double")

Function: Writes to CSV file
Arguments
- x: A data frame to write to disk
- path: Path or connection to write to
- na: Vector of values to treat as missing value
- append: Whether or not to overwrite existing file or append to it
- col_names: Whether or not to write column names at the top of the file
- quote_escape: The type of escaping to use for quoted values

Example: Writing to CSV file using write_csv()

Recall the Chetty data from earlier. We can write the data from the dataframe to a CSV file:

# Chetty data
mrc
#> [38;5;246m# A tibble: 4 x 4[39m
#>   super_opeid name                                         czname   state
#>         [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m                                        [3m[38;5;246m<chr>[39m[23m    [3m[38;5;246m<chr>[39m[23m
#> [38;5;250m1[39m        [4m2[24m665 Vaughn College Of Aeronautics And Technology New York NY   
#> [38;5;250m2[39m        [4m7[24m273 CUNY Bernard M. Baruch College               New York NY   
#> [38;5;250m3[39m        [4m2[24m688 City College Of New York - CUNY              New York NY   
#> [38;5;250m4[39m        [4m7[24m022 CUNY Lehman College                          New York NY

# Write to CSV file
write_csv(x = mrc,
          path = file.path(data_dir, 'mrc.csv'))  # write to `data` folder we created earlier

6 `readxl` package

The readxl package:

?readxl

The readxl package is designed to import Excel files into R [doc]
readxl is part of tidyverse, so you’ll have the package if you have tidyverse installed. But unlike readr, it is not automatically loaded when you load tidyverse so you’ll need to explicitly load readxl if you want to use it.
The function to read Excel files is read_excel()

6.1 `read_excel()` function

The read_excel() function:

?read_excel

# SYNTAX AND DEFAULT VALUES
read_excel(path, sheet = NULL, range = NULL, col_names = TRUE,
           col_types = NULL, na = "", trim_ws = TRUE, skip = 0,
           n_max = Inf, guess_max = min(1000, n_max),
           progress = readxl_progress(), .name_repair = "unique")

Function: Reads in .xls and .xlsx files
Arguments
- path: Path to the Excel file
- sheet: Sheet to read – either a string (the name of a sheet) or an integer (the position of the sheet)
- range: A cell range to read from
  - cell_rows(): Cell rows to read from
  - cell_cols(): Cell columns to read from
- col_names: Whether to use first row of input as column names or provide own column names
- col_types: Specifies data type for columns you read in
- na: Character vector of strings to interpret as missing values
- n_max: Maximum number of data rows to read

Example: Reading in an Excel spreadsheet using read_excel()

readxl has several example files that we could use as practice:

# List available sample Excel files
readxl_example()
#>  [1] "clippy.xls"    "clippy.xlsx"   "datasets.xls"  "datasets.xlsx"
#>  [5] "deaths.xls"    "deaths.xlsx"   "geometry.xls"  "geometry.xlsx"
#>  [9] "type-me.xls"   "type-me.xlsx"

# Get path to datasets.xlsx
path_to_datasets <- readxl_example('datasets.xlsx')
path_to_datasets
#> [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/readxl/extdata/datasets.xlsx"

# View sheets in datasets.xlsx
excel_sheets(path_to_datasets)
#> [1] "iris"     "mtcars"   "chickwts" "quakes"

If we read in the Excel file without specifying the sheet, it will default to the first sheet:

iris_dataset <- read_excel(path = path_to_datasets)
head(iris_dataset, n = 4)
#> [38;5;246m# A tibble: 4 x 5[39m
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          [3m[38;5;246m<dbl>[39m[23m       [3m[38;5;246m<dbl>[39m[23m        [3m[38;5;246m<dbl>[39m[23m       [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m  
#> [38;5;250m1[39m          5.1         3.5          1.4         0.2 setosa 
#> [38;5;250m2[39m          4.9         3            1.4         0.2 setosa 
#> [38;5;250m3[39m          4.7         3.2          1.3         0.2 setosa 
#> [38;5;250m4[39m          4.6         3.1          1.5         0.2 setosa

We could also specify a specific sheet to read using the sheet argument:

quakes_dataset <- read_excel(path = path_to_datasets, sheet = 'quakes')
head(quakes_dataset, n = 4)
#> [38;5;246m# A tibble: 4 x 5[39m
#>     lat  long depth   mag stations
#>   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m -[31m20[39m[31m.[39m[31m4[39m  182.   562   4.8       41
#> [38;5;250m2[39m -[31m20[39m[31m.[39m[31m6[39m  181.   650   4.2       15
#> [38;5;250m3[39m -[31m26[39m    184.    42   5.4       43
#> [38;5;250m4[39m -[31m18[39m[31m.[39m[31m0[39m  182.   626   4.1       19

Example: Specifying the range argument in read_excel()

Cell notation in Excel uses letters to indicate columns and numbers to indicate rows (e.g., cell A4 is the cell on the 1st column and 4th row). We can specify the cell range we want to read using this notation:

# Selects range of cells from C1 at top left corner to E4 at bottom right corner
read_excel(path = path_to_datasets, sheet = 'quakes', range = 'C1:E4')
#> [38;5;246m# A tibble: 3 x 3[39m
#>   depth   mag stations
#>   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m   562   4.8       41
#> [38;5;250m2[39m   650   4.2       15
#> [38;5;250m3[39m    42   5.4       43

# Selects rows of cells from row 1 to 3 using cell_rows()
read_excel(path = path_to_datasets, sheet = 'quakes', range = cell_rows(1:3))
#> [38;5;246m# A tibble: 2 x 5[39m
#>     lat  long depth   mag stations
#>   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m -[31m20[39m[31m.[39m[31m4[39m  182.   562   4.8       41
#> [38;5;250m2[39m -[31m20[39m[31m.[39m[31m6[39m  181.   650   4.2       15

# Selects columns of cells from column A to C using cell_cols()
head(read_excel(path = path_to_datasets, sheet = 'quakes', range = cell_cols('A:C')))
#> [38;5;246m# A tibble: 6 x 3[39m
#>     lat  long depth
#>   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m -[31m20[39m[31m.[39m[31m4[39m  182.   562
#> [38;5;250m2[39m -[31m20[39m[31m.[39m[31m6[39m  181.   650
#> [38;5;250m3[39m -[31m26[39m    184.    42
#> [38;5;250m4[39m -[31m18[39m[31m.[39m[31m0[39m  182.   626
#> [38;5;250m5[39m -[31m20[39m[31m.[39m[31m4[39m  182.   649
#> [38;5;250m6[39m -[31m19[39m[31m.[39m[31m7[39m  184.   195

Note that we could also specify the sheet we want to select from using range, thus eliminating the need to specify sheet using the sheet argument:

# Selects range of cells from C1 to E4 in the quakes spreadsheet
read_excel(path = path_to_datasets, range = 'quakes!C1:E4')
#> [38;5;246m# A tibble: 3 x 3[39m
#>   depth   mag stations
#>   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m   562   4.8       41
#> [38;5;250m2[39m   650   4.2       15
#> [38;5;250m3[39m    42   5.4       43

Example: Specifying the n_max argument in read_excel()

# Read in at most 4 rows (if available)
read_excel(path = path_to_datasets, n_max = 4)
#> [38;5;246m# A tibble: 4 x 5[39m
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          [3m[38;5;246m<dbl>[39m[23m       [3m[38;5;246m<dbl>[39m[23m        [3m[38;5;246m<dbl>[39m[23m       [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m  
#> [38;5;250m1[39m          5.1         3.5          1.4         0.2 setosa 
#> [38;5;250m2[39m          4.9         3            1.4         0.2 setosa 
#> [38;5;250m3[39m          4.7         3.2          1.3         0.2 setosa 
#> [38;5;250m4[39m          4.6         3.1          1.5         0.2 setosa

Example: Reading in local Excel file using read_excel()

Recall the data dictionary we downloaded earlier for the 2019 IPEDS Institutional Characteristics survey (hd2019.xlsx). We will use read_excel() to:

Read in the Frequencies sheet
Read in only the rows 151 to 154
Make sure the column names are: varnumber, varname, codevalue, valuelabel, frequency, percent
Read in '-3' as NA values

read_excel(path = file.path(data_dir, 'hd2019.xlsx'),
           sheet = 'Frequencies',
           range = cell_rows(151:154),
           col_names = c('varnumber', 'varname', 'codevalue', 'valuelabel', 'frequency', 'percent'),
           na = '-3')
#> [38;5;246m# A tibble: 4 x 6[39m
#>   varnumber varname codevalue valuelabel             frequency percent
#>       [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m   [3m[38;5;246m<chr>[39m[23m     [3m[38;5;246m<chr>[39m[23m                      [3m[38;5;246m<dbl>[39m[23m   [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m     [4m1[24m[4m0[24m091 ICLEVEL [31mNA[39m        {Not available}               32    0.49
#> [38;5;250m2[39m     [4m1[24m[4m0[24m096 CONTROL 1         Public                      [4m2[24m056   31.4 
#> [38;5;250m3[39m     [4m1[24m[4m0[24m096 CONTROL 2         Private not-for-profit      [4m1[24m905   29.0 
#> [38;5;250m4[39m     [4m1[24m[4m0[24m096 CONTROL 3         Private for-profit          [4m2[24m566   39.1

7 `haven` package

The haven package:

?haven

The haven package that allows users to import and export data from the following statistical packages: SAS, SPSS, Stata [doc]
haven is part of tidyverse so you’ll have the package if you have tidyverse installed. But unlike readr, it is not automatically loaded when you load tidyverse so you’ll need to explicitly load haven if you want to use it.
The functions to read and write to Stata .dta files are read_dta() and write_dta()

haven functions for reading data:

Format	Function
SAS	`read_sas()`
SPSS	`read_sav()`
Stata	`read_dta()`

7.1 `read_dta()` function

The read_dta() function:

?read_dta

# SYNTAX AND DEFAULT VALUES
read_dta(file, encoding = NULL, col_select = NULL, skip = 0,
         n_max = Inf, .name_repair = "unique")

Function: Reads in Stata .dta files
Arguments
- file: File path or URL to a .dta file, or literal data
- skip: Number of lines to skip before reading data
- n_max: Maximum number of data rows to read

Example: Reading in .dta data file using read_dta()

High school longitudinal surveys from National Center for Education Statistics (NCES) follow U.S. students from high school through college and the labor market. We will be working with High School Longitudinal Study of 2009 (HSLS:09):

Follows 9th graders from 2009
Data collection waves
- Base Year (2009)
- First Follow-up (2012)
- 2013 Update (2013)
- High School Transcripts (2013-2014)
- Second Follow-up (2016)

# Read .dta file from URL
hsls <- read_dta(file = 'https://raw.githubusercontent.com/anyone-can-cook/rclass2/main/data/hsls/hsls_sch_small.dta')

# Print first few rows of the hsls dataframe
head(hsls)
#> [38;5;246m# A tibble: 6 x 5[39m
#>   sch_id          x1control  x1locale   x1region               a1schcontrol
#>   [3m[38;5;246m<chr>[39m[23m           [3m[38;5;246m<dbl+lbl>[39m[23m [3m[38;5;246m<dbl+lbl>[39m[23m  [3m[38;5;246m<dbl+lbl>[39m[23m                  [3m[38;5;246m<dbl+lbl>[39m[23m
#> [38;5;250m1[39m 1001   1[38;5;246m [Public][39m         4[38;5;246m [City][39m  6[38;5;246m [South][39m  2[38;5;246m [Unit non-response/comp[39m[38;5;246m…[39m
#> [38;5;250m2[39m 1002   1[38;5;246m [Public][39m         5[38;5;246m [Subur[39m[38;5;246m…[39m 6[38;5;246m [South][39m  4[38;5;246m [Public][39m                
#> [38;5;250m3[39m 1003   1[38;5;246m [Public][39m         4[38;5;246m [City][39m  6[38;5;246m [South][39m  4[38;5;246m [Public][39m                
#> [38;5;250m4[39m 1004   2[38;5;246m [Catholic or ot[39m[38;5;246m…[39m 5[38;5;246m [Subur[39m[38;5;246m…[39m 5[38;5;246m [Midwes[39m[38;5;246m…[39m 5[38;5;246m [Private][39m               
#> [38;5;250m5[39m 1005   1[38;5;246m [Public][39m         5[38;5;246m [Subur[39m[38;5;246m…[39m 6[38;5;246m [South][39m  4[38;5;246m [Public][39m                
#> [38;5;250m6[39m 1006   1[38;5;246m [Public][39m         5[38;5;246m [Subur[39m[38;5;246m…[39m 6[38;5;246m [South][39m  4[38;5;246m [Public][39m

7.2 `write_dta()` function

The write_dta() function:

?write_dta

# SYNTAX AND DEFAULT VALUES
write_dta(data, path, version = 14, label = attr(data, "label"))

Function: Writes to Stata .dta files
Arguments
- data: Dataframe to write
- path: Path to a file where the data will be written

Example: Writing to .dta data file using read_dta()

# Write the first few rows of the hsls dataframe to a .dta file
write_dta(data = head(hsls),
          path = file.path(data_dir, 'hsls_sch_small_subset.dta'))

8 Saving and loading R objects

There are base R functions for saving R objects, such as dataframes, to R data files. This is useful when you want to preserve data structures, such as column data types of a dataframe. [x]

We’ll be looking at saving and loading R objects with .RDS and .RData file types. A summary of the differences is below and we will provide examples in the following sections.

`.RDS` files	`.RData` files
Can store only 1 R object	Can store 1 or more R object(s)
Use `saveRDS()` to save object	Use `save()` to save object(s)
Use `readRDS()` to load object	Use `load()` to load object(s)
Need to assign loaded object in order to retain it	Object(s) will be loaded directly to your environment

8.1 `.RDS` files

Single R objects (e.g., a single data frame, or a single character vector) can be saved to .RDS files using saveRDS() and loaded again using readRDS().

The saveRDS() function:

?saveRDS

# SYNTAX AND DEFAULT VALUES
saveRDS(object, file = "", ascii = FALSE, version = NULL,
        compress = TRUE, refhook = NULL)

Function: Saves single R object to .RDS file
Arguments
- object: R object to write
- file: Name of the file where the R object is saved to

Example: Saving single R object to .RDS file using saveRDS()

# Save the `mrc` dataframe from earlier
saveRDS(object = mrc, file = file.path(data_dir, 'mrc.RDS'))

The readRDS() function:

?readRDS

# SYNTAX AND DEFAULT VALUES
readRDS(file, refhook = NULL)

Function: Loads single R object from .RDS file
Arguments
- file: Name of the file where the R object is loaded from
Notes
- You need to assign the loaded object using <- to retain it in your environment

Example: Loading single R object from .RDS file using readRDS()

# Load `mrc` dataframe
mrc_df <- readRDS(file = file.path(data_dir, 'mrc.RDS'))

mrc_df
#> [38;5;246m# A tibble: 4 x 4[39m
#>   super_opeid name                                         czname   state
#>         [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m                                        [3m[38;5;246m<chr>[39m[23m    [3m[38;5;246m<chr>[39m[23m
#> [38;5;250m1[39m        [4m2[24m665 Vaughn College Of Aeronautics And Technology New York NY   
#> [38;5;250m2[39m        [4m7[24m273 CUNY Bernard M. Baruch College               New York NY   
#> [38;5;250m3[39m        [4m2[24m688 City College Of New York - CUNY              New York NY   
#> [38;5;250m4[39m        [4m7[24m022 CUNY Lehman College                          New York NY

8.2 `.RData` files

One or more R objects can be saved to .RData files using save() and loaded again using load().

The save() function:

?save

# SYNTAX AND DEFAULT VALUES
save(..., list = character(),
     file = stop("'file' must be specified"),
     ascii = FALSE, version = NULL, envir = parent.frame(),
     compress = isTRUE(!ascii), compression_level,
     eval.promises = TRUE, precheck = TRUE)

Function: Saves R object(s) to .RData file
Arguments
- ...: Dataframe to write
- file: Name of the file where the R object(s) are saved to

Example: Saving multiple R objects to .RData file using save()

# Save the `iris_dataset` and `quakes_dataset` dataframes from earlier
save(iris_dataset, quakes_dataset, file = file.path(data_dir, 'datasets.RData'))

The load() function:

?load

# SYNTAX AND DEFAULT VALUES
load(file, envir = parent.frame(), verbose = FALSE)

Function: Loads R object(s) from .RData file
Arguments
- file: Name of the file where the R object(s) are loaded from
Notes
- You do NOT need to assign the loaded object(s) using <- because they will all be loaded into your environment with their original names

Example: Loading multiple R objects from .RData file using load()

# Load `iris_dataset` and `quakes_dataset` dataframes
load(file = file.path(data_dir, 'datasets.RData'))

8.3 Loading from URLs

Besides loading local data files, we can also load them in directly from the web using url().

The url() function:

?url

# SYNTAX AND DEFAULT VALUES
url(description, open = "", blocking = TRUE,
    encoding = getOption("encoding"),
    method = getOption("url.method", "default"),
    headers = NULL)

Function: Open connection to URL
Arguments
- description: Character string containing the URL

Example: Loading .RDS file from URL using url() and readRDS()

# Load `recruit_school_somevars.RDS` file from URL
df_school <- readRDS(file = url(description = 'https://github.com/anyone-can-cook/rclass2/raw/main/data/recruiting/recruit_school_somevars.RDS'))

Example: Loading .RData file from URL using url() and load()

# Load `recruit_school_somevars.RData` file from URL
load(file = url(description = 'https://github.com/anyone-can-cook/rclass2/raw/main/data/recruiting/recruit_school_somevars.RData'))

Organizing and Reading/Writing Data