Load packages:
library(dplyr)
library(readr)
library(readxl)
library(haven)
When you begin a new project, it is a good idea to have a consistent and efficient way of organizing your folders and files. This can help save you time and trouble later down the road, as you will know exactly where everything is when you need to look for them. Another common task you may encounter in your project is reading and writing data. This lecture covers these two fundamental topics for helping you get started with a project.
Organizing folders and files:
Reading and writing data:
Common file formats to read from:
Format | Package | Function | |
---|---|---|---|
Comma-separated values (.csv ) |
readr |
read_csv() |
|
Text-formated data (.txt ) |
readr |
read_table() |
|
Tab-separated values (.tsv ) |
readr |
read_tsv() |
|
Excel (.xls or .xlsx ) |
readxl |
read_excel() |
|
Stata (.dta ) |
haven |
read_dta() |
|
SPSS (.sav ) |
haven |
read_sav() |
|
SAS (.sas ) |
haven |
read_sas() |
|
R (.rds ) |
base R | readRDS() |
|
R (.Rdata ) |
base R | load() |
Source: Professor Darin Christensen
Note:
.rds
files can store a single R object.Rdata
files can store multiple R objects
Common file formats to write to:
Format | Package | Function | |
---|---|---|---|
Comma-separated values (.csv ) |
readr |
write_csv() |
|
Stata (.dta ) |
haven |
write_dta() |
|
R (.rds ) |
base R | saveRDS() |
|
R (.Rdata ) |
base R | save() |
What is a directory and how to organize one?
data/
subdirectory to hold all data files, scripts/
subdirectory to hold your scripts, etc.my_project
is the “root directory” for your research project (sometimes called the “project directory” or “project root directory”)
data
, scripts
, and figures
are sub-directories of your root directorymy_project.Rproj
is your RStudio project file and lives in the root directory for your research projectmy_project/
|
|- data/
|- scripts/
|- figures/
|- my_project.Rproj
How to create an RStudio project?
New Project
under the dropdown menuExisting Directory
Project working directory
, browse for your folder and click Create Project
Why use RStudio project?
What is a working directory?
.Rmd
) vs. an R script (.R
)getwd()
function to check your current working directory
The getwd()
function:
?getwd
# SYNTAX
getwd()
When you run R code in an R markdown file, the working directory is the directory that your .Rmd
file is in (the directory where the .Rmd file is saved):
getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"
When you run an R script, the working directory is the directory indicated at the top of your console in RStudio:
What is the home directory?
Home
in the bottom right file viewer panel/Users/<username>
C:\Users\<username>
~
) refers to the user’s home directory
~
in R [see here]# Check home directory
Sys.getenv('HOME')
#> [1] "/Users/cyouh95"
# Confirm that ~ denotes a user's home directory
path.expand('~')
#> [1] "/Users/cyouh95"
To summarize, your working directory will be as follows in the various scenarios:
Not working from RStudio Project | Working from RStudio Project | |
---|---|---|
In R Markdown code chunk | Directory that the .Rmd file is in |
Directory that the .Rmd file is in |
In R script | Home directory | project root directory |
In R console | Home directory | project root directory |
What are file paths?
/
) or backward slash (\
) depending on the operating system
/path/to/file
C:\path\to\file
/
in file paths regardless of whether you’re a Mac or PC user/path/to/my_project/ # E.g., /Users/my_username/Desktop/
|
|- my_project/
|
|- data/
|- scripts/
There are two types of file paths:
/Users/my_username/Desktop/my_project/
is the absolute file path to the my_project/
directory/Users/my_username/Desktop/my_project/data/
is the absolute file path to the data/
directory/Users/my_username/Desktop/my_project/scripts/
is the absolute file path to the scripts/
directoryDesktop/
folder in the example above:
./my_project/
is the relative file path to the my_project/
directory./my_project/data/
is the relative file path to the data/
directorydata/
folder:
./
is the relative file path to the data/
directory (i.e., folder you are currently in)../
is the relative file path to the my_project/
directory../scripts/
is the relative file path to the scripts/
directory
As seen, relative file paths uses dots to indicate the relative directory:
Key | Description |
---|---|
. |
this directory (i.e., current directory) |
.. |
up a directory (i.e., parent directory) |
/ |
separates directories in a file path |
./
and trailing /
in relative paths are not mandatory
./my_project/
is equivalent to my_project
, my_project/
, and ./my_project
What are the advantages of using relative file paths?
/Users/<username>/path/to/file
has your specific username in itLet’s create a directory for a hypothetical research project called research_project
. Inside this directory, we’ll create separate subdirectories for data
, scripts
, output
, output/tables
, output/figures
:
research_project/
|
|- data/
|- scripts/
|- output/
|- tables/
|- figures/
# delete research_project folder in case it already exists
unlink(x = 'research_project', recursive = TRUE, force = TRUE)
# Check working directory to see where your directory will be created
getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"
list.files()
#> [1] "organizing_and_io.html" "organizing_and_io.Rmd"
# Create `research_project` directory
dir.create(path = "research_project")
# Create `data` and `scripts` within `research_project`
dir.create(path = "research_project/data")
dir.create(path = "research_project/scripts")
# List the contents of `research_project` directory
list.files(path = "research_project")
#> [1] "data" "scripts"
# Though we did not create the `output` directory yet, we can still directly create the `output/tables` subdirectory if we specify `recursive = TRUE`, which will create both `output` and `tables`
dir.create(path = "research_project/output/tables", recursive = TRUE)
dir.create(path = "research_project/output/figures")
# List the contents of `output` directory
list.files(path = "research_project/output")
#> [1] "figures" "tables"
# To delete `research_project`, remember you need `recursive = TRUE` to delete a directory
# unlink(x = "research_project", recursive = TRUE)
What is an R script?
#
to indicate that they are lines of commentsRun
at the top of the script
ctrl
+ enter
/cmd
+ enter
on your keyboard to run the code.Rmd
files are mostly useful for combining text and code (e.g., writing a report) and presenting it in various formats (e.g., PDF, HTML)How to organize an R script?
################################################################################
##
## [ PROJ ] < Name of the overall project >
## [ FILE ] < Name of this particular file >
## [ AUTH ] < Your name + email / Twitter / GitHub handle >
## [ INIT ] < Date you started the file >
##
################################################################################
## ---------------------------
## libraries
## ---------------------------
## ---------------------------
## directory paths
## ---------------------------
## ---------------------------
## functions
## ---------------------------
## -----------------------------------------------------------------------------
## < BODY >
## -----------------------------------------------------------------------------
## ---------------------------
## input
## ---------------------------
## ---------------------------
## process
## ---------------------------
## ---------------------------
## output
## ---------------------------
## -----------------------------------------------------------------------------
## END SCRIPT
## -----------------------------------------------------------------------------
Source: R script template by Ben Skinner
We use the
file.path()
command because it is smart. Some computer operating systems use forward slashes,/
, for their file paths; others use backslashes,\
. Rather than try to guess or assume what operating system future users will use, we can use R’s function,file.path()
, to check the current operating system and build the paths correctly for us.
Source: Organizing Lecture by Ben Skinner
The file.path()
function:
?file.path
# SYNTAX AND DEFAULT VALUES
file.path(..., fsep = .Platform$file.sep)
...
: File path component(s)fsep
: The path separator to use (default is /
)
/
path separator (unless an alternative path separator is specified in fsep
)
Example: Creating file path objects using file.path()
Recall our example research_project
directory from earlier:
research_project/
|
|- data/
|- scripts/
|- output/
|- tables/
|- figures/
Let’s use file.path()
to create file path objects for some of these directories:
# Pass in each section of the path as a separate argument
file.path('.', 'research_project', 'data')
#> [1] "./research_project/data"
We would usually create and save these objects at the top of our script to be used later on:
# Create file path object for `data` directory
data_dir <- file.path('.', 'research_project', 'data')
data_dir
#> [1] "./research_project/data"
# Create file path object for `output` directory
output_dir <- file.path('.', 'research_project', 'output')
output_dir
#> [1] "./research_project/output"
# Create file path object for `tables` directory
tables_dir <- file.path('.', 'research_project', 'output', 'tables')
tables_dir
#> [1] "./research_project/output/tables"
Note that the object created by file.path()
is just a character vector containing the path:
# Investigate file path object
output_dir %>% str()
#> chr "./research_project/output"
Since the file path object is just a regular character vector, we could use that as input to file.path()
to help create subdirectory path objects:
# Create file path object for `figures` directory using `output_dir`
figures_dir <- file.path(output_dir, 'figures')
figures_dir
#> [1] "./research_project/output/figures"
Similarly, we can use the file path object anywhere that we would normally input a file path:
getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"
# List the contents of the `output` directory
list.files(path = output_dir)
#> [1] "figures" "tables"
figures_dir
Let’s download an image from R for Data Science to the figures_dir
we created:
# We will introduce the download.file() function in the next section
download.file(url = 'https://d33wubrfki0l68.cloudfront.net/8b89c5554ed6108359d59909d441dbeb010e8802/9f366/visualize_files/figure-html/unnamed-chunk-7-1.png',
destfile = file.path(figures_dir, 'scatterplot.png'))
# confirm figure is there:
list.files(path = figures_dir)
#> [1] "scatterplot.png"
We can use the file path object figures_dir
to help us refer to the saved image:
# Display image using include_graphics()
knitr::include_graphics(path = file.path(figures_dir, 'scatterplot.png'))
There are many functions available to read in various types of data. The most common types that we will cover are:
Format | Package | Function | |
---|---|---|---|
Comma-separated values (.csv ) |
readr |
read_csv() |
|
Excel (.xls or .xlsx ) |
readxl |
read_excel() |
|
Stata (.dta ) |
haven |
read_dta() |
|
R (.rds ) |
base R | readRDS() |
|
R (.Rdata ) |
base R | load() |
Note:
.rds
files can store a single R object.Rdata
files can store multiple R objects
What can these functions read in?
As we transition into learning about reading/writing data, let’s first take a look at some helpful functions to download and unzip data files, as you may need to use them when obtaining your data.
download.file()
functionAlthough it is most convenient to read in data directly from the web, not all functions support that. In addition, we may want to download and save a copy of the data locally in some cases. This can be done using the download.file()
function.
The download.file()
function:
?download.file
# SYNTAX AND DEFAULT VALUES
download.file(url, destfile, method, quiet = FALSE, mode = "w",
cacheOK = TRUE,
extra = getOption("download.file.extra"),
headers = NULL, ...)
url
: URL of a resource to be downloadeddestfile
: Name where the downloaded file is saved
Example: Downloading data from the Internet using download.file()
We will be downloading the 2019 Institutional Characteristics data dictionary file from the IPEDS Data Center. Select the year and survey from the drop down and right-click the “Dictionary” link to obtain the URL to download:
# Recall the file path object to the `data` directory we created earlier
data_dir
#> [1] "./research_project/data"
# Download data dictionary file to `data` directory
download.file(url = 'https://nces.ed.gov/ipeds/datacenter/data/HD2019_Dict.zip',
destfile = file.path(data_dir, 'hd2019_dictionary.zip')) # rename downloaded file
# Check where we downloaded the data (i.e., the `destfile` arg from above)
file.path(data_dir, 'hd2019_dictionary.zip')
#> [1] "./research_project/data/hd2019_dictionary.zip"
# Confirm that the file has been downloaded in `data` folder
list.files(path = data_dir)
#> [1] "hd2019_dictionary.zip"
unzip()
functionSome files downloaded from the web, like the example above, is contained in a zip folder. The unzip()
function can be used to extract the zipped contents.
The unzip()
function:
?unzip
# SYNTAX AND DEFAULT VALUES
unzip(zipfile, files = NULL, list = FALSE, overwrite = TRUE,
junkpaths = FALSE, exdir = ".", unzip = "internal",
setTimes = FALSE)
zipfile
: Path to zip file (including file name)exdir
: The directory to extract files to
Example: Extracting zipped data file using unzip()
Continuing from the previous example, we can use unzip()
to extract the contents of the downloaded file:
# Extract data dictionary file
unzip(zipfile = file.path(data_dir, 'hd2019_dictionary.zip'),
exdir = data_dir) # extract to `data` folder
# Check that the file has been extracted in `data` folder
list.files(path = data_dir)
#> [1] "hd2019_dictionary.zip" "hd2019.xlsx"
readr
packageThe readr
package:
?readr
readr
package contains functions to “read rectangular text data (like ‘csv’, ‘tsv’, and ‘fwf’)” into R [doc]readr
is part of tidyverse
, and it is automatically loaded every time you load tidyverse
read_csv()
and write_csv()
readr
functions for reading data:
Format | Function |
---|---|
Comma-separated values (.csv ) |
read_csv() |
Semicolon-separated values | read_csv2() |
Tab-separated values (.tsv ) |
read_tsv() |
General delimited files | read_delim() |
Fixed width files | read_fwf() |
Text-formatted data (.txt ) |
read_table() |
Web log files | read_log() |
read_csv()
functionThe read_csv()
function:
?read_csv
# SYNTAX AND DEFAULT VALUES
read_csv(file, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
quote = "\"", comment = "", trim_ws = TRUE, skip = 0,
n_max = Inf, guess_max = min(1000, n_max),
progress = show_progress(), skip_empty_rows = TRUE)
file
: File path or URL to a CSV file, or literal datacol_names
: Whether to use first row of input as column names or provide own column namescol_types
: Specifies data type for columns you read inna
: Vector of values to treat as missing valuecomment
: A string used to identify commentsskip
: Number of lines to skip before reading data
As you can see from above, the read_csv()
function provides you with many options for how you can read in the data. We’ll introduce you to some of the most commonly used arguments in read_csv()
, but keep in mind this won’t cover them all exhaustively.
file
argumentThe file
argument:
We will be reading data from the Mobility Report Cards: The Role of Colleges in Intergenerational Mobility. This is part of the Equality of Opportunity Project, which uses two data sources – federal tax records and Department of Education records (1999-2013) – to investigate intergenerational income mobility at colleges in the US.
We will be reading in Online Data Table 1 under the Mobility Report Cards: The Role of Colleges in Intergenerational Mobility drop down from the Equality of Opportunity Project Data Page. Right-click the “Excel” link to obtain the URL to read (Note: it is actually a CSV file, not Excel):
Example: Reading data from the web using read_csv()
# Read data from URL
mrc <- read_csv(file = 'http://www.equality-of-opportunity.org/data/college/mrc_table1.csv')
# View first 4 rows and 4 columns
mrc[1:4, 1:4]
#> [38;5;246m# A tibble: 4 x 4[39m
#> super_opeid name czname state
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m
#> [38;5;250m1[39m [4m2[24m665 Vaughn College Of Aeronautics And Technology New York NY
#> [38;5;250m2[39m [4m7[24m273 CUNY Bernard M. Baruch College New York NY
#> [38;5;250m3[39m [4m2[24m688 City College Of New York - CUNY New York NY
#> [38;5;250m4[39m [4m7[24m022 CUNY Lehman College New York NY
Example: Reading data from local file using read_csv()
If we have downloaded data files on our computer, we can read them in by providing the path to the file:
# First, download Chetty data file
download.file(url = 'http://www.equality-of-opportunity.org/data/college/mrc_table1.csv',
destfile = file.path(data_dir, 'mrc_table1.csv')) # save to `data` folder
# Read data from local file
mrc <- read_csv(file = file.path(data_dir, 'mrc_table1.csv'))
Example: Reading in literal data from string using read_csv()
We can also provide literal comma-separated data in the form of a string to be read:
# Read literal data
mrc <- read_csv(
file = "super_opeid,name,czname,state
2665,Vaughn College Of Aeronautics And Technology,New York,NY
7273,CUNY Bernard M. Baruch College,New York,NY
2688,City College Of New York - CUNY,New York,NY
7022,CUNY Lehman College,New York,NY"
)
mrc
#> [38;5;246m# A tibble: 4 x 4[39m
#> super_opeid name czname state
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m
#> [38;5;250m1[39m [4m2[24m665 Vaughn College Of Aeronautics And Technology New York NY
#> [38;5;250m2[39m [4m7[24m273 CUNY Bernard M. Baruch College New York NY
#> [38;5;250m3[39m [4m2[24m688 City College Of New York - CUNY New York NY
#> [38;5;250m4[39m [4m7[24m022 CUNY Lehman College New York NY
Note that for a real project, you would typically be reading in data from a file, rather than providing literal data. But for the purpose of experimentation and providing examples (as in the next few sections), it is helpful to use literal data so we are able to see the data.
col_names
argumentThe col_names
argument:
TRUE
, the first row of the input will be used as the column names (default)FALSE
, column names will be generated automatically (X1
, X2
, X3
, etc.)col_names
to TRUE
in read_csv()
(default)
read_csv(
file = "a, b, c
1, 2, F
4, 5, T",
col_names = TRUE
)
#> [38;5;246m# A tibble: 2 x 3[39m
#> a b c
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<lgl>[39m[23m
#> [38;5;250m1[39m 1 2 FALSE
#> [38;5;250m2[39m 4 5 TRUE
col_names
to FALSE
in read_csv()
read_csv(
file = "a, b, c
1, 2, F
4, 5, T",
col_names = FALSE
)
#> [38;5;246m# A tibble: 3 x 3[39m
#> X1 X2 X3
#> [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m
#> [38;5;250m1[39m a b c
#> [38;5;250m2[39m 1 2 F
#> [38;5;250m3[39m 4 5 T
col_names
in read_csv()
read_csv(
file = "1, 2, F
4, 5, T",
col_names = c('a', 'b', 'c')
)
#> [38;5;246m# A tibble: 2 x 3[39m
#> a b c
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<lgl>[39m[23m
#> [38;5;250m1[39m 1 2 FALSE
#> [38;5;250m2[39m 4 5 TRUE
col_types
argumentBy default, read_csv()
attempts to guess each column’s data type by looking at the first 1000 rows (e.g. character, double, etc). But you can manually specify the data type for the columns you read in using col_types
.
The col_types
argument:
NULL
, column data types guessed from the first 1000 rows (default)cols()
to specify data type for all columnscols_only()
to specify data type and read in only a subset of the columns
To specify column type:
Column Type | Parser function | Character representation |
---|---|---|
Logical | col_logical() |
l |
Integers | col_integer() |
i |
Doubles | col_double() |
d |
Characters | col_character() |
c |
Numbers | col_numeric() |
n |
Factors | col_factors(levels, ordered) |
f |
Dates | col_date(format = "") |
D |
col_types
to NULL
in read_csv()
(default)
We can see that read_csv()
guessed double
, double
, and logical
as the data type of the 3 columns:
df <- read_csv(
file = "a, b, c
1, 2, F
4, 5, T",
col_types = NULL
)
str(df)
#> spec_tbl_df [2 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ a: num [1:2] 1 4
#> $ b: num [1:2] 2 5
#> $ c: logi [1:2] FALSE TRUE
#> - attr(*, "spec")=
#> .. cols(
#> .. a = col_double(),
#> .. b = col_double(),
#> .. c = col_logical()
#> .. )
col_types
using cols()
in read_csv()
We can manually choose the data type of columns using cols()
and the corresponding parser function:
df <- read_csv(
file = "a, b, c
1, 2, F
4, 5, T",
col_types = cols(
a = col_factor(c('1', '2', '3', '4')),
b = col_character()
)
)
str(df)
#> spec_tbl_df [2 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ a: Factor w/ 4 levels "1","2","3","4": 1 4
#> $ b: chr [1:2] "2" "5"
#> $ c: logi [1:2] FALSE TRUE
#> - attr(*, "spec")=
#> .. cols(
#> .. a = col_factor(levels = c("1", "2", "3", "4"), ordered = FALSE, include_na = FALSE),
#> .. b = col_character(),
#> .. c = col_logical()
#> .. )
col_types
using cols_only()
in read_csv()
We can use cols_only()
to specify data type and read in only the specified columns. This is useful when you want to read in one variable at a time to check that the type looks good.
df <- read_csv(
file = "a, b, c
1, 2, F
4, 5, T",
col_types = cols_only(
a = col_factor(c('1', '2', '3', '4'))
)
)
str(df)
#> spec_tbl_df [2 × 1] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ a: Factor w/ 4 levels "1","2","3","4": 1 4
#> - attr(*, "spec")=
#> .. cols_only(
#> .. a = col_factor(levels = c("1", "2", "3", "4"), ordered = FALSE, include_na = FALSE),
#> .. b = col_skip(),
#> .. c = col_skip()
#> .. )
df <- read_csv(
file = "a, b, c
1, 2, F
4, 5, T",
col_types = cols_only(
a = col_factor(c('1', '2', '3', '4')),
c = col_logical()
)
)
str(df)
#> spec_tbl_df [2 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ a: Factor w/ 4 levels "1","2","3","4": 1 4
#> $ c: logi [1:2] FALSE TRUE
#> - attr(*, "spec")=
#> .. cols_only(
#> .. a = col_factor(levels = c("1", "2", "3", "4"), ordered = FALSE, include_na = FALSE),
#> .. b = col_skip(),
#> .. c = col_logical()
#> .. )
To summarize, the approach would be:
col_types = cols_only(...)
and make sure variable looks goodcols_only()
cols_only()
col_types
using compact string representation in read_csv()
For example, the string representation 'icl'
specifies the 3 columns to be of type integer
, character
, and logical
, respectively:
df <- read_csv(
file = "a, b, c
1, 2, F
4, 5, T",
col_types = 'icl'
)
str(df)
#> spec_tbl_df [2 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ a: int [1:2] 1 4
#> $ b: chr [1:2] "2" "5"
#> $ c: logi [1:2] FALSE TRUE
#> - attr(*, "spec")=
#> .. cols(
#> .. a = col_integer(),
#> .. b = col_character(),
#> .. c = col_logical()
#> .. )
Other useful arguments for read_csv()
:
na
: Vector of values to treat as missing value
na = c(-2, 'TBD')
: Treat the values -2
and 'TBD'
as missing valuescomment
: A string used to identify comments
comment = '#'
: Any lines starting with #
should be treated as a comment and ignored when reading in dataskip
: Number of lines to skip before reading data
n = 5
: Skip the first 5 lines when reading in the datana
argument in read_csv()
Treat the values -2
and 'TBD'
as missing values (i.e., NA
values):
read_csv(
file = "column_1, column_2, column_3
1, -2, 3
4, 5, TBD",
na = c(-2, 'TBD')
)
#> [38;5;246m# A tibble: 2 x 3[39m
#> column_1 column_2 column_3
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m 1 [31mNA[39m 3
#> [38;5;250m2[39m 4 5 [31mNA[39m
comment
argument in read_csv()
Skip the first line of comment that contains some meta information on the data:
read_csv(
file = "# This file contains data on student charges for the acdemic year.
a, b, c
1, 2, 3
4, 5, 6",
comment = '#'
)
#> [38;5;246m# A tibble: 2 x 3[39m
#> a b c
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m 1 2 3
#> [38;5;250m2[39m 4 5 6
We can specify what character indicates the start of a comment:
read_csv(
file = "* This file contains data on student charges for the acdemic year.
a, b, c
1, 2, 3
4, 5, 6",
comment = '*'
)
#> [38;5;246m# A tibble: 2 x 3[39m
#> a b c
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m 1 2 3
#> [38;5;250m2[39m 4 5 6
skip
argument in read_csv()
Skip the first 2 lines that contains some meta information on the data:
read_csv(
file =
"This file contains data on student charges for the acdemic year.
File name: IC2016_AY
a, b, c
1, 2, 3
4, 5, 6",
skip = 2
)
#> [38;5;246m# A tibble: 2 x 3[39m
#> a b c
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m 1 2 3
#> [38;5;250m2[39m 4 5 6
We can also use skip
to skip the first row of header data when we want to provide our own headings using col_names
:
read_csv(
file =
"This file contains data on student charges for the acdemic year.
File name: IC2016_AY
a, b, c
1, 2, 3
4, 5, 6",
skip = 3,
col_names = c('colA', 'colB', 'colC')
)
#> [38;5;246m# A tibble: 2 x 3[39m
#> colA colB colC
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m 1 2 3
#> [38;5;250m2[39m 4 5 6
write_csv()
functionThe write_csv()
function:
?write_csv
# SYNTAX AND DEFAULT VALUES
write_csv(x, path, na = "NA", append = FALSE, col_names = !append,
quote_escape = "double")
x
: A data frame to write to diskpath
: Path or connection to write tona
: Vector of values to treat as missing valueappend
: Whether or not to overwrite existing file or append to itcol_names
: Whether or not to write column names at the top of the filequote_escape
: The type of escaping to use for quoted values
Example: Writing to CSV file using write_csv()
Recall the Chetty data from earlier. We can write the data from the dataframe to a CSV file:
# Chetty data
mrc
#> [38;5;246m# A tibble: 4 x 4[39m
#> super_opeid name czname state
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m
#> [38;5;250m1[39m [4m2[24m665 Vaughn College Of Aeronautics And Technology New York NY
#> [38;5;250m2[39m [4m7[24m273 CUNY Bernard M. Baruch College New York NY
#> [38;5;250m3[39m [4m2[24m688 City College Of New York - CUNY New York NY
#> [38;5;250m4[39m [4m7[24m022 CUNY Lehman College New York NY
# Write to CSV file
write_csv(x = mrc,
path = file.path(data_dir, 'mrc.csv')) # write to `data` folder we created earlier
readxl
packageThe readxl
package:
?readxl
readxl
package is designed to import Excel files into R [doc]readxl
is part of tidyverse
, so you’ll have the package if you have tidyverse
installed. But unlike readr
, it is not automatically loaded when you load tidyverse
so you’ll need to explicitly load readxl
if you want to use it.read_excel()
read_excel()
functionThe read_excel()
function:
?read_excel
# SYNTAX AND DEFAULT VALUES
read_excel(path, sheet = NULL, range = NULL, col_names = TRUE,
col_types = NULL, na = "", trim_ws = TRUE, skip = 0,
n_max = Inf, guess_max = min(1000, n_max),
progress = readxl_progress(), .name_repair = "unique")
.xls
and .xlsx
filespath
: Path to the Excel filesheet
: Sheet to read – either a string (the name of a sheet) or an integer (the position of the sheet)range
: A cell range to read from
cell_rows()
: Cell rows to read fromcell_cols()
: Cell columns to read fromcol_names
: Whether to use first row of input as column names or provide own column namescol_types
: Specifies data type for columns you read inna
: Character vector of strings to interpret as missing valuesn_max
: Maximum number of data rows to read
Example: Reading in an Excel spreadsheet using read_excel()
readxl
has several example files that we could use as practice:
# List available sample Excel files
readxl_example()
#> [1] "clippy.xls" "clippy.xlsx" "datasets.xls" "datasets.xlsx"
#> [5] "deaths.xls" "deaths.xlsx" "geometry.xls" "geometry.xlsx"
#> [9] "type-me.xls" "type-me.xlsx"
# Get path to datasets.xlsx
path_to_datasets <- readxl_example('datasets.xlsx')
path_to_datasets
#> [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/readxl/extdata/datasets.xlsx"
# View sheets in datasets.xlsx
excel_sheets(path_to_datasets)
#> [1] "iris" "mtcars" "chickwts" "quakes"
If we read in the Excel file without specifying the sheet, it will default to the first sheet:
iris_dataset <- read_excel(path = path_to_datasets)
head(iris_dataset, n = 4)
#> [38;5;246m# A tibble: 4 x 5[39m
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m
#> [38;5;250m1[39m 5.1 3.5 1.4 0.2 setosa
#> [38;5;250m2[39m 4.9 3 1.4 0.2 setosa
#> [38;5;250m3[39m 4.7 3.2 1.3 0.2 setosa
#> [38;5;250m4[39m 4.6 3.1 1.5 0.2 setosa
We could also specify a specific sheet to read using the sheet
argument:
quakes_dataset <- read_excel(path = path_to_datasets, sheet = 'quakes')
head(quakes_dataset, n = 4)
#> [38;5;246m# A tibble: 4 x 5[39m
#> lat long depth mag stations
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m -[31m20[39m[31m.[39m[31m4[39m 182. 562 4.8 41
#> [38;5;250m2[39m -[31m20[39m[31m.[39m[31m6[39m 181. 650 4.2 15
#> [38;5;250m3[39m -[31m26[39m 184. 42 5.4 43
#> [38;5;250m4[39m -[31m18[39m[31m.[39m[31m0[39m 182. 626 4.1 19
range
argument in read_excel()
Cell notation in Excel uses letters to indicate columns and numbers to indicate rows (e.g., cell A4
is the cell on the 1st column and 4th row). We can specify the cell range we want to read using this notation:
# Selects range of cells from C1 at top left corner to E4 at bottom right corner
read_excel(path = path_to_datasets, sheet = 'quakes', range = 'C1:E4')
#> [38;5;246m# A tibble: 3 x 3[39m
#> depth mag stations
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m 562 4.8 41
#> [38;5;250m2[39m 650 4.2 15
#> [38;5;250m3[39m 42 5.4 43
# Selects rows of cells from row 1 to 3 using cell_rows()
read_excel(path = path_to_datasets, sheet = 'quakes', range = cell_rows(1:3))
#> [38;5;246m# A tibble: 2 x 5[39m
#> lat long depth mag stations
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m -[31m20[39m[31m.[39m[31m4[39m 182. 562 4.8 41
#> [38;5;250m2[39m -[31m20[39m[31m.[39m[31m6[39m 181. 650 4.2 15
# Selects columns of cells from column A to C using cell_cols()
head(read_excel(path = path_to_datasets, sheet = 'quakes', range = cell_cols('A:C')))
#> [38;5;246m# A tibble: 6 x 3[39m
#> lat long depth
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m -[31m20[39m[31m.[39m[31m4[39m 182. 562
#> [38;5;250m2[39m -[31m20[39m[31m.[39m[31m6[39m 181. 650
#> [38;5;250m3[39m -[31m26[39m 184. 42
#> [38;5;250m4[39m -[31m18[39m[31m.[39m[31m0[39m 182. 626
#> [38;5;250m5[39m -[31m20[39m[31m.[39m[31m4[39m 182. 649
#> [38;5;250m6[39m -[31m19[39m[31m.[39m[31m7[39m 184. 195
Note that we could also specify the sheet we want to select from using range
, thus eliminating the need to specify sheet using the sheet
argument:
# Selects range of cells from C1 to E4 in the quakes spreadsheet
read_excel(path = path_to_datasets, range = 'quakes!C1:E4')
#> [38;5;246m# A tibble: 3 x 3[39m
#> depth mag stations
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m 562 4.8 41
#> [38;5;250m2[39m 650 4.2 15
#> [38;5;250m3[39m 42 5.4 43
n_max
argument in read_excel()
# Read in at most 4 rows (if available)
read_excel(path = path_to_datasets, n_max = 4)
#> [38;5;246m# A tibble: 4 x 5[39m
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m
#> [38;5;250m1[39m 5.1 3.5 1.4 0.2 setosa
#> [38;5;250m2[39m 4.9 3 1.4 0.2 setosa
#> [38;5;250m3[39m 4.7 3.2 1.3 0.2 setosa
#> [38;5;250m4[39m 4.6 3.1 1.5 0.2 setosa
read_excel()
Recall the data dictionary we downloaded earlier for the 2019 IPEDS Institutional Characteristics survey (hd2019.xlsx
). We will use read_excel()
to:
Frequencies
sheetvarnumber
, varname
, codevalue
, valuelabel
, frequency
, percent
'-3'
as NA
valuesread_excel(path = file.path(data_dir, 'hd2019.xlsx'),
sheet = 'Frequencies',
range = cell_rows(151:154),
col_names = c('varnumber', 'varname', 'codevalue', 'valuelabel', 'frequency', 'percent'),
na = '-3')
#> [38;5;246m# A tibble: 4 x 6[39m
#> varnumber varname codevalue valuelabel frequency percent
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
#> [38;5;250m1[39m [4m1[24m[4m0[24m091 ICLEVEL [31mNA[39m {Not available} 32 0.49
#> [38;5;250m2[39m [4m1[24m[4m0[24m096 CONTROL 1 Public [4m2[24m056 31.4
#> [38;5;250m3[39m [4m1[24m[4m0[24m096 CONTROL 2 Private not-for-profit [4m1[24m905 29.0
#> [38;5;250m4[39m [4m1[24m[4m0[24m096 CONTROL 3 Private for-profit [4m2[24m566 39.1
haven
packageThe haven
package:
?haven
haven
package that allows users to import and export data from the following statistical packages: SAS, SPSS, Stata [doc]haven
is part of tidyverse
so you’ll have the package if you have tidyverse
installed. But unlike readr
, it is not automatically loaded when you load tidyverse
so you’ll need to explicitly load haven
if you want to use it..dta
files are read_dta()
and write_dta()
haven
functions for reading data:
Format | Function |
---|---|
SAS | read_sas() |
SPSS | read_sav() |
Stata | read_dta() |
read_dta()
functionThe read_dta()
function:
?read_dta
# SYNTAX AND DEFAULT VALUES
read_dta(file, encoding = NULL, col_select = NULL, skip = 0,
n_max = Inf, .name_repair = "unique")
.dta
filesfile
: File path or URL to a .dta
file, or literal dataskip
: Number of lines to skip before reading datan_max
: Maximum number of data rows to read
Example: Reading in .dta
data file using read_dta()
High school longitudinal surveys from National Center for Education Statistics (NCES) follow U.S. students from high school through college and the labor market. We will be working with High School Longitudinal Study of 2009 (HSLS:09):
# Read .dta file from URL
hsls <- read_dta(file = 'https://raw.githubusercontent.com/anyone-can-cook/rclass2/main/data/hsls/hsls_sch_small.dta')
# Print first few rows of the hsls dataframe
head(hsls)
#> [38;5;246m# A tibble: 6 x 5[39m
#> sch_id x1control x1locale x1region a1schcontrol
#> [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<dbl+lbl>[39m[23m [3m[38;5;246m<dbl+lbl>[39m[23m [3m[38;5;246m<dbl+lbl>[39m[23m [3m[38;5;246m<dbl+lbl>[39m[23m
#> [38;5;250m1[39m 1001 1[38;5;246m [Public][39m 4[38;5;246m [City][39m 6[38;5;246m [South][39m 2[38;5;246m [Unit non-response/comp[39m[38;5;246m…[39m
#> [38;5;250m2[39m 1002 1[38;5;246m [Public][39m 5[38;5;246m [Subur[39m[38;5;246m…[39m 6[38;5;246m [South][39m 4[38;5;246m [Public][39m
#> [38;5;250m3[39m 1003 1[38;5;246m [Public][39m 4[38;5;246m [City][39m 6[38;5;246m [South][39m 4[38;5;246m [Public][39m
#> [38;5;250m4[39m 1004 2[38;5;246m [Catholic or ot[39m[38;5;246m…[39m 5[38;5;246m [Subur[39m[38;5;246m…[39m 5[38;5;246m [Midwes[39m[38;5;246m…[39m 5[38;5;246m [Private][39m
#> [38;5;250m5[39m 1005 1[38;5;246m [Public][39m 5[38;5;246m [Subur[39m[38;5;246m…[39m 6[38;5;246m [South][39m 4[38;5;246m [Public][39m
#> [38;5;250m6[39m 1006 1[38;5;246m [Public][39m 5[38;5;246m [Subur[39m[38;5;246m…[39m 6[38;5;246m [South][39m 4[38;5;246m [Public][39m
write_dta()
functionThe write_dta()
function:
?write_dta
# SYNTAX AND DEFAULT VALUES
write_dta(data, path, version = 14, label = attr(data, "label"))
.dta
filesdata
: Dataframe to writepath
: Path to a file where the data will be written
Example: Writing to .dta
data file using read_dta()
# Write the first few rows of the hsls dataframe to a .dta file
write_dta(data = head(hsls),
path = file.path(data_dir, 'hsls_sch_small_subset.dta'))
There are base R functions for saving R objects, such as dataframes, to R data files. This is useful when you want to preserve data structures, such as column data types of a dataframe. [x]
We’ll be looking at saving and loading R objects with .RDS
and .RData
file types. A summary of the differences is below and we will provide examples in the following sections.
.RDS files |
.RData files |
---|---|
Can store only 1 R object | Can store 1 or more R object(s) |
Use saveRDS() to save object |
Use save() to save object(s) |
Use readRDS() to load object |
Use load() to load object(s) |
Need to assign loaded object in order to retain it | Object(s) will be loaded directly to your environment |
.RDS
filesSingle R objects (e.g., a single data frame, or a single character vector) can be saved to .RDS
files using saveRDS()
and loaded again using readRDS()
.
The saveRDS()
function:
?saveRDS
# SYNTAX AND DEFAULT VALUES
saveRDS(object, file = "", ascii = FALSE, version = NULL,
compress = TRUE, refhook = NULL)
.RDS
fileobject
: R object to writefile
: Name of the file where the R object is saved to
Example: Saving single R object to .RDS
file using saveRDS()
# Save the `mrc` dataframe from earlier
saveRDS(object = mrc, file = file.path(data_dir, 'mrc.RDS'))
The readRDS()
function:
?readRDS
# SYNTAX AND DEFAULT VALUES
readRDS(file, refhook = NULL)
.RDS
filefile
: Name of the file where the R object is loaded from<-
to retain it in your environment
Example: Loading single R object from .RDS
file using readRDS()
# Load `mrc` dataframe
mrc_df <- readRDS(file = file.path(data_dir, 'mrc.RDS'))
mrc_df
#> [38;5;246m# A tibble: 4 x 4[39m
#> super_opeid name czname state
#> [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m
#> [38;5;250m1[39m [4m2[24m665 Vaughn College Of Aeronautics And Technology New York NY
#> [38;5;250m2[39m [4m7[24m273 CUNY Bernard M. Baruch College New York NY
#> [38;5;250m3[39m [4m2[24m688 City College Of New York - CUNY New York NY
#> [38;5;250m4[39m [4m7[24m022 CUNY Lehman College New York NY
.RData
filesOne or more R objects can be saved to .RData
files using save()
and loaded again using load()
.
The save()
function:
?save
# SYNTAX AND DEFAULT VALUES
save(..., list = character(),
file = stop("'file' must be specified"),
ascii = FALSE, version = NULL, envir = parent.frame(),
compress = isTRUE(!ascii), compression_level,
eval.promises = TRUE, precheck = TRUE)
.RData
file...
: Dataframe to writefile
: Name of the file where the R object(s) are saved to
Example: Saving multiple R objects to .RData
file using save()
# Save the `iris_dataset` and `quakes_dataset` dataframes from earlier
save(iris_dataset, quakes_dataset, file = file.path(data_dir, 'datasets.RData'))
The load()
function:
?load
# SYNTAX AND DEFAULT VALUES
load(file, envir = parent.frame(), verbose = FALSE)
.RData
filefile
: Name of the file where the R object(s) are loaded from<-
because they will all be loaded into your environment with their original names
Example: Loading multiple R objects from .RData
file using load()
# Load `iris_dataset` and `quakes_dataset` dataframes
load(file = file.path(data_dir, 'datasets.RData'))
Besides loading local data files, we can also load them in directly from the web using url()
.
The url()
function:
?url
# SYNTAX AND DEFAULT VALUES
url(description, open = "", blocking = TRUE,
encoding = getOption("encoding"),
method = getOption("url.method", "default"),
headers = NULL)
description
: Character string containing the URL
Example: Loading .RDS
file from URL using url()
and readRDS()
# Load `recruit_school_somevars.RDS` file from URL
df_school <- readRDS(file = url(description = 'https://github.com/anyone-can-cook/rclass2/raw/main/data/recruiting/recruit_school_somevars.RDS'))
Example: Loading .RData
file from URL using url()
and load()
# Load `recruit_school_somevars.RData` file from URL
load(file = url(description = 'https://github.com/anyone-can-cook/rclass2/raw/main/data/recruiting/recruit_school_somevars.RData'))