1 Introduction

1.1 Libraries we will use

Load packages:

library(dplyr)
library(readr)
library(readxl)
library(haven)

1.2 Lecture overview

When you begin a new project, it is a good idea to have a consistent and efficient way of organizing your folders and files. This can help save you time and trouble later down the road, as you will know exactly where everything is when you need to look for them. Another common task you may encounter in your project is reading and writing data. This lecture covers these two fundamental topics for helping you get started with a project.

Organizing folders and files:

  • Organizing folders: how to structure and navigate your project directory
  • Organizing files: how to structure an R script

Reading and writing data:

Common file formats to read from:

Format Package Function
Comma-separated values (.csv) readr read_csv()
Text-formated data (.txt) readr read_table()
Tab-separated values (.tsv) readr read_tsv()
Excel (.xls or .xlsx) readxl read_excel()
Stata (.dta) haven read_dta()
SPSS (.sav) haven read_sav()
SAS (.sas) haven read_sas()
R (.rds) base R readRDS()
R (.Rdata) base R load()

Source: Professor Darin Christensen


Common file formats to write to:

Format Package Function
Comma-separated values (.csv) readr write_csv()
Stata (.dta) haven write_dta()
R (.rds) base R saveRDS()
R (.Rdata) base R save()

2 Organizing project directory

What is a directory and how to organize one?

  • Folders are referred to as directories
  • Your project directory is the folder that contains all the files and subfolders (i.e., subdirectories) you create for the purpose of your project
  • There is no right or wrong way to organize a project directory, but the structure shown below is generally a good way to set up your project
    • You can have a data/ subdirectory to hold all data files, scripts/ subdirectory to hold your scripts, etc.
  • When using RStudio, it is often useful to turn your project directory into an RStudio project
my_project/
|
|- data/
|- scripts/
|- figures/

2.1 RStudio project

How to create an RStudio project?

  • On the top right corner in RStudio, select New Project under the dropdown menu
  • If there’s a folder you want to turn into a project, select Existing Directory
  • Under Project working directory, browse for your folder and click Create Project


Why use RStudio project?

  • Creating a RStudio project helps keep everything relative to the project directory
    • Your R Console and R scripts will run using the project directory as the working directory
    • Your Terminal in RStudio will start in the project directory
    • Your file browser window (bottom right panel) will also start off in the project directory

2.2 Working directory

What is a working directory?

  • Your working directory is the directory that you are currently working in (e.g., running a script from)
    • It’ll be important to keep the working directory in mind when you need to refer to files/folders relative to the working directory
  • Note that the working directory is different when you run an R markdown file (.Rmd) vs. an R script (.R)
  • You can use the getwd() function to check your current working directory


The getwd() function:

?getwd

# SYNTAX
getwd()
  • Function: Returns an absolute filepath representing the current working directory
  • Arguments: NA


When you run R code in an R markdown file, the working directory is the directory that your .Rmd file is in (the directory where the .Rmd file is saved):

getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"


When you run an R script, the working directory is the directory indicated at the top of your console in RStudio:

  • If you are not working from an RStudio project, this is typically your home directory
  • If you are working from an RStudio project, your working directory would be the project directory


What is the home directory?

  • You can view your home directory in RStudio by clicking on Home in the bottom right file viewer panel
  • The home directory is a user’s main directory on a computer and it varies by operating systems
    • MacOS: /Users/<username>
    • Windows: C:\Users\<username>
  • The tilde (~) refers to the user’s home directory
    • Note that it is possible to change the default ~ in R [see here]
  • You can check your home directory by running the following R code
# Check home directory
Sys.getenv('HOME')
#> [1] "/Users/cyouh95"

# Confirm that ~ denotes a user's home directory
path.expand('~')
#> [1] "/Users/cyouh95"


To summarize, your working directory will be as follows in the various scenarios:

Not working from RStudio Project Working from RStudio Project
In R Markdown code chunk Directory that the .Rmd file is in Directory that the .Rmd file is in
In R script Home directory Project directory
In R console Home directory Project directory

2.3 File paths

What are file paths?

  • A file path specifies the list of directories needed to locate a file
  • The directories in a path can be separated by a forward slash (/) or backward slash (\) depending on the operating system
    • MacOS: /path/to/file
    • Windows: C:\path\to\file
  • R uses / in file paths regardless of whether you’re a Mac or PC user


/path/to/my_project/   # E.g., /Users/my_username/Desktop/
|
|- my_project/
   |
   |- data/
   |- scripts/

There are two types of file paths:

  • Absolute file path: full path that includes the complete list of directories needed to locate a file or folder
    • E.g., /Users/my_username/Desktop/my_project/ is the absolute file path to the my_project/ directory
    • E.g., /Users/my_username/Desktop/my_project/data/ is the absolute file path to the data/ directory
    • E.g., /Users/my_username/Desktop/my_project/scripts/ is the absolute file path to the scripts/ directory
  • Relative file path: path relative to your current working directory
    • E.g., Assuming your working directory is the Desktop/ folder in the example above:
      • ./my_project/ is the relative file path to the my_project/ directory
      • ./my_project/data/ is the relative file path to the data/ directory
    • E.g., Assuming your working directory is the data/ folder:
      • ./ is the relative file path to the data/ directory
      • ../ is the relative file path to the my_project/ directory
      • ../scripts/ is the relative file path to the scripts/ directory


As seen, relative file paths uses dots to indicate the relative directory:

Key Description
. this directory (i.e., current directory)
.. up a directory (i.e., parent directory)
/ separates directories in a file path
  • Usually, the leading ./ and trailing / in relative paths are not mandatory
    • E.g., ./my_project/ is equivalent to my_project, my_project/, and ./my_project
  • You can add the slashes if you want to make it clear/explicit that you are referring to a file path
    • Slashes might be required in other scenarios (e.g., when referring to filenames for some commands on the command line)
    • But otherwise it can just be for clarification purposes


What are the advantages of using relative file paths?

  • Notice that the absolute file path is dependent on your specific machine
    • E.g., /Users/<username>/path/to/file has your specific username in it
  • When you are collaborating with others on a shared project directory (e.g., on a GitHub repository), you will want to use relative file paths in your scripts to refer to the shared files/folders, so that the path will be valid for everyone

2.5 Example project directory

Let’s create a directory for a hypothetical research project called research_project. Inside this directory, we’ll create separate subdirectories for data, scripts, output, output/tables, output/figures:

research_project/
|
|- data/
|- scripts/
|- output/
  |- tables/
  |- figures/
# Check working directory to see where your directory will be created
getwd()
#> [1] "/Users/cyouh95/anyone-can-cook/rclass2/lectures/organizing_and_io"

# Create `research_project` directory
dir.create(path = "research_project")

# Create `data` and `scripts` within `research_project`
dir.create(path = "research_project/data")
dir.create(path = "research_project/scripts")

# List the contents of `research_project` directory
list.files(path = "research_project")
#> [1] "data"    "output"  "scripts"

# Though we did not create the `output` directory yet, we can still directly create the `output/tables` subdirectory if we specify `recursive = TRUE`, which will create both `output` and `tables`
dir.create(path = "research_project/output/tables", recursive = TRUE)
dir.create(path = "research_project/output/figures")

# List the contents of `output` directory
list.files(path = "research_project/output")
#> [1] "figures" "tables"

# To delete `research_project`, remember you need `recursive = TRUE` to delete a directory
# unlink(x = "research_project", recursive = TRUE)

3 Organizing R script

What is an R script?

  • An R script contains R code (i.e., what you’d write inside R code chunks in an R markdown file)
  • Any lines of text that aren’t R code should start with a # to indicate that they are lines of comments
  • You can run code in an R script by putting your cursor on the line you want to run (or highlighting the code to run multiple lines) and clicking Run at the top of the script
    • Alternatively, you can hit ctrl + enter/cmd + enter on your keyboard to run the code
  • You will want to use an R script for the purpose of writing R code
    • On the other hand, you can only execute R code within code chunks in R markdown files, so they are not as well-suited for large amounts of code writing. .Rmd files are mostly useful for combining text and code (e.g., writing a report) and presenting it in various formats (e.g., PDF, HTML)

How to organize an R script?

  • Like folder structure, there is no absolute right or wrong way to organize your R script, but below is one way you can do it
    • The general guideline is to clearly label each section of your script and define objects (e.g., directory paths, functions) at the top of your file to be used throughout the script
  • You will be completing the remaining problem sets in this course using an R script rather than R markdown file
################################################################################
##
## [ PROJ ] < Name of the overall project >
## [ FILE ] < Name of this particular file >
## [ AUTH ] < Your name + email / Twitter / GitHub handle >
## [ INIT ] < Date you started the file >
##
################################################################################

## ---------------------------
## libraries
## ---------------------------

## ---------------------------
## directory paths
## ---------------------------

## ---------------------------
## functions
## ---------------------------

## -----------------------------------------------------------------------------
## < BODY >
## -----------------------------------------------------------------------------

## ---------------------------
## input
## ---------------------------

## ---------------------------
## process
## ---------------------------

## ---------------------------
## output
## ---------------------------

## -----------------------------------------------------------------------------
## END SCRIPT
## -----------------------------------------------------------------------------

Source: R script template by Ben Skinner

3.1 Creating directory path objects

We use the file.path() command because it is smart. Some computer operating systems use forward slashes, /, for their file paths; others use backslashes, \. Rather than try to guess or assume what operating system future users will use, we can use R’s function, file.path(), to check the current operating system and build the paths correctly for us.

Source: Organizing Lecture by Ben Skinner


The file.path() function:

?file.path

# SYNTAX AND DEFAULT VALUES
file.path(..., fsep = .Platform$file.sep)
  • Function: Construct the path to a file from components in a platform-independent way
  • Arguments
    • ...: File path component(s)
    • fsep: The path separator to use (default is /)
      • Usually, we ignore this argument
  • Output: A character vector object of the arguments concatenated by the / path separator (unless an alternative path separator is specified in fsep)


Example: Creating file path objects using file.path()

Recall our example research_project directory from earlier:

research_project/
|
|- data/
|- scripts/
|- output/
  |- tables/
  |- figures/


Let’s use file.path() to create file path objects for some of these directories:

# Pass in each section of the path as a separate argument
file.path('.', 'research_project', 'data')
#> [1] "./research_project/data"


We would usually create and save these objects at the top of our script to be used later on:

# Create file path object for `data` directory
data_dir <- file.path('.', 'research_project', 'data')
data_dir
#> [1] "./research_project/data"

# Create file path object for `output` directory
output_dir <- file.path('.', 'research_project', 'output')
output_dir
#> [1] "./research_project/output"

# Create file path object for `tables` directory
tables_dir <- file.path('.', 'research_project', 'output', 'tables')
tables_dir
#> [1] "./research_project/output/tables"


Note that the object created by file.path() is just a character vector containing the path:

# Investigate file path object
output_dir %>% str()
#>  chr "./research_project/output"


Since the file path object is just a regular character vector, we could use that as input to file.path() to help create subdirectory path objects:

# Create file path object for `figures` directory using `output_dir`
figures_dir <- file.path(output_dir, 'figures')
figures_dir
#> [1] "./research_project/output/figures"


Similarly, we can use the file path object anywhere that we would normally input a file path:

# List the contents of the `output` directory
list.files(path = output_dir)
#> [1] "figures" "tables"

Example: Adding a figure to the figures_dir

Let’s download an image from R for Data Science to the figures_dir we created:

# We will introduce the download.file() function in the next section
download.file(url = 'https://d33wubrfki0l68.cloudfront.net/8b89c5554ed6108359d59909d441dbeb010e8802/9f366/visualize_files/figure-html/unnamed-chunk-7-1.png',
              destfile = file.path(figures_dir, 'scatterplot.png'))


We can use the file path object figures_dir to help us refer to the saved image:

# Display image using include_graphics()
knitr::include_graphics(path = file.path(figures_dir, 'scatterplot.png'))