1 Introduction

Load packages:

library(tidyverse)

Resources used to create this lecture:

1.1 What and why use Git and GitHub?

Video from Will Doyle, Professor at Vanderbilt University

What is version control?

  • Version control is a “system that records changes to a file or set of files over time so that you can recall specific versions later”
  • Keeps records of changes, who made changes, and when those changes were made
  • You or collaborators take “snapshots” of a document at a particular point in time. Later on, you can recover any previous snapshot of the document.

How version control works:

  • Imagine you write a simple text file document that gives a recipe for yummy chocolate chip cookies and you save it as cookies.txt
  • Later on, you make changes to cookies.txt (e.g., add alternative baking time for people who like “soft and chewy” cookies)
  • When using version control to make these changes, you don’t save entirely new version of cookies.txt; rather, you save the changes made relative to the previous version of cookies.txt

Why use version control when you can just save new version of document?

  1. Saving entirely new document each time a change is made is very inefficient from a memory/storage perspective
    • When you save a new version of a document, much of the contents are the same as the previous version
    • Inefficient to devote space to saving multiple copies of the same content
  2. When document undergoes lots of changes – especially a document that multiple people are collaborating on – it’s hard to keep track of so many different documents. Easy to end up with a situation like this:

Credit: Jorge Chan (and also, lifted this example from Benjamin Skinner’s intro to Git/GitHub lecture)


What is Git? (from git website)

“Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency”

  • Git is a particular version control software created by The Git Project
  • Git can be used by:
    • An individual/standalone developer
    • For collaborative projects, where multiple people collaborate on each file
  • The term “distributed” means that every user collaborating on the project has access to all files and the history of changes to all files
  • Git is the industry standard version control system used to create software
  • Increasingly, Git is the industry standard for collaborative academic research projects

What is a Git repository?

  • A Git repository is any project managed in Git
  • From Git Handbook by github.com:
    • A repository “encompasses the entire collection of files and folders associated with a project, along with each file’s revision history”
    • Because git is a distributed version control system, “repositories are self-contained units and anyone who owns a copy of the repository can access the entire codebase and its history”
  • This course is a Git repository (Rclass2 repository)
  • Local vs. remote git repository:
    • Local git repository: git repository for a project stored on your machine
    • Remote git repository: git repository for a project stored on the internet
  • Typically, a local git repository is connected to a remote git repository
    • You can make changes to local repository on your machine and then push those changes to the remote repository
    • Other collaborators can also make changes to their local repository, push them to the remote repository, and then you can pull these changes into your local repository
  • Private vs. public repositories
    • Public repositories: anyone can access the repository
      • e.g., rclass2, the git repository we created to develop the Rclass2 course is a public repository because we want the public to benefit from this course
    • Private repositories: only those who have been granted access by a repository “administrator” can access the repository
      • e.g., rclass2_student_issues is a private repository because we don’t want communication between students or communication between students and instructors to be public

What is GitHub?

  • GitHub is the industry standard hosting site/service for Git repositories
    • Hosting services allow people/organizations to store files on the internet and make those files available to others
  • Microsoft acquired Github in 2018 for $7.5 billion
  • GitHub stores your local repositories in “the cloud”
    • E.g., if you create a local repository stored on your machine, GitHub enables you to create a “remote” version of this repository
    • Also, you can connect to a remote repository that already exists and create a local version of this respository on your machine
  • More broadly, GitHub enables you to store files, share code, and communicate with others
  • Github organizations
    • Github organizations “are shared accounts where businesses and open-source projects can collaborate across many projects [that is, collaborate across many repositories] at once”
    • e.g., we created the Github organization anyone-can-cook, which contains the public repositories associated with the Rclass1 and Rclass2 courses, and also repositories that enrolled students create to complete problem sets
  • In this course and in Rclass1 you have already been using github to communicate with instructors and with your classmates:
    • e.g., we created a private repository called rclass2_student_issues so that students can ask questions about course content
    • e.g., within the anyone-can-cook Github organization, we have created teams for each problem set group so that you can communicate with your problem set group

1.2 How we will learn Git and GitHub

“Whoah, I’ve just read this quick tutorial about git and oh my god it is cool. I feel now super comfortable using it, and I’m not afraid at all to break something.”— said no one ever (de Wulf)

Understanding and learning how to use Git and GitHub can be intimidating. A lot of tutorials give you recipes for how to accomplish specific tasks (either point-and-click or issuing commands on command line), but don’t provide a conceptual understanding of how things work.

Here is how we will learn Git and GitHub over the course of the quarter:

  • Three weeks of the course will be devoted to Git and GitHub
    • Most of this time will be devoted to Git rather than Github because you already have some experience with Github and because learning Git is much harder than learning Github
  • During the Git/GitHub unit, we will:
    • Provide a conceptual overview of concepts and workflow
    • Show you how to accomplish specific tasks by issuing commands on the command line
    • Devote time to providing in-depth conceptual understanding of particular topics/concepts
    • You will practice doing Git/GitHub stuff during in-class exercises and in weekly problem sets
  • With the exception of using GitHub for communication (“issues”) and for creating/cloning repositories, we will perform all tasks on the command line rather than using a point-and-click graphical user interface (GUI)
    • Initially, this will feel intimidating, but after a few weeks you will see that this helps you understand Git/GitHub better and is much more efficient
  • After the Git/GitHub unit:
    • Weekly problem sets will be completed and submitted using GitHub
    • When communicating with your problem set “team,” you will use GitHub “issues”
    • When posing questions to instructors/classmates, you will use GitHub “issues”
    • Selected additional lectures/class exercises about additional Git/GitHub concepts

1.3 Command line vs. graphical user interface (GUI)

What is a shell?

  • “A shell is a terminal application used to interface with an operating system through written commands” (Git Bash tutorial)
  • “The shell is a program on your computer whose job is to run other programs. Pseudo-synonyms are ‘terminal’, ‘command line’, and ‘console.’” (Happy Git and GitHub for the useR by Jenny Bryan)
  • In this course, we will usually use the term “command line” rather than “shell”
  • In the command line, you issue commands one line at a time
  • Most programmers use the command line rather than a graphical user interface (GUI) to accomplish tasks

What is graphical user interface (GUI)?

  • A graphical user interface is an interface for using a program that includes graphical elements such as windows, icons, and buttons that the user can click on using the mouse
  • For example, “RStudio” has GUI capabilities in that it has windows and you can perform operations using point-and-click (however, RStudio also has command line capabilities)
  • RStudio also includes a GUI interface for performing Git operations
  • There are many other GUI software packages for performing GIT operations
    • Popular tools include “GitHub Desktop,” “GitKraken,” and “SmartGit”
    • See GUI Clients

In this course, we will perform Git operations solely using the command line. Why?

  • Learning Git from the command line will give you a deeper understanding of how Git and GitHub work
    • I have found that performing Git operations using a GUI did nothing to help me overcome my feelings of anxiety/intimidation about Git
    • As soon as I started doing stuff on the command line, I started feeling less intimidated
  • After you start feeling more comfortable with the command line, using the command line makes you much more efficient than using a GUI
  • Learning the command line takes time and does feel intimidating
    • So we will devote substantial time in-class and during problem sets to learning/practicing the command line

Background information on the Unix shell “Bash”

We will use the Unix shell called “Bash” to perform Git operations:

  • Some background on “Bash”
    • Unix is an operating system developed by AT&T Bell Labs in the late 1960s
    • The “Unix shell” is a command line program for issuing commands to “Unix-like” operating systems (Unix Shell)
      • Unix-like operating systems include macOS and Linux, but not Windows
      • The first Unix shell was the “Thompson shell” originally written by Ken Thompson at Bell Labs in 1971
    • The Bourne shell was a Unix shell programming language written by Stephen Bourne at Bell Labs in 1979
    • The “Bourne Again Shell” - commonly referred to as “Bash” was “written by Brian Fox for the GNU Project as a free software replacement for the Bourne shell,” and first released in 1989
  • Relationship between Git and Bash
    • “At its core, Git is a set of command line utility programs that are designed to execute on a Unix style command-line environment” (Git Bash)
  • Mac users
    • “Terminal” is the application that enables you to control your Mac using a command line prompt
    • Terminal runs the Bash shell programming language
    • Therefore, Mac users use “Terminal” to perform Git operations and the commands to perform Git operations utilize the Bash programming language
  • Windows users
    • Windows is not a “Unix-like” operating system. Therefore, Bash is not the default command line interface
    • In order for Windows users to use Bash and to perform Git operations, you must install the Git Bash program, which is installed as part of Git for Windows
  • Because Mac “Terminal” program and the Windows “Git Bash” program both use the Bash command line program, performing Git operations using the command line will be exactly the same for both Mac and Windows users!!!


Why learn the command line and “command-line bullshittery,” from Philip J. Guo

“What is wonderful about doing applied computer science research in the modern era is that there are thousands of pieces of free software and other computer-based tools that researchers can leverage to create their research software. With the right set of tools, one can be 10x or even 100x more productive than peers who don’t know how to set up those tools.”

“But this power comes at a great cost: It takes a tremendous amount of command-line bullshittery to install, set up, and configure all of this wonderful free software. What I mean by command-line bullshittery is dealing with all of the arcane, obscure, strange bullshit of the command-line paradigm that most of these free tools are built upon….So perhaps what is more important to a researcher than programming ability is adeptness at dealing with command-line bullshittery, since that enables one to become 10x or even 100x more productive than peers by finding, installing, configuring, customizing, and remixing the appropriate pieces of free software.”

Helping my students overcome command-line bullshittery by Philip J. Guo

1.4 Installation and running shell commands from RStudio

If you have a Windows computer, you will need to follow these steps to install Git for Windows, which will allow you to run Bash and Git commands. If you have a Mac, you won’t need to download anything because it already comes with a Terminal app. However, if you have a newer version of Mac, you may need to run xcode-select --install in your Terminal before you’re able to use Git commands (see here for more info).


In RStudio, there is a Terminal tab (next to the Console tab) where you can run Bash commands and perform Git operations:

Credit: RStudio Terminal blog post by Gary Ritchie


If you are working from an R markdown file, you can also create bash code chunks (similar to R code chunks) for running shell commands. All you need to do is indicate {bash} for the code chunk:

1.5 RStudio Console vs. Terminal

What is the difference between the RStudio Console and Terminal?

  • The Console is for running R code
    • You can run a line of R code in the Console by typing it and hitting enter. Separate lines using semicolons if you want to write multiple lines before running them.
    • Running R code in the Console is equivalent to running R code within an R script
  • The Terminal is for running shell commands (e.g., bash commands)
    • You can run a command in the Terminal by typing it and hitting enter. Separate lines using semicolons if you want to write multiple lines before running them.
    • Running shell commands from the Terminal is equivalent to running them in your Git Bash (Windows) or Terminal app (Mac)

2 Command line

In this section, we will go over some of the commonly used command line commands. You can run these commands either in your RStudio Terminal or in a bash code chunk of an R markdown file.


Generally, you can pull up the help file for a command by running:

  • command_name --help (Windows)
  • man command_name (Mac)


We’ll use the ls command as an example:

ls --help
man ls

ls: List directory contents

  • Syntax: ls [<option(s)>] [<directory_name(s)>]
    • The options and arguments in [] indicates they are optional and you do not have to specify these
    • Options can be specified using - or -- (see help file)
      • Note: For the most part, - is the way to specify the short name version and -- is the way to specify the long name version of an option [x]
    • We will not be listing out all the options and arguments in this lecture (only the commonly used ones), so see help file for full details
  • Options:
    • -a: Include directory entries whose names begin with a dot (.)
      • Note: Hidden files (i.e., files you don’t by default see in your Files Explorer or Finder) have names that start with a dot
    • -l: List files in long format (i.e., include additional information like file size, date of creation, etc.)
  • Arguments:
    • directory_name(s): Which directories to list the content of (default: current directory)
  • Equivalent R function: list.files()


Example: Using ls to list content in current directory (default)

ls
## git_and_github.Rmd
## git_and_github.html
## render_toc.R

Example: Using ls to list content in parent directory

ls ..
## apis_and_json
## ggplot
## git_and_github
## organizing_and_io
## programming
## strings_and_regex

Example: Using ls -a to list content in parent directory including entries whose names begin with a dot

ls -a ..
## .
## ..
## apis_and_json
## ggplot
## git_and_github
## organizing_and_io
## programming
## strings_and_regex

2.2 Working with files


echo: Write to standard output (i.e., print to terminal)

  • Syntax: echo <text_to_print>
    • Note: Use help echo to access the help file on Windows
  • Arguments:
    • text_to_print: Text to print to terminal
  • Notes:
    • The text outputted to the terminal can be redirected to a file using >
    • The text could also be appended to an existing file using >> (i.e., not overwrite existing content of file)

cat: Concatenate and print files

  • Syntax: cat <file_name>
  • Arguments:
    • file_name: File to print to terminal

Example: Using echo to print text to terminal

echo "Hello, World!"
## Hello, World!

Example: Using echo and > to redirect text to file and cat to print content of file

# Redirect text to file
echo "Hello, World!" > my_script.R

# Print contents of file
cat my_script.R
## Hello, World!
# We would overwrite contents of file when using `>`
echo "library(tidyverse)" > my_script.R

# Print contents of file
cat my_script.R
## library(tidyverse)

Example: Using echo and >> to append text to file and cat to print content of file

# Append line to R script by using `>>` (`>` would overwrite contents of file)
echo "mpg %>% head(5)" >> my_script.R

# Print contents of file
cat my_script.R
## library(tidyverse)
## mpg %>% head(5)



head: Print first part of file

  • Syntax: head [<option(s)>] [<file_name>]
  • Options:
    • -n <int>: Print the first <int> lines (default: 10)
  • Arguments:
    • file_name: File to print

tail: Print last part of file

  • Syntax: tail [<option(s)>] [<file_name>]
  • Options:
    • -n <int>: Print the last <int> lines (default: 10)
  • Arguments:
    • file_name: File to print

Example: Using head to print first part of file

# Preview first 10 lines by default (or up to 10 lines)
head my_script.R
## library(tidyverse)
## mpg %>% head(5)
# Preview first line
head -n 1 my_script.R
## library(tidyverse)

Example: Using tail to print last part of file

# Preview last 10 lines by default (or up to 10 lines)
tail my_script.R
## library(tidyverse)
## mpg %>% head(5)
# Preview last line
tail -n 1 my_script.R
## mpg %>% head(5)



cp: Copies files or directories

  • Syntax: cp [<option(s)>] [<source_file/directory>] [<destination_file/directory>]
  • Options:
    • -r: Copies directories and their contents recursively (this flag is required to copy a directory)
  • Arguments:
    • Copies the source_file/directory to destination_file/directory

Example: Using cp to copy a file

# Print contents of my_script.R
cat my_script.R
## library(tidyverse)
## mpg %>% head(5)
# Make a copy of my_script.R called my_script_copy.R inside my_folder/
cp my_script.R my_folder/my_script_copy.R

# Print contents of my_script_copy.R
cat my_folder/my_script_copy.R
## library(tidyverse)
## mpg %>% head(5)

Example: Using cp -r to copy a directory

# View contents of my_folder/
ls my_folder
## my_script_copy.R
## test_script.R
# Make a copy of my_folder/ (with its contents) called my_folder_copy/
cp -r my_folder my_folder_copy

# View contents of my_folder_copy/
ls my_folder
## my_script_copy.R
## test_script.R



mv: Rename or move files

  • Syntax:
    • Renaming: mv [<old_file/directory>] [<new_file/directory>]
    • Moving: mv [<file/directory(s)>] [<destination_directory>]
  • Arguments:
    • To rename, provide 2 arguments - the file/directory you want to rename and the name you want to change it to
    • To move, the last argument provided should be a directory and all files/directories provided before that will be moved into that directory

Example: Using mv to rename a file or directory

# Rename file
mv my_script.R create_dataset.R
# Rename directory
mv my_folder_copy my_folder_2

Example: Using mv to move files and directories into a directory

# View contents of my_folder/
ls my_folder
## my_script_copy.R
## test_script.R
# Move file and directory into the destination directory (last arg)
mv create_dataset.R my_folder_2 my_folder

# View contents of my_folder/
ls my_folder
## create_dataset.R
## my_folder_2
## my_script_copy.R
## test_script.R


3 Overview of core concepts and workflow

This section introduces some core concepts and explains the basic Git “workflow” (i.e., how Git works)

3.1 Git stores “snapshots,” not “differences”

Version control systems that save differences:

  • Prior to Git, “centralized version control systems” were the industry standard version control systems (From Getting Started - About Version Control)
    • In these systems, a central server stored all the versions of a file and “clients” (e.g., a programmer working on a project on their local computer) could “check out” files from the central server
  • These centralized version control systems stored multiple versions of a file as “differences”
    • For example, imagine you create a simple text file called twinkle.txt
    • “Version 1” (the “base” version) of twinkle.txt has the following contents:
      • twinkle, twinkle, little star
    • You make some changes to twinkle.txt and save those changes, resulting in “Version 2,” which has the following contents:
      • twinkle, twinkle, little star, how I wonder what you are!
    • When storing “Version 2” of twinkle.txt, centralized version control systems don’t store the entire file. Rather, they store the changes relative to the previous version. In our example, “Version 2” stores:
      • , how I wonder what you are!
  • The below figure portrays version control systems that store data as changes relative to the base version of each file:


Credit: Getting Started - What is Git


Git stores data as snapshots rather than differences:

  • Git doesn’t think of data as differences relative to the base version of each file
  • Rather, Git thinks of data as “a series of snapshots of a miniature filesystem” or, said differently, a series of snapshots of all files in the repository
  • “With Git, every time you commit, or save the state of your project, Git basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot.”
  • “To be efficient, if files have not changed, Git doesn’t store the file again, just a link to the previous identical file it has already stored.”
  • The below figure portrays storing data as a stream of snapshots over time: