1 Introduction

Load packages:

library(tidyverse)

Resources used to create this lecture:

1.1 What and why use Git and GitHub?

Video from Will Doyle, Professor at Vanderbilt University

What is version control?

  • Version control is a “system that records changes to a file or set of files over time so that you can recall specific versions later”
  • Keeps records of changes, who made changes, and when those changes were made
  • You or collaborators take “snapshots” of a document at a particular point in time. Later on, you can recover any previous snapshot of the document.

How version control works:

  • Imagine you write a simple text file document that gives a recipe for yummy chocolate chip cookies and you save it as cookies.txt
  • Later on, you make changes to cookies.txt (e.g., add alternative baking time for people who like “soft and chewy” cookies)
  • When using version control to make these changes, you don’t save entirely new version of cookies.txt; rather, you save the changes made relative to the previous version of cookies.txt

Why use version control when you can just save new version of document?

  1. Saving entirely new document each time a change is made is very inefficient from a memory/storage perspective
    • When you save a new version of a document, much of the contents are the same as the previous version
    • Inefficient to devote space to saving multiple copies of the same content
  2. When document undergoes lots of changes – especially a document that multiple people are collaborating on – it’s hard to keep track of so many different documents. Easy to end up with a situation like this:

Credit: Jorge Chan (and also, lifted this example from Benjamin Skinner’s intro to Git/GitHub lecture)


What is Git? (from git website)

“Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency”

  • Git is a particular version control software created by The Git Project
    • Git is free and open source software, meaning that anyone can use, share, and modify the software
    • Although Microsoft owns Github (described) below, it thankfully does not own Git!
  • Git can be used by:
    • An individual/standalone developer
    • For collaborative projects, where multiple people collaborate on each file
  • The term “distributed” means that every user collaborating on the project has access to all files and the history of changes to all files
  • Git is the industry standard version control system used to create software
  • Increasingly, Git is the industry standard for collaborative academic research projects

What is a Git repository?

  • A Git repository is any project managed in Git
  • From Git Handbook by github.com:
    • A repository “encompasses the entire collection of files and folders associated with a project, along with each file’s revision history”
    • Because git is a distributed version control system, “repositories are self-contained units and anyone who owns a copy of the repository can access the entire codebase and its history”
  • This course is a Git repository (Rclass2 repository)
  • Local vs. remote git repository:
    • Local git repository: git repository for a project stored on your machine
    • Remote git repository: git repository for a project stored on the internet
  • Typically, a local git repository is connected to a remote git repository
    • You can make changes to local repository on your machine and then push those changes to the remote repository
    • Other collaborators can also make changes to their local repository, push them to the remote repository, and then you can pull these changes into your local repository
  • Private vs. public repositories
    • Public repositories: anyone can access the repository
      • e.g., rclass2, the git repository we created to develop the Rclass2 course is a public repository because we want the public to benefit from this course
    • Private repositories: only those who have been granted access by a repository “administrator” can access the repository
      • e.g., rclass2_student_issues is a private repository because we don’t want communication between students or communication between students and instructors to be public

What is GitHub?

  • GitHub is the industry standard hosting site/service for Git repositories
    • Hosting services allow people/organizations to store files on the internet and make those files available to others
  • Microsoft acquired Github in 2018 for $7.5 billion
  • Github is where remote git repositories live
    • E.g., if you create a local repository stored on your machine, GitHub enables you to create a “remote” version of this repository that live “in the cloud”
    • Also, you can connect to a remote repository that already exists and create a local version of this respository on your machine
  • More broadly, GitHub enables you to store files, share code, and communicate with others
  • Github organizations
    • Github organizations “are shared accounts where businesses and open-source projects can collaborate across many projects [that is, collaborate across many repositories] at once”
    • e.g., we created the Github organization anyone-can-cook, which contains the public repositories associated with the Rclass1 and Rclass2 courses, and also repositories that enrolled students create to complete problem sets
  • In this course and in Rclass1 you have already been using github to communicate with instructors and with your classmates:
    • e.g., we created a private repository called rclass2_student_issues so that students can ask questions about course content
    • e.g., within the anyone-can-cook Github organization, we have created teams for each problem set group so that you can communicate with your problem set group

1.2 How we will learn Git

Even professional programmers find learning and understanding git to be challenging

“Whoah, I’ve just read this quick tutorial about git and oh my god it is cool. I feel now super comfortable using it, and I’m not afraid at all to break something.”— said no one ever (de Wulf)

Understanding and learning how to use Git can be intimidating. A lot of tutorials give you recipes for how to accomplish specific tasks (either point-and-click or issuing commands on command line), but don’t provide a conceptual understanding of how things work.

Here is how we will learn Git and GitHub over the course of the quarter:

  • Three weeks of the course will be devoted to Git and GitHub
    • Most of this time will be devoted to Git rather than Github because you already have some experience with Github and because learning Git is much harder than learning Github
  • During the Git/GitHub unit, we will:
    • Provide a conceptual overview of concepts and workflow
    • Show you how to accomplish specific tasks by issuing commands on the command line
    • Devote time to providing in-depth conceptual understanding of particular topics/concepts
    • You will practice doing Git/GitHub while you work through lecture and in weekly problem sets
  • With the exception of using GitHub for communication (“issues”) and for creating/cloning repositories, we will perform all tasks on the command line rather than using a point-and-click graphical user interface (GUI)
    • Initially, this will feel intimidating, but after a few weeks you will see that this helps you understand Git/GitHub better and is much more efficient
  • After the Git/GitHub unit:
    • Weekly problem sets will be completed and submitted using GitHub
    • When communicating with your problem set “team,” you will use GitHub “issues”
    • When posing questions to instructors/classmates, you will use GitHub “issues”

1.3 Command line vs. graphical user interface (GUI)

What is a shell?

  • “A shell is a terminal application used to interface with an operating system through written commands” (Git Bash tutorial)
  • “The shell is a program on your computer whose job is to run other programs. Pseudo-synonyms are ‘terminal’, ‘command line’, and ‘console.’” (Happy Git and GitHub for the useR by Jenny Bryan)
  • In this course, we will usually use the term “command line” rather than “shell”
  • In the command line, you issue commands one line at a time
  • Most programmers use the command line rather than a graphical user interface (GUI) to accomplish tasks

What is graphical user interface (GUI)?

  • A graphical user interface is an interface for using a program that includes graphical elements such as windows, icons, and buttons that the user can click on using the mouse
  • For example, “RStudio” has GUI capabilities in that it has windows and you can perform operations using point-and-click (however, RStudio also has command line capabilities)
  • RStudio also includes a GUI interface for performing Git operations
  • There are many other GUI software packages for performing GIT operations
    • Popular tools include “GitHub Desktop,” “GitKraken,” and “SmartGit”
    • See GUI Clients

In this course, we will perform Git operations solely using the command line. Why?

  • Learning Git from the command line will give you a deeper understanding of how Git and GitHub work
    • I have found that performing Git operations using a GUI did nothing to help me overcome my feelings of anxiety/intimidation about Git
    • As soon as I started doing stuff on the command line, I started feeling less intimidated
  • After you start feeling more comfortable with the command line, using the command line makes you much more efficient than using a GUI
  • Learning the command line takes time and does feel intimidating
    • So we will devote substantial time in-class and during problem sets to learning/practicing the command line

Background information on the Unix shell “Bash”

We will use the Unix shell called “Bash” to perform Git operations:

  • Some background on “Bash”
    • Unix is an operating system developed by AT&T Bell Labs in the late 1960s
    • The “Unix shell” is a command line program for issuing commands to “Unix-like” operating systems (Unix Shell)
      • Unix-like operating systems include macOS and Linux, but not Windows
      • The first Unix shell was the “Thompson shell” originally written by Ken Thompson at Bell Labs in 1971
    • The Bourne shell was a Unix shell programming language written by Stephen Bourne at Bell Labs in 1979
    • The “Bourne Again Shell” - commonly referred to as “Bash” was “written by Brian Fox for the GNU Project as a free software replacement for the Bourne shell,” and first released in 1989
  • Relationship between Git and Bash
    • “At its core, Git is a set of command line utility programs that are designed to execute on a Unix style command-line environment” (Git Bash)
  • Mac users
    • “Terminal” is the application that enables you to control your Mac using a command line prompt
    • Terminal runs the Bash shell programming language
    • Therefore, Mac users use “Terminal” to perform Git operations and the commands to perform Git operations utilize the Bash programming language
  • Windows users
    • Windows is not a “Unix-like” operating system. Therefore, Bash is not the default command line interface
    • In order for Windows users to use Bash and to perform Git operations, you must install the Git Bash program, which is installed as part of Git for Windows
  • Because Mac “Terminal” program and the Windows “Git Bash” program both use the Bash command line program, performing Git operations using the command line will be exactly the same for both Mac and Windows users!!!


Why learn the command line and “command-line bullshittery,” from Philip J. Guo

“What is wonderful about doing applied computer science research in the modern era is that there are thousands of pieces of free software and other computer-based tools that researchers can leverage to create their research software. With the right set of tools, one can be 10x or even 100x more productive than peers who don’t know how to set up those tools.”

“But this power comes at a great cost: It takes a tremendous amount of command-line bullshittery to install, set up, and configure all of this wonderful free software. What I mean by command-line bullshittery is dealing with all of the arcane, obscure, strange bullshit of the command-line paradigm that most of these free tools are built upon….So perhaps what is more important to a researcher than programming ability is adeptness at dealing with command-line bullshittery, since that enables one to become 10x or even 100x more productive than peers by finding, installing, configuring, customizing, and remixing the appropriate pieces of free software.”

Helping my students overcome command-line bullshittery by Philip J. Guo

1.4 Installation and running command line from RStudio

1.4.1 Installation

If you have a Windows computer, you will need to follow steps in this link to install Git for Windows, which will allow you to run Bash and Git commands.

If you have a Mac, you won’t need to download anything because it already comes with a Terminal app. However, if you have a newer version of Mac, you may need to run xcode-select --install in your Terminal before you’re able to use Git commands (see here for more info).

1.4.2 Running command line commands from RStudio


In RStudio, there is a Terminal tab (next to the Console tab) where you can run Bash commands and perform Git operations:

Credit: RStudio Terminal blog post by Gary Ritchie


If you are working from an R markdown file, you can also create bash code chunks (similar to R code chunks) for running shell commands. All you need to do is indicate {bash} for the code chunk:

Try running this code chunk on your own

  • if it works, you the bash/terminal program is connected to Rstudio!
echo "Hello, World!"
## Hello, World!

1.5 RStudio Console vs. Terminal

What is the difference between the RStudio Console and Terminal?

  • The Console is for running R code
    • You can run a line of R code in the Console by typing it and hitting enter. Separate lines using semicolons if you want to write multiple lines before running them.
    • Running R code in the Console is equivalent to running R code within an R script
  • The Terminal is for running shell commands (e.g., bash commands)
    • You can run a command in the Terminal by typing it and hitting enter. Separate lines using semicolons if you want to write multiple lines before running them.
    • Running shell commands from the Terminal is equivalent to running them in your Git Bash (Windows) or Terminal app (Mac)

2 Command line

In this section, we will go over some of the commonly used bash command line commands. You can run these commands either in a standalone Git Bash/Terminal application, your RStudio Terminal, or in a bash code chunk of an R markdown file.

  • Running bash commands from Git Bash/Terminal application or RStudio Terminal
    • This is how you will do things when working with git in real projects
    • current (working) directory will be your home directory
      • Home directory for Git Bash/Terminal application may differ from home directory in RStudio Terminal
    • if you change working directory, this change will persist for the duration of your current R session
  • Running bash commands from code chunk of an R Markdown file
    • this approach is not how people run bash commands; but useful for initial teaching and learning about bash commands and git operations
      • if you want, you can use bash code chunks to answer questions in problem set (optional)
    • Note: even if you try to run only a single line from a bash code chunk, the chunk will run ALL lines
    • current (working) directory will be directory where the .Rmd file is saved
    • if you change the working directory within a code chunk, after code chunk finishes running the working directory will revert back to the directory where the .Rmd file is saved


Generally, you can pull up the help file for a command by running:

  • command_name --help (Windows)
  • man command_name (Mac)


We’ll use the ls command as an example:

ls --help
man ls

ls: List directory contents

  • Syntax: ls [<option(s)>] [<directory_name(s)>]
    • The options and arguments in [] indicates they are optional and you do not have to specify these
    • Options can be specified using - or -- (see help file)
      • Note: For the most part, - is the way to specify the short name version and -- is the way to specify the long name version of an option [x]
    • We will not be listing out all the options and arguments in this lecture (only the commonly used ones), so see help file for full details
  • Options:
    • -a: Include directory entries whose names begin with a dot (.)
      • Note: Hidden files (i.e., files you don’t by default see in your Files Explorer or Finder) have names that start with a dot
    • -l: List files in long format (i.e., include additional information like file size, date of creation, etc.)
  • Arguments:
    • directory_name(s): Which directories to list the content of (default: current directory)
      • note: provide the filepath relative to the current (working) directory
      • note: unlike R, do not need to put filepath in quotes; but you can!
  • Equivalent R function: list.files()


Example: Using ls to list content in current directory (default)

ls
## git_and_github.Rmd
## git_and_github.html
## render_toc.R
## windows_credential_manager_screen_clip.png

Example: Using ls to list content in parent directory

ls ..
## apis_and_json
## ggplot
## git_and_github
## organizing_and_io
## programming
## strings_and_regex
  • This works too (same output as above)
ls ../
## apis_and_json
## ggplot
## git_and_github
## organizing_and_io
## programming
## strings_and_regex
  • This works too (same output as above)
ls ".."
## apis_and_json
## ggplot
## git_and_github
## organizing_and_io
## programming
## strings_and_regex

Example: Using ls -a to list content in parent directory including entries whose names begin with a dot

ls -a ..
## .
## ..
## apis_and_json
## ggplot
## git_and_github
## organizing_and_io
## programming
## strings_and_regex

2.2 Working with files

This section shows bash code to:

  • add (write) text to files (echo function)
  • print contents of files (cat function)
  • print first part of a file (head function)
  • print last part of a file (tail function)
  • copy files or directories (cp function)
  • rename or move files (mv function)



Often, we want to insert text into a file from the command line
echo: Write to standard output (i.e., print to terminal)

  • Syntax: echo <text_to_print>
  • Help:
    • Windows: help echo (note: this is different from the usual function_name --help syntax
    • Mac: man echo
  • Arguments:
    • text_to_print: Text to print to terminal
  • What executing echo function does:
    • if you execute echo <text_to_print>, it will simply print that text on terminal
    • echo <text_to_print> > file_name (overwrite file)
      • The text outputted to the terminal can be redirected to a file using >
      • if file already exists, using > will overwrite contents of existing file
    • echo <text_to_print> >> file_name (append to file)
      • The text could also be appended to an existing file using >> (i.e., not overwrite existing content of file)
  • echo interprets the following “backslash-escaped” characters (will explain more fully in the strings/regex unit)
    • \a: alert (bell)
    • \b: backspace
    • \c: suppress further output
    • \e: escape character
    • \E: escape character
    • \f: form feed
    • \n: new line
    • \r: carriage return
    • \t: horizontal tab
    • \v: vertical tab
    • \\: backslash

Example: Using echo to print text to terminal
# help echo # help Windows
# man echo # help Mac
echo "Hello, World!"
## Hello, World!



cat: Concatenate and print files

  • Syntax: cat <file_name>
  • Arguments:
    • file_name: File to print to terminal

Example: Using echo and > to redirect text to file and cat to print content of file
# Redirect text to file
echo "Hello, World!" > my_script.R

# Print contents of file
cat my_script.R
## Hello, World!
# We would overwrite contents of file when using `>`
echo "library(tidyverse)" > my_script.R

# Print contents of file
cat my_script.R
## library(tidyverse)

Example: Using echo and >> to append text to file and cat to print content of file
# Append line to R script by using `>>` (`>` would overwrite contents of file)
echo "mpg %>% head(5)" >> my_script.R

# Print contents of file
cat my_script.R
## library(tidyverse)
## mpg %>% head(5)



head: Print first part of file

  • Syntax: head [<option(s)>] [<file_name>]
  • Options:
    • -n <int>: Print the first <int> lines (default: 10)
  • Arguments:
    • file_name: File to print

tail: Print last part of file

  • Syntax: tail [<option(s)>] [<file_name>]
  • Options:
    • -n <int>: Print the last <int> lines (default: 10)
  • Arguments:
    • file_name: File to print
Example: Using head to print first part of file
# Preview first 10 lines by default (or up to 10 lines)
head my_script.R
## library(tidyverse)
## mpg %>% head(5)
# Preview first line
head -n 1 my_script.R
## library(tidyverse)

Example: Using tail to print last part of file
# Preview last 10 lines by default (or up to 10 lines)
tail my_script.R
## library(tidyverse)
## mpg %>% head(5)
# Preview last line
tail -n 1 my_script.R
## mpg %>% head(5)



cp: Copies files or directories

  • Syntax: cp [<option(s)>] [<source_directory/file>] [<source_directory/file>]
  • Options:
    • -r: Copies directories and their contents recursively (this flag is required to copy a directory)
  • Arguments:
    • Copies the source_directory/file to source_directory/file
Example: Using cp to copy a file
# Print contents of my_script.R
cat my_script.R
## library(tidyverse)
## mpg %>% head(5)
# Make a copy of my_script.R called my_script_copy.R inside my_folder/
cp my_script.R my_folder/my_script_copy.R

# Print contents of my_script_copy.R
cat my_folder/my_script_copy.R
## library(tidyverse)
## mpg %>% head(5)

Example: Using cp -r to copy a directory
# View contents of my_folder/
pwd
ls my_folder
## /c/Users/ozanj/Documents/rclass2/lectures/git_and_github
## my_script_copy.R
## test_script.R
# Make a copy of my_folder/ (with its contents) called my_folder_copy/
cp -r my_folder my_folder_copy

# View contents of my_folder_copy/
ls my_folder_copy
## my_script_copy.R
## test_script.R



mv: Rename or move files

  • Syntax:
    • Renaming: mv [<old_directory/file>] [<new_directory/file>]
    • Moving: mv [<directory/file(s)>] [<destination_directory>]
  • Arguments (how renaming and moving differ)
    • To rename, provide 2 arguments - the file/directory you want to rename and the name you want to change it to
    • To move, the last argument provided should be a directory and all files/directories provided before that will be moved into that directory
  • Options
    • -n: do not overwite an existing file
Example: Using mv to rename a file or directory
# Rename file
mv my_script.R create_dataset.R
# Rename directory
mv my_folder_copy my_folder_2

Example: Using mv to move files and directories into a directory
# View contents of my_folder/
ls my_folder
## my_script_copy.R
## test_script.R
# Move file and directory into the destination directory (last arg)
mv create_dataset.R my_folder_2 my_folder

# View contents of my_folder/
ls my_folder
## create_dataset.R
## my_folder_2
## my_script_copy.R
## test_script.R


3 Overview of core concepts and workflow

This section introduces some core concepts and explains the basic Git “workflow” (i.e., how Git works)

3.1 Git stores “snapshots,” not “differences”

Version control systems that save differences:

  • Prior to Git, “centralized version control systems” were the industry standard version control systems (From Getting Started - About Version Control)
    • In these systems, a central server stored all the versions of a file and “clients” (e.g., a programmer working on a project on their local computer) could “check out” files from the central server
  • These centralized version control systems stored multiple versions of a file as “differences”
    • For example, imagine you create a simple text file called twinkle.txt
    • “Version 1” (the “base” version) of twinkle.txt has the following contents:
      • twinkle, twinkle, little star
    • You make some changes to twinkle.txt and save those changes, resulting in “Version 2,” which has the following contents:
      • twinkle, twinkle, little star, how I wonder what you are!
    • When storing “Version 2” of twinkle.txt, centralized version control systems don’t store the entire file. Rather, they store the changes relative to the previous version. In our example, “Version 2” stores:
      • , how I wonder what you are!
  • The below figure portrays version control systems that store data as changes relative to the base version of each file:


Credit: Getting Started - What is Git


Git stores data as snapshots rather than differences:

  • Git doesn’t think of data as differences relative to the base version of each file
  • Rather, Git thinks of data as “a series of snapshots of a miniature filesystem” or, said differently, a series of snapshots of all files in the repository
  • “With Git, every time you commit, or save the state of your project, Git basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot.”
  • “To be efficient, if files have not changed, Git doesn’t store the file again, just a link to the previous identical file it has already stored.”
  • For files that have changed:
    • the “commit” will save lines that you have changed or added [like “differences”]
    • lines that have not changed will not be re-saved; because these lines have been saved in previous commit(s) that are linked to the current commit
  • The below figure portrays storing data as a stream of snapshots over time:


Credit: Getting Started - What is Git


What is a commit?

  • A commit is a snapshot of all files in the repository at a particular time
  • Example: Imagine you are working on a project (repository) that contains a dozen files
    • You change two files and make a commit
    • Git takes a snapshot of the full repository (all files)
    • Content that remains unchanged relative to the previous commit is stored vis-a-vis a link to the previous commit

3.2 Three components of a Git project

Credit: Lucas Maurer, medium.com

  • Local working directory (also called “working tree”)
    • This is the area where all your work happens! You are writing Rmd files, debugging R scripts, adding and deleting files
    • These changes are made on your local machine!
  • Git index/staging area (git add <filename(s)> command)
    • The staging area is the area between your local working directory and the repository, where you list changes you have made in the local working directory that you would like to commit to the repository
    • Hypothetical work flow (imagine you are working on document cookies.txt):
      • Make changes to cookies.txt in a text editor. These are changes made in your local working directory.
      • Imagine you are happy with some changes you made to cookies.txt and you want to commit those changes to your repository
      • Before you commit changes to repository, you must add them to the staging area as an intermediary step
  • Repository (git commit command)
    • This is the actual repository where Git permanently stores the changes you’ve made in the local working directory and added to the staging area
    • Hypothetical work flow to cookies.txt:
      • Add changes from local working directory to staging area
      • Commit changes from staging area to repository
    • Each commit to the repository is a different version of the file that represents a snapshot of the file at a particular time
    • Commits are made to branches in the repo
      • By default, a git repository comes with one main branch (typically called main)
      • But we can also create other branches (discussed more later)
    • Local vs. remote repository
      • When you add a change to the staging area and then commit the change to your repository, this changes your local repository (i.e., on your computer) rather than your remote repository (i.e., on GitHub)
      • If you want to change the remote repository (typically named origin), you must push the change from your local repository to your remote repository
      • As seen below, each circle represents a commit. After you make commits on a branch in your local repository (i.e., main), you need to push them in order for the corresponding branch on the remote repository (i.e., origin/main) to be up-to-date with your changes.

Credit: Modified from Atlassian, Git push

3.3 Git/GitHub workflow


Credit: Simon Maple, JRebel, https://www.jrebel.com/blog/git-cheat-sheet


Git commands:

  • add: Add file from working directory to staging area
  • commit: Commit file from staging area to local repository
  • push: Send files from local repository (your machine) to remote repository
    • Synchronizes local repository and remote repository
    • Think of push as “uploading”
  • fetch: Get files from remote repository and put them in local repository
  • pull: Get files from remote repository and put them in the working directory
    • Think of pull as “downloading”
    • pull is effectively fetch followed by merge (discussed later)
  • reset: After you add files from working directory to staging area, reset unstages those files

3.4 Basic git commands

Git command cheatsheets:

When performing git operations on command line, all commands begin with git, for example:

  • git init
  • git clone url_of_remote_repository
  • git status

For an overview of git command syntax and a list of common git commands, type this in command line:

git --help

To see the help file for a particular git command (e.g., add, commit, clone), type git command_name --help. For example:

git add --help

# or this:
# git help add

Basic/essential git commands:

  • Create a repository
    • git init
      • “Initializes a brand new Git repository and begins tracking an existing directory. It adds a hidden subfolder [named .git/] within the existing directory that houses the internal data structure required for version control” (Git Handbook)
    • git clone url_of_remote_repository
      • “Creates a local copy of a project that already exists remotely. The clone includes all the project’s files, history, and branches” (Git Handbook)
  • Make a change
    • git add file_name(s)
      • Add file(s) from local working directory to staging area/index
      • Note: You must “stage” changes to a file before you commit them to your local repository
    • git commit -m "commit message"
      • All changes to files that have been staged [previous step] are committed to the local repository
      • Each commit is a snapshot of all files in your repository
      • Note: -m is an option to the git commit command, which specifies that you will add a brief description about changes you are committing. You can reference an issue in the commit message by using a hashtag followed by the issue number: #<issue_number>. These commits will appear on the issue page.
  • Observe your repository
    • git status
      • “Shows the status of changes as untracked, modified, or staged” (Git Handbook)
  • Synchronize with remote repository
    • git push
      • “Updates the remote repository with any commits made locally” (Git Handbook)
    • git pull
      • Updates the local repository with any commits from the remote repository

4 Getting started: Git repository

4.1 Local and remote repositories

What are local and remote repositories?

  • Local vs. remote git repository:
    • Local git repository: Git repository for a project stored on your machine
    • Remote git repository (often called “origin”): Git repository for a project stored on the internet (e.g., GitHub)
  • Typically, a local git repository is connected to a remote git repository for collaboration
    • Everyone working on the project will have a local copy of the shared remote repository
    • You’ll all be making changes to your local repository and pushing them to the remote, as well as pulling other people’s changes from the remote to your local copy
    • That way, everyone’s project repository will be in sync and up-to-date with everybody else’s
  • A remote repository is identified by its URL, which can be used to connect your local repository
  • There are 2 types of URL: HTTPS and SSH
    • HTTPS and SSH are two different ways to authenticate that you are you
    • HTTPS
      • “Hypertext Transfer Protocol Secure (HTTPS) is an extension of the Hypertext Transfer Protocol (HTTP). It uses cryptography for secure communication over a computer network, and is widely used on the Internet” (link)
    • SSH
      • “An SSH key is a secure access credential used in the Secure Shell (SSH) protocol” (link)
    • If you haven’t set up SSH (you probably haven’t), then choose HTTPS
  • HTTPS tokens
    • If you are using HTTPS, you must first create a personal access token on GitHub following these instructions
      • “Once you have a token, you can enter it instead of your password when performing Git operations over HTTPS.” (GitHub Docs)
      • Example: On the command line, when prompted for the password, enter your token rather than your GitHub password $ git clone https://github.com/username/repo.git Username: your_github_username Password: your_token


There are 2 basic ways to get your local repository set up with a remote:


git remote: Show list of connected remote repositories

  • Help: git remote --help
  • Syntax: git remote [<option(s)>]
  • Options:
    • -v: Show more detailed info about the remotes, including its URL


Understanding how local and remote repositories are connected:

  • We can use git remote to check which remote repository is connected (i.e., which remote(s) you can push to and pull from)
    • By convention, the remote repository is named origin, but you could call it anything
    • When you clone a repository, it will by default be given the name origin, but you could change it afterwards if you wanted to
    • When you add a remote, you could name it anything
  • Each local branch can be set to track a remote branch (e.g., your local main branch tracks the remote main branch)
    • The remote branch that you are tracking is known as the upstream branch
    • Once the upstream branch is set for a local branch, git will know where to push to and pull from
    • When you clone a repository, your local branch will automatically be set to track the corresponding remote branch
    • When you push a new local branch to the remote, you will need to set the upstream branch the very first time you push

4.2 Clone a remote repository from GitHub to your local machine

This is usually the easiest way to get a local repository set up with a remote repository

Step 1: Obtain the URL of the remote repository on GitHub:

  • In your browser, navigate to the repository on GitHub
    • This can be an already existing repository (e.g., rclass2 repo) or a new repository you create
    • To create a new repository, navigate to GitHub and make sure to check one or more of the Initialize this repository with options before creating
      • If you need to change the default branch that your new repo will have, see here
  • Click on the green Code button
  • Copy either the repository URL to your clipboard (use HTTPS URL if you don’t have SSH set up)

Step 2: Clone the repository to your local machine:

  • In your Terminal/Git Bash, change directory into where you want to clone the repository
    • Note that you do not need to create a new folder for this repository. Cloning it will create a folder for you that contains the contents of the repository.
  • Use the git clone command to clone the repository to your local machine
  • Now that you have a local copy of the repo, you can start making changes locally then push to the remote
    • For example, create/change one or more file(s)
    • git add changes to file(s) from the local working directory to the staging area
    • git commit -m "commit message" all staged changes to the local repository
    • git push to push changes from your local repository to the remote repository

Credit: W3 docs, Git clone


git clone: Clone a repository into a new directory

  • Help: git clone --help
  • Syntax: git clone <repo_url>
    • The repo_url can be the HTTPS or SSH URL
  • Result:
    • A new directory will be created that contains the cloned repository
    • The remote repository will be given the default name of origin
    • Local branches are created that tracks the corresponding remote branches

Example: Using git clone to clone a repository
  • The repository downloadipeds, created by Ben Skinner, contains a script to “batch download” files from the Integrated Postsecondary Data System (IPEDS), which contains data on U.S. colleges and universities
  • Copy the repository URL and use it to clone the repository to your local machine
    • HTTPS URL will be: https://github.com/btskinner/downloadipeds.git
    • SSH URL will be: git@github.com:btskinner/downloadipeds.git
cd ~  # change to root directory
rm -rf downloadipeds  # force remove `downloadipeds` (if it exists)

# Change directory to where you want to clone the repository
cd ~

# This will be the directory where the `downloadipeds` repository will be cloned
# Note that you do not need to create a `downloadipeds` sub-directory yourself
pwd
## /c/Users/ozanj/Documents
cd ~

# Clone the remote repository
git clone https://github.com/btskinner/downloadipeds.git  # HTTPS URL
# git clone git@github.com:btskinner/downloadipeds.git  # SSH URL
## Cloning into 'downloadipeds'...
# Change directory to the newly cloned `downloadipeds`
cd downloadipeds
pwd

# List out contents of repository
ls -la
## /c/Users/ozanj/Documents/downloadipeds
## total 93
## drwxr-xr-x 1 ozanj None     0 Feb  9 15:45 .
## drwxr-xr-x 1 ozanj None     0 Feb  9 15:45 ..
## drwxr-xr-x 1 ozanj None     0 Feb  9 15:45 .git
## -rw-r--r-- 1 ozanj None    22 Feb  9 15:45 .gitignore
## -rw-r--r-- 1 ozanj None  1094 Feb  9 15:45 LICENSE
## -rw-r--r-- 1 ozanj None  4682 Feb  9 15:45 README.md
## -rw-r--r-- 1 ozanj None  6028 Feb  9 15:45 downloadipeds.R
## -rw-r--r-- 1 ozanj None 13876 Feb  9 15:45 ipeds_file_list.txt
# List out the connected remote, which is named `origin` by default
git remote
## /c/Users/ozanj/Documents/downloadipeds
## origin
# Display more details about the remote, including the repository URL
git remote -v
  # https://github.com/btskinner/downloadipeds.git
## origin   https://github.com/btskinner/downloadipeds.git (fetch)
## origin   https://github.com/btskinner/downloadipeds.git (push)


4.3 Create new git repository on your local machine and add to GitHub

Alternatively, you can create a new git repository on your local machine, and then connect it to the remote on GitHub, in three steps.

Step 1 = Create a local git repository:

  • In your Terminal/Git Bash, create a new directory or change into an existing directory that you want to turn into a git repository
  • Turn this directory into a git repository using git init
  • In this directory, you can use any git commands and start tracking files
    • For example, create/change one or more file(s)
    • git add changes to file(s) from the local working directory to the staging area
    • git commit -m "commit message" all staged changes to the local repository

Step 2 = Create a remote repository on GitHub:

  • Create a new repository on GitHub
    • I usually give this repository the same name as local repository (above)
    • If you need to change the default branch that your new repo will have, see here
  • Do not check any of the Initialize this repository with options
  • After creation, you will be able to see the HTTPS/SSH URL of your new repository. Save this URL for later.

Step 3 = Connect your local repository to the remote:

  • In your Terminal/Git Bash, use git remote add to add a new remote for your local repository
    • This will allow you to start pushing to and pulling from the remote repository
  • The very first time you push to the remote, you’ll need to use the --set-upstream option with the git push command
    • All new repositories start off with the default main branch
    • If you are pushing a new local branch to the remote for the first time, you need to set the upstream branch so Git knows which remote branch to track
    • For example, we’ll want to set our local main branch to track the remote repository’s main branch

Credit: Java T Point, Git Push


New git commands we will use in examples below
git remote: Add or modify a remote repository

  • Help: git remote --help
  • Syntax:
    • git remote add <remote_name> <remote_url>: Add a new remote
      • remote_name: Name we choose to call our remote repository, conventionally origin
      • remote_url: HTTPS/SSH URL of remote repository
    • git remote set-url <remote_name> <remote_url>: Update the URL for the specified remote
      • remote_name: Name of the remote we want to update URL for
      • remote_url: HTTPS/SSH URL we want to update to

git push: Set and push to upstream branch

  • Help: git push --help
  • Syntax: git push --set-upstream <remote_name> <branch_name>
    • remote_name: Name of the remote repository to push to
    • branch_name: Name of the remote branch you want your current branch to track
  • Result:
    • Your current branch will be set to track the specified remote repository’s branch
    • This will only need to be run the first time you push a new local branch to the remote. All subsequent pushes can just be git push.

Example: Full sample workflow
# CREATING AND CHANGING DIRECTORIES

  cd ~ # change directories to home directory
  
  #cd documents # change to "documents" [if necessary]
  
  ls # list files in directory
  
  # make new directory that will be our git repository
    # rm -rf gitr_practice # remove if it exists
  mkdir gitr_practice
  
  cd gitr_practice # move to new directory
  
  ls -a # show all files in directory

# INITIALIZING GIT REPOSITORY

  # turn the current, empty directory into a fresh Git repository
  git init
  
  ls -a # show all files in directory
  
# CHANGING FILES IN WORKING DIRECTORY
  
  # create a new README file with some sample text
  echo "Hello. I thought we would be learning R this quarter" >> README.txt
  
  # view the file README.txt
  cat README.txt
  
  # create a simple R script
  echo "library(tidyverse)" >> simple_script.r
  echo "mpg %>% head(5)" >> simple_script.r # add another line to simple_script.r
  
  cat simple_script.r # show contents of file simple_script.r

# STAGE AND COMMIT FILES TO LOCAL REPOSITORY

  # check status of git repository
  git status 
  
  # add README.txt from working directory to staging area (will now become a file that is "tracked" by git)
  git add README.txt
  
  # add simple_script.r from working directory to staging area (will now become a file that is "tracked" by git)
  git add simple_script.r
  
  # check status
  git status
  
  # commit changes to local repository
  git commit -m "Initial commit, README.txt simple_script.r"
  
  git status
  
# CONNECT AND PUSH TO REMOTE REPOSITORY
  # rename default branch name
  git branch -M main
  
  # provide the path for the repository you created on GitHub in the first step
  #git remote add origin https://github.com/YOUR-USERNAME/YOUR-REPOSITORY.git
  git remote add origin https://github.com/ozanj/gitr_practice.git

  # push changes to GitHub
  git push --set-upstream origin main

Example: Using git remote to add a remote
cd ~  # change to root directory
rm -rf my_git_repo  # force remove `my_git_repo` (if it exists)
mkdir my_git_repo  # make directory `my_git_repo`

# Initialize a new git repository in `my_git_repo` directory
cd my_git_repo
git init

# next create a repo on github.com and name it my_git_repo
  # don't have to give it this name, but I find it less confusing
  
# Add remote and name it `origin`
  # paste the url you obtain from github
git remote add origin https://github.com/ozanj/my_git_repo.git

# Check remote
git remote -v
## Initialized empty Git repository in C:/Users/ozanj/Documents/my_git_repo/.git/
## origin   https://github.com/ozanj/my_git_repo.git (fetch)
## origin   https://github.com/ozanj/my_git_repo.git (push)


Note that we could’ve named the remote repository anything - it doesn’t have to be origin:

# Add remote (https://github.com/anyone-can-cook/my_git_repo) and name it `my_remote`
git remote add my_remote https://github.com/anyone-can-cook/my_git_repo.git

# Check remote
git remote -v
## my_remote    https://github.com/anyone-can-cook/my_git_repo.git (fetch)
## my_remote    https://github.com/anyone-can-cook/my_git_repo.git (push)

Example: Using git remote to update URL for a remote
# Check remote
git remote -v
## my_remote    https://github.com/anyone-can-cook/my_git_repo.git (fetch)
## my_remote    https://github.com/anyone-can-cook/my_git_repo.git (push)
# Change the URL for the remote named `my_remote`
git remote set-url my_remote https://github.com/anyone-can-cook/my_git_repo_2.git
# Check remote
git remote -v
## my_remote    https://github.com/anyone-can-cook/my_git_repo_2.git (fetch)
## my_remote    https://github.com/anyone-can-cook/my_git_repo_2.git (push)

Example: Using git push to push a new branch
cd ~/my_git_repo

# Create new R script
echo "library(tidyverse)" > create_dataset.R
echo "mpg %>% head(5)" >> create_dataset.R

# Add R script and make a commit
git add create_dataset.R
git commit -m "initial commit"
git branch -M main
# Because this is a new local branch, we get an error if we just use `git push` on the initial push
git push
## fatal: The current branch main has no upstream branch.
## To push the current branch and set the remote as upstream, use
## 
##     git push --set-upstream my_remote main


As hinted in the error message, we need to use the --set-upstream option to set upstream branch on the initial push for a new local branch:

# Recall that we are connected to a remote repository we named `my_remote`
git remote -v
## my_remote    https://github.com/anyone-can-cook/my_git_repo_2.git (fetch)
## my_remote    https://github.com/anyone-can-cook/my_git_repo_2.git (push)
# We can check status to see that we are currently on the `main` branch
# (Note that because we have yet to set an upstream branch,
# it does not say our main branch is ahead of remote by 1 commit)
git status
## On branch main
## nothing to commit, working tree clean
# Use the `--set-upstream` option with the remote and branch names to push new local branch
git push --set-upstream my_remote main
## To https://github.com/anyone-can-cook/my_git_repo_2.git
##  * [new branch]      main -> main
## Branch main set up to track remote branch main from my_remote.
# Check status
# (Now that we have set the upstream branch, 
# it says our main branch is up-to-date with the remote's main branch)
git status
## On branch main
## Your branch is up-to-date with 'my_remote/main'.
## 
## nothing to commit, working tree clean

5 Git commands: Observing your repository

Once a directory is initialized as a git repository, you can choose to track the changes to any file in the directory:

  • All files start off as untracked until they are added (i.e., using git add)
  • Once a file is being tracked, you’ll be able to monitor the changes being made to those files as well as the history of changes
  • As described in more detail in the next section, git status can be used to check which files are tracked and which are not. Untracked files, except those listed in your .gitignore file, will be listed under Untracked files.


What is a .gitignore file? (see below for more details)

  • It is a special file that tells Git what files in the repository to ignore, or not track
  • These files will no longer be listed under Untracked files when you check git status
  • You can either create a .gitignore file yourself or click Add .gitignore when you are creating a new repository on GitHub and select the R template from the dropdown menu:

Credit: How to Make Git Forget Tracked Files Now In gitignore


Below are some common git commands you might use to observe your repository:

5.1 git status

git status: Shows the working tree status

  • Help: git status --help
  • Syntax: git status [<option(s)>]
    • Commonly used without any options, but see help file for possible options
  • Output:
    • Information about the branch (e.g., which branch you are on, its status relative to the remote branch)
    • Changes to be committed
      • List of files that have been added to the staging area using git add
      • These can be committed using git commit
      • The filenames will be in green
    • Changes not staged for commit
      • List of tracked files (i.e., files that have been added using git add before) that have since been changed (e.g., modified, deleted) in the working directory
      • These can be added to the staging area using git add
      • The filenames will be in red
    • Untracked files
      • List of untracked files (i.e., new files that have never been added using git add before)
      • These can be added to the staging area using git add
      • The filenames will be in red

Below is a sample output of git status:

On branch main
Your branch is up-to-date with 'origin/main'.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    new file:   clean_dataset.R

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   create_dataset.R

Untracked files:
  (use "git add <file>..." to include in what will be committed)

    analyze_dataset.R

Example: Checking git status after creating a new file
  • Imagine you have created a new file called create_dataset.R in your git repository
  • You will initially see the file listed under Untracked files
# Create new R script
echo "library(tidyverse)" > create_dataset.R
echo "mpg %>% head(5)" >> create_dataset.R

git status
On branch main
Your branch is up-to-date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)

    create_dataset.R

nothing added to commit but untracked files present (use "git add" to track)

Example: Checking git status after adding a file
  • After adding create_dataset.R, you will see it listed under Changes to be committed
# Add R script
git add create_dataset.R

git status
On branch main
Your branch is up-to-date with 'origin/main'.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    new file:   create_dataset.R

Example: Checking git status after making a commit
  • After making a commit, you will notice that the committed file(s) are no longer listed
  • If your local repository is connected with a remote, you’ll also see that it says your branch is ahead of the remote by 1 commit
# Make a commit
git commit -m "add create_dataset.R"

git status
On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean

Example: Checking git status after modifying a tracked file
  • If you make further modifications to a file that’s being tracked (i.e., a file that’s been added before), you will see it listed under Changes not staged for commit (as compared to under Untracked files when it’s never been tracked before)
# Modify create_dataset.R
echo "df <- mpg %>% filter(year == 2008)" >> create_dataset.R

git status
On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   create_dataset.R

no changes added to commit (use "git add" and/or "git commit -a")


5.2 git log

git log: Show commit logs

  • Help: git log --help
  • Syntax: git log [<option(s)>]
    • -n <int>: Show the latest <int> commits
  • Output: List of commits in reverse chronological order (i.e., newest first)
    • commit <commit_hash>: Each commit can be uniquely identified by their hash ID (SHA-1)
      • Note: Only the first 7 characters of this hash is needed to uniquely identify it
    • Author: <username> <email>: Username and email of the author of the commit
    • Date: <commit_date>: Date of the commit
    • <commit_message>: Commit message
  • Note: If the list of commits is long, you will be able to use your up and down arrow keys to scroll through the log. After you are done viewing, you can hit q to exit this read mode.

Below is a sample output of git log:

commit 2e525e4b1c40f6cffb78438285a00cd7eed54ae0 (HEAD -> main)
Author: username <email@example.com>
Date:   Thu Apr 2 23:53:30 2020 -0700

    second commit

commit 8c20a14b99d7a490580045176287b979c93d9cb5
Author: username <email@example.com>
Date:   Wed Apr 1 22:49:52 2020 -0700

    initial commit


5.3 git diff

git diff: Show changes between files, commits, etc.

  • Help: git diff --help
  • Syntax:
    • git diff [<file_name(s)>]: Show changes made to unstaged files in working directory compared to the “index”
      • In other words, these are the changes that would get added to the staging area if you git add them
      • This only applies to tracked files (i.e., files listed under Changes not staged for commit when you check git status), since untracked files have no history in the “index” to compare against
      • If no file_name(s) specified, git diff shows changes made to all tracked, unstaged files
    • git diff --cached [<file_name(s)>]: Show changes made to added files in staging area compared to the last commit
      • In other words, these are the changes that would be committed if you run git commit command
      • If no file_name(s) specified, git diff --cached shows changes made to all staged files (i.e., files listed under Changes to be committed when you check git status)
      • If this is the initial commit, then all staged changes are shown
    • git diff <commit_hash> <commit_hash> [<file_name(s)>]: Show changes between the two specified commits
      • If no file_name(s) specified, git diff <commit_hash> <commit_hash> shows changes between all files
  • Output: Comparison results for each file being checked by git diff
    • Each output starts with diff --git a/<file_name> b/<file_name>, which indicates that two versions of file_name is being compared
    • This is followed by some information about whether the versions are previously tracked by Git (indicated by index) or if a new file is involved (as in the case of git diff --cached for an untracked, staged file – see second example below)
    • The line-by-line comparison of the file begins after the part in the output that starts with @@
      • A - in front of a line indicates that the line has been removed in b/<file_name> as compared to a/<file_name>
      • A + in front of a line indicates that the line has been added in b/<file_name> as compared to a/<file_name>

Below is a sample output of git diff:

diff --git a/create_dataset.R b/create_dataset.R
index c1cff38..5ea84e9 100644
--- a/create_dataset.R
+++ b/create_dataset.R
@@ -1,2 +1,2 @@
 library(tidyverse)
-mpg %>% head(5)
+mpg %>% filter(year == 2008)

Example: Checking git diff for an untracked file
  • Imagine you have created a new file called create_dataset.R in your git repository
  • Because this file has never been added to staging area/“index” before, you will not see any output to git diff
# Create new R script
echo "library(tidyverse)" > create_dataset.R

git diff  # No output

Example: Checking git diff for a staged file
  • After staging create_dataset.R, it will be added to the “index”
  • git diff --cached can be used to view all staged changes
# Add R script
git add create_dataset.R

git diff --cached
diff --git a/create_dataset.R b/create_dataset.R
new file mode 100644
index 0000000..8b151a2
--- /dev/null
+++ b/create_dataset.R
@@ -0,0 +1 @@
+library(tidyverse)

Example: Checking git diff for a modified, tracked file
  • If you make further modifications to a file that’s being tracked (i.e., a file that’s been added before), you can use git diff to see changes between the versions in the working directory and the staging area
# Modify create_dataset.R
echo "mpg %>% head(5)" >> create_dataset.R

git diff
diff --git a/create_dataset.R b/create_dataset.R
index 8b151a2..c1cff38 100644
--- a/create_dataset.R
+++ b/create_dataset.R
@@ -1 +1,2 @@
 library(tidyverse)
+mpg %>% head(5)

Example: Checking git diff after committing changes
  • Suppose you commit the staged changes (i.e., the line library(tidyverse) in create_dataset.R)
  • Note that the output of git diff (i.e., comparing changes between the working directory and “index”) is the same as the previous example, when the changes were just staged and not yet committed
# Make a commit
git commit -m "add 1st line to create_dataset.R"

git diff
diff --git a/create_dataset.R b/create_dataset.R
index 8b151a2..c1cff38 100644
--- a/create_dataset.R
+++ b/create_dataset.R
@@ -1 +1,2 @@
 library(tidyverse)
+mpg %>% head(5)

Example: Checking git diff between commits
  • Now suppose we add the new changes made to create_dataset.R in the working directory (i.e., the line mpg %>% head(5)) and make a second commit
# Add create_dataset.R and make a commit
git add create_dataset.R
git commit -m "add 2nd line to create_dataset.R"

git log
commit aa89efba9adddf8547b3743ba81a421dd2a28881 (HEAD -> main)
Author: cyouh95 <25449416+cyouh95@users.noreply.github.com>
Date:   Sat Apr 4 03:20:15 2020 -0700

    add 2nd line to create_dataset.R

commit d5c6e0958fb173af04f7e2c5d5fd81457e8ffd0c
Author: cyouh95 <25449416+cyouh95@users.noreply.github.com>
Date:   Sat Apr 4 03:11:38 2020 -0700

    add 1st line to create_dataset.R
  • We can use git diff to check the differences between the two commits by specifying their hash ID’s
  • As seen below, the line mpg %>% head(5) has been added between the two commits
git diff d5c6e09 aa89efb
diff --git a/create_dataset.R b/create_dataset.R
index 8b151a2..c1cff38 100644
--- a/create_dataset.R
+++ b/create_dataset.R
@@ -1 +1,2 @@
 library(tidyverse)
+mpg %>% head(5)
  • Note that the order we specify the commit hash ID’s in matters
  • As seen below, if we specify the ID of second commit and then the first commit, the displayed differences show that the line mpg %>% head(5) has been removed between the two commits
git diff aa89efb d5c6e09
diff --git a/create_dataset.R b/create_dataset.R
index c1cff38..8b151a2 100644
--- a/create_dataset.R
+++ b/create_dataset.R
@@ -1,2 +1 @@
 library(tidyverse)
-mpg %>% head(5)


6 Git: Under the hood

Remember, when we said that git stores data as snapshots (or checkins) over time? That is, each “commit” we make is a snapshot of a miniature file system. In the pcutre below, each “version” (version 1, version 2, …) represents a new “commit,” in which some files in the repository have changed and some files have not changed.

Credit: Getting Started - What is Git

Well, in this section we’re going deep “under the hood” of git, to explain how this process works. It’s called the “Git Object Model

For this section, we’ll be working with a git repository on your local machine that is not connected to a remote repository.


In your everyday work with git, you usually won’t be going “under the hood” of your git repository. So, why are we teaching you this stuff?

  • you will gain a deeper conceptual understanding of how git works
    • this conceptual understanding becomes practically useful when challenges arise
  • understanding how git works will make it easier to learn subsequent practical topics like “branching” and “merging”

6.1 .git/ directory


Every git repository that is created using git init contains a .git/ directory that “contains all the informations needed for git to work” (From Git series 1/3: Understanding git for real by exploring the .git directory):

cd ~  # change to root directory
pwd
## /c/Users/ozanj/Documents
  • Initialize a new git repository in my_git_repo directory
cd ~  # change to root directory
pwd
rm -rf my_git_repo  # force remove `my_git_repo` (if it exists)
mkdir my_git_repo  # make directory `my_git_repo`

# Initialize a new git repository in `my_git_repo` directory
cd my_git_repo
git init

ls -al # list files: show hidden files -a; use long listing format
## /c/Users/ozanj/Documents
## Initialized empty Git repository in C:/Users/ozanj/Documents/my_git_repo/.git/
## total 52
## drwxr-xr-x 1 ozanj None 0 Feb  9 15:45 .
## drwxr-xr-x 1 ozanj None 0 Feb  9 15:45 ..
## drwxr-xr-x 1 ozanj None 0 Feb  9 15:45 .git


What’s inside the .git/ directory?

cd ~/my_git_repo

# List out the contents of the .git/ directory (in tree form)
find .git -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g' # the quoted text is regular expressions; don't worry about understanding this!
## .git
## |____config
## |____description
## |____HEAD
## |____hooks
## | |____applypatch-msg.sample
## | |____commit-msg.sample
## | |____fsmonitor-watchman.sample
## | |____post-update.sample
## | |____pre-applypatch.sample
## | |____pre-commit.sample
## | |____pre-merge-commit.sample
## | |____pre-push.sample
## | |____pre-rebase.sample
## | |____pre-receive.sample
## | |____prepare-commit-msg.sample
## | |____push-to-checkout.sample
## | |____update.sample
## |____info
## | |____exclude
## |____objects
## | |____info
## | |____pack
## |____refs
## | |____heads
## | |____tags

We will be focusing on:

  • objects/: Directory containing all git objects
  • HEAD: Reference to the latest commit of the current branch
  • refs/: Directory containing the hash ID of commit referred to by HEAD

We’ll get into git objects starting in the next section, and see an example of HEAD and refs/ in a later section.

6.2 Git objects

What is a git object?

  • “A git repository is actually just a collection of objects, each identified with their own hash.” (From Deep dive into git: git Objects)
    • A “hash” can be thought of as an unique ID that points to the git object
    • “Git is a simple key-value data store. You put a value into the repository and get a key by which this value can be accessed.” (From Becoming a Git pro. Part 1: internal Git architecture)
      • Key = Hash (think of this as the name of a git object)
      • Value = Git object (think of this as the underlying contents of the git object)
  • Git objects are stored inside the .git/objects directory
    • The first 2 characters of its hash will be the name of the sub-directory within .git/objects that it is located in
    • The rest of the hash will be the git object filename
  • Use the git cat-file command to view information about a git object whose hash you specify
  • Use the git hash-object to compute (show) the hash for a git “blob” object based on the name of associated file


git cat-file: Provide content or type and size information for repository objects

  • Help: git cat-file --help
  • Syntax: git cat-file [<option(s)>] <object>
  • Options:
    • -p: Pretty-print the contents of <object> based on its type
    • -t: Instead of the content, show the object type identified by <object>
    • -s: Instead of the content, show the object size identified by <object>


There are 4 types of git objects (From The Git Object Model)

6.2.1 Blob object

A blob is generally a file which stores data, like text

  • For example, this could be an R script
  • The file must be added to the staging area (i.e., “index”) in order for the blob object to be created
  • The hash of the blob object can be seen in the .git/objects directory
    • The first 2 characters of the hash is the name of the sub-directory within .git/objects
    • The rest of the hash comes from the git object filename
    • But only the first 7 characters of the hash is required to uniquely identify it
  • This hash can also be computed from the name of the file for which the blob is to be created by using the git hash-object command
cd ~/my_git_repo

# Create new R script (working directory)
echo "library(tidyverse)" > create_dataset.R
echo "mpg %>% head(5)" >> create_dataset.R

# Add R script to staging area
git add create_dataset.R

# View .git/objects directory
find .git/objects -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## |____objects
## | |____c1
## | | |____cff389562e8bc123e6691a60352fdf839df113
## | |____info
## | |____pack


git hash-object: Compute hash for a blob object from name of file

  • Help: git hash-object --help
  • Syntax: git hash-object <file_name>

We can use git hash-object to verify the hash for create_dataset.R:

# Generate blob object hash for R script
git hash-object create_dataset.R
## c1cff389562e8bc123e6691a60352fdf839df113

Example: Using git cat-file to view blob object content
# View content of create_dataset.R
git cat-file -p c1cff38
## library(tidyverse)
## mpg %>% head(5)

Example: Using git cat-file to view blob object type
# View object type for create_dataset.R
git cat-file -t c1cff38
## blob

Example: Using git cat-file to view blob object size
# View object size for create_dataset.R
git cat-file -s c1cff38
## 35


6.2.2 What the heck are all these objects!?

After a blob the next type of git object to discuss is a tree

A tree is a directory that contains references to blobs (files) or other trees (sub-directories)

  • Any sub-directories created inside the git repository is a tree object
    • It contains references to any blobs (files) or additional trees (sub-directories) within it
  • The root directory of the git repository is also a tree itself, and contains references to all its content at the point of commit (like a “snapshot”)
  • A commit must be made in order for the tree object(s) to be created

But before saying more about trees, we’re gonna take a detour to give you some skills to diagnose all these different objects we are going to encounter as we start adding folders and making commits

  • add a sub-directory named my_git_repo/notes (type of git object = tree) and a couple text files to the sub-directory (type of git object = blob)
cd ~/my_git_repo
rm -rf notes # force remove notes directory if it exists

# Create a sub-directory 
mkdir notes

# Add files to the sub-directory (since git doesn't track empty directories)
echo "This is my first set of notes." > notes/note_1.txt
echo "This is my second set of notes." > notes/note_2.txt

# Add new files
git add .

# View .git/objects directory
find .git/objects -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## warning: in the working copy of 'notes/note_1.txt', LF will be replaced by CRLF the next time Git touches it
## warning: in the working copy of 'notes/note_2.txt', LF will be replaced by CRLF the next time Git touches it
## |____objects
## | |____47
## | | |____6fb98775843929ca6c55b16b04752d973b3d2a
## | |____61
## | | |____08458417308ddc15d7390a2f8db50cf65ec399
## | |____c1
## | | |____cff389562e8bc123e6691a60352fdf839df113
## | |____info
## | |____pack

We can use git hash-object to verify the hashes for the files notes/note_1.txt and notes/note_2.txt:

# Generate blob object hash for the file notes/note_1.txt
echo 'hash for notes/note_1.txt:'
git hash-object notes/note_1.txt

# Generate blob object hash for the file notes/note_2.txt
echo 'hash for notes/note_2.txt:'
git hash-object notes/note_2.txt
## hash for notes/note_1.txt:
## 6108458417308ddc15d7390a2f8db50cf65ec399
## hash for notes/note_2.txt:
## 476fb98775843929ca6c55b16b04752d973b3d2a


Now that we know the hash associated with note_1.txt and note_2.txt we can use git cat-file to print the contents of these files.

# View content of note_1.txt and note_2.txt
git cat-file -p 6108458 # note_1.txt
git cat-file -p 476fb98 # note_2.txt
## This is my first set of notes.
## This is my second set of notes.


The command git cat-file -p <hash> is different from the command cat <filename>:

  • git cat-file -p <hash> prints the contents of files stored in the (hidden) .git directory
  • cat <filename> prints the contents of files stored in a “regular” directory
# View content of note_1.txt and note_2.txt, the versions not in .git directory
cat notes/note_1.txt
cat notes/note_2.txt
## This is my first set of notes.
## This is my second set of notes.


We also created the directory notes, but tree objects (i.e., directories) are not created until a commit has been made

  • this is why we don’t see an object for the root directory my_git_repo


After the files have been committed, tree objects will be created for any sub-directories as well as for the root directory of the repository:

cd ~/my_git_repo

# Make a commit
git commit -m "initial commit"
## [main (root-commit) 651cb58] initial commit
##  3 files changed, 4 insertions(+)
##  create mode 100644 create_dataset.R
##  create mode 100644 notes/note_1.txt
##  create mode 100644 notes/note_2.txt

Now that we have made our first commit, let’s print contents of .git/objects directory

  • now .git/objects has 6 objects, each associated with a different hash. Ugh. this is getting confusing! what the hell are all these things!?
# View .git/objects directory
find .git/objects -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'
## |____objects
## | |____47
## | | |____6fb98775843929ca6c55b16b04752d973b3d2a
## | |____61
## | | |____08458417308ddc15d7390a2f8db50cf65ec399
## | |____65
## | | |____1cb5811251ecfffa1768dbd2b895de45030837
## | |____6c
## | | |____f7bbf49af4f9fd5103cf9f0a3fa25226b12336
## | |____c1
## | | |____cff389562e8bc123e6691a60352fdf839df113
## | |____f5
## | | |____9085df29aed7826a89b23af3f67fc3ab96f643
## | |____info
## | |____pack

Commands to diagnose git objects

  • find hashes associated with all blob (files) and tree (directory) objects (except the root directory)
    • git ls-tree -rt HEAD
  • find hash for a particular folder associated with the most recent commit
    • git rev-parse HEAD:<path/to/directory>
      • e.g., git rev-parse HEAD:notes
  • find hash associated with a commit
    • git log
  • find hash associated with root directory (e.g., my_git_repo is root directory for our repository)
    • can find this by printing the contents of the git object associated with most recent commit


Example of using git log to show hash associated with commits

cd ~/my_git_repo
git log
## commit 651cb5811251ecfffa1768dbd2b895de45030837
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:35 2023 -0800
## 
##     initial commit


git ls-tree: List the contents of a tree object

  • Help: git ls-tree --help
  • Syntax: git ls-tree [<option(s)>] [<tree hash id>] [<path>]
  • Some useful options:
    • -r: recurse into subtrees (i.e, show contents of sub-folders)
    • -t: show trees when recursing (i.e., when showing contents of sub-folders, also show folders within these sub-folders)
    • useful to use -r and -t together, like: -rt
  • how to specify [<path>]
    • HEAD is basically a shortcut to the most recent commit. so if we specify the path as HEAD, this will show the directory structure of the respository associated with the most recent commit. For example,
      • git ls-tree -rt HEAD
    • HEAD:/path/to/directory
      • use this syntax to show contents of a tree object that is a sub-directory within the root directory. For example,
      • git ls-tree -rt HEAD:notes


cd ~/my_git_repo

# View .git/objects directory
find .git/objects -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'
## |____objects
## | |____47
## | | |____6fb98775843929ca6c55b16b04752d973b3d2a
## | |____61
## | | |____08458417308ddc15d7390a2f8db50cf65ec399
## | |____65
## | | |____1cb5811251ecfffa1768dbd2b895de45030837
## | |____6c
## | | |____f7bbf49af4f9fd5103cf9f0a3fa25226b12336
## | |____c1
## | | |____cff389562e8bc123e6691a60352fdf839df113
## | |____f5
## | | |____9085df29aed7826a89b23af3f67fc3ab96f643
## | |____info
## | |____pack

Examples of using git ls-tree [<path>] to identify git objects

  • show git hash associated with all blob (file) and tree (folder) objects within the root directory
cd ~/my_git_repo

# View .git/objects directory
git ls-tree -rt HEAD
## 100644 blob c1cff389562e8bc123e6691a60352fdf839df113 create_dataset.R
## 040000 tree 6cf7bbf49af4f9fd5103cf9f0a3fa25226b12336 notes
## 100644 blob 6108458417308ddc15d7390a2f8db50cf65ec399 notes/note_1.txt
## 100644 blob 476fb98775843929ca6c55b16b04752d973b3d2a notes/note_2.txt
  • Show contents of a particular sub-directory
cd ~/my_git_repo
git ls-tree -rt HEAD:notes
## 100644 blob 6108458417308ddc15d7390a2f8db50cf65ec399 note_1.txt
## 100644 blob 476fb98775843929ca6c55b16b04752d973b3d2a note_2.txt

Examples of using git ls-tree [<hash id>] to identify git objects

  • Once we know the hash associated with a particular object, we can specify this hash insead of <path>
  • show contents of “notes” folder
cd ~/my_git_repo
git ls-tree -rt 6cf7bbf4
## 100644 blob 6108458417308ddc15d7390a2f8db50cf65ec399 note_1.txt
## 100644 blob 476fb98775843929ca6c55b16b04752d973b3d2a note_2.txt
  • Once we know the hash associated with root directory (based on process of elimination), we can show contents associated with that hash
cd ~/my_git_repo
git ls-tree -rt f59085d
## 100644 blob c1cff389562e8bc123e6691a60352fdf839df113 create_dataset.R
## 040000 tree 6cf7bbf49af4f9fd5103cf9f0a3fa25226b12336 notes
## 100644 blob 6108458417308ddc15d7390a2f8db50cf65ec399 notes/note_1.txt
## 100644 blob 476fb98775843929ca6c55b16b04752d973b3d2a notes/note_2.txt


git rev-parse command, find hash for a particular folder associated with the most recent commit

  • git rev-parse HEAD: retrieve hash associated with latest commit
    • git rev-parse --short HEAD: retrieve first 7 digits of hash associated with latest commit
  • git rev-parse HEAD:path/to/directory: retrieve hash for a particular folder in most recent commit
    • e.g., git rev-parse HEAD:notes

Examples of using git rev-parse

  • hash for latest commit
cd ~/my_git_repo

echo 'retrieve hash associated with latest commit:'
git rev-parse HEAD
## retrieve hash associated with latest commit:
## 651cb5811251ecfffa1768dbd2b895de45030837
  • retrieve hash for sub-directory notes in latest commit
cd ~/my_git_repo

echo 'retrieve hash associated folder notes in latest commit:'
git rev-parse HEAD:notes
## retrieve hash associated folder notes in latest commit:
## 6cf7bbf49af4f9fd5103cf9f0a3fa25226b12336

6.2.3 Tree object

A tree is a directory that contains references to blobs (files) or other trees (sub-directories)

  • Any sub-directories created inside the git repository is a tree object
    • It contains references to any blobs (files) or additional trees (sub-directories) within it
  • The root directory of the git repository is also a tree itself, and contains references to all its content at the point of commit (like a “snapshot”)
  • A commit must be made in order for the tree object(s) to be created

Show git hash associated with all blob (file) and tree (folder) objects within the root directory

cd ~/my_git_repo

# View .git/objects directory
git ls-tree -rt HEAD
## 100644 blob c1cff389562e8bc123e6691a60352fdf839df113 create_dataset.R
## 040000 tree 6cf7bbf49af4f9fd5103cf9f0a3fa25226b12336 notes
## 100644 blob 6108458417308ddc15d7390a2f8db50cf65ec399 notes/note_1.txt
## 100644 blob 476fb98775843929ca6c55b16b04752d973b3d2a notes/note_2.txt


As we now see, the tree objects for the my_git_repo/ root directory and notes/ sub-directory exists, and another object has been created for the commit (more info on that in next section):

  • show object type using git cat-file -t
# View object type for my_git_repo/ and notes/ trees
git cat-file -t f59085d # this is hash for the root directory
git cat-file -t 6cf7bbf # this is hash for the "notes" sub-directory

# View object type for the commit
git cat-file -t $(git rev-parse --short HEAD)  # git rev-parse retrieves latest commit hash
## tree
## tree
## commit


The content of a tree object is a list of all blobs (files) and other trees (sub-directories) in the directory. Each list entry follows the format:

<permission_code> <object_type> <object_hash> <object_name>
  • <permission_code>: Code indicating who has read/write access to the object
    • This is typically 100644 for blobs and 100755 or 040000 for trees
  • <object_type>: Type of the object (i.e., blobs or trees)
  • <object_hash>: Reference to the object (i.e., the hash)
  • <object_name>: Name of the file or directory

Example: Using git cat-file to view tree object content for my_git_repo/ root directory

First, show files in directory using ls command with options al:

# Show files in directory
ls -al
## total 53
## drwxr-xr-x 1 ozanj None  0 Feb  9 15:45 .
## drwxr-xr-x 1 ozanj None  0 Feb  9 15:45 ..
## drwxr-xr-x 1 ozanj None  0 Feb  9 15:45 .git
## -rw-r--r-- 2 ozanj None 35 Feb  9 15:45 create_dataset.R
## drwxr-xr-x 1 ozanj None  0 Feb  9 15:45 notes

Second, show contents of tree (root directory) using git cat-file:

# View type and content of my_git_repo/ tree object
git cat-file -t f59085d  # type
git cat-file -p f59085d  # content
## tree
## 100644 blob c1cff389562e8bc123e6691a60352fdf839df113 create_dataset.R
## 040000 tree 6cf7bbf49af4f9fd5103cf9f0a3fa25226b12336 notes

Example: Using git cat-file to view tree object content for notes/ sub-directory
# View type and content of notes/ tree object
git cat-file -t 6cf7bbf  # type
git cat-file -p 6cf7bbf  # content
## tree
## 100644 blob 6108458417308ddc15d7390a2f8db50cf65ec399 note_1.txt
## 100644 blob 476fb98775843929ca6c55b16b04752d973b3d2a note_2.txt


6.2.4 Commit object

A commit object is created after a commit is made that contains information about the commit:

tree <tree_hash>
parent <commit_hash>
author <username> <email> <time>
committer <username> <email> <time>

<commit_message>
  • tree: Reference to the root directory tree object (i.e., “snapshot” of repository at the point of commit)
  • parent: Reference to the parent commit (if not the first commit)
  • Other information about the commit (e.g., author, committer, commit_message)


All commits except for the initial commit will contain a reference to its parent commit.

Investigate contents associated with our first commit using git log

cd ~/my_git_repo
git log
## commit 651cb5811251ecfffa1768dbd2b895de45030837
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:35 2023 -0800
## 
##     initial commit

Example: Using git cat-file to view commit object content for first commit

<br?

  • Note: we use git rev-list HEAD | tail -n 1 to obtain hash associated with the commit because commit hash depends on time
# Retrieve commit hash for first commit
echo 'retrieve hash associated with first commit:'
git rev-list HEAD | tail -n 1

echo ''
echo 'show object type associated with the git object:'
git cat-file -t $(git rev-list HEAD | tail -n 1)

# View content of the commit object
echo ''
echo 'print contents of the commit:'
git cat-file -p $(git rev-list HEAD | tail -n 1)
## retrieve hash associated with first commit:
## 651cb5811251ecfffa1768dbd2b895de45030837
## 
## show object type associated with the git object:
## commit
## 
## print contents of the commit:
## tree f59085df29aed7826a89b23af3f67fc3ab96f643
## author Ozan Jaquette <ozanj@ucla.edu> 1675986335 -0800
## committer Ozan Jaquette <ozanj@ucla.edu> 1675986335 -0800
## 
## initial commit
  • Because printing commit object gave us the hash for the root directory (tree), we can use this to print contents of root directory
cd ~/my_git_repo
echo 'print contents root directory:'
git cat-file -p f59085df
## print contents root directory:
## 100644 blob c1cff389562e8bc123e6691a60352fdf839df113 create_dataset.R
## 040000 tree 6cf7bbf49af4f9fd5103cf9f0a3fa25226b12336 notes


Let’s create a second commit:

# Modify R script
echo "df <- mpg %>% filter(year == 2008)" >> create_dataset.R

# Add R script
git add create_dataset.R

# Make another commit
git commit -m "second commit"

# View .git/objects directory
find .git/objects -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [main dbe6fbb] second commit
##  1 file changed, 1 insertion(+)
## |____objects
## | |____47
## | | |____6fb98775843929ca6c55b16b04752d973b3d2a
## | |____49
## | | |____0ec1c138021b8d5c196c26a2a7b3de69afc2d1
## | |____52
## | | |____4db779f0a3e3b3b353b522285c7da4830e21f1
## | |____61
## | | |____08458417308ddc15d7390a2f8db50cf65ec399
## | |____65
## | | |____1cb5811251ecfffa1768dbd2b895de45030837
## | |____6c
## | | |____f7bbf49af4f9fd5103cf9f0a3fa25226b12336
## | |____c1
## | | |____cff389562e8bc123e6691a60352fdf839df113
## | |____db
## | | |____e6fbbdb626285fd1c8943eb814c2402008213d
## | |____f5
## | | |____9085df29aed7826a89b23af3f67fc3ab96f643
## | |____info
## | |____pack

Example: Using git cat-file to view commit object content for second commit


  • Note: The commit hash will be different each time we run this because it is dependent on the time
  • Note that the “parent” of the most recent commit is the hash for the previous commit!
# Retrieve commit hash for latest commit
echo 'Retrieve commit hash for latest commit:'
git rev-parse HEAD

# View content of the commit object
echo ''
echo 'print contents of most recent commit object:'
git cat-file -p $(git rev-parse HEAD)
## Retrieve commit hash for latest commit:
## dbe6fbbdb626285fd1c8943eb814c2402008213d
## 
## print contents of most recent commit object:
## tree 524db779f0a3e3b3b353b522285c7da4830e21f1
## parent 651cb5811251ecfffa1768dbd2b895de45030837
## author Ozan Jaquette <ozanj@ucla.edu> 1675986341 -0800
## committer Ozan Jaquette <ozanj@ucla.edu> 1675986341 -0800
## 
## second commit


6.2.5 Tag object

A tag object is created after a tag is generated:

  • Tags are typically references to a particular commit
  • Tags are used to make a reference to a particular commit that is viewed as a milestone
    • e.g., the commit associated with a new version of the software (e.g., Minecraft version 1.19.4)
    • e.g., the commit associated with a particular version of a journal manuscript (like, the commit associated with submitting the manuscript to a journal)
object <object_hash>
type <object_type>
tag <tag_name>
tagger <username> <email> <time>

<tag_message>
  • object: Reference to the tagged object
  • type: Object type of the tagged object (usually a commit)
  • Other information about the tag (e.g., name of tag, tagger, tag_message)

Let’s create a tag for the current commit:

# Create a tag
git tag -a v1 -m "version 1.0"

# View .git/objects directory
find .git/objects -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'
## |____objects
## | |____47
## | | |____6fb98775843929ca6c55b16b04752d973b3d2a
## | |____49
## | | |____0ec1c138021b8d5c196c26a2a7b3de69afc2d1
## | |____52
## | | |____4db779f0a3e3b3b353b522285c7da4830e21f1
## | |____61
## | | |____08458417308ddc15d7390a2f8db50cf65ec399
## | |____65
## | | |____1cb5811251ecfffa1768dbd2b895de45030837
## | |____6c
## | | |____f7bbf49af4f9fd5103cf9f0a3fa25226b12336
## | |____c1
## | | |____cff389562e8bc123e6691a60352fdf839df113
## | |____d3
## | | |____d7fd4e4df4a9d983c1987465e00b2a5b3207aa
## | |____db
## | | |____e6fbbdb626285fd1c8943eb814c2402008213d
## | |____f5
## | | |____9085df29aed7826a89b23af3f67fc3ab96f643
## | |____info
## | |____pack

Example: Using git cat-file to view tag object
echo 'print hash associated with v1 tag:'
(git show-ref -s v1)
echo ''
echo 'print contents of the v1 tag:'
git cat-file -p $(git show-ref -s v1)  # retrieves hash for v1 tag
## print hash associated with v1 tag:
## d3d7fd4e4df4a9d983c1987465e00b2a5b3207aa
## 
## print contents of the v1 tag:
## object dbe6fbbdb626285fd1c8943eb814c2402008213d
## type commit
## tag v1
## tagger Ozan Jaquette <ozanj@ucla.edu> 1675986342 -0800
## 
## version 1.0
# The tagged object was the second commit
git log
## commit dbe6fbbdb626285fd1c8943eb814c2402008213d
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:41 2023 -0800
## 
##     second commit
## 
## commit 651cb5811251ecfffa1768dbd2b895de45030837
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:35 2023 -0800
## 
##     initial commit



6.3 HEAD and refs/

The HEAD file is a pointer to your current (active) branch

  • Specifically, the HEAD file points to the latest commit of the branch you are working on, whose hash ID is stored in the refs/ directory.


Especially when we get to working with multiple branches, the HEAD becomes important as it keeps track of which branch you are currently on.

Note, when we print the full directory structure of .git directory, we can see the file HEAD and the directory refs/heads

cd ~/my_git_repo

find .git -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'
## .git
## |____COMMIT_EDITMSG
## |____config
## |____description
## |____HEAD
## |____hooks
## | |____applypatch-msg.sample
## | |____commit-msg.sample
## | |____fsmonitor-watchman.sample
## | |____post-update.sample
## | |____pre-applypatch.sample
## | |____pre-commit.sample
## | |____pre-merge-commit.sample
## | |____pre-push.sample
## | |____pre-rebase.sample
## | |____pre-receive.sample
## | |____prepare-commit-msg.sample
## | |____push-to-checkout.sample
## | |____update.sample
## |____index
## |____info
## | |____exclude
## |____logs
## | |____HEAD
## | |____refs
## | | |____heads
## | | | |____main
## |____objects
## | |____47
## | | |____6fb98775843929ca6c55b16b04752d973b3d2a
## | |____49
## | | |____0ec1c138021b8d5c196c26a2a7b3de69afc2d1
## | |____52
## | | |____4db779f0a3e3b3b353b522285c7da4830e21f1
## | |____61
## | | |____08458417308ddc15d7390a2f8db50cf65ec399
## | |____65
## | | |____1cb5811251ecfffa1768dbd2b895de45030837
## | |____6c
## | | |____f7bbf49af4f9fd5103cf9f0a3fa25226b12336
## | |____c1
## | | |____cff389562e8bc123e6691a60352fdf839df113
## | |____d3
## | | |____d7fd4e4df4a9d983c1987465e00b2a5b3207aa
## | |____db
## | | |____e6fbbdb626285fd1c8943eb814c2402008213d
## | |____f5
## | | |____9085df29aed7826a89b23af3f67fc3ab96f643
## | |____info
## | |____pack
## |____refs
## | |____heads
## | | |____main
## | |____tags
## | | |____v1

The directory heads/refs has a file named main

cd ~/my_git_repo

find .git/refs -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'
## |____refs
## | |____heads
## | | |____main
## | |____tags
## | | |____v1


Below we are using cat command (rather than git cat) to print contents of a file

If we output the contents of the file .git/HEAD, we see it contains a reference to the main branch:

# View content of HEAD
cat .git/HEAD
## ref: refs/heads/main

Following that reference, we can find the hash ID of the latest commit stored inside refs/heads/main:

# View content of refs/heads/main
cat .git/refs/heads/main
## dbe6fbbdb626285fd1c8943eb814c2402008213d

We can use git log to verify that this is the hash ID of the latest commit:

# View commit log
git log
## commit dbe6fbbdb626285fd1c8943eb814c2402008213d
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:41 2023 -0800
## 
##     second commit
## 
## commit 651cb5811251ecfffa1768dbd2b895de45030837
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:35 2023 -0800
## 
##     initial commit


More generally, the refs/ directory stores references to all branches. In particular, refs/heads/ stores all your local branches:

  • “branches” discussed in more detail later
# View local branches
ls .git/refs/heads
## main

On the other hand, refs/remotes/ contains the remote HEAD and your remote-tracking branches. In other words, it is a local copy of your remote repository.

Inside refs/remotes/, There will be a folder for each of your remotes. For example, to view all references for the remote repository named origin, you can look under refs/remotes/origin:

  • we don’t run below code, because we have not yet created remotes for this repository
# View remote HEAD and remote-tracking branches for origin
ls .git/refs/remotes/origin
## HEAD
## main

When you run git fetch, it will update the references in refs/remotes/ (i.e., your local copy of the remote repository), but it will not change anything in refs/heads/ (i.e., your local repository). Thus, git fetch is useful if you want a local copy of the most up-to-date changes in the remote repository (e.g., to preview changes), but don’t actually want to merge these changes into your local repository yet.

On the other hand, git pull is effectively a git fetch followed by a git merge (discussed more later). It will not only update refs/remotes/ but refs/heads as well to bring your local repository up-to-date with the remote.

6.4 Full example

cd ~  # change to root directory
rm -rf my_git_repo  # force remove `my_git_repo` (if it exists)
mkdir my_git_repo  # make directory `my_git_repo`

# Initialize a new git repository in `my_git_repo` directory
cd my_git_repo
git init
## Initialized empty Git repository in C:/Users/ozanj/Documents/my_git_repo/.git/
# Create new R script
echo "library(tidyverse)" > create_dataset.R
echo "mpg %>% head(5)" >> create_dataset.R

# R script initially starts off under `Untracked Files`
git status
## On branch main
## 
## No commits yet
## 
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
## 
##  create_dataset.R
## 
## nothing added to commit but untracked files present (use "git add" to track)
# Add R script
git add create_dataset.R

# R script moves to `Changes to be committed`
git status
## On branch main
## 
## No commits yet
## 
## Changes to be committed:
##   (use "git rm --cached <file>..." to unstage)
## 
##  new file:   create_dataset.R
# Once R script has been added, a blob object is created for it in the .git/objects directory
find .git/objects -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'
## |____objects
## | |____c1
## | | |____cff389562e8bc123e6691a60352fdf839df113
## | |____info
## | |____pack
# We can use `git hash-object` to verify the hash of the blob object
git hash-object create_dataset.R
## c1cff389562e8bc123e6691a60352fdf839df113
# With this hash, we can view the content of create_dataset.R
git cat-file -p c1cff38
## library(tidyverse)
## mpg %>% head(5)
# Make a commit
git commit -m "add create_dataset.R"
# The R script is now no longer listed
git status
## On branch main
## nothing to commit, working tree clean
# Check the commit history
git log
## commit 572522a3f76f84867a47bc79acbe1115cfcf4312
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:46 2023 -0800
## 
##     add create_dataset.R
# Verify that `HEAD` is indeed pointing to the last commit made, which is our initial commit
cat .git/HEAD
cat .git/refs/heads/main
## ref: refs/heads/main
## 572522a3f76f84867a47bc79acbe1115cfcf4312
# Further modify R script, which is now a tracked file
echo "df <- mpg %>% filter(year == 2008)" >> create_dataset.R

# R script is now under `Changes not staged for commit`
git status
## On branch main
## Changes not staged for commit:
##   (use "git add <file>..." to update what will be committed)
##   (use "git restore <file>..." to discard changes in working directory)
##  modified:   create_dataset.R
## 
## no changes added to commit (use "git add" and/or "git commit -a")
# View what new changes were made to R script
  # below git diff command shows differences between last commit and changes made in working directory which are not yet staged
git diff
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## diff --git a/create_dataset.R b/create_dataset.R
## index c1cff38..490ec1c 100644
## --- a/create_dataset.R
## +++ b/create_dataset.R
## @@ -1,2 +1,3 @@
##  library(tidyverse)
##  mpg %>% head(5)
## +df <- mpg %>% filter(year == 2008)
# Add new changes made to R script
git add create_dataset.R

# .git/objects directory now contains blob objects for both versions of R script
# It also contains objects for the commit and root directory tree
find .git/objects -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## |____objects
## | |____49
## | | |____0ec1c138021b8d5c196c26a2a7b3de69afc2d1
## | |____57
## | | |____2522a3f76f84867a47bc79acbe1115cfcf4312
## | |____96
## | | |____6cc780d5994bc8a4ed535484cd7f8268e8e874
## | |____c1
## | | |____cff389562e8bc123e6691a60352fdf839df113
## | |____info
## | |____pack
# We can use `git hash-object` to verify the hash for the new blob object
git hash-object create_dataset.R
## 490ec1c138021b8d5c196c26a2a7b3de69afc2d1
# With this hash, we can view the content of the modified create_dataset.R
git cat-file -p 490ec1c
## library(tidyverse)
## mpg %>% head(5)
## df <- mpg %>% filter(year == 2008)

Note that the hash and contents for the above blob (i.e., file) is different from the hash and contents of the blob associated with the first commit (below)

# With this hash, we can view the content of the modified create_dataset.R
git cat-file -p c1cff38
## library(tidyverse)
## mpg %>% head(5)

Commit changes to script file

# Make a commit
git commit -m "modify create_dataset.R"
## [main 21f7f13] modify create_dataset.R
##  1 file changed, 1 insertion(+)
# Check the commit history
git log
## commit 21f7f135ebcd86f7df2da9c0f16b3396e2940158
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:48 2023 -0800
## 
##     modify create_dataset.R
## 
## commit 572522a3f76f84867a47bc79acbe1115cfcf4312
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:46 2023 -0800
## 
##     add create_dataset.R
# Verify that `HEAD` is pointing to the last commit made, which is now our second commit
cat .git/HEAD
cat .git/refs/heads/main
## ref: refs/heads/main
## 21f7f135ebcd86f7df2da9c0f16b3396e2940158
# View content of commit object for second commit
git cat-file -p $(git rev-parse HEAD)
## tree 6de1187f46bbf4d76cafca7c0e5d3d61db6b5a53
## parent 572522a3f76f84867a47bc79acbe1115cfcf4312
## author Ozan Jaquette <ozanj@ucla.edu> 1675986348 -0800
## committer Ozan Jaquette <ozanj@ucla.edu> 1675986348 -0800
## 
## modify create_dataset.R

7 Git commands: Undoing changes

Sometimes we want to undo changes we have made to files in our git repository.
When thinking about undoing changes to files, helpful to keep in mind this visualization of the three components and workflow of a git repository

  • note: here we focus on a local repository on your computer

Credit: Lucas Maurer, medium.com


Overview of four different undo changes operations

  1. Discard (unstaged) changes to a tracked file that were made in your working directory
    • situation
      • the file is part of your repository (it is being tracked); you made changes to file while working in your working directory and you want to get rid of those changes
    • command: git restore <file(s)>
    • result:
      • changes you made to <file> in your working directory are discarded, gone forever
  2. Unstage staged changes to a file(s)
    • situation:
      • in your working directory, you made changes to a file, then you staged those changes, and now you want to unstage those changes
    • command: git restore --staged <file(s)>
      • equivalent: git reset HEAD <file_name(s)>
    • result:
      • staged changes to <file> are unstaged; these unstaged changes are retained in your working directory
  3. Remove commit(s) prior to a specific commit (previous commits discarded)
    • situation: you want to undo previous n commits and you want to record of these previous commits
    • command: git reset <commit_hash>
    • result:
      • Undoes all commits up to (but not including) the specified <commit_hash>
  4. Revert back to a specific commit (previous commits retained)
    • situation: you want to undo previous n commits but you want a record of these previous commits
    • command: git revert --no-edit <commit_hash>
    • result:
      • Revert all commits up to and including the specified <commit_hash>
      • git revert does not remove any previous commits; rather, it creates a new commit (e.g., commit 3) that changes things back to the way they were prior to some commit (e.g., prior to commit 2)


Image of git revert vs. git reset

Credit: NUKE Designs, Git revert

7.1 git restore: discard unstaged changes

git restore: Discard/undo changes made in working directory to tracked file(s)

  • Help: git restore --help
  • Syntax: git restore [<file_name(s)>]
  • Result: Undo changes made to specified file_name(s) in the working directory
    • This only applies to tracked, unstaged files (i.e., files listed under Changes not staged for commit when you check git status)

Example: Using git restore to discard changes to a tracked, unstaged file


In this git restore example we will:

  • Initialize an (empty) folder as a git repo
  • (in working directory) Create a file create_dataset.R that contains two lines of code
  • “add” changes to create_dataset.R to “staging area” and “commit” those changes to the local repository
  • (in working directory) insert a new line of code to create_dataset.R
  • Imagine that we decide we don’t like this new line of code we created in the working area
  • so we use git restore <file_name> to undo changes made in the “working directory” to the file create_dataset.R
    • The result is that file create_dataset.R goes back to the way it was after the initial commit and the result of git status is “nothing to commit, working tree clean”
cd ~  # change to root directory
rm -rf my_git_repo  # force remove `my_git_repo` (if it exists)
mkdir my_git_repo  # make directory `my_git_repo`
cd my_git_repo
git init

# First, create new R script
echo "library(tidyverse)" > create_dataset.R
echo "mpg %>% head(5)" >> create_dataset.R

# Add/commit R script so it is now tracked
git add create_dataset.R
git commit -m "add create_dataset.R"
# View how create_dataset.R looks when it was committed
cat create_dataset.R
## library(tidyverse)
## mpg %>% head(5)
# Modify R script
echo "df <- mpg %>% filter(year == 2008)" >> create_dataset.R

# View how create_dataset.R looks now
cat create_dataset.R
## library(tidyverse)
## mpg %>% head(5)
## df <- mpg %>% filter(year == 2008)
echo 'output from git status:'
git status

# See exact changes that have been made to file since last commit
echo ''
echo 'output from git diff:'
git diff
## output from git status:
## On branch main
## Changes not staged for commit:
##   (use "git add <file>..." to update what will be committed)
##   (use "git restore <file>..." to discard changes in working directory)
##  modified:   create_dataset.R
## 
## no changes added to commit (use "git add" and/or "git commit -a")
## 
## output from git diff:
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## diff --git a/create_dataset.R b/create_dataset.R
## index c1cff38..490ec1c 100644
## --- a/create_dataset.R
## +++ b/create_dataset.R
## @@ -1,2 +1,3 @@
##  library(tidyverse)
##  mpg %>% head(5)
## +df <- mpg %>% filter(year == 2008)
# Undo those changes using git checkout
git restore create_dataset.R

# View file after discarding changes
echo 'view contents of file create_dataset.R after undoing changes using git checkout:'
cat create_dataset.R

echo ''
echo 'output from git status:'
git status
## view contents of file create_dataset.R after undoing changes using git checkout:
## library(tidyverse)
## mpg %>% head(5)
## 
## output from git status:
## On branch main
## nothing to commit, working tree clean


7.2 git restore: unstage staged changes

Using git restore: to unstage staged changes to a file(s)

  • Do this by adding the --staged option to git restore
  • Syntax: git restore --staged [<file(s)>]
  • Result:
    • staged changes to <file(s)> are unstaged
    • these unstaged changes are retained in your working directory
      • files listed as Changes not staged for commit when you check git status

Example: Using git restore --staged to unstage a file

In this example we will:

  • Initialize an (empty) folder as a git repo
  • (in working directory) Create a file create_dataset.R that contains two lines of code
  • “add” changes to create_dataset.R to “staging area” and “commit” those changes to the local repository
  • (in working directory) insert a new line of code to create_dataset.R
  • “add” this change to create_dataset.R to the staging area
  • We use git restore --staged <file_name> to “unstage” changes we had “added” to the staging area.
  • Result:changes we made to create_dataset.R are now unstaged changes in the working directory rather than staged changes ready to be committed
    • git restore --staged does not delete the line of code we added after the initial commit
cd ~  # change to root directory
rm -rf my_git_repo  # force remove `my_git_repo` (if it exists)
mkdir my_git_repo  # make directory `my_git_repo`
cd my_git_repo
git init

# First, create new R script
echo "library(tidyverse)" > create_dataset.R
echo "mpg %>% head(5)" >> create_dataset.R

# Add/commit R script so it is now tracked
git add create_dataset.R
git commit -m "add create_dataset.R"
# Modify R script
echo "df <- mpg %>% filter(year == 2008)" >> create_dataset.R

# Add new changes to the staging area
git add create_dataset.R

# Check status to verify it has been staged (listed under `Changes to be committed`)
git status
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## On branch main
## Changes to be committed:
##   (use "git restore --staged <file>..." to unstage)
##  modified:   create_dataset.R
# Use git restore --staged to unstage file
echo 'use git reset to unstage changes added to the staging area'
git restore --staged create_dataset.R

# Check status to verify it has been unstaged (listed under `Changes not staged for commit`)
echo ''
echo 'output from git status (after using git reset to unstage changes):'
git status

echo ''
echo 'print the file create_dataset.R (after using git reset to unstage changes):'
cat create_dataset.R
## use git reset to unstage changes added to the staging area
## 
## output from git status (after using git reset to unstage changes):
## On branch main
## Changes not staged for commit:
##   (use "git add <file>..." to update what will be committed)
##   (use "git restore <file>..." to discard changes in working directory)
##  modified:   create_dataset.R
## 
## no changes added to commit (use "git add" and/or "git commit -a")
## 
## print the file create_dataset.R (after using git reset to unstage changes):
## library(tidyverse)
## mpg %>% head(5)
## df <- mpg %>% filter(year == 2008)

7.3 git reset: discard commits

git reset: Remove commit(s) prior to a specific commit (previous commits discarded)

  • Help: git reset --help
  • Syntax: git reset <commit_hash>:
  • Result:
    • Undo all commits up to (but not including) the specified commit_hash
    • The HEAD pointer will be set to the specified commit
    • Changes to files associated with undone commits are retained in the working directory

Example: Using git reset to undo a commit
cd ~  # change to root directory
rm -rf my_git_repo  # force remove `my_git_repo` (if it exists)
mkdir my_git_repo  # make directory `my_git_repo`
cd my_git_repo
git init

# First, create new R script
echo "library(tidyverse)" > create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "add 1st line to create_dataset.R"
# Modify R script
echo "mpg %>% head(5)" >> create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "add 2nd line to create_dataset.R"
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [main 12fa051] add 2nd line to create_dataset.R
##  1 file changed, 1 insertion(+)
# View commit log
git log

# this code retrieves the first commit hash
git rev-list HEAD | tail -n 1
## commit 12fa051d8fd7452519701667e93381bdceeeb41d
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:54 2023 -0800
## 
##     add 2nd line to create_dataset.R
## 
## commit 996fd32ce120d748fb9360ec06f42cf2b31ac392
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:53 2023 -0800
## 
##     add 1st line to create_dataset.R
## 996fd32ce120d748fb9360ec06f42cf2b31ac392
# Specify the hash ID of the commit to undo up to
git reset $(git rev-list HEAD | tail -n 1)  # this retrieves the first commit hash

# View commit log - the 2nd commit has been removed
git log
## Unstaged changes after reset:
## M    create_dataset.R
## commit 996fd32ce120d748fb9360ec06f42cf2b31ac392
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:53 2023 -0800
## 
##     add 1st line to create_dataset.R
# changes to files associated with undone commit(s) are retained in working directory
cat create_dataset.R

git status
## library(tidyverse)
## mpg %>% head(5)
## On branch main
## Changes not staged for commit:
##   (use "git add <file>..." to update what will be committed)
##   (use "git restore <file>..." to discard changes in working directory)
##  modified:   create_dataset.R
## 
## no changes added to commit (use "git add" and/or "git commit -a")


7.4 git revert:

  • command: git revert --no-edit <commit_hash>
  • result:
    • Revert all commits up to and including the specified <commit_hash>
    • git revert does not remove any previous commits; rather, it creates a new commit (e.g., commit 3) that changes things back to the way they were prior to some commit (e.g., prior to commit 2)

git revert: Revert back to a specific commit, previous commits retained

  • Help: git revert --help
  • Syntax:git revert --no-edit <commit_hash>
    • The --no-edit option means that you will use the default message for the revert commit
      • If you run command without --no-edit, you’ll be taken to a screen where you have a chance to edit the commit message of the new commit. Just enter :q to use the default message.
  • Result:
    • Revert all commits up to and including the specified <commit_hash>; does this by creating a new commit that takes the repository back to way it was before <commit_hash>
    • Previous commits retained in case you want to access them
    • changes made by those previous commits not retained in working directory

The difference between git revert and git reset (see figure below):

  • git reset removes a previous commit, so there will be no record of this commit in git log, it’ll be like it never happened
  • git revert does not remove any previous commits; rather, it creates a new commit (e.g., commit 3) that changes things back to the way they were prior to some commit (e.g., prior to commit 2)
  • When working in a collaborative project where multiple users are contributing to a remote repository, you may want to use git revert so that it does not permanently erase history
  • When you are working locally and want to undo commits that you have not yet pushed to a remote, then git reset may also be an option

Credit: NUKE Designs, Git revert


Example: Using git revert to revert a commit
# First, create new R script
echo "library(tidyverse)" > create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "add 1st line to create_dataset.R"
# Modify R script
echo "mpg %>% head(5)" >> create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "add 2nd line to create_dataset.R"
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [main da3a943] add 2nd line to create_dataset.R
##  1 file changed, 1 insertion(+)
# View commit log
git log

# this code retrieve's hash id associated with the most recent commit
git rev-parse HEAD
## commit da3a94323d5341dccd1aac22ca89afe28a7ff53c
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:56 2023 -0800
## 
##     add 2nd line to create_dataset.R
## 
## commit 7e83925fc96e8c87e4b4e1e3ea96d7001d89607c
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:56 2023 -0800
## 
##     add 1st line to create_dataset.R
## da3a94323d5341dccd1aac22ca89afe28a7ff53c
# Specify the hash ID of the unwanted commit
git revert --no-edit $(git rev-parse HEAD)  # git rev-parse retrieves latest commit hash

# View commit log; note, now there are three commits
git log
## [main 4affe54] Revert "add 2nd line to create_dataset.R"
##  Date: Thu Feb 9 15:45:57 2023 -0800
##  1 file changed, 1 deletion(-)
## commit 4affe54b218c8e821023695ee58b6146011ed9b2
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:57 2023 -0800
## 
##     Revert "add 2nd line to create_dataset.R"
##     
##     This reverts commit da3a94323d5341dccd1aac22ca89afe28a7ff53c.
## 
## commit da3a94323d5341dccd1aac22ca89afe28a7ff53c
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:56 2023 -0800
## 
##     add 2nd line to create_dataset.R
## 
## commit 7e83925fc96e8c87e4b4e1e3ea96d7001d89607c
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:45:56 2023 -0800
## 
##     add 1st line to create_dataset.R
# The file now only contains the 1st line
cat create_dataset.R
## library(tidyverse)

8 Branching

What is a branch?

  • A branch is an “independent line of development” that “isolates your work from that of other team members” (Using branches tutorial)
  • By default, a git repository “has one branch named [main] which is considered to be the definitive branch.
    • Default branches used to be named [master]
    • Starting in 2020, github transitioned to making [main] the default branch name [LINK]https://github.com/github/renaming()
    • For some of you the default branch may still be named [master]
  • We use [other] branches to experiment and make edits before committing them to [main]” (Hello World tutorial)
  • In the figure below, each circle represents a commit. There are three branches - the main branch and 2 other branches (little feature and big feature) that “branched” off main.
  • “When you create a branch off the [main] branch, you’re making a copy, or snapshot, of [main] as it was at that point in time”
    • Therefore, a branch can be thought of as a “pointer to a single commit”

Credit: Modified from W3 docs, Git branch


Defining branches in terms of commits:

  • People often define a branch as “a pointer to a single commit”
    • In programming, a “pointer” is a variable/object that stores the address of other variables or objects in memory
  • Recall that a commit is a snapshot of a repository at a particular point in time
    • “references”
      • Each commit also stores connections (referred to as “references”) between the current commit and previous commits
    • “ancestors”
      • for a given commit, “ancestors” are all previous commits
    • “parent”
      • for a given commit, the “parent” is the most-recent previous commit
      • for a given commit, the hash of the “parent” commit is the “reference” between the current commit and most-recent the previous commit
  • The figure below shows the relationship between commits, references, and branches
  • Below, commits 1, 2, 3, and 4 are made to the main branch, prior to the creation of branch 1
    • When we make commit 2, we create reference 1, which is a pointer from commit 2 to commit 1
  • commit 4 is the last commit made to the main branch prior to the creation of branch 1
  • We can think of branch 1 as a pointer to commit 4
  • When we make additional commits to branch 1 (e.g., commit 5, commit 7) we also create references to the previous commit
    • For example, commit 5 creates reference 4, which is a pointer from commit 5 to commit 4


Credit: Modified from Mastering git branches by Henrique Mota


Commit objects show relationship between commits and “references”

  • In the below example [output hard-coded], we print the contents of a commit object (the second commit made to a repo)
  • The Second commit has the hash c281829157334317e93172c8acca20ffaaa59cff, let’s call it commit2
  • In printing the contents of the commit object for commit2:
    • the “parent” commit has hash b764bdbcfbe2009f01e89283cbbf35c95b9e2ad6, which is the hash for commit1
  • Using the language of branching, we could say the making commit2 creates the “reference” to commit1
    • The reference is the “parent” commit hash that is located inside the commit object for commit2
cd ~/my_git_repo

echo 'Print commit hash for latest commit:'
git rev-parse HEAD

echo ''
echo 'Print contents of most recent commit object:'
echo ''
git cat-file -p $(git rev-parse HEAD)
## Print commit hash for latest commit:
## c281829157334317e93172c8acca20ffaaa59cff
## 
## print contents of most recent commit object:
## 
## tree 524db779f0a3e3b3b353b522285c7da4830e21f1
## parent b764bdbcfbe2009f01e89283cbbf35c95b9e2ad6
## author Ozan Jaquette <ozanj@ucla.edu> 1643410447 -0800
## committer Ozan Jaquette <ozanj@ucla.edu> 1643410447 -0800
## 
## second commit


Why use branches?

  • Branching is a means of working on different versions of files in a repository at one time
    • Branching can be useful for “solo developer” projects (e.g., a PhD dissertation that does not build on an existing project) but is essential for collaborative projects
    • In collaborative projects, it is common for several programmers to share and work on the same programming scripts
  • Example in programming/software development world:
    • Imagine your bank is creating a new mobile banking app
    • Some programmers are fixing a bug in how the app imports data from your account
    • Other programmers are developing a new feature (e.g., allowing users to use Venmo to transfer funds)
    • “With so much going on, there needs to be a system in place for managing different versions of the same code base.”
  • Example from social science research world:
    • In the Unrollment Project we are exploring potential bias in alternative algorithms to predict student success
    • We have a file predict_grad.Rmd that reads in secondary data, creates analysis variables, runs alternative models for predicting the probability of obtaining a BA
    • Several collaborators are working on different parts of predict_grad.Rmd. For example, one person writing functions to clean data and create analysis variables and another person writing functions to run models and store model results.
    • Need a way for multiple people to work on predict_grad.Rmd at the same time
  • Typically, there is one main branch (usually main) that contains approved changes. All other development and testing is usually done on separate branches.
  • Standard “good practice” in the programming world
    • the “main” branch is sacred! don’t do your work on the main branch
    • do your day-to-day work on separate “development” branches
      • once you finish a task, or get the code the way you want it, then “merge” these changes from your development branch to the main branch
  • It is good practice to work on branches, then “merge” back to main at key points

8.1 git branch

git branch: List, create, or delete branches

  • Help: git branch --help
  • List existing branches (default: only local branches)
    • syntax: git branch [<option(s)>]:
      • There will be a * next to your current branch
      • Common ptions:
        • -a: List all branches, both local and remote (remote branches will start with remotes/)
        • -r: List only remote branches
        • -v: Display details about latest commits next to each branch
  • Create new local branch
    • syntax: git branch <branch_name>
  • Delete local branch
    • syntax: git branch -d <branch_name>:
    • You must be on a branch different than the one you want to delete
  • Rename/move branch
    • Options:
      • -m or --move: Move/rename a branch
      • -f or --force: force; in combination with -m (move), “allow renaming the branch even if the new branch name already exists”
      • -M: shortcut for -m -f (move and force)
    • sample syntax git branch -M <new_branch_name>

Example: Using git branch to list branches

Let’s create a new git repository in the example below. Note that we will not be able to list branches until we’ve made at least 1 commit:

# Initialize a new git repository in `my_git_repo` directory
cd my_git_repo
git init

# Note that you won't be able to list branches until you've made at least 1 commit
git branch
## Initialized empty Git repository in C:/Users/ozanj/Documents/my_git_repo/.git/
# Create new R script
echo "library(tidyverse)" > create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "import tidyverse in create_dataset.R"
git branch
## * main


We can use the -v option to list branches with more details about the latest commit on each branch:

# See detailed branch listing
git branch -v
## * main 488ccc8 import tidyverse in create_dataset.R


The -a option will list both local and remote branches.

  • Remote branches will start with remotes/ in the output.
  • They will also include both the remote repository name and the branch name (e.g., origin/main in the example below).
  • In addition to remote branches, we’ll also see the remote HEAD listed and where it’s pointing to (e.g., remote HEAD is pointing to remote main branch in the example below):
  • In the example below (code not run and the output is hard-coded), this is what the output would look like after we connected a local repo to a remote repo:
# List local and remote branches
git branch -a
## * main
##   remotes/origin/HEAD -> origin/main
##   remotes/origin/main


To list only information on remote branches, we can use the -r option.

  • Notice that the names do not have remotes/ prepended, as that only appears when listing all branches using -a to be able to distinguish between local and remote branches:
  • code not run; output hard-coded:
# List only remote branches
git branch -r
##   origin/HEAD -> origin/main
##   origin/main

Example: Using git branch to create new branch
# See branch listing
git branch
## * main
# Create new branch
git branch dev
# See branch listing
git branch
##   dev
## * main

Example: Using git branch to delete branch


A common practice around creating/deleting branches (but don’t have to do things this way)

  • A programmer creatSe a new branch – let’s name it dev for “development” – to make some improvement to a script.
  • Once the programmer is satisfied with this improvement, they “merge” the changes from the dev branch to the main branch (merging covered below).
  • then, they delete the dev branch.
# See branch listing
git branch
##   dev
## * main
# Delete branch
git branch -d dev
## Deleted branch dev (was 488ccc8).
# See branch listing
git branch
## * main


Example: Using git branch -M to force rename a branch


Recall that the option -M is a shortcut for -m (move) and -f (force)

  • You may have seen the following code in this lecture or elsewhere: git branch -M main
    • This code is saying: rename the current branch “main” even if a branch named “main” already exists
    • We inserted this code because some computers initilize git repo with the default branch name “master” rather than “main”

In the below example, we initialize a repo, make a commit, then force rename the branch name

cd ~  # change to root directory
rm -rf my_git_repo2  # force remove `my_git_repo` (if it exists)
mkdir my_git_repo2  # make directory `my_git_repo`

# Initialize a new git repository in `my_git_repo` directory
cd my_git_repo2
git init

# Create new R script
echo "library(tidyverse)" > create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "import tidyverse in create_dataset.R"
## Initialized empty Git repository in C:/Users/ozanj/Documents/my_git_repo2/.git/
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [main (root-commit) 0d4de44] import tidyverse in create_dataset.R
##  1 file changed, 1 insertion(+)
##  create mode 100644 create_dataset.R

List existing branches; note default branch was named “main”

cd ~/my_git_repo2

# See branch listing
git branch
## * main

Use git branch -M to force rename a branch

cd ~/my_git_repo2

# Force rename current branch from current name to branch1
git branch -M branch1

List existing branches

# See branch listing
git branch
## * branch1

8.2 git checkout

git checkout: Switch branches

  • Help: git checkout --help
  • Syntax:
    • git checkout <branch_name>: Switches to an existing branch named branch_name
    • git checkout -b <branch_name>: Creates a new branch named branch_name and switches to it

Credit: Modified from Pham Quy, Git tutorial


Example: Using git checkout to create a new branch and switch to it
# Force rename current branch to "main"
git branch -M main

# See branch listing
git branch
## * main
# Create new branch and switch to it
git checkout -b dev
## Switched to a new branch 'dev'
# See branch listing
git branch
## * dev
##   main

Example: Using git checkout to switch to an existing branch
# See branch listing
git branch
## * dev
##   main
# Switch to an existing branch
git checkout main
## Switched to branch 'main'
# See branch listing
git branch
##   dev
## * main

8.3 Pushing local branch

What is an upstream branch?

  • An upstream branch is the remote branch that your local branch is tracking (i.e., where it will push to and pull from)
  • When you push a new local branch to the remote for the first time, you will need to set the upstream branch using git push --set-upstream <remote_name> <branch_name> (or equivalently, git push -u <remote_name> <branch_name>)
    • We’ve seen this before earlier in the lecture, when we initialized a new git repository on our local machine and wanted to push our main branch to GitHub for the first time
  • The figure below summarizes the connection between a local and remote branch. We see that the local dev branch is tracking the remote dev branch (i.e., the upstream branch). Recall that under the hood, we also have a local copy of the remote repository, so origin/dev here is this local, remote-tracking branch.

Credit: devconnected, How To Set Upstream Branch on Git


Example: Pushing local branch to the remote

When you create a new local branch, you may choose to push it to the remote if you want a copy of it on GitHub, or if you want others to be able to contribute to it. When you push a local branch for the first time, you are required to set the upstream branch, otherwise it won’t let you push. Then, all subsequent pushes after this first one can just be git push.

In the below code chunks (code not run), we do the following:

  • Initialize a new local repo
  • create a file and make a commit
  • create a new branch named dev
  • use git remote add <remote_name> <remote_url> to connect to a remote repo (assume you created this remote repo on github.com)
    • note: for this step, it doesn’t matter which branch you are currently on; the remote_url will be the same for all branches
  • use git push -u origin main to push local branch named “main” to the remote for the first time
  • switch to dev branch and use git push -u origin dev push local branch named “dev” to the remote for the first time


Initialize new repo, make a commit, create a new branch named dev

cd ~  # change to root directory
rm -rf my_git_repo  # force remove `my_git_repo` (if it exists)
mkdir my_git_repo  # make directory `my_git_repo`

# Initialize a new git repository in `my_git_repo` directory
cd my_git_repo
git init

cd ~/my_git_repo

# Create new R script
echo "library(tidyverse)" > create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "import tidyverse in create_dataset.R"

git branch

# Create and switch to new branch
git checkout -b dev
## Initialized empty Git repository in C:/Users/ozanj/Documents/my_git_repo/.git/
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [main (root-commit) 7c1e136] import tidyverse in create_dataset.R
##  1 file changed, 1 insertion(+)
##  create mode 100644 create_dataset.R
## * main
## Switched to a new branch 'dev'

After creating repo on github, use `git remote add to add a new remote

cd ~/my_git_repo

git checkout main # switch to main branch; fine if we don't though

#git branch
git remote add origin https://github.com/ozanj/my_git_repo.git
git remote -v
  • Switch to main branch and push changes [if any] to remote repository
cd ~/my_git_repo

# switch to main branch
git checkout main

git push -u origin main # we have to add -u origin main cuz this is first time we are connecting local main to remote main
## $ git push -u origin main
## Enumerating objects: 3, done.
## Counting objects: 100% (3/3), done.
## Writing objects: 100% (3/3), 259 bytes | 259.00 KiB/s, done.
## Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
## To https://github.com/ozanj/my_git_repo.git
##  * [new branch]      main -> main
## Branch 'main' set up to track remote branch 'main' from 'origin'.

Show local and remote branches

cd ~/my_git_repo
git branch -a
##   dev
## * main
##   remotes/origin/main

Switch to dev branch and push to changes to remote repository (code not run)

cd ~/my_git_repo

# switch to dev branch
git checkout dev

# print local and remote branches
git branch -a

#push branch to remote for the first time
git push -u origin dev

#git push
$ git push -u origin dev
## Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
## remote:
## remote: Create a pull request for 'dev' on GitHub by visiting:
## remote:      https://github.com/ozanj/my_git_repo/pull/new/dev
## remote:
## To https://github.com/ozanj/my_git_repo.git
##  * [new branch]      dev -> dev
## Branch 'dev' set up to track remote branch 'dev' from 'origin'.

Show local and remote branches

cd ~/my_git_repo
git branch -a
## $ git branch -a
## * dev
##   main
##   remotes/origin/dev
##   remotes/origin/main


9 Merging

What is a merge?

  • The goal of a merge is to “integrate changes from multiple branches into one [branch]” (An introduction to Git merge and rebase)
  • Changes from a “target branch” can be merged into your “current branch”
  • It is good practice to make changes on separate branches (e.g., develop branch), then once they look good, merge them back into the main branch (e.g., main branch)

Credit: Modified from Eduard Lebedyuk


Merge terminology:

  • “Current branch”
    • Branch you are currently working with
    • The branch will be updated/modified by the merge with the “target branch”
    • In the figure above, the main branch is the “current branch”
  • “Target branch”
    • Branch that will be merged into the “current branch”
    • Target branch will be unaffected by the merge
    • Often, programmers delete the target branch after merging with the current branch
    • In the figure above, “develop” is the target branch


How programmers use branches and merges in day-to-day work:

  • Typically, programmers do all work on branches rather than the main branch
  • Branches are created for specific tasks (e.g., fixing a bug, adding a new feature)
    • “Short-lived topic branches, in particular, keep teams focused” (Git Handbook)
    • Once the specific task is completed, the topic branch is merged into the main branch and then the topic branch is often deleted


Types of merges:

  • Fast-forward merge
    • If after branching, changes are only made to the “target branch,” then merging changes from the “target branch” back to the “current branch” will be a fast-forward merge
    • In other words, the “current branch” will gain all the new changes from “target branch” after the merge, and essentially “fast forward” its HEAD to point to the most recent commit from the “target branch”
  • 3-way merge
    • If after branching, changes are made to both the “target branch” and “current branch” (i.e., the branches have diverged), then Git will attempt to combine all changes in a 3-way merge
    • The 3-way merge looks at the latest commits on both branches and their common ancestor, then attempts to create a new commit merge that combines all changes
    • Git is able to combine changes made to the same file if the changes are not made to the same line
    • Otherwise, a merge conflict occurs and would have to be resolved manually

Credit: Modified from Atlassian, Git merge

9.1 git merge

git merge: Merge branches

  • Help: git merge --help
  • Syntax:
    • git merge <branch_name>: All changes from branch_name will be merged into the current branch
    • git merge --abort: If a conflict arises during the merge, this can be run to restore both branches to their original states

Example: Using git merge for fast-forward merge

Initialize new repo, make a commit, create a new branch named dev

cd ~  # change to root directory
rm -rf my_git_repo  # force remove `my_git_repo` (if it exists)
mkdir my_git_repo  # make directory `my_git_repo`

# Initialize a new git repository in `my_git_repo` directory
cd my_git_repo
git init

cd ~/my_git_repo

# Create new R script
echo "library(tidyverse)" > create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "import tidyverse in create_dataset.R"

git branch

# Create and switch to new branch
git checkout -b dev
## Initialized empty Git repository in C:/Users/ozanj/Documents/my_git_repo/.git/
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [main (root-commit) e6e8fc4] import tidyverse in create_dataset.R
##  1 file changed, 1 insertion(+)
##  create mode 100644 create_dataset.R
## * main
## Switched to a new branch 'dev'

Continuing from previous examples, we have the main and dev branches, which are even with the same initial commit:

git checkout main
# View commit log for `main` branch
git log
## Switched to branch 'main'
## commit e6e8fc432c85f0307e10f59b31379aea430e7adb
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:04 2023 -0800
## 
##     import tidyverse in create_dataset.R
# Switch to `dev` branch
git checkout dev

# View commit log for `dev` branch
git log
## Switched to branch 'dev'
## commit e6e8fc432c85f0307e10f59b31379aea430e7adb
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:04 2023 -0800
## 
##     import tidyverse in create_dataset.R
# View content of R script, which is the same on both `main` and `dev` branches
cat create_dataset.R
## library(tidyverse)


Now, let’s make a second commit on the dev branch:

git branch

# Modify R script
echo "mpg %>% head(5)" >> create_dataset.R
echo "df <- mpg %>% filter(year == 2008)" >> create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "manipulate mpg dataset"
## * dev
##   main
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [dev 745d052] manipulate mpg dataset
##  1 file changed, 2 insertions(+)
# View commit log for `dev` branch
git log
## commit 745d0529d6a9c8f93ddcba1d6b5f6576bf10b2a4
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:06 2023 -0800
## 
##     manipulate mpg dataset
## 
## commit e6e8fc432c85f0307e10f59b31379aea430e7adb
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:04 2023 -0800
## 
##     import tidyverse in create_dataset.R


Let’s switch back to the main branch and merge in dev. Since the dev branch is ahead of main by 1 commit, the changes can be combined using a fast-forward merge:

# Switch to `main` branch
git checkout main

# Merge `dev` branch into `main`
git merge dev
## Switched to branch 'main'
## Updating e6e8fc4..745d052
## Fast-forward
##  create_dataset.R | 2 ++
##  1 file changed, 2 insertions(+)
# The commit log for `main` now matches the `dev` branch
git log
## commit 745d0529d6a9c8f93ddcba1d6b5f6576bf10b2a4
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:06 2023 -0800
## 
##     manipulate mpg dataset
## 
## commit e6e8fc432c85f0307e10f59b31379aea430e7adb
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:04 2023 -0800
## 
##     import tidyverse in create_dataset.R


Let’s examine the git object associated with the commit:

# Commit object hash
git rev-parse HEAD # git rev-parse retrieves latest commit hash

git cat-file -t $(git rev-parse HEAD) # type = commit
git cat-file -p $(git rev-parse HEAD)
## 745d0529d6a9c8f93ddcba1d6b5f6576bf10b2a4
## commit
## tree 6de1187f46bbf4d76cafca7c0e5d3d61db6b5a53
## parent e6e8fc432c85f0307e10f59b31379aea430e7adb
## author Ozan Jaquette <ozanj@ucla.edu> 1675986366 -0800
## committer Ozan Jaquette <ozanj@ucla.edu> 1675986366 -0800
## 
## manipulate mpg dataset

Examine the “tree” object associated with the commit:

git cat-file -t 6de1187f46bbf4d76cafca7c0e5d3d61db6b5a53 # type = tree
git cat-file -p 6de1187f46bbf4d76cafca7c0e5d3d61db6b5a53
## tree
## 100644 blob 490ec1c138021b8d5c196c26a2a7b3de69afc2d1 create_dataset.R

Examine the “blob” object (file) associated with the commit:

git cat-file -t 490ec1c138021b8d5c196c26a2a7b3de69afc2d1 # type = blob
git cat-file -p 490ec1c138021b8d5c196c26a2a7b3de69afc2d1
## blob
## library(tidyverse)
## mpg %>% head(5)
## df <- mpg %>% filter(year == 2008)

Examine the “parent” object associated with this commit:

# Parent commit hash
git rev-list HEAD | tail -n 1

git cat-file -t $(git rev-list HEAD | tail -n 1) # type = commit
git cat-file -p $(git rev-list HEAD | tail -n 1)
## e6e8fc432c85f0307e10f59b31379aea430e7adb
## commit
## tree cb70185218351236255cdea1297210ceeaf6e3b5
## author Ozan Jaquette <ozanj@ucla.edu> 1675986364 -0800
## committer Ozan Jaquette <ozanj@ucla.edu> 1675986364 -0800
## 
## import tidyverse in create_dataset.R

Example: Using git merge for 3-way merge

Continuing from previous examples, we have the main and dev branches, which are even with the same two commits:

# View commit log for `main` branch
git log
## commit 745d0529d6a9c8f93ddcba1d6b5f6576bf10b2a4
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:06 2023 -0800
## 
##     manipulate mpg dataset
## 
## commit e6e8fc432c85f0307e10f59b31379aea430e7adb
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:04 2023 -0800
## 
##     import tidyverse in create_dataset.R
git branch

# View content of R script on the `main` branch
cat create_dataset.R
##   dev
## * main
## 
## library(tidyverse)
## mpg %>% head(5)
## df <- mpg %>% filter(year == 2008)


Now, let’s suppose the two branches diverge, both making changes to the R script:

# Modify R script
echo "library(tidyverse)" > create_dataset.R
echo "mpg %>% head(10)" >> create_dataset.R  # this line is modified
echo "df <- mpg %>% filter(year == 2008)" >> create_dataset.R

# Add and commit changes
git add create_dataset.R
git commit -m "update head() on line 2" 
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [main 118f803] update head() on line 2
##  1 file changed, 1 insertion(+), 1 deletion(-)

View updated content of R script on the main branch, which now shows head(10) instead of head(5):

git branch

# View content of R script
cat create_dataset.R
##   dev
## * main
## 
## library(tidyverse)
## mpg %>% head(10)
## df <- mpg %>% filter(year == 2008)


Switch to dev branch, and make change to file create_dataset.R:

# Switch to `dev` branch
git checkout dev

# Modify R script
echo "df <- df %>% filter(manufacturer == 'audi')" >> create_dataset.R  # add new line

# Add and commit changes
git add create_dataset.R
git commit -m "add additional filter() on line 4" 
## Switched to branch 'dev'
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [dev a6b4653] add additional filter() on line 4
##  1 file changed, 1 insertion(+)

View updated content of R script on the dev branch, which now has additional filter() line at the end:

git branch

# View content of R script 
cat create_dataset.R
## * dev
##   main
## 
## library(tidyverse)
## mpg %>% head(5)
## df <- mpg %>% filter(year == 2008)
## df <- df %>% filter(manufacturer == 'audi')


Before we attempt to merge main and dev branches, we can use git diff to compare the two branches:

  • Syntax: git diff <branch1_name> <branch2_name>
# View diff between `main` and `dev` branches
git diff main dev
## diff --git a/create_dataset.R b/create_dataset.R
## index da2f5c5..6665541 100644
## --- a/create_dataset.R
## +++ b/create_dataset.R
## @@ -1,3 +1,4 @@
##  library(tidyverse)
## -mpg %>% head(10)
## +mpg %>% head(5)
##  df <- mpg %>% filter(year == 2008)
## +df <- df %>% filter(manufacturer == 'audi')


Let’s switch back to the main branch and merge in dev. Since both branches made changes to the R script on different lines, the changes can be combined without any conflicts via a 3-way merge:

# Switch to `main` branch
git checkout main

# Merge changes from `dev` into `main`
git merge dev
## Switched to branch 'main'
## Auto-merging create_dataset.R
## Merge made by the 'ort' strategy.
##  create_dataset.R | 1 +
##  1 file changed, 1 insertion(+)
# View commit log - note that a new merge commit was created during the 3-way merge
git log
## commit d1264b137e1dee8040cbf249b29fb626c054db55
## Merge: 118f803 a6b4653
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:11 2023 -0800
## 
##     Merge branch 'dev'
## 
## commit a6b4653b12ed97c4e608f4d01cc297a4ab0bf4a1
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:10 2023 -0800
## 
##     add additional filter() on line 4
## 
## commit 118f803e5d1291e6ccf82659b4105b420fc80985
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:09 2023 -0800
## 
##     update head() on line 2
## 
## commit 745d0529d6a9c8f93ddcba1d6b5f6576bf10b2a4
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:06 2023 -0800
## 
##     manipulate mpg dataset
## 
## commit e6e8fc432c85f0307e10f59b31379aea430e7adb
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:04 2023 -0800
## 
##     import tidyverse in create_dataset.R
# View merged content of R script
cat create_dataset.R
## library(tidyverse)
## mpg %>% head(10)
## df <- mpg %>% filter(year == 2008)
## df <- df %>% filter(manufacturer == 'audi')

9.2 git pull

git pull: Incorporate remote changes into your current branch

  • Help: git pull --help
  • Syntax:
    • git pull: This is equivalent to a git fetch followed by a git merge to incorporate remote changes to your current branch
  • Notes:
    • git fetch is useful if you want a local copy of the most up-to-date changes in the remote repository (e.g., to preview changes), but don’t actually want to merge these changes into your local repository yet. On the other hand, running git pull will directly incorporate the changes.
    • More specifically, git fetch will incorporate changes into your remote-tracking branch (e.g., origin/main, your local copy of the remote main branch) but not your local branch (e.g., your local main branch). Then, git merge can merge the change from your remote-tracking branch into your local branch.

Credit: Modified from Medium, Git Fetch vs Git Pull


Example: Using git pull to incorporate remote changes

Let’s say your remote branch is ahead of your local branch by some commits. You can run git pull to incorporate those changes:

  • below code not run; will work if you previously connected local repo to remote
# Incorporate remote changes to current branch
git pull


After you run the command, you may see some output indicating the progress as remote changes are being fetched:

remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.


Then, the output will look something like the below:

  • The 1st line indicates which remote repository it is fetching from
  • The 2nd line displays the hash IDs of your latest commit before and after the fetch, as well as which branch you are updating with the fetch
    • In the example below, the fetch has updated our origin/main branch, which is our local copy of the remote main branch
  • The 3rd line again displays the latest commit hash before and after the update
  • The 4th line indicates what kind of merge was performed (since git pull is just git fetch followed by git merge)
  • The next lines list out which files were changed and what changes were made (i.e., how many lines were added, how many lines were deleted)
  • The final line shows a summary of the total number of files changed, the total number of lines added (i.e., insertions), and the total number of lines deleted (i.e., deletions)
From github.com:anyone-can-cook/student_lastname_firstname
   1eeaff7..6c3e46f  main     -> origin/main
Updating 1eeaff7..6c3e46f
Fast-forward
 README.md | 2 ++
 my_script.R | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

As we’ll see in the next example, the first 2 lines of the output comes from git fetch being run and the remaining lines come from git merge.


Example: Using git fetch and git merge to incorporate remote changes

Running git pull essentially performs a git fetch followed by git merge. If we only want to fetch the remote changes to our local repository but not incorporate them into our current branch, we can use git fetch:

# Fetch remote changes
git fetch
From github.com:anyone-can-cook/student_lastname_firstname
   1eeaff7..6c3e46f  main     -> origin/main


We can verify that the fetch only updated our remote-tracking branch origin/main (i.e., our local copy of the remote main branch) and not our local main branch by checking the commit history of the branches.

Assuming we are currently on our local main branch, we can run git log to view the commit history. In the output, we see HEAD -> main next to the most recent commit, indicating that HEAD is pointing to this commit on the main branch:

# Check commit history of local `main` branch
git log
commit e329908682dfefba0417bd7337cc660d0d5f133d (HEAD -> main)
Author: username <email@example.com>
Date:   Fri Jan 22 11:15:50 2021 -0800

    initial commit

Next, we can check the commit log of the remote-tracking branch origin/main. In the output below, we can see that the changes have indeed been fetched to this branch, as indicated by the presence of the second commit. In parentheses next to the commits, we can again see that our local main branch still only contains the first commit while origin/main and origin/HEAD has been updated with the second. HEAD always points to the latest commit on your current (active) branch, so it also appears next to the second commit:

# Check commit history of `origin/main`
git log origin/main
commit 1eeaff75a681213890e5ce4850d17a1672a4ada6 (HEAD, origin/main, origin/HEAD)
Author: username <email@example.com>
Date:   Fri Jan 22 11:27:40 2021 -0800

    second commit

commit e329908682dfefba0417bd7337cc660d0d5f133d (main)
Author: username <email@example.com>
Date:   Fri Jan 22 11:15:50 2021 -0800

    initial commit


After we are satisfied with the fetched changes, we can manually merge them into our local main branch:

# Merge changes from `origin/main` into local `main`
git merge origin/main
Updating 1eeaff7..6c3e46f
Fast-forward
 README.md | 2 ++
 my_script.R | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

Alternatively, we could have just run git pull instead of git merge origin/main and it would’ve also merged in the changes (after performing git fetch again).


If we check the commit history on our local main branch again, we can see it has now been updated:

# Check commit history of local `main` branch
git log
commit 1eeaff75a681213890e5ce4850d17a1672a4ada6 (HEAD -> main, origin/main, origin/HEAD)
Author: username <email@example.com>
Date:   Fri Jan 22 11:27:40 2021 -0800

    second commit

commit e329908682dfefba0417bd7337cc660d0d5f133d
Author: username <email@example.com>
Date:   Fri Jan 22 11:15:50 2021 -0800

    initial commit

9.3 Merge conflicts

When attempting a git merge, two types of merge conflict can arise for two different reasons: (1) when starting a merge; and (2) during a merge (From Git merge conflicts)


First, conflicts that arise when starting a merge

  • these conflicts arise because your current branch has changes to tracked files that have not yet been committed, so git will not allow you to merge to another branch until those changes have been committed
  • When starting a merge, Git will first check if you have any changes in either the working directory or staging area. If so, Git will abort the merge completely and display an error message that looks like this:
error: Your local changes to the following files would be overwritten by merge:
    <file_name>
Please commit your changes or stash them before you merge.
Aborting


Second, conflicts that arise during the merge.

  • During a 3-way merge when both branches made changes to the same line(s) of the same file(s), a conflict will occur. The error message would look like this:
Auto-merging <file_name>
CONFLICT (content): Merge conflict in <file_name>
Automatic merge failed; fix conflicts and then commit the result.
  • If you open the failed file, you will see that Git has marked the line(s) that were conflicting:
<normal_line_of_code>
<normal_line_of_code>

<<<<<<< HEAD
<conflicted_line_of_code__current_branch_version>
=======
<conflicted_line_of_code__target_branch_version>
>>>>>>> <branch_name>

<normal_line_of_code>
<normal_line_of_code>

These conflicts will need to be resolved manually (described in next section), or the merge can be aborted using git merge --abort.


Example: Merge conflict when starting a merge

Continuing from previous examples, our main branch currently looks like this:

# View commit log for `main` branch
git log
## commit d1264b137e1dee8040cbf249b29fb626c054db55
## Merge: 118f803 a6b4653
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:11 2023 -0800
## 
##     Merge branch 'dev'
## 
## commit a6b4653b12ed97c4e608f4d01cc297a4ab0bf4a1
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:10 2023 -0800
## 
##     add additional filter() on line 4
## 
## commit 118f803e5d1291e6ccf82659b4105b420fc80985
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:09 2023 -0800
## 
##     update head() on line 2
## 
## commit 745d0529d6a9c8f93ddcba1d6b5f6576bf10b2a4
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:06 2023 -0800
## 
##     manipulate mpg dataset
## 
## commit e6e8fc432c85f0307e10f59b31379aea430e7adb
## Author: Ozan Jaquette <ozanj@ucla.edu>
## Date:   Thu Feb 9 15:46:04 2023 -0800
## 
##     import tidyverse in create_dataset.R
git branch

# View content of R script
cat create_dataset.R
##   dev
## * main
## 
## library(tidyverse)
## mpg %>% head(10)
## df <- mpg %>% filter(year == 2008)
## df <- df %>% filter(manufacturer == 'audi')


Let’s create a new branch called revision that branches off main, then make a new commit on this branch:

# Create and switch to new branch
git checkout -b revision

# Modify R script
echo "library(tidyverse)" > create_dataset.R
echo "mpg %>% head(10)" >> create_dataset.R
echo "df <- mpg %>% filter(year == 2008)" >> create_dataset.R
echo "df <- df %>% filter(manufacturer == 'lincoln')" >> create_dataset.R  # this line is modified

# Add and commit change
git add create_dataset.R
git commit -m "filter for lincoln instead of audi"
## Switched to a new branch 'revision'
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [revision f15f183] filter for lincoln instead of audi
##  1 file changed, 1 insertion(+), 1 deletion(-)

View updated content of R script on the revision branch, which now filters for lincoln instead of audi on the last line:

git branch

# View content of R script
cat create_dataset.R
##   dev
##   main
## * revision
## 
## library(tidyverse)
## mpg %>% head(10)
## df <- mpg %>% filter(year == 2008)
## df <- df %>% filter(manufacturer == 'lincoln')


Back on the main branch, let’s modify the same line in the R script:

# Switch back to `main` branch
git checkout main

# Modify R script
echo "library(tidyverse)" > create_dataset.R
echo "mpg %>% head(10)" >> create_dataset.R
echo "df <- mpg %>% filter(year == 2008)" >> create_dataset.R
echo "df <- df %>% filter(manufacturer == 'chevrolet')" >> create_dataset.R  # this line is modified
## Switched to branch 'main'


Notice that we have uncommitted changes in the working directory:

# Check status
git status
## On branch main
## Changes not staged for commit:
##   (use "git add <file>..." to update what will be committed)
##   (use "git restore <file>..." to discard changes in working directory)
##  modified:   create_dataset.R
## 
## no changes added to commit (use "git add" and/or "git commit -a")


If we try to merge changes from revision into main now, there will be a merge conflict because we have uncommited changes. The merge will be aborted:

  • note: the merge conflict exists because we have uncommitted changes in the branch we are currently on; the merge conflict here is not because the same line differs on two different branches
# Merge changes from `revision` into `main`
git merge revision
## error: Your local changes to the following files would be overwritten by merge:
##  create_dataset.R
## Please commit your changes or stash them before you merge.
## Aborting

Example: Merge conflict during a merge

Continuing from the previous example, let’s say we commited our change to create_dataset.R on the main branch:

# Add and commit change
git add create_dataset.R
git commit -m "filter for chevrolet instead of audi"
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [main d0a78e8] filter for chevrolet instead of audi
##  1 file changed, 1 insertion(+), 1 deletion(-)

View updated content of R script on the main branch, which now filters for chevrolet instead of audi on the last line:

git branch

# View content of R script
cat create_dataset.R
##   dev
## * main
##   revision
## 
## library(tidyverse)
## mpg %>% head(10)
## df <- mpg %>% filter(year == 2008)
## df <- df %>% filter(manufacturer == 'chevrolet')


Recall that create_dataset.R on the revision branch looks like this:

## library(tidyverse)
## mpg %>% head(10)
## df <- mpg %>% filter(year == 2008)
## df <- df %>% filter(manufacturer == 'lincoln')


If we try to merge changes from revision into main now, there will be a merge conflict because both branches modified the same line of the same file:

# Merge changes from `revision` into `main`
git merge revision
## Auto-merging create_dataset.R
## CONFLICT (content): Merge conflict in create_dataset.R
## Automatic merge failed; fix conflicts and then commit the result.


You can also tell which file(s) failed to merge by checking git status:

## On branch main
## You have unmerged paths.
##   (fix conflicts and run "git commit")
##   (use "git merge --abort" to abort the merge)
## 
## Unmerged paths:
##   (use "git add <file>..." to mark resolution)
## 
## both modified:   create_dataset.R
## 
## no changes added to commit (use "git add" and/or "git commit -a")


The file(s) that failed to merge will contain markings by Git that indicates which line(s) are conflicted:

# View content of R script
cat create_dataset.R
## library(tidyverse)
## mpg %>% head(10)
## df <- mpg %>% filter(year == 2008)
## <<<<<<< HEAD
## df <- df %>% filter(manufacturer == 'chevrolet')
## =======
## df <- df %>% filter(manufacturer == 'lincoln')
## >>>>>>> revision

9.4 Resolving merge conflicts

What to do when you encounter a merge conflict?

  • As introduced earlier, you can use git merge --abort to abort the merge and restore the branches back to their original states
  • Alternatively, you can manually edit the file(s) to resolve the conflicts (often do this on a text editor)
    • Make sure to remove the markers that Git has added (i.e., <<<<<<< HEAD, =======, >>>>>>> <branch_name>) and choose which version of the conflicted line to keep
    • git add the file(s) after you are done resolving the conflicts
    • Commit your changes using git commit -m "<commit_message>" to complete the merge

Example: Resolving a merge conflict
# View content of R script
cat create_dataset.R
## library(tidyverse)
## mpg %>% head(10)
## df <- mpg %>% filter(year == 2008)
## <<<<<<< HEAD
## df <- df %>% filter(manufacturer == 'chevrolet')
## =======
## df <- df %>% filter(manufacturer == 'lincoln')
## >>>>>>> revision


We can manually edit the file to resolve the conflicts, usually by opening up file in a text editor.

  • Let’s say we choose to filter for 'volkswagen' instead:
## library(tidyverse)
## mpg %>% head(10)
## df <- mpg %>% filter(year == 2008)
## df <- df %>% filter(manufacturer == 'volkswagen')


Finally, we can add and commit the file to complete the merge:

# Add/commit R script
git add create_dataset.R
git commit -m "merge revision branch"

10 Pull Requests

What is a pull request?

“Pull requests let you tell others about changes you’ve pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch.” – GitHub Help

  • As mentioned in the branching section, there is typically one base branch (usually main) that contains all working or approved changes
  • Any development or testing is usually done on separate branches, then merged back into main once changes are finalized
  • Pull requests are essentially requests to have one branch (e.g., development branch) merged into another (e.g., main branch)
  • Pull requests are opened on GitHub
    • This creates a pull request page (similar to an issues page), where collaborators can comment and discuss the changes that are to be merged
  • The alternative to a pull request is to directly merge in the branch yourself (example below), but this bypasses the review and approval process that a pull request offers


Why make a pull request?

  • In a collaborative setting, pull requests give other people a chance to review and approve your changes before they are merged to the base branch
    • This allows for better quality control
    • It also lets all collaborators be in agreement with what gets merged to the base branch
  • Pull requests can also be a way to keep a history of the major revisions and decisions made to the project

Example: Alternative to pull request: Merging changes directly into main

Let’s say we create a new R script and add/commit that to the main branch:

# Create new R script
echo "library(tidyverse)" > create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "import tidyverse library"

Then, we create a new branch and make further changes to the R script on the branch:

# Create and switch to new branch
git checkout -b dev

# Modify R script
echo "mpg %>% head(5)" >> create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "preview mpg dataset"
## Switched to a new branch 'dev'
## 
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [dev 78ce323] preview mpg dataset
##  1 file changed, 1 insertion(+)

At this point, we can push this new branch to the remote if we wanted to open a pull request. But the alternative is to directly merge the changes to main:

# Switch back to main
git checkout main

# Merge in changes from the branch
git merge dev
## Switched to branch 'main'
## Updating 87138bc..78ce323
## Fast-forward
##  create_dataset.R | 1 +
##  1 file changed, 1 insertion(+)

Then, we can push the changes to the remote’s main branch, which would also be the ultimate goal of a pull request:

# Push to remote's main
git push


10.1 Creating a pull request

All image credits: GitHub Help


Creating a topical branch:

  • Create a new local branch and make your changes to it
  • After you are done, it is good practice to merge in any changes from main that your branch doesn’t have
    • This makes it easier later down the road when you are merging your branch back into main after the pull request is complete
  • Push your branch to the remote repository


Making the pull request:

  • On GitHub, select your branch and click New pull request:

  • Add a title and (optionally) a description for your pull request. You can also @ users/teams if you want:

  • Click Create Pull Request:

  • Your pull request will appear under the tab Pull requests:


Assigning reviewers:

  • On the right-hand side of the pull request, you are also able to assign Reviewers or Assignees, similar to an issue:

  • Reviewers should be someone who you want to review the changes you made, while Assignees could be anyone else more generally involved in the pull request

    • Reviewers will get a notification that their review is requested
    • Whether or not the reviewer actually completes a reviews does not affect the ability to merge the pull request
    • If someone who is not assigned as reviewer reviews the changes (i.e., does one of three actions described in the next section), they will be added to the reviewers list
  • The users listed under Reviewers (unlike Assignees) will also have a status icon:

    • : Pending review from reviewer
    • : Reviewer has left comments
    • : Reviewer has approved changes
    • : Reviewer has requested additional changes
    • For any of the last three statuses, you can click to re-request a review from the reviewer

Example: Creating a pull request

Similar to the previous example, let’s say we create a new R script and added/committed that to the main branch:

# Create new R script
echo "library(tidyverse)" > create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "import tidyverse library"

Then, we create a new branch and make further changes to the R script on the branch:

# Create and switch to new branch
git checkout -b dev

# Modify R script
echo "mpg %>% head(5)" >> create_dataset.R

# Add/commit R script
git add create_dataset.R
git commit -m "preview mpg dataset"
## Switched to a new branch 'dev'
## 
## warning: in the working copy of 'create_dataset.R', LF will be replaced by CRLF the next time Git touches it
## [dev 2c0fbc7] preview mpg dataset
##  1 file changed, 1 insertion(+)

At this point, we can push this new branch to the remote repository. Remember to set the upstream branch if this is the first time you are pushing the branch to remote:

# Push branch to remote (say our remote is called `origin` here)
git push --set-upstream origin dev

All the subsequent steps to open the pull request will be performed on GitHub.


10.2 Responding to a pull request

There are two ultimate responses to a pull request.

  • Merging pull request:

  • Closing pull request:

But before coming to one of these decisions, you will likely want to review the changes in more detail.

10.2.1 Reviewing changes

Under the Files tab, you can view all changes that would potentially be merged if the pull request is completed:

There, you will also see a button called Review changes that contains three options for leaving a review:


Comment:

  • Select this option to leave general feedback on the changes
    • You must write something in the comment box in order to click Submit review
  • The reviewer status will be changed to
  • Note that simply leaving a comment on the main pull request page will not trigger this status change


Approve:

  • Select this option to approve merging the changes
    • You do not need to write anything in the comment box in order to click Submit review
  • The reviewer status will be changed to


Request changes:

  • Select this option to request further changes before merging

    • You must write something in the comment box in order to click Submit review
  • The reviewer status will be changed to

  • You will see that the merge box on the main pull request page is outlined in orange, along with a list of reviewers who requested changes:

  • To respond to the change request from each reviewer, there are three options:

    • Approve changes: The reviewer can select this to resolve the change request
      • This will turn the merge box outline from orange back to green
      • The reviewer status will be changed to
      • For anyone other than the reviewer, they will see the option See review instead
    • Dismiss review: The review can be dismissed by anyone
      • You will be asked to enter a reason why you want to dismiss the review, which will appear as a comment on the pull request page
      • This will turn the merge box outline from orange back to green
      • The reviewer status will be changed to
    • Re-request review: Another review from the reviewer can be requested
      • The merge box outline will remain orange
      • The reviewer status will be changed to
  • Note that the merge box outline color and reviewer status do not affect the ability to merge the pull request

10.2.2 Line-by-line comments

Under the Files tab, you can also make comments to specific lines of a file:

11 Appendix

11.1 .gitignore file

What is a .gitignore file? (gitignore documentation)

  • It is a special file that tells Git what files in the repository to ignore, or not track
    • More specifically, each line in the .gitignore file specifies a pattern to ignore (more below)
    • It does not support regular expression patterns, but supports unix fnmatch style patterns
  • These files will no longer be listed under Untracked files when you check git status
    • Note that .gitignore does not affect files already being tracked
    • To stop tracking a file that is already being tracked, use git rm --cached
  • The .gitignore file is usually in your project root directory
    • However, you can also have multiple .gitignore or .gitignore in various subdirectories if you need to ignore different files in different locations
  • You can either create a .gitignore file yourself or click Add .gitignore when you are creating a new repository on GitHub and select the R template from the dropdown menu

Credit: How to Make Git Forget Tracked Files Now In gitignore


Pattern formats in .gitignore file:

  • Lines starting with # are treated as comments
  • Lines starting with ! means do not ignore this pattern
  • Use \ to escape literal #, !, or trailing spaces
  • * matches anything except /
  • ? matches any one character except /
  • Range notation (e.g., [a-z], [0-9]) can be used to match one of the characters in a range
  • Patterns with / at the end will only match directories and not files
  • Patterns with / in the beginning or middle will only match relative to the directory the .gitignore file is in and not any subdirectories
    • To match in subdirectories as well, add leading **/ to the start of the pattern
    • /**/ in the middle of the path matches zero or more directories

Example: Ignoring files by name patterns

Let’s say we have a git repository with the following files and directory structure:

## .
## |____A1.csv
## |____A1.png
## |____A1.tsv
## |____ABC
## | |____README.md
## |____B2.csv
## |____blank.txt
## |____de.csv

When we check git status, all the files are untracked:

# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  A1.csv
##  A1.png
##  A1.tsv
##  ABC/README.md
##  B2.csv
##  de.csv
## 
## nothing added to commit but untracked files present (use "git add" to track)


Let’s create a .gitignore file in the root directory. In .gitignore, we can specify which files to ignore:

# Ignores `A1.csv`, `A1.png`, and `A1.tsv`
echo "A1.csv" > .gitignore
echo "A1.png" >> .gitignore
echo "A1.tsv" >> .gitignore

cat .gitignore
## A1.csv
## A1.png
## A1.tsv
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  ABC/README.md
##  B2.csv
##  de.csv
## 
## nothing added to commit but untracked files present (use "git add" to track)


We can use the wildcard * to match any characters that’s not a /. For example, A* matches all files and directories that starts with an A:

# Ignores `A1.csv`, `A1.png`, `A1.tsv`, and `ABC/` directory using `*`
echo "A*" > .gitignore

cat .gitignore
## A*
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  B2.csv
##  de.csv
## 
## nothing added to commit but untracked files present (use "git add" to track)


To specify a file or pattern not to match (i.e., not ignore), put ! at the start of the line:

# Ignores all files and directories starting with `A` except `A1.png`
echo "A*" > .gitignore
echo "!A1.png" >> .gitignore

cat .gitignore
## A*
## !A1.png
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  A1.png
##  B2.csv
##  de.csv
## 
## nothing added to commit but untracked files present (use "git add" to track)


To only match directories, add a trailing / to your pattern:

# Ignores `ABC/` directory only and not files starting with `A`
echo "A*/" > .gitignore

cat .gitignore
## A*/
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  A1.csv
##  A1.png
##  A1.tsv
##  B2.csv
##  de.csv
## 
## nothing added to commit but untracked files present (use "git add" to track)


The ? can be used to match any one character that’s not a /:

# Ignores `A1.csv` and `A1.tsv` using `?`
echo "A1.?sv" > .gitignore

cat .gitignore
## A1.?sv
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  A1.png
##  ABC/README.md
##  B2.csv
##  de.csv
## 
## nothing added to commit but untracked files present (use "git add" to track)


Square brackets [] can be used to specify specific characters to match:

# Ignores `A1.csv` and `A1.tsv` using `[]`
echo "A1.[ct]sv" > .gitignore

cat .gitignore
## A1.[ct]sv
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  A1.png
##  ABC/README.md
##  B2.csv
##  de.csv
## 
## nothing added to commit but untracked files present (use "git add" to track)


Ranges can also be specified using square brackets [] to match a range of characters (e.g., alphabet or numeric):

# Ignores `A1.csv` and `B2.csv` using ranges
echo "[a-z][0-9].csv" > .gitignore

cat .gitignore
## [a-z][0-9].csv
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  A1.png
##  A1.tsv
##  ABC/README.md
##  de.csv
## 
## nothing added to commit but untracked files present (use "git add" to track)


Ranges can also be alphanumeric:

# Ignores `A1.csv`, `B2.csv`, and `de.csv` using ranges
echo "[a-z][a-z0-9].csv" > .gitignore

cat .gitignore
## [a-z][a-z0-9].csv
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  A1.png
##  A1.tsv
##  ABC/README.md
## 
## nothing added to commit but untracked files present (use "git add" to track)

Example: Ignoring files and nested files

Let’s say we have a git repository with the following files and directory structure:

## .
## |____blank.txt
## |____doc
## | |____README.md
## |____intput
## | |____doc
## | | |____README.md
## |____output
## | |____doc
## | | |____README.md
## | |____plots
## | | |____doc
## | | | |____README.md
## |____README.md

When we check git status, all the README.md files are untracked:

# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  README.md
##  doc/README.md
##  intput/doc/README.md
##  output/doc/README.md
##  output/plots/doc/README.md
## 
## nothing added to commit but untracked files present (use "git add" to track)


Let’s create a .gitignore file in the root directory. If we add README.md to .gitignore, all the README.md files will be ignored:

# Ignores all `README.md`
echo "README.md" > .gitignore

cat .gitignore
## README.md
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
## 
## nothing added to commit but untracked files present (use "git add" to track)


If we add doc/README.md to the .gitignore file, only the doc/README.md in the project root directory (i.e., where the .gitignore file is located) will be ignored because there’s a / in the middle of the pattern:

# Ignores `doc/README.md` in the root directory where `.gitignore` is located
echo "doc/README.md" > .gitignore

cat .gitignore
## doc/README.md
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  README.md
##  intput/doc/README.md
##  output/doc/README.md
##  output/plots/doc/README.md
## 
## nothing added to commit but untracked files present (use "git add" to track)


Similarly, if we start a pattern with / like /doc, it will only match things in the directory where the .gitignore file is located (i.e., not the /doc folders nested within the subdirectories):

# Ignores `doc/` in the root directory where `.gitignore` is located
echo "/doc" > .gitignore

cat .gitignore
## /doc
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  README.md
##  intput/doc/README.md
##  output/doc/README.md
##  output/plots/doc/README.md
## 
## nothing added to commit but untracked files present (use "git add" to track)


In order to match things in subdirectories, we need to add **/ to the start of the pattern. So **/doc will match /doc in both the directory where .gitignore is located as well as in subdirectories:

# Ignores all `doc/` in both the root directory and within subdirectories
echo "**/doc" > .gitignore

cat .gitignore
## **/doc
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  README.md
## 
## nothing added to commit but untracked files present (use "git add" to track)


Having /**/ in the middle of the path will match zero or more directories:

# Ignores `output/doc` and `output/plots/doc`
echo "output/**/doc" > .gitignore

cat .gitignore
## output/**/doc
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  README.md
##  doc/README.md
##  intput/doc/README.md
## 
## nothing added to commit but untracked files present (use "git add" to track)


Having just * in the path will match any one directory:

# Ignores `output/plots/doc`
echo "output/*/doc" > .gitignore

cat .gitignore
## output/*/doc
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  README.md
##  doc/README.md
##  intput/doc/README.md
##  output/doc/README.md
## 
## nothing added to commit but untracked files present (use "git add" to track)


This matches all doc/ folders that’s inside some arbitrary folder (indicated by *) that’s located in the root directory (i.e., directory where .gitignore is located):

# Ignores `output/doc` and `input/doc`
echo "*/doc" > .gitignore

cat .gitignore
## */doc
# Check status
git status -u
## On branch main
## Untracked files:
##   (use "git add <file>..." to include in what will be committed)
##  .gitignore
##  README.md
##  doc/README.md
##  output/plots/doc/README.md
## 
## nothing added to commit but untracked files present (use "git add" to track)

11.2 Models for collaborative development

Two primary ways people collaborate on GitHub:

  1. Shared repository
  2. Fork and pull

11.2.1 Shared repository


Credit: Matuesz Lubanski


Overview of shared repository workflow:

  • All work on project happens in a single repository
  • Everyone working on the project clones the repository to their local computer
  • Designate level of access for each team member
    • Read access
    • Write access
    • Administrator access
  • As an individual team member, you work on specific tasks (e.g., fix a bug, add a new feature, write a lecture on a topic)
    • Work on tasks in your local working directory on your local machine
      • Often, work on tasks in a branch other than main
    • Once you complete a task, commit changes to your local repository
    • push changes from local repository on your machine to remote repository shared with collaborators
  • Other team members will also work on specific tasks that they commit to their local repository and then push to the remote repository
    • After your team members push a change to remote respository, you may pull those changes to your local repository and local working directory
  • Issuing a pull request
    • For most collaborative projects, users do not simply push their changes to the main branch of the shared remote repository
    • Why? Before pushing final changes to shared repository, those changes should be reviewed by other team members. This can be done by issuing a pull request
    • A pull request is an announcement to team members that you have made changes and you want those changes to be reviewed before they become final (e.g., merged to the main branch)
    • Once you issue a pull request, “the person or team reviewing your changes may have questions or comments. Perhaps the coding style doesn’t match project guidelines, the change is missing unit tests, or maybe everything looks great and props are in order” (Understanding the GitHub flow)
      • Process:
        • Create a new local branch off main to make changes to
        • Push the branch to the remote repository
        • Open a pull request on GitHub to have the branch merged to main
      • Alternative to pull request:
        • Merge your changes on the local branch directly into local main, then push to remote
        • This bypasses the review and approval process that a pull request offers

11.2.2 Fork and pull

What is a fork?

  • A fork is a copy of a repository that is associated with an individual’s personal account
  • The individual has full control of their fork (read, write, administrator)

Why use forks?

  • For projects with many contributors, it can become overwhelming to manage the project and to manage individual permissions using the shared repository model
  • People who don’t have write permission to a repository can still contribute to it using this pull requests


Credit: Shaumik Daityari


Overview of fork and pull workflow:

  • Create a fork repository (copy of project repository associated with your personal account) of the central_repo repository
    • Let’s call the forked repository your_fork
    • Initially, your_fork repository only exists on GitHub
  • clone the your_fork repository to your local machine
    • On the local working directory, make changes to files
    • add changes to index/staging area
    • commit changes to local your_fork repository
    • push changes to remote your_fork repository
  • Issue a pull request asking that the changes you have made to remote your_fork repository be incorporated to the main central_repo repository
    • “If you send a pull request to another repository, you ask their maintainers to pull your changes into theirs (you more or less ask them to use a git pull from your repository)” (Stack Overflow)
  • Alternative to pull request:
    • Open an issue instead to request certain changes
    • But this means someone still has to implement the change
    • If the requester is able to make the change themselves, doing so and creating a pull request is a faster way to get the change incorporated