Chapter 5 Git and GitHub for reproducible research
Git and GitHub are among the most important tools for reproducible research, collaborative workflows, and modern computational project management.
Version control systems such as Git help researchers and analysts:
- track changes over time,
- collaborate across teams,
- maintain project history,
- recover previous versions of files,
- document analytical workflows,
- and improve transparency and reproducibility.
GitHub extends Git by providing a cloud-based platform for sharing repositories, collaborating on projects, managing issues, documenting workflows, and distributing open-source code and research products.
In reproducible research, Git and GitHub help ensure that code, documentation, analyses, and reports evolve in a structured, transparent, and traceable manner.
5.1 Why use Git and GitHub?
Some major benefits include:
- preserving the full history of a project,
- enabling collaboration without overwriting files,
- improving organization and workflow management,
- supporting open science and transparency,
- facilitating reproducible analyses,
- allowing rollback to previous versions,
- integrating with RStudio and other analytical tools,
- and supporting automated workflows and continuous integration.
Git is especially valuable in projects involving multiple collaborators or long-term analytical development.
5.3 GIT Terminology
Understanding common Git terminology is important before working with repositories and version control workflows.
origin: a connection pointing to the remote repository.master: traditionally the name of the default branch. A branch in Git is a lightweight movable pointer to a commit.
(Note: many modern repositories now usemaininstead ofmasteras the default branch name.)working directory: the local repository and files currently being edited on your computer..git directory: Git stores all repository data inside the.gitdirectory. This folder is created when a local repository is initialized using thegit initcommand..gitignore: Git uses this file to determine which files and directories should be ignored before making commits. Common examples include temporary files, datasets, and system-generated files.hash: the commit command creates a unique identifier called a hash that uniquely identifies a commit.HEAD: a pointer to the latest commit on the current branch. If you are on the master branch, thenHEADandmasterpoint to the same commit. To refer to the previous commit, useHEAD~1.repository (repo): a project folder managed by Git containing files, history, and version information.commit: a snapshot of changes saved into the repository history.staging area: an intermediate area where files are prepared before committing.branch: an independent line of development that allows experimentation and collaboration without affecting the main workflow.merge: combining changes from different branches into a single branch.
5.4 GIT Commands
Git commands help manage repositories, track changes, collaborate with others, and maintain reproducible workflows.
5.4.1 Initialize
git init <local repository name>
Initializes a new local repository.
This creates the .git directory and begins version control tracking.
5.4.2 Remotes
Remotes connect local repositories to repositories hosted on platforms such as GitHub.
git remote add <remote name> <url>
Creates a new connection to a remote repository. The remote name is typically set toorigin.git remote show <remote name>
Shows information about the remote repository and branch tracking.git remote rename <remote name> <new remote name>
Renames a remote connection.git remote remove <remote name>
Removes a remote connection.git remote -v
Lists all configured remotes.
5.4.3 Branches
Branches allow parallel development and experimentation.
git branch <new branch name>
Creates a new branch.git branch
Lists all branches in the repository.git checkout <branch-name>
Switches to another branch.git checkout -b <branch-name>
Creates a branch and switches to it.git checkout <branch name> <filename>
Restores a specific version of a file from another branch.git merge <source> <destination>
Merges changes from one branch into another.
Branches are extremely useful for collaborative development, testing new features, and preventing accidental modification of stable workflows.
5.4.4 Status
These commands help inspect repository status and changes.
git status
Shows modified, staged, and untracked files.git diff
Shows changes made to files.git diff --staged
Shows differences between staged files and the last commit.git diff <directory>
Shows changes within a directory.git diff -r HEAD
Compares the current working directory against the latest commit.git log
Displays the repository history.git show <hash>
Displays details of a specific commit.git annotate <filename>
Shows who last modified each line of a file and when.
These tools are especially useful for auditing changes and understanding project history.
5.4.5 Clone
git clone <remote repository url>
Downloads a copy of a remote repository to a local folder.
This automatically creates the remote called origin.
Cloning is commonly used when starting collaboration on an existing project.
5.4.6 Add
The staging area allows users to select which changes will be included in the next commit.
git add
Adds files from the working directory to the staging area.git add <filename>
Stages a specific file.git add -A
Stages all new, modified, and deleted files.git add <foldername>/*
Adds a folder and its contents to the staging area.
5.4.7 Remove
git clean -n
Shows files not currently tracked by Git.git clean -f
Deletes untracked files.
These commands should be used carefully because deleted files may not be recoverable.
5.4.8 Undo
Git provides several ways to undo staged or modified changes.
git reset
Undoes all staged changes.git reset HEAD <filename>
Unstages a specific file.
Version control systems are valuable because they allow experimentation while preserving recoverability.
5.4.9 Commit
git commit -m "<message>"
Saves a snapshot of staged changes into the local repository history.
Commits should ideally: - be small and focused, - contain meaningful messages, - and document logical analytical or project changes.
Good commit messages improve project traceability and collaboration.
5.4.10 Fetch
git fetch
Retrieves new work from a remote repository without merging it into the current branch.
Fetching allows users to inspect incoming changes before merging.
5.5 Note on adding files to the remote repository
When pushing to a remote repository for the first time, you must establish a connection between the local and remote branches:
git push --set-upstream origin master
or the shorter version:
git push -u origin master
After this initial setup, future pushes can usually be completed using:
git push
Modern repositories may use main instead of master.
5.6 GitHub workflows and collaboration
GitHub provides additional collaborative tools including:
- pull requests,
- issue tracking,
- repository discussions,
- project boards,
- GitHub Actions for automation,
- documentation hosting,
- and integration with analytical tools and IDEs.
These features support collaborative and reproducible analytical workflows across teams and organizations.
5.7 Git and reproducibility
Git supports reproducibility by:
- preserving the complete analytical history,
- documenting workflow evolution,
- enabling rollback to previous versions,
- facilitating peer review,
- supporting collaborative development,
- and improving transparency and accountability.
In reproducible research, version control is not simply a software development tool — it is also an important organizational and documentation practice.
5.8 References
Happy Git and GitHub for the useR by Jennifer Bryan adapted under the Creative Commons Attribution-NonCommercial 4.0 International License
Pro Git book, written by Scott Chacon and Ben Straub adapted under the Creative Commons Attribution Non Commercial Share Alike 3.0 license
Version Control with Git by Software Carpentry adapted under the Attribution 4.0 International (CC BY 4.0) license