Chapter 5 Git and GitHub for reproducible research

Git and GitHub are among the most important tools for reproducible research, collaborative workflows, and modern computational project management.

Version control systems such as Git help researchers and analysts:

  • track changes over time,
  • collaborate across teams,
  • maintain project history,
  • recover previous versions of files,
  • document analytical workflows,
  • and improve transparency and reproducibility.

GitHub extends Git by providing a cloud-based platform for sharing repositories, collaborating on projects, managing issues, documenting workflows, and distributing open-source code and research products.

In reproducible research, Git and GitHub help ensure that code, documentation, analyses, and reports evolve in a structured, transparent, and traceable manner.

5.1 Why use Git and GitHub?

Some major benefits include:

  • preserving the full history of a project,
  • enabling collaboration without overwriting files,
  • improving organization and workflow management,
  • supporting open science and transparency,
  • facilitating reproducible analyses,
  • allowing rollback to previous versions,
  • integrating with RStudio and other analytical tools,
  • and supporting automated workflows and continuous integration.

Git is especially valuable in projects involving multiple collaborators or long-term analytical development.

5.3 GIT Terminology

Understanding common Git terminology is important before working with repositories and version control workflows.

  • origin : a connection pointing to the remote repository.

  • master : traditionally the name of the default branch. A branch in Git is a lightweight movable pointer to a commit.
    (Note: many modern repositories now use main instead of master as the default branch name.)

  • working directory : the local repository and files currently being edited on your computer.

  • .git directory : Git stores all repository data inside the .git directory. This folder is created when a local repository is initialized using the git init command.

  • .gitignore : Git uses this file to determine which files and directories should be ignored before making commits. Common examples include temporary files, datasets, and system-generated files.

  • hash : the commit command creates a unique identifier called a hash that uniquely identifies a commit.

  • HEAD : a pointer to the latest commit on the current branch. If you are on the master branch, then HEAD and master point to the same commit. To refer to the previous commit, use HEAD~1.

  • repository (repo) : a project folder managed by Git containing files, history, and version information.

  • commit : a snapshot of changes saved into the repository history.

  • staging area : an intermediate area where files are prepared before committing.

  • branch : an independent line of development that allows experimentation and collaboration without affecting the main workflow.

  • merge : combining changes from different branches into a single branch.

5.4 GIT Commands

Git commands help manage repositories, track changes, collaborate with others, and maintain reproducible workflows.

5.4.1 Initialize

  • git init <local repository name>
    Initializes a new local repository.

This creates the .git directory and begins version control tracking.

5.4.2 Remotes

Remotes connect local repositories to repositories hosted on platforms such as GitHub.

  • git remote add <remote name> <url>
    Creates a new connection to a remote repository. The remote name is typically set to origin.

  • git remote show <remote name>
    Shows information about the remote repository and branch tracking.

  • git remote rename <remote name> <new remote name>
    Renames a remote connection.

  • git remote remove <remote name>
    Removes a remote connection.

  • git remote -v
    Lists all configured remotes.

5.4.3 Branches

Branches allow parallel development and experimentation.

  • git branch <new branch name>
    Creates a new branch.

  • git branch
    Lists all branches in the repository.

  • git checkout <branch-name>
    Switches to another branch.

  • git checkout -b <branch-name>
    Creates a branch and switches to it.

  • git checkout <branch name> <filename>
    Restores a specific version of a file from another branch.

  • git merge <source> <destination>
    Merges changes from one branch into another.

Branches are extremely useful for collaborative development, testing new features, and preventing accidental modification of stable workflows.

5.4.4 Status

These commands help inspect repository status and changes.

  • git status
    Shows modified, staged, and untracked files.

  • git diff
    Shows changes made to files.

  • git diff --staged
    Shows differences between staged files and the last commit.

  • git diff <directory>
    Shows changes within a directory.

  • git diff -r HEAD
    Compares the current working directory against the latest commit.

  • git log
    Displays the repository history.

  • git show <hash>
    Displays details of a specific commit.

  • git annotate <filename>
    Shows who last modified each line of a file and when.

These tools are especially useful for auditing changes and understanding project history.

5.4.5 Clone

  • git clone <remote repository url>
    Downloads a copy of a remote repository to a local folder.

This automatically creates the remote called origin.

Cloning is commonly used when starting collaboration on an existing project.

5.4.6 Add

The staging area allows users to select which changes will be included in the next commit.

  • git add
    Adds files from the working directory to the staging area.

  • git add <filename>
    Stages a specific file.

  • git add -A
    Stages all new, modified, and deleted files.

  • git add <foldername>/*
    Adds a folder and its contents to the staging area.

5.4.7 Remove

  • git clean -n
    Shows files not currently tracked by Git.

  • git clean -f
    Deletes untracked files.

These commands should be used carefully because deleted files may not be recoverable.

5.4.8 Undo

Git provides several ways to undo staged or modified changes.

  • git reset
    Undoes all staged changes.

  • git reset HEAD <filename>
    Unstages a specific file.

Version control systems are valuable because they allow experimentation while preserving recoverability.

5.4.9 Commit

  • git commit -m "<message>"
    Saves a snapshot of staged changes into the local repository history.

Commits should ideally: - be small and focused, - contain meaningful messages, - and document logical analytical or project changes.

Good commit messages improve project traceability and collaboration.

5.4.10 Fetch

  • git fetch
    Retrieves new work from a remote repository without merging it into the current branch.

Fetching allows users to inspect incoming changes before merging.

5.4.11 Pull

  • git pull
    Automatically fetches and merges changes from the remote branch into the current branch.

Pulling helps synchronize local and remote repositories.

5.4.12 Push

  • git push
    Uploads local commits to the remote repository.

Pushing allows collaborators to access updates and maintains synchronization between local and remote versions.

5.5 Note on adding files to the remote repository

When pushing to a remote repository for the first time, you must establish a connection between the local and remote branches:

  • git push --set-upstream origin master

or the shorter version:

  • git push -u origin master

After this initial setup, future pushes can usually be completed using:

  • git push

Modern repositories may use main instead of master.

5.6 GitHub workflows and collaboration

GitHub provides additional collaborative tools including:

  • pull requests,
  • issue tracking,
  • repository discussions,
  • project boards,
  • GitHub Actions for automation,
  • documentation hosting,
  • and integration with analytical tools and IDEs.

These features support collaborative and reproducible analytical workflows across teams and organizations.

5.7 Git and reproducibility

Git supports reproducibility by:

  • preserving the complete analytical history,
  • documenting workflow evolution,
  • enabling rollback to previous versions,
  • facilitating peer review,
  • supporting collaborative development,
  • and improving transparency and accountability.

In reproducible research, version control is not simply a software development tool — it is also an important organizational and documentation practice.