Chapter 1 Project Setup and Reproducible Workflows

A well-organized project structure is one of the foundations of reproducible data analysis. In real-world projects, data files, scripts, figures, notes, reports, and outputs can quickly become difficult to manage if they are not organized consistently. Poor organization often leads to broken file paths, duplicated work, missing outputs, and confusion about which version of a file should be used.

This chapter introduces a simple workflow for organizing R projects using RStudio Projects, relative file paths, R Markdown, and bookdown. The goal is not only to keep files tidy, but also to make analyses easier to reproduce, easier to review, and easier to share with others.

By the end of this chapter, you should be able to create a structured RStudio project, organize files into logical folders, work with relative file paths, and understand the role of Git and GitHub in reproducible workflows.

1.1 Organizing a reproducible project

A typical data analysis project includes several different types of files. These often include raw datasets, cleaned datasets, R scripts, figures, tables, notes, and reports. Without a clear structure, projects can quickly become difficult to navigate and maintain.

A good project structure helps you keep related files together and reduces the likelihood of errors. It also makes collaboration easier because other users can understand the workflow more quickly. Bookdown provides a framework for creating reproducible books and technical documents directly from R Markdown files Xie (2016). Reproducible workflows are an important component of modern data analysis Xie (2015).

For this course, all work should be placed inside a single main project folder. Within that folder, separate folders should be created for raw data, generated outputs, reports, and archived material.

A recommended structure is shown below.

Data-Management-with-R/
│
├── data_management.Rproj
├── index.Rmd
├── 01_project_setup.Rmd
├── 02_tidyverse_basics.Rmd
├── 03_joining_data.Rmd
├── 04_data_cleaning.Rmd
├── 05_strings_regex.Rmd
├── 06_visualization.Rmd
├── 07_exploratory_analysis.Rmd
├── 08_storyboard_reporting.Rmd
│
├── Raw Data/
├── Outputs/
├── docs/
├── archive/
│
├── _bookdown.yml
├── _output.yml
├── style.css
├── README.md
├── LICENSE
└── .gitignore

The .Rproj file allows RStudio to recognize the project and automatically set the correct working directory. The .Rmd files contain the book chapters, while the Raw Data/ folder stores original datasets that should remain unchanged throughout the analysis process.

The Outputs/ folder should contain generated files such as cleaned datasets, tables, and figures. The docs/ folder stores the rendered bookdown website, while the archive/ folder can be used for old drafts or unused material that should not be deleted permanently.

The _bookdown.yml file controls the order of chapters and output settings, and _output.yml controls formatting options for the final rendered book.

One of the most common beginner mistakes in R is working with files scattered across different folders. Keeping all project files inside a single RStudio Project helps avoid many file path and working directory problems.

1.2 Working with RStudio Projects and file paths

An RStudio Project provides a dedicated working environment for a single analysis or course. Instead of opening individual files manually from different folders, the entire project can be opened through the .Rproj file.

To create a new project in RStudio, select File > New Project and either create a new directory or connect to an existing folder. Once the project is created, opening the .Rproj file automatically sets the project folder as the working directory.

Using projects becomes especially important when working with file paths. Many beginners use absolute file paths that only work on their own computer. For example:

read_csv("C:/Users/YourName/Desktop/Data/Population_Estimates.csv")

Although this works locally, the path will fail on another computer because the directory structure is different.

Instead, use relative paths that begin from the project folder:

read_csv("./Raw Data/Population_Estimates.csv")

Relative paths make projects portable and reproducible because the code can run correctly on different computers as long as the project structure remains the same.

Another useful approach is to use the here package:

library(here)

read_csv(
  here("Raw Data", "Population_Estimates.csv")
)

Both approaches are acceptable, but the important idea is to avoid hard-coded computer-specific paths whenever possible.

1.3 Naming files and managing outputs

Consistent file names make projects easier to understand and maintain. File names should be short, descriptive, and predictable. Spaces are usually avoided because they can create problems in some programming environments, so underscores are preferred instead.

Examples of clear file names include:

01_project_setup.Rmd
02_tidyverse_basics.Rmd
cancer_rates.csv
air_quality_summary.csv

Avoid vague names such as:

final_version2_revised_NEW.Rmd

As projects grow, unclear naming conventions can make it difficult to determine which files are current and which files are outdated.

It is also important to separate raw data from generated outputs. Raw datasets should remain unchanged so that the original source data is always preserved. Cleaned datasets, transformed files, tables, and figures should instead be written to the Outputs/ folder.

For example:

rates_clean <- rates %>%
  filter(letter == "C")

write_csv(
  rates_clean,
  "./Outputs/cancer_rates_clean.csv"
)

This workflow creates a clear separation between original data and processed outputs, which improves reproducibility and transparency.

Avoid manually editing files inside the Raw Data/ folder. If data cleaning is required, create a cleaned version in the Outputs/ folder instead.

1.4 R Markdown and bookdown workflows

R Markdown combines text, code, and results within a single document. This makes it possible to explain the purpose of an analysis while also showing the code and outputs used to produce the results.

In this course, each major topic is stored in a separate .Rmd file, and bookdown combines these files into a single website or book.

The order of chapters is controlled by _bookdown.yml. A typical configuration looks like this:

book_filename: "data-management-with-r"
output_dir: "docs"

rmd_files:
  - "index.Rmd"
  - "01_project_setup.Rmd"
  - "02_tidyverse_basics.Rmd"
  - "03_joining_data.Rmd"
  - "04_data_cleaning.Rmd"
  - "05_strings_regex.Rmd"
  - "06_visualization.Rmd"
  - "07_exploratory_analysis.Rmd"
  - "08_storyboard_reporting.Rmd"

Inside each chapter file, only one top-level heading should normally be used:

# Chapter Title

Subsections should use lower heading levels:

## Section
### Subsection

Using multiple # headings inside the same file causes bookdown to incorrectly interpret them as separate chapters, which leads to a disorganized sidebar and navigation structure.

A typical R Markdown workflow often follows a consistent sequence:

  1. Introduce the purpose of the analysis.
  2. Load the required packages.
  3. Read the data.
  4. Clean or transform the data.
  5. Create summaries, tables, or figures.
  6. Explain the results.
  7. Save outputs if needed.

This structure helps readers understand not only what the code does, but also why each step is being performed.

1.5 Git, GitHub, and reproducible analysis

As projects become larger and more collaborative, version control becomes increasingly important. Git is a version control system that tracks changes made to files over time, while GitHub provides an online platform for storing and sharing Git repositories.

Using Git and GitHub allows you to maintain a history of your work, recover earlier versions of files, collaborate with others safely, and create a backup of your project outside your local computer.

RStudio includes built-in Git support, allowing Git operations to be performed directly within the RStudio interface.

The recommended workflow is:

  1. Create a GitHub repository.
  2. Clone the repository into an RStudio Project.
  3. Edit files locally.
  4. Commit changes regularly.
  5. Push updates back to GitHub.

For detailed guidance, the online resource Happy Git with R provides an excellent introduction:

https://happygitwithr.com/

Git and GitHub are strongly recommended for reproducible workflows, but they are not mandatory for completing this course. You can still complete all workshop activities locally using RStudio Projects and bookdown.

1.6 Chapter summary

In this chapter, you learned how to organize a reproducible R project using RStudio Projects, relative paths, R Markdown, and bookdown. You also learned how to separate raw data from generated outputs, organize chapter files consistently, and structure projects in a way that improves reproducibility and collaboration.

A clean project structure makes analyses easier to understand, easier to maintain, and easier to share with others. As projects grow larger and more complex, good organizational practices become increasingly important.

References

Xie, Yihui. 2015. Dynamic Documents with r and Knitr. Chapman; Hall/CRC.
———. 2016. Bookdown: Authoring Books and Technical Documents with r Markdown. Chapman; Hall/CRC. https://bookdown.org/yihui/bookdown/.