• Data Management with R
  • Welcome
  • Overview
  • How to Use This Book
  • Recommended Chapter Flow
  • Project Structure
  • Required R Packages
  • Building the Book
  • Data and Reproducibility Notes
  • Intended Audience
  • Licence
  • 1 Project Setup and Reproducible Workflows
    • 1.1 Why project setup matters
    • 1.2 Recommended folder structure
      • 1.2.1 Folder descriptions
    • 1.3 Using RStudio Projects
    • 1.4 Working with relative paths
    • 1.5 File naming conventions
    • 1.6 Separating raw data and outputs
    • 1.7 R Markdown workflow
    • 1.8 Bookdown chapter organization
    • 1.9 GitHub and version control
    • 1.10 Installation and first-time Git setup
      • 1.10.1 1. Create a GitHub account
      • 1.10.2 2. Install Git
      • 1.10.3 3. Introduce yourself to Git
    • 1.11 Connecting Git, GitHub, and RStudio
      • 1.11.1 1. Connect local Git to GitHub
      • 1.11.2 2. Choose an authentication method
      • 1.11.3 3. Enable Git support inside RStudio
    • 1.12 Recommended workflow: GitHub first
      • 1.12.1 Core steps
    • 1.13 Practical notes for workshops
    • 1.14 Optional GitHub extensions
    • 1.15 Chapter summary
  • 2 Tidyverse Basics
    • 2.1 Learning objectives
    • 2.2 Load libraries
    • 2.3 Read data
    • 2.4 Inspect data
    • 2.5 Core tidyverse verbs
    • 2.6 Practice questions
  • 3 Joining Data
    • 3.1 Learning objectives
    • 3.2 Example data
    • 3.3 Left join
    • 3.4 Check duplicate keys
    • 3.5 Common join types
    • 3.6 Practice
  • 4 Data Cleaning and Data Management
    • 4.1 Load libraries
  • 5 Load in the data
  • 6 Data Description
  • 7 Strings and Regular Expressions
  • 8 Learning Objectives
  • 9 1. Load Libraries
  • 10 2. Titanic Example: Cleaning Names and Extracting Titles
    • 10.1 2.1 Read the Data
    • 10.2 2.2 Basic Cleaning
  • 11 3. Separate Passenger Names
    • 11.1 3.1 Remove Extra Spaces
  • 12 4. Extract Titles Using Regex
    • 12.1 4.1 Regex Explanation
  • 13 5. Create a Function
    • 13.1 5.1 Test the Function
  • 14 6. Practice Question: Summarize Titanic Titles
    • 14.1 6.1 Plot Titanic Titles
  • 15 7. WHO Data Example: Reshaping Data
  • 16 8. Why Reshape the WHO Data?
  • 17 9. Pivot WHO Data from Wide to Long Format
  • 18 10. Clean Inconsistent Column Names
  • 19 11. Separate the Key Column
  • 20 12. Separate Sex and Age
  • 21 13. Full WHO Cleaning Pipeline
  • 22 14. WHO Data Dictionary
  • 23 15. Regex Practice Question
    • 23.1 15.1 Explanation of the Regex
  • 24 16. Discussion Question: NA vs Zero
  • 25 17. Additional Practice Questions
    • 25.1 17.1 Titanic Practice
    • 25.2 17.2 WHO Practice
  • 26 18. Example Solutions for Extra Practice
    • 26.1 18.1 Titanic: Survival Count by Sex
    • 26.2 18.2 WHO: Total Cases by Sex
    • 26.3 18.3 WHO: Total Cases by Age Group
  • 27 19. Key Takeaways
  • 28 Visualization and Advanced Cleaning
  • 29 Purpose of this practice
  • 30 Research questions
  • 31 1. Load packages
  • 32 2. Read data
  • 33 3. Initial data review
    • 33.1 3.1 Dataset dimensions
    • 33.2 3.2 Column names
    • 33.3 3.3 Missing values by dataset
  • 34 4. Clean and reshape population data
    • 34.1 4.1 Convert population data to long format
    • 34.2 4.2 Rename columns and remove unnecessary fields
    • 34.3 4.3 Keep total population only
  • 35 5. Visualize population data
    • 35.1 5.1 Line plot by HSDA
    • 35.2 5.2 Bar plot by age group
    • 35.3 5.3 Optional: Faceted visualization
  • 36 6. Identify possible join keys
  • 37 7. Clean mortality data
    • 37.1 7.1 Remove columns not needed for the current analysis
    • 37.2 7.2 Check and remove duplicate records
  • 38 8. Work with missing and unexpected values
    • 38.1 8.1 Check birth year values
    • 38.2 8.2 Check death month and death day
  • 39 9. Create date variables and calculate age
    • 39.1 9.1 Check remaining missing values
    • 39.2 9.2 Drop records missing essential identifiers
    • 39.3 9.3 Impute missing sex values
    • 39.4 9.4 Recalculate age where needed
  • 40 10. Work with ICD-10 cause-of-death codes
    • 40.1 10.1 Separate ICD-10 letters and numbers
    • 40.2 10.2 Keep cancer deaths only
  • 41 11. Categorize cancer type and age group
  • 42 12. Visualize cleaned mortality data
  • 43 13. Save cleaned cancer mortality data
  • 44 14. Clean environmental exposure data
    • 44.1 14.1 Save exposure data
  • 45 15. Join mortality data to correspondence file
    • 45.1 15.1 Save joined Analysis 1 data
  • 46 16. Postal code quality check activity
    • 46.1 16.1 Standardize postal codes
    • 46.2 16.2 Identify postal codes that do not match the expected format
    • 46.3 16.3 Check each postal code position
    • 46.4 16.4 Check postal code length
  • 47 17. Practice activity: Join population and correspondence data
  • 48 18. Final assignment guidance
  • 49 19. Reflection questions
  • 50 Exploratory Analysis Project
    • 50.1 Learning objectives
    • 50.2 Purpose
    • 50.3 Load libraries
    • 50.4 Read data
    • 50.5 Create a data list
    • 50.6 Dataset dimensions
    • 50.7 Missing values function
    • 50.8 Recommended report structure
    • 50.9 Practice task
  • 51 Storyboard and Reporting
    • 51.1 Learning objectives
    • 51.2 Why storyboarding matters
    • 51.3 A simple storyboard template
    • 51.4 Writing clear interpretation
    • 51.5 Example interpretation sentence
    • 51.6 Practice

Data Management with R

Recommended Chapter Flow

The book is organized into the following learning sequence:

  1. Project Setup
    Set up an RStudio Project, organize folders, and understand reproducible workflows.

  2. Tidyverse Basics
    Learn core tidyverse functions for reading, selecting, filtering, mutating, and summarizing data.

  3. Joining Data
    Practice combining datasets using keys and joins, with examples from the Titanic dataset.

  4. Data Cleaning
    Work with raw files, column names, missing values, reshaping, and structured cleaning steps.

  5. Strings and Regular Expressions
    Use stringr, separate(), str_extract(), and regex patterns to clean and parse text data.

  6. Data Visualization
    Create clear visual summaries using ggplot2, including line plots, bar charts, and grouped visualizations.

  7. Exploratory Data Analysis
    Build an organized EDA workflow using summaries, plots, missing-value checks, and data validation.

  8. Communicating Results
    Use storyboards, dashboards, and reporting structure to communicate data insights clearly.