Data Management with R
Welcome
Overview
How to Use This Book
Recommended Chapter Flow
Project Structure
Required R Packages
Building the Book
Data and Reproducibility Notes
Intended Audience
Licence
1
Project Setup and Reproducible Workflows
1.1
Why project setup matters
1.2
Recommended folder structure
1.2.1
Folder descriptions
1.3
Using RStudio Projects
1.4
Working with relative paths
1.5
File naming conventions
1.6
Separating raw data and outputs
1.7
R Markdown workflow
1.8
Bookdown chapter organization
1.9
GitHub and version control
1.10
Installation and first-time Git setup
1.10.1
1. Create a GitHub account
1.10.2
2. Install Git
1.10.3
3. Introduce yourself to Git
1.11
Connecting Git, GitHub, and RStudio
1.11.1
1. Connect local Git to GitHub
1.11.2
2. Choose an authentication method
1.11.3
3. Enable Git support inside RStudio
1.12
Recommended workflow: GitHub first
1.12.1
Core steps
1.13
Practical notes for workshops
1.14
Optional GitHub extensions
1.15
Chapter summary
2
Tidyverse Basics
2.1
Learning objectives
2.2
Load libraries
2.3
Read data
2.4
Inspect data
2.5
Core tidyverse verbs
2.6
Practice questions
3
Joining Data
3.1
Learning objectives
3.2
Example data
3.3
Left join
3.4
Check duplicate keys
3.5
Common join types
3.6
Practice
4
Data Cleaning and Data Management
4.1
Load libraries
5
Load in the data
6
Data Description
7
Strings and Regular Expressions
8
Learning Objectives
9
1. Load Libraries
10
2. Titanic Example: Cleaning Names and Extracting Titles
10.1
2.1 Read the Data
10.2
2.2 Basic Cleaning
11
3. Separate Passenger Names
11.1
3.1 Remove Extra Spaces
12
4. Extract Titles Using Regex
12.1
4.1 Regex Explanation
13
5. Create a Function
13.1
5.1 Test the Function
14
6. Practice Question: Summarize Titanic Titles
14.1
6.1 Plot Titanic Titles
15
7. WHO Data Example: Reshaping Data
16
8. Why Reshape the WHO Data?
17
9. Pivot WHO Data from Wide to Long Format
18
10. Clean Inconsistent Column Names
19
11. Separate the Key Column
20
12. Separate Sex and Age
21
13. Full WHO Cleaning Pipeline
22
14. WHO Data Dictionary
23
15. Regex Practice Question
23.1
15.1 Explanation of the Regex
24
16. Discussion Question: NA vs Zero
25
17. Additional Practice Questions
25.1
17.1 Titanic Practice
25.2
17.2 WHO Practice
26
18. Example Solutions for Extra Practice
26.1
18.1 Titanic: Survival Count by Sex
26.2
18.2 WHO: Total Cases by Sex
26.3
18.3 WHO: Total Cases by Age Group
27
19. Key Takeaways
28
Visualization and Advanced Cleaning
29
Purpose of this practice
30
Research questions
31
1. Load packages
32
2. Read data
33
3. Initial data review
33.1
3.1 Dataset dimensions
33.2
3.2 Column names
33.3
3.3 Missing values by dataset
34
4. Clean and reshape population data
34.1
4.1 Convert population data to long format
34.2
4.2 Rename columns and remove unnecessary fields
34.3
4.3 Keep total population only
35
5. Visualize population data
35.1
5.1 Line plot by HSDA
35.2
5.2 Bar plot by age group
35.3
5.3 Optional: Faceted visualization
36
6. Identify possible join keys
37
7. Clean mortality data
37.1
7.1 Remove columns not needed for the current analysis
37.2
7.2 Check and remove duplicate records
38
8. Work with missing and unexpected values
38.1
8.1 Check birth year values
38.2
8.2 Check death month and death day
39
9. Create date variables and calculate age
39.1
9.1 Check remaining missing values
39.2
9.2 Drop records missing essential identifiers
39.3
9.3 Impute missing sex values
39.4
9.4 Recalculate age where needed
40
10. Work with ICD-10 cause-of-death codes
40.1
10.1 Separate ICD-10 letters and numbers
40.2
10.2 Keep cancer deaths only
41
11. Categorize cancer type and age group
42
12. Visualize cleaned mortality data
43
13. Save cleaned cancer mortality data
44
14. Clean environmental exposure data
44.1
14.1 Save exposure data
45
15. Join mortality data to correspondence file
45.1
15.1 Save joined Analysis 1 data
46
16. Postal code quality check activity
46.1
16.1 Standardize postal codes
46.2
16.2 Identify postal codes that do not match the expected format
46.3
16.3 Check each postal code position
46.4
16.4 Check postal code length
47
17. Practice activity: Join population and correspondence data
48
18. Final assignment guidance
49
19. Reflection questions
50
Exploratory Analysis Project
50.1
Learning objectives
50.2
Purpose
50.3
Load libraries
50.4
Read data
50.5
Create a data list
50.6
Dataset dimensions
50.7
Missing values function
50.8
Recommended report structure
50.9
Practice task
51
Storyboard and Reporting
51.1
Learning objectives
51.2
Why storyboarding matters
51.3
A simple storyboard template
51.4
Writing clear interpretation
51.5
Example interpretation sentence
51.6
Practice
Data Management with R
Chapter 31
1. Load packages
library
(tidyverse)
library
(readxl)
library
(lubridate)
library
(stringr)