Chapter 4 Data Cleaning and Data Management

Data cleaning is one of the most important stages of any analytical workflow. In real-world projects, raw data are rarely ready for analysis immediately after import. Datasets often contain inconsistent variable names, missing values, formatting issues, duplicate records, encoding problems, or variables stored in unsuitable formats.

This chapter introduces several common data management tasks using the tidyverse. You will learn how to import multiple file types, inspect datasets, identify common data quality issues, reshape data into tidy formats, and recode variables into more useful forms.

Although the datasets used in this course are simplified for teaching purposes, the workflow closely reflects many real-world data cleaning tasks encountered in epidemiology, environmental health, and population data analysis.

4.1 Preparing the R environment

Before beginning any analysis, the required libraries should be loaded. For this chapter, we will primarily use packages from the tidyverse ecosystem, along with ggplot2 and readxl.

If these packages are not already installed on your computer, they can be installed using:

install.packages(
  c("tidyverse", "ggplot2", "readxl")
)

Once installed, the libraries can be loaded into the R session.

library(tidyverse)
library(ggplot2)
library(readxl)

4.2 Exploring project files

Before importing data, it is often useful to inspect the contents of the project folders. The list.files() function allows you to display all files stored within a directory.

list.files("./Raw Data/")

## [1] "Corr_2016 copy.csv"           
## [2] "Corr_2016.csv"                
## [3] "Data Dictionary - Blank.xlsx" 
## [4] "Data Dictionary - Filled.xlsx"
## [5] "deaths_2016.xlsx"             
## [6] "Population_Estimates.csv"     
## [7] "titanic.csv"                  
## [8] "Titantic_DataDictionary.xlsx" 
## [9] "Weather_data.csv"

You can also search for specific file types using the pattern argument.

list.files(
  "./Raw Data/",
  pattern = ".csv"
)

## [1] "Corr_2016 copy.csv"       "Corr_2016.csv"           
## [3] "Population_Estimates.csv" "titanic.csv"             
## [5] "Weather_data.csv"

list.files(
  "./Raw Data/",
  pattern = ".xlsx"
)

## [1] "Data Dictionary - Blank.xlsx" 
## [2] "Data Dictionary - Filled.xlsx"
## [3] "deaths_2016.xlsx"             
## [4] "Titantic_DataDictionary.xlsx"

Functions such as list.files() are frequently overlooked by beginners, but they become extremely useful when working with large projects that contain many datasets or outputs.

Some datasets may also be stored in compressed .zip files. These files should typically be extracted before analysis.

unzip(
  "./Raw Data/Corr_2016.zip",
  exdir = "./Raw Data"
)

The exdir argument specifies the folder where the extracted files should be saved. Since files only need to be unzipped once, this code is often commented out after the extraction is complete.

When working in R Markdown documents, avoid repeatedly running file extraction code unless necessary. Re-extracting files every time the document is knitted can slow down workflows and create duplicate files.

4.3 Description of the datasets

This course uses four datasets that simulate a simplified population health analysis workflow.

The mortality dataset contains records of deaths, including demographic information and ICD-10 cause-of-death codes. Although the dataset is based on realistic structures and counts, the data themselves are fictitious for teaching purposes.

The population dataset contains estimates of population counts by age group, sex, and Health Service Delivery Area in British Columbia.

The correspondence dataset links lower geographic units to larger health regions and service delivery areas. Correspondence files are commonly used in population health and environmental research because they allow different geographic datasets to be connected.

The environmental dataset contains weather-related variables by postal code, including measures related to temperature and cold exposure.

The overall analytical goal is to combine these datasets in order to calculate mortality rates and explore potential environmental relationships.

4.4 Importing data into R

Different file formats require different import functions. The readr package is commonly used for .csv files, while Excel files require functions from the readxl package.

The following code imports the datasets used in this chapter.

mort <- read_excel(
  "./Raw Data/deaths_2016.xlsx"
)

pop <- read_csv(
  "./Raw Data/Population_Estimates.csv"
)

corr <- read_csv(
  "./Raw Data/Corr_2016.csv",
  locale = readr::locale(
    encoding = "latin1"
  )
)

env <- read_csv(
  "./Raw Data/Weather_data.csv"
)

Notice that the correspondence file uses a specific text encoding. Encoding problems are common when datasets contain accented characters or multilingual text.

4.5 Inspecting datasets

After importing data, it is important to examine the structure and contents of each dataset before beginning analysis.

Several functions provide useful high-level summaries.

str(mort)

## tibble [60,143 × 13] (S3: tbl_df/tbl/data.frame)
##  $ ID               : num [1:60143] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
##  $ Sex              : chr [1:60143] "M" "F" "M" "M" ...
##  $ Cause_of_death   : chr [1:60143] "D47" "C34" "C26" "C16" ...
##  $ dbuid2016        : num [1:60143] 1.31e+10 4.81e+10 3.52e+10 2.47e+10 2.46e+10 ...
##  $ Location_of_death: num [1:60143] 5 3 6 1 6 5 4 1 3 2 ...
##  $ Marital_status   : num [1:60143] 1 1 2 1 1 1 1 1 1 2 ...
##  $ Postalcode       : chr [1:60143] "E1A1E9" "T2E0T2" "M4J1L1" "H9H1B4" ...
##  $ B_year           : num [1:60143] 1943 1943 1953 1923 1931 ...
##  $ B_month          : num [1:60143] 7 5 8 6 8 11 5 8 7 2 ...
##  $ B_day            : num [1:60143] 8 22 9 14 8 4 20 18 23 7 ...
##  $ D_year           : num [1:60143] 2016 2016 2016 2016 2016 ...
##  $ D_month          : num [1:60143] 9 12 4 11 4 2 2 6 5 1 ...
##  $ D_day            : num [1:60143] 22 14 21 21 4 6 9 13 25 16 ...

summary(mort)

##        ID              Sex            Cause_of_death    
##  Min.   :1000001   Length:60143       Length:60143      
##  1st Qu.:1016187   Class :character   Class :character  
##  Median :1034842   Mode  :character   Mode  :character  
##  Mean   :1036699                                        
##  3rd Qu.:1056221                                        
##  Max.   :1080708                                        
##                                                         
##    dbuid2016         Location_of_death Marital_status
##  Min.   :1.001e+10   Min.   :1.000     Min.   :1.0   
##  1st Qu.:2.466e+10   1st Qu.:2.000     1st Qu.:1.0   
##  Median :3.521e+10   Median :4.000     Median :1.0   
##  Mean   :3.642e+10   Mean   :3.498     Mean   :1.5   
##  3rd Qu.:4.715e+10   3rd Qu.:5.000     3rd Qu.:2.0   
##  Max.   :6.208e+10   Max.   :6.000     Max.   :2.0   
##  NA's   :4                                           
##   Postalcode            B_year        B_month      
##  Length:60143       Min.   :1827   Min.   : 1.000  
##  Class :character   1st Qu.:1933   1st Qu.: 4.000  
##  Mode  :character   Median :1941   Median : 7.000  
##                     Mean   :1942   Mean   : 6.513  
##                     3rd Qu.:1951   3rd Qu.:10.000  
##                     Max.   :2016   Max.   :28.000  
##                     NA's   :24     NA's   :26      
##      B_day           D_year        D_month      
##  Min.   : 1.00   Min.   :2016   Min.   : 1.000  
##  1st Qu.: 8.00   1st Qu.:2016   1st Qu.: 4.000  
##  Median :16.00   Median :2016   Median : 7.000  
##  Mean   :15.74   Mean   :2016   Mean   : 6.521  
##  3rd Qu.:23.00   3rd Qu.:2016   3rd Qu.: 9.000  
##  Max.   :31.00   Max.   :2016   Max.   :21.000  
##  NA's   :26                                     
##      D_day      
##  Min.   : 1.00  
##  1st Qu.: 8.00  
##  Median :16.00  
##  Mean   :15.78  
##  3rd Qu.:23.00  
##  Max.   :31.00  
##

dplyr::glimpse(pop, width = 60)

## Rows: 51
## Columns: 25
## $ ...1                           <dbl> 0, 0, 0, 11, 11, 11…
## $ `Health Service Delivery Area` <chr> "British Columbia",…
## $ Year                           <dbl> 2016, 2016, 2016, 2…
## $ Gender                         <chr> "M", "F", "T", "M",…
## $ `<1`                           <dbl> 22997, 21760, 44757…
## $ `04-Jan`                       <dbl> 93211, 88300, 18151…
## $ `09-May`                       <dbl> 121341, 112777, 234…
## $ `14-Oct`                       <dbl> 119586, 112424, 232…
## $ `15-19`                        <dbl> 140451, 132226, 272…
## $ `20-24`                        <dbl> 170968, 156161, 327…
## $ `25-29`                        <dbl> 159609, 159389, 318…
## $ `30-34`                        <dbl> 162416, 166826, 329…
## $ `35-39`                        <dbl> 155936, 157628, 313…
## $ `40-44`                        <dbl> 151311, 154187, 305…
## $ `45-49`                        <dbl> 161823, 166603, 328…
## $ `50-54`                        <dbl> 172367, 179295, 351…
## $ `55-59`                        <dbl> 172667, 178878, 351…
## $ `60-64`                        <dbl> 156654, 159977, 316…
## $ `65-69`                        <dbl> 139068, 142469, 281…
## $ `70-74`                        <dbl> 99527, 103484, 2030…
## $ `75-79`                        <dbl> 69574, 77160, 14673…
## $ `80-84`                        <dbl> 49268, 57639, 10690…
## $ `85-89`                        <dbl> 28898, 40789, 69687…
## $ `90+`                          <dbl> 13366, 28648, 42014…
## $ Total                          <dbl> 2361038, 2396620, 4…

summary(pop)

##       ...1       Health Service Delivery Area
##  Min.   : 0.00   Length:51                   
##  1st Qu.:14.00   Class :character            
##  Median :31.00   Mode  :character            
##  Mean   :29.06                               
##  3rd Qu.:42.00                               
##  Max.   :53.00                               
##       Year         Gender                <1       
##  Min.   :2016   Length:51          Min.   :  276  
##  1st Qu.:2016   Class :character   1st Qu.:  714  
##  Median :2016   Mode  :character   Median : 1425  
##  Mean   :2016                      Mean   : 3510  
##  3rd Qu.:2016                      3rd Qu.: 2802  
##  Max.   :2016                      Max.   :44757  
##      04-Jan           09-May           14-Oct      
##  Min.   :  1356   Min.   :  1755   Min.   :  1811  
##  1st Qu.:  3118   1st Qu.:  4194   1st Qu.:  3910  
##  Median :  5663   Median :  8002   Median :  7808  
##  Mean   : 14236   Mean   : 18362   Mean   : 18197  
##  3rd Qu.: 11025   3rd Qu.: 15384   3rd Qu.: 15350  
##  Max.   :181511   Max.   :234118   Max.   :232010  
##      15-19            20-24            25-29       
##  Min.   :  1919   Min.   :  1577   Min.   :  1640  
##  1st Qu.:  4240   1st Qu.:  4214   1st Qu.:  4169  
##  Median :  8836   Median :  9193   Median :  8646  
##  Mean   : 21386   Mean   : 25657   Mean   : 25019  
##  3rd Qu.: 17866   3rd Qu.: 22940   3rd Qu.: 21291  
##  Max.   :272677   Max.   :327129   Max.   :318998  
##      30-34            35-39            40-44       
##  Min.   :  2069   Min.   :  2092   Min.   :  2008  
##  1st Qu.:  4415   1st Qu.:  4572   1st Qu.:  4483  
##  Median :  8830   Median :  9490   Median :  9551  
##  Mean   : 25823   Mean   : 24593   Mean   : 23961  
##  3rd Qu.: 22644   3rd Qu.: 21896   3rd Qu.: 21345  
##  Max.   :329242   Max.   :313564   Max.   :305498  
##      45-49            50-54            55-59       
##  Min.   :  2040   Min.   :  2137   Min.   :  2038  
##  1st Qu.:  4859   1st Qu.:  5431   1st Qu.:  5684  
##  Median :  9930   Median : 10666   Median : 10972  
##  Mean   : 25759   Mean   : 27581   Mean   : 27572  
##  3rd Qu.: 23211   3rd Qu.: 24455   3rd Qu.: 22498  
##  Max.   :328426   Max.   :351662   Max.   :351545  
##      60-64            65-69            70-74       
##  Min.   :  1550   Min.   :  1119   Min.   :   784  
##  1st Qu.:  5349   1st Qu.:  5006   1st Qu.:  3628  
##  Median : 10698   Median : 10011   Median :  7256  
##  Mean   : 24834   Mean   : 22081   Mean   : 15922  
##  3rd Qu.: 19486   3rd Qu.: 16891   3rd Qu.: 12128  
##  Max.   :316631   Max.   :281537   Max.   :203011  
##      75-79            80-84            85-89      
##  Min.   :   538   Min.   :   351   Min.   :  196  
##  1st Qu.:  2332   1st Qu.:  1612   1st Qu.:  973  
##  Median :  5026   Median :  3867   Median : 2636  
##  Mean   : 11509   Mean   :  8385   Mean   : 5466  
##  3rd Qu.:  9268   3rd Qu.:  6582   3rd Qu.: 4543  
##  Max.   :146734   Max.   :106907   Max.   :69687  
##       90+              Total        
##  Min.   :  114.0   Min.   :  32171  
##  1st Qu.:  536.5   1st Qu.:  70681  
##  Median : 1394.0   Median : 145321  
##  Mean   : 3295.2   Mean   : 373150  
##  3rd Qu.: 2955.0   3rd Qu.: 329569  
##  Max.   :42014.0   Max.   :4757658

dplyr::glimpse(corr, width = 60)

## Rows: 420,963
## Columns: 6
## $ dbuid2016      <dbl> 10010165001, 10010165002, 100101650…
## $ csduid2016     <dbl> 1001519, 1001519, 1001519, 1001519,…
## $ hruid2017      <dbl> 1011, 1011, 1011, 1011, 1011, 1011,…
## $ hrname_english <chr> "Eastern Regional Health Authority"…
## $ hrname_french  <chr> "Eastern Regional Health Authority"…
## $ dbpop2016      <dbl> 160, 25, 268, 53, 71, 217, 39, 120,…

summary(corr)

##    dbuid2016           csduid2016        hruid2017   
##  Min.   :1.001e+10   Min.   :1001101   Min.   :1011  
##  1st Qu.:2.466e+10   1st Qu.:2466023   1st Qu.:2414  
##  Median :3.558e+10   Median :3558090   Median :3595  
##  Mean   :3.856e+10   Mean   :3855572   Mean   :3856  
##  3rd Qu.:4.806e+10   3rd Qu.:4806016   3rd Qu.:4832  
##  Max.   :6.208e+10   Max.   :6208098   Max.   :6201  
##                                                      
##  hrname_english     hrname_french        dbpop2016      
##  Length:420963      Length:420963      Min.   :   0.00  
##  Class :character   Class :character   1st Qu.:   5.00  
##  Mode  :character   Mode  :character   Median :  26.00  
##                                        Mean   :  69.44  
##                                        3rd Qu.:  77.00  
##                                        Max.   :7607.00  
##                                        NA's   :35

str(env)

## spc_tbl_ [116,011 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ POSTALCODE12: chr [1:116011] "V0C1E0" "V0C1W0" "V0C2X0" "V0C2Z0" ...
##  $ WTHNRC12_01 : num [1:116011] 27.1 30.3 28.2 28.7 23.1 ...
##  $ WTHNRC12_02 : num [1:116011] -46.7 -46 -42.3 -44.9 -35 ...
##  $ WTHNRC12_03 : num [1:116011] -2.605 -1.868 -1.963 -1.286 0.338 ...
##  $ WTHNRC12_04 : num [1:116011] 2.91 3.92 4.01 4.32 4.33 ...
##  $ WTHNRC12_05 : num [1:116011] -8.12 -7.65 -7.93 -6.89 -3.65 ...
##  $ WTHNRC12_06 : num [1:116011] 11.04 11.57 11.94 11.2 7.98 ...
##  $ WTHNRC12_07 : num [1:116011] 171 198 248 175 310 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   POSTALCODE12 = col_character(),
##   ..   WTHNRC12_01 = col_double(),
##   ..   WTHNRC12_02 = col_double(),
##   ..   WTHNRC12_03 = col_double(),
##   ..   WTHNRC12_04 = col_double(),
##   ..   WTHNRC12_05 = col_double(),
##   ..   WTHNRC12_06 = col_double(),
##   ..   WTHNRC12_07 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(env)

##  POSTALCODE12        WTHNRC12_01     WTHNRC12_02    
##  Length:116011      Min.   :19.28   Min.   :-46.74  
##  Class :character   1st Qu.:31.07   1st Qu.:-11.44  
##  Mode  :character   Median :31.28   Median : -8.40  
##                     Mean   :31.62   Mean   :-11.67  
##                     3rd Qu.:31.76   3rd Qu.: -7.49  
##                     Max.   :39.07   Max.   : -4.48  
##   WTHNRC12_03      WTHNRC12_04      WTHNRC12_05    
##  Min.   :-2.605   Min.   : 2.913   Min.   :-8.122  
##  1st Qu.: 9.693   1st Qu.:13.454   1st Qu.: 5.750  
##  Median :10.140   Median :13.887   Median : 6.612  
##  Mean   : 9.573   Mean   :13.521   Mean   : 5.625  
##  3rd Qu.:10.449   3rd Qu.:14.006   3rd Qu.: 6.893  
##  Max.   :10.551   Max.   :16.227   Max.   : 7.132  
##   WTHNRC12_06      WTHNRC12_07    
##  Min.   : 4.336   Min.   : 104.6  
##  1st Qu.: 6.936   1st Qu.:1072.1  
##  Median : 7.151   Median :1316.4  
##  Mean   : 7.896   Mean   :1110.7  
##  3rd Qu.: 7.935   3rd Qu.:1399.0  
##  Max.   :13.498   Max.   :3076.5

The str() function displays the structure of the dataset, including variable types and dimensions. The summary() function provides summary statistics and quick descriptive information.

Another useful function is glimpse() from dplyr.

glimpse(mort)

## Rows: 60,143
## Columns: 13
## $ ID                <dbl> 1000001, 1000002, 1000003, 10000…
## $ Sex               <chr> "M", "F", "M", "M", "M", "F", "F…
## $ Cause_of_death    <chr> "D47", "C34", "C26", "C16", "C22…
## $ dbuid2016         <dbl> 13070175019, 48060924001, 352045…
## $ Location_of_death <dbl> 5, 3, 6, 1, 6, 5, 4, 1, 3, 2, 5,…
## $ Marital_status    <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2,…
## $ Postalcode        <chr> "E1A1E9", "T2E0T2", "M4J1L1", "H…
## $ B_year            <dbl> 1943, 1943, 1953, 1923, 1931, 19…
## $ B_month           <dbl> 7, 5, 8, 6, 8, 11, 5, 8, 7, 2, 1…
## $ B_day             <dbl> 8, 22, 9, 14, 8, 4, 20, 18, 23, …
## $ D_year            <dbl> 2016, 2016, 2016, 2016, 2016, 20…
## $ D_month           <dbl> 9, 12, 4, 11, 4, 2, 2, 6, 5, 1, …
## $ D_day             <dbl> 22, 14, 21, 21, 4, 6, 9, 13, 25,…

Compared with str(), glimpse() often provides a cleaner and easier-to-read overview of the data frame.

Initial data inspection is one of the most important parts of data cleaning because it helps identify:

missing values;
unusual variable names;
incorrect data types;
inconsistent formatting; and
unexpected values.

4.6 Exploring and validating the data

A useful next step is performing simple exploratory checks to understand the datasets more fully.

For example, we can identify all health regions in the correspondence dataset.

corr %>%
  distinct(hrname_english) %>%
  mutate(
    hrname_english = iconv(
      hrname_english,
      from = "latin1",
      to = "UTF-8",
      sub = ""
    )
  ) %>%
  arrange(hrname_english)

## # A tibble: 96 × 1
##    hrname_english                                       
##    <chr>                                                
##  1 Athabasca Health Authority                           
##  2 Brant County Health Unit                             
##  3 Calgary Zone                                         
##  4 Central Regional Health Authority                    
##  5 Central Vancouver Island Health Service Delivery Area
##  6 Central Zone                                         
##  7 Chatham-Kent Health Unit                             
##  8 City of Hamilton Health Unit                         
##  9 City of Toronto Health Unit                          
## 10 Cypress Regional Health Authority                    
## # ℹ 86 more rows

When working with multilingual datasets, encoding issues may cause characters to display incorrectly. Re-importing the file using the appropriate encoding often resolves these problems.

corr <- read_csv(
  "./Raw Data/Corr_2016.csv",
  locale = readr::locale(
    encoding = "latin1"
  )
)

corr %>%
  distinct(hrname_english) %>%
  arrange(hrname_english)

## # A tibble: 96 × 1
##    hrname_english                                       
##    <chr>                                                
##  1 Athabasca Health Authority                           
##  2 Brant County Health Unit                             
##  3 Calgary Zone                                         
##  4 Central Regional Health Authority                    
##  5 Central Vancouver Island Health Service Delivery Area
##  6 Central Zone                                         
##  7 Chatham-Kent Health Unit                             
##  8 City of Hamilton Health Unit                         
##  9 City of Toronto Health Unit                          
## 10 Cypress Regional Health Authority                    
## # ℹ 86 more rows

We can also calculate the number of health regions in the dataset.

no_regions <- corr %>%
  distinct(hrname_english) %>%
  nrow()

There are 96 health regions in the correspondence dataset.

Additional exploratory summaries can help validate the imported data.

mort %>%
  count(Sex)

## # A tibble: 3 × 2
##   Sex       n
##   <chr> <int>
## 1 F     28630
## 2 M     31500
## 3 <NA>     13

pop %>%
  summarize(
    Total_below_1 = sum(`<1`)
  )

## # A tibble: 1 × 1
##   Total_below_1
##           <dbl>
## 1        179028

These simple checks help verify that the datasets were imported correctly and that the variables contain plausible values.

4.7 Cleaning variable names and reshaping data

One common issue in real-world datasets is poorly formatted variable names. Variables that begin with numbers or contain symbols can make coding more difficult.

For example:

names(pop)

##  [1] "...1"                        
##  [2] "Health Service Delivery Area"
##  [3] "Year"                        
##  [4] "Gender"                      
##  [5] "<1"                          
##  [6] "04-Jan"                      
##  [7] "09-May"                      
##  [8] "14-Oct"                      
##  [9] "15-19"                       
## [10] "20-24"                       
## [11] "25-29"                       
## [12] "30-34"                       
## [13] "35-39"                       
## [14] "40-44"                       
## [15] "45-49"                       
## [16] "50-54"                       
## [17] "55-59"                       
## [18] "60-64"                       
## [19] "65-69"                       
## [20] "70-74"                       
## [21] "75-79"                       
## [22] "80-84"                       
## [23] "85-89"                       
## [24] "90+"                         
## [25] "Total"

colnames(pop)

##  [1] "...1"                        
##  [2] "Health Service Delivery Area"
##  [3] "Year"                        
##  [4] "Gender"                      
##  [5] "<1"                          
##  [6] "04-Jan"                      
##  [7] "09-May"                      
##  [8] "14-Oct"                      
##  [9] "15-19"                       
## [10] "20-24"                       
## [11] "25-29"                       
## [12] "30-34"                       
## [13] "35-39"                       
## [14] "40-44"                       
## [15] "45-49"                       
## [16] "50-54"                       
## [17] "55-59"                       
## [18] "60-64"                       
## [19] "65-69"                       
## [20] "70-74"                       
## [21] "75-79"                       
## [22] "80-84"                       
## [23] "85-89"                       
## [24] "90+"                         
## [25] "Total"

Some age-group variables contain names such as "04-Jan" or "09-May", which are not ideal variable names for analysis.The janitor package is also useful for cleaning messy column names and inspecting datasets Firke (2026).

One approach is to rename problematic variables directly.

pop2 <- pop %>%
  rename(
    "01-04" = "04-Jan",
    "05-09" = "09-May",
    "10-14" = "14-Oct"
  )

Another important data-cleaning task is reshaping data into a tidy format. In tidy data, each variable has its own column, each observation has its own row, and each value occupies a single cell.

The pivot_longer() function is commonly used to convert wide datasets into long format.

pop_long <- pop %>%
  
  select(
    -any_of(c("X1", "Total"))
  ) %>%
  
  pivot_longer(
    cols = -any_of(
      c(
        "Year",
        "Gender",
        "Health Service Delivery Area"
      )
    ),
    names_to = "Age",
    values_to = "Value"
  )

pop_long

## # A tibble: 1,071 × 5
##    `Health Service Delivery Area`  Year Gender Age     Value
##    <chr>                          <dbl> <chr>  <chr>   <dbl>
##  1 British Columbia                2016 M      ...1        0
##  2 British Columbia                2016 M      <1      22997
##  3 British Columbia                2016 M      04-Jan  93211
##  4 British Columbia                2016 M      09-May 121341
##  5 British Columbia                2016 M      14-Oct 119586
##  6 British Columbia                2016 M      15-19  140451
##  7 British Columbia                2016 M      20-24  170968
##  8 British Columbia                2016 M      25-29  159609
##  9 British Columbia                2016 M      30-34  162416
## 10 British Columbia                2016 M      35-39  155936
## # ℹ 1,061 more rows

The resulting dataset stores age groups within a single variable rather than across many separate columns.

After reshaping the data, some values may still require cleaning or recoding. The case_when() function is often used for this purpose.

pop_long <- pop_long %>%
  
  mutate(
    Age = case_when(
      Age == "04-Jan" ~ "01-04",
      Age == "09-May" ~ "05-09",
      Age == "14-Oct" ~ "10-14",
      TRUE ~ Age
    )
  )

The final line, TRUE ~ Age, is important because it preserves all values that do not match the earlier conditions.

If this line is removed, unmatched rows become missing values.

pop %>%
  
  select(
    -any_of(c("X1", "Total"))
  ) %>%
  
  pivot_longer(
    cols = -any_of(
      c(
        "Year",
        "Gender",
        "Health Service Delivery Area"
      )
    ),
    names_to = "Age",
    values_to = "Value"
  ) %>%
  
  mutate(
    Age = case_when(
      Age == "04-Jan" ~ "01-04",
      Age == "09-May" ~ "05-09",
      Age == "14-Oct" ~ "10-14"
    )
  )

## # A tibble: 1,071 × 5
##    `Health Service Delivery Area`  Year Gender Age    Value
##    <chr>                          <dbl> <chr>  <chr>  <dbl>
##  1 British Columbia                2016 M      <NA>       0
##  2 British Columbia                2016 M      <NA>   22997
##  3 British Columbia                2016 M      01-04  93211
##  4 British Columbia                2016 M      05-09 121341
##  5 British Columbia                2016 M      10-14 119586
##  6 British Columbia                2016 M      <NA>  140451
##  7 British Columbia                2016 M      <NA>  170968
##  8 British Columbia                2016 M      <NA>  159609
##  9 British Columbia                2016 M      <NA>  162416
## 10 British Columbia                2016 M      <NA>  155936
## # ℹ 1,061 more rows

Always verify the output after recoding variables. Small coding mistakes can unintentionally convert valid values into missing values.

4.8 Practical considerations for R Markdown workflows

When working with R Markdown documents, remember that every code chunk is executed again during knitting. Code that modifies files, unzips folders, or downloads data should therefore be used carefully.

As projects grow larger, keeping code organized and avoiding repeated operations becomes increasingly important for reproducible workflows.

You should also begin thinking about how analyses are documented. A good R Markdown document not only produces correct outputs, but also explains the reasoning behind each analytical step.

4.9 Chapter summary

In this chapter, you learned several important data cleaning and management techniques using the tidyverse. You imported multiple datasets, explored project files, inspected dataset structures, handled encoding issues, cleaned variable names, reshaped data into tidy formats, and recoded variables using case_when().

These skills form the foundation of reproducible data analysis workflows and will be used extensively throughout the remainder of the course.

References

Firke, Sam. 2026. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://cran.r-project.org/package=janitor.