Chapter 4 Data Cleaning and Data Management

Data cleaning is one of the most important stages of any analytical workflow. In real-world projects, raw data are rarely ready for analysis immediately after import. Datasets often contain inconsistent variable names, missing values, formatting issues, duplicate records, encoding problems, or variables stored in unsuitable formats.

This chapter introduces several common data management tasks using the tidyverse. You will learn how to import multiple file types, inspect datasets, identify common data quality issues, reshape data into tidy formats, and recode variables into more useful forms.

Although the datasets used in this course are simplified for teaching purposes, the workflow closely reflects many real-world data cleaning tasks encountered in epidemiology, environmental health, and population data analysis.

4.1 Preparing the R environment

Before beginning any analysis, the required libraries should be loaded. For this chapter, we will primarily use packages from the tidyverse ecosystem, along with ggplot2 and readxl.

If these packages are not already installed on your computer, they can be installed using:

install.packages(
  c("tidyverse", "ggplot2", "readxl")
)

Once installed, the libraries can be loaded into the R session.

library(tidyverse)
library(ggplot2)
library(readxl)

4.2 Exploring project files

Before importing data, it is often useful to inspect the contents of the project folders. The list.files() function allows you to display all files stored within a directory.

list.files("./Raw Data/")
## [1] "Corr_2016 copy.csv"            "Corr_2016.csv"                
## [3] "Data Dictionary - Blank.xlsx"  "Data Dictionary - Filled.xlsx"
## [5] "deaths_2016.xlsx"              "Population_Estimates.csv"     
## [7] "titanic.csv"                   "Titantic_DataDictionary.xlsx" 
## [9] "Weather_data.csv"

You can also search for specific file types using the pattern argument.

list.files(
  "./Raw Data/",
  pattern = ".csv"
)
## [1] "Corr_2016 copy.csv"       "Corr_2016.csv"            "Population_Estimates.csv"
## [4] "titanic.csv"              "Weather_data.csv"
list.files(
  "./Raw Data/",
  pattern = ".xlsx"
)
## [1] "Data Dictionary - Blank.xlsx"  "Data Dictionary - Filled.xlsx"
## [3] "deaths_2016.xlsx"              "Titantic_DataDictionary.xlsx"

Functions such as list.files() are frequently overlooked by beginners, but they become extremely useful when working with large projects that contain many datasets or outputs.

Some datasets may also be stored in compressed .zip files. These files should typically be extracted before analysis.

unzip(
  "./Raw Data/Corr_2016.zip",
  exdir = "./Raw Data"
)

The exdir argument specifies the folder where the extracted files should be saved. Since files only need to be unzipped once, this code is often commented out after the extraction is complete.

When working in R Markdown documents, avoid repeatedly running file extraction code unless necessary. Re-extracting files every time the document is knitted can slow down workflows and create duplicate files.

4.3 Description of the datasets

This course uses four datasets that simulate a simplified population health analysis workflow.

The mortality dataset contains records of deaths, including demographic information and ICD-10 cause-of-death codes. Although the dataset is based on realistic structures and counts, the data themselves are fictitious for teaching purposes.

The population dataset contains estimates of population counts by age group, sex, and Health Service Delivery Area in British Columbia.

The correspondence dataset links lower geographic units to larger health regions and service delivery areas. Correspondence files are commonly used in population health and environmental research because they allow different geographic datasets to be connected.

The environmental dataset contains weather-related variables by postal code, including measures related to temperature and cold exposure.

The overall analytical goal is to combine these datasets in order to calculate mortality rates and explore potential environmental relationships.

4.4 Importing data into R

Different file formats require different import functions. The readr package is commonly used for .csv files, while Excel files require functions from the readxl package.

The following code imports the datasets used in this chapter.

mort <- read_excel(
  "./Raw Data/deaths_2016.xlsx"
)

pop <- read_csv(
  "./Raw Data/Population_Estimates.csv"
)
## New names:
## Rows: 51 Columns: 25
## -- Column specification
## --------------------------------------------------------------------- Delimiter: "," chr
## (2): Health Service Delivery Area, Gender dbl (23): ...1, Year, <1, 04-Jan, 09-May, 14-Oct,
## 15-19, 20-24, 25-29, 30-34, 35-39, 40-4...
## i Use `spec()` to retrieve the full column specification for this data. i Specify the column
## types or set `show_col_types = FALSE` to quiet this message.
## * `` -> `...1`
corr <- read_csv(
  "./Raw Data/Corr_2016.csv",
  locale = readr::locale(
    encoding = "latin1"
  )
)
## Rows: 420963 Columns: 6
## -- Column specification ---------------------------------------------------------------------
## Delimiter: ","
## chr (2): hrname_english, hrname_french
## dbl (4): dbuid2016, csduid2016, hruid2017, dbpop2016
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
env <- read_csv(
  "./Raw Data/Weather_data.csv"
)
## Rows: 116011 Columns: 8
## -- Column specification ---------------------------------------------------------------------
## Delimiter: ","
## chr (1): POSTALCODE12
## dbl (7): WTHNRC12_01, WTHNRC12_02, WTHNRC12_03, WTHNRC12_04, WTHNRC12_05, WTHNRC12_06, WT...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Notice that the correspondence file uses a specific text encoding. Encoding problems are common when datasets contain accented characters or multilingual text.

4.5 Inspecting datasets

After importing data, it is important to examine the structure and contents of each dataset before beginning analysis.

Several functions provide useful high-level summaries.

str(mort)
## tibble [60,143 x 13] (S3: tbl_df/tbl/data.frame)
##  $ ID               : num [1:60143] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
##  $ Sex              : chr [1:60143] "M" "F" "M" "M" ...
##  $ Cause_of_death   : chr [1:60143] "D47" "C34" "C26" "C16" ...
##  $ dbuid2016        : num [1:60143] 1.31e+10 4.81e+10 3.52e+10 2.47e+10 2.46e+10 ...
##  $ Location_of_death: num [1:60143] 5 3 6 1 6 5 4 1 3 2 ...
##  $ Marital_status   : num [1:60143] 1 1 2 1 1 1 1 1 1 2 ...
##  $ Postalcode       : chr [1:60143] "E1A1E9" "T2E0T2" "M4J1L1" "H9H1B4" ...
##  $ B_year           : num [1:60143] 1943 1943 1953 1923 1931 ...
##  $ B_month          : num [1:60143] 7 5 8 6 8 11 5 8 7 2 ...
##  $ B_day            : num [1:60143] 8 22 9 14 8 4 20 18 23 7 ...
##  $ D_year           : num [1:60143] 2016 2016 2016 2016 2016 ...
##  $ D_month          : num [1:60143] 9 12 4 11 4 2 2 6 5 1 ...
##  $ D_day            : num [1:60143] 22 14 21 21 4 6 9 13 25 16 ...
summary(mort)
##        ID              Sex            Cause_of_death       dbuid2016        
##  Min.   :1000001   Length:60143       Length:60143       Min.   :1.001e+10  
##  1st Qu.:1016187   Class :character   Class :character   1st Qu.:2.466e+10  
##  Median :1034842   Mode  :character   Mode  :character   Median :3.521e+10  
##  Mean   :1036699                                         Mean   :3.642e+10  
##  3rd Qu.:1056221                                         3rd Qu.:4.715e+10  
##  Max.   :1080708                                         Max.   :6.208e+10  
##                                                          NA's   :4          
##  Location_of_death Marital_status  Postalcode            B_year        B_month      
##  Min.   :1.000     Min.   :1.0    Length:60143       Min.   :1827   Min.   : 1.000  
##  1st Qu.:2.000     1st Qu.:1.0    Class :character   1st Qu.:1933   1st Qu.: 4.000  
##  Median :4.000     Median :1.0    Mode  :character   Median :1941   Median : 7.000  
##  Mean   :3.498     Mean   :1.5                       Mean   :1942   Mean   : 6.513  
##  3rd Qu.:5.000     3rd Qu.:2.0                       3rd Qu.:1951   3rd Qu.:10.000  
##  Max.   :6.000     Max.   :2.0                       Max.   :2016   Max.   :28.000  
##                                                      NA's   :24     NA's   :26      
##      B_day           D_year        D_month           D_day      
##  Min.   : 1.00   Min.   :2016   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 8.00   1st Qu.:2016   1st Qu.: 4.000   1st Qu.: 8.00  
##  Median :16.00   Median :2016   Median : 7.000   Median :16.00  
##  Mean   :15.74   Mean   :2016   Mean   : 6.521   Mean   :15.78  
##  3rd Qu.:23.00   3rd Qu.:2016   3rd Qu.: 9.000   3rd Qu.:23.00  
##  Max.   :31.00   Max.   :2016   Max.   :21.000   Max.   :31.00  
##  NA's   :26
str(pop)
## spc_tbl_ [51 x 25] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1                        : num [1:51] 0 0 0 11 11 11 12 12 12 13 ...
##  $ Health Service Delivery Area: chr [1:51] "British Columbia" "British Columbia" "British Columbia" "East Kootenay" ...
##  $ Year                        : num [1:51] 2016 2016 2016 2016 2016 ...
##  $ Gender                      : chr [1:51] "M" "F" "T" "M" ...
##  $ <1                          : num [1:51] 22997 21760 44757 358 353 ...
##  $ 04-Jan                      : num [1:51] 93211 88300 181511 1507 1439 ...
##  $ 09-May                      : num [1:51] 121341 112777 234118 2058 2016 ...
##  $ 14-Oct                      : num [1:51] 119586 112424 232010 1838 1811 ...
##  $ 15-19                       : num [1:51] 140451 132226 272677 2179 2011 ...
##  $ 20-24                       : num [1:51] 170968 156161 327129 1935 1767 ...
##  $ 25-29                       : num [1:51] 159609 159389 318998 2055 2014 ...
##  $ 30-34                       : num [1:51] 162416 166826 329242 2332 2290 ...
##  $ 35-39                       : num [1:51] 155936 157628 313564 2527 2335 ...
##  $ 40-44                       : num [1:51] 151311 154187 305498 2399 2260 ...
##  $ 45-49                       : num [1:51] 161823 166603 328426 2465 2416 ...
##  $ 50-54                       : num [1:51] 172367 179295 351662 2882 2927 ...
##  $ 55-59                       : num [1:51] 172667 178878 351545 3337 3280 ...
##  $ 60-64                       : num [1:51] 156654 159977 316631 3166 3032 ...
##  $ 65-69                       : num [1:51] 139068 142469 281537 2961 2720 ...
##  $ 70-74                       : num [1:51] 99527 103484 203011 2035 1974 ...
##  $ 75-79                       : num [1:51] 69574 77160 146734 1428 1360 ...
##  $ 80-84                       : num [1:51] 49268 57639 106907 931 996 ...
##  $ 85-89                       : num [1:51] 28898 40789 69687 522 673 ...
##  $ 90+                         : num [1:51] 13366 28648 42014 242 437 ...
##  $ Total                       : num [1:51] 2361038 2396620 4757658 39157 38111 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   `Health Service Delivery Area` = col_character(),
##   ..   Year = col_double(),
##   ..   Gender = col_character(),
##   ..   `<1` = col_double(),
##   ..   `04-Jan` = col_double(),
##   ..   `09-May` = col_double(),
##   ..   `14-Oct` = col_double(),
##   ..   `15-19` = col_double(),
##   ..   `20-24` = col_double(),
##   ..   `25-29` = col_double(),
##   ..   `30-34` = col_double(),
##   ..   `35-39` = col_double(),
##   ..   `40-44` = col_double(),
##   ..   `45-49` = col_double(),
##   ..   `50-54` = col_double(),
##   ..   `55-59` = col_double(),
##   ..   `60-64` = col_double(),
##   ..   `65-69` = col_double(),
##   ..   `70-74` = col_double(),
##   ..   `75-79` = col_double(),
##   ..   `80-84` = col_double(),
##   ..   `85-89` = col_double(),
##   ..   `90+` = col_double(),
##   ..   Total = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(pop)
##       ...1       Health Service Delivery Area      Year         Gender         
##  Min.   : 0.00   Length:51                    Min.   :2016   Length:51         
##  1st Qu.:14.00   Class :character             1st Qu.:2016   Class :character  
##  Median :31.00   Mode  :character             Median :2016   Mode  :character  
##  Mean   :29.06                                Mean   :2016                     
##  3rd Qu.:42.00                                3rd Qu.:2016                     
##  Max.   :53.00                                Max.   :2016                     
##        <1            04-Jan           09-May           14-Oct           15-19       
##  Min.   :  276   Min.   :  1356   Min.   :  1755   Min.   :  1811   Min.   :  1919  
##  1st Qu.:  714   1st Qu.:  3118   1st Qu.:  4194   1st Qu.:  3910   1st Qu.:  4240  
##  Median : 1425   Median :  5663   Median :  8002   Median :  7808   Median :  8836  
##  Mean   : 3510   Mean   : 14236   Mean   : 18362   Mean   : 18197   Mean   : 21386  
##  3rd Qu.: 2802   3rd Qu.: 11025   3rd Qu.: 15384   3rd Qu.: 15350   3rd Qu.: 17866  
##  Max.   :44757   Max.   :181511   Max.   :234118   Max.   :232010   Max.   :272677  
##      20-24            25-29            30-34            35-39            40-44       
##  Min.   :  1577   Min.   :  1640   Min.   :  2069   Min.   :  2092   Min.   :  2008  
##  1st Qu.:  4214   1st Qu.:  4169   1st Qu.:  4415   1st Qu.:  4572   1st Qu.:  4483  
##  Median :  9193   Median :  8646   Median :  8830   Median :  9490   Median :  9551  
##  Mean   : 25657   Mean   : 25019   Mean   : 25823   Mean   : 24593   Mean   : 23961  
##  3rd Qu.: 22940   3rd Qu.: 21291   3rd Qu.: 22644   3rd Qu.: 21896   3rd Qu.: 21345  
##  Max.   :327129   Max.   :318998   Max.   :329242   Max.   :313564   Max.   :305498  
##      45-49            50-54            55-59            60-64            65-69       
##  Min.   :  2040   Min.   :  2137   Min.   :  2038   Min.   :  1550   Min.   :  1119  
##  1st Qu.:  4859   1st Qu.:  5431   1st Qu.:  5684   1st Qu.:  5349   1st Qu.:  5006  
##  Median :  9930   Median : 10666   Median : 10972   Median : 10698   Median : 10011  
##  Mean   : 25759   Mean   : 27581   Mean   : 27572   Mean   : 24834   Mean   : 22081  
##  3rd Qu.: 23211   3rd Qu.: 24455   3rd Qu.: 22498   3rd Qu.: 19486   3rd Qu.: 16891  
##  Max.   :328426   Max.   :351662   Max.   :351545   Max.   :316631   Max.   :281537  
##      70-74            75-79            80-84            85-89            90+         
##  Min.   :   784   Min.   :   538   Min.   :   351   Min.   :  196   Min.   :  114.0  
##  1st Qu.:  3628   1st Qu.:  2332   1st Qu.:  1612   1st Qu.:  973   1st Qu.:  536.5  
##  Median :  7256   Median :  5026   Median :  3867   Median : 2636   Median : 1394.0  
##  Mean   : 15922   Mean   : 11509   Mean   :  8385   Mean   : 5466   Mean   : 3295.2  
##  3rd Qu.: 12128   3rd Qu.:  9268   3rd Qu.:  6582   3rd Qu.: 4543   3rd Qu.: 2955.0  
##  Max.   :203011   Max.   :146734   Max.   :106907   Max.   :69687   Max.   :42014.0  
##      Total        
##  Min.   :  32171  
##  1st Qu.:  70681  
##  Median : 145321  
##  Mean   : 373150  
##  3rd Qu.: 329569  
##  Max.   :4757658
str(corr)
## spc_tbl_ [420,963 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ dbuid2016     : num [1:420963] 1e+10 1e+10 1e+10 1e+10 1e+10 ...
##  $ csduid2016    : num [1:420963] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
##  $ hruid2017     : num [1:420963] 1011 1011 1011 1011 1011 ...
##  $ hrname_english: chr [1:420963] "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" ...
##  $ hrname_french : chr [1:420963] "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" ...
##  $ dbpop2016     : num [1:420963] 160 25 268 53 71 217 39 120 154 111 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   dbuid2016 = col_double(),
##   ..   csduid2016 = col_double(),
##   ..   hruid2017 = col_double(),
##   ..   hrname_english = col_character(),
##   ..   hrname_french = col_character(),
##   ..   dbpop2016 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(corr)
##    dbuid2016           csduid2016        hruid2017    hrname_english     hrname_french     
##  Min.   :1.001e+10   Min.   :1001101   Min.   :1011   Length:420963      Length:420963     
##  1st Qu.:2.466e+10   1st Qu.:2466023   1st Qu.:2414   Class :character   Class :character  
##  Median :3.558e+10   Median :3558090   Median :3595   Mode  :character   Mode  :character  
##  Mean   :3.856e+10   Mean   :3855572   Mean   :3856                                        
##  3rd Qu.:4.806e+10   3rd Qu.:4806016   3rd Qu.:4832                                        
##  Max.   :6.208e+10   Max.   :6208098   Max.   :6201                                        
##                                                                                            
##    dbpop2016      
##  Min.   :   0.00  
##  1st Qu.:   5.00  
##  Median :  26.00  
##  Mean   :  69.44  
##  3rd Qu.:  77.00  
##  Max.   :7607.00  
##  NA's   :35
str(env)
## spc_tbl_ [116,011 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ POSTALCODE12: chr [1:116011] "V0C1E0" "V0C1W0" "V0C2X0" "V0C2Z0" ...
##  $ WTHNRC12_01 : num [1:116011] 27.1 30.3 28.2 28.7 23.1 ...
##  $ WTHNRC12_02 : num [1:116011] -46.7 -46 -42.3 -44.9 -35 ...
##  $ WTHNRC12_03 : num [1:116011] -2.605 -1.868 -1.963 -1.286 0.338 ...
##  $ WTHNRC12_04 : num [1:116011] 2.91 3.92 4.01 4.32 4.33 ...
##  $ WTHNRC12_05 : num [1:116011] -8.12 -7.65 -7.93 -6.89 -3.65 ...
##  $ WTHNRC12_06 : num [1:116011] 11.04 11.57 11.94 11.2 7.98 ...
##  $ WTHNRC12_07 : num [1:116011] 171 198 248 175 310 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   POSTALCODE12 = col_character(),
##   ..   WTHNRC12_01 = col_double(),
##   ..   WTHNRC12_02 = col_double(),
##   ..   WTHNRC12_03 = col_double(),
##   ..   WTHNRC12_04 = col_double(),
##   ..   WTHNRC12_05 = col_double(),
##   ..   WTHNRC12_06 = col_double(),
##   ..   WTHNRC12_07 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(env)
##  POSTALCODE12        WTHNRC12_01     WTHNRC12_02      WTHNRC12_03      WTHNRC12_04    
##  Length:116011      Min.   :19.28   Min.   :-46.74   Min.   :-2.605   Min.   : 2.913  
##  Class :character   1st Qu.:31.07   1st Qu.:-11.44   1st Qu.: 9.693   1st Qu.:13.454  
##  Mode  :character   Median :31.28   Median : -8.40   Median :10.140   Median :13.887  
##                     Mean   :31.62   Mean   :-11.67   Mean   : 9.573   Mean   :13.521  
##                     3rd Qu.:31.76   3rd Qu.: -7.49   3rd Qu.:10.449   3rd Qu.:14.006  
##                     Max.   :39.07   Max.   : -4.48   Max.   :10.551   Max.   :16.227  
##   WTHNRC12_05      WTHNRC12_06      WTHNRC12_07    
##  Min.   :-8.122   Min.   : 4.336   Min.   : 104.6  
##  1st Qu.: 5.750   1st Qu.: 6.936   1st Qu.:1072.1  
##  Median : 6.612   Median : 7.151   Median :1316.4  
##  Mean   : 5.625   Mean   : 7.896   Mean   :1110.7  
##  3rd Qu.: 6.893   3rd Qu.: 7.935   3rd Qu.:1399.0  
##  Max.   : 7.132   Max.   :13.498   Max.   :3076.5

The str() function displays the structure of the dataset, including variable types and dimensions. The summary() function provides summary statistics and quick descriptive information.

Another useful function is glimpse() from dplyr.

glimpse(mort)
## Rows: 60,143
## Columns: 13
## $ ID                <dbl> 1000001, 1000002, 1000003, 1000004, 1000005, 1000006, 1000007, 10~
## $ Sex               <chr> "M", "F", "M", "M", "M", "F", "F", "F", "F", "M", "F", "F", "F", ~
## $ Cause_of_death    <chr> "D47", "C34", "C26", "C16", "C22", "C50", "C85", "C56", "C64", "C~
## $ dbuid2016         <dbl> 13070175019, 48060924001, 35204502002, 24663048013, 24560249013, ~
## $ Location_of_death <dbl> 5, 3, 6, 1, 6, 5, 4, 1, 3, 2, 5, 1, 3, 4, 1, 5, 2, 6, 1, 1, 2, 5,~
## $ Marital_status    <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2,~
## $ Postalcode        <chr> "E1A1E9", "T2E0T2", "M4J1L1", "H9H1B4", "J2W2B6", "L8H2K1", "N9Y3~
## $ B_year            <dbl> 1943, 1943, 1953, 1923, 1931, 1933, 1918, 1972, 1922, 1953, 1978,~
## $ B_month           <dbl> 7, 5, 8, 6, 8, 11, 5, 8, 7, 2, 11, 3, 5, 9, 8, 12, 5, 11, 9, 4, 5~
## $ B_day             <dbl> 8, 22, 9, 14, 8, 4, 20, 18, 23, 7, 8, 6, 1, 8, 14, 31, 22, 4, 11,~
## $ D_year            <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,~
## $ D_month           <dbl> 9, 12, 4, 11, 4, 2, 2, 6, 5, 1, 2, 10, 6, 2, 7, 8, 6, 2, 5, 9, 10~
## $ D_day             <dbl> 22, 14, 21, 21, 4, 6, 9, 13, 25, 16, 22, 17, 15, 7, 9, 7, 20, 7, ~

Compared with str(), glimpse() often provides a cleaner and easier-to-read overview of the data frame.

Initial data inspection is one of the most important parts of data cleaning because it helps identify:

  • missing values;
  • unusual variable names;
  • incorrect data types;
  • inconsistent formatting; and
  • unexpected values.

4.6 Exploring and validating the data

A useful next step is performing simple exploratory checks to understand the datasets more fully.

For example, we can identify all health regions in the correspondence dataset.

corr %>%
  distinct(hrname_english) %>%
  mutate(
    hrname_english = iconv(
      hrname_english,
      from = "latin1",
      to = "UTF-8",
      sub = ""
    )
  ) %>%
  arrange(hrname_english)
## # A tibble: 96 x 1
##    hrname_english                                       
##    <chr>                                                
##  1 Athabasca Health Authority                           
##  2 Brant County Health Unit                             
##  3 Calgary Zone                                         
##  4 Central Regional Health Authority                    
##  5 Central Vancouver Island Health Service Delivery Area
##  6 Central Zone                                         
##  7 Chatham-Kent Health Unit                             
##  8 City of Hamilton Health Unit                         
##  9 City of Toronto Health Unit                          
## 10 Cypress Regional Health Authority                    
## # i 86 more rows

When working with multilingual datasets, encoding issues may cause characters to display incorrectly. Re-importing the file using the appropriate encoding often resolves these problems.

corr <- read_csv(
  "./Raw Data/Corr_2016.csv",
  locale = readr::locale(
    encoding = "latin1"
  )
)
## Rows: 420963 Columns: 6
## -- Column specification ---------------------------------------------------------------------
## Delimiter: ","
## chr (2): hrname_english, hrname_french
## dbl (4): dbuid2016, csduid2016, hruid2017, dbpop2016
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
corr %>%
  distinct(hrname_english) %>%
  arrange(hrname_english)
## # A tibble: 96 x 1
##    hrname_english                                       
##    <chr>                                                
##  1 Athabasca Health Authority                           
##  2 Brant County Health Unit                             
##  3 Calgary Zone                                         
##  4 Central Regional Health Authority                    
##  5 Central Vancouver Island Health Service Delivery Area
##  6 Central Zone                                         
##  7 Chatham-Kent Health Unit                             
##  8 City of Hamilton Health Unit                         
##  9 City of Toronto Health Unit                          
## 10 Cypress Regional Health Authority                    
## # i 86 more rows

We can also calculate the number of health regions in the dataset.

no_regions <- corr %>%
  distinct(hrname_english) %>%
  nrow()

There are 96 health regions in the correspondence dataset.

Additional exploratory summaries can help validate the imported data.

mort %>%
  count(Sex)
## # A tibble: 3 x 2
##   Sex       n
##   <chr> <int>
## 1 F     28630
## 2 M     31500
## 3 <NA>     13
pop %>%
  summarize(
    Total_below_1 = sum(`<1`)
  )
## # A tibble: 1 x 1
##   Total_below_1
##           <dbl>
## 1        179028

These simple checks help verify that the datasets were imported correctly and that the variables contain plausible values.

4.7 Cleaning variable names and reshaping data

One common issue in real-world datasets is poorly formatted variable names. Variables that begin with numbers or contain symbols can make coding more difficult.

For example:

names(pop)
##  [1] "...1"                         "Health Service Delivery Area"
##  [3] "Year"                         "Gender"                      
##  [5] "<1"                           "04-Jan"                      
##  [7] "09-May"                       "14-Oct"                      
##  [9] "15-19"                        "20-24"                       
## [11] "25-29"                        "30-34"                       
## [13] "35-39"                        "40-44"                       
## [15] "45-49"                        "50-54"                       
## [17] "55-59"                        "60-64"                       
## [19] "65-69"                        "70-74"                       
## [21] "75-79"                        "80-84"                       
## [23] "85-89"                        "90+"                         
## [25] "Total"
colnames(pop)
##  [1] "...1"                         "Health Service Delivery Area"
##  [3] "Year"                         "Gender"                      
##  [5] "<1"                           "04-Jan"                      
##  [7] "09-May"                       "14-Oct"                      
##  [9] "15-19"                        "20-24"                       
## [11] "25-29"                        "30-34"                       
## [13] "35-39"                        "40-44"                       
## [15] "45-49"                        "50-54"                       
## [17] "55-59"                        "60-64"                       
## [19] "65-69"                        "70-74"                       
## [21] "75-79"                        "80-84"                       
## [23] "85-89"                        "90+"                         
## [25] "Total"

Some age-group variables contain names such as "04-Jan" or "09-May", which are not ideal variable names for analysis.The janitor package is also useful for cleaning messy column names and inspecting datasets Firke (2026).

One approach is to rename problematic variables directly.

pop2 <- pop %>%
  rename(
    "01-04" = "04-Jan",
    "05-09" = "09-May",
    "10-14" = "14-Oct"
  )

Another important data-cleaning task is reshaping data into a tidy format. In tidy data, each variable has its own column, each observation has its own row, and each value occupies a single cell.

The pivot_longer() function is commonly used to convert wide datasets into long format.

pop_long <- pop %>%
  
  select(
    -any_of(c("X1", "Total"))
  ) %>%
  
  pivot_longer(
    cols = -any_of(
      c(
        "Year",
        "Gender",
        "Health Service Delivery Area"
      )
    ),
    names_to = "Age",
    values_to = "Value"
  )

pop_long
## # A tibble: 1,071 x 5
##    `Health Service Delivery Area`  Year Gender Age     Value
##    <chr>                          <dbl> <chr>  <chr>   <dbl>
##  1 British Columbia                2016 M      ...1        0
##  2 British Columbia                2016 M      <1      22997
##  3 British Columbia                2016 M      04-Jan  93211
##  4 British Columbia                2016 M      09-May 121341
##  5 British Columbia                2016 M      14-Oct 119586
##  6 British Columbia                2016 M      15-19  140451
##  7 British Columbia                2016 M      20-24  170968
##  8 British Columbia                2016 M      25-29  159609
##  9 British Columbia                2016 M      30-34  162416
## 10 British Columbia                2016 M      35-39  155936
## # i 1,061 more rows

The resulting dataset stores age groups within a single variable rather than across many separate columns.

After reshaping the data, some values may still require cleaning or recoding. The case_when() function is often used for this purpose.

pop_long <- pop_long %>%
  
  mutate(
    Age = case_when(
      Age == "04-Jan" ~ "01-04",
      Age == "09-May" ~ "05-09",
      Age == "14-Oct" ~ "10-14",
      TRUE ~ Age
    )
  )

The final line, TRUE ~ Age, is important because it preserves all values that do not match the earlier conditions.

If this line is removed, unmatched rows become missing values.

pop %>%
  
  select(
    -any_of(c("X1", "Total"))
  ) %>%
  
  pivot_longer(
    cols = -any_of(
      c(
        "Year",
        "Gender",
        "Health Service Delivery Area"
      )
    ),
    names_to = "Age",
    values_to = "Value"
  ) %>%
  
  mutate(
    Age = case_when(
      Age == "04-Jan" ~ "01-04",
      Age == "09-May" ~ "05-09",
      Age == "14-Oct" ~ "10-14"
    )
  )
## # A tibble: 1,071 x 5
##    `Health Service Delivery Area`  Year Gender Age    Value
##    <chr>                          <dbl> <chr>  <chr>  <dbl>
##  1 British Columbia                2016 M      <NA>       0
##  2 British Columbia                2016 M      <NA>   22997
##  3 British Columbia                2016 M      01-04  93211
##  4 British Columbia                2016 M      05-09 121341
##  5 British Columbia                2016 M      10-14 119586
##  6 British Columbia                2016 M      <NA>  140451
##  7 British Columbia                2016 M      <NA>  170968
##  8 British Columbia                2016 M      <NA>  159609
##  9 British Columbia                2016 M      <NA>  162416
## 10 British Columbia                2016 M      <NA>  155936
## # i 1,061 more rows

Always verify the output after recoding variables. Small coding mistakes can unintentionally convert valid values into missing values.

4.8 Practical considerations for R Markdown workflows

When working with R Markdown documents, remember that every code chunk is executed again during knitting. Code that modifies files, unzips folders, or downloads data should therefore be used carefully.

As projects grow larger, keeping code organized and avoiding repeated operations becomes increasingly important for reproducible workflows.

You should also begin thinking about how analyses are documented. A good R Markdown document not only produces correct outputs, but also explains the reasoning behind each analytical step.

4.9 Chapter summary

In this chapter, you learned several important data cleaning and management techniques using the tidyverse. You imported multiple datasets, explored project files, inspected dataset structures, handled encoding issues, cleaned variable names, reshaped data into tidy formats, and recoded variables using case_when().

These skills form the foundation of reproducible data analysis workflows and will be used extensively throughout the remainder of the course.

References

Firke, Sam. 2026. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://cran.r-project.org/package=janitor.