Chapter 4 Data Cleaning and Data Management
Data cleaning is one of the most important stages of any analytical workflow. In real-world projects, raw data are rarely ready for analysis immediately after import. Datasets often contain inconsistent variable names, missing values, formatting issues, duplicate records, encoding problems, or variables stored in unsuitable formats.
This chapter introduces several common data management tasks using the tidyverse. You will learn how to import multiple file types, inspect datasets, identify common data quality issues, reshape data into tidy formats, and recode variables into more useful forms.
Although the datasets used in this course are simplified for teaching purposes, the workflow closely reflects many real-world data cleaning tasks encountered in epidemiology, environmental health, and population data analysis.
4.1 Preparing the R environment
Before beginning any analysis, the required libraries should be loaded. For this chapter, we will primarily use packages from the tidyverse ecosystem, along with ggplot2 and readxl.
If these packages are not already installed on your computer, they can be installed using:
Once installed, the libraries can be loaded into the R session.
4.2 Exploring project files
Before importing data, it is often useful to inspect the contents of the project folders. The list.files() function allows you to display all files stored within a directory.
## [1] "Corr_2016 copy.csv" "Corr_2016.csv"
## [3] "Data Dictionary - Blank.xlsx" "Data Dictionary - Filled.xlsx"
## [5] "deaths_2016.xlsx" "Population_Estimates.csv"
## [7] "titanic.csv" "Titantic_DataDictionary.xlsx"
## [9] "Weather_data.csv"
You can also search for specific file types using the pattern argument.
## [1] "Corr_2016 copy.csv" "Corr_2016.csv" "Population_Estimates.csv"
## [4] "titanic.csv" "Weather_data.csv"
## [1] "Data Dictionary - Blank.xlsx" "Data Dictionary - Filled.xlsx"
## [3] "deaths_2016.xlsx" "Titantic_DataDictionary.xlsx"
Functions such as list.files() are frequently overlooked by beginners, but they become extremely useful when working with large projects that contain many datasets or outputs.
Some datasets may also be stored in compressed .zip files. These files should typically be extracted before analysis.
The exdir argument specifies the folder where the extracted files should be saved. Since files only need to be unzipped once, this code is often commented out after the extraction is complete.
When working in R Markdown documents, avoid repeatedly running file extraction code unless necessary. Re-extracting files every time the document is knitted can slow down workflows and create duplicate files.
4.3 Description of the datasets
This course uses four datasets that simulate a simplified population health analysis workflow.
The mortality dataset contains records of deaths, including demographic information and ICD-10 cause-of-death codes. Although the dataset is based on realistic structures and counts, the data themselves are fictitious for teaching purposes.
The population dataset contains estimates of population counts by age group, sex, and Health Service Delivery Area in British Columbia.
The correspondence dataset links lower geographic units to larger health regions and service delivery areas. Correspondence files are commonly used in population health and environmental research because they allow different geographic datasets to be connected.
The environmental dataset contains weather-related variables by postal code, including measures related to temperature and cold exposure.
The overall analytical goal is to combine these datasets in order to calculate mortality rates and explore potential environmental relationships.
4.4 Importing data into R
Different file formats require different import functions. The readr package is commonly used for .csv files, while Excel files require functions from the readxl package.
The following code imports the datasets used in this chapter.
mort <- read_excel(
"./Raw Data/deaths_2016.xlsx"
)
pop <- read_csv(
"./Raw Data/Population_Estimates.csv"
)## New names:
## Rows: 51 Columns: 25
## -- Column specification
## --------------------------------------------------------------------- Delimiter: "," chr
## (2): Health Service Delivery Area, Gender dbl (23): ...1, Year, <1, 04-Jan, 09-May, 14-Oct,
## 15-19, 20-24, 25-29, 30-34, 35-39, 40-4...
## i Use `spec()` to retrieve the full column specification for this data. i Specify the column
## types or set `show_col_types = FALSE` to quiet this message.
## * `` -> `...1`
## Rows: 420963 Columns: 6
## -- Column specification ---------------------------------------------------------------------
## Delimiter: ","
## chr (2): hrname_english, hrname_french
## dbl (4): dbuid2016, csduid2016, hruid2017, dbpop2016
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 116011 Columns: 8
## -- Column specification ---------------------------------------------------------------------
## Delimiter: ","
## chr (1): POSTALCODE12
## dbl (7): WTHNRC12_01, WTHNRC12_02, WTHNRC12_03, WTHNRC12_04, WTHNRC12_05, WTHNRC12_06, WT...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice that the correspondence file uses a specific text encoding. Encoding problems are common when datasets contain accented characters or multilingual text.
4.5 Inspecting datasets
After importing data, it is important to examine the structure and contents of each dataset before beginning analysis.
Several functions provide useful high-level summaries.
## tibble [60,143 x 13] (S3: tbl_df/tbl/data.frame)
## $ ID : num [1:60143] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
## $ Sex : chr [1:60143] "M" "F" "M" "M" ...
## $ Cause_of_death : chr [1:60143] "D47" "C34" "C26" "C16" ...
## $ dbuid2016 : num [1:60143] 1.31e+10 4.81e+10 3.52e+10 2.47e+10 2.46e+10 ...
## $ Location_of_death: num [1:60143] 5 3 6 1 6 5 4 1 3 2 ...
## $ Marital_status : num [1:60143] 1 1 2 1 1 1 1 1 1 2 ...
## $ Postalcode : chr [1:60143] "E1A1E9" "T2E0T2" "M4J1L1" "H9H1B4" ...
## $ B_year : num [1:60143] 1943 1943 1953 1923 1931 ...
## $ B_month : num [1:60143] 7 5 8 6 8 11 5 8 7 2 ...
## $ B_day : num [1:60143] 8 22 9 14 8 4 20 18 23 7 ...
## $ D_year : num [1:60143] 2016 2016 2016 2016 2016 ...
## $ D_month : num [1:60143] 9 12 4 11 4 2 2 6 5 1 ...
## $ D_day : num [1:60143] 22 14 21 21 4 6 9 13 25 16 ...
## ID Sex Cause_of_death dbuid2016
## Min. :1000001 Length:60143 Length:60143 Min. :1.001e+10
## 1st Qu.:1016187 Class :character Class :character 1st Qu.:2.466e+10
## Median :1034842 Mode :character Mode :character Median :3.521e+10
## Mean :1036699 Mean :3.642e+10
## 3rd Qu.:1056221 3rd Qu.:4.715e+10
## Max. :1080708 Max. :6.208e+10
## NA's :4
## Location_of_death Marital_status Postalcode B_year B_month
## Min. :1.000 Min. :1.0 Length:60143 Min. :1827 Min. : 1.000
## 1st Qu.:2.000 1st Qu.:1.0 Class :character 1st Qu.:1933 1st Qu.: 4.000
## Median :4.000 Median :1.0 Mode :character Median :1941 Median : 7.000
## Mean :3.498 Mean :1.5 Mean :1942 Mean : 6.513
## 3rd Qu.:5.000 3rd Qu.:2.0 3rd Qu.:1951 3rd Qu.:10.000
## Max. :6.000 Max. :2.0 Max. :2016 Max. :28.000
## NA's :24 NA's :26
## B_day D_year D_month D_day
## Min. : 1.00 Min. :2016 Min. : 1.000 Min. : 1.00
## 1st Qu.: 8.00 1st Qu.:2016 1st Qu.: 4.000 1st Qu.: 8.00
## Median :16.00 Median :2016 Median : 7.000 Median :16.00
## Mean :15.74 Mean :2016 Mean : 6.521 Mean :15.78
## 3rd Qu.:23.00 3rd Qu.:2016 3rd Qu.: 9.000 3rd Qu.:23.00
## Max. :31.00 Max. :2016 Max. :21.000 Max. :31.00
## NA's :26
## spc_tbl_ [51 x 25] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:51] 0 0 0 11 11 11 12 12 12 13 ...
## $ Health Service Delivery Area: chr [1:51] "British Columbia" "British Columbia" "British Columbia" "East Kootenay" ...
## $ Year : num [1:51] 2016 2016 2016 2016 2016 ...
## $ Gender : chr [1:51] "M" "F" "T" "M" ...
## $ <1 : num [1:51] 22997 21760 44757 358 353 ...
## $ 04-Jan : num [1:51] 93211 88300 181511 1507 1439 ...
## $ 09-May : num [1:51] 121341 112777 234118 2058 2016 ...
## $ 14-Oct : num [1:51] 119586 112424 232010 1838 1811 ...
## $ 15-19 : num [1:51] 140451 132226 272677 2179 2011 ...
## $ 20-24 : num [1:51] 170968 156161 327129 1935 1767 ...
## $ 25-29 : num [1:51] 159609 159389 318998 2055 2014 ...
## $ 30-34 : num [1:51] 162416 166826 329242 2332 2290 ...
## $ 35-39 : num [1:51] 155936 157628 313564 2527 2335 ...
## $ 40-44 : num [1:51] 151311 154187 305498 2399 2260 ...
## $ 45-49 : num [1:51] 161823 166603 328426 2465 2416 ...
## $ 50-54 : num [1:51] 172367 179295 351662 2882 2927 ...
## $ 55-59 : num [1:51] 172667 178878 351545 3337 3280 ...
## $ 60-64 : num [1:51] 156654 159977 316631 3166 3032 ...
## $ 65-69 : num [1:51] 139068 142469 281537 2961 2720 ...
## $ 70-74 : num [1:51] 99527 103484 203011 2035 1974 ...
## $ 75-79 : num [1:51] 69574 77160 146734 1428 1360 ...
## $ 80-84 : num [1:51] 49268 57639 106907 931 996 ...
## $ 85-89 : num [1:51] 28898 40789 69687 522 673 ...
## $ 90+ : num [1:51] 13366 28648 42014 242 437 ...
## $ Total : num [1:51] 2361038 2396620 4757658 39157 38111 ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. `Health Service Delivery Area` = col_character(),
## .. Year = col_double(),
## .. Gender = col_character(),
## .. `<1` = col_double(),
## .. `04-Jan` = col_double(),
## .. `09-May` = col_double(),
## .. `14-Oct` = col_double(),
## .. `15-19` = col_double(),
## .. `20-24` = col_double(),
## .. `25-29` = col_double(),
## .. `30-34` = col_double(),
## .. `35-39` = col_double(),
## .. `40-44` = col_double(),
## .. `45-49` = col_double(),
## .. `50-54` = col_double(),
## .. `55-59` = col_double(),
## .. `60-64` = col_double(),
## .. `65-69` = col_double(),
## .. `70-74` = col_double(),
## .. `75-79` = col_double(),
## .. `80-84` = col_double(),
## .. `85-89` = col_double(),
## .. `90+` = col_double(),
## .. Total = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## ...1 Health Service Delivery Area Year Gender
## Min. : 0.00 Length:51 Min. :2016 Length:51
## 1st Qu.:14.00 Class :character 1st Qu.:2016 Class :character
## Median :31.00 Mode :character Median :2016 Mode :character
## Mean :29.06 Mean :2016
## 3rd Qu.:42.00 3rd Qu.:2016
## Max. :53.00 Max. :2016
## <1 04-Jan 09-May 14-Oct 15-19
## Min. : 276 Min. : 1356 Min. : 1755 Min. : 1811 Min. : 1919
## 1st Qu.: 714 1st Qu.: 3118 1st Qu.: 4194 1st Qu.: 3910 1st Qu.: 4240
## Median : 1425 Median : 5663 Median : 8002 Median : 7808 Median : 8836
## Mean : 3510 Mean : 14236 Mean : 18362 Mean : 18197 Mean : 21386
## 3rd Qu.: 2802 3rd Qu.: 11025 3rd Qu.: 15384 3rd Qu.: 15350 3rd Qu.: 17866
## Max. :44757 Max. :181511 Max. :234118 Max. :232010 Max. :272677
## 20-24 25-29 30-34 35-39 40-44
## Min. : 1577 Min. : 1640 Min. : 2069 Min. : 2092 Min. : 2008
## 1st Qu.: 4214 1st Qu.: 4169 1st Qu.: 4415 1st Qu.: 4572 1st Qu.: 4483
## Median : 9193 Median : 8646 Median : 8830 Median : 9490 Median : 9551
## Mean : 25657 Mean : 25019 Mean : 25823 Mean : 24593 Mean : 23961
## 3rd Qu.: 22940 3rd Qu.: 21291 3rd Qu.: 22644 3rd Qu.: 21896 3rd Qu.: 21345
## Max. :327129 Max. :318998 Max. :329242 Max. :313564 Max. :305498
## 45-49 50-54 55-59 60-64 65-69
## Min. : 2040 Min. : 2137 Min. : 2038 Min. : 1550 Min. : 1119
## 1st Qu.: 4859 1st Qu.: 5431 1st Qu.: 5684 1st Qu.: 5349 1st Qu.: 5006
## Median : 9930 Median : 10666 Median : 10972 Median : 10698 Median : 10011
## Mean : 25759 Mean : 27581 Mean : 27572 Mean : 24834 Mean : 22081
## 3rd Qu.: 23211 3rd Qu.: 24455 3rd Qu.: 22498 3rd Qu.: 19486 3rd Qu.: 16891
## Max. :328426 Max. :351662 Max. :351545 Max. :316631 Max. :281537
## 70-74 75-79 80-84 85-89 90+
## Min. : 784 Min. : 538 Min. : 351 Min. : 196 Min. : 114.0
## 1st Qu.: 3628 1st Qu.: 2332 1st Qu.: 1612 1st Qu.: 973 1st Qu.: 536.5
## Median : 7256 Median : 5026 Median : 3867 Median : 2636 Median : 1394.0
## Mean : 15922 Mean : 11509 Mean : 8385 Mean : 5466 Mean : 3295.2
## 3rd Qu.: 12128 3rd Qu.: 9268 3rd Qu.: 6582 3rd Qu.: 4543 3rd Qu.: 2955.0
## Max. :203011 Max. :146734 Max. :106907 Max. :69687 Max. :42014.0
## Total
## Min. : 32171
## 1st Qu.: 70681
## Median : 145321
## Mean : 373150
## 3rd Qu.: 329569
## Max. :4757658
## spc_tbl_ [420,963 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ dbuid2016 : num [1:420963] 1e+10 1e+10 1e+10 1e+10 1e+10 ...
## $ csduid2016 : num [1:420963] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
## $ hruid2017 : num [1:420963] 1011 1011 1011 1011 1011 ...
## $ hrname_english: chr [1:420963] "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" ...
## $ hrname_french : chr [1:420963] "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" ...
## $ dbpop2016 : num [1:420963] 160 25 268 53 71 217 39 120 154 111 ...
## - attr(*, "spec")=
## .. cols(
## .. dbuid2016 = col_double(),
## .. csduid2016 = col_double(),
## .. hruid2017 = col_double(),
## .. hrname_english = col_character(),
## .. hrname_french = col_character(),
## .. dbpop2016 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## dbuid2016 csduid2016 hruid2017 hrname_english hrname_french
## Min. :1.001e+10 Min. :1001101 Min. :1011 Length:420963 Length:420963
## 1st Qu.:2.466e+10 1st Qu.:2466023 1st Qu.:2414 Class :character Class :character
## Median :3.558e+10 Median :3558090 Median :3595 Mode :character Mode :character
## Mean :3.856e+10 Mean :3855572 Mean :3856
## 3rd Qu.:4.806e+10 3rd Qu.:4806016 3rd Qu.:4832
## Max. :6.208e+10 Max. :6208098 Max. :6201
##
## dbpop2016
## Min. : 0.00
## 1st Qu.: 5.00
## Median : 26.00
## Mean : 69.44
## 3rd Qu.: 77.00
## Max. :7607.00
## NA's :35
## spc_tbl_ [116,011 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ POSTALCODE12: chr [1:116011] "V0C1E0" "V0C1W0" "V0C2X0" "V0C2Z0" ...
## $ WTHNRC12_01 : num [1:116011] 27.1 30.3 28.2 28.7 23.1 ...
## $ WTHNRC12_02 : num [1:116011] -46.7 -46 -42.3 -44.9 -35 ...
## $ WTHNRC12_03 : num [1:116011] -2.605 -1.868 -1.963 -1.286 0.338 ...
## $ WTHNRC12_04 : num [1:116011] 2.91 3.92 4.01 4.32 4.33 ...
## $ WTHNRC12_05 : num [1:116011] -8.12 -7.65 -7.93 -6.89 -3.65 ...
## $ WTHNRC12_06 : num [1:116011] 11.04 11.57 11.94 11.2 7.98 ...
## $ WTHNRC12_07 : num [1:116011] 171 198 248 175 310 ...
## - attr(*, "spec")=
## .. cols(
## .. POSTALCODE12 = col_character(),
## .. WTHNRC12_01 = col_double(),
## .. WTHNRC12_02 = col_double(),
## .. WTHNRC12_03 = col_double(),
## .. WTHNRC12_04 = col_double(),
## .. WTHNRC12_05 = col_double(),
## .. WTHNRC12_06 = col_double(),
## .. WTHNRC12_07 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## POSTALCODE12 WTHNRC12_01 WTHNRC12_02 WTHNRC12_03 WTHNRC12_04
## Length:116011 Min. :19.28 Min. :-46.74 Min. :-2.605 Min. : 2.913
## Class :character 1st Qu.:31.07 1st Qu.:-11.44 1st Qu.: 9.693 1st Qu.:13.454
## Mode :character Median :31.28 Median : -8.40 Median :10.140 Median :13.887
## Mean :31.62 Mean :-11.67 Mean : 9.573 Mean :13.521
## 3rd Qu.:31.76 3rd Qu.: -7.49 3rd Qu.:10.449 3rd Qu.:14.006
## Max. :39.07 Max. : -4.48 Max. :10.551 Max. :16.227
## WTHNRC12_05 WTHNRC12_06 WTHNRC12_07
## Min. :-8.122 Min. : 4.336 Min. : 104.6
## 1st Qu.: 5.750 1st Qu.: 6.936 1st Qu.:1072.1
## Median : 6.612 Median : 7.151 Median :1316.4
## Mean : 5.625 Mean : 7.896 Mean :1110.7
## 3rd Qu.: 6.893 3rd Qu.: 7.935 3rd Qu.:1399.0
## Max. : 7.132 Max. :13.498 Max. :3076.5
The str() function displays the structure of the dataset, including variable types and dimensions. The summary() function provides summary statistics and quick descriptive information.
Another useful function is glimpse() from dplyr.
## Rows: 60,143
## Columns: 13
## $ ID <dbl> 1000001, 1000002, 1000003, 1000004, 1000005, 1000006, 1000007, 10~
## $ Sex <chr> "M", "F", "M", "M", "M", "F", "F", "F", "F", "M", "F", "F", "F", ~
## $ Cause_of_death <chr> "D47", "C34", "C26", "C16", "C22", "C50", "C85", "C56", "C64", "C~
## $ dbuid2016 <dbl> 13070175019, 48060924001, 35204502002, 24663048013, 24560249013, ~
## $ Location_of_death <dbl> 5, 3, 6, 1, 6, 5, 4, 1, 3, 2, 5, 1, 3, 4, 1, 5, 2, 6, 1, 1, 2, 5,~
## $ Marital_status <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2,~
## $ Postalcode <chr> "E1A1E9", "T2E0T2", "M4J1L1", "H9H1B4", "J2W2B6", "L8H2K1", "N9Y3~
## $ B_year <dbl> 1943, 1943, 1953, 1923, 1931, 1933, 1918, 1972, 1922, 1953, 1978,~
## $ B_month <dbl> 7, 5, 8, 6, 8, 11, 5, 8, 7, 2, 11, 3, 5, 9, 8, 12, 5, 11, 9, 4, 5~
## $ B_day <dbl> 8, 22, 9, 14, 8, 4, 20, 18, 23, 7, 8, 6, 1, 8, 14, 31, 22, 4, 11,~
## $ D_year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,~
## $ D_month <dbl> 9, 12, 4, 11, 4, 2, 2, 6, 5, 1, 2, 10, 6, 2, 7, 8, 6, 2, 5, 9, 10~
## $ D_day <dbl> 22, 14, 21, 21, 4, 6, 9, 13, 25, 16, 22, 17, 15, 7, 9, 7, 20, 7, ~
Compared with str(), glimpse() often provides a cleaner and easier-to-read overview of the data frame.
Initial data inspection is one of the most important parts of data cleaning because it helps identify:
- missing values;
- unusual variable names;
- incorrect data types;
- inconsistent formatting; and
- unexpected values.
4.6 Exploring and validating the data
A useful next step is performing simple exploratory checks to understand the datasets more fully.
For example, we can identify all health regions in the correspondence dataset.
corr %>%
distinct(hrname_english) %>%
mutate(
hrname_english = iconv(
hrname_english,
from = "latin1",
to = "UTF-8",
sub = ""
)
) %>%
arrange(hrname_english)## # A tibble: 96 x 1
## hrname_english
## <chr>
## 1 Athabasca Health Authority
## 2 Brant County Health Unit
## 3 Calgary Zone
## 4 Central Regional Health Authority
## 5 Central Vancouver Island Health Service Delivery Area
## 6 Central Zone
## 7 Chatham-Kent Health Unit
## 8 City of Hamilton Health Unit
## 9 City of Toronto Health Unit
## 10 Cypress Regional Health Authority
## # i 86 more rows
When working with multilingual datasets, encoding issues may cause characters to display incorrectly. Re-importing the file using the appropriate encoding often resolves these problems.
## Rows: 420963 Columns: 6
## -- Column specification ---------------------------------------------------------------------
## Delimiter: ","
## chr (2): hrname_english, hrname_french
## dbl (4): dbuid2016, csduid2016, hruid2017, dbpop2016
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 96 x 1
## hrname_english
## <chr>
## 1 Athabasca Health Authority
## 2 Brant County Health Unit
## 3 Calgary Zone
## 4 Central Regional Health Authority
## 5 Central Vancouver Island Health Service Delivery Area
## 6 Central Zone
## 7 Chatham-Kent Health Unit
## 8 City of Hamilton Health Unit
## 9 City of Toronto Health Unit
## 10 Cypress Regional Health Authority
## # i 86 more rows
We can also calculate the number of health regions in the dataset.
There are 96 health regions in the correspondence dataset.
Additional exploratory summaries can help validate the imported data.
## # A tibble: 3 x 2
## Sex n
## <chr> <int>
## 1 F 28630
## 2 M 31500
## 3 <NA> 13
## # A tibble: 1 x 1
## Total_below_1
## <dbl>
## 1 179028
These simple checks help verify that the datasets were imported correctly and that the variables contain plausible values.
4.7 Cleaning variable names and reshaping data
One common issue in real-world datasets is poorly formatted variable names. Variables that begin with numbers or contain symbols can make coding more difficult.
For example:
## [1] "...1" "Health Service Delivery Area"
## [3] "Year" "Gender"
## [5] "<1" "04-Jan"
## [7] "09-May" "14-Oct"
## [9] "15-19" "20-24"
## [11] "25-29" "30-34"
## [13] "35-39" "40-44"
## [15] "45-49" "50-54"
## [17] "55-59" "60-64"
## [19] "65-69" "70-74"
## [21] "75-79" "80-84"
## [23] "85-89" "90+"
## [25] "Total"
## [1] "...1" "Health Service Delivery Area"
## [3] "Year" "Gender"
## [5] "<1" "04-Jan"
## [7] "09-May" "14-Oct"
## [9] "15-19" "20-24"
## [11] "25-29" "30-34"
## [13] "35-39" "40-44"
## [15] "45-49" "50-54"
## [17] "55-59" "60-64"
## [19] "65-69" "70-74"
## [21] "75-79" "80-84"
## [23] "85-89" "90+"
## [25] "Total"
Some age-group variables contain names such as "04-Jan" or "09-May", which are not ideal variable names for analysis.The janitor package is also useful for cleaning messy column names and inspecting datasets Firke (2026).
One approach is to rename problematic variables directly.
Another important data-cleaning task is reshaping data into a tidy format. In tidy data, each variable has its own column, each observation has its own row, and each value occupies a single cell.
The pivot_longer() function is commonly used to convert wide datasets into long format.
pop_long <- pop %>%
select(
-any_of(c("X1", "Total"))
) %>%
pivot_longer(
cols = -any_of(
c(
"Year",
"Gender",
"Health Service Delivery Area"
)
),
names_to = "Age",
values_to = "Value"
)
pop_long## # A tibble: 1,071 x 5
## `Health Service Delivery Area` Year Gender Age Value
## <chr> <dbl> <chr> <chr> <dbl>
## 1 British Columbia 2016 M ...1 0
## 2 British Columbia 2016 M <1 22997
## 3 British Columbia 2016 M 04-Jan 93211
## 4 British Columbia 2016 M 09-May 121341
## 5 British Columbia 2016 M 14-Oct 119586
## 6 British Columbia 2016 M 15-19 140451
## 7 British Columbia 2016 M 20-24 170968
## 8 British Columbia 2016 M 25-29 159609
## 9 British Columbia 2016 M 30-34 162416
## 10 British Columbia 2016 M 35-39 155936
## # i 1,061 more rows
The resulting dataset stores age groups within a single variable rather than across many separate columns.
After reshaping the data, some values may still require cleaning or recoding. The case_when() function is often used for this purpose.
pop_long <- pop_long %>%
mutate(
Age = case_when(
Age == "04-Jan" ~ "01-04",
Age == "09-May" ~ "05-09",
Age == "14-Oct" ~ "10-14",
TRUE ~ Age
)
)The final line, TRUE ~ Age, is important because it preserves all values that do not match the earlier conditions.
If this line is removed, unmatched rows become missing values.
pop %>%
select(
-any_of(c("X1", "Total"))
) %>%
pivot_longer(
cols = -any_of(
c(
"Year",
"Gender",
"Health Service Delivery Area"
)
),
names_to = "Age",
values_to = "Value"
) %>%
mutate(
Age = case_when(
Age == "04-Jan" ~ "01-04",
Age == "09-May" ~ "05-09",
Age == "14-Oct" ~ "10-14"
)
)## # A tibble: 1,071 x 5
## `Health Service Delivery Area` Year Gender Age Value
## <chr> <dbl> <chr> <chr> <dbl>
## 1 British Columbia 2016 M <NA> 0
## 2 British Columbia 2016 M <NA> 22997
## 3 British Columbia 2016 M 01-04 93211
## 4 British Columbia 2016 M 05-09 121341
## 5 British Columbia 2016 M 10-14 119586
## 6 British Columbia 2016 M <NA> 140451
## 7 British Columbia 2016 M <NA> 170968
## 8 British Columbia 2016 M <NA> 159609
## 9 British Columbia 2016 M <NA> 162416
## 10 British Columbia 2016 M <NA> 155936
## # i 1,061 more rows
Always verify the output after recoding variables. Small coding mistakes can unintentionally convert valid values into missing values.
4.8 Practical considerations for R Markdown workflows
When working with R Markdown documents, remember that every code chunk is executed again during knitting. Code that modifies files, unzips folders, or downloads data should therefore be used carefully.
As projects grow larger, keeping code organized and avoiding repeated operations becomes increasingly important for reproducible workflows.
You should also begin thinking about how analyses are documented. A good R Markdown document not only produces correct outputs, but also explains the reasoning behind each analytical step.
4.9 Chapter summary
In this chapter, you learned several important data cleaning and management techniques using the tidyverse. You imported multiple datasets, explored project files, inspected dataset structures, handled encoding issues, cleaned variable names, reshaped data into tidy formats, and recoded variables using case_when().
These skills form the foundation of reproducible data analysis workflows and will be used extensively throughout the remainder of the course.