Chapter 6 Data Description

Population Data – estimates of people in BC, by sex, age, and Health Service Delivery Area. File name: Population_Estimates.csv

Mortality Data – records of deaths are collected by each province and territory, then joined together to national level data at Statistics Canada. The data contains personal identifiers (age, sex, residence, marital status, etc.) and details about the nature of the death (ICD-10 codes). File name: Deaths_2016.xlsx

Geographical Correspondence File – contains two associated levels of geography, in this case, DB (dissemination block) and Health Service Delivery Area. File name: Corr_2016.csv

Environmental Data File – contains weather data by postal code. For this analysis, the variables for average daily high and average daily low temperature will be used to determine if the postal code is located in a cold climate.

    File name: Weather_data.csv

Unlike the population data that is used in this course, the mortality data is fictitious.

The mortality data has been created based on real mortality counts, causes of death, and sex, but with imputed birth and death dates, and geographical values.

This course uses fictitious 2016 mortality data, which contains sex, geographical variables, cause of death (underlying cause of death, ICD10 codes), marital status, location of death (in hospital, at home, etc.), and the day, month and year of birth and death.

The mortality data in Canada lists cause of death using ICD10 codes. ICD stands for the International Classification of Disease, while the ‘10’ indicates the 10th edition of the list. Prior to 2000, Canada used ICD9 codes, which causes a bit of a headache when comparing causes of death across editions.

For this course, weather data from CANUE will be used as the exposure data. The exposure variable is the presence or absence of a cold climate. A cold climate is defined as a daily average high temperate less than 10 degrees Celsius, with a daily average low temperature less than -3 degrees Celsius.

This data is real, although it is being joined to mortality data that is fictitious, so the results of the analysis will not have any practical research meaning.

The weather data set includes average temperatures, days of extreme heat and cold, days of frost, etc., all organized by postal code.

Correspondence files can take a number of different forms, but primarily, they allow one piece of information to be “matched” to another. These can sometimes take the form of data keys.

For this course, we will be using a geographic correspondence file.

In our mortality data, we have a low level of geography (similar to a postal code). We need a correspondence file to figure out which province, health authority and health service delivery area that geographic code belongs to, in order to group our death by geography.

The particular correspondence file that we will be using is available here: https://www150.statcan.gc.ca/n1/pub/82-402-x/2017001/corr-eng.htm

Wait, readr doesn’t have any functions for files ending in .xlsx. If we check the documentation and search for xlsx or excel it makes no mention of it. Thankfully the tidyverse also has readxl bundled with it and read_excel.

library(readxl)
mort<- read_excel("./Raw Data/deaths_2016.xlsx")
pop <- read_csv("./Raw Data/Population_Estimates.csv")
corr <- read_csv("./Raw Data/Corr_2016.csv", locale = readr::locale(encoding = "latin1"))
env <- read_csv ("./Raw Data/Weather_data.csv")

#Explore the data

Type ?str and ?summary into the R console. Both of these functions give high level insights into our data.

str(mort)

## tibble [60,143 x 13] (S3: tbl_df/tbl/data.frame)
##  $ ID               : num [1:60143] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
##  $ Sex              : chr [1:60143] "M" "F" "M" "M" ...
##  $ Cause_of_death   : chr [1:60143] "D47" "C34" "C26" "C16" ...
##  $ dbuid2016        : num [1:60143] 1.31e+10 4.81e+10 3.52e+10 2.47e+10 2.46e+10 ...
##  $ Location_of_death: num [1:60143] 5 3 6 1 6 5 4 1 3 2 ...
##  $ Marital_status   : num [1:60143] 1 1 2 1 1 1 1 1 1 2 ...
##  $ Postalcode       : chr [1:60143] "E1A1E9" "T2E0T2" "M4J1L1" "H9H1B4" ...
##  $ B_year           : num [1:60143] 1943 1943 1953 1923 1931 ...
##  $ B_month          : num [1:60143] 7 5 8 6 8 11 5 8 7 2 ...
##  $ B_day            : num [1:60143] 8 22 9 14 8 4 20 18 23 7 ...
##  $ D_year           : num [1:60143] 2016 2016 2016 2016 2016 ...
##  $ D_month          : num [1:60143] 9 12 4 11 4 2 2 6 5 1 ...
##  $ D_day            : num [1:60143] 22 14 21 21 4 6 9 13 25 16 ...

summary(mort)

##        ID              Sex            Cause_of_death       dbuid2016         Location_of_death
##  Min.   :1000001   Length:60143       Length:60143       Min.   :1.001e+10   Min.   :1.000    
##  1st Qu.:1016187   Class :character   Class :character   1st Qu.:2.466e+10   1st Qu.:2.000    
##  Median :1034842   Mode  :character   Mode  :character   Median :3.521e+10   Median :4.000    
##  Mean   :1036699                                         Mean   :3.642e+10   Mean   :3.498    
##  3rd Qu.:1056221                                         3rd Qu.:4.715e+10   3rd Qu.:5.000    
##  Max.   :1080708                                         Max.   :6.208e+10   Max.   :6.000    
##                                                          NA's   :4                            
##  Marital_status  Postalcode            B_year        B_month           B_day      
##  Min.   :1.0    Length:60143       Min.   :1827   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:1.0    Class :character   1st Qu.:1933   1st Qu.: 4.000   1st Qu.: 8.00  
##  Median :1.0    Mode  :character   Median :1941   Median : 7.000   Median :16.00  
##  Mean   :1.5                       Mean   :1942   Mean   : 6.513   Mean   :15.74  
##  3rd Qu.:2.0                       3rd Qu.:1951   3rd Qu.:10.000   3rd Qu.:23.00  
##  Max.   :2.0                       Max.   :2016   Max.   :28.000   Max.   :31.00  
##                                    NA's   :24     NA's   :26       NA's   :26     
##      D_year        D_month           D_day      
##  Min.   :2016   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:2016   1st Qu.: 4.000   1st Qu.: 8.00  
##  Median :2016   Median : 7.000   Median :16.00  
##  Mean   :2016   Mean   : 6.521   Mean   :15.78  
##  3rd Qu.:2016   3rd Qu.: 9.000   3rd Qu.:23.00  
##  Max.   :2016   Max.   :21.000   Max.   :31.00  
##

str(pop)

## spc_tbl_ [51 x 25] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1                        : num [1:51] 0 0 0 11 11 11 12 12 12 13 ...
##  $ Health Service Delivery Area: chr [1:51] "British Columbia" "British Columbia" "British Columbia" "East Kootenay" ...
##  $ Year                        : num [1:51] 2016 2016 2016 2016 2016 ...
##  $ Gender                      : chr [1:51] "M" "F" "T" "M" ...
##  $ <1                          : num [1:51] 22997 21760 44757 358 353 ...
##  $ 04-Jan                      : num [1:51] 93211 88300 181511 1507 1439 ...
##  $ 09-May                      : num [1:51] 121341 112777 234118 2058 2016 ...
##  $ 14-Oct                      : num [1:51] 119586 112424 232010 1838 1811 ...
##  $ 15-19                       : num [1:51] 140451 132226 272677 2179 2011 ...
##  $ 20-24                       : num [1:51] 170968 156161 327129 1935 1767 ...
##  $ 25-29                       : num [1:51] 159609 159389 318998 2055 2014 ...
##  $ 30-34                       : num [1:51] 162416 166826 329242 2332 2290 ...
##  $ 35-39                       : num [1:51] 155936 157628 313564 2527 2335 ...
##  $ 40-44                       : num [1:51] 151311 154187 305498 2399 2260 ...
##  $ 45-49                       : num [1:51] 161823 166603 328426 2465 2416 ...
##  $ 50-54                       : num [1:51] 172367 179295 351662 2882 2927 ...
##  $ 55-59                       : num [1:51] 172667 178878 351545 3337 3280 ...
##  $ 60-64                       : num [1:51] 156654 159977 316631 3166 3032 ...
##  $ 65-69                       : num [1:51] 139068 142469 281537 2961 2720 ...
##  $ 70-74                       : num [1:51] 99527 103484 203011 2035 1974 ...
##  $ 75-79                       : num [1:51] 69574 77160 146734 1428 1360 ...
##  $ 80-84                       : num [1:51] 49268 57639 106907 931 996 ...
##  $ 85-89                       : num [1:51] 28898 40789 69687 522 673 ...
##  $ 90+                         : num [1:51] 13366 28648 42014 242 437 ...
##  $ Total                       : num [1:51] 2361038 2396620 4757658 39157 38111 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   `Health Service Delivery Area` = col_character(),
##   ..   Year = col_double(),
##   ..   Gender = col_character(),
##   ..   `<1` = col_double(),
##   ..   `04-Jan` = col_double(),
##   ..   `09-May` = col_double(),
##   ..   `14-Oct` = col_double(),
##   ..   `15-19` = col_double(),
##   ..   `20-24` = col_double(),
##   ..   `25-29` = col_double(),
##   ..   `30-34` = col_double(),
##   ..   `35-39` = col_double(),
##   ..   `40-44` = col_double(),
##   ..   `45-49` = col_double(),
##   ..   `50-54` = col_double(),
##   ..   `55-59` = col_double(),
##   ..   `60-64` = col_double(),
##   ..   `65-69` = col_double(),
##   ..   `70-74` = col_double(),
##   ..   `75-79` = col_double(),
##   ..   `80-84` = col_double(),
##   ..   `85-89` = col_double(),
##   ..   `90+` = col_double(),
##   ..   Total = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(pop)

##       ...1       Health Service Delivery Area      Year         Gender                <1       
##  Min.   : 0.00   Length:51                    Min.   :2016   Length:51          Min.   :  276  
##  1st Qu.:14.00   Class :character             1st Qu.:2016   Class :character   1st Qu.:  714  
##  Median :31.00   Mode  :character             Median :2016   Mode  :character   Median : 1425  
##  Mean   :29.06                                Mean   :2016                      Mean   : 3510  
##  3rd Qu.:42.00                                3rd Qu.:2016                      3rd Qu.: 2802  
##  Max.   :53.00                                Max.   :2016                      Max.   :44757  
##      04-Jan           09-May           14-Oct           15-19            20-24       
##  Min.   :  1356   Min.   :  1755   Min.   :  1811   Min.   :  1919   Min.   :  1577  
##  1st Qu.:  3118   1st Qu.:  4194   1st Qu.:  3910   1st Qu.:  4240   1st Qu.:  4214  
##  Median :  5663   Median :  8002   Median :  7808   Median :  8836   Median :  9193  
##  Mean   : 14236   Mean   : 18362   Mean   : 18197   Mean   : 21386   Mean   : 25657  
##  3rd Qu.: 11025   3rd Qu.: 15384   3rd Qu.: 15350   3rd Qu.: 17866   3rd Qu.: 22940  
##  Max.   :181511   Max.   :234118   Max.   :232010   Max.   :272677   Max.   :327129  
##      25-29            30-34            35-39            40-44            45-49       
##  Min.   :  1640   Min.   :  2069   Min.   :  2092   Min.   :  2008   Min.   :  2040  
##  1st Qu.:  4169   1st Qu.:  4415   1st Qu.:  4572   1st Qu.:  4483   1st Qu.:  4859  
##  Median :  8646   Median :  8830   Median :  9490   Median :  9551   Median :  9930  
##  Mean   : 25019   Mean   : 25823   Mean   : 24593   Mean   : 23961   Mean   : 25759  
##  3rd Qu.: 21291   3rd Qu.: 22644   3rd Qu.: 21896   3rd Qu.: 21345   3rd Qu.: 23211  
##  Max.   :318998   Max.   :329242   Max.   :313564   Max.   :305498   Max.   :328426  
##      50-54            55-59            60-64            65-69            70-74       
##  Min.   :  2137   Min.   :  2038   Min.   :  1550   Min.   :  1119   Min.   :   784  
##  1st Qu.:  5431   1st Qu.:  5684   1st Qu.:  5349   1st Qu.:  5006   1st Qu.:  3628  
##  Median : 10666   Median : 10972   Median : 10698   Median : 10011   Median :  7256  
##  Mean   : 27581   Mean   : 27572   Mean   : 24834   Mean   : 22081   Mean   : 15922  
##  3rd Qu.: 24455   3rd Qu.: 22498   3rd Qu.: 19486   3rd Qu.: 16891   3rd Qu.: 12128  
##  Max.   :351662   Max.   :351545   Max.   :316631   Max.   :281537   Max.   :203011  
##      75-79            80-84            85-89            90+              Total        
##  Min.   :   538   Min.   :   351   Min.   :  196   Min.   :  114.0   Min.   :  32171  
##  1st Qu.:  2332   1st Qu.:  1612   1st Qu.:  973   1st Qu.:  536.5   1st Qu.:  70681  
##  Median :  5026   Median :  3867   Median : 2636   Median : 1394.0   Median : 145321  
##  Mean   : 11509   Mean   :  8385   Mean   : 5466   Mean   : 3295.2   Mean   : 373150  
##  3rd Qu.:  9268   3rd Qu.:  6582   3rd Qu.: 4543   3rd Qu.: 2955.0   3rd Qu.: 329569  
##  Max.   :146734   Max.   :106907   Max.   :69687   Max.   :42014.0   Max.   :4757658

str(corr)

## spc_tbl_ [420,963 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ dbuid2016     : num [1:420963] 1e+10 1e+10 1e+10 1e+10 1e+10 ...
##  $ csduid2016    : num [1:420963] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
##  $ hruid2017     : num [1:420963] 1011 1011 1011 1011 1011 ...
##  $ hrname_english: chr [1:420963] "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" ...
##  $ hrname_french : chr [1:420963] "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" ...
##  $ dbpop2016     : num [1:420963] 160 25 268 53 71 217 39 120 154 111 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   dbuid2016 = col_double(),
##   ..   csduid2016 = col_double(),
##   ..   hruid2017 = col_double(),
##   ..   hrname_english = col_character(),
##   ..   hrname_french = col_character(),
##   ..   dbpop2016 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(corr)

##    dbuid2016           csduid2016        hruid2017    hrname_english     hrname_french     
##  Min.   :1.001e+10   Min.   :1001101   Min.   :1011   Length:420963      Length:420963     
##  1st Qu.:2.466e+10   1st Qu.:2466023   1st Qu.:2414   Class :character   Class :character  
##  Median :3.558e+10   Median :3558090   Median :3595   Mode  :character   Mode  :character  
##  Mean   :3.856e+10   Mean   :3855572   Mean   :3856                                        
##  3rd Qu.:4.806e+10   3rd Qu.:4806016   3rd Qu.:4832                                        
##  Max.   :6.208e+10   Max.   :6208098   Max.   :6201                                        
##                                                                                            
##    dbpop2016      
##  Min.   :   0.00  
##  1st Qu.:   5.00  
##  Median :  26.00  
##  Mean   :  69.44  
##  3rd Qu.:  77.00  
##  Max.   :7607.00  
##  NA's   :35

str(env)

## spc_tbl_ [116,011 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ POSTALCODE12: chr [1:116011] "V0C1E0" "V0C1W0" "V0C2X0" "V0C2Z0" ...
##  $ WTHNRC12_01 : num [1:116011] 27.1 30.3 28.2 28.7 23.1 ...
##  $ WTHNRC12_02 : num [1:116011] -46.7 -46 -42.3 -44.9 -35 ...
##  $ WTHNRC12_03 : num [1:116011] -2.605 -1.868 -1.963 -1.286 0.338 ...
##  $ WTHNRC12_04 : num [1:116011] 2.91 3.92 4.01 4.32 4.33 ...
##  $ WTHNRC12_05 : num [1:116011] -8.12 -7.65 -7.93 -6.89 -3.65 ...
##  $ WTHNRC12_06 : num [1:116011] 11.04 11.57 11.94 11.2 7.98 ...
##  $ WTHNRC12_07 : num [1:116011] 171 198 248 175 310 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   POSTALCODE12 = col_character(),
##   ..   WTHNRC12_01 = col_double(),
##   ..   WTHNRC12_02 = col_double(),
##   ..   WTHNRC12_03 = col_double(),
##   ..   WTHNRC12_04 = col_double(),
##   ..   WTHNRC12_05 = col_double(),
##   ..   WTHNRC12_06 = col_double(),
##   ..   WTHNRC12_07 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(env)

##  POSTALCODE12        WTHNRC12_01     WTHNRC12_02      WTHNRC12_03      WTHNRC12_04    
##  Length:116011      Min.   :19.28   Min.   :-46.74   Min.   :-2.605   Min.   : 2.913  
##  Class :character   1st Qu.:31.07   1st Qu.:-11.44   1st Qu.: 9.693   1st Qu.:13.454  
##  Mode  :character   Median :31.28   Median : -8.40   Median :10.140   Median :13.887  
##                     Mean   :31.62   Mean   :-11.67   Mean   : 9.573   Mean   :13.521  
##                     3rd Qu.:31.76   3rd Qu.: -7.49   3rd Qu.:10.449   3rd Qu.:14.006  
##                     Max.   :39.07   Max.   : -4.48   Max.   :10.551   Max.   :16.227  
##   WTHNRC12_05      WTHNRC12_06      WTHNRC12_07    
##  Min.   :-8.122   Min.   : 4.336   Min.   : 104.6  
##  1st Qu.: 5.750   1st Qu.: 6.936   1st Qu.:1072.1  
##  Median : 6.612   Median : 7.151   Median :1316.4  
##  Mean   : 5.625   Mean   : 7.896   Mean   :1110.7  
##  3rd Qu.: 6.893   3rd Qu.: 7.935   3rd Qu.:1399.0  
##  Max.   : 7.132   Max.   :13.498   Max.   :3076.5

We just broke one of our rules to not repeat our code! We will deal with this later…

We can also use the dplyr function glimpse:

glimpse(mort)

## Rows: 60,143
## Columns: 13
## $ ID                <dbl> 1000001, 1000002, 1000003, 1000004, 1000005, 1000006, 1000007, 100000~
## $ Sex               <chr> "M", "F", "M", "M", "M", "F", "F", "F", "F", "M", "F", "F", "F", "M",~
## $ Cause_of_death    <chr> "D47", "C34", "C26", "C16", "C22", "C50", "C85", "C56", "C64", "C22",~
## $ dbuid2016         <dbl> 13070175019, 48060924001, 35204502002, 24663048013, 24560249013, 3525~
## $ Location_of_death <dbl> 5, 3, 6, 1, 6, 5, 4, 1, 3, 2, 5, 1, 3, 4, 1, 5, 2, 6, 1, 1, 2, 5, 6, ~
## $ Marital_status    <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, ~
## $ Postalcode        <chr> "E1A1E9", "T2E0T2", "M4J1L1", "H9H1B4", "J2W2B6", "L8H2K1", "N9Y3X5",~
## $ B_year            <dbl> 1943, 1943, 1953, 1923, 1931, 1933, 1918, 1972, 1922, 1953, 1978, 193~
## $ B_month           <dbl> 7, 5, 8, 6, 8, 11, 5, 8, 7, 2, 11, 3, 5, 9, 8, 12, 5, 11, 9, 4, 5, 9,~
## $ B_day             <dbl> 8, 22, 9, 14, 8, 4, 20, 18, 23, 7, 8, 6, 1, 8, 14, 31, 22, 4, 11, 8, ~
## $ D_year            <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 201~
## $ D_month           <dbl> 9, 12, 4, 11, 4, 2, 2, 6, 5, 1, 2, 10, 6, 2, 7, 8, 6, 2, 5, 9, 10, 7,~
## $ D_day             <dbl> 22, 14, 21, 21, 4, 6, 9, 13, 25, 16, 22, 17, 15, 7, 9, 7, 20, 7, 7, 7~

Investigate:

How many health regions are in the data?
How many men died of cancer? How many women?
How many babies are in BC (under one year)?

# Display unique health region names safely.
# The correspondence file includes accented French names, so we use latin1 encoding.
corr %>%
  distinct(hrname_english) %>%
  mutate(hrname_english = iconv(hrname_english, from = "latin1", to = "UTF-8", sub = "")) %>%
  arrange(hrname_english)

## # A tibble: 96 x 1
##    hrname_english                                       
##    <chr>                                                
##  1 Athabasca Health Authority                           
##  2 Brant County Health Unit                             
##  3 Calgary Zone                                         
##  4 Central Regional Health Authority                    
##  5 Central Vancouver Island Health Service Delivery Area
##  6 Central Zone                                         
##  7 Chatham-Kent Health Unit                             
##  8 City of Hamilton Health Unit                         
##  9 City of Toronto Health Unit                          
## 10 Cypress Regional Health Authority                    
## # i 86 more rows

If we look through these Health Regions, we can see that there is something strange going on with a few entries from Quebec. When working with English/French data, the encoding of the text can display wrong during import. We can fix this by re-importing the Corr data and specifying the locale. Sometimes this will resolve the issue, other times it may be more of a manual exercise that requires recoding of variables and working with stringr. We will be looking at stringr in Session 3.

corr <- read_csv(
  "./Raw Data/Corr_2016.csv",
  locale = readr::locale(encoding = "latin1")
)

corr %>%
  distinct(hrname_english) %>%
  arrange(hrname_english)

## # A tibble: 96 x 1
##    hrname_english                                       
##    <chr>                                                
##  1 Athabasca Health Authority                           
##  2 Brant County Health Unit                             
##  3 Calgary Zone                                         
##  4 Central Regional Health Authority                    
##  5 Central Vancouver Island Health Service Delivery Area
##  6 Central Zone                                         
##  7 Chatham-Kent Health Unit                             
##  8 City of Hamilton Health Unit                         
##  9 City of Toronto Health Unit                          
## 10 Cypress Regional Health Authority                    
## # i 86 more rows

If we simply wanted the number of health regions we can use:

no_regions<-corr %>%distinct(hrname_english)%>%nrow()

There are 96 Health Regions in the corr data set.

##----------------------------------------------------
## How many men died of cancer? How many women?
##----------------------------------------------------
mort %>% count(Sex)

## # A tibble: 3 x 2
##   Sex       n
##   <chr> <int>
## 1 F     28630
## 2 M     31500
## 3 <NA>     13

#or
mort %>% group_by(Sex)%>%count()

## # A tibble: 3 x 2
## # Groups:   Sex [3]
##   Sex       n
##   <chr> <int>
## 1 F     28630
## 2 M     31500
## 3 <NA>     13

##----------------------------------------------------
## How many babies are in BC (under one year)?
##----------------------------------------------------
sum(pop$`<1`)

## [1] 179028

pop%>%
  summarize(Total_below_1 = sum(`<1`))

## # A tibble: 1 x 1
##   Total_below_1
##           <dbl>
## 1        179028

When column names do not start with a character or letter, the syntax changes. For example pop$<1 will not return any values. When columns are named incorrectly (with spaces, starting with a symbol or number), we can refer to them using the forward ticks shown above. In practice, we should be converting all columns to a better format, it makes for quicker coding and ensures that future work or analysis will not be hampered by column names.

If we look closer at the population data some of the column name do not make much sense. Both names and colnames do the same thing and list the column names. In R, there are always more than one way to do things.

names(pop)

##  [1] "...1"                         "Health Service Delivery Area" "Year"                        
##  [4] "Gender"                       "<1"                           "04-Jan"                      
##  [7] "09-May"                       "14-Oct"                       "15-19"                       
## [10] "20-24"                        "25-29"                        "30-34"                       
## [13] "35-39"                        "40-44"                        "45-49"                       
## [16] "50-54"                        "55-59"                        "60-64"                       
## [19] "65-69"                        "70-74"                        "75-79"                       
## [22] "80-84"                        "85-89"                        "90+"                         
## [25] "Total"

colnames(pop)

##  [1] "...1"                         "Health Service Delivery Area" "Year"                        
##  [4] "Gender"                       "<1"                           "04-Jan"                      
##  [7] "09-May"                       "14-Oct"                       "15-19"                       
## [10] "20-24"                        "25-29"                        "30-34"                       
## [13] "35-39"                        "40-44"                        "45-49"                       
## [16] "50-54"                        "55-59"                        "60-64"                       
## [19] "65-69"                        "70-74"                        "75-79"                       
## [22] "80-84"                        "85-89"                        "90+"                         
## [25] "Total"

The 6th column”04-Jan” should be the next age group after <1 year old. The interval between years is 5. Therefore “04-Jan” should probably be “01-04” or some variation of that. For tidy data we need to make sure that our column names start with a character and not a numeric value. This is a problem since so many age columns start with numbers and need to be cleaned up!

We are going to use rename to rename the three columns. Note: There are many different approaches to this problem.

pop2 <- pop %>%
  rename("01-04"="04-Jan",
        "05-09" ="09-May",
        "10-14"="14-Oct")

This is one approach, here is another using pivot_longer, lets remove X1 if it exists and remove Total while we’re at it using select. After removing columns that are not needed for the long-format population table… In pivot_longer(), the cols argument defines which columns should be combined under Age. Instead of relying only on column positions, we exclude the known non-age columns. This is more reliable because some versions of the population file may not contain an X1 column.

pop_long<-pop %>%
  select(-any_of(c("X1", "Total"))) %>%
  pivot_longer(
    cols = -any_of(c("Year", "Gender", "Health Service Delivery Area")),
    names_to = "Age",
    values_to = "Value"
  )
  
pop_long

## # A tibble: 1,071 x 5
##    `Health Service Delivery Area`  Year Gender Age     Value
##    <chr>                          <dbl> <chr>  <chr>   <dbl>
##  1 British Columbia                2016 M      ...1        0
##  2 British Columbia                2016 M      <1      22997
##  3 British Columbia                2016 M      04-Jan  93211
##  4 British Columbia                2016 M      09-May 121341
##  5 British Columbia                2016 M      14-Oct 119586
##  6 British Columbia                2016 M      15-19  140451
##  7 British Columbia                2016 M      20-24  170968
##  8 British Columbia                2016 M      25-29  159609
##  9 British Columbia                2016 M      30-34  162416
## 10 British Columbia                2016 M      35-39  155936
## # i 1,061 more rows

Examine how the data is different from pop to pop_long. Type both in your console to compare.

We can now recode the values in the long format by using case_when. See ?case_when for details and https://dplyr.tidyverse.org/reference/case_when.html

First we are going to mutate the existing Age variable. Then we are saying:

when age is equal to “04-Jan”, assign (~) these rows the value of “01-04”, when age is equal to “09-May”, assign these rows the value of “05-09”, when age is equal to “14-Oct”, assign these rows the value of “10-14”, when the conditions above are not matched, assign these rows the value that they already had

pop_long =  pop_long %>%
  mutate(Age = case_when(
          Age =="04-Jan" ~ "01-04",
          Age =="09-May" ~ "05-09",
          Age =="14-Oct" ~   "10-14",
          TRUE ~ Age
             
           )
  )

Below we have altered the code above by removing the last line - why is this wrong? What happens to the other values?

pop %>%
  select(-any_of(c("X1", "Total"))) %>%
  pivot_longer(
    cols = -any_of(c("Year", "Gender", "Health Service Delivery Area")),
    names_to = "Age",
    values_to = "Value"
  )%>%
  mutate(Age = case_when(
          Age =="04-Jan" ~ "01-04",
          Age =="09-May" ~ "05-09",
          Age =="14-Oct" ~   "10-14"
          # TRUE ~ Age
             
           )
  )

## # A tibble: 1,071 x 5
##    `Health Service Delivery Area`  Year Gender Age    Value
##    <chr>                          <dbl> <chr>  <chr>  <dbl>
##  1 British Columbia                2016 M      <NA>       0
##  2 British Columbia                2016 M      <NA>   22997
##  3 British Columbia                2016 M      01-04  93211
##  4 British Columbia                2016 M      05-09 121341
##  5 British Columbia                2016 M      10-14 119586
##  6 British Columbia                2016 M      <NA>  140451
##  7 British Columbia                2016 M      <NA>  170968
##  8 British Columbia                2016 M      <NA>  159609
##  9 British Columbia                2016 M      <NA>  162416
## 10 British Columbia                2016 M      <NA>  155936
## # i 1,061 more rows

It is important to double check that when you recode values!

If you “knit” this document, make sure you have commented out the “unzip” chunk. Note that all of the outputs will be “knitted” to the word output. Some of which is not intuitive or pretty for a display in a word document. Refer to the R Markdown documentation to alter the chunks and their display in word.

See: https://rmarkdown.rstudio.com/lesson-3.html

Next week:

We will continue cleaning and reshaping the data. We will look into dates, regex, advanced functions and working with lists.

You will structure your own analysis Project in R to create a polished R Markdown document.