Chapter 6 Data Description
Population Data – estimates of people in BC, by sex, age, and Health Service Delivery Area. File name: Population_Estimates.csv
Mortality Data – records of deaths are collected by each province and territory, then joined together to national level data at Statistics Canada. The data contains personal identifiers (age, sex, residence, marital status, etc.) and details about the nature of the death (ICD-10 codes). File name: Deaths_2016.xlsx
Geographical Correspondence File – contains two associated levels of geography, in this case, DB (dissemination block) and Health Service Delivery Area. File name: Corr_2016.csv
Environmental Data File – contains weather data by postal code. For this analysis, the variables for average daily high and average daily low temperature will be used to determine if the postal code is located in a cold climate.
File name: Weather_data.csv
Unlike the population data that is used in this course, the mortality data is fictitious.
The mortality data has been created based on real mortality counts, causes of death, and sex, but with imputed birth and death dates, and geographical values.
This course uses fictitious 2016 mortality data, which contains sex, geographical variables, cause of death (underlying cause of death, ICD10 codes), marital status, location of death (in hospital, at home, etc.), and the day, month and year of birth and death.
The mortality data in Canada lists cause of death using ICD10 codes. ICD stands for the International Classification of Disease, while the ‘10’ indicates the 10th edition of the list. Prior to 2000, Canada used ICD9 codes, which causes a bit of a headache when comparing causes of death across editions.
For this course, weather data from CANUE will be used as the exposure data. The exposure variable is the presence or absence of a cold climate. A cold climate is defined as a daily average high temperate less than 10 degrees Celsius, with a daily average low temperature less than -3 degrees Celsius.
This data is real, although it is being joined to mortality data that is fictitious, so the results of the analysis will not have any practical research meaning.
The weather data set includes average temperatures, days of extreme heat and cold, days of frost, etc., all organized by postal code.
Correspondence files can take a number of different forms, but primarily, they allow one piece of information to be “matched” to another. These can sometimes take the form of data keys.
For this course, we will be using a geographic correspondence file.
In our mortality data, we have a low level of geography (similar to a postal code). We need a correspondence file to figure out which province, health authority and health service delivery area that geographic code belongs to, in order to group our death by geography.
The particular correspondence file that we will be using is available here: https://www150.statcan.gc.ca/n1/pub/82-402-x/2017001/corr-eng.htm
Wait, readr doesn’t have any functions for files ending in .xlsx. If we check the documentation and search for xlsx or excel it makes no mention of it.
Thankfully the tidyverse also has readxl bundled with it and read_excel.
library(readxl)
mort<- read_excel("./Raw Data/deaths_2016.xlsx")
pop <- read_csv("./Raw Data/Population_Estimates.csv")
corr <- read_csv("./Raw Data/Corr_2016.csv", locale = readr::locale(encoding = "latin1"))
env <- read_csv ("./Raw Data/Weather_data.csv")#Explore the data
Type ?str and ?summary into the R console. Both of these functions give high level insights into our data.
## tibble [60,143 x 13] (S3: tbl_df/tbl/data.frame)
## $ ID : num [1:60143] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
## $ Sex : chr [1:60143] "M" "F" "M" "M" ...
## $ Cause_of_death : chr [1:60143] "D47" "C34" "C26" "C16" ...
## $ dbuid2016 : num [1:60143] 1.31e+10 4.81e+10 3.52e+10 2.47e+10 2.46e+10 ...
## $ Location_of_death: num [1:60143] 5 3 6 1 6 5 4 1 3 2 ...
## $ Marital_status : num [1:60143] 1 1 2 1 1 1 1 1 1 2 ...
## $ Postalcode : chr [1:60143] "E1A1E9" "T2E0T2" "M4J1L1" "H9H1B4" ...
## $ B_year : num [1:60143] 1943 1943 1953 1923 1931 ...
## $ B_month : num [1:60143] 7 5 8 6 8 11 5 8 7 2 ...
## $ B_day : num [1:60143] 8 22 9 14 8 4 20 18 23 7 ...
## $ D_year : num [1:60143] 2016 2016 2016 2016 2016 ...
## $ D_month : num [1:60143] 9 12 4 11 4 2 2 6 5 1 ...
## $ D_day : num [1:60143] 22 14 21 21 4 6 9 13 25 16 ...
## ID Sex Cause_of_death dbuid2016 Location_of_death
## Min. :1000001 Length:60143 Length:60143 Min. :1.001e+10 Min. :1.000
## 1st Qu.:1016187 Class :character Class :character 1st Qu.:2.466e+10 1st Qu.:2.000
## Median :1034842 Mode :character Mode :character Median :3.521e+10 Median :4.000
## Mean :1036699 Mean :3.642e+10 Mean :3.498
## 3rd Qu.:1056221 3rd Qu.:4.715e+10 3rd Qu.:5.000
## Max. :1080708 Max. :6.208e+10 Max. :6.000
## NA's :4
## Marital_status Postalcode B_year B_month B_day
## Min. :1.0 Length:60143 Min. :1827 Min. : 1.000 Min. : 1.00
## 1st Qu.:1.0 Class :character 1st Qu.:1933 1st Qu.: 4.000 1st Qu.: 8.00
## Median :1.0 Mode :character Median :1941 Median : 7.000 Median :16.00
## Mean :1.5 Mean :1942 Mean : 6.513 Mean :15.74
## 3rd Qu.:2.0 3rd Qu.:1951 3rd Qu.:10.000 3rd Qu.:23.00
## Max. :2.0 Max. :2016 Max. :28.000 Max. :31.00
## NA's :24 NA's :26 NA's :26
## D_year D_month D_day
## Min. :2016 Min. : 1.000 Min. : 1.00
## 1st Qu.:2016 1st Qu.: 4.000 1st Qu.: 8.00
## Median :2016 Median : 7.000 Median :16.00
## Mean :2016 Mean : 6.521 Mean :15.78
## 3rd Qu.:2016 3rd Qu.: 9.000 3rd Qu.:23.00
## Max. :2016 Max. :21.000 Max. :31.00
##
## spc_tbl_ [51 x 25] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:51] 0 0 0 11 11 11 12 12 12 13 ...
## $ Health Service Delivery Area: chr [1:51] "British Columbia" "British Columbia" "British Columbia" "East Kootenay" ...
## $ Year : num [1:51] 2016 2016 2016 2016 2016 ...
## $ Gender : chr [1:51] "M" "F" "T" "M" ...
## $ <1 : num [1:51] 22997 21760 44757 358 353 ...
## $ 04-Jan : num [1:51] 93211 88300 181511 1507 1439 ...
## $ 09-May : num [1:51] 121341 112777 234118 2058 2016 ...
## $ 14-Oct : num [1:51] 119586 112424 232010 1838 1811 ...
## $ 15-19 : num [1:51] 140451 132226 272677 2179 2011 ...
## $ 20-24 : num [1:51] 170968 156161 327129 1935 1767 ...
## $ 25-29 : num [1:51] 159609 159389 318998 2055 2014 ...
## $ 30-34 : num [1:51] 162416 166826 329242 2332 2290 ...
## $ 35-39 : num [1:51] 155936 157628 313564 2527 2335 ...
## $ 40-44 : num [1:51] 151311 154187 305498 2399 2260 ...
## $ 45-49 : num [1:51] 161823 166603 328426 2465 2416 ...
## $ 50-54 : num [1:51] 172367 179295 351662 2882 2927 ...
## $ 55-59 : num [1:51] 172667 178878 351545 3337 3280 ...
## $ 60-64 : num [1:51] 156654 159977 316631 3166 3032 ...
## $ 65-69 : num [1:51] 139068 142469 281537 2961 2720 ...
## $ 70-74 : num [1:51] 99527 103484 203011 2035 1974 ...
## $ 75-79 : num [1:51] 69574 77160 146734 1428 1360 ...
## $ 80-84 : num [1:51] 49268 57639 106907 931 996 ...
## $ 85-89 : num [1:51] 28898 40789 69687 522 673 ...
## $ 90+ : num [1:51] 13366 28648 42014 242 437 ...
## $ Total : num [1:51] 2361038 2396620 4757658 39157 38111 ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. `Health Service Delivery Area` = col_character(),
## .. Year = col_double(),
## .. Gender = col_character(),
## .. `<1` = col_double(),
## .. `04-Jan` = col_double(),
## .. `09-May` = col_double(),
## .. `14-Oct` = col_double(),
## .. `15-19` = col_double(),
## .. `20-24` = col_double(),
## .. `25-29` = col_double(),
## .. `30-34` = col_double(),
## .. `35-39` = col_double(),
## .. `40-44` = col_double(),
## .. `45-49` = col_double(),
## .. `50-54` = col_double(),
## .. `55-59` = col_double(),
## .. `60-64` = col_double(),
## .. `65-69` = col_double(),
## .. `70-74` = col_double(),
## .. `75-79` = col_double(),
## .. `80-84` = col_double(),
## .. `85-89` = col_double(),
## .. `90+` = col_double(),
## .. Total = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## ...1 Health Service Delivery Area Year Gender <1
## Min. : 0.00 Length:51 Min. :2016 Length:51 Min. : 276
## 1st Qu.:14.00 Class :character 1st Qu.:2016 Class :character 1st Qu.: 714
## Median :31.00 Mode :character Median :2016 Mode :character Median : 1425
## Mean :29.06 Mean :2016 Mean : 3510
## 3rd Qu.:42.00 3rd Qu.:2016 3rd Qu.: 2802
## Max. :53.00 Max. :2016 Max. :44757
## 04-Jan 09-May 14-Oct 15-19 20-24
## Min. : 1356 Min. : 1755 Min. : 1811 Min. : 1919 Min. : 1577
## 1st Qu.: 3118 1st Qu.: 4194 1st Qu.: 3910 1st Qu.: 4240 1st Qu.: 4214
## Median : 5663 Median : 8002 Median : 7808 Median : 8836 Median : 9193
## Mean : 14236 Mean : 18362 Mean : 18197 Mean : 21386 Mean : 25657
## 3rd Qu.: 11025 3rd Qu.: 15384 3rd Qu.: 15350 3rd Qu.: 17866 3rd Qu.: 22940
## Max. :181511 Max. :234118 Max. :232010 Max. :272677 Max. :327129
## 25-29 30-34 35-39 40-44 45-49
## Min. : 1640 Min. : 2069 Min. : 2092 Min. : 2008 Min. : 2040
## 1st Qu.: 4169 1st Qu.: 4415 1st Qu.: 4572 1st Qu.: 4483 1st Qu.: 4859
## Median : 8646 Median : 8830 Median : 9490 Median : 9551 Median : 9930
## Mean : 25019 Mean : 25823 Mean : 24593 Mean : 23961 Mean : 25759
## 3rd Qu.: 21291 3rd Qu.: 22644 3rd Qu.: 21896 3rd Qu.: 21345 3rd Qu.: 23211
## Max. :318998 Max. :329242 Max. :313564 Max. :305498 Max. :328426
## 50-54 55-59 60-64 65-69 70-74
## Min. : 2137 Min. : 2038 Min. : 1550 Min. : 1119 Min. : 784
## 1st Qu.: 5431 1st Qu.: 5684 1st Qu.: 5349 1st Qu.: 5006 1st Qu.: 3628
## Median : 10666 Median : 10972 Median : 10698 Median : 10011 Median : 7256
## Mean : 27581 Mean : 27572 Mean : 24834 Mean : 22081 Mean : 15922
## 3rd Qu.: 24455 3rd Qu.: 22498 3rd Qu.: 19486 3rd Qu.: 16891 3rd Qu.: 12128
## Max. :351662 Max. :351545 Max. :316631 Max. :281537 Max. :203011
## 75-79 80-84 85-89 90+ Total
## Min. : 538 Min. : 351 Min. : 196 Min. : 114.0 Min. : 32171
## 1st Qu.: 2332 1st Qu.: 1612 1st Qu.: 973 1st Qu.: 536.5 1st Qu.: 70681
## Median : 5026 Median : 3867 Median : 2636 Median : 1394.0 Median : 145321
## Mean : 11509 Mean : 8385 Mean : 5466 Mean : 3295.2 Mean : 373150
## 3rd Qu.: 9268 3rd Qu.: 6582 3rd Qu.: 4543 3rd Qu.: 2955.0 3rd Qu.: 329569
## Max. :146734 Max. :106907 Max. :69687 Max. :42014.0 Max. :4757658
## spc_tbl_ [420,963 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ dbuid2016 : num [1:420963] 1e+10 1e+10 1e+10 1e+10 1e+10 ...
## $ csduid2016 : num [1:420963] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
## $ hruid2017 : num [1:420963] 1011 1011 1011 1011 1011 ...
## $ hrname_english: chr [1:420963] "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" ...
## $ hrname_french : chr [1:420963] "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" "Eastern Regional Health Authority" ...
## $ dbpop2016 : num [1:420963] 160 25 268 53 71 217 39 120 154 111 ...
## - attr(*, "spec")=
## .. cols(
## .. dbuid2016 = col_double(),
## .. csduid2016 = col_double(),
## .. hruid2017 = col_double(),
## .. hrname_english = col_character(),
## .. hrname_french = col_character(),
## .. dbpop2016 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## dbuid2016 csduid2016 hruid2017 hrname_english hrname_french
## Min. :1.001e+10 Min. :1001101 Min. :1011 Length:420963 Length:420963
## 1st Qu.:2.466e+10 1st Qu.:2466023 1st Qu.:2414 Class :character Class :character
## Median :3.558e+10 Median :3558090 Median :3595 Mode :character Mode :character
## Mean :3.856e+10 Mean :3855572 Mean :3856
## 3rd Qu.:4.806e+10 3rd Qu.:4806016 3rd Qu.:4832
## Max. :6.208e+10 Max. :6208098 Max. :6201
##
## dbpop2016
## Min. : 0.00
## 1st Qu.: 5.00
## Median : 26.00
## Mean : 69.44
## 3rd Qu.: 77.00
## Max. :7607.00
## NA's :35
## spc_tbl_ [116,011 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ POSTALCODE12: chr [1:116011] "V0C1E0" "V0C1W0" "V0C2X0" "V0C2Z0" ...
## $ WTHNRC12_01 : num [1:116011] 27.1 30.3 28.2 28.7 23.1 ...
## $ WTHNRC12_02 : num [1:116011] -46.7 -46 -42.3 -44.9 -35 ...
## $ WTHNRC12_03 : num [1:116011] -2.605 -1.868 -1.963 -1.286 0.338 ...
## $ WTHNRC12_04 : num [1:116011] 2.91 3.92 4.01 4.32 4.33 ...
## $ WTHNRC12_05 : num [1:116011] -8.12 -7.65 -7.93 -6.89 -3.65 ...
## $ WTHNRC12_06 : num [1:116011] 11.04 11.57 11.94 11.2 7.98 ...
## $ WTHNRC12_07 : num [1:116011] 171 198 248 175 310 ...
## - attr(*, "spec")=
## .. cols(
## .. POSTALCODE12 = col_character(),
## .. WTHNRC12_01 = col_double(),
## .. WTHNRC12_02 = col_double(),
## .. WTHNRC12_03 = col_double(),
## .. WTHNRC12_04 = col_double(),
## .. WTHNRC12_05 = col_double(),
## .. WTHNRC12_06 = col_double(),
## .. WTHNRC12_07 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## POSTALCODE12 WTHNRC12_01 WTHNRC12_02 WTHNRC12_03 WTHNRC12_04
## Length:116011 Min. :19.28 Min. :-46.74 Min. :-2.605 Min. : 2.913
## Class :character 1st Qu.:31.07 1st Qu.:-11.44 1st Qu.: 9.693 1st Qu.:13.454
## Mode :character Median :31.28 Median : -8.40 Median :10.140 Median :13.887
## Mean :31.62 Mean :-11.67 Mean : 9.573 Mean :13.521
## 3rd Qu.:31.76 3rd Qu.: -7.49 3rd Qu.:10.449 3rd Qu.:14.006
## Max. :39.07 Max. : -4.48 Max. :10.551 Max. :16.227
## WTHNRC12_05 WTHNRC12_06 WTHNRC12_07
## Min. :-8.122 Min. : 4.336 Min. : 104.6
## 1st Qu.: 5.750 1st Qu.: 6.936 1st Qu.:1072.1
## Median : 6.612 Median : 7.151 Median :1316.4
## Mean : 5.625 Mean : 7.896 Mean :1110.7
## 3rd Qu.: 6.893 3rd Qu.: 7.935 3rd Qu.:1399.0
## Max. : 7.132 Max. :13.498 Max. :3076.5
We just broke one of our rules to not repeat our code! We will deal with this later…
We can also use the dplyr function glimpse:
## Rows: 60,143
## Columns: 13
## $ ID <dbl> 1000001, 1000002, 1000003, 1000004, 1000005, 1000006, 1000007, 100000~
## $ Sex <chr> "M", "F", "M", "M", "M", "F", "F", "F", "F", "M", "F", "F", "F", "M",~
## $ Cause_of_death <chr> "D47", "C34", "C26", "C16", "C22", "C50", "C85", "C56", "C64", "C22",~
## $ dbuid2016 <dbl> 13070175019, 48060924001, 35204502002, 24663048013, 24560249013, 3525~
## $ Location_of_death <dbl> 5, 3, 6, 1, 6, 5, 4, 1, 3, 2, 5, 1, 3, 4, 1, 5, 2, 6, 1, 1, 2, 5, 6, ~
## $ Marital_status <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, ~
## $ Postalcode <chr> "E1A1E9", "T2E0T2", "M4J1L1", "H9H1B4", "J2W2B6", "L8H2K1", "N9Y3X5",~
## $ B_year <dbl> 1943, 1943, 1953, 1923, 1931, 1933, 1918, 1972, 1922, 1953, 1978, 193~
## $ B_month <dbl> 7, 5, 8, 6, 8, 11, 5, 8, 7, 2, 11, 3, 5, 9, 8, 12, 5, 11, 9, 4, 5, 9,~
## $ B_day <dbl> 8, 22, 9, 14, 8, 4, 20, 18, 23, 7, 8, 6, 1, 8, 14, 31, 22, 4, 11, 8, ~
## $ D_year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 201~
## $ D_month <dbl> 9, 12, 4, 11, 4, 2, 2, 6, 5, 1, 2, 10, 6, 2, 7, 8, 6, 2, 5, 9, 10, 7,~
## $ D_day <dbl> 22, 14, 21, 21, 4, 6, 9, 13, 25, 16, 22, 17, 15, 7, 9, 7, 20, 7, 7, 7~
Investigate:
- How many health regions are in the data?
- How many men died of cancer? How many women?
- How many babies are in BC (under one year)?
# Display unique health region names safely.
# The correspondence file includes accented French names, so we use latin1 encoding.
corr %>%
distinct(hrname_english) %>%
mutate(hrname_english = iconv(hrname_english, from = "latin1", to = "UTF-8", sub = "")) %>%
arrange(hrname_english)## # A tibble: 96 x 1
## hrname_english
## <chr>
## 1 Athabasca Health Authority
## 2 Brant County Health Unit
## 3 Calgary Zone
## 4 Central Regional Health Authority
## 5 Central Vancouver Island Health Service Delivery Area
## 6 Central Zone
## 7 Chatham-Kent Health Unit
## 8 City of Hamilton Health Unit
## 9 City of Toronto Health Unit
## 10 Cypress Regional Health Authority
## # i 86 more rows
If we look through these Health Regions, we can see that there is something strange going on with a few entries from Quebec.
When working with English/French data, the encoding of the text can display wrong during import. We can fix this by re-importing the Corr data and specifying the locale. Sometimes this will resolve the issue, other times it may be more of a manual exercise that requires recoding of variables and working with stringr. We will be looking at stringr in Session 3.
corr <- read_csv(
"./Raw Data/Corr_2016.csv",
locale = readr::locale(encoding = "latin1")
)
corr %>%
distinct(hrname_english) %>%
arrange(hrname_english)## # A tibble: 96 x 1
## hrname_english
## <chr>
## 1 Athabasca Health Authority
## 2 Brant County Health Unit
## 3 Calgary Zone
## 4 Central Regional Health Authority
## 5 Central Vancouver Island Health Service Delivery Area
## 6 Central Zone
## 7 Chatham-Kent Health Unit
## 8 City of Hamilton Health Unit
## 9 City of Toronto Health Unit
## 10 Cypress Regional Health Authority
## # i 86 more rows
If we simply wanted the number of health regions we can use:
There are 96 Health Regions in the corr data set.
##----------------------------------------------------
## How many men died of cancer? How many women?
##----------------------------------------------------
mort %>% count(Sex)## # A tibble: 3 x 2
## Sex n
## <chr> <int>
## 1 F 28630
## 2 M 31500
## 3 <NA> 13
## # A tibble: 3 x 2
## # Groups: Sex [3]
## Sex n
## <chr> <int>
## 1 F 28630
## 2 M 31500
## 3 <NA> 13
##----------------------------------------------------
## How many babies are in BC (under one year)?
##----------------------------------------------------
sum(pop$`<1`)## [1] 179028
## # A tibble: 1 x 1
## Total_below_1
## <dbl>
## 1 179028
When column names do not start with a character or letter, the syntax changes. For example pop$<1 will not return any values. When columns are named incorrectly (with spaces, starting with a symbol or number), we can refer to them using the forward ticks shown above. In practice, we should be converting all columns to a better format, it makes for quicker coding and ensures that future work or analysis will not be hampered by column names.
If we look closer at the population data some of the column name do not make much sense.
Both names and colnames do the same thing and list the column names. In R, there are always more than one way to do things.
## [1] "...1" "Health Service Delivery Area" "Year"
## [4] "Gender" "<1" "04-Jan"
## [7] "09-May" "14-Oct" "15-19"
## [10] "20-24" "25-29" "30-34"
## [13] "35-39" "40-44" "45-49"
## [16] "50-54" "55-59" "60-64"
## [19] "65-69" "70-74" "75-79"
## [22] "80-84" "85-89" "90+"
## [25] "Total"
## [1] "...1" "Health Service Delivery Area" "Year"
## [4] "Gender" "<1" "04-Jan"
## [7] "09-May" "14-Oct" "15-19"
## [10] "20-24" "25-29" "30-34"
## [13] "35-39" "40-44" "45-49"
## [16] "50-54" "55-59" "60-64"
## [19] "65-69" "70-74" "75-79"
## [22] "80-84" "85-89" "90+"
## [25] "Total"
The 6th column”04-Jan” should be the next age group after <1 year old. The interval between years is 5. Therefore “04-Jan” should probably be “01-04” or some variation of that. For tidy data we need to make sure that our column names start with a character and not a numeric value. This is a problem since so many age columns start with numbers and need to be cleaned up!
We are going to use rename to rename the three columns. Note: There are many different approaches to this problem.
This is one approach, here is another using pivot_longer, lets remove X1 if it exists and remove Total while we’re at it using select. After removing columns that are not needed for the long-format population table…
In pivot_longer(), the cols argument defines which columns should be combined under Age. Instead of relying only on column positions, we exclude the known non-age columns. This is more reliable because some versions of the population file may not contain an X1 column.
pop_long<-pop %>%
select(-any_of(c("X1", "Total"))) %>%
pivot_longer(
cols = -any_of(c("Year", "Gender", "Health Service Delivery Area")),
names_to = "Age",
values_to = "Value"
)
pop_long## # A tibble: 1,071 x 5
## `Health Service Delivery Area` Year Gender Age Value
## <chr> <dbl> <chr> <chr> <dbl>
## 1 British Columbia 2016 M ...1 0
## 2 British Columbia 2016 M <1 22997
## 3 British Columbia 2016 M 04-Jan 93211
## 4 British Columbia 2016 M 09-May 121341
## 5 British Columbia 2016 M 14-Oct 119586
## 6 British Columbia 2016 M 15-19 140451
## 7 British Columbia 2016 M 20-24 170968
## 8 British Columbia 2016 M 25-29 159609
## 9 British Columbia 2016 M 30-34 162416
## 10 British Columbia 2016 M 35-39 155936
## # i 1,061 more rows
Examine how the data is different from pop to pop_long. Type both in your console to compare.
We can now recode the values in the long format by using case_when. See ?case_when for details and https://dplyr.tidyverse.org/reference/case_when.html
First we are going to mutate the existing Age variable. Then we are saying:
when age is equal to “04-Jan”, assign (~) these rows the value of “01-04”,
when age is equal to “09-May”, assign these rows the value of “05-09”,
when age is equal to “14-Oct”, assign these rows the value of “10-14”,
when the conditions above are not matched, assign these rows the value that they already had
pop_long = pop_long %>%
mutate(Age = case_when(
Age =="04-Jan" ~ "01-04",
Age =="09-May" ~ "05-09",
Age =="14-Oct" ~ "10-14",
TRUE ~ Age
)
)Below we have altered the code above by removing the last line - why is this wrong? What happens to the other values?
pop %>%
select(-any_of(c("X1", "Total"))) %>%
pivot_longer(
cols = -any_of(c("Year", "Gender", "Health Service Delivery Area")),
names_to = "Age",
values_to = "Value"
)%>%
mutate(Age = case_when(
Age =="04-Jan" ~ "01-04",
Age =="09-May" ~ "05-09",
Age =="14-Oct" ~ "10-14"
# TRUE ~ Age
)
)## # A tibble: 1,071 x 5
## `Health Service Delivery Area` Year Gender Age Value
## <chr> <dbl> <chr> <chr> <dbl>
## 1 British Columbia 2016 M <NA> 0
## 2 British Columbia 2016 M <NA> 22997
## 3 British Columbia 2016 M 01-04 93211
## 4 British Columbia 2016 M 05-09 121341
## 5 British Columbia 2016 M 10-14 119586
## 6 British Columbia 2016 M <NA> 140451
## 7 British Columbia 2016 M <NA> 170968
## 8 British Columbia 2016 M <NA> 159609
## 9 British Columbia 2016 M <NA> 162416
## 10 British Columbia 2016 M <NA> 155936
## # i 1,061 more rows
It is important to double check that when you recode values!
If you “knit” this document, make sure you have commented out the “unzip” chunk. Note that all of the outputs will be “knitted” to the word output. Some of which is not intuitive or pretty for a display in a word document. Refer to the R Markdown documentation to alter the chunks and their display in word.
See: https://rmarkdown.rstudio.com/lesson-3.html
Next week:
We will continue cleaning and reshaping the data. We will look into dates, regex, advanced functions and working with lists.
You will structure your own analysis Project in R to create a polished R Markdown document.