Chapter 34 4. Clean and reshape population data

The population data is currently in a wide format, with age groups stored as separate columns. For plotting and analysis, it is easier to work with tidy long-format data.

34.1 4.1 Convert population data to long format

pop_long <- pop %>%
  select(-any_of(c("X1", "Total"))) %>%
  pivot_longer(
    cols = -any_of(c("Year", "Gender", "Health Service Delivery Area")),
    names_to = "Age",
    values_to = "Population"
  ) %>%
  mutate(
    Age = case_when(
      Age == "04-Jan" ~ "01-04",
      Age == "09-May" ~ "05-09",
      Age == "14-Oct" ~ "10-14",
      TRUE ~ Age
    )
  )

pop_long
## # A tibble: 1,071 x 5
##    `Health Service Delivery Area`  Year Gender Age   Population
##    <chr>                          <dbl> <chr>  <chr>      <dbl>
##  1 British Columbia                2016 M      ...1           0
##  2 British Columbia                2016 M      <1         22997
##  3 British Columbia                2016 M      01-04      93211
##  4 British Columbia                2016 M      05-09     121341
##  5 British Columbia                2016 M      10-14     119586
##  6 British Columbia                2016 M      15-19     140451
##  7 British Columbia                2016 M      20-24     170968
##  8 British Columbia                2016 M      25-29     159609
##  9 British Columbia                2016 M      30-34     162416
## 10 British Columbia                2016 M      35-39     155936
## # i 1,061 more rows

34.2 4.2 Rename columns and remove unnecessary fields

The column name Health Service Delivery Area is long and contains spaces. We will rename it to HSDA. We will also remove the Year column because all records are from 2016.

pop_long <- pop_long %>%
  rename(HSDA = `Health Service Delivery Area`) %>%
  select(-Year)

pop_long
## # A tibble: 1,071 x 4
##    HSDA             Gender Age   Population
##    <chr>            <chr>  <chr>      <dbl>
##  1 British Columbia M      ...1           0
##  2 British Columbia M      <1         22997
##  3 British Columbia M      01-04      93211
##  4 British Columbia M      05-09     121341
##  5 British Columbia M      10-14     119586
##  6 British Columbia M      15-19     140451
##  7 British Columbia M      20-24     170968
##  8 British Columbia M      25-29     159609
##  9 British Columbia M      30-34     162416
## 10 British Columbia M      35-39     155936
## # i 1,061 more rows

34.3 4.3 Keep total population only

The Gender column includes T, which represents total population. Because we are interested in total population by HSDA and age group for this example, we will keep only these records.

pop_total <- pop_long %>%
  filter(Gender == "T")

pop_total
## # A tibble: 357 x 4
##    HSDA             Gender Age   Population
##    <chr>            <chr>  <chr>      <dbl>
##  1 British Columbia T      ...1           0
##  2 British Columbia T      <1         44757
##  3 British Columbia T      01-04     181511
##  4 British Columbia T      05-09     234118
##  5 British Columbia T      10-14     232010
##  6 British Columbia T      15-19     272677
##  7 British Columbia T      20-24     327129
##  8 British Columbia T      25-29     318998
##  9 British Columbia T      30-34     329242
## 10 British Columbia T      35-39     313564
## # i 347 more rows