Chapter 23 15. Regex Practice Question

The following code reshapes the WHO dataset using a regular expression.

who_longer <- who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65,
    names_to = c("diagnosis", "gender", "age"),
    names_pattern = "new_?(.*)_(.)(.*)",
    values_to = "count",
    values_drop_na = TRUE
  )

who_longer
## # A tibble: 76,046 x 8
##    country     iso2  iso3   year diagnosis gender age   count
##    <chr>       <chr> <chr> <dbl> <chr>     <chr>  <chr> <dbl>
##  1 Afghanistan AF    AFG    1997 sp        m      014       0
##  2 Afghanistan AF    AFG    1997 sp        m      1524     10
##  3 Afghanistan AF    AFG    1997 sp        m      2534      6
##  4 Afghanistan AF    AFG    1997 sp        m      3544      3
##  5 Afghanistan AF    AFG    1997 sp        m      4554      5
##  6 Afghanistan AF    AFG    1997 sp        m      5564      2
##  7 Afghanistan AF    AFG    1997 sp        m      65        0
##  8 Afghanistan AF    AFG    1997 sp        f      014       5
##  9 Afghanistan AF    AFG    1997 sp        f      1524     38
## 10 Afghanistan AF    AFG    1997 sp        f      2534     36
## # i 76,036 more rows

23.1 15.1 Explanation of the Regex

The pattern new_?(.*)_(.)(.*) means:

  • new matches the text new at the beginning.
  • _? means the underscore after new may or may not be present.
  • (.*) captures the diagnosis type, such as sp, ep, sn, or rel.
  • _ matches the underscore before sex and age.
  • (.) captures one character for gender, such as m or f.
  • (.*) captures the remaining characters as the age group.

The brackets create capture groups. These captured parts are placed into the new columns listed in names_to.