Chapter 23 15. Regex Practice Question
The following code reshapes the WHO dataset using a regular expression.
who_longer <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count",
values_drop_na = TRUE
)
who_longer## # A tibble: 76,046 x 8
## country iso2 iso3 year diagnosis gender age count
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan AF AFG 1997 sp m 014 0
## 2 Afghanistan AF AFG 1997 sp m 1524 10
## 3 Afghanistan AF AFG 1997 sp m 2534 6
## 4 Afghanistan AF AFG 1997 sp m 3544 3
## 5 Afghanistan AF AFG 1997 sp m 4554 5
## 6 Afghanistan AF AFG 1997 sp m 5564 2
## 7 Afghanistan AF AFG 1997 sp m 65 0
## 8 Afghanistan AF AFG 1997 sp f 014 5
## 9 Afghanistan AF AFG 1997 sp f 1524 38
## 10 Afghanistan AF AFG 1997 sp f 2534 36
## # i 76,036 more rows
23.1 15.1 Explanation of the Regex
The pattern new_?(.*)_(.)(.*) means:
newmatches the textnewat the beginning._?means the underscore afternewmay or may not be present.(.*)captures the diagnosis type, such assp,ep,sn, orrel._matches the underscore before sex and age.(.)captures one character for gender, such asmorf.(.*)captures the remaining characters as the age group.
The brackets create capture groups. These captured parts are placed into the new columns listed in names_to.