Chapter 38 8. Work with missing and unexpected values
Missing values can appear in different ways, such as NA, blanks, NULL, 999, or other placeholder values. Always check how missing or unusual data is represented before making cleaning decisions.
38.1 8.1 Check birth year values
rates %>%
ggplot(aes(x = B_year)) +
geom_histogram(binwidth = 5) +
labs(
title = "Distribution of Birth Year",
x = "Birth year",
y = "Count"
) +
theme_minimal()
Birth years before 1905 are unlikely in this dataset. For this exercise, we assume these values are data-entry errors where 18xx should be 19xx.
38.2 8.2 Check death month and death day
## # A tibble: 17 x 2
## D_month n
## <dbl> <int>
## 1 1 5088
## 2 2 4677
## 3 3 5056
## 4 4 4880
## 5 5 5108
## 6 6 5015
## 7 7 5147
## 8 8 5195
## 9 9 4927
## 10 10 5038
## 11 11 4913
## 12 12 5070
## 13 16 1
## 14 17 1
## 15 18 2
## 16 19 1
## 17 21 1
## # A tibble: 12 x 2
## D_month n
## <dbl> <int>
## 1 1 5088
## 2 2 4677
## 3 3 5056
## 4 4 4880
## 5 5 5108
## 6 6 5021
## 7 7 5147
## 8 8 5195
## 9 9 4927
## 10 10 5038
## 11 11 4913
## 12 12 5070
## # A tibble: 31 x 2
## D_day n
## <dbl> <int>
## 1 1 1954
## 2 2 1996
## 3 3 1972
## 4 4 2022
## 5 5 1929
## 6 6 1960
## 7 7 1986
## 8 8 1966
## 9 9 1956
## 10 10 1903
## # i 21 more rows
## # A tibble: 31 x 2
## B_day n
## <dbl> <int>
## 1 1 1979
## 2 2 2029
## 3 3 1949
## 4 4 1937
## 5 5 1916
## 6 6 1877
## 7 7 2001
## 8 8 1960
## 9 9 1942
## 10 10 1995
## # i 21 more rows