Chapter 7 Exploratory Analysis Project

In previous chapters, we focused on importing, cleaning, reshaping, joining, and visualizing datasets using the tidyverse. This chapter brings those individual skills together into a more complete exploratory analysis workflow.

The purpose of exploratory analysis is not to produce final analytical results, but rather to understand the structure and quality of the data before formal analysis begins. A well-prepared exploratory analysis document should allow a colleague, manager, or future researcher to understand the datasets being used, the cleaning decisions that were made, and the reasoning behind those decisions.

In practice, exploratory analysis is one of the most important stages of a data project because many analytical problems can be identified early through careful inspection of the data.

By the end of this chapter, you should be able to organize multiple datasets into a reproducible workflow, summarize dataset structures, evaluate missing values, create reusable functions, and communicate cleaning decisions clearly using R Markdown.

7.1 Loading libraries and importing datasets

We begin by loading the libraries used throughout the exploratory analysis workflow.

library(tidyverse)
library(readxl)
library(lubridate)

The datasets used in this project include mortality records, population estimates, geographic correspondence files, and environmental exposure data.

mort <- read_excel(
  "./Raw Data/deaths_2016.xlsx"
)

pop <- read_csv(
  "./Raw Data/Population_Estimates.csv"
)

corr <- read_csv(
  "./Raw Data/Corr_2016.csv",
  locale = readr::locale(
    encoding = "latin1"
  )
)

env <- read_csv(
  "./Raw Data/Weather_data.csv"
)

When working with multiple datasets, it is often helpful to place them into a single list structure. This makes it easier to apply functions repeatedly across all datasets using tools from the purrr package.

data_list <- list(
  mortality = mort,
  population = pop,
  correspondence = corr,
  environment = env
)

7.2 Reviewing dataset structure

A good exploratory analysis begins with understanding the basic structure of each dataset. Important questions include:

How many rows and columns are present?
What variables are available?
Are there obvious missing values?
Are the variable types correct?
Are there unusual or unexpected values?

The map() function allows us to apply the same operation across all datasets efficiently. The skimr package can also generate compact summaries for exploratory data analysis Waring et al. (2026).

For example, we can review dataset dimensions.

map(data_list, dim)

map(data_list, nrow)

map(data_list, ncol)

We can also inspect variable names.

map(data_list, names)

Functions such as glimpse() and summary() are also useful during exploratory review.

map(data_list, glimpse)

map(data_list, summary)

These exploratory checks often reveal problems early in the workflow, including inconsistent variable naming, missing values, incorrect data types, or unusual coding structures.

One of the most common mistakes in data analysis is beginning formal analysis before fully understanding the structure and quality of the data.

7.3 Working with missing values

Missing data are extremely common in real-world datasets. Before deciding how to handle missing values, it is important to understand how frequently they occur and whether they appear systematically within specific variables.

Instead of repeatedly writing the same code for each dataset, we can create a reusable function.

count_missing <- function(data) {
  
  data %>%
    
    summarize(
      across(
        everything(),
        ~ sum(is.na(.))
      )
    )
}

This function calculates the number of missing values for every variable in a dataset.

We can then apply the function across all datasets.

map(data_list, count_missing)

Reusable functions improve efficiency, reduce duplicated code, and make workflows easier to maintain.

It is also important to remember that missing values are not always random. In some cases, missingness may indicate systematic collection problems, unavailable information, or meaningful absence of data.

Never remove missing values automatically without first understanding why they are missing and whether the missingness may influence the analysis.

7.4 Duplicate records and cleaning decisions

In addition to missing values, duplicate records should also be investigated during exploratory analysis.

For example:

mort %>%
  
  count(ID) %>%
  
  filter(n > 1)

Duplicate records can arise from data-entry errors, repeated imports, or incorrect joins. Identifying these issues early helps prevent inflated counts and inaccurate results later in the workflow.

As cleaning decisions are made, they should be documented clearly in plain language. Good documentation is one of the defining features of reproducible analysis.

For example, instead of writing only code comments such as:

# fixed dates

it is better to explain the reasoning more clearly:

# Birth years before 1905 were assumed to be data-entry errors
# and adjusted forward by 100 years.

Clear explanations help future users understand not only what changes were made, but also why those changes were considered appropriate.

7.5 Exploratory visualization

Visualization is another important component of exploratory analysis. Simple plots can quickly reveal outliers, skewed distributions, missing categories, or unusual relationships between variables.

For example, histograms are useful for examining continuous variables.

mort %>%
  
  ggplot(
    aes(x = B_year)
  ) +
  
  geom_histogram(binwidth = 5) +
  
  labs(
    title = "Distribution of Birth Year",
    x = "Birth year",
    y = "Count"
  ) +
  
  theme_minimal()

Bar plots are often useful for categorical variables.

mort %>%
  
  count(Sex) %>%
  
  ggplot(
    aes(x = Sex, y = n)
  ) +
  
  geom_col() +
  
  labs(
    title = "Distribution of Sex",
    x = "Sex",
    y = "Count"
  ) +
  
  theme_minimal()

Exploratory visualizations do not need to be highly polished. Their primary purpose is to help you understand the data and identify potential problems.

7.6 Organizing an exploratory analysis report

A well-structured exploratory analysis document should guide the reader logically through the datasets and cleaning workflow.

Although the exact structure may vary between projects, most exploratory analysis reports include the following components:

A short description of the project purpose and research questions.
A description of the datasets being used.
Summary information about dataset dimensions and structure.
Explanations of important variables.
Missing-value summaries and duplicate checks.
Cleaning and transformation decisions.
Initial visualizations and descriptive summaries.
A brief discussion of key findings and remaining data issues.
Proposed next steps for future analysis.

The goal is not to create a perfect final report, but rather to create a clear and reproducible record of the exploratory workflow.

7.7 Practice activity

Create a new R Markdown document called:

exploratory_analysis.Rmd

Use the structure introduced in this chapter to build a short exploratory analysis report for the datasets used throughout the course.

Your report should include:

at least one table summarizing dataset dimensions;
one missing-value summary;
at least one visualization;
a short explanation of cleaning decisions; and
clear comments describing major analytical steps.

As you write the document, focus on clarity and reproducibility. Imagine that another analyst will need to understand and continue your work in the future.

7.8 Chapter summary

In this chapter, you combined many of the skills introduced earlier in the course into a more complete exploratory analysis workflow. You organized datasets into reusable structures, summarized dimensions and variable information, evaluated missing values, created reusable functions, checked for duplicate records, and produced simple exploratory visualizations.

Most importantly, you practiced documenting cleaning decisions and analytical reasoning in a way that supports reproducibility and collaboration. Exploratory analysis is not simply a preliminary step before “real analysis”; it is a critical stage that shapes the quality and reliability of all later results.

References

Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2026. Skimr: Compact and Flexible Summaries of Data. https://cran.r-project.org/package=skimr.