Chapter 3 Joining Data
In many real-world projects, data are stored across multiple datasets rather than in a single table. For example, one file may contain patient demographic information, while another contains laboratory results or medication records. To perform meaningful analysis, these datasets often need to be combined into a single data frame.
The process of combining datasets based on one or more shared variables is called a join. In the tidyverse, joins are commonly performed using functions from the dplyr package. Understanding how joins work is an essential skill in data management because incorrect joins can easily introduce duplicate rows, missing information, or inaccurate results.
In this chapter, you will learn how to identify join keys, understand common join relationships, and use functions such as left_join() and inner_join() to combine datasets safely and effectively. The dplyr package provides powerful tools for joining relational datasets Wickham et al. (2026).
3.1 Understanding join keys
A join works by matching rows between two datasets using one or more shared variables called keys. A key is typically an identifier such as a patient ID, employee number, postal code, or year.
The example below creates two small datasets that share a common variable called key.
library(tidyverse)
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2"
)In this example, the variable key is used to connect the two tables. Dataset x contains multiple rows for some keys, while dataset y contains only one row for each key.
Before performing joins, it is important to understand the relationship between the datasets. Common relationships include:
- one-to-one relationships, where each key appears only once in both tables;
- one-to-many relationships, where one table contains repeated keys; and
- many-to-many relationships, where both tables contain duplicate keys.
Understanding these relationships is important because joins involving duplicate keys can unexpectedly increase the number of rows in the final dataset.
3.2 Performing joins with dplyr
One of the most commonly used joins is the left_join(). A left join keeps all rows from the first dataset and adds matching information from the second dataset whenever a match is found.
## # A tibble: 4 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 2 x3 y2
## 4 1 x4 y1
In this example, every row from dataset x is preserved. The values from y are added based on matching values of key.
Another commonly used join is the inner_join(), which keeps only rows that have matches in both datasets.
Different join functions are useful for different analytical situations:
left_join()keeps all rows from the left table;inner_join()keeps only matched rows;right_join()keeps all rows from the right table;full_join()keeps all rows from both tables; andanti_join()identifies rows that do not have matching values in another dataset.
In practice, left_join() is often the safest starting point because it preserves the structure of the main dataset while adding additional variables from another table.
When performing joins, always think carefully about which dataset should be considered the “main” table. The order of tables in a join can affect the final result.
3.3 Checking for duplicate keys
Before joining datasets, it is good practice to verify whether the join key is unique. Duplicate keys are one of the most common causes of unexpected results during data merging.
The following code counts the number of times each key appears in both datasets.
## # A tibble: 2 x 2
## key n
## <dbl> <int>
## 1 1 2
## 2 2 2
## # A tibble: 0 x 2
## # i 2 variables: key <dbl>, n <int>
In dataset x, some keys appear more than once, while dataset y contains unique keys. This means the join represents a one-to-many relationship.
If both datasets contain duplicate keys, the join may create many-to-many relationships, which can dramatically increase the number of rows in the output dataset.
Always inspect duplicate keys before performing joins on large datasets. Unexpected duplication is one of the most common data management errors in real-world projects.
Another useful function is anti_join(), which identifies rows in one dataset that do not have matching values in another dataset.
For example:
This can help identify missing IDs, unmatched records, or inconsistencies between datasets.
3.4 Practical workflow for joining data
In real-world projects, joining datasets often follows a consistent workflow:
- Identify the key variable used to connect the datasets.
- Inspect both datasets for duplicate keys or missing values.
- Decide which join type is appropriate.
- Perform the join.
- Verify the number of rows before and after joining.
- Inspect the final dataset for unexpected duplication or missing values.
Developing careful habits during joins is extremely important because join-related errors can propagate through the remainder of an analysis and affect final results.
3.5 Practice exercise
To reinforce the concepts introduced in this chapter, create two small datasets with a shared key variable and experiment with different join functions.
Try the following:
- Perform a
left_join()and compare the output with aninner_join(). - Add duplicate keys to one or both datasets and observe how the number of rows changes.
- Use
anti_join()to identify unmatched records. - Compare the dimensions of the datasets before and after joining.
As you work through these exercises, focus not only on obtaining the correct output, but also on understanding why the number of rows changes under different join relationships.
3.6 Chapter summary
In this chapter, you learned how datasets can be combined using joins in dplyr. You were introduced to join keys, common join relationships, and several important join functions including left_join(), inner_join(), and anti_join().
You also learned why checking for duplicate keys is an essential part of reproducible data management workflows. Careful inspection of joins helps prevent data duplication, missing records, and inaccurate analytical results.