Welcome back to the final installment of our series, “Joins Are No Mystery Anymore: Hands-On Tutorial.” In the previous weeks, we covered the foundational and advanced join techniques, including Inner Join, Left Join, Right Join, Full Join, Semi Join, Anti Join, Cross Join, Natural Join, Self Join, and Equi Join. We’ve seen how these joins can be applied to real-life scenarios to solve various data problems.
Today, we’ll dive even deeper into the world of data joins by exploring Non-Equi Joins, Rolling Joins, Overlap Joins, and Fuzzy Joins. These specialized joins will help you handle more complex data scenarios, such as matching based on non-equality conditions, finding the nearest matches, and dealing with approximate or fuzzy data.
Let’s get started with our first join of the day: Non-Equi Join.
A Non-Equi Join is used to join tables based on non-equality conditions, such as greater than (>) or less than (<). This type of join is particularly useful when dealing with ranges or thresholds.
In this scenario, we have sales and targets. We want to find sales that exceeded the targets. This helps in identifying successful sales that met or surpassed the set goals.
We will use two datasets:
Loading the datasets:
# Load the necessary libraries library(dplyr) library(data.table) # Load the datasets load("non_equi_join_data.RData") # Display the datasets print(sales) # A tibble: 20 × 3 sale_id amount date <int> <int> <date> 1 1 178 2024-01-01 2 2 219 2024-01-02 3 3 111 2024-01-03 4 4 266 2024-01-04 5 5 208 2024-01-05 6 6 231 2024-01-06 7 7 296 2024-01-07 8 8 242 2024-01-08 9 9 149 2024-01-09 10 10 245 2024-01-10 11 11 137 2024-01-11 12 12 175 2024-01-12 13 13 209 2024-01-13 14 14 133 2024-01-14 15 15 256 2024-01-15 16 16 128 2024-01-16 17 17 107 2024-01-17 18 18 220 2024-01-18 19 19 295 2024-01-19 20 20 235 2024-01-20 print(targets) # A tibble: 10 × 3 target_id target_amount target_date <int> <dbl> <date> 1 1 100 2024-01-01 2 2 120 2024-01-06 3 3 140 2024-01-11 4 4 160 2024-01-16 5 5 180 2024-01-21 6 6 200 2024-01-26 7 7 220 2024-01-31 8 8 240 2024-02-05 9 9 260 2024-02-10 10 10 280 2024-02-15
Performing the Non-Equi Join
# Convert to data.table sales_dt <- as.data.table(sales) targets_dt <- as.data.table(targets) # Perform the non-equi join successful_sales <- sales_dt[targets_dt, on = .(amount > target_amount), nomatch = 0] # Display the result print(successful_sales) sale_id amount date target_id target_date <int> <int> <Date> <int> <Date> 1: 1 100 2024-01-01 1 2024-01-01 2: 2 100 2024-01-02 1 2024-01-01 3: 3 100 2024-01-03 1 2024-01-01 4: 4 100 2024-01-04 1 2024-01-01 5: 5 100 2024-01-05 1 2024-01-01 --- 106: 4 260 2024-01-04 9 2024-02-10 107: 7 260 2024-01-07 9 2024-02-10 108: 19 260 2024-01-19 9 2024-02-10 109: 7 280 2024-01-07 10 2024-02-15 110: 19 280 2024-01-19 10 2024-02-15
Explanation of the Code:
The resulting dataset successful_sales contains only the rows from the sales dataset where the amount exceeds the target_amount from the targets dataset.
In the same non_equi_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:
Your task is to perform a non-equi join on these datasets to match students with scholarship thresholds they exceed. Use the grade and min_grade columns for the non-equality condition.
A Rolling Join is used to join two tables based on a key column, with the ability to match the nearest value when an exact match is not found. This is particularly useful for time series data or any scenario where you need to find the closest preceding or following value.
In this scenario, we have stock prices and company events. We want to join these tables to match stock prices with the nearest company events. This helps in understanding how company events might have influenced stock prices.
We will use two datasets:
Loading the datasets:
# Load the necessary libraries library(dplyr) library(data.table) # Load the datasets load("rolling_join_data.RData") # Display the datasets print(stock_prices) # A tibble: 20 × 3 date stock_id price <date> <dbl> <int> 1 2024-01-01 1 106 2 2024-01-02 1 106 3 2024-01-03 1 108 4 2024-01-04 1 103 5 2024-01-05 1 103 6 2024-01-06 1 109 7 2024-01-07 1 101 8 2024-01-08 1 104 9 2024-01-09 1 107 10 2024-01-10 1 101 11 2024-01-11 1 108 12 2024-01-12 1 102 13 2024-01-13 1 107 14 2024-01-14 1 108 15 2024-01-15 1 101 16 2024-01-16 1 100 17 2024-01-17 1 104 18 2024-01-18 1 104 19 2024-01-19 1 104 20 2024-01-20 1 101 print(events) # A tibble: 3 × 4 event_id stock_id event_date description <int> <dbl> <date> <chr> 1 1 1 2024-01-05 Quarterly Meeting 2 2 1 2024-01-15 Product Launch 3 3 1 2024-01-25 Earnings Call
Performing the Rolling Join
# Convert to data.table stock_prices_dt <- as.data.table(stock_prices) events_dt <- as.data.table(events) # Set keys for rolling join setkey(stock_prices_dt, stock_id, date) setkey(events_dt, stock_id, event_date) # Perform the rolling join stock_events <- events_dt[stock_prices_dt, roll = "nearest", on = .(stock_id, event_date = date)] # Display the result print(stock_events) Key: <stock_id, event_date> event_id stock_id event_date description price <int> <num> <Date> <char> <int> 1: 1 1 2024-01-01 Quarterly Meeting 106 2: 1 1 2024-01-02 Quarterly Meeting 106 3: 1 1 2024-01-03 Quarterly Meeting 108 4: 1 1 2024-01-04 Quarterly Meeting 103 5: 1 1 2024-01-05 Quarterly Meeting 103 6: 1 1 2024-01-06 Quarterly Meeting 109 7: 1 1 2024-01-07 Quarterly Meeting 101 8: 1 1 2024-01-08 Quarterly Meeting 104 9: 1 1 2024-01-09 Quarterly Meeting 107 10: 1 1 2024-01-10 Quarterly Meeting 101 11: 2 1 2024-01-11 Product Launch 108 12: 2 1 2024-01-12 Product Launch 102 13: 2 1 2024-01-13 Product Launch 107 14: 2 1 2024-01-14 Product Launch 108 15: 2 1 2024-01-15 Product Launch 101 16: 2 1 2024-01-16 Product Launch 100 17: 2 1 2024-01-17 Product Launch 104 18: 2 1 2024-01-18 Product Launch 104 19: 2 1 2024-01-19 Product Launch 104 20: 2 1 2024-01-20 Product Launch 101 event_id stock_id event_date description price
Explanation of the Code:
The resulting dataset stock_events contains the rows from the stock_prices dataset matched with the nearest preceding or following event from the events dataset based on the date and event_date columns.
In the same rolling_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:
Your task is to perform a rolling join on these datasets to match weather records with the nearest significant weather events. Use the date and event_date columns for the rolling join.
An Overlap Join is used to join tables based on overlapping ranges of values. This type of join is particularly useful for scenarios where you need to find overlapping time periods or other ranges.
In this scenario, we have hotel bookings and we want to find overlapping bookings. This helps in identifying potential overbookings and managing reservations effectively.
We will use one dataset:
Loading the dataset:
# Load the necessary libraries library(dplyr) library(fuzzyjoin) # Load the dataset load("overlap_join_data.RData") # Display the dataset print(bookings) # A tibble: 20 × 4 booking_id room_id start_date end_date <int> <int> <date> <date> 1 1 109 2024-01-01 2024-01-05 2 2 111 2024-01-03 2024-01-07 3 3 118 2024-01-05 2024-01-09 4 4 104 2024-01-07 2024-01-11 5 5 103 2024-01-09 2024-01-13 6 6 103 2024-01-11 2024-01-15 7 7 101 2024-01-13 2024-01-17 8 8 101 2024-01-15 2024-01-19 9 9 101 2024-01-17 2024-01-21 10 10 103 2024-01-19 2024-01-23 11 11 101 2024-01-21 2024-01-25 12 12 116 2024-01-23 2024-01-27 13 13 111 2024-01-25 2024-01-29 14 14 116 2024-01-27 2024-01-31 15 15 104 2024-01-29 2024-02-02 16 16 103 2024-01-31 2024-02-04 17 17 110 2024-02-02 2024-02-06 18 18 114 2024-02-04 2024-02-08 19 19 111 2024-02-06 2024-02-10 20 20 102 2024-02-08 2024-02-12
Performing the Overlap Join
# Perform the overlap join overlapping_bookings <- fuzzy_left_join( bookings, bookings, by = c("room_id" = "room_id", "start_date" = "end_date", "end_date" = "start_date"), match_fun = list(`==`, `<=`, `>=`) ) %>% filter(booking_id.x != booking_id.y) # Display the result print(overlapping_bookings) # A tibble: 12 × 8 booking_id.x room_id.x start_date.x end_date.x booking_id.y room_id.y start_date.y end_date.y <int> <int> <date> <date> <int> <int> <date> <date> 1 5 103 2024-01-09 2024-01-13 6 103 2024-01-11 2024-01-15 2 6 103 2024-01-11 2024-01-15 5 103 2024-01-09 2024-01-13 3 7 101 2024-01-13 2024-01-17 8 101 2024-01-15 2024-01-19 4 7 101 2024-01-13 2024-01-17 9 101 2024-01-17 2024-01-21 5 8 101 2024-01-15 2024-01-19 7 101 2024-01-13 2024-01-17 6 8 101 2024-01-15 2024-01-19 9 101 2024-01-17 2024-01-21 7 9 101 2024-01-17 2024-01-21 7 101 2024-01-13 2024-01-17 8 9 101 2024-01-17 2024-01-21 8 101 2024-01-15 2024-01-19 9 9 101 2024-01-17 2024-01-21 11 101 2024-01-21 2024-01-25 10 11 101 2024-01-21 2024-01-25 9 101 2024-01-17 2024-01-21 11 12 116 2024-01-23 2024-01-27 14 116 2024-01-27 2024-01-31 12 14 116 2024-01-27 2024-01-31 12 116 2024-01-23 2024-01-27
Explanation of the Code:
The resulting dataset overlapping_bookings contains pairs of rows from the bookings dataset where the bookings overlap based on the room_id, start_date, and end_date columns.
In the same overlap_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:
Your task is to perform an overlap join on these datasets to find overlapping project timelines. Use the start_date and end_date columns for the overlap join.
A Fuzzy Join is used to join tables based on approximate or “fuzzy” matching of key columns. Unlike traditional joins, which require exact matches between columns, fuzzy joins allow for matches based on similarity, proximity, or other non-exact criteria. This is particularly useful when dealing with data that has inconsistencies, such as typos, different naming conventions, or slight variations in values.
Fuzzy joins can be used in various scenarios, such as:
Fuzzy joins leverage different methods of similarity measurement, such as string distance (e.g., Levenshtein distance), numeric proximity, or custom matching functions, to find the best possible matches between rows.
In this scenario, we have customer records from two different sources. We want to join these tables to combine records that refer to the same customers, even if there are slight differences in the names or addresses. This helps in consolidating customer data from multiple sources into a single, unified view.
We will use two datasets:
Loading the datasets:
# Load the necessary libraries library(dplyr) library(fuzzyjoin) # Load the datasets load("fuzzy_join_data.RData") # Display the datasets print(customer_records_A) Name Address ID 1 Terry Welsh 5028 Paddock Way 1 2 Robin Lewis 3415 Photinia Ave 2 3 Tyrone Carter 5697 Smokey Ln 3 4 Noah Fuller 4028 Northaven Rd 4 5 Heather Barnes 5530 First Street 5 6 Alex Peck 7132 Mockingbird Hill 6 7 Alfredo Martinez 8920 Smokey Ln 7 8 Adrian Morales 6567 Nowlin Rd 8 9 Melvin Paine 8310 Fincher Rd 9 10 Harry Edwards 8848 Valley View Ln 10 11 Esther Williamson 2644 Daisy Dr 11 12 Stella Campbell 9021 E Sandy Lake Rd 12 13 Lawrence Grant 3221 First Street 13 14 Vivan Perez 996 Wycliff Ave 14 15 Eli Brewer 7344 Wheeler Ridge Dr 15 16 Edward Wagner 7174 W Sherman Dr 16 17 Dwayne Day 2115 Groveland Terrace 17 18 Erika Flores 3566 Nowlin Rd 18 19 Nicholas Nelson 7867 Dane St 19 20 Wade Willis 8608 Pecan Acres Ln 20 print(customer_records_B) Name Address ID 1 Terry Welch 5028 Paddock Way 1 2 Courtney Elliott 6526 Cackson St 2 3 Tyron Carter 5697 Smokey Ln 3 4 Noah Fueller 4028 Northaven Rd 4 5 Jesus Herrera 722 Hillcrest Rd 5 6 Isaac Neal 308 W Campbell Ave 6 7 Annette Carr 2087 Photinia Ave 7 8 Rebecca Boyd 7584 Homestead Rd 8 9 Melvin Payne 8310 Fincher Rd 9 10 Irma Bowman 9065 Valwood Pkwy 10 11 Heather Wallace 95 Railroad St 11 12 Janice West 1545 W Pecan St 12 13 Dianne Chavez 326 Robinson Rd 13 14 Vivian Perez 996 Wycliff Ave 14 15 Dustin Wood 2677 Groveland Terrace 15 16 Calvin Jones 5859 Samaritan Dr 16 17 Tara Carroll 2215 Rolling Green Rd 17 18 Francis Gardner 1242 Sunset St 18 19 Bryan Henderson 5781 Ranchview Dr 19 20 Michelle Bell 9072 Westheimer Rd 20
Performing the Fuzzy Join
# Perform the fuzzy join customer_matches <- stringdist_left_join( customer_records_A, customer_records_B, by = "Name", max_dist = 2, distance_col = "dist" ) %>% filter(dist < 2) # Display the result print(customer_matches) Name.x Address.x ID.x Name.y Address.y ID.y dist 1 Terry Welsh 5028 Paddock Way 1 Terry Welch 5028 Paddock Way 1 1 2 Tyrone Carter 5697 Smokey Ln 3 Tyron Carter 5697 Smokey Ln 3 1 3 Noah Fuller 4028 Northaven Rd 4 Noah Fueller 4028 Northaven Rd 4 1 4 Melvin Paine 8310 Fincher Rd 9 Melvin Payne 8310 Fincher Rd 9 1 5 Vivan Perez 996 Wycliff Ave 14 Vivian Perez 996 Wycliff Ave 14 1
Explanation of the Code:
The resulting dataset customer_matches contains rows from customer_records_A matched with the closest approximate rows from customer_records_B based on the name column. The dist column shows the computed distance between the matched names.
In this final installment of our series, “Joins Are No Mystery Anymore: Hands-On Tutorial,” we’ve taken a deep dive into specialized join techniques that are essential for handling more complex data scenarios. We’ve covered:
Through practical examples and detailed code walkthroughs, we demonstrated how these advanced joins can solve real-world data problems. We’ve also provided homework tasks to reinforce your learning and give you hands-on experience with these techniques.
In the bonus section, “Anatomy of a Basic Joining Function,” we’ll explore the different arguments in joining functions, explaining what each one does and how they change the output. This deep dive will enhance your understanding and give you even greater control over your data analysis.
Stay tuned as we uncover the intricacies of joining functions and provide you with the tools to master data joins in R. Thank you for joining us on this journey, and happy coding!
In this section, we’ll delve into the anatomy of basic joining functions in the dplyr package. We'll explain the different arguments you can use, how they affect the output, and provide examples to illustrate their usage.
Specifies the columns to join by. If not provided, dplyr will join by columns with the same name in both tables.
Determines the suffixes added to duplicate column names from the left and right tables. By default, it is set to c(".x", ".y").
A logical argument that allows joining of data frames located in different databases. By default, it is set to FALSE.
An argument in full_join that keeps the join columns in the output.
Example
# Example of different arguments in a join function joined_data <- left_join( x = df1, y = df2, by = "id", suffix = c("_left", "_right"), copy = TRUE, keep = TRUE )
Joins Are No Mystery Anymore: Hands-On Tutorial — Part 3 was originally published in Numbers around us on Medium, where people are continuing the conversation by highlighting and responding to this story.