IT博客汇 | Joins Are No Mystery Anymore: Hands-On Tutorial

Joins Are No Mystery Anymore: Hands-On Tutorial — Part 2

Numbers around us发表于 2024-06-06 11:37:00

[This article was first published on Numbers around us - Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Joins Are No Mystery Anymore: Hands-On Tutorial — Part 2

Welcome back to the second part of our series, “Joins Are No Mystery Anymore: Hands-On Tutorial.” In the first part, we explored the foundational types of joins, including Inner Join, Left Join, Right Join, Full Join, and Semi Join. Through practical, real-life scenarios and step-by-step code examples, we learned how to effectively combine datasets and uncover valuable insights.

In this second part, we’ll delve into more advanced join techniques. We’ll start with Anti Joins, which help identify unmatched rows between datasets. Following that, we’ll explore Cross Joins, Natural Joins, Self Joins, and Equi Joins. Each join type will be demonstrated with real-life scenarios to enhance your understanding and practical application. Get ready to take your data analysis skills to the next level!

Anti Join

An Anti Join returns all rows from the left table where there are no matching values in the right table. It is useful for identifying rows in the left table that do not have corresponding rows in the right table.

Explanation of the Scenario

In this scenario, we have subscription information and payment records. We want to find subscriptions that have not been paid for. This helps in identifying outstanding payments and managing accounts receivable.

Data file:
https://github.com/kgryczan/medium_publishing/blob/main/anti_join_data.RData

Description of the Datasets

We will use two datasets:

subscriptions: Contains information about subscriptions.
Columns: subscription_id, customer_id, start_date
payments: Contains information about payments made for subscriptions.
Columns: payment_id, subscription_id, amount, payment_date

Step-by-Step Code Examples

Loading the datasets:

# Load the necessary libraries
library(dplyr)

# Load the datasets
load("anti_join_data.RData")

# Display the datasets
print(subscriptions)

# A tibble: 30 × 3
   subscription_id customer_id start_date
             <int>       <int> <date>    
 1               1         108 2024-01-01
 2               2         107 2024-01-02
 3               3         110 2024-01-03
 4               4         113 2024-01-04
 5               5         110 2024-01-05
 6               6         111 2024-01-06
 7               7         108 2024-01-07
 8               8         107 2024-01-08
 9               9         107 2024-01-09
10              10         112 2024-01-10
# ℹ 20 more rows

print(payments)

   payment_id subscription_id amount payment_date
        <int>           <int>  <dbl> <date>      
 1        201               4  154.  2024-01-05  
 2        202              29  134.  2024-01-06  
 3        203              25  158.  2024-01-07  
 4        204              20  169.  2024-01-08  
 5        205              18  170.  2024-01-09  
 6        206              27   91.4 2024-01-10  
 7        207              23   52.4 2024-01-11  
 8        208              21  151.  2024-01-12  
 9        209              12   76.4 2024-01-13  
10        210              25  133.  2024-01-14  
# ℹ 20 more rows

Performing the Anti Join

# Perform the anti join
unpaid_subscriptions <- anti_join(subscriptions, payments, by = "subscription_id")

# Display the result
print(unpaid_subscriptions)

# A tibble: 13 × 3
   subscription_id customer_id start_date
             <int>       <int> <date>    
 1               1         108 2024-01-01
 2               2         107 2024-01-02
 3               5         110 2024-01-05
 4               6         111 2024-01-06
 5               8         107 2024-01-08
 6              10         112 2024-01-10
 7              15         113 2024-01-15
 8              16         115 2024-01-16
 9              19         115 2024-01-19
10              22         111 2024-01-22
11              24         108 2024-01-24
12              28         115 2024-01-28
13              30         102 2024-01-30

Explanation of the Code:

We first load the datasets using the load function.
We then use the anti_join function from the dplyr package to filter the subscriptions dataset to include only those subscriptions that do not have matching entries in the payments dataset, based on the subscription_id column.
Finally, we display the result to see which subscriptions have not been paid for.

Interpretation of Results

The resulting dataset unpaid_subscriptions contains only the rows from the subscriptions dataset where there is no matching row in the payments dataset. This means that only unpaid subscriptions are included.

Homework for Readers

In the same anti_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:

courses: Contains information about courses.
Columns: course_id, course_name, instructor
enrollments: Contains information about student enrollments.
Columns: enrollment_id, course_id, student_id, enrollment_date

Your task is to perform an anti join on these datasets to identify courses that have no enrollments. Use the course_id column for joining.

Cross Join

A Cross Join returns the Cartesian product of two tables, combining all rows from the left table with all rows from the right table. This join type is useful when you want to create all possible combinations of the rows in two tables.

Explanation of the Scenario

In this scenario, we have menu items and days of the week. We want to create a schedule of menu items for each day of the week. This helps in planning and organizing the weekly menu offerings.

Data file:
https://github.com/kgryczan/medium_publishing/blob/main/cross_join_data.RData

Description of the Datasets

We will use two datasets:

menu_items: Contains information about the menu items.
Columns: item_id, item_name, category
days_of_week: Contains information about the days of the week.
Columns: day_id, day_name

Step-by-Step Code Examples

Loading the datasets

# Load the necessary libraries
library(dplyr)

# Load the datasets
load("cross_join_data.RData")

# Display the datasets
print(menu_items)

# A tibble: 10 × 3
   item_id item_name category 
     <int> <chr>     <chr>    
 1       1 Pancakes  Breakfast
 2       2 Sandwich  Lunch    
 3       3 Salad     Lunch    
 4       4 Burger    Lunch    
 5       5 Soup      Dinner   
 6       6 Pizza     Dinner   
 7       7 Spaghetti Dinner   
 8       8 Tacos     Dinner   
 9       9 Sushi     Dinner   
10      10 Steak     Dinner   

print(days_of_week)

# A tibble: 7 × 2
  day_id day_name 
   <int> <chr>    
1      1 Monday   
2      2 Tuesday  
3      3 Wednesday
4      4 Thursday 
5      5 Friday   
6      6 Saturday 
7      7 Sunday

Performing the Cross Join

# Perform the cross join
menu_schedule <- tidyr::crossing(menu_items, days_of_week)

# Display the result
print(menu_schedule)

# A tibble: 70 × 5
   item_id item_name category  day_id day_name 
     <int> <chr>     <chr>      <int> <chr>    
 1       1 Pancakes  Breakfast      1 Monday   
 2       1 Pancakes  Breakfast      2 Tuesday  
 3       1 Pancakes  Breakfast      3 Wednesday
 4       1 Pancakes  Breakfast      4 Thursday 
 5       1 Pancakes  Breakfast      5 Friday   
 6       1 Pancakes  Breakfast      6 Saturday 
 7       1 Pancakes  Breakfast      7 Sunday   
 8       2 Sandwich  Lunch          1 Monday   
 9       2 Sandwich  Lunch          2 Tuesday  
10       2 Sandwich  Lunch          3 Wednesday
# ℹ 60 more rows

Explanation of the Code:

We first load the datasets using the load function.
We then use the crossing function from the tidyr package to perform the cross join between the menu_items and days_of_week datasets. This function creates all possible combinations of the rows in the two datasets.
Finally, we display the result to see the complete schedule of menu items for each day of the week.

Interpretation of Results

The resulting dataset menu_schedule contains all possible combinations of the rows from the menu_items and days_of_week datasets. Each row represents a menu item scheduled for a particular day of the week.

Homework for Readers

In the same cross_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:

shirts: Contains information about shirts.
Columns: shirt_id, color, size
pants: Contains information about pants.
Columns: pants_id, color, size

Your task is to perform a cross join on these datasets to generate all possible outfits by combining shirts and pants. Use the shirt_id and pants_id columns for joining.

Natural Join

A Natural Join joins two tables based on columns with the same name and type in both tables. It automatically matches rows with equal values in the common columns, removing the need to specify the joining column.

Explanation of the Scenario

In this scenario, we have authors and books. We want to find authors and their corresponding books based on a common column. This helps in linking authors with the books they have written.

Data file:
https://github.com/kgryczan/medium_publishing/blob/main/natural_join_data.RData

Description of the Datasets

We will use two datasets:

authors: Contains information about authors.
Columns: author_id, name, nationality
books: Contains information about books.
Columns: book_id, author_id, title, genre

Step-by-Step Code Examples

Loading the datasets:

# Load the necessary libraries
library(dplyr)

# Load the datasets
load("natural_join_data.RData")

# Display the datasets
print(authors)

# A tibble: 20 × 3
   author_id name     nationality
       <int> <chr>    <chr>      
 1         1 Author A Australia  
 2         2 Author B Australia  
 3         3 Author C Canada     
 4         4 Author D USA        
 5         5 Author E UK         
 6         6 Author F USA        
 7         7 Author G Australia  
 8         8 Author H Australia  
 9         9 Author I UK         
10        10 Author J Australia  
11        11 Author K Australia  
12        12 Author L USA        
13        13 Author M USA        
14        14 Author N Canada     
15        15 Author O Canada     
16        16 Author P Canada     
17        17 Author Q USA        
18        18 Author R Australia  
19        19 Author S USA        
20        20 Author T USA 

print(books)

# A tibble: 20 × 4
   book_id author_id title  genre  
     <int>     <int> <chr>  <chr>  
 1     101         1 Book A Fantasy
 2     102         2 Book B Fiction
 3     103         3 Book C Sci-Fi 
 4     104         4 Book D Fiction
 5     105         5 Book E Sci-Fi 
 6     106         6 Book F Fantasy
 7     107         7 Book G Fantasy
 8     108         8 Book H Mystery
 9     109         9 Book I Mystery
10     110        10 Book J Mystery
11     111        11 Book K Fantasy
12     112        12 Book L Fiction
13     113        13 Book M Fiction
14     114        14 Book N Mystery
15     115        15 Book O Sci-Fi 
16     116        16 Book P Sci-Fi 
17     117        17 Book Q Fantasy
18     118        18 Book R Sci-Fi 
19     119        19 Book S Mystery
20     120        20 Book T Mystery

Performing the Natural Join:

# Perform the natural join
authors_books <- authors %>%
  inner_join(books, by = "author_id")

# Display the result
print(authors_books)

# A tibble: 20 × 6
   author_id name     nationality book_id title  genre  
       <int> <chr>    <chr>         <int> <chr>  <chr>  
 1         1 Author A Australia       101 Book A Fantasy
 2         2 Author B Australia       102 Book B Fiction
 3         3 Author C Canada          103 Book C Sci-Fi 
 4         4 Author D USA             104 Book D Fiction
 5         5 Author E UK              105 Book E Sci-Fi 
 6         6 Author F USA             106 Book F Fantasy
 7         7 Author G Australia       107 Book G Fantasy
 8         8 Author H Australia       108 Book H Mystery
 9         9 Author I UK              109 Book I Mystery
10        10 Author J Australia       110 Book J Mystery
11        11 Author K Australia       111 Book K Fantasy
12        12 Author L USA             112 Book L Fiction
13        13 Author M USA             113 Book M Fiction
14        14 Author N Canada          114 Book N Mystery
15        15 Author O Canada          115 Book O Sci-Fi 
16        16 Author P Canada          116 Book P Sci-Fi 
17        17 Author Q USA             117 Book Q Fantasy
18        18 Author R Australia       118 Book R Sci-Fi 
19        19 Author S USA             119 Book S Mystery
20        20 Author T USA             120 Book T Mystery

Explanation of the Code:

We first load the datasets using the load function.
We then use the inner_join function from the dplyr package to perform the natural join between the authors and books datasets on the author_id column.
Finally, we display the result to see which authors have written which books.

Interpretation of Results

The resulting dataset authors_books contains all rows from both the authors and books datasets where there is a matching value in the author_id column. This means that only authors who have written books are included, along with the details of those books. (Yes, it is a kind of inner join.)

Homework for Readers

In the same natural_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:

staff: Contains information about staff members.
Columns: staff_id, name, role
assignments: Contains information about project assignments.
Columns: assignment_id, staff_id, project_name

Your task is to perform a natural join on these datasets to combine staff details with their project assignments. Use the staff_id column for joining.

Self Join

A Self Join is a join of a table to itself. It is used to compare rows within the same table. This can be particularly useful for hierarchical data, such as organizational structures, where you need to find relationships between rows within the same table.

Explanation of the Scenario

In this scenario, we have employee information, and each employee has a manager, who is also an employee. We want to find the relationship between employees and their managers using the same table. This helps in understanding the organizational hierarchy.

Data file:
https://github.com/kgryczan/medium_publishing/blob/main/self_join_data.RData

Description of the Datasets

We will use one dataset:

employees: Contains information about employees and their managers.
Columns: employee_id, name, manager_id

Step-by-Step Code Examples

Loading the dataset

# Load the necessary libraries
library(dplyr)

# Load the dataset
load("self_join_data.RData")

# Display the dataset
print(employees)

# A tibble: 20 × 3
   employee_id name    manager_id
         <int> <chr>        <int>
 1           1 Alice E         12
 2           2 Bob C            6
 3           3 Carol L          8
 4           4 Zoe K           15
 5           5 Alice G         10
 6           6 Bob D            7
 7           7 Carol S         19
 8           8 Zoe T            1
 9           9 Alice O         13
10          10 Bob O            2
11          11 Carol G         16
12          12 Zoe P            8
13          13 Alice H          4
14          14 Bob K           14
15          15 Carol B         18
16          16 Zoe E            4
17          17 Alice O          7
18          18 Bob T           14
19          19 Carol R          8
20          20 Zoe G           16

Performing the Self Join

# Perform the self join
employees_managers <- employees %>%
  inner_join(employees, by = c("manager_id" = "employee_id"), suffix = c("_employee", "_manager"))

# Display the result
print(employees_managers)

# A tibble: 20 × 5
   employee_id name_employee manager_id name_manager manager_id_manager
         <int> <chr>              <int> <chr>                     <int>
 1           1 Alice E               12 Zoe P                         8
 2           2 Bob C                  6 Bob D                         7
 3           3 Carol L                8 Zoe T                         1
 4           4 Zoe K                 15 Carol B                      18
 5           5 Alice G               10 Bob O                         2
 6           6 Bob D                  7 Carol S                      19
 7           7 Carol S               19 Carol R                       8
 8           8 Zoe T                  1 Alice E                      12
 9           9 Alice O               13 Alice H                       4
10          10 Bob O                  2 Bob C                         6
11          11 Carol G               16 Zoe E                         4
12          12 Zoe P                  8 Zoe T                         1
13          13 Alice H                4 Zoe K                        15
14          14 Bob K                 14 Bob K                        14
15          15 Carol B               18 Bob T                        14
16          16 Zoe E                  4 Zoe K                        15
17          17 Alice O                7 Carol S                      19
18          18 Bob T                 14 Bob K                        14
19          19 Carol R                8 Zoe T                         1
20          20 Zoe G                 16 Zoe E                         4

Explanation of the Code:

We first load the dataset using the load function.
We then use the inner_join function from the dplyr package to join the employees table to itself. The join condition matches the manager_id of one row with the employee_id of another row. The suffix argument is used to distinguish between the employee and manager columns.
Finally, we display the result to see the relationship between employees and their managers.

Interpretation of Results

The resulting dataset employees_managers contains pairs of employees and their managers. Each row shows an employee along with their corresponding manager, including details such as names and IDs.

Homework for Readers

In the same self_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:

friends: Contains information about friendships within a social network.
Columns: person_id, friend_id

Your task is to perform a self join on this dataset to analyze the friendships and find mutual friends. Use the person_id and friend_id columns for joining.

Equi Join

An Equi Join is a type of join that combines rows from two tables based on equality conditions between specified columns. It is one of the most common types of joins used in SQL and relational database management. Take into a consideration that all types we already talked about are equi joins. It is just wider definition, beacuse we are looking for equality of keys.

Explanation of the Scenario

In this scenario, we have orders and order details. We want to join these tables to get a comprehensive view of each order along with its details. This helps in understanding the full scope of each transaction.

Data file:
https://github.com/kgryczan/medium_publishing/blob/main/equi_join_data.RData

Description of the Datasets

We will use two datasets:

orders: Contains information about customer orders.
Columns: order_id, customer_id, order_date
order_details: Contains detailed information about each order.
Columns: order_detail_id, order_id, product_id, quantity

Step-by-Step Code Examples

Loading the datasets

# Load the necessary libraries
library(dplyr)

# Load the datasets
load("equi_join_data.RData")

# Display the datasets
print(orders)

# A tibble: 20 × 3
   order_id customer_id order_date
      <int>       <int> <date>    
 1        1         113 2024-01-01
 2        2         113 2024-01-02
 3        3         112 2024-01-03
 4        4         105 2024-01-04
 5        5         117 2024-01-05
 6        6         115 2024-01-06
 7        7         112 2024-01-07
 8        8         107 2024-01-08
 9        9         117 2024-01-09
10       10         106 2024-01-10
11       11         105 2024-01-11
12       12         109 2024-01-12
13       13         117 2024-01-13
14       14         113 2024-01-14
15       15         101 2024-01-15
16       16         107 2024-01-16
17       17         115 2024-01-17
18       18         103 2024-01-18
19       19         108 2024-01-19
20       20         115 2024-01-20

print(order_details)

# A tibble: 40 × 4
   order_detail_id order_id product_id quantity
             <int>    <int>      <int>    <int>
 1               1       17        206       10
 2               2       18        215        4
 3               3       13        218        8
 4               4       16        217        3
 5               5        2        217        5
 6               6        2        203        4
 7               7       20        219        9
 8               8       18        214        4
 9               9       12        218        4
10              10       14        214        4
# ℹ 30 more rows

Performing the Equi Join

# Perform the equi join
orders_with_details <- inner_join(orders, order_details, by = "order_id")

# Display the result
print(orders_with_details)

# A tibble: 40 × 6
   order_id customer_id order_date order_detail_id product_id quantity
      <int>       <int> <date>               <int>      <int>    <int>
 1        1         113 2024-01-01              39        210        5
 2        2         113 2024-01-02               5        217        5
 3        2         113 2024-01-02               6        203        4
 4        2         113 2024-01-02              22        219        1
 5        2         113 2024-01-02              25        220        8
 6        2         113 2024-01-02              36        213        3
 7        2         113 2024-01-02              38        202        9
 8        4         105 2024-01-04              12        215        4
 9        4         105 2024-01-04              24        201        5
10        4         105 2024-01-04              30        219        3
# ℹ 30 more rows

Explanation of the Code:

We first load the datasets using the load function.
We then use the inner_join function from the dplyr package to perform the equi join between the orders and order_details datasets on the order_id column.
Finally, we display the result to see the full details of each order.

Interpretation of Results

The resulting dataset orders_with_details contains all rows from the orders dataset with the matching rows from the order_details dataset based on the order_id column. This means that each order is enriched with its detailed information.

Homework for Readers

In the same equi_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:

athletes: Contains information about athletes.
Columns: athlete_id, name, sport
performance_records: Contains information about performance records of athletes.
Columns: record_id, athlete_id, event, score

Your task is to perform an equi join on these datasets to match athletes with their performance records. Use the athlete_id column for joining.

In this second part of our series, “Joins Are No Mystery Anymore: Hands-On Tutorial,” we delved into more advanced join techniques that are essential for comprehensive data analysis. We explored:

Anti Join: Identifying rows in one table that do not have corresponding rows in another, helping to spot outstanding payments.
Cross Join: Creating all possible combinations of rows from two tables, useful for planning and organizing.
Natural Join: Automatically joining tables based on columns with the same names and types, simplifying the linking process.
Self Join: Comparing rows within the same table to uncover relationships, such as employee-manager hierarchies.
Equi Join: Combining rows from two tables based on equality conditions, providing detailed insights into orders and their specifics.

Through practical scenarios and step-by-step code examples, we enhanced our understanding of these joins and their applications. Each join type was demonstrated with real-life datasets, allowing you to see how these techniques can be applied to solve everyday data problems.

Next week, we will conclude our series with even more specialized join techniques. We’ll cover Non-Equi Joins, Rolling Joins, Overlap Joins, and Fuzzy Joins. These advanced joins will help you handle more complex data scenarios, such as matching based on non-equality conditions, finding the nearest matches, and dealing with approximate or fuzzy data. Stay tuned as we unlock the full potential of joins in R and take your data analysis skills to the ultimate level.

Thank you for sticking with us through the second part of our “Joins Are No Mystery Anymore: Hands-On Tutorial” series! Your dedication to mastering data joins in R is commendable.

As a special treat for our patient readers, next week we’ll dive into “Anatomy of a Basic Joining Function.” This extra content will break down the different arguments in joining functions, explaining what each one does and how it changes the output. It’s a deep dive into the mechanics of joins that will enhance your understanding and give you even greater control over your data analysis.

Stay tuned and happy coding!

Joins Are No Mystery Anymore: Hands-On Tutorial — Part 2 was originally published in Numbers around us on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Numbers around us - Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Joins Are No Mystery Anymore: Hands-On Tutorial — Part 2