rdplyrleft-joinright-join

Why do results differ for dplyr left_join() and right_join() using these two dataframes


I am learning how to use the R dplyr 'join' functions by doing the exercises from this course: https://github.com/uclouvain-cbio/WSBIM1207 and got stuck on the problem described below.

First, download the example dataframes used for this question:

BiocManager::install("UCLouvain-CBIO/rWSBIM1207")

Load the package:

library(rWSBIM1207)

Then in R/RStudio load the dataframe files, 'clinical2' and 'expression' by typing:

data(clinical2)
data(expression)

The task is, firstly: 'Join the expression and clinical2 tables by the patient reference, using the left_join and the right_join functions.' I did that in this way:

left_join(expression, clinical2, 
          by = c("patient" = "patientID"))
right_join(expression, clinical2,
                     by = c("patient" = "patientID"))

The second task is to explain why the results are different. I found that there are 3 more rows in the right_join output versus the left_join output. This seems odd to me given that 'clinical2' has 516 rows, whereas 'expression' has 570 rows. The 3 extra rows present in the r_join output have in common that they contain multiple NA values, which presumably represent patients found in 'clinical2' and not in 'expression'. I don't really understand what is going on here, and would be grateful for any help.


Solution

  • Update Maybe not clearly explained. Here an explanation with the original data:

    create left join

    left <- left_join(expression, clinical2, by = c("patient" = "patientID"))
    
    dim(expression)
    [1] 570   8
    dim(left)
    [1] 570  10
    

    create right join

    right <- right_join(expression, clinical2,
               by = c("patient" = "patientID"))
    
    dim(expression)
    [1] 570   8
    dim(right)
    [1] 573  10
    

    You want to know why dim(right) is 573!

    Explanation step by step:

    1. Definition of right_join(): includes all rows in y (y is here clinical2).
    2. By doing so: there are 3 rows in clinical2 (y) which are not in expression (x)

    See here: which patient occur in clinical2 but not in expression

    anti_join(clinical2, expression, by=c("patientID"="patient"))
    
    patientID    gender years_at_diagnosis
    <chr>        <chr>               <dbl>
    1 TCGA-55-7284 male                 74.2
    2 TCGA-55-7913 female               61.2
    3 TCGA-67-4679 male                 69.0
    

    Again:

    right_join(expression, clinical2, by = c("patient" = "patientID"))
    

    We start with expression(x) (dim = 570 8) and join with clinical2 (y) (dim = 516 3)

    So now what happens:

      anti_join(expression, clinical2, by=c("patient"="patientID"))
    
    We get:
      # ... with 8 variables: sampleID <fct>, patient <chr>, type <chr>, A1BG <dbl>, A1CF <dbl>,
      #   A2BP1 <dbl>, A2LD1 <dbl>, A2ML1 <dbl>````
    

    That means all rows are included in expression. So no additional rows will be added here:

    First answer: In joining two things are important:

    1. from which side you start to join, e.g. which table is in first position
    2. Given position of tables e.g. df1, df2 which join method you apply

    See this example:

    library(dplyr)
    library(tibble)
    
    # add ID
    iris1<- iris %>% 
      tibble::rowid_to_column("ID")
    
    # add ID
    mtcars1 <- mtcars %>% 
      tibble::rowid_to_column("ID")
    
    dim(iris1)
    # [1] 150   6
    
    dim(mtcars1)
    # [1] 32 12
    
    # 1. iris1 is first and we start from left e.g. iris1
    
    a <- left_join(iris1, mtcars1, by="ID")
    dim(a)
    # [1] 150  17
    
    # 2. iris1 is still first, but we join from right e.g. mtcars1
    b <- right_join(iris1, mtcars1, by="ID")
    dim(b)
    # [1] 32 17
    
    # 3. mtcars1 is first and we join from left e.g mtcars1
    a1 <- left_join(mtcars1, iris1, by="ID")
    dim(a1)
    # [1] 32 17
    
    -> b = a1 e.g. right_join(iris1, mtcars1, by="ID") = left_join(mtcars1, iris1, by="ID")
    

    https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join