rdplyr

what is the difference between inner_join and semi_join?


I could not understand difference between inner_join and semi_join? could you provide me with examples?

According to R


Solution

  • The rows from x returned by semi_join() and inner_join() are the same. The difference is that inner_join will add columns present in y but not present in x, but a semi_join will not add any columns from y.

    x = data.frame(a = 1:3)
    y = data.frame(a = 2:4, b = 10:12)
    
    ## with an inner join, the `b` column is part of the result
    inner_join(x, y)
    # Joining, by = "a"
    #   a  b
    # 1 2 10
    # 2 3 11
    
    ## with a semi join, the `b` column is not part of the result
    ## because it is not part of `x`
    semi_join(x, y)
    # Joining, by = "a"
    #   a
    # 1 2
    # 2 3
    

    The joins documented together as "mutating joins", which are described at ?inner_join as

    mutating joins add columns from y to x, matching rows based on the key

    Compare to the "filtering joins" documented together at ?semi_join

    Filtering joins filter rows from x based on the presence or absence of matches in y

    Filtering joins only filter x, they do not add columns from y. The other filtering join is anti_join, which does the opposite of semi_join, returning only the rows without a match.