scalaapache-sparkapache-spark-sqlapache-spark-2.0

What are the various join types in Spark?


I looked at the docs and it says the following join types are supported:

Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.

I looked at the StackOverflow answer on SQL joins and top couple of answers do not mention some of the joins from above e.g. left_semi and left_anti. What do they mean in Spark?


Solution

  • Here is a simple illustrative experiment:

    import org.apache.spark.sql._
    
    object SparkSandbox extends App {
      implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
      import spark.implicits._
      spark.sparkContext.setLogLevel("ERROR")
    
      val left = Seq((1, "A1"), (2, "A2"), (3, "A3"), (4, "A4")).toDF("id", "value")
      val right = Seq((3, "A3"), (4, "A4"), (4, "A4_1"), (5, "A5"), (6, "A6")).toDF("id", "value")
    
      println("LEFT")
      left.orderBy("id").show()
    
      println("RIGHT")
      right.orderBy("id").show()
    
      val joinTypes = Seq("inner", "outer", "full", "full_outer", "left", "left_outer", "right", "right_outer", "left_semi", "left_anti")
    
      joinTypes foreach { joinType =>
        println(s"${joinType.toUpperCase()} JOIN")
        left.join(right = right, usingColumns = Seq("id"), joinType = joinType).orderBy("id").show()
      }
    }
    

    Output

    LEFT
    +---+-----+
    | id|value|
    +---+-----+
    |  1|   A1|
    |  2|   A2|
    |  3|   A3|
    |  4|   A4|
    +---+-----+
    
    RIGHT
    +---+-----+
    | id|value|
    +---+-----+
    |  3|   A3|
    |  4|   A4|
    |  4| A4_1|
    |  5|   A5|
    |  6|   A6|
    +---+-----+
    
    INNER JOIN
    +---+-----+-----+
    | id|value|value|
    +---+-----+-----+
    |  3|   A3|   A3|
    |  4|   A4| A4_1|
    |  4|   A4|   A4|
    +---+-----+-----+
    
    OUTER JOIN
    +---+-----+-----+
    | id|value|value|
    +---+-----+-----+
    |  1|   A1| null|
    |  2|   A2| null|
    |  3|   A3|   A3|
    |  4|   A4|   A4|
    |  4|   A4| A4_1|
    |  5| null|   A5|
    |  6| null|   A6|
    +---+-----+-----+
    
    FULL JOIN
    +---+-----+-----+
    | id|value|value|
    +---+-----+-----+
    |  1|   A1| null|
    |  2|   A2| null|
    |  3|   A3|   A3|
    |  4|   A4|   A4|
    |  4|   A4| A4_1|
    |  5| null|   A5|
    |  6| null|   A6|
    +---+-----+-----+
    
    FULL_OUTER JOIN
    +---+-----+-----+
    | id|value|value|
    +---+-----+-----+
    |  1|   A1| null|
    |  2|   A2| null|
    |  3|   A3|   A3|
    |  4|   A4|   A4|
    |  4|   A4| A4_1|
    |  5| null|   A5|
    |  6| null|   A6|
    +---+-----+-----+
    
    LEFT JOIN
    +---+-----+-----+
    | id|value|value|
    +---+-----+-----+
    |  1|   A1| null|
    |  2|   A2| null|
    |  3|   A3|   A3|
    |  4|   A4| A4_1|
    |  4|   A4|   A4|
    +---+-----+-----+
    
    LEFT_OUTER JOIN
    +---+-----+-----+
    | id|value|value|
    +---+-----+-----+
    |  1|   A1| null|
    |  2|   A2| null|
    |  3|   A3|   A3|
    |  4|   A4| A4_1|
    |  4|   A4|   A4|
    +---+-----+-----+
    
    RIGHT JOIN
    +---+-----+-----+
    | id|value|value|
    +---+-----+-----+
    |  3|   A3|   A3|
    |  4|   A4| A4_1|
    |  4|   A4|   A4|
    |  5| null|   A5|
    |  6| null|   A6|
    +---+-----+-----+
    
    RIGHT_OUTER JOIN
    +---+-----+-----+
    | id|value|value|
    +---+-----+-----+
    |  3|   A3|   A3|
    |  4|   A4|   A4|
    |  4|   A4| A4_1|
    |  5| null|   A5|
    |  6| null|   A6|
    +---+-----+-----+
    
    LEFT_SEMI JOIN
    +---+-----+
    | id|value|
    +---+-----+
    |  3|   A3|
    |  4|   A4|
    +---+-----+
    
    LEFT_ANTI JOIN
    +---+-----+
    | id|value|
    +---+-----+
    |  1|   A1|
    |  2|   A2|
    +---+-----+