apache-sparkapache-spark-sqlapache-spark-dataset

Differences between Spark's Row and InternalRow types


Currently Spark has two implementations for Row:

import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.InternalRow

What is the need to have both of them? Do they represent the same encoded entities but one used internally (internal APIs) and the other is used with the external APIs?


Solution

  • TLDR:

    In-detailed

    Yes, you are correct that both Row and InternalRow serve similar purposes in representing a row of data. Still, they are designed for different use cases and environments within Spark.

    Why Does Spark Have Both Row and InternalRow?

    1. Separation of Concerns:

    2. Different Use Cases:

    Do They Represent the Same Encoded Entities?