I see that Spark 2.0.0 introduced a property spark.sql.files.maxPartitionBytes
and it's subsequent sub-release (2.1.0) introduced spark.files.maxPartitionBytes
The Spark configuration link says in case of former -
The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.
Whereas in case of latter, it mentions -
The maximum number of bytes to pack into a single partition when reading files.
Both the above explanations point to one common thing - they are both used in reading files.
But the next sentence mentions the use of spark.sql.files.maxPartitionBytes
in reading JSON, Parquet files.
Does that mean spark.files.maxPartitionBytes
is used when reading files for low-level APIs like RDDs ?
Yes, you got it.
The .sql
generally implies dataframes, datasets scope. The spark.files.maxPartitionBytes
is for RDD's.