apache-spark

Difference between spark.sql.files.maxPartitionBytes and spark.files.maxPartitionBytes


I see that Spark 2.0.0 introduced a property spark.sql.files.maxPartitionBytes and it's subsequent sub-release (2.1.0) introduced spark.files.maxPartitionBytes

The Spark configuration link says in case of former -

The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.

Whereas in case of latter, it mentions -

The maximum number of bytes to pack into a single partition when reading files.

Both the above explanations point to one common thing - they are both used in reading files. But the next sentence mentions the use of spark.sql.files.maxPartitionBytes in reading JSON, Parquet files.

Does that mean spark.files.maxPartitionBytes is used when reading files for low-level APIs like RDDs ?


Solution

  • Yes, you got it.

    The .sql generally implies dataframes, datasets scope. The spark.files.maxPartitionBytes is for RDD's.