palantir-foundryfoundry-code-repositoriesfoundry-python-transform

How do I ensure consistent file sizes in datasets built in Foundry Python Transforms?


My Foundry transform is producing different amount of data on different runs, but I want to have similar amount of rows in each file. I can use DataFrame.count() and then coalesce/repartition, but that requires computing the full dataset and then either caching or recomputing it again. Is there a way for Spark to take care of this?


Solution

  • You can use spark.sql.files.maxRecordsPerFile configuration option by setting it per output of @transform:

    output.write_dataframe(
        output_df,
        options={"maxRecordsPerFile": "1000000"},
    )