My Foundry transform is producing different amount of data on different runs, but I want to have similar amount of rows in each file. I can use DataFrame.count() and then coalesce/repartition, but that requires computing the full dataset and then either caching or recomputing it again. Is there a way for Spark to take care of this?
You can use spark.sql.files.maxRecordsPerFile configuration option by setting it per output of @transform:
output.write_dataframe(
output_df,
options={"maxRecordsPerFile": "1000000"},
)