suppose I have this dataframe:
id | value |
---|---|
A | 1 |
A | 2 |
A | 3 |
B | 1 |
B | 2 |
C | 1 |
D | 1 |
D | 2 |
and so on. basically I want to make sure even with records limit any certain id can only appear in one single file(suppose number of entries with that id is less than the limit)
say I am trying to output as csv with records limit:
df.write.option("maxRecordsPerFile", 4).csv(path)
what turns out is that id B may appear in 2 different CSVs, which I want to avoid,
is there a way to ensure? thanks
You could ensure that all records with the same id end up in the same file with repartition
and partitionBy
. In that case, you will have one file per id which respects you constraints.
df.repartition($"id").write.partitionBy("id").csv(path)
If you want to reduce the number of files, you can simply use repartition
without partitionBy
. In that case, records with the same id will necessarily end up in the same file but there will be collisions. Note that in that case, you cannot really control the maximum size of a file, only the average size of each file. Let's say that we have n
records and that we want an average file size of s
, we could do the following:
df.repartition(n / s, $"id").write.csv(path)