I have a dynamodb table which will need to be exported to a s3 bucket every 24 hours using data pipeline. This will in turn be used by a sparkjob to query the data.
The problem is that whenever I am setting up a data pipeline to do this activity, the output in s3 is multiple partitioned files.
Is there a way to ensure that the entire table is exported as a single file in s3? If not, is there a way in spark to read the partitioned files using manifest and combine them into one to query the data?
You have two options here (The function should be run on the dataframe just before writing):
repartition(1)
coalesce(1)
But as the docs emphasized the better in your case is the repartition
:
However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition(). This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
Docs: