apache-sparkamazon-s3amazon-data-pipeline

How to export dynamodb to s3 as a single file?


I have a dynamodb table which will need to be exported to a s3 bucket every 24 hours using data pipeline. This will in turn be used by a sparkjob to query the data.

The problem is that whenever I am setting up a data pipeline to do this activity, the output in s3 is multiple partitioned files.

Is there a way to ensure that the entire table is exported as a single file in s3? If not, is there a way in spark to read the partitioned files using manifest and combine them into one to query the data?


Solution

  • You have two options here (The function should be run on the dataframe just before writing):

    1. repartition(1)
    2. coalesce(1)

    But as the docs emphasized the better in your case is the repartition:

    However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition(). This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

    Docs:

    repartition

    coalesce