I have a DynamoDB table that has 1.5 million records / 2GB. How to export this to an S3?
The AWS data pipeline method to do this worked with a small table. But i am facing issues with exporting the 1.5 million record table to my S3.
At my initial trial, the pipeline job took 1 hour and failed with
java.lang.OutOfMemoryError: GC overhead limit exceeded
I had increased the namenode heap size by supplying a hadoop-env configuration object to the instances inside the EMR cluster by following this link
After increasing the heapsize my next job run attempt failed after 1 hour with another error as seen in the screenshot attached. I am not sure what to do here to fix this completely.
Also while checking the AWS Cloudwatch graphs of the instances in the EMR cluster. The core node was continuously at a 100% CPU usage.
The EMR cluster instance types (master and core node) were m3.2xlarge.
The issue was with the maptasks not running efficiently. The core node was hitting 100% CPU usage. I upgraded the cluster instance types to one of the compute C series available and the export worked with no issues.