amazon-web-servicesamazon-dynamodbamazon-redshiftamazon-data-pipeline

AWS Data Pipeline Dynamo to Redshift


I have an issue: I need to migrate data from DynamoDB to Redshift. The problem is that I receive such exception:

ERROR: Unsupported Data Type: Current Version only supports Strings and Numbers Detail: ----------------------------------------------- error: Unsupported Data Type: Current Version only supports Strings and Numbers code: 9005 context: Table Name = user_session query: 446027 location: copy_dynamodb_scanner.cpp:199 process: query0_124_446027 [pid=25424] -----------------------------------------------

In my Dynamo item I have boolean field. How can I modify field from Boolean to INT(for example)? I tried to use as a VARCHAR(5), but didn't help(so it one ticket in Github without response)

Will be appreciate for any suggestions.


Solution

  • As a solution, I migrated data from DynamoDB to S3 first and then to Redshift.

    1. I used Exports to S3 build in feature in DynamoDB. It saves all data as *.json files into S3 realy fast(but not sorted).
    2. After that I used ETL script, using Glue Job and custom script with pyspark to process and save data into Redshift.

    Also can be done with Glue crawler to define schema, but still need to validate its result, as sometimes it was not correct.

    Using crawlers to parse DynamoDB directly is overkill of your tables if you are not using ONDEMAND read/write. So the better way is to do that with data from S3.