amazon-web-servicesamazon-s3amazon-redshiftaws-glueaws-data-pipeline

Scheduling data extraction from AWS Redshift to S3


I am trying to build out a job for extracting data from Redshift and write the same data to S3 buckets. Till now I have explored AWS Glue, but Glue is not capable to run custom sql's on redshift. I know we can run unload commands and can be stored to S3 directly. I am looking for a solution which can be parameterised and scheduled in AWS.


Solution

  • Consider using AWS Data Pipeline for this.

    AWS Data Pipeline is AWS service that allows you to define and schedule regular jobs. These jobs are referred to as pipelines. Pipeline contains a business logic of the work required, for example, extracting data from Redshift to S3. You can schedule a pipeline to run however often you require e.g. daily.

    Pipeline is defined by you, you can even version control it. You can prepare a pipeline definition in a browser using Data Pipeline Architect or compose it using JSON file locally on your computer. Pipeline definition is composed of components, such as, Redshift database, S3 node , SQL activity, as well as parameters, for example to specifying S3 path to use for extracted data.

    AWS Data Pipeline service handles scheduling, dependency between components in your pipeline, monitoring and error handling.

    For your specific use case, I would consider the following options:

    Option 1

    Define pipeline with the following components: SQLDataNode and S3DataNode. SQLDataNode would reference your Redshift database and SELECT query to use to extract your data. S3DataNode would point to S3 path to be used to store your data. You add a CopyActivity activity to copy data from SQLDataNode to S3DataNode. When such pipeline runs, it will retrieve data from Redshift using SQLDataNode and copy that data to S3DataNode using CopyActivity. S3 path in S3DataNode can be parameterised so it is different every time you run a pipeline.

    Option 2

    Firstly, define SQL query with UNLOAD statement to be used to unload your data to S3. Optionally, you can save it in a file and upload to S3. Use SQLActivity component to specify SQL query to execute in Redshift database. SQL query in SQLActivity can be a reference to S3 path where you stored your query (optionally), or just a query itself. Whenever a pipeline runs, it will connect to Redshift and execute SQL query which stores the data in S3. Constraints of option 2: in UNLOAD statement, S3 path is static. If you plan to store every data extract in a separate S3 path, you will have to modify UNLOAD statement to use another S3 path every time you run it which is not out-of-the-box function.

    Where do these pipelines run?

    On EC2 instance with a TaskRunner, a tool provided by AWS to run data pipelines. You can start that instance automatically at the time when pipeline runs, or you can reference already running instance with a TaskRunner installed on it. You have to make sure that EC2 instance is allowed to connect to your Redshift database.

    Relevant documentation:

    http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html

    http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftdatabase.html

    http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-sqldatanode.html

    http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-sqlactivity.html

    http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-using-task-runner.html