amazon-web-servicesamazon-s3aws-lambdaaws-sdkamazon-kinesis-firehose

ETL Possible Between S3 and Redshift with Kinesis Firehose?


My team is attempting to use Redshift to consolidate information from several different databases. In our first attempt to implement this solution, we used Kinesis Firehose to write records of POSTs to our APIs to S3 then issued a COPY command to write the data being inserted to the correct tables in Redshift. However, this only allowed us to insert new data and did not let us transform data, update rows when altered, or delete rows.

What is the best way to maintain an updated data warehouse in Redshift without using batch transformation? Ideally, we would like updates to occur "automatically" (< 5min) whenever data is altered in our local databases.


Solution

    1. Firehose or Redshift don't have triggers, however you could potentially use the approach using Lambda and Firehose to pre-process the data before it gets inserted as described here: https://blogs.aws.amazon.com/bigdata/post/Tx2MUQB5PRWU36K/Persist-Streaming-Data-to-Amazon-S3-using-Amazon-Kinesis-Firehose-and-AWS-Lambda

      In your case, you could extend it to use Lambda on S3 as Firehose is creating new files, which would then execute COPY/SQL update.

    2. Another alternative is just writing your own KCL client that would implement what Firehose does, and then executing the required updates after COPY of micro-batches (500-1000 rows).

      I've done such an implementation (we needed to update old records based on new records) and it works alright from consistency point of view, though I'd advise against such architecture in general due to bad Redshift performance with regards to updates. Based on my experience, the key rule is that Redshift data is append-only, and it is often faster to use filters to remove unnecessary rows (with optional regular pruning, like daily) than to delete/update those rows in real-time.

    3. Yet another alernative, is to have Firehose dump data into staging table(s), and then have scheduled jobs take whatever is in that table, do processing, move the data, and rotate tables.

    As a general reference architecture for real-time inserts into Redshift, take a look at this: https://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-Practices-for-Micro-Batch-Loading-on-Amazon-Redshift

    This has been implemented multiple times, and works well.