amazon-s3amazon-rdsaws-gluedata-lakeaws-datasync

AWS Glue sync data from RDS (need to sync 4 table from all schema) to S3 (apache parque format)


We are using a Postgres RDS instance (db.t3.2xlarge with around 2TB data). We have a multi-tenancy application so for all organizations who sign up in our product, we are creating a separate schema which replicates our data model. Now a couple of our schemas (around 5 to 10 schemas) contain a couple of big tables (around 5 to 7 big tables where each contains 10 to 200 million rows). For UI we need to show some statics as well as graphs and to calculate that statics as well as graph data we need to perform joins on big tables and it slows down our whole database server. Sometimes we need to do this type of query in night time so that users don't face any performance issues. So ss a solution we are planning to create a data lake in S3 so that all analytical load we can shift out of RDBMS and to an OLAP solution.

As a first step we need to transfer our data from RDS to S3 and also keep syncing both data sources. Can you please suggest which tool is a better choice for us considering the below requirements:

  1. We need to update the last 3 days data on an hourly basis. We want to keep updating recent data because over the 3 day time window, it may change. After 3 days we can consider the data “at rest” and it can rest in the data lake without any future modification.
  2. We are using a multi tenancy system currently and we are having ~350 schemas, But it will be increasing as more organizations sign up in our product.
  3. We are planning to do ETL so in transform we are planning to join all tables and create one denormalized table and store the data in apache parque format in S3. So that we can perform analytical queries on that table using Redshift Spectrum, EMR, or some other tool.

Solution

  • I just found out about AWS Data Lake recently, and also based on my research (which will hopefully, assist you in the best solution possible)..

    AWS Athena can partition data, and you may want to partition your data based on tenant id (customer id).

    AWS Glue has crawlers:

    Crawlers can run periodically to detect the availability of new data as well as changes to existing data, including table definition changes.