We are using a Postgres RDS instance (db.t3.2xlarge with around 2TB data). We have a multi-tenancy application so for all organizations who sign up in our product, we are creating a separate schema which replicates our data model. Now a couple of our schemas (around 5 to 10 schemas) contain a couple of big tables (around 5 to 7 big tables where each contains 10 to 200 million rows). For UI we need to show some statics as well as graphs and to calculate that statics as well as graph data we need to perform joins on big tables and it slows down our whole database server. Sometimes we need to do this type of query in night time so that users don't face any performance issues. So ss a solution we are planning to create a data lake in S3 so that all analytical load we can shift out of RDBMS and to an OLAP solution.
As a first step we need to transfer our data from RDS to S3 and also keep syncing both data sources. Can you please suggest which tool is a better choice for us considering the below requirements:
I just found out about AWS Data Lake recently, and also based on my research (which will hopefully, assist you in the best solution possible)..
AWS Athena can partition data, and you may want to partition your data based on tenant id (customer id).
AWS Glue has crawlers:
Crawlers can run periodically to detect the availability of new data as well as changes to existing data, including table definition changes.