databasedatabase-designamazon-rdsdata-storagedata-lake

Database Management: Where to store old data outside the database


I'm wondering what is the most efficient way to store older data that I don't need to access outside of the database.

Context: I'm managing billions of rows of data in a single relational database with Amazon RDS PostgreSQL. 95+% of the data is timestamp and value data points. The plan is to have rolling data storage, meaning we will keep, for example, 2 years of data easily accessible from the database, but whatever is older does not need to be accessible. Storage in the database is quite expensive and would make the database just grow so big (and expensive) over the years for nothing.

My big picture idea of the solution is that every month, some routine would back up data points older than 2 years and transfer it somewhere cheaper, BUT WHERE, that is the question.

I've seen and read about:

Thing is I honestly don't know what fits my needs the best. Whatever I read focuses on things like having access for third parties to easily monitor and perform analytics in the data for ML or AI, which is not useful for me (at least not for now). Some diagrams show an architecture with like 4+ different services architecture to do XYZ. I don't want to go overkill and keep as simple as possible. I just need to store old data somewhere super cheap, and for now don't really not to access it after it's backed up, in order to keep my database no bigger than it needs to. Maybe in the future there will be a need to fetch it in bulk for ML or AI training, but no plan so far in the near future.

Any ideas of what would be optimal?


Solution

  • The simple answer is Amazon S3. It has multiple Storage Classes that offer discounts for long-term storage, depending on how quickly you want to access the data.

    There are many databases and services that can access and query data stored in Amazon S3 buckets. For example, Amazon Athena allows you to query data stored in S3 using SQL exactly the same way you would query data in a database. You pay the standard S3 storage costs, plus a charge for Athena that is based on the amount of data read from disk for the query. This is an excellent way to keep archived information just as accessible as information in a database (but a little slower to query).

    This raises the question of what format to store it in.

    You can easily Export data from an RDS for PostgreSQL DB instance to Amazon S3 - Amazon Relational Database Service using the aws_s3 extension. It has limited formatting options, so you will likely choose CSV as the format. This is perfectly acceptable, but other formats have some benefits:

    Both these options would require you to process the data after exporting it from Postgres to change the file format. The extra effort can be worthwhile due to cost savings and the speed of querying.

    Here is an excellent article comparing storage options for Amazon Athena. The article is a bit old, but the concepts remain true: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog