sqlamazon-s3formatschemaamazon-athena

How to efficiently store and query data with key-value pair in AWS S3 and Athena?


I have a dataset stored in AWS S3 and query using AWS Athena. The data is currently structured in two columns in S3 as follows:

Dataset

Customer_ID | Products_csv
C1          | P1, P2, P3
C2          | P2, P4

With additional requirement, one more field needs to be added, which is quantity of product and this is how I planned on storing it :

Customer_ID | Product:Quantity(csv)
C1          | P1:3, P2:1, P3:5
C2          | P2:2, P4:7 

If the above format is followed for table in athena, it would not be simpler to perform the required operations on data. For example, if only products with quantity(>=3) has to be retrieved per customer (which is P1,P3 for customer1 and P4 for customer2), the query would be complex with above structure or might not scale for huge amount of data.

Data size

  1. Number of customer-ids(as rows) would be few millions
  2. Products can be in a range of 0-100 per customer-id

Any ideas on how to store the product & quantity column data in a alternative format that is optimum for querying would be greatly appreciated.


Solution

  • A more-traditional format would be separate rows for each relationship.

    For example:

    Customer_ID, Product, Quantity
    C1, P1, 3
    C1, P2, 1
    C1, P3, 5
    C2, P2, 2
    C2, P4, 7
    

    Millions of rows are no problem. The simpler format would be easier for Athena to process.

    You would get much better performance if the data is stored in Parquet format. It is possible to use Athena to convert the data into Parquet, but start by using CSV to prove that everything works.