My source is JSON with nested arrays & structures. (examples at end of post)
Large volume of new data streaming real-time (20m/day)
I have to decide how to store this data, considering.....
-- End users want to use 'traditional' SQL
-- Performance (ingestion & query)
-- Load on Cluster
As far as I can see, my options, are to make use of the SUPER data type, or just convert everything to traditional relational tables & types.
(Even if I store the full JSON as a super, I still have to serialize critical attributes into regular columns for the purposes of Distribution/Sort.)
Regardless, been trying to weigh up the pros & cons of super vs. 'traditional'.
(1) Store full JSON as SUPER type
-- Very easy to ingest data with low load on cluster
-- Maybe an additional load on cluster & performance impact to execute end user queries?
-- End users would have to learn PartiQL and deal with unnesting & serialization etc
(2) 'Traditional' Relational Tables & Types
(a) Load as super, but then use PartiQL to unnest, serialize and store in relational tables
-- Additional continuous load on cluster
-- Easy to implement (insert into)
-- Would result in some massive tables for the 'tag' nodes
(b) Use lambda to pre-unnest & serialize json, insert/copy directly into relational tables
-- Lambda would be invoked continuously
(3) Redshift Spectrum
If I am converting to relational structure (2b), could simply store in S3 and utilze Redshift Spectrum to query
-- No load on cluster to ingest/process incoming data
-- Cost/maintenance of lambda/other process to transform JSON
Questions:
Are the above understandings correct ?
Any other considerations not listed ?
Is there a 'standard' for this scenario ?
Any other guidance/wisdeom welcome !
Background info
Schema:
events:array[ struct{
channels:array[
struct{
tags:array[struct{}]
}
tags:array[struct{}]
]
}
]
You will not want to leave things as a monolithic json. Any data that will be repeatedly queries in analytics will want to be its own column. The database work to expand the json at ingestion will be dwarfed by the work to repeatedly expand it for every query.
Any data that will be commonly used in a where clause, group by, partition, join condition etc will likely need to be its own column. I'd expect any data that is common for 90% of the json elements you will want to be in unique columns. Json element that are rare, unique, or of little analytic interest can be kept in super columns that have just these subset parts of the json.
The data size increase will be less than you think, Redshift is good a compressing columns. The ingestion load is unlikely to be a major concern but if it is then the Lambda approach is a reasonable way to extend the compute resources to address. I really don't expect this but if needed can be folded in easily to the existing ETL processes.
A hazard you will face is that users will only reference the json and not the extracted columns. Re-extracting the same data repeatedly costs. I'd consider NOT keeping the entire json in the main fact tables, only json pieces that represent the data not otherwise in columns. Keeping the original jsons in a separate table keyed with an identity column will allow joining if some need arises but the goal will be to not need to do this.
Spectrum does not look like a good fit for this use case. Spectrum does well when the compute elements in S3 can apply the first level where clauses and simple aggregations. I just don't see Spectrum working on this data so it will just send the entire json data to Redshift repeatedly. This will make things slow and tie up a ton of network bandwidth. Now storing the full original json with identity column in S3 and having all the expanded columns plus left-over json elements in native Redshift table does make sense. This way if some need to reference the full json arises it is just an external table reference away.