I'm optimizing a Spark SQL ETL to frequently select 0.1% of data from a 100 billion rows, 100TB parquet formatted table 'event_100B' on S3.
Table event_100B contains a unique key column EventId (32 hex uuid). The select query needs to be optimized is joining a predicate_set providing 100 mil EventId keys as predicate, which map to randomly distributed rows in the event table. No clustering pattern can be exploited.
select t1.* from event_100B t1
inner join predicate_set p1 on t1.EventId = p1.EventId
Because the predicate consists of a bag of high cardinality keys spanning a large min-max range, neither file nor row-group level pruning takes place. The ETL spends a significant % time downloading files to run full table scan.
Looking for advice what DB or file formats can support effective data skipping with this type of bulk random access query, and if necessary the kind of AWS/Azure storage hardware.
A Tentative Idea: Assign all EventIds into 100 mil buckets: BucketId = EventId % 1,000,000 such that each bucket contains avg 1000 EventId values. The event_100B is then clustered/sorted table rows by BucketId.
The BucketId column requires much less IO & network bandwidth to download and join with the predicate_set:
select t1.* from event_100B t1
inner join predicate_set p1 on t1.BucketId = (p1.EventId % 1,000,000)
My expectation: by increasing the bucket count, the probability of any predicate EventId falls into a bucket decreases and a higher % of buckets (and associated rows) can be skipped.
100TB of parquet files easily represents million of files. As a result the query engine will spend lot of time getting footers and worst in your context scan the data of most files.
One approach to limit the number of parquet files to read, is to maintain a side index telling you for each parquet files, the list of eventid they contains. Of course you can keep parquet to build this index because it's a matter of few million rows in worst case. Then knowing which files contains your 100k events should be fairly fast, and hopefully you would then end up reducing the parquet files to be read.
You could have a more simple design by trying out other technologies such as starrocks/apache Doris. They are great candidate for your use case because they provide a native format compatible with object storage/hdfs with statistics, indexing, cost based optimizers the c++ engine is highly vectorized. BTW I wouldn't consider Trino(parquet), clickhouse(poor join support), or elasticsearch(bad for retriving lot of records).
Lastly you could also consider tables format based on parquet. With apache hudi you could for example build a bloom filter index on eventid, and store it in the metadata table. While designed to faster uptates, the index can also be used for read queries by using rdd api. Also apache iceberg has ongoing work to provide built-in indexing capabilities but so far it's not yet mature enough AFAIK.
Let us know your final conclusions !