amazon-redshiftamazon-redshift-spectrumapache-hudi

Can partitioning data in Apache Hudi optimize AWS Spectrum query?


I'm using AWS Redshift Spectrum to query a Hudi table. As we know, filtering data by partition column when querying data in Spectrum could reduce the size of the data scanned by Spectrum and speed up the query.

My question is, if I use Spectrum to query a Hudi table like :select a, b from my_table where a = 3, does this query perform differently if I set hoodie.datasource.write.partitionpath.field = a or hoodie.datasource.write.partitionpath.field = b? Can Spectrum use Hudi's partition path to reduce the scanning data size?

---- Update: I've try with this and found Spectrum couldn't use partition path in Hudi table to faster the query, it still scan the whole table even I filter by the partitioned column. My DDL was like:

create external table spectrum.test_hudi_users_activity
    (
...
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS
    INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    location 's3://xxx';

Wonder if my DDL has any problem that make Spectrum couldn't recognize my partition column


Solution

  • Absolutely, partitioning is useful for all the Hudi clients, when you partition your table with column a, and then you use the column a as a filter on your query, Spectrum will only scan the files in the partition a=3.

    But partitioning the table with column b, and using the column a as filter may slightly affect the performance of your query where Spectrum still need to scan all the files, and since the table is partitioned, it will contain more files, which means more IO stream opened by Spectrum to read the files. So try to use a column which you use frequently in your queries as a partition key.