apache-spark-sqlapache-hudi

Can I use incremental, time travel, and snapshot queries with hudi only using spark-sql?


I'm trying to do incremental, snapshot, and time travel queries using spark-sql with hudi, but the only way that I can find to do this is creating a DataFrame with spark.read and then creating a temp view. Is there any way to accomplish this with only using spark-sql? This is in an EMR cluster, version 6.6.0.


Solution

  • Yes you can.

    For time travel, you need spark 3.2+:

    select * from <table name> timestamp as of '20221113183401100' where id = 1;
    select * from <table name> timestamp as of '2022-11-13 18:34:01.100' where id = 1;
    select * from <table name> timestamp as of '2022-11-13' where id = 1;
    

    For incremental queries, you need to sync your MoR table to hive, this will create two view for your table, <table name>_rt for real-time view, <table name>_ro for read-optimized view.

    set hoodie.<table name>.consume.mode=INCREMENTAL;
    set hoodie.<table name>.consume.max.commits=3;
    set hoodie.<table name>.consume.start.timestamp=20221113183401;
    select * from <table name>_rt where `_hoodie_commit_time`>'20221113183401';
    

    And for snapshot queries, it is still under development (issue), but you can use the incremental queries syntax to achieve it, by using the timestamp you want to query as start timestamp and setting max commits to 1:

    set hoodie.<table name>.consume.mode=INCREMENTAL;
    set hoodie.<table name>.consume.max.commits=1;
    set hoodie.<table name>.consume.start.timestamp=20221113183401;
    select * from <table name>_rt where `_hoodie_commit_time`='20221113183401';