hadoophivehdfshive-partitionshiveddl

Hive load multiple partitioned HDFS file to table


I have some twice-partitioned files in HDFS with the following structure:

/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=1.0/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=1.0/data.parquet

and would like to load these into a hive table as elegantly as possible. I know the typical solution for something like this is to load all the data into a non-partitioned table first, then transfer all the data to final table using dynamic partitioning as mentioned here

However, my files don't have the datekey and coeff values in the actual data, it's only in the filename since that's how it's partitioned. So how would I keep track of these values when I load them into the intermediate table?

One workaround would be to do a separate load data inpath query for each coeff value and datekey. This would not need the intermediate table, but would be cumbersome and probably not optimal.

Are there any better ways for how to do this?


Solution

  • Typical solution is to build external partitioned table on top of hdfs directory:

    create external table table_name (
    column1 datatype, 
    column2 datatype,
    ...
    columnN datatype 
    )
    partitioned by (datekey int,
                    coeff float)
    STORED AS PARQUET
    LOCATION '/user/hive/warehouse/datascience.db/simulations'
    

    After that, recover all partitions, this command will scan table location and create partitions in Hive metadata:

    MSCK REPAIR TABLE table_name;
    

    Now you can query table columns along with partiion columns and do whatever you want with it: use as is, or load into another table using insert .. select .. , etc:

    select 
        column1, 
        column2,
        ...
        columnN,
        --partition columns
        datekey,
        coeff
    from table_name
    where datekey = 20210506
    ;