azurehadoophivetpc

How to Benchmark Hive (Azure Interactive Query HDI 4.0)


Does anyone have a working and tested as of 2020 TPC-DS or TPC-H benchmarks for Azure Interactive Query HDI 4.0 clusters, which uses Hadoop 3.x+?

I was using https://github.com/hortonworks/hive-testbench but I ran into an error trying to generate data for TPC-H and TPC-DS.

Interactive Query HDI 4.0 (Hadoop 3.1.1). What could this error be? The step that fails is when it runs the jar file.

code where it fails

Generating data at scale factor 100.
Exception in thread "main" java.lang.IllegalAccessError: 
class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator
...
ls: `/tmp/tpch-generate/100/lineitem': No such file or directory
Data generation failed, exiting.

Then second question is that for TPC-DS whenever I run "large" scale factors, the failure comes during the optimization step. And it usually fails on table 17 or 18. Any ideas of what this could be?

INFO  : Loading data to table tpcds_bin_partitioned_orc_100.store_sales partition (ss_sold_date_sk=null) from wasb://asdasd-2020-04-16t02-32-03-034z@asdasd.blob.core.windows.net/hive/warehouse/managed/tpcds_bin_partitioned_orc_100.db/store_sales/.hive-staging_hive_2020-04-16_06-47-19_242_1371829803314907581-47/-ext-10000
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Exception updating metastore for acid table tpcds_bin_partitioned_orc_100.store_sales with partitions [store_sales
...
INFO  : Completed executing command(queryId=hive_20200416064719_4aa11ffb-31c0-411f-a7ca-954c9741891d); Time taken: 1280.036 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Exception updating metastore for acid table tpcds_bin_partitioned_orc_100.store_sales with partitions

Solution

  • The MoveTask error is due to an internal sql database limitation. In Azure SQL Database, the incoming parameters can only have 2100 parameters and the benchmarks generate too many partitions.

    hive.direct.sql.max.elements.values.clause=200
    hive.direct.sql.max.elements.in.clause=200