apache-sparkhiveapache-spark-sqlhortonworks-data-platformapache-tez

Tez VS Spark - huge performance diffs


I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows

SELECT DT, Sum(1) from mydata GROUP BY DT

DT is partition column, a string that marks date.

In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.

When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!) To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.

UPDATE: It finished 1 row selected (672.152 seconds)

More information about the environment:

More Updates:

When checking about vectorization on this link, I noticed I don't see Vectorized execution: true anywhere when I used explain. Another thing that caught my attention is the following: table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}

Namely, when checking table itself: STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' and OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.

What shd be the first thing to check?

Thx


Solution

  • In the end, we gave up and installed LLAP. I'm going to accept it as an answer, as I have sort of an OCD and this unanswered question has been poking my eyes for long enough.