I am using hive-1.1.0. Submitting queries to HiveServer2 via Beeline which are read-only and contain no predicates will cause HiveServer2 to try to read the data from HDFS itself without spawning a MapReduce job:
SELECT * FROM my_table LIMIT 100;
For very large datasets this can cause HiveServer2 to hold onto a lot of memory leading to long garbage collection pauses. Adding a "fake" predicate will cause HiveServer2 to run the MapReduce job as desired; e.g.
SELECT * FROM my_table WHERE (my_id > 0 OR my_id <= 0) LIMIT 100;
By "fake", I mean a predicate that does not matter; the above example predicate will always be true. Is there a setting to force HiveServer2 to always run the MapReduce job without having to add bogus predicates?
I am not talking about when HiveServer2 determines it can run a MapReduce job locally; I have this disabled entirely:
> SET hive.exec.mode.local.auto;
+----------------------------------+--+
| set |
+----------------------------------+--+
| hive.exec.mode.local.auto=false |
+----------------------------------+--+
but queries without predicates are still read entirely by HiveServer2 causing issues.
Any guidance much appreciated. Thanks!
Some select queries can be converted to a single FETCH task, without map-reduce at all.
This behavior is controlled by hive.fetch.task.conversion
configuration parameter.
Possible values are: none
, minimal
and more
.
If you want to disable fetch task conversion, set it to none:
set hive.fetch.task.conversion=none;
minimal
will trigger FETCH-only task for
SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only.
more
will trigger FETCH-only task for
SELECT any kind of expressions including UDFs, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)
Read also about hive.fetch.task.conversion.threshold
parameter and more details here: Hive Configuration Properties