hadoophivemapreducehiveqlbeeline

Force HiveServer2 to run MapReduce job


I am using hive-1.1.0. Submitting queries to HiveServer2 via Beeline which are read-only and contain no predicates will cause HiveServer2 to try to read the data from HDFS itself without spawning a MapReduce job:

SELECT * FROM my_table LIMIT 100;

For very large datasets this can cause HiveServer2 to hold onto a lot of memory leading to long garbage collection pauses. Adding a "fake" predicate will cause HiveServer2 to run the MapReduce job as desired; e.g.

SELECT * FROM my_table WHERE (my_id > 0 OR my_id <= 0) LIMIT 100;

By "fake", I mean a predicate that does not matter; the above example predicate will always be true. Is there a setting to force HiveServer2 to always run the MapReduce job without having to add bogus predicates?

I am not talking about when HiveServer2 determines it can run a MapReduce job locally; I have this disabled entirely:

> SET hive.exec.mode.local.auto;
+----------------------------------+--+
|               set                |
+----------------------------------+--+
| hive.exec.mode.local.auto=false  |
+----------------------------------+--+

but queries without predicates are still read entirely by HiveServer2 causing issues.

Any guidance much appreciated. Thanks!


Solution

  • Some select queries can be converted to a single FETCH task, without map-reduce at all.

    This behavior is controlled by hive.fetch.task.conversion configuration parameter. Possible values are: none, minimal and more.

    If you want to disable fetch task conversion, set it to none:

    set hive.fetch.task.conversion=none;
    

    minimal will trigger FETCH-only task for

    SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only.

    more will trigger FETCH-only task for

    SELECT any kind of expressions including UDFs, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)

    Read also about hive.fetch.task.conversion.threshold parameter and more details here: Hive Configuration Properties