[SOLVED] how to reduce the number of containers in the query

how to reduce the number of containers in the query

I have a query using to much containers and to much memory. (97% of the memory used). Is there a way to set the number of containers used in the query and limit the max memory? The query is running on Tez.

Thanks in advance

Solution

Controlling the number of Mappers:

The number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also How initial task parallelism works

MR uses CombineInputFormat, while Tez uses grouped splits.

Tez:

set tez.grouping.min-size=16777216; -- 16 MB min split
set tez.grouping.max-size=1073741824; -- 1 GB max split

Increase these figures to reduce the number of mappers running.

Also Mappers are running on data nodes where the data is located, that is why manually controlling the number of mappers is not an easy task, not always possible to combine input.

Controlling the number of Reducers:

The number of reducers determined according to

mapreduce.job.reduces

The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.

hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.

Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapreduce.job.reduces is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.

Simply set hive.exec.reducers.max=<number> to limit the number of reducers running.

If you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer.

Memory settings

set tez.am.resource.memory.mb=8192;
set tez.am.java.opts=-Xmx6144m;
set tez.reduce.memory.mb=6144;
set hive.tez.container.size=9216;
set hive.tez.java.opts=-Xmx6144m;

The default settings mean that the actual Tez task will use the mapper's memory setting:

hive.tez.container.size = mapreduce.map.memory.mb
hive.tez.java.opts = mapreduce.map.java.opts

Read this for more details: Demystify Apache Tez Memory Tuning - Step by Step

I would suggest to optimize query first. Use map-joins if possible, use vectorising execution, add distribute by partitin key if you are writing partitioned table to reduce memory consumption on reducers and write good sql of course.