I have a query using to much containers and to much memory. (97% of the memory used). Is there a way to set the number of containers used in the query and limit the max memory? The query is running on Tez.
Thanks in advance
Controlling the number of Mappers:
The number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also How initial task parallelism works
MR uses CombineInputFormat, while Tez uses grouped splits.
Tez:
set tez.grouping.min-size=16777216; -- 16 MB min split
set tez.grouping.max-size=1073741824; -- 1 GB max split
Increase these figures to reduce the number of mappers running.
Also Mappers are running on data nodes where the data is located, that is why manually controlling the number of mappers is not an easy task, not always possible to combine input.
Controlling the number of Reducers:
The number of reducers determined according to
mapreduce.job.reduces
mapred.job.tracker
is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.hive.exec.reducers.bytes.per.reducer
- The default in Hive 0.14.0 and earlier is 1 GB.
Also hive.exec.reducers.max
- Maximum number of reducers that will be used. If mapreduce.job.reduces
is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.
Simply set hive.exec.reducers.max=<number>
to limit the number of reducers running.
If you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer.
set tez.am.resource.memory.mb=8192;
set tez.am.java.opts=-Xmx6144m;
set tez.reduce.memory.mb=6144;
set hive.tez.container.size=9216;
set hive.tez.java.opts=-Xmx6144m;
The defaultĀ settings mean that the actual Tez
task will use the mapper's memory setting:
hive.tez.container.size = mapreduce.map.memory.mb
hive.tez.java.opts = mapreduce.map.java.opts
Read this for more details: Demystify Apache Tez Memory Tuning - Step by Step
I would suggest to optimize query first. Use map-joins if possible, use vectorising execution, add distribute by partitin key
if you are writing partitioned table to reduce memory consumption on reducers and write good sql of course.