[SOLVED] Container gets killed on Worker, when doing flush, after completing superstep, and the entire application hangs

I'm running a Giraph application on EMR.

I'm using a cluster of 1 master and 10 slaves, all m3.2xlarge machines.

The application consist, basically, on a BFS through the spanish version of Wikipedia (i adapted the Wikipedia information for fitting on Giraph).

I execute the applications in the following way:

/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.algoritmos.masivos.BusquedaDeCaminosNavegacionalesWikiquotesMasivo /tmp/vertices.txt 4 -@- 1 ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 10 -yh 11500 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,giraph.isStaticGraph=true,giraph.numInputThreads=4,giraph.numOutputThreads=4

I can sucessfully run a application with 3 supersteps, but if i want to do 4 supersteps, the application fails, a container gets killed, and the rest die along too.

Searching in the Giraph Application Manager, says this:

16/08/15 03:44:32 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-0-147.sa-east-1.compute.internal:9103
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1471231949464_0001_01_000005
16/08/15 03:44:32 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-0-145.sa-east-1.compute.internal:9103
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000009
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000011
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000004
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000010
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000006
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000007
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000008
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000005
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000002
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000012
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000003
16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: Got response from RM for container ask, completedCnt=1
16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000008, state=COMPLETE, exitStatus=143, diagnostics=Container [pid=4455,containerID=container_1471231949464_0001_01_000008] is running beyond physical memory limits. Current usage: 11.4 GB of 11.3 GB physical memory used; 12.6 GB of 56.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_1471231949464_0001_01_000008 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 4459 4455 4455 4455 (java) 13568 5567 13419675648 2982187 java -Xmx11500M -Xms11500M -cp .:${CLASSPATH}:./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/* org.apache.giraph.yarn.GiraphYarnTask 1471231949464 1 8 1 
        |- 4455 2706 4455 4455 (bash) 0 0 115875840 807 /bin/bash -c java -Xmx11500M -Xms11500M -cp .:${CLASSPATH}:./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/* org.apache.giraph.yarn.GiraphYarnTask 1471231949464 1 8 1 1>/mnt/var/log/hadoop/userlogs/application_1471231949464_0001/container_1471231949464_0001_01_000008/task-8-stdout.log 2>/mnt/var/log/hadoop/userlogs/application_1471231949464_0001/container_1471231949464_0001_01_000008/task-8-stderr.log  

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: After completion of one conatiner. current status is: completedCount :1 containersToLaunch :11 successfulCount :0 failedCount :1
16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got response from RM for container ask, completedCnt=7
16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000002, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1

16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000012, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1

16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000006, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1

16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000007, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)

So, appears to be a problem with memory on container 8, but here are the last logs lines of container 8 (remember, it's the container that gets killed):

16/08/15 03:46:52 INFO graph.ComputeCallable: call: Computation took 23.90834 secs for 10 partitions on superstep 3.  Flushing started
16/08/15 03:46:52 INFO worker.BspServiceWorker: finishSuperstep: Waiting on all requests, superstep 3 Memory (free/total/max) = 4516.47M / 10619.50M / 10619.50M
16/08/15 03:46:52 INFO netty.NettyClient: logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 1307 open requests, waiting for it to be <= 0, MBytes/sec received = 0.0029, MBytesReceived = 0.0678, ave received req MBytes = 0, secs waited = 23.332
MBytes/sec sent = 143.2912, MBytesSent = 3343.4141, ave sent req MBytes = 0.4999, secs waited = 23.332
16/08/15 03:46:52 INFO netty.NettyClient: logInfoAboutOpenRequests: 548 requests for taskId=10, 504 requests for taskId=0, 251 requests for taskId=5, 1 requests for taskId=4, 1 requests for taskId=7, 1 requests for taskId=8,

So, if i understand this right, the container have 4516.47M available before doing flush, and when doing it, consumes all available of those 4516.47M, and when wants more, gets killed by Giraph AM?

I don't understand why needs so much memory doing flush, its basically save the results on disk for the next superstep right? so theoretically shouldn’t need memory at all.

Container gets killed on Worker, when doing flush, after completing superstep, and the entire application hangs - Giraph