amazon-web-servicesamazon-ec2nettyapache-zookeepergiraph

Giraph application get stuck, on superstep 4, all workers active but without progress


I'm doing BFS search through the Wikipedia (spanish edition) site. I converted the dump (https://dumps.wikimedia.org/eswiki/20160601) into a file that could be read with Giraph.

The BFS is searching for paths, and its all ok until get stuck in some point of the superstep four.

I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each node is a r3.8xlarge ec2 instance. The command for executing the BFS is this one:

/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 4 -yh 120000 -ca giraph.useOutOfCoreMessages=true,giraph.metrics.enable=true,giraph.maxMessagesInMemory=1000000000,giraph.isStaticGraph=true,giraph.logLevel=Debug

Each container have 120GB (almost). I'm using 1000M messages limit in outOfCore, because i believed that was the problem, but apparently is not.

This ones are the master logs (it seems that is waiting for workers for finish but they just don't...and keeps like this forever...):

6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got finished worker list = [], size = 0, worker list = [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)], size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
...thirty times same last two lines...
...
6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got finished worker list = [], size = 0, worker list = [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)], size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4

And in all workers, there is no information on what is happening (i'm testing this with giraph.logLevel=Debug because with the default level of giraph log i was lost), and the workers say this over and over again:

16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result not ready yet java.util.concurrent.FutureTask@7392f34d
16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82

Before starting the superstep 4, the information on each worker was the following one

16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2] startSuperstep: WORKER_ONLY - Attempt=0, Superstep=4
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep: addressesAndPartitions[Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID
=1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)]
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 4 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 5 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 8 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 9 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 13 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 15 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG graph.GraphTaskManager: execute: Memory (free/total/max) = 92421.41M / 115000.00M / 115000.00M

I don't know what is exactly failing:

As you maybe are suspecting, i have too many hypothesis, that's why i'm seeking for help, so i can go in the right direction.

EDIT 1: Using giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirmation=true didn't solve the problem

EDIT 2: I duplicated the netty threads, and assigned the double of the original size to netty buffers, and no change.

EDIT 3: I condensed the messages, 1000 into 1, and get a lot of less messages, but still, same final results.


Solution

  • I isolate the problem using 15 compute threads and 240 partitions.

    I could observe that one partition and one thread was taking too long.

    So i check the code, searching for something that could take that long. And i found it, i was concatenating strings using + instead of StringBuilder (i believed that the compiler switch the "+" for StringBuilder.append(), as someone suggested here). But this wasn't the case. The time decreases dramatically and the application finishs, finally.