hadoophadoop-yarnapache-zookeepermonitorganglia

ganglia: the graph in ganglia remians unchanged after I stop hadoop datanode


I use ganglia to monitor hadoop. I choose the metric "dfs.datanode.HeartbeatsAvgTime" to judge whether the datanode(I mean datanode service, not the host.) is down or not.

When the datanode is working fine, the "dfs.datanode.HeartbeatsAvgTime" is remaining changing. That's to say ,the value in the graph is varing.

It looks like this: graph that is varing

but after I stop the datanode service, the value in the graph remains not change.

It looks like this: enter image description here

The value in the second graph remains unchanged.But the value is not 0 or infinity. So, I can not judge the datanode service is up or down.

It is the same when deal with other metrics.

I've check rrd which is used by ganglia to store the metric data with "rrdtool fetch". The value about the metric is stored in a *.rrd file.when I check the file, I find that after I stop datanode, the value about the metric is also updated. But the value is not varing.

I read the references about rrd in rrd's official website. they says that, if rrd did not receive update date between the interval setted before, rrd write UNKNOWN in the *.rrdfile.

I think that there may be two causes to raise the problem.

  1. when the gmetad did not receive metric. it update rrd with the old value.So the graph stay the same as the old value.
  2. when gmond can not collect metric, it report the old value to gmetad.

But I haven't really find any evidence in the source code in the github of ganglia.

So do you know how to solve the problem that value in the graph remain unchanged? or do you know other details about how to monitor hadoop cluster with ganglia?

@DaveStephens @Lorin Hochstein


Solution

  • After my struggle to solve the problem, I found that if we set dmax of the metric in hadoop-metrics2.properties, when the hadoop break down, ganglia would not receive any data, and return UNKNOW. The graph in the ganglia website will disappear . when ganglia + nagios, nagios will also return UNKNOW status. that's enough to judge whether hadoop is up or down.

    dmax means that after dmax time, hadoop will destroy the metric.