I use ganglia to monitor hadoop. I choose the metric "dfs.datanode.HeartbeatsAvgTime" to judge whether the datanode(I mean datanode service, not the host.) is down or not.
When the datanode is working fine, the "dfs.datanode.HeartbeatsAvgTime" is remaining changing. That's to say ,the value in the graph is varing.
but after I stop the datanode service, the value in the graph remains not change.
The value in the second graph remains unchanged.But the value is not 0 or infinity. So, I can not judge the datanode service is up or down.
It is the same when deal with other metrics.
I've check rrd which is used by ganglia to store the metric data with "rrdtool fetch". The value about the metric is stored in a *.rrd file.when I check the file, I find that after I stop datanode, the value about the metric is also updated. But the value is not varing.
I read the references about rrd in rrd's official website. they says that, if rrd did not receive update date between the interval setted before, rrd write UNKNOWN in the *.rrdfile.
I think that there may be two causes to raise the problem.
But I haven't really find any evidence in the source code in the github of ganglia.
So do you know how to solve the problem that value in the graph remain unchanged? or do you know other details about how to monitor hadoop cluster with ganglia?
@DaveStephens @Lorin Hochstein
After my struggle to solve the problem, I found that if we set dmax of the metric in hadoop-metrics2.properties, when the hadoop break down, ganglia would not receive any data, and return UNKNOW. The graph in the ganglia website will disappear . when ganglia + nagios, nagios will also return UNKNOW status. that's enough to judge whether hadoop is up or down.
dmax means that after dmax time, hadoop will destroy the metric.