pysparkmonitoringgrafanagraphitestatsd

can graphite or grafana used to monitor pyspark metrics?


In a pyspark project we have pyspark dataframe.foreachPartition(func) and in that func we have some aiohttp call to transfer data. What type of monitor tools can be used to monitor the metrics like data rate, throughput, time elapsed...? Can we use statsd and graphite or grafana in this case(they're prefered if possible)? Thanks.


Solution

  • Here is my solution. I used PySpark's accumulators to collect the metrics(number of http calls, payload sent per call, etc.) at each partitions, at the driver node, assign these accumulators' value to statsD gauge variable, and send these metrics to Graphite server and eventually visualized them in Grafana dashboard. It works so far so good.