I have a flink job written in scala and I am creating one custom metric to count the nmber of events in a stream. The job is deployed on kubernetes and I see system metrics of job-manager and task-managers in the prometheus. However, we don't see the custom metrics in prometheus though we see that in Flink UI. Below is the custom metrics code:
val sampleProcessFunction = new ProcessFunction[String, String] {
@transient private var counter: Counter = _
override def open(parameters: Configuration): Unit =
counter = getRuntimeContext.getMetricGroup.addGroup("abc").counter("streamcounter")
override def processElement(
value: String,
ctx: ProcessFunction[String, String]#Context,
out: Collector[String]): Unit = {
val result = value.parseJson.toString
counter.inc()
out.collect(result)
}
}
flink-config.yaml has these entries related to prometheus:
taskmanager.network.detailed-metrics: true
metrics.reporter.prom.class:org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 8080
Not only custom metrics, any taskmanager metrics that follows the path taskmanager.job.* are not exposed in the metrics endpoint. When I am getting into a taskmanager pod and doing a curl to the metrics endpoint like this:
kubectl exec -it flink-taskmanager-app-7448cdb787-9c48j -- /bin/bash
curl http://localhost:8080/metrics
I am only getting the status metrics related to taskmanager:
# HELP flink_taskmanager_Status_JVM_Memory_Mapped_MemoryUsed MemoryUsed (scope: taskmanager_Status_JVM_Memory_Mapped)
# TYPE flink_taskmanager_Status_JVM_Memory_Mapped_MemoryUsed gauge
flink_taskmanager_Status_JVM_Memory_Mapped_MemoryUsed{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 0.0
# HELP flink_taskmanager_Status_Flink_Memory_Managed_Used Used (scope: taskmanager_Status_Flink_Memory_Managed)
# TYPE flink_taskmanager_Status_Flink_Memory_Managed_Used gauge
flink_taskmanager_Status_Flink_Memory_Managed_Used{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 0.0
# HELP flink_taskmanager_Status_Shuffle_Netty_UsedMemorySegments UsedMemorySegments (scope: taskmanager_Status_Shuffle_Netty)
# TYPE flink_taskmanager_Status_Shuffle_Netty_UsedMemorySegments gauge
flink_taskmanager_Status_Shuffle_Netty_UsedMemorySegments{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 0.0
# HELP flink_taskmanager_Status_Network_TotalMemorySegments TotalMemorySegments (scope: taskmanager_Status_Network)
# TYPE flink_taskmanager_Status_Network_TotalMemorySegments gauge
flink_taskmanager_Status_Network_TotalMemorySegments{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 30037.0
# HELP flink_taskmanager_Status_Shuffle_Netty_AvailableMemory AvailableMemory (scope: taskmanager_Status_Shuffle_Netty)
# TYPE flink_taskmanager_Status_Shuffle_Netty_AvailableMemory gauge
flink_taskmanager_Status_Shuffle_Netty_AvailableMemory{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 9.84252416E8
# HELP flink_taskmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded (scope: taskmanager_Status_JVM_ClassLoader)
# TYPE flink_taskmanager_Status_JVM_ClassLoader_ClassesLoaded gauge
flink_taskmanager_Status_JVM_ClassLoader_ClassesLoaded{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 11075.0
# HELP flink_taskmanager_Status_JVM_Memory_Metaspace_Max Max (scope: taskmanager_Status_JVM_Memory_Metaspace)
# TYPE flink_taskmanager_Status_JVM_Memory_Metaspace_Max gauge
flink_taskmanager_Status_JVM_Memory_Metaspace_Max{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 2.68435456E8
# HELP flink_taskmanager_Status_Shuffle_Netty_RequestedMemoryUsage RequestedMemoryUsage (scope: taskmanager_Status_Shuffle_Netty)
# TYPE flink_taskmanager_Status_Shuffle_Netty_RequestedMemoryUsage gauge
flink_taskmanager_Status_Shuffle_Netty_RequestedMemoryUsage{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 0.0
# HELP flink_taskmanager_Status_Shuffle_Netty_AvailableMemorySegments AvailableMemorySegments (scope: taskmanager_Status_Shuffle_Netty)
# TYPE flink_taskmanager_Status_Shuffle_Netty_AvailableMemorySegments gauge
flink_taskmanager_Status_Shuffle_Netty_AvailableMemorySegments{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 30037.0
# HELP flink_taskmanager_Status_JVM_Memory_Metaspace_Used Used (scope: taskmanager_Status_JVM_Memory_Metaspace)
# TYPE flink_taskmanager_Status_JVM_Memory_Metaspace_Used gauge
flink_taskmanager_Status_JVM_Memory_Metaspace_Used{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 6.5252976E7
# HELP flink_taskmanager_Status_JVM_Memory_NonHeap_Max Max (scope: taskmanager_Status_JVM_Memory_NonHeap)
# TYPE flink_taskmanager_Status_JVM_Memory_NonHeap_Max gauge
flink_taskmanager_Status_JVM_Memory_NonHeap_Max{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 7.80140544E8
# HELP flink_taskmanager_Status_JVM_Memory_Direct_Count Count (scope: taskmanager_Status_JVM_Memory_Direct)
# TYPE flink_taskmanager_Status_JVM_Memory_Direct_Count gauge
flink_taskmanager_Status_JVM_Memory_Direct_Count{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 30065.0
# HELP flink_taskmanager_Status_JVM_Memory_Direct_TotalCapacity TotalCapacity (scope: taskmanager_Status_JVM_Memory_Direct)
# TYPE flink_taskmanager_Status_JVM_Memory_Direct_TotalCapacity gauge
flink_taskmanager_Status_JVM_Memory_Direct_TotalCapacity{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 9.85225216E8
# HELP flink_taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Time Time (scope: taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation)
# TYPE flink_taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Time gauge
flink_taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Time{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 0.0
# HELP flink_taskmanager_Status_JVM_Threads_Count Count (scope: taskmanager_Status_JVM_Threads)
# TYPE flink_taskmanager_Status_JVM_Threads_Count gauge
flink_taskmanager_Status_JVM_Threads_Count{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 51.0
# HELP flink_taskmanager_Status_Shuffle_Netty_TotalMemory TotalMemory (scope: taskmanager_Status_Shuffle_Netty)
# TYPE flink_taskmanager_Status_Shuffle_Netty_TotalMemory gauge
flink_taskmanager_Status_Shuffle_Netty_TotalMemory{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 9.84252416E8
# HELP flink_taskmanager_Status_JVM_GarbageCollector_G1_Young_Generation_Time Time (scope: taskmanager_Status_JVM_GarbageCollector_G1_Young_Generation)
# TYPE flink_taskmanager_Status_JVM_GarbageCollector_G1_Young_Generation_Time gauge
flink_taskmanager_Status_JVM_GarbageCollector_G1_Young_Generation_Time{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 55.0
# HELP flink_taskmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded (scope: taskmanager_Status_JVM_ClassLoader)
# TYPE flink_taskmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge
flink_taskmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 0.0
# HELP flink_taskmanager_Status_JVM_Memory_Heap_Used Used (scope: taskmanager_Status_JVM_Memory_Heap)
# TYPE flink_taskmanager_Status_JVM_Memory_Heap_Used gauge
flink_taskmanager_Status_JVM_Memory_Heap_Used{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 1.56297264E8
# HELP flink_taskmanager_Status_JVM_CPU_Time Time (scope: taskmanager_Status_JVM_CPU)
# TYPE flink_taskmanager_Status_JVM_CPU_Time gauge
flink_taskmanager_Status_JVM_CPU_Time{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 4.001E10
# HELP flink_taskmanager_Status_JVM_Memory_Direct_MemoryUsed MemoryUsed (scope: taskmanager_Status_JVM_Memory_Direct)
# TYPE flink_taskmanager_Status_JVM_Memory_Direct_MemoryUsed gauge
flink_taskmanager_Status_JVM_Memory_Direct_MemoryUsed{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 9.85225217E8
# HELP flink_taskmanager_Status_Shuffle_Netty_UsedMemory UsedMemory (scope: taskmanager_Status_Shuffle_Netty)
# TYPE flink_taskmanager_Status_Shuffle_Netty_UsedMemory gauge
flink_taskmanager_Status_Shuffle_Netty_UsedMemory{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 0.0
# HELP flink_taskmanager_Status_JVM_GarbageCollector_G1_Young_Generation_Count Count (scope: taskmanager_Status_JVM_GarbageCollector_G1_Young_Generation)
# TYPE flink_taskmanager_Status_JVM_GarbageCollector_G1_Young_Generation_Count gauge
flink_taskmanager_Status_JVM_GarbageCollector_G1_Young_Generation_Count{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 3.0
# HELP flink_taskmanager_Status_JVM_Memory_Metaspace_Committed Committed (scope: taskmanager_Status_JVM_Memory_Metaspace)
# TYPE flink_taskmanager_Status_JVM_Memory_Metaspace_Committed gauge
flink_taskmanager_Status_JVM_Memory_Metaspace_Committed{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 6.7375104E7
# HELP flink_taskmanager_Status_JVM_Memory_Heap_Max Max (scope: taskmanager_Status_JVM_Memory_Heap)
# TYPE flink_taskmanager_Status_JVM_Memory_Heap_Max gauge
flink_taskmanager_Status_JVM_Memory_Heap_Max{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 4.429185024E9
# HELP flink_taskmanager_Status_JVM_Memory_NonHeap_Committed Committed (scope: taskmanager_Status_JVM_Memory_NonHeap)
# TYPE flink_taskmanager_Status_JVM_Memory_NonHeap_Committed gauge
flink_taskmanager_Status_JVM_Memory_NonHeap_Committed{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 9.8787328E7
# HELP flink_taskmanager_Status_JVM_Memory_NonHeap_Used Used (scope: taskmanager_Status_JVM_Memory_NonHeap)
# TYPE flink_taskmanager_Status_JVM_Memory_NonHeap_Used gauge
flink_taskmanager_Status_JVM_Memory_NonHeap_Used{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 9.4597576E7
# HELP flink_taskmanager_Status_Shuffle_Netty_TotalMemorySegments TotalMemorySegments (scope: taskmanager_Status_Shuffle_Netty)
# TYPE flink_taskmanager_Status_Shuffle_Netty_TotalMemorySegments gauge
flink_taskmanager_Status_Shuffle_Netty_TotalMemorySegments{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 30037.0
# HELP flink_taskmanager_Status_Flink_Memory_Managed_Total Total (scope: taskmanager_Status_Flink_Memory_Managed)
# TYPE flink_taskmanager_Status_Flink_Memory_Managed_Total gauge
flink_taskmanager_Status_Flink_Memory_Managed_Total{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 4.294967296E9
# HELP flink_taskmanager_Status_JVM_CPU_Load Load (scope: taskmanager_Status_JVM_CPU)
# TYPE flink_taskmanager_Status_JVM_CPU_Load gauge
flink_taskmanager_Status_JVM_CPU_Load{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 0.002796347271376764
# HELP flink_taskmanager_Status_JVM_Memory_Mapped_Count Count (scope: taskmanager_Status_JVM_Memory_Mapped)
# TYPE flink_taskmanager_Status_JVM_Memory_Mapped_Count gauge
flink_taskmanager_Status_JVM_Memory_Mapped_Count{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 0.0
# HELP flink_taskmanager_Status_JVM_Memory_Heap_Committed Committed (scope: taskmanager_Status_JVM_Memory_Heap)
# TYPE flink_taskmanager_Status_JVM_Memory_Heap_Committed gauge
flink_taskmanager_Status_JVM_Memory_Heap_Committed{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 4.429185024E9
# HELP flink_taskmanager_Status_Network_AvailableMemorySegments AvailableMemorySegments (scope: taskmanager_Status_Network)
# TYPE flink_taskmanager_Status_Network_AvailableMemorySegments gauge
flink_taskmanager_Status_Network_AvailableMemorySegments{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 30037.0
# HELP flink_taskmanager_Status_JVM_Memory_Mapped_TotalCapacity TotalCapacity (scope: taskmanager_Status_JVM_Memory_Mapped)
# TYPE flink_taskmanager_Status_JVM_Memory_Mapped_TotalCapacity gauge
flink_taskmanager_Status_JVM_Memory_Mapped_TotalCapacity{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 0.0
# HELP flink_taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Count Count (scope: taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation)
# TYPE flink_taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Count gauge
flink_taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Count{host="172_26_129_190",tm_id="172_26_129_190:6122_3b582c",} 0.0
Note: No explicit filter/exclusion configured in the config file.
Can anybody please help how can we get the taskmanager.job.* metrics including custom metrics?
The issue was with the flink version. I was using 1.15.0 which has a reported bug on metrics https://lists.apache.org/thread/6bd9vmcroh7576d7h1kdcd8czf0b4l73
Basically, when a job runs, taskmanager metrics related to taskmanager.job.* disappears. After upgrading flink to 1.15.2, it started working properly. All the metrics along with custom metrics are getting exported properly.