in our organization, we got number of systems running on flink 1.16.
We use PrometheusReporterFactory.
To expose our metrics to promethues scrape.
Due to the dynamic labels definitions of the flink system metrics, we experience cardinallity explosion on our promethues, due to the hige amount of time series created.
When having lots of operators with many taskmanagers and taskslots, the number of metrics is gigantic due to the dynamic metrics labels, such as task_attempt_id, task_id, tm_id and more, when most of them are not even being used or queried by the SRE team.
Is there any possible way to reduce the cardinallity? Maybe some way to exclude specific labels from being exported by the flink.
Thanks.
We tried to reduce the cardinallity by disabling the latency metrics, as presented in this issue
But without any significant decrease in the cardinallity.
Please look at the Metrics System Scope documentation. This allows you to customize the information displayed in metrics. Accordingly, we can exclude unnecessary variables to reduce the cardinality
Edit:
For example (flink-conf.yml):
metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: "8080"
metrics.reporter.prom.interval: 60 SECONDS
metrics.reporter.prom.scope.variables.excludes: host;tm_id;task_attempt_id;task_attempt_num;subtask_index;task_id;job_id;operator_id
metrics.scope.jm: jobmanager
metrics.scope.jm-job: jobmanager.<job_name>
metrics.scope.jm-operator: jobmanager.<job_name>.<operator_name>
metrics.scope.tm: taskmanager
metrics.scope.tm-job: taskmanager.<job_name>
metrics.scope.task: taskmanager.<job_name>.<task_name>
metrics.scope.operator: taskmanager.<job_name>.<operator_name>