apache-flinkflink-streaming

Apache Flink - high promethues metrics cardinallity


in our organization, we got number of systems running on flink 1.16.

We use PrometheusReporterFactory.

To expose our metrics to promethues scrape.

Due to the dynamic labels definitions of the flink system metrics, we experience cardinallity explosion on our promethues, due to the hige amount of time series created.

When having lots of operators with many taskmanagers and taskslots, the number of metrics is gigantic due to the dynamic metrics labels, such as task_attempt_id, task_id, tm_id and more, when most of them are not even being used or queried by the SRE team.

Is there any possible way to reduce the cardinallity? Maybe some way to exclude specific labels from being exported by the flink.

Thanks.

We tried to reduce the cardinallity by disabling the latency metrics, as presented in this issue

But without any significant decrease in the cardinallity.


Solution

  • Please look at the Metrics System Scope documentation. This allows you to customize the information displayed in metrics. Accordingly, we can exclude unnecessary variables to reduce the cardinality

    Edit:
    For example (flink-conf.yml):

    metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
    metrics.reporter.prom.port: "8080"
    metrics.reporter.prom.interval: 60 SECONDS
    metrics.reporter.prom.scope.variables.excludes: host;tm_id;task_attempt_id;task_attempt_num;subtask_index;task_id;job_id;operator_id
    metrics.scope.jm: jobmanager
    metrics.scope.jm-job: jobmanager.<job_name>
    metrics.scope.jm-operator: jobmanager.<job_name>.<operator_name>
    metrics.scope.tm: taskmanager
    metrics.scope.tm-job: taskmanager.<job_name>
    metrics.scope.task: taskmanager.<job_name>.<task_name>
    metrics.scope.operator: taskmanager.<job_name>.<operator_name>