[SOLVED] Prometheus Monitoring on Docker Swarm with Multiple Replicas

Prometheus Monitoring on Docker Swarm with Multiple Replicas

I'm trying to monitor my Node.js API built with Express using Prometheus, but I'm having trouble exporting the metrics because it's running on a server via Docker Swarm, with approximately 6 replicas. I tried configuring dns_sd_configs, but each instance creates a new counter. I want to group them to create charts in Grafana, such as 2XX requests, 5XX requests, etc.

The name of my service is backend-server, and I want to scrape the data from port 9464 and the endpoint /api/metrics. I configured my prometheus.yaml as follows:

- job_name: 'dockerswarm'
  dockerswarm_sd_configs:
    - host: unix:///var/run/docker.sock
      role: tasks
  relabel_configs:
    # Only keep containers that should be running.
    - source_labels: [__meta_dockerswarm_task_desired_state]
      regex: running
      action: keep
    # Only keep containers with the specific service name.
    - source_labels: [__meta_dockerswarm_service_name]
      regex: backend-server
      action: keep
    - source_labels: [__meta_dockerswarm_node_address]
      target_label: __address__
      replacement: $1:9464/api/metrics

It's not throwing any errors, but it doesn't appear in the targets of my application...

root@srv:~# docker service ls | grep backend
z5bnz2t5riw8   backend-server            replicated   6/6        xx/xx/backend-server:x           *:3000->3000/tcp

root@srv:~# docker service ls | grep promethe
8zlh5kwfx8ks   prometheus                replicated   1/1        prom/prometheus:v2.52.0          *:9090->9090/tcp

I am configuring it as follows to make it work.

scrape_configs:
  - job_name: cadvisor
    scrape_interval: 1m
    static_configs:
      - targets:
          - cadvisor:8080
  - job_name: node
    scrape_interval: 1m
    static_configs:
      - targets: ['host.docker.internal:9100', 'victoria.consorcio.local:9100']
  - job_name: backend
    scrape_interval: 15s
    metrics_path: /victoria/api/metrics
    dns_sd_configs:
      - names:
          - 'tasks.backend-server'
        type: 'A'
        port: 9464

But the counter creates one for each instance.

error_counter_total{instance="10.0.1.15:9464", job="backend", method="POST", status="401"}
error_counter_total{instance="10.0.1.16:9464", job="backend", method="POST", status="401"}

Solution

The Prometheus container is successfully scraping all 6 replicas. So, you can simply use PromQL to aggregate and group results in your Grafana dashboard.

For example, you can create a panel in Grafana with the following query to sum the count of 500 response code on all the nodes:

sum(error_counter_total{job="backend", method="POST", status="500"}).

You can check the other aggregation operators here.