pythonnginxprometheusgunicorndocker-swarm

Prometheus Metrics with Docker Swarm


Metrics collection with Docker Deploy Replica

I am a developer, but in my new job, the company doesn't have a DevOps team. So, we don't have any type of metrics collection or proper CI/CD flows. Because of that, I am trying to implement a few things around here, but I am no expert.

The first thing I am trying to do is implement metrics collection and visualization with Prometheus and Grafana to monitor some Python and Node.js on-premises apps. I am using a Flask app for testing and Docker to install Prometheus and Grafana locally before setting them up on a proper server. I made it work easily using prometheus-flask-exporter, but I started noticing some issues and have some questions about what is best for my app stack.

App Stack:

Issues and Questions:

  1. Docker Deploy Replica: I immediately realized that my app has a Docker replica for load balancing. So, when Prometheus scrapes the /metrics path, Docker sends the request to one of the replicas. I believe both should have separate metrics on Grafana to see if the load balancing is working properly. What I did was create different paths for each replica like /metrics_1 and /metrics_2 on Nginx and two different jobs on Prometheus, and it worked, but I don't think that is the proper way to do that.

  2. Metrics Accuracy: I want basic metrics like percentile latency, requests per second on each path, 2xx requests, 3xx requests, 4xx requests, and 5xx requests. However, the way I implemented it, I can't trust the metrics because when I compare them with the K6 load test, I have completely different metrics, especially on percentile latency and requests per second.

After these issues, I got mad and rolled back everything I did. Now, I want to start from the ground up. My questions are mostly about good monitoring practices. Given my stack, what should I focus on monitoring? Do I need to collect metrics from Nginx too? How can I handle Docker replicas? Is it better to monitor Gunicorn using something like statsd-exporter instead of Flask using prometheus-flask-exporter? Do I need Multiprocess Mode?

My config files:

compose.yml:


services:
api:
image: api-auth-ad
build: .
expose:
\- "8000"
environment:
\- SECRET_KEY=${SECRET_KEY}
\- LDAP_DOMAIN=${LDAP_DOMAIN}
deploy:
replicas: 2
resources:
limits:
cpus: "0.75"
memory: "1gb"
restart: always

nginx:
container_name: api-auth-ad-nginx
image: nginx:1.27.0
ports:
\- "80:80"
\- "443:443"
volumes:
\- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
\- api
restart: always

Dockerfile:


FROM python:3.11

WORKDIR /app

COPY pyproject.toml ./

RUN pip install poetry

RUN poetry lock

RUN poetry install --only main

COPY . .

CMD \["poetry", "run", "gunicorn", "--config", "gunicorn_config.py", "src.app:create_app()"\]

gunicorn_config.py:

workers = 2
threads = 2
bind = "0.0.0.0:8000"
loglevel = "info"
accesslog = "-"
errorlog = "-"
worker_class = "gthread"

nginx.conf:

worker_processes  auto;
worker_rlimit_nofile 500000;

events {
    use epoll;
    worker_connections 512;
}

http {
    access_log off;
    error_log /dev/null emerg;

    upstream api_auth {
        server api:8000;
        keepalive 400;
    }

    server {
        listen 80;

        location / {
            proxy_pass http://api_auth;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            proxy_intercept_errors off;
        }
        
    }

}

Solution

  • Theres a lot going on here so I will deal with dealing with replicas:

    One way to deal with replicas is to rewrite "instance" to be the replica name in task level metrics.

    You can use the "dns_sd_configs" and setup an A record scraper and use the "tasks.{service_name}". This returns each individual IP for a service in the __address__ variable that is used to scrape metrics - rewriting instance to be interesting is harder this way.

    There is a better way, by pinning your prometheus instances to manager nodes you can use the dockerswarm_sd_configs to pull metrics:

      - job_name: 'dockerswarm'
        dockerswarm_sd_configs:
          - host: unix:///var/run/docker.sock
            role: tasks
    

    You can add relabel configs. I re-write the instance to be the node the service is running on:

        relabel_configs:
          - source_labels: [__meta_dockerswarm_node_hostname]
            target_label: instance
    

    The dockerswarm_sd_config doesn't know which port your service is listening for metrics on. I define a deploy.label "prometheus.metrics.port" to carry that and assign it:

        relable_configs:
          - source_labels:
              - __address__
              - __meta_dockerswarm_service_label_prometheus_metrics_port
            regex: '(.*):(\d+);(\d+)'
            target_label: __address__
            replacement: '$1:$3'
    

    To be scraped, services also need to be attached to a monitoring network. dockerswarm_sd_configs will generate a service discovery entry for each attached network, and published ports, so we can filter to a global docker network. Here we only keep containers that are attached to a network "monitoring":

        relabel_configs:
          - source_labels: [__meta_dockerswarm_network_name]
            regex: monitoring
            action: keep