Metrics collection with Docker Deploy Replica
I am a developer, but in my new job, the company doesn't have a DevOps team. So, we don't have any type of metrics collection or proper CI/CD flows. Because of that, I am trying to implement a few things around here, but I am no expert.
The first thing I am trying to do is implement metrics collection and visualization with Prometheus and Grafana to monitor some Python and Node.js on-premises apps. I am using a Flask app for testing and Docker to install Prometheus and Grafana locally before setting them up on a proper server. I made it work easily using prometheus-flask-exporter
, but I started noticing some issues and have some questions about what is best for my app stack.
App Stack:
Python
Flask
Gunicorn with 2 workers and 2 threads
Docker with a replica
Nginx
Issues and Questions:
Docker Deploy Replica: I immediately realized that my app has a Docker replica for load balancing. So, when Prometheus scrapes the /metrics path, Docker sends the request to one of the replicas. I believe both should have separate metrics on Grafana to see if the load balancing is working properly. What I did was create different paths for each replica like /metrics_1 and /metrics_2 on Nginx and two different jobs on Prometheus, and it worked, but I don't think that is the proper way to do that.
Metrics Accuracy: I want basic metrics like percentile latency, requests per second on each path, 2xx requests, 3xx requests, 4xx requests, and 5xx requests. However, the way I implemented it, I can't trust the metrics because when I compare them with the K6 load test, I have completely different metrics, especially on percentile latency and requests per second.
After these issues, I got mad and rolled back everything I did. Now, I want to start from the ground up. My questions are mostly about good monitoring practices. Given my stack, what should I focus on monitoring? Do I need to collect metrics from Nginx too? How can I handle Docker replicas? Is it better to monitor Gunicorn using something like statsd-exporter
instead of Flask using prometheus-flask-exporter
? Do I need Multiprocess Mode?
My config files:
compose.yml:
services:
api:
image: api-auth-ad
build: .
expose:
\- "8000"
environment:
\- SECRET_KEY=${SECRET_KEY}
\- LDAP_DOMAIN=${LDAP_DOMAIN}
deploy:
replicas: 2
resources:
limits:
cpus: "0.75"
memory: "1gb"
restart: always
nginx:
container_name: api-auth-ad-nginx
image: nginx:1.27.0
ports:
\- "80:80"
\- "443:443"
volumes:
\- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
\- api
restart: always
Dockerfile:
FROM python:3.11
WORKDIR /app
COPY pyproject.toml ./
RUN pip install poetry
RUN poetry lock
RUN poetry install --only main
COPY . .
CMD \["poetry", "run", "gunicorn", "--config", "gunicorn_config.py", "src.app:create_app()"\]
gunicorn_config.py:
workers = 2
threads = 2
bind = "0.0.0.0:8000"
loglevel = "info"
accesslog = "-"
errorlog = "-"
worker_class = "gthread"
nginx.conf:
worker_processes auto;
worker_rlimit_nofile 500000;
events {
use epoll;
worker_connections 512;
}
http {
access_log off;
error_log /dev/null emerg;
upstream api_auth {
server api:8000;
keepalive 400;
}
server {
listen 80;
location / {
proxy_pass http://api_auth;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_intercept_errors off;
}
}
}
Theres a lot going on here so I will deal with dealing with replicas:
One way to deal with replicas is to rewrite "instance" to be the replica name in task level metrics.
You can use the "dns_sd_configs" and setup an A record scraper and use the "tasks.{service_name}". This returns each individual IP for a service in the __address__
variable that is used to scrape metrics - rewriting instance to be interesting is harder this way.
There is a better way, by pinning your prometheus instances to manager nodes you can use the dockerswarm_sd_configs to pull metrics:
- job_name: 'dockerswarm'
dockerswarm_sd_configs:
- host: unix:///var/run/docker.sock
role: tasks
You can add relabel configs. I re-write the instance to be the node the service is running on:
relabel_configs:
- source_labels: [__meta_dockerswarm_node_hostname]
target_label: instance
The dockerswarm_sd_config doesn't know which port your service is listening for metrics on. I define a deploy.label "prometheus.metrics.port" to carry that and assign it:
relable_configs:
- source_labels:
- __address__
- __meta_dockerswarm_service_label_prometheus_metrics_port
regex: '(.*):(\d+);(\d+)'
target_label: __address__
replacement: '$1:$3'
To be scraped, services also need to be attached to a monitoring network. dockerswarm_sd_configs will generate a service discovery entry for each attached network, and published ports, so we can filter to a global docker network. Here we only keep containers that are attached to a network "monitoring":
relabel_configs:
- source_labels: [__meta_dockerswarm_network_name]
regex: monitoring
action: keep