prometheusprometheus-alertmanagerprometheus-java

Prometheus WAL Keeps on Growing Indefinitely


Currently, I have a Prometheus v2.20.0 running, and it has an issue that the WAL keeps on growing indefinitely and consuming disk space.

Actually the disk space is not the issue now, but that the WAL folder is not getting cleaned, so if any time Prometheus is restarted, it tries to load the entire WAL into memory.

So for example WAL is now 60GB, and memory is 32GB, so Prometheus keeps on restarting when it gets killed by the OOM, as it consumes the whole server memory of 24 GB.

Here is my current config for it, and please note that I run it using Docker Compose.

   - '--web.enable-admin-api'
   - '--config.file=/etc/prometheus/prometheus.yml'
   - '--web.external-url=https://prometheus.example.com'
   - '--storage.tsdb.path=/var/lib/prometheus'
   - '--storage.tsdb.retention=150d'
   - '--web.console.libraries=/usr/share/prometheus/console_libraries'
   - '--web.console.templates=/usr/share/prometheus/consoles'

So my question is, how I can configure it to do proper checkpointing and cleaning of WAL so it won't keep growing indefinitely?


Solution

  • It seems a known bug in Prometheus v2.20.0, and an upgrade to v2.21.0 fixed it. https://github.com/prometheus/prometheus/issues/7955