prometheusvictoriametricsthanos

Combining VictoriaMetrics and Thanos - VM for hot data, Thanos for lookback and federation?


I'm currently running a Prometheus + Thanos stack for metrics, and it's ... ok ... but it's outrageously memory-hungry. It also tends to fall on its face and OOM if more expensive queries are submitted, instead of simply taking longer to process them.

VictoriaMetrics is a lot more memory-efficient than Prometheus so I'd like to look at swapping it in. But I need federated query across several stores, and offloading of lookback metrics to an object store. Both of these are presently handled by Thanos Receive collecting data from Prometheus sidecars, feeding it into Thanos Store and accessing it with Thanos Query.

It's unclear if this model will translate well to use with VictoriaMetrics - if the VM TSDB is compatible with Thanos Sidecar. There are PromQL dialect issues to consider too.

But VM doesn't appear to support fan-out federated query or using object storage, so I can't just drop it in to replace Thanos too.

Is there a sensible, maintainable way to integrate VictoriaMetrics and Thanos, so that Thanos handles backlog retention and fan-out, but VictoriaMetrics is used instead of Prometheus for scraping, hot-data querying, recording rule processing etc?

If not, any advice for how to get Prometheus to execute more expensive queries without RAM usage that approaches the infinite? I've come from PostgreSQL where more expensive queries are just slower, as the RDBMS execution engine will use tempfiles and tapesorts and various other execution strategies to process datasets vastly in excess of available RAM. Prometheus ... just seems to want more RAM.

Edit: There are also data residency considerations, where EU user data may need to be physically stored in the EU, New Zealand user data physically stored in New Zealand, etc etc.


Solution

  • But VM doesn't appear to support fan-out federated query or using object storage, so I can't just drop it in to replace Thanos too.

    In VM ecosystem fan-out queries aren't needed. Usually, Prometheus (or stateless scrape agents) is used for scraping and delivering metrics to central VM cluster. Data usually has about 30-60s freshness and can be queried right away from central cluster, providing global query view.

    Yes, VictoriaMetrics doesn't support object storage for historical data. But it is very efficient in terms of data compressing, so probably storing everything on disks would cost the same money and will provide better query performance.