#prometheus (2023-08)
Archive: https://archive.sweetops.com/prometheus/
2023-08-29

How are y’all scaling prometheus-server
when 1 instance can’t handle the huge amount of metrics (so starts OOMing)?
We are adopting Prometheus-Operator
and debating between:
• a) “one prometheus per namespace”: benefit for us would be that we have >100 namespaces and a few dozen teams, so can have reporting and isolate impact to each namespace.
• b) “functional sharding”: the Prometheus shard X scrapes all pods of Service A, B and C while shard Y scrapes pods from Service D, E and F.
• c) “automatic sharding”: the targets will be assigned to Prometheus shards based on their addresses.

This is where thanos comes in.

(or Victoria Metrics)


Thanos is more so for the storage side, not the scraping. Paired typically with prometheus-operator to do some sort of sharding.

Yes, but it eliminates the memory issues from our past experience

because it shards and aggregates

1 instance can’t handle the huge amount of metrics (starts OOMing)

this was our problem.

We kept vertically scaling the the pods, until they were using 36GB ram and that got expensive. With Thanos, we didn’t need to vertically scale.

This was ~3 years ago, so my memory is foggy, but OOM was exactly our problem.

Yeah. The pattern is to pair prometheus-operator with Thanos.
Our Prometheus isn’t used for long-term storage (couple days of retention), so it was simply the huge mount of metric scraping (several 100k) that killed it.
We will do a trial between options (a) and (b) to see what works best at huge scale.