SweetOps #prometheus for August, 2023

Archive: https://archive.sweetops.com/prometheus/

2023-08-29

Sean

How are y’all scaling prometheus-server when 1 instance can’t handle the huge amount of metrics (so starts OOMing)?

We are adopting Prometheus-Operator and debating between:

• a) “one prometheus per namespace”: benefit for us would be that we have >100 namespaces and a few dozen teams, so can have reporting and isolate impact to each namespace.

• b) “functional sharding”: the Prometheus shard X scrapes all pods of Service A, B and C while shard Y scrapes pods from Service D, E and F.

• c) “automatic sharding”: the targets will be assigned to Prometheus shards based on their addresses.

Erik Osterman (Cloud Posse)

10:40:46 PM

This is where thanos comes in.

Erik Osterman (Cloud Posse)

10:41:43 PM

(or Victoria Metrics)

Erik Osterman (Cloud Posse)

10:43:17 PM

https://thanos.io/tip/thanos/design.md/

Thanos

Thanos - Highly available Prometheus setup with long term storage capabilities

Sean

08:22:22 PM

Thanos is more so for the storage side, not the scraping. Paired typically with prometheus-operator to do some sort of sharding.

Erik Osterman (Cloud Posse)

09:24:49 PM

Yes, but it eliminates the memory issues from our past experience

Erik Osterman (Cloud Posse)

09:24:55 PM

because it shards and aggregates

Erik Osterman (Cloud Posse)

09:25:23 PM

1 instance can’t handle the huge amount of metrics (starts OOMing)

Erik Osterman (Cloud Posse)

09:25:26 PM

this was our problem.

Erik Osterman (Cloud Posse)

09:26:09 PM

We kept vertically scaling the the pods, until they were using 36GB ram and that got expensive. With Thanos, we didn’t need to vertically scale.

Erik Osterman (Cloud Posse)

09:26:36 PM

This was ~3 years ago, so my memory is foggy, but OOM was exactly our problem.

Sean

10:49:23 PM

Yeah. The pattern is to pair prometheus-operator with Thanos.

Our Prometheus isn’t used for long-term storage (couple days of retention), so it was simply the huge mount of metric scraping (several 100k) that killed it.

We will do a trial between options (a) and (b) to see what works best at huge scale.

#prometheus (2023-08)

2023-08-29

2023-08-30

2023-08-31