#prometheus (2023-08)

prometheus

Archive: https://archive.sweetops.com/prometheus/

2023-08-29

Sean avatar

How are y’all scaling prometheus-server when 1 instance can’t handle the huge amount of metrics (so starts OOMing)?

We are adopting Prometheus-Operator and debating between:

• a) “one prometheus per namespace”: benefit for us would be that we have >100 namespaces and a few dozen teams, so can have reporting and isolate impact to each namespace.

• b) “functional sharding”: the Prometheus shard X scrapes all pods of Service A, B and C while shard Y scrapes pods from Service D, E and F.

• c) “automatic sharding”: the targets will be assigned to Prometheus shards based on their addresses.

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

This is where thanos comes in.

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

(or Victoria Metrics)

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)
Thanosattachment image

Thanos - Highly available Prometheus setup with long term storage capabilities

Sean avatar

Thanos is more so for the storage side, not the scraping. Paired typically with prometheus-operator to do some sort of sharding.

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

Yes, but it eliminates the memory issues from our past experience

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

because it shards and aggregates

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)


1 instance can’t handle the huge amount of metrics (starts OOMing)

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

this was our problem.

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

We kept vertically scaling the the pods, until they were using 36GB ram and that got expensive. With Thanos, we didn’t need to vertically scale.

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

This was ~3 years ago, so my memory is foggy, but OOM was exactly our problem.

Sean avatar

Yeah. The pattern is to pair prometheus-operator with Thanos.

Our Prometheus isn’t used for long-term storage (couple days of retention), so it was simply the huge mount of metric scraping (several 100k) that killed it.

We will do a trial between options (a) and (b) to see what works best at huge scale.

1

2023-08-30

2023-08-31

    keyboard_arrow_up