#prometheus (2023-08)
Archive: https://archive.sweetops.com/prometheus/
2023-08-29
![Sean avatar](https://secure.gravatar.com/avatar/b124653b19ee9dd438710a38954ed4a3.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0004-72.png)
How are y’all scaling prometheus-server
when 1 instance can’t handle the huge amount of metrics (so starts OOMing)?
We are adopting Prometheus-Operator
and debating between:
• a) “one prometheus per namespace”: benefit for us would be that we have >100 namespaces and a few dozen teams, so can have reporting and isolate impact to each namespace.
• b) “functional sharding”: the Prometheus shard X scrapes all pods of Service A, B and C while shard Y scrapes pods from Service D, E and F.
• c) “automatic sharding”: the targets will be assigned to Prometheus shards based on their addresses.
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
This is where thanos comes in.
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
(or Victoria Metrics)
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
![Sean avatar](https://secure.gravatar.com/avatar/b124653b19ee9dd438710a38954ed4a3.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0004-72.png)
Thanos is more so for the storage side, not the scraping. Paired typically with prometheus-operator to do some sort of sharding.
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
Yes, but it eliminates the memory issues from our past experience
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
because it shards and aggregates
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
1 instance can’t handle the huge amount of metrics (starts OOMing)
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
this was our problem.
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
We kept vertically scaling the the pods, until they were using 36GB ram and that got expensive. With Thanos, we didn’t need to vertically scale.
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
This was ~3 years ago, so my memory is foggy, but OOM was exactly our problem.
![Sean avatar](https://secure.gravatar.com/avatar/b124653b19ee9dd438710a38954ed4a3.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0004-72.png)
Yeah. The pattern is to pair prometheus-operator with Thanos.
Our Prometheus isn’t used for long-term storage (couple days of retention), so it was simply the huge mount of metric scraping (several 100k) that killed it.
We will do a trial between options (a) and (b) to see what works best at huge scale.