#sre (2019-12)
Prometheus, Prometheus Operator, Grafana, Kubernetes
Archive: https://archive.sweetops.com/monitoring/
2019-12-10

Hey @Erik Osterman (Cloud Posse) here is a slightly biased, coming from the core developer of Victoria Metrics, comparison with Thanos.
“Comparing Thanos to VictoriaMetrics cluster” by Aliaksandr Valialkin https://link.medium.com/ltvYvhvrj2

Thanos and VictoriaMetrics provide long-term storage and global query view for Prometheus. The article compares these solutions

This is great! Thanks for sharing @Steve Boardwell!


IMO this is the downside. Why be in the business of managing EBS disks.

I like the simpler architecture of VictoriaMetrics, but the big appeal for me regarding Thanos was the S3 backend. Basically, “set it and forget it”

yeh S3 for storage is a much lower ops burden and much more straight forward to lifecycle - the last thing i want my ops tooling to do is unduly increase the burden of operations

i’d be interested to see cortex in that comparison

Great points - I hadn’t considered the overhead involved in managing the EBS disks. Cortex looks interesting as well, thanks Chris.
2019-12-23

Anyone aware of a tool that can take kubernetes action based on prometheus alerts/metrics?

I’d like to kill a pod when it stops processing, which is tracked by a prometheus metric

Hi Nelson, we’re a startup focused on automating actions (diagnostics or repair) within k8s. Would you be available for a 30 mins demo ? Would love to get your thoughts.

Depending on what you want to achieve this can do that: https://github.com/weaveworks/flagger
Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments) - weaveworks/flagger

Or use something like https://github.com/stefanprodan/k8s-prom-hpa to scale replicas to zero
Kubernetes Horizontal Pod Autoscaler with Prometheus custom metrics - stefanprodan/k8s-prom-hpa

Thank you Erik. I don’t think HPA autoscaler will help, because I don’t want to scale to zero, I just want the pods replaced/restarted. I’ll take a look at flagger, but I’m not sure if I can use it to kill/restart the pod

@Nelson Jeppesen it sounds like it maybe be better describe the problem you’re trying to solve than the technical implementation of a solution

Ok thats a very good point. I have some logstash pods. Between 2 and 20 days, the pod will stop processing logs (writing to elasticsearch). We alert on this and kill the pods manualy. As of yet, we’ve not be able to figure out why this happens as we see no errors

Ah cool

Yes I can see why this would be useful

Let me see what I can find

Thank you very much

So one idea… and you will see where I am going with this

So I presume you have AlertManager hooked up with Prometheus operator

Alert manager can fire webhooks

Now, if your cicd system like Jenkins or codefresh supports webhooks, you can then trigger the job to nuke the pod

Oh interesting; yeah, we’ve got alertmanager hooked up. Thats an interesting proposal. We do have Jenkins setup that could work

Kubernetes keeps applications running while you’re asleep: This is mostly thanks to the “Readiness and Liveness Probes”. If you don’t know about them, read this cool article. This article is about some health check patterns I have seen in…

You can also modify the deployment to include a sidecar healthcheck endpoint

That end point would then query Prometheus

hmmm; I wonder with a sidecar I could bypass prometheus all together. A simple check for rate over 5min or something

Ya possibly…

is there any disadvantage of killing the pod just every couple of hours / days ? Like the book keeping which logs have already been send? We had a similar issue and we just added timeout 3600 npm run start
as the cmd which would restart the pod every hour

For example, “we have a problem with our application where the services stop responding processing customer orders. The health checks are all okay, but if we look at prometheus we see it is no longer doing anything. If we restart the service, everything is fine. So I want to implement something that restarts our service based on metrics in Prometheus”