Anyone aware of a tool that can take kubernetes action based on prometheus alerts/metrics?
I’d like to kill a pod when it stops processing, which is tracked by a prometheus metric
Thank you Erik. I don’t think HPA autoscaler will help, because I don’t want to scale to zero, I just want the pods replaced/restarted. I’ll take a look at flagger, but I’m not sure if I can use it to kill/restart the pod
@Nelson Jeppesen it sounds like it maybe be better describe the problem you’re trying to solve than the technical implementation of a solution
Ok thats a very good point. I have some logstash pods. Between 2 and 20 days, the pod will stop processing logs (writing to elasticsearch). We alert on this and kill the pods manualy. As of yet, we’ve not be able to figure out why this happens as we see no errors
Yes I can see why this would be useful
Let me see what I can find
Thank you very much
So one idea… and you will see where I am going with this
So I presume you have AlertManager hooked up with Prometheus operator
Alert manager can fire webhooks
Now, if your cicd system like Jenkins or codefresh supports webhooks, you can then trigger the job to nuke the pod
Oh interesting; yeah, we’ve got alertmanager hooked up. Thats an interesting proposal. We do have Jenkins setup that could work
Kubernetes keeps applications running while you’re asleep: This is mostly thanks to the “Readiness and Liveness Probes”. If you don’t know about them, read this cool article. This article is about some health check patterns I have seen in…
You can also modify the deployment to include a sidecar healthcheck endpoint
That end point would then query Prometheus
hmmm; I wonder with a sidecar I could bypass prometheus all together. A simple check for rate over 5min or something
is there any disadvantage of killing the pod just every couple of hours / days ? Like the book keeping which logs have already been send? We had a similar issue and we just added
timeout 3600 npm run start as the cmd which would restart the pod every hour
For example, “we have a problem with our application where the services stop responding processing customer orders. The health checks are all okay, but if we look at prometheus we see it is no longer doing anything. If we restart the service, everything is fine. So I want to implement something that restarts our service based on metrics in Prometheus”
Hey @Erik Osterman (Cloud Posse) here is a slightly biased, coming from the core developer of Victoria Metrics, comparison with Thanos.
“Comparing Thanos to VictoriaMetrics cluster” by Aliaksandr Valialkin https://link.medium.com/ltvYvhvrj2
Thanos and VictoriaMetrics provide long-term storage and global query view for Prometheus. The article compares these solutions
This is great! Thanks for sharing @Steve Boardwell!
IMO this is the downside. Why be in the business of managing EBS disks.
I like the simpler architecture of VictoriaMetrics, but the big appeal for me regarding Thanos was the S3 backend. Basically, “set it and forget it”
yeh S3 for storage is a much lower ops burden and much more straight forward to lifecycle - the last thing i want my ops tooling to do is unduly increase the burden of operations
i’d be interested to see cortex in that comparison
Great points - I hadn’t considered the overhead involved in managing the EBS disks. Cortex looks interesting as well, thanks Chris.