#sre (2019-12)
Prometheus, Prometheus Operator, Grafana, Kubernetes
Archive: https://archive.sweetops.com/monitoring/
2019-12-10
data:image/s3,"s3://crabby-images/99f15/99f1562dda5fb206568ace3c8ae64868e805308a" alt="Steve Boardwell avatar"
Hey @Erik Osterman (Cloud Posse) here is a slightly biased, coming from the core developer of Victoria Metrics, comparison with Thanos.
“Comparing Thanos to VictoriaMetrics cluster” by Aliaksandr Valialkin https://link.medium.com/ltvYvhvrj2
data:image/s3,"s3://crabby-images/986a7/986a702bfeaf3cba81e878fae7d92ebc7350bd1c" alt="attachment image"
Thanos and VictoriaMetrics provide long-term storage and global query view for Prometheus. The article compares these solutions
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
This is great! Thanks for sharing @Steve Boardwell!
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
IMO this is the downside. Why be in the business of managing EBS disks.
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
I like the simpler architecture of VictoriaMetrics, but the big appeal for me regarding Thanos was the S3 backend. Basically, “set it and forget it”
data:image/s3,"s3://crabby-images/9f7d3/9f7d37e6df4fb280d718c728e563fdba7ce5b9ba" alt="Chris Fowles avatar"
yeh S3 for storage is a much lower ops burden and much more straight forward to lifecycle - the last thing i want my ops tooling to do is unduly increase the burden of operations
data:image/s3,"s3://crabby-images/9f7d3/9f7d37e6df4fb280d718c728e563fdba7ce5b9ba" alt="Chris Fowles avatar"
i’d be interested to see cortex in that comparison
data:image/s3,"s3://crabby-images/99f15/99f1562dda5fb206568ace3c8ae64868e805308a" alt="Steve Boardwell avatar"
Great points - I hadn’t considered the overhead involved in managing the EBS disks. Cortex looks interesting as well, thanks Chris.
2019-12-23
data:image/s3,"s3://crabby-images/77573/775736e2a1eb9753b309e3adf8e46283f2484067" alt="Nelson Jeppesen avatar"
Anyone aware of a tool that can take kubernetes action based on prometheus alerts/metrics?
data:image/s3,"s3://crabby-images/77573/775736e2a1eb9753b309e3adf8e46283f2484067" alt="Nelson Jeppesen avatar"
I’d like to kill a pod when it stops processing, which is tracked by a prometheus metric
data:image/s3,"s3://crabby-images/73042/73042259a5ebd28330c42d198f2225fd309bd24f" alt="Arjun Iyer avatar"
Hi Nelson, we’re a startup focused on automating actions (diagnostics or repair) within k8s. Would you be available for a 30 mins demo ? Would love to get your thoughts.
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
Depending on what you want to achieve this can do that: https://github.com/weaveworks/flagger
Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments) - weaveworks/flagger
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
Or use something like https://github.com/stefanprodan/k8s-prom-hpa to scale replicas to zero
Kubernetes Horizontal Pod Autoscaler with Prometheus custom metrics - stefanprodan/k8s-prom-hpa
data:image/s3,"s3://crabby-images/77573/775736e2a1eb9753b309e3adf8e46283f2484067" alt="Nelson Jeppesen avatar"
Thank you Erik. I don’t think HPA autoscaler will help, because I don’t want to scale to zero, I just want the pods replaced/restarted. I’ll take a look at flagger, but I’m not sure if I can use it to kill/restart the pod
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
@Nelson Jeppesen it sounds like it maybe be better describe the problem you’re trying to solve than the technical implementation of a solution
data:image/s3,"s3://crabby-images/77573/775736e2a1eb9753b309e3adf8e46283f2484067" alt="Nelson Jeppesen avatar"
Ok thats a very good point. I have some logstash pods. Between 2 and 20 days, the pod will stop processing logs (writing to elasticsearch). We alert on this and kill the pods manualy. As of yet, we’ve not be able to figure out why this happens as we see no errors
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
Ah cool
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
Yes I can see why this would be useful
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
Let me see what I can find
data:image/s3,"s3://crabby-images/77573/775736e2a1eb9753b309e3adf8e46283f2484067" alt="Nelson Jeppesen avatar"
Thank you very much
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
So one idea… and you will see where I am going with this
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
So I presume you have AlertManager hooked up with Prometheus operator
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
Alert manager can fire webhooks
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
Now, if your cicd system like Jenkins or codefresh supports webhooks, you can then trigger the job to nuke the pod
data:image/s3,"s3://crabby-images/77573/775736e2a1eb9753b309e3adf8e46283f2484067" alt="Nelson Jeppesen avatar"
Oh interesting; yeah, we’ve got alertmanager hooked up. Thats an interesting proposal. We do have Jenkins setup that could work
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
Kubernetes keeps applications running while you’re asleep: This is mostly thanks to the “Readiness and Liveness Probes”. If you don’t know about them, read this cool article. This article is about some health check patterns I have seen in…
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
You can also modify the deployment to include a sidecar healthcheck endpoint
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
That end point would then query Prometheus
data:image/s3,"s3://crabby-images/77573/775736e2a1eb9753b309e3adf8e46283f2484067" alt="Nelson Jeppesen avatar"
hmmm; I wonder with a sidecar I could bypass prometheus all together. A simple check for rate over 5min or something
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
Ya possibly…
data:image/s3,"s3://crabby-images/662c3/662c3185b944a7d273fbaa7d61c4a971edb10194" alt="Pierre Humberdroz avatar"
is there any disadvantage of killing the pod just every couple of hours / days ? Like the book keeping which logs have already been send? We had a similar issue and we just added timeout 3600 npm run start
as the cmd which would restart the pod every hour
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
For example, “we have a problem with our application where the services stop responding processing customer orders. The health checks are all okay, but if we look at prometheus we see it is no longer doing anything. If we restart the service, everything is fine. So I want to implement something that restarts our service based on metrics in Prometheus”