SweetOps #sre for December, 2019

Archive: https://archive.sweetops.com/monitoring/

2019-12-10

Steve Boardwell

Hey @Erik Osterman (Cloud Posse) here is a slightly biased, coming from the core developer of Victoria Metrics, comparison with Thanos.

“Comparing Thanos to VictoriaMetrics cluster” by Aliaksandr Valialkin https://link.medium.com/ltvYvhvrj2

Comparing Thanos to VictoriaMetrics cluster attachment image

Thanos and VictoriaMetrics provide long-term storage and global query view for Prometheus. The article compares these solutions

Erik Osterman (Cloud Posse)

05:04:53 AM

This is great! Thanks for sharing @Steve Boardwell!

Erik Osterman (Cloud Posse)

05:09:41 AM

Erik Osterman (Cloud Posse)

05:09:54 AM

IMO this is the downside. Why be in the business of managing EBS disks.

Erik Osterman (Cloud Posse)

05:12:11 AM

I like the simpler architecture of VictoriaMetrics, but the big appeal for me regarding Thanos was the S3 backend. Basically, “set it and forget it”

Chris Fowles

05:31:41 AM

yeh S3 for storage is a much lower ops burden and much more straight forward to lifecycle - the last thing i want my ops tooling to do is unduly increase the burden of operations

Chris Fowles

05:32:19 AM

i’d be interested to see cortex in that comparison

Steve Boardwell

07:07:53 AM

Great points - I hadn’t considered the overhead involved in managing the EBS disks. Cortex looks interesting as well, thanks Chris.

2019-12-23

Nelson Jeppesen

07:35:19 PM

Anyone aware of a tool that can take kubernetes action based on prometheus alerts/metrics?

Nelson Jeppesen

07:35:47 PM

I’d like to kill a pod when it stops processing, which is tracked by a prometheus metric

Arjun Iyer

11:51:38 PM

Hi Nelson, we’re a startup focused on automating actions (diagnostics or repair) within k8s. Would you be available for a 30 mins demo ? Would love to get your thoughts.

Erik Osterman (Cloud Posse)

07:42:20 PM

Depending on what you want to achieve this can do that: https://github.com/weaveworks/flagger

weaveworks/flagger

Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments) - weaveworks/flagger

Erik Osterman (Cloud Posse)

07:43:37 PM

Or use something like https://github.com/stefanprodan/k8s-prom-hpa to scale replicas to zero

stefanprodan/k8s-prom-hpa

Kubernetes Horizontal Pod Autoscaler with Prometheus custom metrics - stefanprodan/k8s-prom-hpa

Nelson Jeppesen

12:26:23 AM

Thank you Erik. I don’t think HPA autoscaler will help, because I don’t want to scale to zero, I just want the pods replaced/restarted. I’ll take a look at flagger, but I’m not sure if I can use it to kill/restart the pod

Erik Osterman (Cloud Posse)

01:32:50 AM

@Nelson Jeppesen it sounds like it maybe be better describe the problem you’re trying to solve than the technical implementation of a solution

Nelson Jeppesen

01:34:16 AM

Ok thats a very good point. I have some logstash pods. Between 2 and 20 days, the pod will stop processing logs (writing to elasticsearch). We alert on this and kill the pods manualy. As of yet, we’ve not be able to figure out why this happens as we see no errors

Erik Osterman (Cloud Posse)

01:36:13 AM

Ah cool

Erik Osterman (Cloud Posse)

01:36:21 AM

Yes I can see why this would be useful

Erik Osterman (Cloud Posse)

01:36:26 AM

Let me see what I can find

Nelson Jeppesen

01:43:32 AM

Thank you very much

Erik Osterman (Cloud Posse)

01:48:28 AM

So one idea… and you will see where I am going with this

Erik Osterman (Cloud Posse)

01:48:45 AM

So I presume you have AlertManager hooked up with Prometheus operator

Erik Osterman (Cloud Posse)

01:48:55 AM

Alert manager can fire webhooks

Erik Osterman (Cloud Posse)

01:50:02 AM

Now, if your cicd system like Jenkins or codefresh supports webhooks, you can then trigger the job to nuke the pod

Nelson Jeppesen

01:53:14 AM

Oh interesting; yeah, we’ve got alertmanager hooked up. Thats an interesting proposal. We do have Jenkins setup that could work

Erik Osterman (Cloud Posse)

01:54:07 AM

https://ahmet.im/blog/advanced-kubernetes-health-checks/

Advanced Health Check Patterns in Kubernetes

Kubernetes keeps applications running while you’re asleep: This is mostly thanks to the “Readiness and Liveness Probes”. If you don’t know about them, read this cool article. This article is about some health check patterns I have seen in…

Erik Osterman (Cloud Posse)

01:54:39 AM

You can also modify the deployment to include a sidecar healthcheck endpoint

Erik Osterman (Cloud Posse)

01:54:51 AM

That end point would then query Prometheus

Nelson Jeppesen

01:56:40 AM

hmmm; I wonder with a sidecar I could bypass prometheus all together. A simple check for rate over 5min or something

Erik Osterman (Cloud Posse)

01:57:13 AM

Ya possibly…

Pierre Humberdroz

03:21:49 AM

is there any disadvantage of killing the pod just every couple of hours / days ? Like the book keeping which logs have already been send? We had a similar issue and we just added timeout 3600 npm run start as the cmd which would restart the pod every hour

Erik Osterman (Cloud Posse)

01:35:07 AM

For example, “we have a problem with our application where the services stop responding processing customer orders. The health checks are all okay, but if we look at prometheus we see it is no longer doing anything. If we restart the service, everything is fine. So I want to implement something that restarts our service based on metrics in Prometheus”

#sre (2019-12)

Prometheus, Prometheus Operator, Grafana, Kubernetes

2019-12-10

2019-12-23