SweetOps #sre for February, 2020

Archive: https://archive.sweetops.com/monitoring/

2020-02-01

Erik Osterman (Cloud Posse)

https://www.transposit.com/blog/2020.01.30-writing-runbook-documentation-when-youre-an-sre/

Writing Runbook Documentation When You’re An SRE attachment image

Whether your team is practicing DevOps or traditional IT operations, here are some tips and tricks for writing effective runbook documentation when you aren’t a technical writer

2020-02-03

btai

10:20:22 PM

those of you using efs-provisioner helmfile for prometheus on k8s, did you need to install nfs-utils on your worker nodes? @roth.andy @Erik Osterman (Cloud Posse)

Erik Osterman (Cloud Posse)

10:59:59 PM

Nope - here’s our kops manifest: https://github.com/cloudposse/reference-architectures/blob/master/templates/kops/kops-private-topology.yaml.gotmpl

cloudposse/reference-architectures

Get up and running quickly with one of our reference architecture using our fully automated cold-start process. - cloudposse/reference-architectures

Erik Osterman (Cloud Posse)

11:00:08 PM

(we’re using kops debian)

2020-02-17

Zachary Loeber

01:16:30 PM

Interesting site idea: https://awesome-prometheus-alerts.grep.to/

Awesome Prometheus alerts

Collection of alerting rules

wattiez.morgan

02:23:48 PM

cool ! thanks

Awesome Prometheus alerts

Collection of alerting rules

Zachary Loeber

01:19:30 PM

So if I do happen to get apiserver high utilization alerts, how would I track down what might be causing it?

joshmyers

01:25:36 PM

“apiserver high utilization alerts” is pretty vague…

Zachary Loeber

02:15:32 PM

Yeah, you are telling me. I got prometheus alerts for it and found I couldn’t complete kubectl commands but couldn’t really find a cause of the issues. Even looking through local host logs for kubelet was rather futile (which feels like defeating the purpose of having managed kubernetes to me btw)

Zachary Loeber

02:16:27 PM

finally going on a hunch I rebooted one of the nodes and stabilized the api services. Solving things by gut instinct always feels kinda wrong though

joshmyers

02:19:25 PM

Well, given I have 0 context one your setup/app/what you are talking about

joshmyers

02:19:36 PM

First step would be to go check what metrics that alert is looking at…

joshmyers

02:19:47 PM

Should be pretty obvious at least where the smell is coming from

joshmyers

02:19:57 PM

Maybe not the root cause

Zachary Loeber

02:20:07 PM

alert: APIServerErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m])
  / rate(apiserver_request_count[5m]) * 100 > 5
for: 10m
labels:
  severity: critical
annotations:
  description: API server returns errors for {{ $value }}% of requests

Zachary Loeber

02:21:09 PM

My question is how can I tell what (if anything) within a cluster might be causing a high request rate to the kubernetes api services?

Zachary Loeber

02:22:00 PM

I have a feeling that a controller or deployment was going haywire but there must be a quicker way to isolate such deployments I’d think

Zachary Loeber

02:22:26 PM

here is another prometheus rule that was triggering:

Zachary Loeber

02:22:26 PM

alert: APIServerLatencyHigh
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
  > 1
for: 10m
labels:
  severity: warning
annotations:
  description: the API server has a 99th percentile latency of {{ $value }} seconds
    for {{$labels.verb}} {{$labels.resource}}
  summary: API server high latency

joshmyers

02:35:10 PM

Is this the k8s API, or your application API?

Zachary Loeber

02:37:10 PM

k8s api

joshmyers

02:41:52 PM

I assume you have done the usual thing of Googling k8s API latency/issues?

2020-02-18

Arjun Iyer

01:58:02 AM

Folks, anybody have experience with distributed tracing tools like Zipkin or Jaeger? We are building an automated solution that would greatly reduce the time spent in searching for the right traces which point to the root cause. We would love to talk to you for 15 mins. Feel free to DM me if you are interested.

2020-02-21

Erik Osterman (Cloud Posse)

05:49:26 PM

@scorebot help keep tabs!

scorebot

05:49:28 PM

@scorebot has joined the channel

scorebot

05:49:29 PM

Thanks for adding me emojis used in this channel are now worth points.

scorebot

05:49:30 PM

Wondering what I can do? try @scorebot help

2020-02-23

scorebot

12:22:13 PM

You can ask me things like @scorebot my score - Shows your points @scorebot winning - Shows Leaderboard @scorebot medals - Shows all Slack reactions with values @scorebot = 40pts - Sets value of reaction

2020-02-27

grv

12:12:10 AM

Hi guys, anyone using Datadog? Currently we are setting it up, and I am bit confused if it needs to be paired with oob Kubernetes monitoring tools like Kube state metrics

grv

12:12:35 AM

We are running KOPS based kube clusters, with Datadog agents deployed as daemonsets. My question is, is there a benefit of pairing Kube state metrics with Datadog agents? The current dashboard is showing duplicate values of instance types (probably cz of metrics gathering from two sources)

#sre (2020-02)

Prometheus, Prometheus Operator, Grafana, Kubernetes

2020-02-01

2020-02-03

2020-02-17

2020-02-18

2020-02-21

2020-02-23

2020-02-27