#sre (2020-02)

Prometheus, Prometheus Operator, Grafana, Kubernetes

Archive: https://archive.sweetops.com/monitoring/

2020-02-01

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)
Writing Runbook Documentation When You’re An SREattachment image

Whether your team is practicing DevOps or traditional IT operations, here are some tips and tricks for writing effective runbook documentation when you aren’t a technical writer

2020-02-03

btai avatar

those of you using efs-provisioner helmfile for prometheus on k8s, did you need to install nfs-utils on your worker nodes? @roth.andy @Erik Osterman (Cloud Posse)

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)
cloudposse/reference-architectures

Get up and running quickly with one of our reference architecture using our fully automated cold-start process. - cloudposse/reference-architectures

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

(we’re using kops debian)

2020-02-17

Zachary Loeber avatar
Zachary Loeber
Awesome Prometheus alerts

Collection of alerting rules

2
wattiez.morgan avatar
wattiez.morgan

cool ! thanks

Awesome Prometheus alerts

Collection of alerting rules

Zachary Loeber avatar
Zachary Loeber

So if I do happen to get apiserver high utilization alerts, how would I track down what might be causing it?

joshmyers avatar
joshmyers

“apiserver high utilization alerts” is pretty vague…

Zachary Loeber avatar
Zachary Loeber

Yeah, you are telling me. I got prometheus alerts for it and found I couldn’t complete kubectl commands but couldn’t really find a cause of the issues. Even looking through local host logs for kubelet was rather futile (which feels like defeating the purpose of having managed kubernetes to me btw)

Zachary Loeber avatar
Zachary Loeber

finally going on a hunch I rebooted one of the nodes and stabilized the api services. Solving things by gut instinct always feels kinda wrong though

joshmyers avatar
joshmyers

Well, given I have 0 context one your setup/app/what you are talking about

joshmyers avatar
joshmyers

First step would be to go check what metrics that alert is looking at…

joshmyers avatar
joshmyers

Should be pretty obvious at least where the smell is coming from

joshmyers avatar
joshmyers

Maybe not the root cause

Zachary Loeber avatar
Zachary Loeber
alert: APIServerErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m])
  / rate(apiserver_request_count[5m]) * 100 > 5
for: 10m
labels:
  severity: critical
annotations:
  description: API server returns errors for {{ $value }}% of requests
Zachary Loeber avatar
Zachary Loeber

My question is how can I tell what (if anything) within a cluster might be causing a high request rate to the kubernetes api services?

Zachary Loeber avatar
Zachary Loeber

I have a feeling that a controller or deployment was going haywire but there must be a quicker way to isolate such deployments I’d think

Zachary Loeber avatar
Zachary Loeber

here is another prometheus rule that was triggering:

Zachary Loeber avatar
Zachary Loeber
alert: APIServerLatencyHigh
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
  > 1
for: 10m
labels:
  severity: warning
annotations:
  description: the API server has a 99th percentile latency of {{ $value }} seconds
    for {{$labels.verb}} {{$labels.resource}}
  summary: API server high latency
joshmyers avatar
joshmyers

Is this the k8s API, or your application API?

Zachary Loeber avatar
Zachary Loeber

k8s api

joshmyers avatar
joshmyers

I assume you have done the usual thing of Googling k8s API latency/issues?

2020-02-18

Arjun Iyer avatar
Arjun Iyer

Folks, anybody have experience with distributed tracing tools like Zipkin or Jaeger? We are building an automated solution that would greatly reduce the time spent in searching for the right traces which point to the root cause. We would love to talk to you for 15 mins. Feel free to DM me if you are interested.

2020-02-21

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

@scorebot help keep tabs!

scorebot avatar
scorebot
05:49:28 PM

@scorebot has joined the channel

scorebot avatar
scorebot
05:49:29 PM

Thanks for adding me emojis used in this channel are now worth points.

scorebot avatar
scorebot
05:49:30 PM

Wondering what I can do? try @scorebot help

2020-02-23

scorebot avatar
scorebot
12:22:13 PM

You can ask me things like @scorebot my score - Shows your points @scorebot winning - Shows Leaderboard @scorebot medals - Shows all Slack reactions with values @scorebot = 40pts - Sets value of reaction

2020-02-27

grv avatar

Hi guys, anyone using Datadog? Currently we are setting it up, and I am bit confused if it needs to be paired with oob Kubernetes monitoring tools like Kube state metrics

grv avatar

We are running KOPS based kube clusters, with Datadog agents deployed as daemonsets. My question is, is there a benefit of pairing Kube state metrics with Datadog agents? The current dashboard is showing duplicate values of instance types (probably cz of metrics gathering from two sources)

    keyboard_arrow_up