#sre (2020-02)
Prometheus, Prometheus Operator, Grafana, Kubernetes
Archive: https://archive.sweetops.com/monitoring/
2020-02-01


Whether your team is practicing DevOps or traditional IT operations, here are some tips and tricks for writing effective runbook documentation when you aren’t a technical writer
2020-02-03

those of you using efs-provisioner helmfile for prometheus on k8s, did you need to install nfs-utils on your worker nodes? @roth.andy @Erik Osterman (Cloud Posse)

Nope - here’s our kops manifest: https://github.com/cloudposse/reference-architectures/blob/master/templates/kops/kops-private-topology.yaml.gotmpl
Get up and running quickly with one of our reference architecture using our fully automated cold-start process. - cloudposse/reference-architectures

(we’re using kops debian)
2020-02-17

Interesting site idea: https://awesome-prometheus-alerts.grep.to/
Collection of alerting rules


So if I do happen to get apiserver high utilization alerts, how would I track down what might be causing it?

“apiserver high utilization alerts” is pretty vague…

Yeah, you are telling me. I got prometheus alerts for it and found I couldn’t complete kubectl commands but couldn’t really find a cause of the issues. Even looking through local host logs for kubelet was rather futile (which feels like defeating the purpose of having managed kubernetes to me btw)

finally going on a hunch I rebooted one of the nodes and stabilized the api services. Solving things by gut instinct always feels kinda wrong though

Well, given I have 0 context one your setup/app/what you are talking about

First step would be to go check what metrics that alert is looking at…

Should be pretty obvious at least where the smell is coming from

Maybe not the root cause

alert: APIServerErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m])
/ rate(apiserver_request_count[5m]) * 100 > 5
for: 10m
labels:
severity: critical
annotations:
description: API server returns errors for {{ $value }}% of requests

My question is how can I tell what (if anything) within a cluster might be causing a high request rate to the kubernetes api services?

I have a feeling that a controller or deployment was going haywire but there must be a quicker way to isolate such deployments I’d think

here is another prometheus rule that was triggering:

alert: APIServerLatencyHigh
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
> 1
for: 10m
labels:
severity: warning
annotations:
description: the API server has a 99th percentile latency of {{ $value }} seconds
for {{$labels.verb}} {{$labels.resource}}
summary: API server high latency

Is this the k8s API, or your application API?

k8s api

I assume you have done the usual thing of Googling k8s API latency/issues?
2020-02-18

Folks, anybody have experience with distributed tracing tools like Zipkin or Jaeger? We are building an automated solution that would greatly reduce the time spent in searching for the right traces which point to the root cause. We would love to talk to you for 15 mins. Feel free to DM me if you are interested.
2020-02-21

@scorebot help keep tabs!

@scorebot has joined the channel

Thanks for adding me emojis used in this channel are now worth points.

Wondering what I can do? try @scorebot help
2020-02-23

You can ask me things like @scorebot my score - Shows your points @scorebot winning - Shows Leaderboard @scorebot medals - Shows all Slack reactions with values @scorebot = 40pts - Sets value of reaction
2020-02-27

Hi guys, anyone using Datadog? Currently we are setting it up, and I am bit confused if it needs to be paired with oob Kubernetes monitoring tools like Kube state metrics

We are running KOPS based kube clusters, with Datadog agents deployed as daemonsets. My question is, is there a benefit of pairing Kube state metrics with Datadog agents? The current dashboard is showing duplicate values of instance types (probably cz of metrics gathering from two sources)