#sre (2020-02)
Prometheus, Prometheus Operator, Grafana, Kubernetes
Archive: https://archive.sweetops.com/monitoring/
2020-02-01
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
data:image/s3,"s3://crabby-images/1ffec/1ffece9c73946385330bf417eefea235115bd817" alt="attachment image"
Whether your team is practicing DevOps or traditional IT operations, here are some tips and tricks for writing effective runbook documentation when you aren’t a technical writer
2020-02-03
data:image/s3,"s3://crabby-images/e471b/e471bc22e77bf7730ed2046efb99c305a4f8df4f" alt="btai avatar"
those of you using efs-provisioner helmfile for prometheus on k8s, did you need to install nfs-utils on your worker nodes? @roth.andy @Erik Osterman (Cloud Posse)
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
Nope - here’s our kops manifest: https://github.com/cloudposse/reference-architectures/blob/master/templates/kops/kops-private-topology.yaml.gotmpl
Get up and running quickly with one of our reference architecture using our fully automated cold-start process. - cloudposse/reference-architectures
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
(we’re using kops debian)
2020-02-17
data:image/s3,"s3://crabby-images/c4007/c4007ac3f2ea7b77860a98a8551d584856b49862" alt="Zachary Loeber avatar"
Interesting site idea: https://awesome-prometheus-alerts.grep.to/
Collection of alerting rules
data:image/s3,"s3://crabby-images/997de/997deac351b2277d68838501a28baf21fa053183" alt="wattiez.morgan avatar"
data:image/s3,"s3://crabby-images/c4007/c4007ac3f2ea7b77860a98a8551d584856b49862" alt="Zachary Loeber avatar"
So if I do happen to get apiserver high utilization alerts, how would I track down what might be causing it?
data:image/s3,"s3://crabby-images/0704f/0704fa2c4de34bfc92a8ecd50096a4fa8404549a" alt="joshmyers avatar"
“apiserver high utilization alerts” is pretty vague…
data:image/s3,"s3://crabby-images/c4007/c4007ac3f2ea7b77860a98a8551d584856b49862" alt="Zachary Loeber avatar"
Yeah, you are telling me. I got prometheus alerts for it and found I couldn’t complete kubectl commands but couldn’t really find a cause of the issues. Even looking through local host logs for kubelet was rather futile (which feels like defeating the purpose of having managed kubernetes to me btw)
data:image/s3,"s3://crabby-images/c4007/c4007ac3f2ea7b77860a98a8551d584856b49862" alt="Zachary Loeber avatar"
finally going on a hunch I rebooted one of the nodes and stabilized the api services. Solving things by gut instinct always feels kinda wrong though
data:image/s3,"s3://crabby-images/0704f/0704fa2c4de34bfc92a8ecd50096a4fa8404549a" alt="joshmyers avatar"
Well, given I have 0 context one your setup/app/what you are talking about
data:image/s3,"s3://crabby-images/0704f/0704fa2c4de34bfc92a8ecd50096a4fa8404549a" alt="joshmyers avatar"
First step would be to go check what metrics that alert is looking at…
data:image/s3,"s3://crabby-images/0704f/0704fa2c4de34bfc92a8ecd50096a4fa8404549a" alt="joshmyers avatar"
Should be pretty obvious at least where the smell is coming from
data:image/s3,"s3://crabby-images/0704f/0704fa2c4de34bfc92a8ecd50096a4fa8404549a" alt="joshmyers avatar"
Maybe not the root cause
data:image/s3,"s3://crabby-images/c4007/c4007ac3f2ea7b77860a98a8551d584856b49862" alt="Zachary Loeber avatar"
alert: APIServerErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m])
/ rate(apiserver_request_count[5m]) * 100 > 5
for: 10m
labels:
severity: critical
annotations:
description: API server returns errors for {{ $value }}% of requests
data:image/s3,"s3://crabby-images/c4007/c4007ac3f2ea7b77860a98a8551d584856b49862" alt="Zachary Loeber avatar"
My question is how can I tell what (if anything) within a cluster might be causing a high request rate to the kubernetes api services?
data:image/s3,"s3://crabby-images/c4007/c4007ac3f2ea7b77860a98a8551d584856b49862" alt="Zachary Loeber avatar"
I have a feeling that a controller or deployment was going haywire but there must be a quicker way to isolate such deployments I’d think
data:image/s3,"s3://crabby-images/c4007/c4007ac3f2ea7b77860a98a8551d584856b49862" alt="Zachary Loeber avatar"
here is another prometheus rule that was triggering:
data:image/s3,"s3://crabby-images/c4007/c4007ac3f2ea7b77860a98a8551d584856b49862" alt="Zachary Loeber avatar"
alert: APIServerLatencyHigh
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
> 1
for: 10m
labels:
severity: warning
annotations:
description: the API server has a 99th percentile latency of {{ $value }} seconds
for {{$labels.verb}} {{$labels.resource}}
summary: API server high latency
data:image/s3,"s3://crabby-images/0704f/0704fa2c4de34bfc92a8ecd50096a4fa8404549a" alt="joshmyers avatar"
Is this the k8s API, or your application API?
data:image/s3,"s3://crabby-images/c4007/c4007ac3f2ea7b77860a98a8551d584856b49862" alt="Zachary Loeber avatar"
k8s api
data:image/s3,"s3://crabby-images/0704f/0704fa2c4de34bfc92a8ecd50096a4fa8404549a" alt="joshmyers avatar"
I assume you have done the usual thing of Googling k8s API latency/issues?
2020-02-18
data:image/s3,"s3://crabby-images/73042/73042259a5ebd28330c42d198f2225fd309bd24f" alt="Arjun Iyer avatar"
Folks, anybody have experience with distributed tracing tools like Zipkin or Jaeger? We are building an automated solution that would greatly reduce the time spent in searching for the right traces which point to the root cause. We would love to talk to you for 15 mins. Feel free to DM me if you are interested.
2020-02-21
data:image/s3,"s3://crabby-images/9a0f8/9a0f8d41476ffe9065fbe0b98227d0cdcaa0cd11" alt="Erik Osterman (Cloud Posse) avatar"
@scorebot help keep tabs!
data:image/s3,"s3://crabby-images/3eaa8/3eaa8e5553edac2aab1ce5bebcc56f80b3677c68" alt="scorebot avatar"
@scorebot has joined the channel
data:image/s3,"s3://crabby-images/3eaa8/3eaa8e5553edac2aab1ce5bebcc56f80b3677c68" alt="scorebot avatar"
Thanks for adding me emojis used in this channel are now worth points.
data:image/s3,"s3://crabby-images/3eaa8/3eaa8e5553edac2aab1ce5bebcc56f80b3677c68" alt="scorebot avatar"
Wondering what I can do? try @scorebot help
2020-02-23
data:image/s3,"s3://crabby-images/3eaa8/3eaa8e5553edac2aab1ce5bebcc56f80b3677c68" alt="scorebot avatar"
You can ask me things like @scorebot my score - Shows your points @scorebot winning - Shows Leaderboard @scorebot medals - Shows all Slack reactions with values @scorebot = 40pts - Sets value of reaction
2020-02-27
data:image/s3,"s3://crabby-images/a8218/a8218a4d8f64a246eb7c4827c76f3d9f17ba8285" alt="grv avatar"
Hi guys, anyone using Datadog? Currently we are setting it up, and I am bit confused if it needs to be paired with oob Kubernetes monitoring tools like Kube state metrics
data:image/s3,"s3://crabby-images/a8218/a8218a4d8f64a246eb7c4827c76f3d9f17ba8285" alt="grv avatar"
We are running KOPS based kube clusters, with Datadog agents deployed as daemonsets. My question is, is there a benefit of pairing Kube state metrics with Datadog agents? The current dashboard is showing duplicate values of instance types (probably cz of metrics gathering from two sources)