SweetOps #sre for March, 2020

Archive: https://archive.sweetops.com/monitoring/

2020-03-17

Erik Osterman (Cloud Posse)

https://news.google.com/articles/CBMiYWh0dHBzOi8vd3d3LnpkbmV0LmNvbS9hcnRpY2xlL2dvb2dsZS10aGlzLWlzLXdoYXQtY2F1c2VkLWNwdS10aHJvdHRsaW5nLWF0LW91ci1jbG91ZC1kYXRhLWNlbnRlci_SAWxodHRwczovL3d3dy56ZG5ldC5jb20vZ29vZ2xlLWFtcC9hcnRpY2xlL2dvb2dsZS10aGlzLWlzLXdoYXQtY2F1c2VkLWNwdS10aHJvdHRsaW5nLWF0LW91ci1jbG91ZC1kYXRhLWNlbnRlci8?hl=en-US&gl=US&ceid=US%3Aen

Google: This is what caused CPU throttling at our cloud data center | ZDNet

Google says crushed rack wheels busted a cooling system, causing CPU performance to be throttled.

2020-03-25

btai

06:41:55 AM

prometheus-operator users: how much memory have you seen your prometheus operator consume?

Erik Osterman (Cloud Posse)

06:45:27 AM

A lot! I think we have allocated 14-16G

Vincent Fiset

03:16:12 PM

On my side its 3Gi on a small cluster… I guess it depends on the cluster size and the amount of metrics generated

btai

10:04:12 PM

cool thanks guys. I think I may end up having it on its own k8s worker node

btai

10:04:36 PM

still a ton cheaper than the ~$4k a month we spend on sysdig

2020-03-26

Vincent Fiset

03:17:45 PM

Hi folks, what’s the right way to handle the KubeletDown alerts that comes with prometheus operator on a public cloud where nodes gets replaced at times ?

    - alert: KubeletDown
      annotations:
        message: Kubelet has disappeared from Prometheus target discovery.
        runbook_url: <https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown>
      expr: |
        absent(up{job="kubelet", metrics_path="/metrics"} == 1)
      for: 15m
      labels:
        severity: critical

2020-03-27

Erik Osterman (Cloud Posse)

10:05:04 PM

Adding @discourse_forum bot

discourse_forum

10:05:07 PM

@discourse_forum has joined the channel

2020-03-28

sheldonh

03:34:47 PM

What’s your preferred APM platform (no Appdynamics) ? Need container support, .net , Java, more, etc? I want to simplify telemetry and monitoring metrics to a central service and give business a self service telemetry metrics source so it’s all centralized.

I want a system ideally that automatically pulls in aws tags on instances to, do I can stop writing complicated chocolatey packages for configuring the app.

Right now gut feeling is SignalFX ( can manage with terraform to), datadog are the promising solutions.

#sre (2020-03)

Prometheus, Prometheus Operator, Grafana, Kubernetes

2020-03-17

2020-03-25

2020-03-26

2020-03-27

2020-03-28