SweetOps #sre for May, 2021

Archive: https://archive.sweetops.com/monitoring/

2021-05-12

Erik Osterman (Cloud Posse)

https://sweetops.slack.com/archives/CHDR1EWNA/p1620241466170000

A nice article about philosoby of Alerting by Rob Ewaschuk, based on his observations while he was a Site Reliability Engineer at Google https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#

2021-05-28

btai

08:54:10 PM

what other vendoried infra (kubernetes) monitoring solutions are people using not named datadog?

Issif

09:04:12 PM

take a look at Sysdig Monitor

Issif

09:04:26 PM

(I don’t like Datadog either)

Chris Fowles

11:21:56 PM

what don’t you like about datadog?

Chris Fowles

11:22:10 PM

we’re loving it - so i’d love to know if there’s something down the track that’s going to ouch

Issif

07:27:48 AM

• time for discovering new AWS ressources can take up to 30min

• you have to use a personal token (with all rights) for automation with Terraform

• graph possibilities are far away from Grafana

• you have to mute the whole monitor for maintenance, not only some subsets that match labels (maybe it’s not like it anymore)

• when you combine 2 metrics (A/B eg), the time window for evaluation of A and B is not the same

Michael Warkentin

11:35:35 AM

For #1 you can decrease your polling interval or use the new cloudwatch metric streams for near real-time

btai

06:14:26 PM

we used sysdig monitor for a while 2+ years back, caused outages because of kernel panics and their agents were pretty resource intensive (high mem usage - but this prob case for all vendors) Regardless of the improvements they’ve prob made over the last two years, our eng leadership (and me as well) are probably still sour about their kernel panics to go with them again.

2021-05-30

2021-05-31

Partha

09:08:38 AM

Hi All, report.CRITICAL: {“error”[{“type”“Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on - Please help on this problem ElasticSearch

#sre (2021-05)

Prometheus, Prometheus Operator, Grafana, Kubernetes

2021-05-12

2021-05-28

2021-05-30

2021-05-31