#sre (2021-05)
Prometheus, Prometheus Operator, Grafana, Kubernetes
Archive: https://archive.sweetops.com/monitoring/
2021-05-12
A nice article about philosoby of Alerting by Rob Ewaschuk, based on his observations while he was a Site Reliability Engineer at Google https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#
2021-05-28
what other vendoried infra (kubernetes) monitoring solutions are people using not named datadog?
take a look at Sysdig Monitor
(I don’t like Datadog either)
what don’t you like about datadog?
we’re loving it - so i’d love to know if there’s something down the track that’s going to ouch
• time for discovering new AWS ressources can take up to 30min
• you have to use a personal token (with all rights) for automation with Terraform
• graph possibilities are far away from Grafana
• you have to mute the whole monitor for maintenance, not only some subsets that match labels (maybe it’s not like it anymore)
• when you combine 2 metrics (A/B eg), the time window for evaluation of A and B is not the same
For #1 you can decrease your polling interval or use the new cloudwatch metric streams for near real-time
we used sysdig monitor for a while 2+ years back, caused outages because of kernel panics and their agents were pretty resource intensive (high mem usage - but this prob case for all vendors) Regardless of the improvements they’ve prob made over the last two years, our eng leadership (and me as well) are probably still sour about their kernel panics to go with them again.
2021-05-30
2021-05-31
Hi All, report.CRITICAL: {“error”[{“type”“Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on - Please help on this problem ElasticSearch