#monitoring (2021-05)

Prometheus, Prometheus Operator, Grafana, Kubernetes

Archive: https://archive.sweetops.com/monitoring/

2021-05-31

Partha avatar
Partha

Hi All, report.CRITICAL: {“error”[{“type”“Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on - Please help on this problem ElasticSearch

1

2021-05-30

2021-05-28

btai avatar

what other vendoried infra (kubernetes) monitoring solutions are people using not named datadog?

Issif avatar
Issif

take a look at Sysdig Monitor

Issif avatar
Issif

(I don’t like Datadog either)

Chris Fowles avatar
Chris Fowles

what don’t you like about datadog?

Chris Fowles avatar
Chris Fowles

we’re loving it - so i’d love to know if there’s something down the track that’s going to ouch

Issif avatar
Issif

• time for discovering new AWS ressources can take up to 30min

• you have to use a personal token (with all rights) for automation with Terraform

• graph possibilities are far away from Grafana

• you have to mute the whole monitor for maintenance, not only some subsets that match labels (maybe it’s not like it anymore)

• when you combine 2 metrics (A/B eg), the time window for evaluation of A and B is not the same

Michael Warkentin avatar
Michael Warkentin

For #1 you can decrease your polling interval or use the new cloudwatch metric streams for near real-time

btai avatar

we used sysdig monitor for a while 2+ years back, caused outages because of kernel panics and their agents were pretty resource intensive (high mem usage - but this prob case for all vendors) Regardless of the improvements they’ve prob made over the last two years, our eng leadership (and me as well) are probably still sour about their kernel panics to go with them again.

2021-05-12

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

A nice article about philosoby of Alerting by Rob Ewaschuk, based on his observations while he was a Site Reliability Engineer at Google https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#

    keyboard_arrow_up