#sre (2021-03)

Prometheus, Prometheus Operator, Grafana, Kubernetes

Archive: https://archive.sweetops.com/monitoring/

2021-03-02

2021-03-04

Patrick Jahns avatar
Patrick Jahns

Are you guys aware of any other json logging format standard besides the Elastic Common Schema ( https://www.elastic.co/what-is/ecs ) - been searching a bit but haven’t found something more vendor neutral so far. Also the opentelemetry spec regarding this aspect is from my point of view quite open - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#log-and-event-record-definition

open-telemetry/opentelemetry-specification

Specifications for OpenTelemetry. Contribute to open-telemetry/opentelemetry-specification development by creating an account on GitHub.

Meb avatar

When I check the SDK they don’t seem mature https://opentelemetry.io/docs/js/ This is the main issue need time to be production ready. Some vendors are moving in too.

open-telemetry/opentelemetry-specification

Specifications for OpenTelemetry. Contribute to open-telemetry/opentelemetry-specification development by creating an account on GitHub.

2021-03-05

Eric Berg avatar
Eric Berg

Regarding custom metrics (we’re an AWS/k8s/Datadog shop), i’m trying to get ahead of my developers on the issue of custom metrics and how to represent situations where I want to represent ratios of successful or failed requests/events. For example, we have a routine for which we want to track success/failure as well as latency.

One approach is to have a single metric for all of these events and add a tag for result where the values are success and fail .

Another approach is to have discrete metrics for the success and failure counts…and maybe another one for the total number of requests.

I’d rather have separate metrics for success, failure, and one for a total number of requests.

Thanks for any input you have on this.

kskewes avatar
kskewes


it’s recommended to have a failure and total metric
https://www.robustperception.io/existential-issues-with-metrics

1
bradym avatar

We’re currently testing out an ELK stack deployed via AWS Elasticsearch and I’m having a heck of a time understanding what permissions I’d need to give engineers for them to do things like create saved searches, create visualizations and notebooks. Anyone know a good reference for this? Maybe I’m just missing it somehow, but I’ve not been able to find anything like this in the documentation. Not sure if this is the best place to ask this, if there’s somewhere better please let me know.

2021-03-30

Andrew Nazarov avatar
Andrew Nazarov

Has anybody tried this service https://www.netdata.cloud/? Didn’t get the trick, no prices found.

Netdata - Monitor everything in real time for free with Netdataattachment image

Open-source, distributed, real-time, performance and health monitoring for systems and applications. Instantly diagnose slowdowns and anomalies in your infrastructure with thousands of metrics, interactive visualizations, and insightful health alarms.

Lee Skillen avatar
Lee Skillen

Haven’t personally used it, but on the sign-in page (https://app.netdata.cloud/) it says:
Netdata Cloud is offered completely free of charge with no limits on the number of nodes, metrics or team members.

In the future, we’ll be offering complementary paid services for advanced user control and auditing, increased metadata retention, and enterprise plugins. The best is yet to come.

Netdata - Monitor everything in real time for free with Netdataattachment image

Open-source, distributed, real-time, performance and health monitoring for systems and applications. Instantly diagnose slowdowns and anomalies in your infrastructure with thousands of metrics, interactive visualizations, and insightful health alarms.

Lee Skillen avatar
Lee Skillen

So looks like it is currently free, but may be monetised later at some point (if uptake proves successful, I suppose). I’ve heard of Netadata though, and it looks Quite Nice.

1
Rashid Boyko avatar
Rashid Boyko

I will take a

Rashid Boyko avatar
Rashid Boyko

I wonder where is this netdata-claim.sh script?

Rashid Boyko avatar
Rashid Boyko
The step-by-step Netdata guide | Learn Netdataattachment image

Welcome to Netdata! We’re glad you’re interested in our health monitoring and performance troubleshooting system.

andrea.pavan avatar
andrea.pavan

Used in the past for VerneMQ monitoring inside a k8s cluster and also some VMs. Very interesting tool with its per-second resolution a really cool feature. Easy to install for single machines but more difficult to set what needed for long persisting storage. Sadly never tried their managed cloud but is should make it easier some admin tasks comparing to an on prem self managed instance

    keyboard_arrow_up