SweetOps #sre for March, 2021

Archive: https://archive.sweetops.com/monitoring/

2021-03-02

2021-03-04

Patrick Jahns

Are you guys aware of any other json logging format standard besides the Elastic Common Schema ( https://www.elastic.co/what-is/ecs ) - been searching a bit but haven’t found something more vendor neutral so far. Also the opentelemetry spec regarding this aspect is from my point of view quite open - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#log-and-event-record-definition

open-telemetry/opentelemetry-specification

Specifications for OpenTelemetry. Contribute to open-telemetry/opentelemetry-specification development by creating an account on GitHub.

Meb

08:04:19 PM

When I check the SDK they don’t seem mature https://opentelemetry.io/docs/js/ This is the main issue need time to be production ready. Some vendors are moving in too.

open-telemetry/opentelemetry-specification

Specifications for OpenTelemetry. Contribute to open-telemetry/opentelemetry-specification development by creating an account on GitHub.

2021-03-05

Eric Berg

03:35:16 PM

Regarding custom metrics (we’re an AWS/k8s/Datadog shop), i’m trying to get ahead of my developers on the issue of custom metrics and how to represent situations where I want to represent ratios of successful or failed requests/events. For example, we have a routine for which we want to track success/failure as well as latency.

One approach is to have a single metric for all of these events and add a tag for result where the values are success and fail .

Another approach is to have discrete metrics for the success and failure counts…and maybe another one for the total number of requests.

I’d rather have separate metrics for success, failure, and one for a total number of requests.

Thanks for any input you have on this.

kskewes

10:30:09 PM

it’s recommended to have a failure and total metric https://www.robustperception.io/existential-issues-with-metrics

bradym

06:09:10 PM

We’re currently testing out an ELK stack deployed via AWS Elasticsearch and I’m having a heck of a time understanding what permissions I’d need to give engineers for them to do things like create saved searches, create visualizations and notebooks. Anyone know a good reference for this? Maybe I’m just missing it somehow, but I’ve not been able to find anything like this in the documentation. Not sure if this is the best place to ask this, if there’s somewhere better please let me know.

2021-03-30

Andrew Nazarov

07:06:21 PM

Has anybody tried this service https://www.netdata.cloud/? Didn’t get the trick, no prices found.

Netdata - Monitor everything in real time for free with Netdata attachment image

Open-source, distributed, real-time, performance and health monitoring for systems and applications. Instantly diagnose slowdowns and anomalies in your infrastructure with thousands of metrics, interactive visualizations, and insightful health alarms.

Lee Skillen

12:04:01 AM

Haven’t personally used it, but on the sign-in page (https://app.netdata.cloud/) it says:
Netdata Cloud is offered completely free of charge with no limits on the number of nodes, metrics or team members.

In the future, we’ll be offering complementary paid services for advanced user control and auditing, increased metadata retention, and enterprise plugins. The best is yet to come.

Netdata - Monitor everything in real time for free with Netdata attachment image

Lee Skillen

12:04:39 AM

So looks like it is currently free, but may be monetised later at some point (if uptake proves successful, I suppose). I’ve heard of Netadata though, and it looks Quite Nice.

Rashid Boyko

07:56:40 AM

I will take a

Rashid Boyko

08:06:35 AM

I wonder where is this netdata-claim.sh script?

Rashid Boyko

08:08:36 AM

I found the instruction https://learn.netdata.cloud/guides/step-by-step/step-00

The step-by-step Netdata guide | Learn Netdata attachment image

Welcome to Netdata! We’re glad you’re interested in our health monitoring and performance troubleshooting system.

andrea.pavan

05:53:05 PM

Used in the past for VerneMQ monitoring inside a k8s cluster and also some VMs. Very interesting tool with its per-second resolution a really cool feature. Easy to install for single machines but more difficult to set what needed for long persisting storage. Sadly never tried their managed cloud but is should make it easier some admin tasks comparing to an on prem self managed instance

#sre (2021-03)

Prometheus, Prometheus Operator, Grafana, Kubernetes

2021-03-02

2021-03-04

2021-03-05

2021-03-30