SweetOps #sre for May, 2020

Archive: https://archive.sweetops.com/monitoring/

2020-05-06

Jeremy G (Cloud Posse)

[thread] @kskewes @webb @asmito (and anyone else who wants to join in) We talked about Thanos 8 months ago as promising but not quite ready. They have of course progressed a lot since then, notably with an S3 backend that @kskewes said he was waiting for. I see support/interest from Banzai and Bitnami. There is experimental support for Thanos in the Prometheus Operator charts, and Prometheus itself has just (May 5) upgraded is Integration listing of Thanos from “read” to ‘read and write”. We are looking to do some kind of Federated Prometheus to monitor multiple clusters and Thanos still looks good to me. Do you have any pros/cons to share?

Thanos supports remote read now. (#1616) · prometheus/docs@d952391

Erik Osterman (Cloud Posse)

09:03:53 PM

@Steve Boardwell

Thanos supports remote read now. (#1616) · prometheus/docs@d952391

kskewes

10:14:46 PM

We’re now on AWS and Thanos on roadmap. All looks good.

joshmyers

08:36:12 AM

Thanos++

Steve Boardwell

10:30:21 AM

Nice!

webb

04:28:49 PM

We have 50+ teams using our product with Thanos… overall it’s been a great experience. There was a number of memory issues fixed in recent versions. Memory management can still require configuration. We’ve been watching the thanos receiver closely, because the sidecar has a two-hour compaction window / delay by default. Happy to share lots more if you want to discuss!

Erik Osterman (Cloud Posse)

06:03:33 PM

@Steve Boardwell @Jeremy G (Cloud Posse)

Erik Osterman (Cloud Posse)

06:03:48 PM

Thanks @webb

Jeremy G (Cloud Posse)

07:09:28 AM

How are you deploying Thanos? Looks like the bitnami chart is the most popular and well-maintained, but it also seems tied to their ecosystem.

joshmyers

08:40:50 AM

Terraform module for Prometheus + Thanos

Jeremy G (Cloud Posse)

07:49:36 PM

@joshmyers Terraform? I don’t see how you can do that. Is that code something you can share?

joshmyers

08:03:44 PM

Not to k8s….

joshmyers

08:05:05 PM

I’m talking prometheus, alertmanager, thanos, push gateway deployed to an ASG

joshmyers

08:06:32 PM

Every X mins prom service scoops up a load of config files from S3 and restarts. Every service pushes prom/alert manager config to the bucket. Works well for us, but probably not the CP way

joshmyers

08:07:40 PM

terratest tests the packer AMI and the modules, extensively. None of this can you apply clean on a happy path stuff

Jeremy G (Cloud Posse)

12:55:10 AM

Right. Thanks!

2020-05-07

2020-05-08

2020-05-13

Jeremy G (Cloud Posse)

07:11:55 AM

Does anyone have any experience with Epsagon they would like to share?

End-to-End Observability for Microservice Environments | Epsagon attachment image

Monitor, troubleshoot and fix problems in seconds with payload visibility. Get started for free and experience Epsagon’s automated tracing solutions today!

Mike F.

03:20:59 PM

Disclaimer: I worked and helped build a competitor that got absorbed into New Relic. Going int PoCs it was always us vs Epsagon, rarely any of the other competitors. After the shut down of that product, my personal second choice absolutely is Epsagon. I really like the interface, its clean and easy to bubble up interesting data points in the system, and the alerting features are pretty good. On the serverless side of things, the tracing is a very helpful to me when troubleshooting any complex Lambda issue. The interdependence maps it shows are cool. From a deployment perspective its realativly straight forward. I do not like the CloudFormations stack AWS integration, I enjoyed the AWS account dependency from my old product. Setting up the Epsagon libraries is easy enough to get automated off box call tracing. Compared to some of the others (Thundra/Dashbird even the big names) i think Epsagon wins.

2020-05-14

Erik Osterman (Cloud Posse)

07:57:56 PM

Thanks @Mike F. for the details! Very helpful…

Mike F.

08:07:42 PM

@Erik Osterman (Cloud Posse) No problem! Love talking about Lambda/Microservice observability Hopefully one day soon we can implement some of this with the DevOps stuff I am doing with @Erick

2020-05-22

Jeremy G (Cloud Posse)

03:41:44 AM

Has anyone gotten Grafana’s new OAuth role mapping to work? And by work, I mean be useful? I have the role mapping working (users are assigned roles based on their OAuth) but they are each put into their own organization, so the role mapping is not very helpful. I want users to be added to Main Org. with the mapped role, but have not been able to get that to work. Either they are added with the auto_assign_org_role or they are not added at all.

https://grafana.com/docs/grafana/latest/guides/whats-new-in-v6-5/#generic-oauth-role-mapping

What's new in Grafana v6.5

Feature and improvement highlights for Grafana v6.5

Jeremy G (Cloud Posse)

06:56:57 AM

Turns out I only thought I had role mapping working. There are some bugs in the parser.

What's new in Grafana v6.5

Feature and improvement highlights for Grafana v6.5

#sre (2020-05)

Prometheus, Prometheus Operator, Grafana, Kubernetes

2020-05-06

2020-05-07

2020-05-08

2020-05-13

2020-05-14

2020-05-22