SweetOps #sre for August, 2020

Archive: https://archive.sweetops.com/monitoring/

2020-08-03

Chris Fowles

11:11:23 PM

that’s awesome

2020-08-06

2020-08-07

Marcin Brański

05:23:15 PM

On recent #office-hours we talked about opsgenie automation. Today we released :tada: first version of our module to manage it with terraform , we currently use that to manage most of our opsgenie setup. https://github.com/cloudposse/terraform-opsgenie-incident-management

cloudposse/terraform-opsgenie-incident-management

Contribute to cloudposse/terraform-opsgenie-incident-management development by creating an account on GitHub.

Erik Osterman (Cloud Posse)

06:39:27 PM

See recording: https://cloudposse.wistia.com/medias/9d4ase4qjy

Public "Office Hours" 2020-08-05

Zach

02:24:25 PM

links in the readme for the examples lead to 404s

2020-08-08

2020-08-16

mado

12:07:37 PM

Just installed OpenShift dedicated on AWS, any better advice to install Instana APM monitoring tool on it? I wanna monitor application by Instana.

2020-08-23

msharma24

11:33:36 PM

Hello All, How can I monitor EMR job failure by Job Name for example I would like to receive alert ony when any job starting with name “Prod-XXX” fails on the cluster .

Zach

01:47:56 AM

I would look at EventBridge, you can get all the EMR cluster events off of it. Ship to a lambda or something, process the event and send another event to your alerting API.

Zach

01:48:25 AM

https://docs.aws.amazon.com/eventbridge/latest/userguide/event-types.html#emr-event-type

EventBridge Event Examples from Supported AWS Services - Amazon EventBridge

Lists the AWS services and event types supported by Amazon EventBridge.

msharma24

01:49:25 AM

Thanks @Zachary Loeber

Eric Berg

07:38:06 PM

In my last gig, we actually wrote something that would query the AWS API for status for all of our EMR jobs and we posted events to Datadog, based on the results. This was 3 or 4 years ago that this was written, but it did give much better info about the status of our EMR jobs.

#sre (2020-08)

Prometheus, Prometheus Operator, Grafana, Kubernetes

2020-08-03

2020-08-06

2020-08-07

2020-08-08

2020-08-16

2020-08-23

2020-08-24