#sre (2022-05)

Prometheus, Prometheus Operator, Grafana, Kubernetes

Archive: https://archive.sweetops.com/monitoring/

2022-05-26

sheldonh avatar
sheldonh

I wish datadog had put more effort into the core features of an incident tool. Their incident tool is refreshingly simple, but missing some key things for it to be viable.

Was hoping to get folks on it as a “simple version of OpsGenie” but it’s missing:

• Dedicated app/alerting (at least as for incident stuff)

• Easy slack integration. If i open in a channel, I can’t have all updates piped through. I have to have it create a dedicated channel and at my place that’s not possible.

• Links to other things in datadog don’t automatically prettify.

• No escalation policy/team schedule for handling.

So many things missing. Seems like it would be really nice experience being in a single place if wasn’t just a barebones way to organize a chat. cool-doge

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

Not to mention it’s missing terraform support :-)

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

Is there an API to manage it?

sheldonh avatar
sheldonh

Wow. Good point. I’m using pulumi for datadog monitors and didn’t think of that. Looking now

sheldonh avatar
sheldonh

Nope.

sheldonh avatar
sheldonh

Really a fail.

sheldonh avatar
sheldonh

I’m about to switch over to OpsGenie since my current $work doesn’t use PagerDuty. PagerDuty was a freaking nightmare with terraform, so I’m hoping OpsGenie makes this a better experience.

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

Overall what I find the hardest is coming up with opinion on how to manage it. There are a hundred ways to do things in OpsGenie

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

No canonical way

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

Now we have that for our purposes with OpsGenie so I am happy

sheldonh avatar
sheldonh

Would welcome anything/article on tips.

I’m just trying to get a team who’s never done organized incident handling into something.

Right now my best option is just a slack workflow with an incident thread. I want to expose the workflow issues in a much more clear way though and make it as simple as I can. OpsGenie seems like the best option right now to make things very trackable

Mohammed Yahya avatar
Mohammed Yahya

limited Terraform support, I’m adding accounts automatically to Datadog with integration resources.

something I hate is that you need a lot of stuff to ship logs out to datadog, I wish there was something easier, although you can use terraform or cloudformation to ship logs

Mohammed Yahya avatar
Mohammed Yahya

but to be fair thier SIEM and CSPM both are awesome, comparing to NewRelic, but I will take a look at OpsGenie

2022-05-27

2022-05-30

    keyboard_arrow_up