SweetOps #sre for July, 2020

Archive: https://archive.sweetops.com/monitoring/

2020-07-02

Tom de Vries

Hi, I’m curious to find out how you’re handling non-urgent alerts coming in from your infrastructure. We’ve set up PagerDuty for all critical alerts that come in (e.g. service down, degraded performance), but we don’t really have a proper process for handling non-urgent things, like “disk usage is over 80%”. The latter would be something we’d need to action on, but not right now. Any suggestions or examples you’re happy with on how you’re handling this? We use PagerDuty/Datadog/Slack mainly for our monitoring and alerting.

Steven

08:54:17 AM

Use PagerDuty with a different escalation. It can send to slack and/or email without the the paging part. That will give you the most control. But you could just have DataDog send to slack

Steven

08:55:43 AM

You could also look at integrating to your ticketing system and just have it create tickets. But be careful if you do, it can get out of hand

Tim Birkett

10:15:55 AM

I usually have 3 Slack channels named something like: infra-info , infra-warn, infra-alert

• infra-info - All alerts and events of interest go here including RSS status page subscriptions, build info etc. This is like a log of events that you can refer back to in a postmortem.

• infra-warn - Actionable, but not urgent events go here. Disk at 80% sort of alerts.

• infra-alert - Alerts here also get sent to PagerDuty, VictorOps or whatever else you need to send to.

Tim Birkett

10:18:17 AM

Avoid email at all costs. I worked somewhere and everything came through as email, it was over 3000 emails a day. That’s a lot to filter straight to your recycling bin.

Tim Birkett

10:20:09 AM

Something that would be cool is JIRA or another ticketing system in Slack that would allow you to: “Click to Create Issue” or at least a link to achieve it.

bradym

05:15:43 PM

I really like the three slack channel approach, I may steal that.

We setup pagerduty services and use them based on the urgency of the alert. One will page in the middle of the night, one will create the incident and page at 9am. It’s ok, but could use some tweaking.

btai

06:24:14 PM

i like the multiple slack channel idea. We recently added some of our (non-technical) support team members to our alerts slack channel and it’ll help them that non-critical alerts don’t show up.

Issif

07:16:02 PM

at job, we’ve two services, with different a different schedule for each. Production uses a services with a schedule which notifies the guy on duty 24/7 but for other env, they’re plugged on service which a schedule in business hours only. We’re planning to use Pritorities too with specific Escalation lines according to criticity.

Erik Osterman (Cloud Posse)

09:10:24 PM

I’m dealing with this as well. Actually, we’re comming off of the slack channel alerts which we’ve been doing for years. The problem is no way to acknowledge, and it doesn’t scale well as the volume of non-vital alerts increases.

The direction we’re headed is aggregating the firehose of alerts in opsgenie (unlimited alerts included in every plan). Use proper aliases for alerts so they group well. Do not auto-resolve alerts that you need to get to at some point. This way alerts are naturally grouped, which reduces the noise. Then open Jiras for each alert. Ack each alert once Jira is created. Close alert one jira resolved.

Erik Osterman (Cloud Posse)

09:12:03 PM

Erik Osterman (Cloud Posse)

09:12:22 PM

This is what it looks like - noise noise noise - not actionable.

Tom de Vries

06:57:37 AM

Ah yeah, that seems to have gotten out of hand a bit…

Thanks for the suggestions all, this is useful and I’ll cycle it back with my team to implement.

I like the idea of having an “ack” on every non-urgent or urgent issue. We currently have the with the PagerDuty Slack integration for urgent issues. As it seems useful, I’ll look into setting that up for non-urgent as well.

Steven

07:49:02 AM

Be careful with having team “ack” every alert. If the alerts are not actionable or if the quantity is too high for the people resources available it can quickly lead to burn out

kskewes

11:46:39 AM

Interesting. We send them into slack and the 4hr repetition is usually enough for someone to deal with (one of 2 of us)

Erik Osterman (Cloud Posse)

04:49:35 PM

@Steven you’re spot on… it must be actionable and when not, the action is to create a jira to either silence it or fix it, then close. the other key concept I think in IBZ (inbox zero) applied to alerts.

Erik Osterman (Cloud Posse)

04:50:01 PM

while the “inbox” is not your email, it’s the alert console - e.g. opsgenie

2020-07-03

2020-07-05

2020-07-06

2020-07-10

2020-07-17

Abel Luck

10:12:07 AM

I’m using an exporter that monitors a service. It doesn’t have a boolean up/down metric to track the service’s healthystate, rather it has a failed scrape counter. Every failed scrape increments the counter.

What would an alert looks like that uses that metric to tell me if the service is UP or DOWN?

shamil.kashmeri

01:01:55 PM

Hi all, wondering if anyone has any experience with externalmetrics from cloudwatch via awslabs/k8s-cloudwatch-adapter. Im experiencing what seems to be weird rbac issues, the adapter sets up seemingly fine, and i was able to deploy my custom metric but im seeing a bunch of permissions issues in the logging.

I0717 03:51:59.474073       1 request.go:947] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"externalmetrics.metrics.aws is forbidden: User \"system:serviceaccount:custom-metrics:k8s-cloudwatch-adapter\" cannot list resource \"externalmetrics\" in API group \"metrics.aws\" at the cluster scope","reason":"Forbidden","details":{"group":"metrics.aws","kind":"externalmetrics"},"code":403}
E0717 03:51:59.474831       1 reflector.go:125] github.com/awslabs/k8s-cloudwatch-adapter/pkg/client/informers/externalversions/factory.go:114: Failed to list *v1alpha1.ExternalMetric: externalmetrics.metrics.aws is forbidden: User "system:serviceaccount:custom-metrics:k8s-cloudwatch-adapter" cannot list resource "externalmetrics" in API group "metrics.aws" at the cluster scope
I0717 03:52:00.234254       1 authorization.go:73] Forbidden: "/", Reason: ""

shamil.kashmeri

01:41:28 PM

Just replying to this, we were trying to use a custom ns, when I ran the deployment deploying to the ns that it wants, everything was fine, hence the permissions issues.

btai

07:40:12 PM

I have been getting random 1 minute connection timeout alerts from my uptime service (uptimerobot) for a few of my sites. My ELB logs show no record of those uptime checks nor are there connection errors on the ELB metrics. Assuming the problem isn’t on the uptime service’s side is there anything else i’m missing that i should check? (note i still continue to get uptime requests for the other sites hosted behind the ELB during that time)

btai

07:51:38 PM

frequency is one or two sites (out of hundreds) once a day every couple days. I don’t get a connection error attempting to hit the HC myself

maarten

06:44:01 AM

Maybe first add another check like statuscake or similar to remove that assumption of uptimerobot and see what happens.

If uptimerobot shows the resolved IP you can maybe see if it’s any ELB IP or just a single AZ one and analyse from there.

Marcin Brański

09:10:05 AM

If no logs on elb side then the problem is before such connection reach aws. Could be intermittent connection errors or failure of monitoring software. Because you can reach endpoint while monitoring cant Id investigate whats exactly is happening on monitoring side. Routing, connectivity, health failures, etc

btai

06:37:04 PM

@maarten we have hundreds of sites and will have only 1 or 2 random sites get these connection errors once every couple days. i would love to but unfortunately not very cheap to just setup another uptime checker.

btai

04:56:03 PM

heh just got approval to run statuscake alongside uptimerobot, we’ll see how that goes :P

maarten

05:05:20 PM

btai

06:38:02 PM

update: got statuscake up and running yday and got another connection timeout error from uptimerobot (once again) but statuscake was able to hit the healthcheck successfully for the same site.

maarten

07:09:37 PM

Good luck with the migration

2020-07-18

2020-07-20

2020-07-21

2020-07-22

2020-07-23

Joe Niland

12:32:45 AM

Can anyone recommend a SaaS that can combine CloudWatch metrics with other sources? Datadog seems to be one.

Marcin Brański

03:53:21 PM

NewRelic does that as well. Won’t integrating CloudWatch with external SaaS providers be expensive?

Joe Niland

08:27:22 PM

@Marcin Brański thanks. I’m not sure yet, but they only need to measure a couple around response time. Looking at custom metrics in Cloudwatch.

But client is used to Heroku graphs :)

sheldonh

03:43:10 PM

If you use telegraph you can scrap metrics into a common store like Prometheus/influxdb and display graphs of various sources. InfluxDB is pretty cool for that. I don’t use it anymore after the v2 but maybe worth checking out.

2020-07-24

2020-07-25

2020-07-28

Erik Osterman (Cloud Posse)

06:02:03 PM

https://github.com/DataDog/datadog-operator

DataDog/datadog-operator

Datadog Agent Kubernetes Operator. Contribute to DataDog/datadog-operator development by creating an account on GitHub.

Erik Osterman (Cloud Posse)

06:02:23 PM

a little bit underwhelming. doesn’t handle monitoring configuration, only agent configuraiton.

2020-07-29

Eric Berg

04:50:08 PM

We’re at that point at which we need to set up something like PagerDuty. I’ve heard OpsGenie mentioned here and we are an Atlassian Cloud shop, but i’ve used PD in the past. We’re a small shop at this point (< 20 devs/ops people) and we’ll start with just one or two rota.

Sorry if this has been discussed before, but any input or suggestions to help make the choice would be appreciated.

Erik Osterman (Cloud Posse)

07:49:37 PM

I’ll talk more about this next week in #office-hours

Zach

05:05:15 PM

For that size I think OpsGenie would probably be a good fit for you

Zach

05:05:34 PM

We’re using it as well, used to be on pagerduty. PD is the ‘industry standard’ workhorse for alerting but IMO its showing its age. And it isn’t priced very well for smaller companies

sheldonh

06:52:05 PM

Is it too much to ask for a beautiful ui to work with, like what Blameless or squadcast? I just want to dislike the annoying utilitarian nature of pagerduty as I can’t update anything with nice clean formatting, markdown or the like. I know it’s first world problems, but I want to enjoy the apps I’m paying for. I get enough utilitarian apps in AWS that something more is nice to experience.

Zach

06:57:29 PM

OpsGenie and VictorOps seem to hit that

Zach

06:57:45 PM

PageDuty is definitely like, the Windows95 of alerting

sheldonh

06:59:25 PM

If I was picking something my gut would be squadcast immediately, and then the others you mentioned.

I just finished a Terraform deployment of Pagerduty and I know it’s flexible but dang it’s not intuitive to configure all those schedules/services and link together correctly. Feels a bit complicated or easy to miss something.

Zach

07:00:50 PM

oh I hadn’t heard of them.. didn’t find them when I did my evaluation of several alerting services

sheldonh

07:02:17 PM

not as big of a name. I was looking for a tool that was designed first and foremost with SRE principles in mind, so slo/sli stuff and all. That’s what brought me to it.

btai

07:53:30 PM

We use victorops here, pricing is roughly similar to opsgenie iirc. if you use jira then opsgenie is an atlassian product so i imagine the integration to automatically create actionable jira tickets would be pretty good

2020-07-31

Erik Osterman (Cloud Posse)

09:13:00 PM

https://www.fairwinds.com/blog/introducing-astro-managing-monitors-in-a-dynamic-environment-0

Introducing Astro: Manage Datadog Monitors in Kubernetes Deployments for Better Productivity and Cluster Performance attachment image

Astro by Fairwinds is an open source Kubernetes operator that watches objects in your cluster for defined patterns, and manages Datadog monitors based on this state.

Erik Osterman (Cloud Posse)

09:13:27 PM

Aha! Now that’s what I was looking for…

#sre (2020-07)

Prometheus, Prometheus Operator, Grafana, Kubernetes

2020-07-02

2020-07-03

2020-07-05

2020-07-06

2020-07-10

2020-07-17

2020-07-18

2020-07-20

2020-07-21

2020-07-22

2020-07-23

2020-07-24

2020-07-25

2020-07-28

2020-07-29

2020-07-31