#sre (2020-07)

Prometheus, Prometheus Operator, Grafana, Kubernetes

Archive: https://archive.sweetops.com/monitoring/

2020-07-02

Tom de Vries avatar
Tom de Vries

Hi, I’m curious to find out how you’re handling non-urgent alerts coming in from your infrastructure. We’ve set up PagerDuty for all critical alerts that come in (e.g. service down, degraded performance), but we don’t really have a proper process for handling non-urgent things, like “disk usage is over 80%”. The latter would be something we’d need to action on, but not right now. Any suggestions or examples you’re happy with on how you’re handling this? We use PagerDuty/Datadog/Slack mainly for our monitoring and alerting.

Steven avatar

Use PagerDuty with a different escalation. It can send to slack and/or email without the the paging part. That will give you the most control. But you could just have DataDog send to slack

Steven avatar

You could also look at integrating to your ticketing system and just have it create tickets. But be careful if you do, it can get out of hand

Tim Birkett avatar
Tim Birkett

I usually have 3 Slack channels named something like: infra-info , infra-warn, infra-alert

infra-info - All alerts and events of interest go here including RSS status page subscriptions, build info etc. This is like a log of events that you can refer back to in a postmortem.

infra-warn - Actionable, but not urgent events go here. Disk at 80% sort of alerts.

infra-alert - Alerts here also get sent to PagerDuty, VictorOps or whatever else you need to send to.

Tim Birkett avatar
Tim Birkett

Avoid email at all costs. I worked somewhere and everything came through as email, it was over 3000 emails a day. That’s a lot to filter straight to your recycling bin.

2
Tim Birkett avatar
Tim Birkett

Something that would be cool is JIRA or another ticketing system in Slack that would allow you to: “Click to Create Issue” or at least a link to achieve it.

bradym avatar

I really like the three slack channel approach, I may steal that.

We setup pagerduty services and use them based on the urgency of the alert. One will page in the middle of the night, one will create the incident and page at 9am. It’s ok, but could use some tweaking.

btai avatar

i like the multiple slack channel idea. We recently added some of our (non-technical) support team members to our alerts slack channel and it’ll help them that non-critical alerts don’t show up.

Issif avatar

at job, we’ve two services, with different a different schedule for each. Production uses a services with a schedule which notifies the guy on duty 24/7 but for other env, they’re plugged on service which a schedule in business hours only. We’re planning to use Pritorities too with specific Escalation lines according to criticity.

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

I’m dealing with this as well. Actually, we’re comming off of the slack channel alerts which we’ve been doing for years. The problem is no way to acknowledge, and it doesn’t scale well as the volume of non-vital alerts increases.

The direction we’re headed is aggregating the firehose of alerts in opsgenie (unlimited alerts included in every plan). Use proper aliases for alerts so they group well. Do not auto-resolve alerts that you need to get to at some point. This way alerts are naturally grouped, which reduces the noise. Then open Jiras for each alert. Ack each alert once Jira is created. Close alert one jira resolved.

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)
09:12:03 PM
Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

This is what it looks like - noise noise noise - not actionable.

2
Tom de Vries avatar
Tom de Vries

Ah yeah, that seems to have gotten out of hand a bit…

Thanks for the suggestions all, this is useful and I’ll cycle it back with my team to implement.

I like the idea of having an “ack” on every non-urgent or urgent issue. We currently have the with the PagerDuty Slack integration for urgent issues. As it seems useful, I’ll look into setting that up for non-urgent as well.

Steven avatar

Be careful with having team “ack” every alert. If the alerts are not actionable or if the quantity is too high for the people resources available it can quickly lead to burn out

1
kskewes avatar
kskewes

Interesting. We send them into slack and the 4hr repetition is usually enough for someone to deal with (one of 2 of us)

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

@Steven you’re spot on… it must be actionable and when not, the action is to create a jira to either silence it or fix it, then close. the other key concept I think in IBZ (inbox zero) applied to alerts.

1
Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

while the “inbox” is not your email, it’s the alert console - e.g. opsgenie

2020-07-03

2020-07-05

2020-07-06

2020-07-10

2020-07-17

Abel Luck avatar
Abel Luck

I’m using an exporter that monitors a service. It doesn’t have a boolean up/down metric to track the service’s healthystate, rather it has a failed scrape counter. Every failed scrape increments the counter.

What would an alert looks like that uses that metric to tell me if the service is UP or DOWN?

shamil.kashmeri avatar
shamil.kashmeri

Hi all, wondering if anyone has any experience with externalmetrics from cloudwatch via awslabs/k8s-cloudwatch-adapter. Im experiencing what seems to be weird rbac issues, the adapter sets up seemingly fine, and i was able to deploy my custom metric but im seeing a bunch of permissions issues in the logging.

I0717 03:51:59.474073       1 request.go:947] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"externalmetrics.metrics.aws is forbidden: User \"system:serviceaccount:custom-metrics:k8s-cloudwatch-adapter\" cannot list resource \"externalmetrics\" in API group \"metrics.aws\" at the cluster scope","reason":"Forbidden","details":{"group":"metrics.aws","kind":"externalmetrics"},"code":403}
E0717 03:51:59.474831       1 reflector.go:125] github.com/awslabs/k8s-cloudwatch-adapter/pkg/client/informers/externalversions/factory.go:114: Failed to list *v1alpha1.ExternalMetric: externalmetrics.metrics.aws is forbidden: User "system:serviceaccount:custom-metrics:k8s-cloudwatch-adapter" cannot list resource "externalmetrics" in API group "metrics.aws" at the cluster scope
I0717 03:52:00.234254       1 authorization.go:73] Forbidden: "/", Reason: ""
shamil.kashmeri avatar
shamil.kashmeri

Just replying to this, we were trying to use a custom ns, when I ran the deployment deploying to the ns that it wants, everything was fine, hence the permissions issues.

btai avatar

I have been getting random 1 minute connection timeout alerts from my uptime service (uptimerobot) for a few of my sites. My ELB logs show no record of those uptime checks nor are there connection errors on the ELB metrics. Assuming the problem isn’t on the uptime service’s side is there anything else i’m missing that i should check? (note i still continue to get uptime requests for the other sites hosted behind the ELB during that time)

btai avatar

frequency is one or two sites (out of hundreds) once a day every couple days. I don’t get a connection error attempting to hit the HC myself

maarten avatar
maarten

Maybe first add another check like statuscake or similar to remove that assumption of uptimerobot and see what happens.

  • If uptimerobot shows the resolved IP you can maybe see if it’s any ELB IP or just a single AZ one and analyse from there.
1
Marcin Brański avatar
Marcin Brański

If no logs on elb side then the problem is before such connection reach aws. Could be intermittent connection errors or failure of monitoring software. Because you can reach endpoint while monitoring cant Id investigate whats exactly is happening on monitoring side. Routing, connectivity, health failures, etc

btai avatar

@maarten we have hundreds of sites and will have only 1 or 2 random sites get these connection errors once every couple days. i would love to but unfortunately not very cheap to just setup another uptime checker.

btai avatar

heh just got approval to run statuscake alongside uptimerobot, we’ll see how that goes :P

maarten avatar
maarten

btai avatar

update: got statuscake up and running yday and got another connection timeout error from uptimerobot (once again) but statuscake was able to hit the healthcheck successfully for the same site.

maarten avatar
maarten

Good luck with the migration

2020-07-18

2020-07-20

2020-07-21

2020-07-22

2020-07-23

Joe Niland avatar
Joe Niland

Can anyone recommend a SaaS that can combine CloudWatch metrics with other sources? Datadog seems to be one.

Marcin Brański avatar
Marcin Brański

NewRelic does that as well. Won’t integrating CloudWatch with external SaaS providers be expensive?

Joe Niland avatar
Joe Niland

@Marcin Brański thanks. I’m not sure yet, but they only need to measure a couple around response time. Looking at custom metrics in Cloudwatch.

But client is used to Heroku graphs :)

sheldonh avatar
sheldonh

If you use telegraph you can scrap metrics into a common store like Prometheus/influxdb and display graphs of various sources. InfluxDB is pretty cool for that. I don’t use it anymore after the v2 but maybe worth checking out.

1

2020-07-24

2020-07-25

2020-07-28

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)
DataDog/datadog-operator

Datadog Agent Kubernetes Operator. Contribute to DataDog/datadog-operator development by creating an account on GitHub.

1
Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

a little bit underwhelming. doesn’t handle monitoring configuration, only agent configuraiton.

2020-07-29

Eric Berg avatar
Eric Berg

We’re at that point at which we need to set up something like PagerDuty.  I’ve heard OpsGenie mentioned here and we are an Atlassian Cloud shop, but i’ve used PD in the past.  We’re a small shop at this point (< 20 devs/ops people) and we’ll start with just one or two rota.

Sorry if this has been discussed before, but any input or suggestions to help make the choice would be appreciated.

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

I’ll talk more about this next week in #office-hours

1
Zach avatar

For that size I think OpsGenie would probably be a good fit for you

Zach avatar

We’re using it as well, used to be on pagerduty. PD is the ‘industry standard’ workhorse for alerting but IMO its showing its age. And it isn’t priced very well for smaller companies

1
1
sheldonh avatar
sheldonh

Is it too much to ask for a beautiful ui to work with, like what Blameless or squadcast? I just want to dislike the annoying utilitarian nature of pagerduty as I can’t update anything with nice clean formatting, markdown or the like. I know it’s first world problems, but I want to enjoy the apps I’m paying for. I get enough utilitarian apps in AWS that something more is nice to experience.

Zach avatar

OpsGenie and VictorOps seem to hit that

Zach avatar

PageDuty is definitely like, the Windows95 of alerting

sheldonh avatar
sheldonh

If I was picking something my gut would be squadcast immediately, and then the others you mentioned.

I just finished a Terraform deployment of Pagerduty and I know it’s flexible but dang it’s not intuitive to configure all those schedules/services and link together correctly. Feels a bit complicated or easy to miss something.

Zach avatar

oh I hadn’t heard of them.. didn’t find them when I did my evaluation of several alerting services

sheldonh avatar
sheldonh

not as big of a name. I was looking for a tool that was designed first and foremost with SRE principles in mind, so slo/sli stuff and all. That’s what brought me to it.

btai avatar

We use victorops here, pricing is roughly similar to opsgenie iirc. if you use jira then opsgenie is an atlassian product so i imagine the integration to automatically create actionable jira tickets would be pretty good

2020-07-31

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)
Introducing Astro: Manage Datadog Monitors in Kubernetes Deployments for Better Productivity and Cluster Performanceattachment image

Astro by Fairwinds is an open source Kubernetes operator that watches objects in your cluster for defined patterns, and manages Datadog monitors based on this state.

cool-doge1
Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

Aha! Now that’s what I was looking for…

    keyboard_arrow_up