#sre (2020-07)
Prometheus, Prometheus Operator, Grafana, Kubernetes
Archive: https://archive.sweetops.com/monitoring/
2020-07-02
Hi, I’m curious to find out how you’re handling non-urgent alerts coming in from your infrastructure. We’ve set up PagerDuty for all critical alerts that come in (e.g. service down, degraded performance), but we don’t really have a proper process for handling non-urgent things, like “disk usage is over 80%”. The latter would be something we’d need to action on, but not right now. Any suggestions or examples you’re happy with on how you’re handling this? We use PagerDuty/Datadog/Slack mainly for our monitoring and alerting.
Use PagerDuty with a different escalation. It can send to slack and/or email without the the paging part. That will give you the most control. But you could just have DataDog send to slack
You could also look at integrating to your ticketing system and just have it create tickets. But be careful if you do, it can get out of hand
I usually have 3 Slack channels named something like: infra-info
, infra-warn
, infra-alert
• infra-info - All alerts and events of interest go here including RSS status page subscriptions, build info etc. This is like a log of events that you can refer back to in a postmortem.
• infra-warn - Actionable, but not urgent events go here. Disk at 80% sort of alerts.
• infra-alert - Alerts here also get sent to PagerDuty, VictorOps or whatever else you need to send to.
Avoid email at all costs. I worked somewhere and everything came through as email, it was over 3000 emails a day. That’s a lot to filter straight to your recycling bin.
Something that would be cool is JIRA or another ticketing system in Slack that would allow you to: “Click to Create Issue” or at least a link to achieve it.
I really like the three slack channel approach, I may steal that.
We setup pagerduty services and use them based on the urgency of the alert. One will page in the middle of the night, one will create the incident and page at 9am. It’s ok, but could use some tweaking.
i like the multiple slack channel idea. We recently added some of our (non-technical) support team members to our alerts slack channel and it’ll help them that non-critical alerts don’t show up.
at job, we’ve two services, with different a different schedule for each. Production uses a services with a schedule which notifies the guy on duty 24/7 but for other env, they’re plugged on service which a schedule in business hours only. We’re planning to use Pritorities too with specific Escalation lines according to criticity.
I’m dealing with this as well. Actually, we’re comming off of the slack channel alerts which we’ve been doing for years. The problem is no way to acknowledge, and it doesn’t scale well as the volume of non-vital alerts increases.
The direction we’re headed is aggregating the firehose of alerts in opsgenie (unlimited alerts included in every plan). Use proper aliases for alerts so they group well. Do not auto-resolve alerts that you need to get to at some point. This way alerts are naturally grouped, which reduces the noise. Then open Jiras for each alert. Ack each alert once Jira is created. Close alert one jira resolved.
This is what it looks like - noise noise noise - not actionable.
Ah yeah, that seems to have gotten out of hand a bit…
Thanks for the suggestions all, this is useful and I’ll cycle it back with my team to implement.
I like the idea of having an “ack” on every non-urgent or urgent issue. We currently have the with the PagerDuty Slack integration for urgent issues. As it seems useful, I’ll look into setting that up for non-urgent as well.
Be careful with having team “ack” every alert. If the alerts are not actionable or if the quantity is too high for the people resources available it can quickly lead to burn out
Interesting. We send them into slack and the 4hr repetition is usually enough for someone to deal with (one of 2 of us)
@Steven you’re spot on… it must be actionable and when not, the action is to create a jira to either silence it or fix it, then close. the other key concept I think in IBZ (inbox zero) applied to alerts.
while the “inbox” is not your email, it’s the alert console - e.g. opsgenie
2020-07-03
2020-07-05
2020-07-06
2020-07-10
2020-07-17
I’m using an exporter that monitors a service. It doesn’t have a boolean up/down metric to track the service’s healthystate, rather it has a failed scrape counter. Every failed scrape increments the counter.
What would an alert looks like that uses that metric to tell me if the service is UP or DOWN?
Hi all, wondering if anyone has any experience with externalmetrics from cloudwatch via awslabs/k8s-cloudwatch-adapter. Im experiencing what seems to be weird rbac issues, the adapter sets up seemingly fine, and i was able to deploy my custom metric but im seeing a bunch of permissions issues in the logging.
I0717 03:51:59.474073 1 request.go:947] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"externalmetrics.metrics.aws is forbidden: User \"system:serviceaccount:custom-metrics:k8s-cloudwatch-adapter\" cannot list resource \"externalmetrics\" in API group \"metrics.aws\" at the cluster scope","reason":"Forbidden","details":{"group":"metrics.aws","kind":"externalmetrics"},"code":403}
E0717 03:51:59.474831 1 reflector.go:125] github.com/awslabs/k8s-cloudwatch-adapter/pkg/client/informers/externalversions/factory.go:114: Failed to list *v1alpha1.ExternalMetric: externalmetrics.metrics.aws is forbidden: User "system:serviceaccount:custom-metrics:k8s-cloudwatch-adapter" cannot list resource "externalmetrics" in API group "metrics.aws" at the cluster scope
I0717 03:52:00.234254 1 authorization.go:73] Forbidden: "/", Reason: ""
Just replying to this, we were trying to use a custom ns, when I ran the deployment deploying to the ns that it wants, everything was fine, hence the permissions issues.
I have been getting random 1 minute connection timeout alerts from my uptime service (uptimerobot) for a few of my sites. My ELB logs show no record of those uptime checks nor are there connection errors on the ELB metrics. Assuming the problem isn’t on the uptime service’s side is there anything else i’m missing that i should check? (note i still continue to get uptime requests for the other sites hosted behind the ELB during that time)
frequency is one or two sites (out of hundreds) once a day every couple days. I don’t get a connection error attempting to hit the HC myself
Maybe first add another check like statuscake or similar to remove that assumption of uptimerobot and see what happens.
- If uptimerobot shows the resolved IP you can maybe see if it’s any ELB IP or just a single AZ one and analyse from there.
If no logs on elb side then the problem is before such connection reach aws. Could be intermittent connection errors or failure of monitoring software. Because you can reach endpoint while monitoring cant Id investigate whats exactly is happening on monitoring side. Routing, connectivity, health failures, etc
@maarten we have hundreds of sites and will have only 1 or 2 random sites get these connection errors once every couple days. i would love to but unfortunately not very cheap to just setup another uptime checker.
heh just got approval to run statuscake alongside uptimerobot, we’ll see how that goes :P
update: got statuscake up and running yday and got another connection timeout error from uptimerobot (once again) but statuscake was able to hit the healthcheck successfully for the same site.
Good luck with the migration
2020-07-18
2020-07-20
2020-07-21
2020-07-22
2020-07-23
Can anyone recommend a SaaS that can combine CloudWatch metrics with other sources? Datadog seems to be one.
NewRelic does that as well. Won’t integrating CloudWatch with external SaaS providers be expensive?
@Marcin Brański thanks. I’m not sure yet, but they only need to measure a couple around response time. Looking at custom metrics in Cloudwatch.
But client is used to Heroku graphs :)
If you use telegraph you can scrap metrics into a common store like Prometheus/influxdb and display graphs of various sources. InfluxDB is pretty cool for that. I don’t use it anymore after the v2 but maybe worth checking out.
2020-07-24
2020-07-25
2020-07-28
Datadog Agent Kubernetes Operator. Contribute to DataDog/datadog-operator development by creating an account on GitHub.
a little bit underwhelming. doesn’t handle monitoring configuration, only agent configuraiton.
2020-07-29
We’re at that point at which we need to set up something like PagerDuty. I’ve heard OpsGenie mentioned here and we are an Atlassian Cloud shop, but i’ve used PD in the past. We’re a small shop at this point (< 20 devs/ops people) and we’ll start with just one or two rota.
Sorry if this has been discussed before, but any input or suggestions to help make the choice would be appreciated.
For that size I think OpsGenie would probably be a good fit for you
We’re using it as well, used to be on pagerduty. PD is the ‘industry standard’ workhorse for alerting but IMO its showing its age. And it isn’t priced very well for smaller companies
Is it too much to ask for a beautiful ui to work with, like what Blameless or squadcast? I just want to dislike the annoying utilitarian nature of pagerduty as I can’t update anything with nice clean formatting, markdown or the like. I know it’s first world problems, but I want to enjoy the apps I’m paying for. I get enough utilitarian apps in AWS that something more is nice to experience.
OpsGenie and VictorOps seem to hit that
PageDuty is definitely like, the Windows95 of alerting
If I was picking something my gut would be squadcast immediately, and then the others you mentioned.
I just finished a Terraform deployment of Pagerduty and I know it’s flexible but dang it’s not intuitive to configure all those schedules/services and link together correctly. Feels a bit complicated or easy to miss something.
oh I hadn’t heard of them.. didn’t find them when I did my evaluation of several alerting services
not as big of a name. I was looking for a tool that was designed first and foremost with SRE principles in mind, so slo/sli stuff and all. That’s what brought me to it.
We use victorops here, pricing is roughly similar to opsgenie iirc. if you use jira then opsgenie is an atlassian product so i imagine the integration to automatically create actionable jira tickets would be pretty good
2020-07-31
Astro by Fairwinds is an open source Kubernetes operator that watches objects in your cluster for defined patterns, and manages Datadog monitors based on this state.
Aha! Now that’s what I was looking for…