SweetOps #sre for June, 2020

we have some async tasks. maybe around 20+ or so. Some of them run at odd hours in the middle of the night and some of them can take up to 20 min to run. I want to get super alerted if something doesn’t run or fail to run. Looking for advice on dashboarding versus alerting. Currently, the team has been trained to keep a close eye on Sentry alerts that comes in thr Slack. We had email alerts from AWS in the past but the team got tuned out.

maarten

05:23:50 PM

Can they run during the day ? Nightly jobs came from a time when cpu power was limited. I personally like the daily jobs after lunch time now :)

Zach

05:39:28 PM

I had my data team integrate with prometheus pushgateway for this, and then I have alarms set if we don’t get a heartbeat update (a gauge where the value is the unix time of the last succesful run) within X hours of when they’re supposed to run

chonan tsai

06:51:16 PM

@maarten i think some of the jobs can be run during the day. a few jobs runs at night because of certain timing requirement for end of day/month reporting (we are a fintech) dependent on 3rd parties and another reason is that the database typically takes a hit on DB cpu / ram when the job runs. We could change those to db connections to a read replica or scale upwards but haven’t gotten around to it.

chonan tsai

06:51:26 PM

@Zach nice approach.

Zach

07:04:37 PM

PushGateway is a bit of a pain in the ass. Setup is easy but it took a few iterations to get an understanding of how it interacts with prometheus and your queries

chonan tsai

11:21:35 PM

we dont have prometheus. will have to find another solution to hook into it. will investigate a bit.

Erik Osterman (Cloud Posse)

12:53:14 AM

@chonan tsai do you have opsgenie?

Erik Osterman (Cloud Posse)

12:53:28 AM

it supports heartbeats

Erik Osterman (Cloud Posse)

12:53:42 AM

super simple

Erik Osterman (Cloud Posse)

12:53:53 AM

just hit a REST API endpoint (E.g. with curl or in your app)

chonan tsai

02:53:28 AM

no opsgenie. anything in aws we can use?

chonan tsai

02:53:35 AM

@Erik Osterman (Cloud Posse)

Erik Osterman (Cloud Posse)

02:57:17 AM

Not that I know of off the top of my head. What do you use for alert escalations on on-call rotations?

chonan tsai

08:34:49 PM

nothing yet. do you recommend anything? @Erik Osterman (Cloud Posse)

Erik Osterman (Cloud Posse)

08:56:26 PM

Yea, Opsgenie or PagerDuty Without something like that you’re tempting fate

chonan tsai

05:48:28 PM

yeah i know. learning it the hard way now.

chonan tsai

05:50:14 PM

@Erik Osterman (Cloud Posse) We just confirmed that our async tasks are not running at the time we specified. And looking at our datadog dashboard, it says load average is around 4 to 5 consistently. Would this cause this issue to happen?

chonan tsai

05:51:23 PM

The delay period swings drastically. Sometimes it is a few seconds off but at times it could be a few hours off. Is my CPU overloaded? I have a micro instance for this application

Erik Osterman (Cloud Posse)

09:00:03 PM

what’s managing the scheduling of the jobs? e.g. a naive crontab solution would easily produce inconsistent results in a EC2 autoscale group where nodes are coming online and offline.

chonan tsai

10:00:03 PM

containerized celery worker and aws elastic cache. i dont think we have enough workers. need to double check

btai

04:29:49 AM

anyone have an example of dropping prometheus labels (i.e. pod name, ip) from some of my custom prometheus metrics with a specific prefix? I can’t tell from the relabeling config whether I can drop it from a subset of metrics

heres the part of my helmfile that i’m attempting to do this:

        serviceMonitor:
          metricRelabelings:
          - targetLabel: pod
            replacement: ''
          - targetLabel: instance
            replacement: ''
          - sourceLabels: [__name__]
            regex: '(container_tasks_state|container_memory_failures_total)'
            action: drop

Zach

12:48:45 PM

You definitely can drop labels from a subset of metrics, just like that. However I think that might also drop all the labels

Zach

12:49:48 PM

Brian Brazil’s blog mentions this here https://www.robustperception.io/dropping-metrics-at-scrape-time-with-prometheus

2020-06-10

2020-06-11

2020-06-15

Zach

12:54:42 PM

I currently have prometheus running on an EC2 instance for my scrapes, as the rest of our applications are all on EC2 still. It has a detachable EBS volume so that I can destroy the instance and spin up a new one for upgrades, but this leaves me with a few-minutes gap in the data when the old instance is down and waiting for the new one to finish attaching the EBS and reading back the WAL. Is there some way (and I’m open to moving this to ECS if that would solve it) of doing this with (near) zero downtime in my metrics?

Erik Osterman (Cloud Posse)

09:33:48 AM

Thanos solves this by letting you run multiple prometheus scrapers. Then it has a separate querier service that helps with deduping.

Erik Osterman (Cloud Posse)

09:34:20 AM

I’d look into thanos before attempting to build something

Zach

12:57:17 PM

Gotcha. I had gotten the impression that Thanos was for scaling out when I had reached the limits of prometheus being able to store/scrape all my targets. I was mostly looking for a way to just not lose minutes of data if an instance went down. I’ll take a closer look.

Zach

12:26:44 PM

Alright so I’ve been digging into Thanos and I”m stumbling at understanding the AlertManager Clustering. This seems to require pre-generated IP addresses in the config files … how does that work in a dynamic environment like an EC2 scaling group or ECS/EKS? Is there some discovery mechanism that I’m not seeing or grokking?

Erik Osterman (Cloud Posse)

03:29:17 PM

We’re using the helm chart by Bitnami. No hardcoded IPs required by end-user.

Zach

03:32:13 PM

Ah I’m doing anything Kubernetes as yet. We’re still running on EC2 only

Zach

03:46:34 PM

Huh, turns out they aren’t doing anything fancy though! That helm chart provisions a load balancer on the AlertManagers, and then uses the first IP address as the cluster IP for discovery

#sre (2020-06)

Prometheus, Prometheus Operator, Grafana, Kubernetes

2020-06-02

2020-06-05

2020-06-06

2020-06-09

2020-06-10

2020-06-11

2020-06-15

2020-06-16

2020-06-22