#sre (2020-06)

Prometheus, Prometheus Operator, Grafana, Kubernetes

Archive: https://archive.sweetops.com/monitoring/

2020-06-02

Chris Fowles avatar
Chris Fowles

nice - it’d be good if they could make it available as a git repo ala pagerduty

Chris Fowles avatar
Chris Fowles
PagerDuty/incident-response-docs

PagerDuty’s Incident Response Documentation. Contribute to PagerDuty/incident-response-docs development by creating an account on GitHub.

1

2020-06-05

btai avatar

anyone doing long term storage for prometheus?

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

We’ve just rolled out Thanos

1
btai avatar

how is it? @Erik Osterman (Cloud Posse) Investigating timescaledb (postgres)

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

i like thanos more than timescaledb since that won’t run on RDS and thanos works with S3 backend

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

the frontend is stateless so that’s easier

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

but to quote @Jeremy G (Cloud Posse) there will be dragons

2020-06-06

2020-06-09

chonan tsai avatar
chonan tsai

we have some async tasks. maybe around 20+ or so. Some of them run at odd hours in the middle of the night and some of them can take up to 20 min to run. I want to get super alerted if something doesn’t run or fail to run. Looking for advice on dashboarding versus alerting. Currently, the team has been trained to keep a close eye on Sentry alerts that comes in thr Slack. We had email alerts from AWS in the past but the team got tuned out.

maarten avatar
maarten

Can they run during the day ? Nightly jobs came from a time when cpu power was limited. I personally like the daily jobs after lunch time now :)

Zach avatar

I had my data team integrate with prometheus pushgateway for this, and then I have alarms set if we don’t get a heartbeat update (a gauge where the value is the unix time of the last succesful run) within X hours of when they’re supposed to run

chonan tsai avatar
chonan tsai

@maarten i think some of the jobs can be run during the day. a few jobs runs at night because of certain timing requirement for end of day/month reporting (we are a fintech) dependent on 3rd parties and another reason is that the database typically takes a hit on DB cpu / ram when the job runs. We could change those to db connections to a read replica or scale upwards but haven’t gotten around to it.

chonan tsai avatar
chonan tsai

@Zach nice approach.

Zach avatar

PushGateway is a bit of a pain in the ass. Setup is easy but it took a few iterations to get an understanding of how it interacts with prometheus and your queries

chonan tsai avatar
chonan tsai

we dont have prometheus. will have to find another solution to hook into it. will investigate a bit.

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

@chonan tsai do you have opsgenie?

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

it supports heartbeats

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

super simple

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

just hit a REST API endpoint (E.g. with curl or in your app)

chonan tsai avatar
chonan tsai

no opsgenie. anything in aws we can use?

chonan tsai avatar
chonan tsai

@Erik Osterman (Cloud Posse)

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

Not that I know of off the top of my head. What do you use for alert escalations on on-call rotations?

chonan tsai avatar
chonan tsai

nothing yet. do you recommend anything? @Erik Osterman (Cloud Posse)

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

Yea, Opsgenie or PagerDuty Without something like that you’re tempting fate

chonan tsai avatar
chonan tsai

yeah i know. learning it the hard way now.

chonan tsai avatar
chonan tsai

@Erik Osterman (Cloud Posse) We just confirmed that our async tasks are not running at the time we specified. And looking at our datadog dashboard, it says load average is around 4 to 5 consistently. Would this cause this issue to happen?

chonan tsai avatar
chonan tsai

The delay period swings drastically. Sometimes it is a few seconds off but at times it could be a few hours off. Is my CPU overloaded? I have a micro instance for this application

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

what’s managing the scheduling of the jobs? e.g. a naive crontab solution would easily produce inconsistent results in a EC2 autoscale group where nodes are coming online and offline.

chonan tsai avatar
chonan tsai

containerized celery worker and aws elastic cache. i dont think we have enough workers. need to double check

btai avatar

anyone have an example of dropping prometheus labels (i.e. pod name, ip) from some of my custom prometheus metrics with a specific prefix? I can’t tell from the relabeling config whether I can drop it from a subset of metrics

heres the part of my helmfile that i’m attempting to do this:

        serviceMonitor:
          metricRelabelings:
          - targetLabel: pod
            replacement: ''
          - targetLabel: instance
            replacement: ''
          - sourceLabels: [__name__]
            regex: '(container_tasks_state|container_memory_failures_total)'
            action: drop
Zach avatar

You definitely can drop labels from a subset of metrics, just like that. However I think that might also drop all the labels

2020-06-10

2020-06-11

2020-06-15

Zach avatar

I currently have prometheus running on an EC2 instance for my scrapes, as the rest of our applications are all on EC2 still. It has a detachable EBS volume so that I can destroy the instance and spin up a new one for upgrades, but this leaves me with a few-minutes gap in the data when the old instance is down and waiting for the new one to finish attaching the EBS and reading back the WAL. Is there some way (and I’m open to moving this to ECS if that would solve it) of doing this with (near) zero downtime in my metrics?

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

Thanos solves this by letting you run multiple prometheus scrapers. Then it has a separate querier service that helps with deduping.

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

I’d look into thanos before attempting to build something

Zach avatar

Gotcha. I had gotten the impression that Thanos was for scaling out when I had reached the limits of prometheus being able to store/scrape all my targets. I was mostly looking for a way to just not lose minutes of data if an instance went down. I’ll take a closer look.

Zach avatar

Alright so I’ve been digging into Thanos and I”m stumbling at understanding the AlertManager Clustering. This seems to require pre-generated IP addresses in the config files … how does that work in a dynamic environment like an EC2 scaling group or ECS/EKS? Is there some discovery mechanism that I’m not seeing or grokking?

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

We’re using the helm chart by Bitnami. No hardcoded IPs required by end-user.

Zach avatar

Ah I’m doing anything Kubernetes as yet. We’re still running on EC2 only

Zach avatar

Huh, turns out they aren’t doing anything fancy though! That helm chart provisions a load balancer on the AlertManagers, and then uses the first IP address as the cluster IP for discovery

2020-06-16

2020-06-22

    keyboard_arrow_up