Prometheus, Prometheus Operator, Grafana, Kubernetes
nice - it’d be good if they could make it available as a git repo ala pagerduty
PagerDuty’s Incident Response Documentation. Contribute to PagerDuty/incident-response-docs development by creating an account on GitHub.
anyone doing long term storage for prometheus?
We’ve just rolled out Thanos
how is it? @Erik Osterman (Cloud Posse) Investigating timescaledb (postgres)
i like thanos more than timescaledb since that won’t run on RDS and thanos works with S3 backend
the frontend is stateless so that’s easier
but to quote @Jeremy G (Cloud Posse) there will be dragons
we have some async tasks. maybe around 20+ or so. Some of them run at odd hours in the middle of the night and some of them can take up to 20 min to run. I want to get super alerted if something doesn’t run or fail to run. Looking for advice on dashboarding versus alerting. Currently, the team has been trained to keep a close eye on Sentry alerts that comes in thr Slack. We had email alerts from AWS in the past but the team got tuned out.
Can they run during the day ? Nightly jobs came from a time when cpu power was limited. I personally like the daily jobs after lunch time now :)
I had my data team integrate with prometheus pushgateway for this, and then I have alarms set if we don’t get a heartbeat update (a gauge where the value is the unix time of the last succesful run) within X hours of when they’re supposed to run
@maarten i think some of the jobs can be run during the day. a few jobs runs at night because of certain timing requirement for end of day/month reporting (we are a fintech) dependent on 3rd parties and another reason is that the database typically takes a hit on DB cpu / ram when the job runs. We could change those to db connections to a read replica or scale upwards but haven’t gotten around to it.
@Zach nice approach.
PushGateway is a bit of a pain in the ass. Setup is easy but it took a few iterations to get an understanding of how it interacts with prometheus and your queries
we dont have prometheus. will have to find another solution to hook into it. will investigate a bit.
@chonan tsai do you have opsgenie?
it supports heartbeats
just hit a REST API endpoint (E.g. with curl or in your app)
no opsgenie. anything in aws we can use?
@Erik Osterman (Cloud Posse)
Not that I know of off the top of my head. What do you use for alert escalations on on-call rotations?
nothing yet. do you recommend anything? @Erik Osterman (Cloud Posse)
Yea, Opsgenie or PagerDuty Without something like that you’re tempting fate
yeah i know. learning it the hard way now.
@Erik Osterman (Cloud Posse) We just confirmed that our async tasks are not running at the time we specified. And looking at our datadog dashboard, it says load average is around 4 to 5 consistently. Would this cause this issue to happen?
The delay period swings drastically. Sometimes it is a few seconds off but at times it could be a few hours off. Is my CPU overloaded? I have a micro instance for this application
what’s managing the scheduling of the jobs? e.g. a naive
crontab solution would easily produce inconsistent results in a EC2 autoscale group where nodes are coming online and offline.
containerized celery worker and aws elastic cache. i dont think we have enough workers. need to double check
anyone have an example of dropping prometheus labels (i.e. pod name, ip) from some of my custom prometheus metrics with a specific prefix? I can’t tell from the relabeling config whether I can drop it from a subset of metrics
heres the part of my helmfile that i’m attempting to do this:
serviceMonitor: metricRelabelings: - targetLabel: pod replacement: '' - targetLabel: instance replacement: '' - sourceLabels: [__name__] regex: '(container_tasks_state|container_memory_failures_total)' action: drop
You definitely can drop labels from a subset of metrics, just like that. However I think that might also drop all the labels
Brian Brazil’s blog mentions this here https://www.robustperception.io/dropping-metrics-at-scrape-time-with-prometheus
I currently have prometheus running on an EC2 instance for my scrapes, as the rest of our applications are all on EC2 still. It has a detachable EBS volume so that I can destroy the instance and spin up a new one for upgrades, but this leaves me with a few-minutes gap in the data when the old instance is down and waiting for the new one to finish attaching the EBS and reading back the WAL. Is there some way (and I’m open to moving this to ECS if that would solve it) of doing this with (near) zero downtime in my metrics?
Thanos solves this by letting you run multiple prometheus scrapers. Then it has a separate
querier service that helps with deduping.
I’d look into thanos before attempting to build something
Gotcha. I had gotten the impression that Thanos was for scaling out when I had reached the limits of prometheus being able to store/scrape all my targets. I was mostly looking for a way to just not lose minutes of data if an instance went down. I’ll take a closer look.
Alright so I’ve been digging into Thanos and I”m stumbling at understanding the AlertManager Clustering. This seems to require pre-generated IP addresses in the config files … how does that work in a dynamic environment like an EC2 scaling group or ECS/EKS? Is there some discovery mechanism that I’m not seeing or grokking?
We’re using the helm chart by Bitnami. No hardcoded IPs required by end-user.
Ah I’m doing anything Kubernetes as yet. We’re still running on EC2 only
Huh, turns out they aren’t doing anything fancy though! That helm chart provisions a load balancer on the AlertManagers, and then uses the first IP address as the cluster IP for discovery