#prometheus (2019-03)
Archive: https://archive.sweetops.com/prometheus/
2019-03-04
![Glenn J. Mason avatar](https://avatars.slack-edge.com/2019-03-05/566284293184_61b5e712471d433544a1_72.png)
Hey folks, I’m deploying Prometheus (with helm and the prometheus-operator
chart from stable) into several clusters, but wondering how best to set up the alerting rules? I’ve added a Slack receiver to Alertmanager, and I’m getting alerts there, but … any clues on the process I would use to ensure good alerting on my clusters?
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
Hrmm @Glenn J. Mason can you add some more context around the process piece?
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
Process for updating alerts?
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
Where to define them?
![Glenn J. Mason avatar](https://avatars.slack-edge.com/2019-03-05/566284293184_61b5e712471d433544a1_72.png)
More like … how do I go about discovering what metrics mean “bad, or going bad”? Like … do I just monitor in e.g. grafana and/or make up “sensible” defaults that I think might work? The human process for tuning I mean.
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
Have you started by looking over the many community provided dashboards?
![Glenn J. Mason avatar](https://avatars.slack-edge.com/2019-03-05/566284293184_61b5e712471d433544a1_72.png)
Yep, looking, but I was really wondering about the Prometheus alerting rules rather than Grafana dashboards specifically. I think the default rules are pretty good, but (e.g.) we’ve got the micrometer.io actuator getting JVM metrics in, and it’s hard to really be able to say “yep, that’s healthy” — except of course by monitoring and historical trends, etc. Which is certainly useful, and the “normal” way to do these things.
![Glenn J. Mason avatar](https://avatars.slack-edge.com/2019-03-05/566284293184_61b5e712471d433544a1_72.png)
I think it’s the only way to go, perhaps: monitor, decide on “normal” and take it from there.
![Erik Osterman (Cloud Posse) avatar](https://secure.gravatar.com/avatar/88c480d4f73b813904e00a5695a454cb.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0023-72.png)
Yea, you’ll need to establish a baseline before you’ll really know what to alert on
2019-03-07
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Anyone running Pushgateway in ~HA~HA”?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
We are evaluating multiple options, but could not find a lot of info about it, as I can see we:
- Replicate traffic to multiple PG from another proxy
- Send reports to multiple PGs
2019-03-11
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
@pecigonzalo what are your goals with running pushgateway with an HA setup?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
to avoid downtime during deployments/etc so it always gets the metrics
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
EG: i roll a new update, config is bad, i broke the PG, and now Im losing all PG metrics coming in (backups, etc)
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
if this was HA, a rolling/blue-green would be really simple
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
have you considered using either DNS + a load balancer – or service discovery in clients – which should help?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I’ve never run PG in production, but if I did, I would avoid running it as a singleton. Is there anything non-trivial involved with running more than one PG (replicas with load-balancing, healthchecks)?
2019-03-12
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
We do have a LB and DNS
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
but the problem is that you need the data in both places
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
is like having 2 postgres database, wihtout replication, and trying to read form the 2nd one if hte first one goes down, you dont have the same data
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
to avoid downtime during deployments/etc so it always gets the metrics
EG: i roll a new update, config is bad, i broke the PG, and now Im losing all PG metrics coming in (backups, etc)
if this was HA, a rolling/blue-green would be really simple
I think I’m having trouble understanding what “always gets the metrics” means for your situation.
is like having 2 postgres database, wihtout replication, and trying to read form the 2nd one if hte first one goes down, you dont have the same data
Are you sure you’re using pushgateway to solve a problem for which it is designed & recommended ?
If not, please describe what you’re trying to solve for.
An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
if hte first one goes down, you dont have the same data
If the first one goes down, you shouldn’t care, assuming both are equally scraped by prometheus. Prometheus reads the metrics from the first (down) and second (up) and merges metrics+{labelsets} into the same metrics timeseries history.
2019-03-13
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
I think I’m having trouble understanding what “always gets the metrics” means for your situation.
You have only 1 PG, you do a failed deployment or it goes down for any reason, now any job pushing to PG cant push to it, because its down until you fix it/rollback/etc
Are you sure you’re using pushgateway to solve a problem for which it is designed & recommended ?
Yes im 100% sure, eg scheduled jobs like backups or other croned or event triggered things.
If the first one goes down, you shouldn’t care, assuming both are equally scraped by prometheus.
It does matter, because if you have 2 PG, unless you push the same data to both, by either a proxy/LB that replicates the data to both or by pushing to both from the clients, you have different data
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
eg. a backup job runs, after it ran it pushes a metrics to PG through a LB.
- PG1 gets the connection and stores the info
- PG2 does not, as the LB only sent to PG1
PG1 goes down, for whatever reason, Prom starts reading from PG2, which doesnt contain the metric for the backup. now your metrics show odd stuff, and might trigger an alert about a missing backup in our case
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
Let me respond to each of those separately:
You have only 1 PG,
Is something forcing you to create a single point of failure?
By deploying 1 PG, you create a SPOF.
Have you considered deploying N+1 ?
you do a failed deployment or it goes down for any reason, now any job pushing to PG cant push to it, because its down until you fix it/rollback/etc
If, as you say, you only have 1 PG and you accidentally deploy a “failed deployment” (read: “a blackhole for metrics”), yes, you’re going to lose data. Load balancers alone won’t help, but healthchecks do. If your infrastructure is always testing that your PG instances are able to accept POST requests, then you have some assurances that active PG are able to accept metrics.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Is something forcing you to create a single point of failure?
Are you even reading my messages?! that is exactly what im doing, and asking how others are doing that….
do you even run PG? because tbh it seems you have no clue what you are talking about
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
do you understand PG only keeps the data at the receiving end?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
yes
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
and if you have 2, unless you send the data to the second nodea s well, you are not going to have redundant data
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
To be honest, it seems like you are trolling at this point
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
are you looking to use pushgateway as a continuous scoreboard? because that’s what it sounds like – but that’s not what it’s designed for.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Again, do you even read?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
absolutely I read and have been trying to understand how you’re using it and why exactly you think it’s a problem to have different data on different PGs
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Why dont you explain how YOU run your N+1 reployment instead?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
sure
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I was in the middle of composing this when you replied, so I’ll send it now:
Yes im 100% sure, eg scheduled jobs like backups or other croned or event triggered things.
If your backups and crons run in the same place (host/instance) then I’d suggest using the Textfile collector [1]
I’ve definitely used it to export the status of a backup script… the script would even update metrics during execution,
eg: it updated a start_time
metric before triggering the backup
and additional detailed metrics at exit: {exitcode
, completion_time
, last_backup_size_bytes
}
If using textfile metrics is not possible, that’s fine – just hoping to make you aware it is option, if you weren’t already.
[1] https://github.com/prometheus/node_exporter#textfile-collector
Exporter for machine metrics. Contribute to prometheus/node_exporter development by creating an account on GitHub.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Not an option, and the textfile option is just a hack tbh, you are basically building metric state in your node
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
now you sound trollish
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
but not an option, this is lambda
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
ok great - thanks for sharing the environment in which you’re trying to operate
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
im waiting for you N+1 deployment,instead of how to save metrics to disk
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
pushing lambda/cron/job style metrics is EXACTLY what PG is there for
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
sure.
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
An ALB with N+1 PG’s in the targetgroup can receive metrics from your ephemeral jobs.
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
Your prometheus server collects metrics from each PG in the TG individually – it does not scrape them via the ALB
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
rewinding a bit from the scenario above – are you doing anything currently to manage the metrics lifecycle on your current PG’s ?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
do you delete metrics at all?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
On PG, no I do not delete them, but that has not been an issue nor the topic of my question
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
now, in your scenario
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
you have a lambda, that runs, pushes a eg, timestamp of lastsuccess run (like many metrics do) to PG, through LB. It only lands in 1 of the PG, lets say PG1 Now PG1 is down, your new metrics would go to PG2. Now you have an alert or want to check something like: ` ((time() - pushgateway_thismetric_last_success) / 3600) > 24` to check last daily backup did run
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
wouldnt the 2PG metrics be 2 different metrics (due to the node they come from) and make it so you get a trigger
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
because PG2 might not have data for pushgateway_thismetric_last_success
yet
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
Push acceptor for ephemeral and batch jobs. Contribute to prometheus/pushgateway development by creating an account on GitHub.
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
so the only situation where what you’ve described creates the situation you sound concerned about would be:
your lambda pushes via the LB to pg1
, after which pg1
dies without being scraped by your prometheus server.
in all other situations where lambda pushes via the LB to pg1
and prometheus scrapes it, you’re safe.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
because of merges metrics+{labelsets} into the same metrics timeseries history
?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
because your query for pushgateway_thismetric_last_success
should return the latest metric, whether it arrived via pg1
or pg2
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, that makes sense, cool
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
I was under the impression it would not do that due to instance
being different basically
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
you have to make sure the instance label is being set to (""
) on scrape
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yep, they act as basically the same metric
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
that’s the behavior you want
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Indeed
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
if you do have “can’t lose” metrics – you’ll want a more durable path to getting the metrics to prometheus
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, tbh in the rare case of the metric getting pushed to PG1 and prom not scraping in time
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
i can tolerate the bad alert
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Or I guess there the persisting to disk or some other exporter would make more sense
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
your questions here got me thinking of a setup : api gateway -> lambda sns writer -> sns readers -> dynamodb ) + ( /metrics
scrape endpoint -> api gw -> lambda dynamo reader )
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
that would be better than PG tbh
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
right ? glad you agree
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
you could even ALB -> lambda -> dynamo
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I searched around and didn’t find anything even discussing that setup
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
without SNS and api gate
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I was thinking of a fanout for writes to go to multiple readers, in case you wanted cross-region HA
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
I guess you could make it have 2 endpoints
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
one for direct push, 1 for fanout or similar
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
or make it extensible for fanout
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
maybe there’s a market for an HA pushgateway
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
as for many people ALB->lambda->dynamo might be enought
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I was also thinking that having TTLs on the dynamo entries would kind-of automate the metrics deletion aspects that you’ve waved your hands at
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
if the entire ALB and lambda of region goes down, i have probably bigger issues
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, tbh we dont have that issue right now, as most of our metrics are not “stale”
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
auto-metrics-deletion gives a lot of rope to folks to create high cardinality issues for themselves.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
but if you run most of your stuff in serverless, I guess you do have that problem
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
running your stuff in serverless isn’t a fixed concern – it’s always how you’ve designed your labelsets. you have to be cognizant of their cardinality.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, but pushed on button, as we only run satellite jobs and some simple (but important) metrics from them (backups, cleanups, checks, auto handle of failed events, etc) we dont have too many labels or cardinality
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
you might benefit from having instance
on a pushed metric in the pushgateway – so you can quickly find+debug the job that pushed it.
but when you scrape it, that instance
label is no longer useful, so it should be dropped/nulled.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, that basically was my entire confusion with this PG stuff at the end
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
yay!
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
glad to help
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
that is why I wanted to check how others were doing, because just discarding random stuff on the prometheus config seemed odd, but we needed a way to handle more than 1PG so we HA
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
If you don’t yet, I’d suggest you run a blackbox prober scheduled lambda that pushes metrics on an interval, both individually to each PG (avoiding the ALB), as well as through the ALB. This gives you an alertable metric that your PG is not functioning.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
wouldnt this be covered by up()
?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
but you are also correct, once Prom got the metric from one PG, and as long as your query “accepts” or “discards” lables that make the metrics from any of the PGs different, it should work
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
yes, how you compose your alert rules is key
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
ill do a local test with compose,thanks for the help
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
glad we could understand each other at the end
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
same here.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
I have one more @tamsky if you have a minute
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
if you have PG1 last_success = 123 PG2 last_success = 124 (because the LB sends one to each) dont they get overwriten on the prom side after each scrape and make for odd metrics?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
timestamps should distinguish the two
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
when you push, you push with a timestamp?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
Push acceptor for ephemeral and batch jobs. Contribute to prometheus/pushgateway development by creating an account on GitHub.
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
reading that, maybe I’m confusing it with the timestamps in the textfile exporter
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
rereads that link
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
do you have a push_time_seconds
for your last_success
metrics group?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
push_time_seconds gets generated automatically
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
afaik
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
there is a timestime field tho, which in my test I did not push
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
push_time_second
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
/tmp/tmp.Aj9RooPWfR on master [!] on 🐳 v18.09.3 took 2s
❯ echo "some_metric 3.14" | curl -u "admin:admin" --data-binary @- <http://localhost:9091/metrics/job/some_job>
/tmp/tmp.Aj9RooPWfR on master [!] on 🐳 v18.09.3
❯ echo "some_metric 3.15" | curl -u "admin:admin" --data-binary @- <http://localhost:9092/metrics/job/some_job>
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
the sample metric looks like
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I see what’s going on. thinking
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
that is the problem i have with my backups metrics as well, that is why i was trying to basically “replicate” traffic to both PGs
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
I might be missing something in the config tho idk
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
- job_name: "pushgateway"
scrape_interval: 10s
honor_labels: true
static_configs:
- targets: ["pushgateway1:9091", "pushgateway2:9091"]
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
just in case
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
what does a graph of push_time_seconds
look like?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
the same
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
should be two flat lines
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
no, its just one
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
because they dont have any labels that distinguishes them
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
and you pushed at nearly the same time to both 9091 & 9092 ?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
some seconds after
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
no, not at the same time
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
If you have an LB one can come now, the other in 10 minutes and hit the other node
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
that is the problem i have with my backups metrics as well, that is why i was trying to basically “replicate” traffic to both PGs
I understand that you want to try to replicate PUT/POST to the PG… but I’m now thinking I’m wrong about HA being a valid design pattern for this.
I was thinking that the PG timestamps operated similarly to the file exporter / text format
An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
rewinding to the decision of “I need HA for PG” – what’s driving that decision?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Dont you run PG in HA? and even you said
By deploying 1 PG, you create a SPOF. Have you considered deploying N+1 ?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
dont you have the same issue?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I’ve run PG in multiple regions, but haven’t needed HA, single node ASG has been fine.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
So basically no
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
it’s possible HA can still work, you’d add a non- job
/instance
label on scrape of each PG with a unique value and then can query for the most recent group
of metrics across all scrapes
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
and your metrics query looks like a join
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Now to Why, as explained in the question before
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
when you asked that
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
what if your deploy a bad version?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
etc
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
yes.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
will help
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
if you deploy a metrics blackhole, you’re going to lose pushes
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Im surprised im the only one having this “issue” with PG not supporting HA
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
The thing you seem concerned with (“bad config/deploy”) is not unique to PG
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
What if you deployed a bad version of prometheus? Or your lambda functions?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
But you might be able to get there with max_over_time(push_time_seconds{job="myjob"}[5m])
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
that is why you run prom in HA
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
with 2+ nodes and even federation
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
and a DNS/Server/Client based balance for EG grafana
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
yes, but nobody gets concerned if you lose one of 2 nodes, and that a new node doesn’t have exactly the same set of data that the older node does
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
but quite similar
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
its clearly not the same, prometheus has a pseudoHA config
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I don’t agree with the statement: prometheus has a pseudoHA config
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
where you scrape same targets, and the only difference is due to timing, which you have to live with or not use prometheus
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
that I agree with.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
that is their way of HA
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
it’s redundancy
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
when running multiple promservers nobody expects them to store identical timeseries history to disk
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
And I never said that
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
but that’s HA
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
no
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Where does it say that is HA?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
that is maybe your personal take on HA
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
that seemed to be the desired outcome for the PG – that an HA version of PG would replicate writes to all PGs
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
that sort of replicated write does seem like a possible way to configure PG
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
max_over_time(push_time_seconds[1m])
would work for evergrowing metrics, but not for a gauge
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
if you push something like items_processed
you can do that
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
and you get a flapping metric again
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
lets get an example going for max_over_time
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
push_time_seconds{job=”myjob”,pg=”pg1”} 111 push_time_seconds{job=”myjob”,pg=”pg2”} 122
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
where is pg=
coming from?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
are you adding instance labels?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
because the client pushing, is not aware to which PG is pusshing
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I mentioned it earlier:
it’s possible HA can still work, you’d add a non- job
/instance
label on scrape of each PG with a unique value and then can query for the most recent group
of metrics across all scrapes
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
OK
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, Prom side
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
yup
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
yeah, doing a join does work, I meant direct max_overtime
does not
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
quite… hacky imho
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I’d think you’d use max_over_time
to select the series and pg
label that has the most recent timestamp (you’re running ntp
right )
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
yeah, that could work as a join, but its quite a hacky way as said before
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I agree hacky
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
instead of engineering the PG to be HA have you considered making the PG clients more robust? what fallback mechanisms do you have if PG is down?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yes, I did, as explained when I started this, I could have 2 targets per job
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
that’s just 2x fire-and-forget
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I’m thinking a separate backup path if the primary is down. Backup might not be HTTP.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
I think at that point i would rather add a proxy that replicates the data to multiple PGs
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
as its a solution that applies to all my lambdas/jobs instead of an implementation per job
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
how do you ameliorate the situation where the proxy misses a write to one of the PGs and the datasets diverge?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
you’ll wind up with a fluctuating graph, like you shared
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
unless you run one prom server per PG ?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
but then its the same, 1 alrets one doesnt
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
multiple prom will not fix it
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
but yeah you are right, if the node returns, we might be back to square one
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I think “be careful” will still apply
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
unless we have non persistent metrics
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I wasn’t thinking the PG node goes up/down/up
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
which then the new pg does not have a metrics for the metrics that the other PG does
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I was simply thinking the proxy fails to communicate a PUT/POST for any number of network reasons – and it doesn’t retry, so that PUT/POST gets dropped on the floor.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
I would ask, how often is that bound to happen on an internal network?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
a non-zero number of times
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
I didnt ask how many times, I asked how often
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
divide by choice of duration
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
which is, if this affects only 1% of the cases, im not really that worried for this scneario
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
how do you ameliorate the situation where the proxy misses a write to one of the PGs and the datasets diverge?
In a number of ways, for the number of problems
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
-.-
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I’m thinking from all this that you have different classes of metrics that you’re pushing – some are frequent and business critical and are categorized “can’t lose”, and others are batch processes, like backups where the metric is useful but losing it isn’t intrinsically harmful.
is that correct?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Well that is the thing, a metric for backups going missing is quite harmful depeding on the period
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
as it could trigger a number of alerts
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
OFC, if the soltuion to HA PG is more complex than just a couple of silences here and there
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
ill just silence them when the event presents
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
but Im suprised this is not a bigger problem for other usecases, eg. heavy serverless users
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
triggering alerts definitely has a cost, but maybe there’s a different metric, (for example, via a blackbox prober) that reports the actual status of the latest backup
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
you mean <https://github.com/prometheus/blackbox_exporter>
?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
yes
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, i mean that is basically saying “dont use PG if you need PG HA” which is ok
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
maybe PG is not the tool for this
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
but im asking in a more generic way
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
a blackbox example might be: a script that reads the state of all backups and reports the most recent timestamp
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
how do you run PG in HA or some way of redundancy
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
and there does not seem to be a way to do that
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Just to be clear, I know i can solve this in a number of different ways
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
but in this case this is not my question, it is rather, how would you HA PG?
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
then I agree, there is no out of box clustered PG that offers a singular HA datastore
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
or replicated, even if delayed
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
like gossip based for example
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, out of the box nothing, but maybe there are some clever ways to do this
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I’d use the existing singleton PG pattern, use ASG, and put it behind an ALB.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
EG: a replication proxy (altho your comment makes a fair point against it)
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, we do that already basically with a container
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I’d implement a backup channel in the clients to write metrics to SNS and have a persistent reader dequeue and PUT SNS data->PG
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
clients would use the backup channel if the primary PG failed or timed out
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
The thing is the problem is not the channel tbh, if that was the case, you could argue the same point about any HTTP communication
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
but it could be a good solution to have some sort of queue
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
it depends on the criticality of the data – if you have a requirement that you can never lose writes, you need more backup channels
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
so client -> SNS/SQS/whatever -> deque -> PG
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
so if PG goes down, it will just put the metrics once its back up
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I’d throw in a S3 writer if I thought SNS wasn’t reliable enough
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, i agree, but again, I guess you could say the same about any communication
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
maybe an out-of-region backup PG ALB as an alternate
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
and I dont see this being a pattern for any REST service for example
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
you have this problem even in kafka writers
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
kafka guarantees writes, but only after ACK. what do you do as a writer with critical data that doesn’t receive an ACK.
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
- keep trying
- queue to a different system
- all of the above
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
- fall over and ask for help
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
yeah but you dont com over a different system
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
normally, you retry or buffer and retry
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
or fallover
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
you don’t have as many options in lambda
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Exactly, but I think its not a great architecture patter to say
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
“all lambdas HTTP to PG or send over SNS if that fails”
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
because the same could be said about any HTTP connection in the same way
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
waht about lambda calling another service?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
should it SNS as well?
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
I would rather just make SNS or a queue the com channel
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
and if that fails, then fallover and cry
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I can agree with that decision… just use SNS for all metrics writes
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
(given you can async)
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
HTTP over SNS
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, queuing the PG metrics might be a sane idea, that mitigate 90% of the issues, for not that much extra work
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
actually I’m thinking SQS is the thing we’re talking about, not SNS
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, I said SNS as you said SNS before, but yeah im talking about a queue
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
yeah - we’re talking same
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
the question isn’t how to handle REST failures in lambda, it’s how to deliver unique unreplicated data that will be otherwise lost unless written somewhere
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yep, i think that is accurate
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
my mistake using SNS
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
and hopefully you can engineer it so that some amount of loss is acceptable
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
as opposed to “no amount of loss is acceptable”
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, that i guess depends on the use case, as said for backups/simple stuff, it might be the silence-work<architectural-complexity
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
but i was curious about how this is tackled across places, as i can think of many ways/things where this becomes more critical
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
hear me out for another minute
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
lets say your cron job that performs backups fails to report its status to PG
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
and you have an alarm NoRecentBackupsHaveCompleted
which fires after N+1 hours
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
I think it’s also wise to have another job that reads the most recent backup and exports what it read.
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
lets say that other “backup-reader” job has a metric alarm NoRecentValidBackup
– which would be in state OK
if the backup worked ok
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
you can combine those two alerts to inhibit. you don’t need silences.
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
lmk if you need more detail.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
I mean, yes this will fix it for backups, but while a second metric seems like a good idea for this case anywa, I think it’s a rather inelegant for the PG issue itself
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
how this is tackled across places, as i can think of many ways/things where this becomes more critical
I haven’t worked with enough shops that were totally serverless to know the pitfalls.
I do know some folks are very happy with couchbase for event metrics.
I’ve also seen mondemand
used as an event bus (in both multicast and anycast modes).
At some point, durable logging becomes your primary concern, and metrics ingestion becomes secondary.
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
Yeah, that is another option I considered, doing just logging for it
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
And you could even do metric form that, like mtail
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
That reminds me of one other excellent resource that I’ve used, and that’s nsq
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
metrics/events can be written to nsq
and tailed
like you mentioned into different systems
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
I’ll check it out
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)
nsq
is pretty awesome
2019-03-14
![Abel Luck avatar](https://secure.gravatar.com/avatar/0f605397e0ead93a68e1be26dc26481a.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0001-72.png)
anyone using prometheus to monitor an aws stack with the cloudwatch exporter? is it worth using?
![Abel Luck avatar](https://secure.gravatar.com/avatar/0f605397e0ead93a68e1be26dc26481a.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0001-72.png)
I’m also curious how costs compare to using cloudwatch alone vs the api request costs of the exporter
![pecigonzalo avatar](https://avatars.slack-edge.com/2020-02-24/954674862595_11f6ff71106151c32655_72.png)
We use CW-Exporter, but only for things we cant get native metrics for. I dont know if that helps you much
2019-03-25
![Abel Luck avatar](https://secure.gravatar.com/avatar/0f605397e0ead93a68e1be26dc26481a.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0001-72.png)
by native metrics, you mean the host metrics from node_exporter?
![Abel Luck avatar](https://secure.gravatar.com/avatar/0f605397e0ead93a68e1be26dc26481a.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0001-72.png)
what would those metrics be that you use CW exporter for?
2019-03-28
![tamsky avatar](https://avatars.slack-edge.com/2019-10-31/817094217669_6e765cea39b456597957_72.jpg)