#prometheus

Archive: https://archive.sweetops.com/prometheus/

2019-10-01

Mark Howard

Question, Anyone using prometheus to monitor Azure Paas resources?

2019-07-11

tamsky

@Tamlyn Rhodes Avoiding Cloudwatch might be helpful as well. What’s behind the personal or business requirement to use Cloudwatch?

2019-07-02

joshmyers

@Tamlyn Rhodes perhaps influx/telegraf maybe helpful, it is like a pipe on steroids and can do prom > cloudwatch with filters and processors etc

2019-06-28

Tamlyn Rhodes
cloudposse/prometheus-to-cloudwatch

Utility for scraping Prometheus metrics from a Prometheus client endpoint and publishing them to CloudWatch - cloudposse/prometheus-to-cloudwatch

Tamlyn Rhodes

How do deal with Prometheus and Cloudwatch’s different model of metrics gathering? Prometheus assumes reported metrics are summed whereas Cloudwatch assumes each reflects current values.

Tamlyn Rhodes

That leads to funny looking graphs like this in Cloudwatch when a container is restarted.

Erik Osterman

@aknysh can probably help. But it probably comes down to using something like counters vs gauges. Prometheus supports multiple types of metrics whereas I am not sure if CloudWatch does (if so they don’t call it gauge). From working with other monitoring systems it is common to support both. I’d be surprised if there isn’t a way to achieve it.

Tamlyn Rhodes

Thanks. I think gauge type metrics (e.g. current memory usage) work OK but not all metrics can be tracked that way. For instance “total number of requests” needs a counter because it tracks events rather than a value. The problem arises because the Prometheus client in my container reports “123 requests have occurred since the container was restarted” but when this gets forwarded to Cloudwatch it is interpreted as “123 requests occurred right now” and in the next update 30 seconds later it thinks there have been another 123 requests whereas there have been none.

aknysh

@Tamlyn Rhodes I’m not sure how to change the metric types between prometheus and CloudWatch. https://github.com/cloudposse/prometheus-to-cloudwatch is just a proxy that scrapes prometheus URLs, converts the format, and sends the metrics to CloudWatch. It does not assume anything. It might be possible to change the module to do some logic.

cloudposse/prometheus-to-cloudwatch

Utility for scraping Prometheus metrics from a Prometheus client endpoint and publishing them to CloudWatch - cloudposse/prometheus-to-cloudwatch

aknysh

also take a look at these releases, it might help

aknysh
cloudposse/prometheus-to-cloudwatch

Utility for scraping Prometheus metrics from a Prometheus client endpoint and publishing them to CloudWatch - cloudposse/prometheus-to-cloudwatch

aknysh
cloudposse/prometheus-to-cloudwatch

Utility for scraping Prometheus metrics from a Prometheus client endpoint and publishing them to CloudWatch - cloudposse/prometheus-to-cloudwatch

Tamlyn Rhodes

OK, thanks for you help. I’ll investigate other approaches.

2
Tamlyn Rhodes

Have a good weekend

aknysh

if you have any improvements, PRs are welcome

1

2019-06-25

rohit

@Erik Osterman Hi. Are you planning to talk about prometheus anytime soon during office-hours?

Erik Osterman

#office-hours topics are really driven by who ever attends

Erik Osterman

is there something specific you’re interested in?

rohit

nothing specific, we are thinking about using prometheus

Erik Osterman

I’d be happy to give a demo

1
Erik Osterman

Our next office hours is tomorrow

rohit

time please

Erik Osterman

Every Wednesday at 11:30 am PST

rohit

bad timing for me(i am in CST), will try to join

rohit

thanks

2019-06-20

tamsky

Has anyone here used https://github.com/weaveworks/prom-aggregation-gateway for aggregating metrics from Lambda functions? Curious if anyone has field notes to share.

weaveworks/prom-aggregation-gateway

An aggregating push gateway for Prometheus. Contribute to weaveworks/prom-aggregation-gateway development by creating an account on GitHub.

Igor Rodionov

I just deployed push gateway on kubernetest

weaveworks/prom-aggregation-gateway

An aggregating push gateway for Prometheus. Contribute to weaveworks/prom-aggregation-gateway development by creating an account on GitHub.

Erik Osterman

@Igor Rodionov deployed something like that. not specifically for lambdas though.

Igor Rodionov
06:42:06 AM

@Igor Rodionov has joined the channel

2019-05-30

Abel Luck

any advice on the simplest prometheus service discovery options? we’ve a small deployment of dynamic services and am looking for something lightweight

Abel Luck

dns discovery seems like it might be the best. but the docs are lacking, how to provide custom /metrics path and port ?

tamsky

@Abel Luck have you checked out the example configs for DNS service discovery?

That’s a full example that touches most of the things you mentioned

  • endpoint lookup by DNS name (Lines 96-99)
  • custom /metrics path (Line 90)
  • and to get a custom port, I’d replace Line 91 (scheme: https) with port: <custom#>
Configuration | Prometheus

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

prometheus/prometheus

The Prometheus monitoring system and time series database. - prometheus/prometheus

Abel Luck

@tamsky thanks for the link to the example configs, I didn’t know about that.

However, when i meant custom path/port I meant a config such that the path and port are discovered from the DNS SRV entry

tamsky

Port should be automatic if you’re using SRV records

tamsky

You don’t typically need to adjust the path, as it’s always /metrics on standard exporters. What’s your use case where the /metrics endpoint also needs to be discoverable?

tamsky

afaik, there’s no custom SRV record type that provides more than service port and weight.

2019-04-12

joshmyers

Ja, it’s basically a pipe on steroids, so many inputs/outputs/filters/processors etc

2019-04-11

2019-04-04

2019-04-01

sarkis

I used CW Exporter at last job. Was mainly to get ELB metrics, the downside is CW is delayed by 15-20 mins usually, so something to think about if you want to then alert on those metrics :(

Tim Malone

what part of CW was delayed - the metrics ingest? or the alarming? i haven’t noticed any delays with native metrics so i’m curious because i was thinking of going down the same path for our k8s stuff…

sarkis

wow i completely missed this - but CW metrics ingest was delayed, I think the issue is due to the specific metrics I was after having a high resolution - best explanation I found was from datadog: https://docs.datadoghq.com/integrations/faq/cloud-metric-delay/

Cloud Metric Delay

Datadog, the leading service for cloud-scale monitoring.

Tim Malone

ohh right, ok that makes sense - in the context of getting the metrics out to another service. although - that delay is still a lot. but, thanks for the context!

joshmyers

https://github.com/influxdata/telegraf is a pretty cool swiss army knife for this stuff

influxdata/telegraf

The plugin-driven server agent for collecting & reporting metrics. - influxdata/telegraf

tamsky

Thanks Josh, I didn’t know telegraf can have a prometheus compatible /metrics endpoint.

https://github.com/influxdata/telegraf/tree/master/plugins/outputs/prometheus_client

influxdata/telegraf

The plugin-driven server agent for collecting & reporting metrics. - influxdata/telegraf

2019-03-28

tamsky

random guess where CW-Exporter might be useful: ALB metrics

1

2019-03-25

Abel Luck

by native metrics, you mean the host metrics from node_exporter?

Abel Luck

what would those metrics be that you use CW exporter for?

2019-03-14

Abel Luck

anyone using prometheus to monitor an aws stack with the cloudwatch exporter? is it worth using?

Abel Luck

I’m also curious how costs compare to using cloudwatch alone vs the api request costs of the exporter

We use CW-Exporter, but only for things we cant get native metrics for. I dont know if that helps you much

2019-03-13


I think I’m having trouble understanding what “always gets the metrics” means for your situation.
You have only 1 PG, you do a failed deployment or it goes down for any reason, now any job pushing to PG cant push to it, because its down until you fix it/rollback/etc
Are you sure you’re using pushgateway to solve a problem for which it is designed & recommended ?
Yes im 100% sure, eg scheduled jobs like backups or other croned or event triggered things.
If the first one goes down, you shouldn’t care, assuming both are equally scraped by prometheus.
It does matter, because if you have 2 PG, unless you push the same data to both, by either a proxy/LB that replicates the data to both or by pushing to both from the clients, you have different data

eg. a backup job runs, after it ran it pushes a metrics to PG through a LB.

  • PG1 gets the connection and stores the info
  • PG2 does not, as the LB only sent to PG1

PG1 goes down, for whatever reason, Prom starts reading from PG2, which doesnt contain the metric for the backup. now your metrics show odd stuff, and might trigger an alert about a missing backup in our case

tamsky

Let me respond to each of those separately:
You have only 1 PG,

Is something forcing you to create a single point of failure? By deploying 1 PG, you create a SPOF. Have you considered deploying N+1 ?
you do a failed deployment or it goes down for any reason, now any job pushing to PG cant push to it, because its down until you fix it/rollback/etc

If, as you say, you only have 1 PG and you accidentally deploy a “failed deployment” (read: “a blackhole for metrics”), yes, you’re going to lose data. Load balancers alone won’t help, but healthchecks do. If your infrastructure is always testing that your PG instances are able to accept POST requests, then you have some assurances that active PG are able to accept metrics.


Is something forcing you to create a single point of failure?
Are you even reading my messages?! that is exactly what im doing, and asking how others are doing that….

do you even run PG? because tbh it seems you have no clue what you are talking about

do you understand PG only keeps the data at the receiving end?

tamsky

yes

and if you have 2, unless you send the data to the second nodea s well, you are not going to have redundant data

To be honest, it seems like you are trolling at this point

tamsky

are you looking to use pushgateway as a continuous scoreboard? because that’s what it sounds like – but that’s not what it’s designed for.

Again, do you even read?

tamsky

absolutely I read and have been trying to understand how you’re using it and why exactly you think it’s a problem to have different data on different PGs

Why dont you explain how YOU run your N+1 reployment instead?

tamsky

sure

tamsky

I was in the middle of composing this when you replied, so I’ll send it now:
Yes im 100% sure, eg scheduled jobs like backups or other croned or event triggered things.

If your backups and crons run in the same place (host/instance) then I’d suggest using the Textfile collector [1] I’ve definitely used it to export the status of a backup script… the script would even update metrics during execution, eg: it updated a start_time metric before triggering the backup and additional detailed metrics at exit: {exitcode, completion_time, last_backup_size_bytes}

If using textfile metrics is not possible, that’s fine – just hoping to make you aware it is option, if you weren’t already.

[1] https://github.com/prometheus/node_exporter#textfile-collector

prometheus/node_exporter

Exporter for machine metrics. Contribute to prometheus/node_exporter development by creating an account on GitHub.

Not an option, and the textfile option is just a hack tbh, you are basically building metric state in your node

tamsky

now you sound trollish

but not an option, this is lambda

tamsky

ok great - thanks for sharing the environment in which you’re trying to operate

im waiting for you N+1 deployment,instead of how to save metrics to disk

pushing lambda/cron/job style metrics is EXACTLY what PG is there for

tamsky

sure.

tamsky

An ALB with N+1 PG’s in the targetgroup can receive metrics from your ephemeral jobs.

tamsky

Your prometheus server collects metrics from each PG in the TG individually – it does not scrape them via the ALB

tamsky

rewinding a bit from the scenario above – are you doing anything currently to manage the metrics lifecycle on your current PG’s ?

tamsky

do you delete metrics at all?

On PG, no I do not delete them, but that has not been an issue nor the topic of my question

now, in your scenario

you have a lambda, that runs, pushes a eg, timestamp of lastsuccess run (like many metrics do) to PG, through LB. It only lands in 1 of the PG, lets say PG1 Now PG1 is down, your new metrics would go to PG2. Now you have an alert or want to check something like: ` ((time() - pushgateway_thismetric_last_success) / 3600) > 24` to check last daily backup did run

wouldnt the 2PG metrics be 2 different metrics (due to the node they come from) and make it so you get a trigger

because PG2 might not have data for pushgateway_thismetric_last_success yet

tamsky
prometheus/pushgateway

Push acceptor for ephemeral and batch jobs. Contribute to prometheus/pushgateway development by creating an account on GitHub.

tamsky

so the only situation where what you’ve described creates the situation you sound concerned about would be: your lambda pushes via the LB to pg1, after which pg1 dies without being scraped by your prometheus server. in all other situations where lambda pushes via the LB to pg1 and prometheus scrapes it, you’re safe.

because of merges metrics+{labelsets} into the same metrics timeseries history?

tamsky

because your query for pushgateway_thismetric_last_success should return the latest metric, whether it arrived via pg1 or pg2

Yeah, that makes sense, cool

I was under the impression it would not do that due to instance being different basically

1
tamsky

you have to make sure the instance label is being set to ("") on scrape

Yep, they act as basically the same metric

tamsky

that’s the behavior you want

Indeed

tamsky

if you do have “can’t lose” metrics – you’ll want a more durable path to getting the metrics to prometheus

Yeah, tbh in the rare case of the metric getting pushed to PG1 and prom not scraping in time

i can tolerate the bad alert

Or I guess there the persisting to disk or some other exporter would make more sense

tamsky

your questions here got me thinking of a setup : api gateway -> lambda sns writer -> sns readers -> dynamodb ) + ( /metrics scrape endpoint -> api gw -> lambda dynamo reader )

1

that would be better than PG tbh

tamsky

right ? glad you agree

you could even ALB -> lambda -> dynamo

tamsky

I searched around and didn’t find anything even discussing that setup

without SNS and api gate

tamsky

I was thinking of a fanout for writes to go to multiple readers, in case you wanted cross-region HA

I guess you could make it have 2 endpoints

one for direct push, 1 for fanout or similar

or make it extensible for fanout

tamsky

maybe there’s a market for an HA pushgateway

as for many people ALB->lambda->dynamo might be enought

tamsky

I was also thinking that having TTLs on the dynamo entries would kind-of automate the metrics deletion aspects that you’ve waved your hands at

1

if the entire ALB and lambda of region goes down, i have probably bigger issues

Yeah, tbh we dont have that issue right now, as most of our metrics are not “stale”

tamsky

auto-metrics-deletion gives a lot of rope to folks to create high cardinality issues for themselves.

but if you run most of your stuff in serverless, I guess you do have that problem

tamsky

running your stuff in serverless isn’t a fixed concern – it’s always how you’ve designed your labelsets. you have to be cognizant of their cardinality.

Yeah, but pushed on button, as we only run satellite jobs and some simple (but important) metrics from them (backups, cleanups, checks, auto handle of failed events, etc) we dont have too many labels or cardinality

1
tamsky

you might benefit from having instance on a pushed metric in the pushgateway – so you can quickly find+debug the job that pushed it. but when you scrape it, that instance label is no longer useful, so it should be dropped/nulled.

Yeah, that basically was my entire confusion with this PG stuff at the end

tamsky

yay!

tamsky

glad to help

that is why I wanted to check how others were doing, because just discarding random stuff on the prometheus config seemed odd, but we needed a way to handle more than 1PG so we HA

tamsky

If you don’t yet, I’d suggest you run a blackbox prober scheduled lambda that pushes metrics on an interval, both individually to each PG (avoiding the ALB), as well as through the ALB. This gives you an alertable metric that your PG is not functioning.

wouldnt this be covered by up()?

but you are also correct, once Prom got the metric from one PG, and as long as your query “accepts” or “discards” lables that make the metrics from any of the PGs different, it should work

tamsky

yes, how you compose your alert rules is key

ill do a local test with compose,thanks for the help

glad we could understand each other at the end

tamsky

same here.

I have one more @tamsky if you have a minute

if you have PG1 last_success = 123 PG2 last_success = 124 (because the LB sends one to each) dont they get overwriten on the prom side after each scrape and make for odd metrics?

tamsky

timestamps should distinguish the two

tamsky

when you push, you push with a timestamp?

tamsky
prometheus/pushgateway

Push acceptor for ephemeral and batch jobs. Contribute to prometheus/pushgateway development by creating an account on GitHub.

tamsky

reading that, maybe I’m confusing it with the timestamps in the textfile exporter

tamsky
05:50:10 PM

rereads that link

tamsky

do you have a push_time_seconds for your last_success metrics group?

push_time_seconds gets generated automatically

afaik

there is a timestime field tho, which in my test I did not push

06:02:38 PM

push_time_second

/tmp/tmp.Aj9RooPWfR on  master [!] on 🐳 v18.09.3 took 2s 
❯ echo "some_metric 3.14" \| curl -u "admin:admin" --data-binary @- <http<i class="em em-//localhost"></i>9091/metrics/job/some_job>

/tmp/tmp.Aj9RooPWfR on  master [!] on 🐳 v18.09.3 
❯ echo "some_metric 3.15" \| curl -u "admin:admin" --data-binary @- <http<i class="em em-//localhost"></i>9092/metrics/job/some_job>
06:03:36 PM

the sample metric looks like

tamsky

I see what’s going on. thinking

that is the problem i have with my backups metrics as well, that is why i was trying to basically “replicate” traffic to both PGs

I might be missing something in the config tho idk

  - job_name: "pushgateway"
    scrape_interval: 10s
    honor_labels: true
    static_configs:
      - targets: ["pushgateway1:9091", "pushgateway2:9091"]

just in case

tamsky

what does a graph of push_time_seconds look like?

the same

tamsky

should be two flat lines

no, its just one

because they dont have any labels that distinguishes them

tamsky

and you pushed at nearly the same time to both 9091 & 9092 ?

some seconds after

no, not at the same time

If you have an LB one can come now, the other in 10 minutes and hit the other node

tamsky


that is the problem i have with my backups metrics as well, that is why i was trying to basically “replicate” traffic to both PGs

I understand that you want to try to replicate PUT/POST to the PG… but I’m now thinking I’m wrong about HA being a valid design pattern for this.

I was thinking that the PG timestamps operated similarly to the file exporter / text format

Exposition formats | Prometheus

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

tamsky

rewinding to the decision of “I need HA for PG” – what’s driving that decision?

Dont you run PG in HA? and even you said
By deploying 1 PG, you create a SPOF. Have you considered deploying N+1 ?

dont you have the same issue?

tamsky

I’ve run PG in multiple regions, but haven’t needed HA, single node ASG has been fine.

So basically no

tamsky

it’s possible HA can still work, you’d add a non- job/instance label on scrape of each PG with a unique value and then can query for the most recent group of metrics across all scrapes

tamsky

and your metrics query looks like a join

Now to Why, as explained in the question before

when you asked that

what if your deploy a bad version?

tamsky

yes.

will help

tamsky

if you deploy a metrics blackhole, you’re going to lose pushes

Im surprised im the only one having this “issue” with PG not supporting HA

tamsky

The thing you seem concerned with (“bad config/deploy”) is not unique to PG

tamsky

What if you deployed a bad version of prometheus? Or your lambda functions?

tamsky

But you might be able to get there with max_over_time(push_time_seconds{job="myjob"}[5m])

that is why you run prom in HA

with 2+ nodes and even federation

and a DNS/Server/Client based balance for EG grafana

tamsky

yes, but nobody gets concerned if you lose one of 2 nodes, and that a new node doesn’t have exactly the same set of data that the older node does

but quite similar

its clearly not the same, prometheus has a pseudoHA config

tamsky

I don’t agree with the statement: prometheus has a pseudoHA config

where you scrape same targets, and the only difference is due to timing, which you have to live with or not use prometheus

tamsky

that I agree with.

that is their way of HA

tamsky

it’s redundancy

tamsky

when running multiple promservers nobody expects them to store identical timeseries history to disk

And I never said that

tamsky

but that’s HA

Where does it say that is HA?

that is maybe your personal take on HA

tamsky

that seemed to be the desired outcome for the PG – that an HA version of PG would replicate writes to all PGs

tamsky

that sort of replicated write does seem like a possible way to configure PG

max_over_time(push_time_seconds[1m]) would work for evergrowing metrics, but not for a gauge

if you push something like items_processed you can do that

and you get a flapping metric again

tamsky

lets get an example going for max_over_time

tamsky

push_time_seconds{job=”myjob”,pg=”pg1”} 111 push_time_seconds{job=”myjob”,pg=”pg2”} 122

where is pg= coming from?

are you adding instance labels?

because the client pushing, is not aware to which PG is pusshing

tamsky

I mentioned it earlier:
it’s possible HA can still work, you’d add a non- job/instance label on scrape of each PG with a unique value and then can query for the most recent group of metrics across all scrapes

Yeah, Prom side

tamsky

yup

yeah, doing a join does work, I meant direct max_overtime does not

quite… hacky imho

tamsky

I’d think you’d use max_over_time to select the series and pg label that has the most recent timestamp (you’re running ntp right )

yeah, that could work as a join, but its quite a hacky way as said before

tamsky

I agree hacky

tamsky

instead of engineering the PG to be HA have you considered making the PG clients more robust? what fallback mechanisms do you have if PG is down?

Yes, I did, as explained when I started this, I could have 2 targets per job

tamsky

that’s just 2x fire-and-forget

tamsky

I’m thinking a separate backup path if the primary is down. Backup might not be HTTP.

I think at that point i would rather add a proxy that replicates the data to multiple PGs

as its a solution that applies to all my lambdas/jobs instead of an implementation per job

tamsky

how do you ameliorate the situation where the proxy misses a write to one of the PGs and the datasets diverge?

tamsky

you’ll wind up with a fluctuating graph, like you shared

tamsky

unless you run one prom server per PG ?

but then its the same, 1 alrets one doesnt

multiple prom will not fix it

but yeah you are right, if the node returns, we might be back to square one

tamsky

I think “be careful” will still apply

unless we have non persistent metrics

tamsky

I wasn’t thinking the PG node goes up/down/up

which then the new pg does not have a metrics for the metrics that the other PG does

tamsky

I was simply thinking the proxy fails to communicate a PUT/POST for any number of network reasons – and it doesn’t retry, so that PUT/POST gets dropped on the floor.

I would ask, how often is that bound to happen on an internal network?

tamsky

a non-zero number of times

tamsky

I didnt ask how many times, I asked how often

tamsky

divide by choice of duration

which is, if this affects only 1% of the cases, im not really that worried for this scneario


how do you ameliorate the situation where the proxy misses a write to one of the PGs and the datasets diverge?
In a number of ways, for the number of problems

tamsky

I’m thinking from all this that you have different classes of metrics that you’re pushing – some are frequent and business critical and are categorized “can’t lose”, and others are batch processes, like backups where the metric is useful but losing it isn’t intrinsically harmful.

is that correct?

Well that is the thing, a metric for backups going missing is quite harmful depeding on the period

as it could trigger a number of alerts

OFC, if the soltuion to HA PG is more complex than just a couple of silences here and there

ill just silence them when the event presents

but Im suprised this is not a bigger problem for other usecases, eg. heavy serverless users

tamsky

triggering alerts definitely has a cost, but maybe there’s a different metric, (for example, via a blackbox prober) that reports the actual status of the latest backup

you mean <https://github.com/prometheus/blackbox_exporter>?

tamsky

yes

Yeah, i mean that is basically saying “dont use PG if you need PG HA” which is ok

maybe PG is not the tool for this

but im asking in a more generic way

tamsky

a blackbox example might be: a script that reads the state of all backups and reports the most recent timestamp

how do you run PG in HA or some way of redundancy

and there does not seem to be a way to do that

Just to be clear, I know i can solve this in a number of different ways

but in this case this is not my question, it is rather, how would you HA PG?

tamsky

then I agree, there is no out of box clustered PG that offers a singular HA datastore

or replicated, even if delayed

like gossip based for example

Yeah, out of the box nothing, but maybe there are some clever ways to do this

tamsky

I’d use the existing singleton PG pattern, use ASG, and put it behind an ALB.

EG: a replication proxy (altho your comment makes a fair point against it)

Yeah, we do that already basically with a container

tamsky

I’d implement a backup channel in the clients to write metrics to SNS and have a persistent reader dequeue and PUT SNS data->PG

tamsky

clients would use the backup channel if the primary PG failed or timed out

The thing is the problem is not the channel tbh, if that was the case, you could argue the same point about any HTTP communication

but it could be a good solution to have some sort of queue

tamsky

it depends on the criticality of the data – if you have a requirement that you can never lose writes, you need more backup channels

so client -> SNS/SQS/whatever -> deque -> PG

so if PG goes down, it will just put the metrics once its back up

tamsky

I’d throw in a S3 writer if I thought SNS wasn’t reliable enough

Yeah, i agree, but again, I guess you could say the same about any communication

tamsky

maybe an out-of-region backup PG ALB as an alternate

and I dont see this being a pattern for any REST service for example

tamsky

you have this problem even in kafka writers

tamsky

kafka guarantees writes, but only after ACK. what do you do as a writer with critical data that doesn’t receive an ACK.

tamsky
  1. keep trying
  2. queue to a different system
  3. all of the above
tamsky
  1. fall over and ask for help

yeah but you dont com over a different system

normally, you retry or buffer and retry

or fallover

tamsky

you don’t have as many options in lambda

Exactly, but I think its not a great architecture patter to say

“all lambdas HTTP to PG or send over SNS if that fails”

because the same could be said about any HTTP connection in the same way

waht about lambda calling another service?

should it SNS as well?

I would rather just make SNS or a queue the com channel

and if that fails, then fallover and cry

tamsky

I can agree with that decision… just use SNS for all metrics writes

(given you can async)

tamsky

HTTP over SNS

Yeah, queuing the PG metrics might be a sane idea, that mitigate 90% of the issues, for not that much extra work

tamsky

actually I’m thinking SQS is the thing we’re talking about, not SNS

Yeah, I said SNS as you said SNS before, but yeah im talking about a queue

tamsky

yeah - we’re talking same


so client -
SNS/SQS/whatever -
deque -
PG
as said before

1
tamsky

the question isn’t how to handle REST failures in lambda, it’s how to deliver unique unreplicated data that will be otherwise lost unless written somewhere

Yep, i think that is accurate

tamsky

my mistake using SNS

tamsky

and hopefully you can engineer it so that some amount of loss is acceptable

tamsky

as opposed to “no amount of loss is acceptable”

Yeah, that i guess depends on the use case, as said for backups/simple stuff, it might be the silence-work<architectural-complexity

but i was curious about how this is tackled across places, as i can think of many ways/things where this becomes more critical

tamsky

hear me out for another minute

tamsky

lets say your cron job that performs backups fails to report its status to PG

tamsky

and you have an alarm NoRecentBackupsHaveCompleted which fires after N+1 hours

tamsky

I think it’s also wise to have another job that reads the most recent backup and exports what it read.

tamsky

lets say that other “backup-reader” job has a metric alarm NoRecentValidBackup – which would be in state OK if the backup worked ok

tamsky

you can combine those two alerts to inhibit. you don’t need silences.

tamsky

lmk if you need more detail.

I mean, yes this will fix it for backups, but while a second metric seems like a good idea for this case anywa, I think it’s a rather inelegant for the PG issue itself

tamsky


how this is tackled across places, as i can think of many ways/things where this becomes more critical

I haven’t worked with enough shops that were totally serverless to know the pitfalls. I do know some folks are very happy with couchbase for event metrics. I’ve also seen mondemand used as an event bus (in both multicast and anycast modes). At some point, durable logging becomes your primary concern, and metrics ingestion becomes secondary.

Yeah, that is another option I considered, doing just logging for it

And you could even do metric form that, like mtail

tamsky

That reminds me of one other excellent resource that I’ve used, and that’s nsq

tamsky

metrics/events can be written to nsq and tailed like you mentioned into different systems

I’ll check it out

tamsky

nsq is pretty awesome

2019-03-12

We do have a LB and DNS

but the problem is that you need the data in both places

is like having 2 postgres database, wihtout replication, and trying to read form the 2nd one if hte first one goes down, you dont have the same data

tamsky


to avoid downtime during deployments/etc so it always gets the metrics
EG: i roll a new update, config is bad, i broke the PG, and now Im losing all PG metrics coming in (backups, etc)
if this was HA, a rolling/blue-green would be really simple

I think I’m having trouble understanding what “always gets the metrics” means for your situation.
is like having 2 postgres database, wihtout replication, and trying to read form the 2nd one if hte first one goes down, you dont have the same data

Are you sure you’re using pushgateway to solve a problem for which it is designed & recommended ?

If not, please describe what you’re trying to solve for.

When to use the Pushgateway | Prometheus

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

tamsky


if hte first one goes down, you dont have the same data

If the first one goes down, you shouldn’t care, assuming both are equally scraped by prometheus. Prometheus reads the metrics from the first (down) and second (up) and merges metrics+{labelsets} into the same metrics timeseries history.

2019-03-11

tamsky

@ what are your goals with running pushgateway with an HA setup?

to avoid downtime during deployments/etc so it always gets the metrics

EG: i roll a new update, config is bad, i broke the PG, and now Im losing all PG metrics coming in (backups, etc)

if this was HA, a rolling/blue-green would be really simple

tamsky

have you considered using either DNS + a load balancer – or service discovery in clients – which should help?

tamsky

I’ve never run PG in production, but if I did, I would avoid running it as a singleton. Is there anything non-trivial involved with running more than one PG (replicas with load-balancing, healthchecks)?

2019-03-07

Anyone running Pushgateway in ~HA~ “HA”?

We are evaluating multiple options, but could not find a lot of info about it, as I can see we:

  • Replicate traffic to multiple PG from another proxy
  • Send reports to multiple PGs

2019-03-04

Glenn J. Mason

Hey folks, I’m deploying Prometheus (with helm and the prometheus-operator chart from stable) into several clusters, but wondering how best to set up the alerting rules? I’ve added a Slack receiver to Alertmanager, and I’m getting alerts there, but … any clues on the process I would use to ensure good alerting on my clusters?

Erik Osterman

Hrmm @Glenn J. Mason can you add some more context around the process piece?

Erik Osterman

Process for updating alerts?

Erik Osterman

Where to define them?

Glenn J. Mason

More like … how do I go about discovering what metrics mean “bad, or going bad”? Like … do I just monitor in e.g. grafana and/or make up “sensible” defaults that I think might work? The human process for tuning I mean.

Erik Osterman

Have you started by looking over the many community provided dashboards?

Glenn J. Mason

Yep, looking, but I was really wondering about the Prometheus alerting rules rather than Grafana dashboards specifically. I think the default rules are pretty good, but (e.g.) we’ve got the http://micrometer.io actuator getting JVM metrics in, and it’s hard to really be able to say “yep, that’s healthy” — except of course by monitoring and historical trends, etc. Which is certainly useful, and the “normal” way to do these things.

Glenn J. Mason

I think it’s the only way to go, perhaps: monitor, decide on “normal” and take it from there.

Erik Osterman

Yea, you’ll need to establish a baseline before you’ll really know what to alert on

2019-02-28

mk
08:57:59 AM

@mk has joined the channel

2019-02-20

thirstydeveloper
06:18:22 PM

@thirstydeveloper has joined the channel

2019-02-14

Erik Osterman
05:21:48 AM

@Erik Osterman set the channel purpose: Archive: https://archive.sweetops.com/prometheus/

2019-02-12

iautom8things
10:11:42 PM

@iautom8things has joined the channel

2019-01-24

Jan
01:02:54 PM

@Jan has joined the channel

04:15:47 AM

@ has joined the channel

2019-01-14

02:36:15 PM

@ has joined the channel

2018-12-12

HoneyBadger
10:56:02 PM

@HoneyBadger has joined the channel

2018-12-08

richwine
01:56:28 PM

@richwine has joined the channel

2018-12-05

mallen
09:22:17 PM

@mallen has joined the channel

2018-11-30

Erik Osterman
11:27:38 PM

@Erik Osterman set the channel topic:

1

2018-11-29

tamsky
10:04:48 PM

@tamsky has joined the channel

Erik Osterman
10:32:34 PM

@Erik Osterman has joined the channel

2018-11-28

rohit
03:52:55 AM

@rohit has joined the channel

2018-11-22

Bogdan
10:55:12 PM

@Bogdan has joined the channel

2018-11-18

04:26:42 PM

@ has joined the channel

2018-11-14

Yoann
07:15:05 AM

@Yoann has joined the channel

2018-11-13

09:51:20 AM

@ has joined the channel

2018-11-12

sarkis
04:11:27 PM

@sarkis has joined the channel

aknysh
04:19:59 PM

@aknysh has joined the channel

Nikola Velkovski
04:28:20 PM

@Nikola Velkovski has joined the channel

joshmyers
03:55:19 AM

@joshmyers has joined the channel

    keyboard_arrow_up