#sre (2020-09)

Prometheus, Prometheus Operator, Grafana, Kubernetes

Archive: https://archive.sweetops.com/monitoring/


Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

For those of you who missed our #office-hours , here’s an explanation of how we’re managing opsgenie with terraform: https://www.youtube.com/watch?v=fXNajuC4L1o

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

Contribute to cloudposse/terraform-opsgenie-incident-management development by creating an account on GitHub.


Jayesh Patel avatar
Jayesh Patel

Does anyone know how to set this Prometheus alert for specific time duration using inhibition_rule ?

  - alert: TestEventBacklogCritical
    expr: sum(test_depth{topic="events",paused="false"}) >= 150000
    for: 15m
      severity: page
      description: Event queue has reached 150K for at least 15m
      summary: Event queue has reached 150K
Ian Bartholomew avatar
Ian Bartholomew

Has anyone worked with either DataDog synthetics or AWS CloudFront synthetics? We are evaluating them, and are curious if anyone has prior experience one way of the other with them. Thanks!

Chris Fowles avatar
Chris Fowles

are you already using datadog?

Ian Bartholomew avatar
Ian Bartholomew

yah, we are already using both them and AWS

Zach avatar

cloudwatch synthetics seemed ridiculously expensive when it was released

Chris Fowles avatar
Chris Fowles

yeh - that’s kind of where i was going to lead. i think our cost estimates had datadog at 1/3 to 1/2 the price of cloudwatch

Ian Bartholomew avatar
Ian Bartholomew

interesting, i think that our estimates had it the other way around. DD is $5 per 1k runs / mo which for one that ran every minute would be $220 / mo, while CW was $0.0012 per run, which would be $51.84 for one that ran every minute. Maybe my math is wrong?

Zach avatar

and datadog is like 2-3x as expensive as other synthetic providers I guess it makes sense if you already use them

Chris Fowles avatar
Chris Fowles

i’d probably lean into datadog over cloudwatch

Ian Bartholomew avatar
Ian Bartholomew

awesome, thanks

Chris Fowles avatar
Chris Fowles

so the catch is, cloudwatch you need to pay also for: alarms, logs, lambda, and s3

Ian Bartholomew avatar
Ian Bartholomew

great call out, thank you @Chris Fowles i really appreciate it

Ian Bartholomew avatar
Ian Bartholomew

this is super helpful

Chris Fowles avatar
Chris Fowles


Chris Fowles avatar
Chris Fowles
If you create 5 canaries that run once every 5 minutes, add alarms on 5 of the metrics generated by the canaries, and store the data for 1 month, your monthly bill will be calculated as follows:

5 canaries * 12 runs per hour * 24 hours per day * 31 days per month = 44,640 canary runs

Monthly CloudWatch charges

Canary run charges = 44,640 canary runs * $0.0012 per canary run = $53.57 per month
5 alarms per month = 5 * $0.10 = $0.50 per month
Total monthly CloudWatch charges = $53.57 + $0.50 = $54.07

Monthly additional charges

Each canary run also runs an AWS Lambda function and writes logs and results to CloudWatch Logs and the designated Amazon S3 bucket. For details on AWS service pricing such as AWS Lambda, Amazon S3, and CloudWatch Logs, see the pricing section of the relevant AWS service detail pages.

Lambda charges = requests charges + duration charges
= requests from 44,640 runs * $0.2 per 1,000,000 + duration of 20 seconds * 44,640 canary runs * 1 GB memory size * $0.000016667 per GB per sec
= $0.01 + $14.88 = $14.89 per month

CloudWatch Logs charges = collection charges + storage charges
= collection of 0.00015 GB per run * 44,640 runs * $0.5 per GB + storage of 0.00015 GB per run* 44,640 canary runs * $0.03 per GB per month
= $3.35 + $0.20 = $3.55 per month

S3 charges = put request charges + storage charges
= put requests of 44,640 runs * $0.005 per 1,000 requests + storage of 0.001 GB per run * 44,640 canary runs * 1 month * $0.023 per GB per month
= $0.22 + $1.03 = $1.25 per month

Additional monthly charges = $14.89 + $3.55 + $1.25 = $19.69

Total monthly charges = $54.07 + $19.69 = $73.76

Pricing values displayed here are based on US East Regions. Please refer to pricing tabs for most current pricing information for your respective region(s).
Zach avatar

yah a single synthetic canary run on Cloudwatch per minute was going to be like an entire month on other services

Zach avatar

I honestly don’t understand who is paying/using that

Chris Fowles avatar
Chris Fowles

people who have approval to spend money on AWS but not enter new contracts with third parties

Zach avatar

Ah, I have a little bit of that going on. VPs don’t raise an eyebrow at the aws bill but if I ask for a new purchase I have to prove that I can’t build it for cheaper

Chris Fowles avatar
Chris Fowles

this is what AWS marketplace is for

Chris Fowles avatar
Chris Fowles

Zach avatar

AWS Marketplace mostly feels like “AMIs that I could build with a simple userdata script that fetches a binary from github”

Chris Fowles avatar
Chris Fowles

yeh - but also i can put things in my aws bill that finance won’t raise an eyebrow at

Zach avatar

or “a cloudformation stack that I almost certainly will regret deploying later”


btai avatar

is this tutorial still relevant for deploying prometheus operator w/ thanos? https://medium.com/@kakashiliu/deploy-prometheus-operator-with-thanos-60210eff172b

btai avatar

i believe its still relevant. I didn’t follow the parts about the peers, but I was able to get thanos read/write from s3 working with prometheus operator


sheldonh avatar

Deep into datadog trial. Had another division propose site24x7. Any general impression on it for a comprehensive platform tool? Anything will help.

Looking to centralized logs, apm, monitors etc for AWS.

Datadog is so expensive due to per host costs so it’s causing some to want more competitor evaluation.

kskewes avatar

We run Prometheus in house and cloudwatch but might move to Loki soon. We’d look at Grafana cloud or honey comb or light step if going external.

sheldonh avatar

Yeah I’d love that with more interest, but they want a boxed solution with minimal development requirements. I love me some grafana :-) would live to try honeycomb

Zach avatar

Also running prometheus and I’m rolling out Loki right now, in Fargate. Have it up and running in my dev environment and everything feeding into it

Zach avatar

“Minimal development” fits it pretty well

Issif avatar

Datadog logs is really good and useful, the metrics part is a shame

Issif avatar

interpolation of metrics is non-understable

Issif avatar

for AWS, metrics are gathered with a latency of 15min at least

Issif avatar

for silences, you have to mute a whole monitor, no filter work (maybe it has changed)

Issif avatar

filter doesn’t accept regexp (like we can do in prometheus)

kskewes avatar

Interesting about metrics. One of our major issues with cloudwatch is log ingestion latency. Others are losing queries, query language in general, etc

Issif avatar

for metrics, a simple example, if you need to compute two different metrics, for creating a ratio for example (nginx : busy_workers over max_workers, eg), you’re not sure the windows which is used for the two is same, they can be shifted in time, so the ratio counts for nothing

joshmyers avatar

Used site24x7 ~ 6 years ago. Horrible horrible.

Eric Berg avatar
Eric Berg

It can take a few minutes for events and metrics to show up in Datadog, but i have never experienced a 15-minute wait. That seems like something you should look into. Also, the ability to mute monitors is tag-based, so you have a very flexible way to mute some or all monitors.

Also, WRT your comments about metrics’ being time-shifted to the point at which they are not properly correlated does not in any way match my experience using it over the past 5 years.

I’m a big DD fan and have been impressed by their continued attention to the user experience and expanding their functionality. Additionally, i’ve had opportunity to work directly with DD to address issues both simple and complex, and I’ve found their techs to be knowledgeable and engaged, when working through issues.

It’s expensive and you have to really watch how you configure Datadog, or you may see unexpected charges, but it’s a great, solid, flexible, well-supported platform and I’m thrilled that we’d chosen it, before i got here.

Issif avatar

their agent is better now, but I remember that for a long time, it had panic all the time, I red the source code, it was really ugly

Issif avatar

we got hundreds of restart a day

DJ avatar

I also really like DD. We picked them over the other APM providers mostly because they were comparable in price and their approach was the most “hackable”. Many other providers obscure away the implementation details but DD has just enough manual configuration that you could understand what the agent was doing and NOT doing.

You can configure everything any number of ways…config files, env values, docker labels, and on and on.

Their admin UI is complex and a bit cumbersome, but it does a pretty good job–given the amount of data it’s trying to expose for you.

sheldonh avatar

Haven’t looked. Doing trial I wonder what our cloudwatch bill has gone too. Read recently of someone having a cloudwatch bill equal to their datadog bill just based on cloudwatch calls






muhaha avatar

Guys? I have a question … What are You using for logging in k8s ( forwarders ) ? I am using Loki & Opendistro ( Elasticsearch ), problem is that I want to use Fluentbit + FluentD combo ( tls forward, exposed separate loadbalancer ), what is problem that there is not complete & matured solution for it, which is weird ..

• fluentbit -> there isnt support for hotreloading, nor API endpoints, signaling option in application

• fluentd -> there is no good helm chart with elasticsearch & loki output and sidecar for reloading after config change ( not only config, but mainly secret (tls) change )

• logging-operator by banzaicloud -> systemd and host logging is behind paywall via logging-opeator-extensions, which is nogo for me

• kubesphere/fluentbit-operator -> seems unfinished ( no helm chart ), but promising

• vmware/kube-fluentd-operator -> helm chart available, its promising

Any other alternatives? I can probably use beats & logstash, but whole community is using fluentbit/fluentd combo…,  but this ecosystem is not matured yet… Ideas? Thanks

kskewes avatar

Fluent bit keeps a small index file on disk (on worker node in logs dir) so when you update or roll it’s pods it can pickup from where it left off from. So we just roll the daemon set. Just in case that might suit instead of hot reload. Otherwise dunno.

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

@muhaha sorry we didn’t get to this in #office-hours

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

Did you end up running with something? We use our own fluentd chart. We log to kinesis and from kinesis to elasticsearch & s3

Erik Osterman (Cloud Posse) avatar
Erik Osterman (Cloud Posse)

Our chart and image are public
