#sre (2020-09)
Prometheus, Prometheus Operator, Grafana, Kubernetes
Archive: https://archive.sweetops.com/monitoring/
2020-09-01
For those of you who missed our #office-hours , here’s an explanation of how we’re managing opsgenie with terraform: https://www.youtube.com/watch?v=fXNajuC4L1o
Contribute to cloudposse/terraform-opsgenie-incident-management development by creating an account on GitHub.
2020-09-08
Does anyone know how to set this Prometheus alert for specific time duration using inhibition_rule
?
- alert: TestEventBacklogCritical
expr: sum(test_depth{topic="events",paused="false"}) >= 150000
for: 15m
labels:
severity: page
annotations:
description: Event queue has reached 150K for at least 15m
summary: Event queue has reached 150K
Has anyone worked with either DataDog synthetics or AWS CloudFront synthetics? We are evaluating them, and are curious if anyone has prior experience one way of the other with them. Thanks!
are you already using datadog?
yah, we are already using both them and AWS
cloudwatch synthetics seemed ridiculously expensive when it was released
yeh - that’s kind of where i was going to lead. i think our cost estimates had datadog at 1/3 to 1/2 the price of cloudwatch
interesting, i think that our estimates had it the other way around. DD is $5 per 1k runs / mo which for one that ran every minute would be $220 / mo, while CW was $0.0012 per run, which would be $51.84 for one that ran every minute. Maybe my math is wrong?
and datadog is like 2-3x as expensive as other synthetic providers I guess it makes sense if you already use them
i’d probably lean into datadog over cloudwatch
awesome, thanks
so the catch is, cloudwatch you need to pay also for: alarms, logs, lambda, and s3
great call out, thank you @Chris Fowles i really appreciate it
this is super helpful
welcome
If you create 5 canaries that run once every 5 minutes, add alarms on 5 of the metrics generated by the canaries, and store the data for 1 month, your monthly bill will be calculated as follows:
5 canaries * 12 runs per hour * 24 hours per day * 31 days per month = 44,640 canary runs
Monthly CloudWatch charges
Canary run charges = 44,640 canary runs * $0.0012 per canary run = $53.57 per month
5 alarms per month = 5 * $0.10 = $0.50 per month
Total monthly CloudWatch charges = $53.57 + $0.50 = $54.07
Monthly additional charges
Each canary run also runs an AWS Lambda function and writes logs and results to CloudWatch Logs and the designated Amazon S3 bucket. For details on AWS service pricing such as AWS Lambda, Amazon S3, and CloudWatch Logs, see the pricing section of the relevant AWS service detail pages.
Lambda charges = requests charges + duration charges
= requests from 44,640 runs * $0.2 per 1,000,000 + duration of 20 seconds * 44,640 canary runs * 1 GB memory size * $0.000016667 per GB per sec
= $0.01 + $14.88 = $14.89 per month
CloudWatch Logs charges = collection charges + storage charges
= collection of 0.00015 GB per run * 44,640 runs * $0.5 per GB + storage of 0.00015 GB per run* 44,640 canary runs * $0.03 per GB per month
= $3.35 + $0.20 = $3.55 per month
S3 charges = put request charges + storage charges
= put requests of 44,640 runs * $0.005 per 1,000 requests + storage of 0.001 GB per run * 44,640 canary runs * 1 month * $0.023 per GB per month
= $0.22 + $1.03 = $1.25 per month
Additional monthly charges = $14.89 + $3.55 + $1.25 = $19.69
Total monthly charges = $54.07 + $19.69 = $73.76
Pricing values displayed here are based on US East Regions. Please refer to pricing tabs for most current pricing information for your respective region(s).
yah a single synthetic canary run on Cloudwatch per minute was going to be like an entire month on other services
I honestly don’t understand who is paying/using that
people who have approval to spend money on AWS but not enter new contracts with third parties
Ah, I have a little bit of that going on. VPs don’t raise an eyebrow at the aws bill but if I ask for a new purchase I have to prove that I can’t build it for cheaper
this is what AWS marketplace is for
or maybe you want to roll out sumologic? https://aws.amazon.com/marketplace/pp/B06XXVNPN2?qid=1599609541958&sr=0-1&ref_=srh_res_product_title
AWS Marketplace mostly feels like “AMIs that I could build with a simple userdata script that fetches a binary from github”
yeh - but also i can put things in my aws bill that finance won’t raise an eyebrow at
or “a cloudformation stack that I almost certainly will regret deploying later”
2020-09-15
is this tutorial still relevant for deploying prometheus operator w/ thanos? https://medium.com/@kakashiliu/deploy-prometheus-operator-with-thanos-60210eff172b
i believe its still relevant. I didn’t follow the parts about the peers, but I was able to get thanos read/write from s3 working with prometheus operator
2020-09-19
Deep into datadog trial. Had another division propose site24x7. Any general impression on it for a comprehensive platform tool? Anything will help.
Looking to centralized logs, apm, monitors etc for AWS.
Datadog is so expensive due to per host costs so it’s causing some to want more competitor evaluation.
We run Prometheus in house and cloudwatch but might move to Loki soon. We’d look at Grafana cloud or honey comb or light step if going external.
Yeah I’d love that with more interest, but they want a boxed solution with minimal development requirements. I love me some grafana :-) would live to try honeycomb
Also running prometheus and I’m rolling out Loki right now, in Fargate. Have it up and running in my dev environment and everything feeding into it
“Minimal development” fits it pretty well
Datadog logs is really good and useful, the metrics part is a shame
interpolation of metrics is non-understable
for silences, you have to mute a whole monitor, no filter work (maybe it has changed)
filter doesn’t accept regexp (like we can do in prometheus)
Interesting about metrics. One of our major issues with cloudwatch is log ingestion latency. Others are losing queries, query language in general, etc
for metrics, a simple example, if you need to compute two different metrics, for creating a ratio for example (nginx : busy_workers over max_workers, eg), you’re not sure the windows which is used for the two is same, they can be shifted in time, so the ratio counts for nothing
It can take a few minutes for events and metrics to show up in Datadog, but i have never experienced a 15-minute wait. That seems like something you should look into. Also, the ability to mute monitors is tag-based, so you have a very flexible way to mute some or all monitors.
Also, WRT your comments about metrics’ being time-shifted to the point at which they are not properly correlated does not in any way match my experience using it over the past 5 years.
I’m a big DD fan and have been impressed by their continued attention to the user experience and expanding their functionality. Additionally, i’ve had opportunity to work directly with DD to address issues both simple and complex, and I’ve found their techs to be knowledgeable and engaged, when working through issues.
It’s expensive and you have to really watch how you configure Datadog, or you may see unexpected charges, but it’s a great, solid, flexible, well-supported platform and I’m thrilled that we’d chosen it, before i got here.
their agent is better now, but I remember that for a long time, it had panic all the time, I red the source code, it was really ugly
we got hundreds of restart a day
I also really like DD. We picked them over the other APM providers mostly because they were comparable in price and their approach was the most “hackable”. Many other providers obscure away the implementation details but DD has just enough manual configuration that you could understand what the agent was doing and NOT doing.
You can configure everything any number of ways…config files, env values, docker labels, and on and on.
Their admin UI is complex and a bit cumbersome, but it does a pretty good job–given the amount of data it’s trying to expose for you.
Haven’t looked. Doing trial I wonder what our cloudwatch bill has gone too. Read recently of someone having a cloudwatch bill equal to their datadog bill just based on cloudwatch calls
2020-09-20
2020-09-21
2020-09-22
2020-09-23
2020-09-25
Guys? I have a question … What are You using for logging in k8s ( forwarders ) ? I am using Loki & Opendistro ( Elasticsearch ), problem is that I want to use Fluentbit + FluentD combo ( tls forward, exposed separate loadbalancer ), what is problem that there is not complete & matured solution for it, which is weird ..
• fluentbit -> there isnt support for hotreloading, nor API endpoints, signaling option in application
• fluentd -> there is no good helm chart with elasticsearch & loki output and sidecar for reloading after config change ( not only config, but mainly secret (tls) change )
• logging-operator by banzaicloud -> systemd and host logging is behind paywall via logging-opeator-extensions, which is nogo for me
• kubesphere/fluentbit-operator -> seems unfinished ( no helm chart ), but promising
• vmware/kube-fluentd-operator -> helm chart available, its promising
Any other alternatives? I can probably use beats & logstash, but whole community is using fluentbit/fluentd combo…, but this ecosystem is not matured yet… Ideas? Thanks
Fluent bit keeps a small index file on disk (on worker node in logs dir) so when you update or roll it’s pods it can pickup from where it left off from. So we just roll the daemon set. Just in case that might suit instead of hot reload. Otherwise dunno.
@muhaha sorry we didn’t get to this in #office-hours
Did you end up running with something? We use our own fluentd chart. We log to kinesis and from kinesis to elasticsearch & s3
Our chart and image are public