SweetOps #sre for September, 2020

Archive: https://archive.sweetops.com/monitoring/

2020-09-01

Erik Osterman (Cloud Posse)

For those of you who missed our #office-hours , here’s an explanation of how we’re managing opsgenie with terraform: https://www.youtube.com/watch?v=fXNajuC4L1o

Cloud Posse Explains: How to Manage OpsGenie Configs w Terraform & YAML

Erik Osterman (Cloud Posse)

05:56:20 PM

https://github.com/cloudposse/terraform-opsgenie-incident-management/tree/master/modules/config

cloudposse/terraform-opsgenie-incident-management

Contribute to cloudposse/terraform-opsgenie-incident-management development by creating an account on GitHub.

2020-09-08

Jayesh Patel

11:00:00 AM

Does anyone know how to set this Prometheus alert for specific time duration using inhibition_rule ?

  - alert: TestEventBacklogCritical
    expr: sum(test_depth{topic="events",paused="false"}) >= 150000
    for: 15m
    labels:
      severity: page
    annotations:
      description: Event queue has reached 150K for at least 15m
      summary: Event queue has reached 150K

Ian Bartholomew

07:32:48 PM

Has anyone worked with either DataDog synthetics or AWS CloudFront synthetics? We are evaluating them, and are curious if anyone has prior experience one way of the other with them. Thanks!

Chris Fowles

10:52:04 PM

are you already using datadog?

Ian Bartholomew

11:21:07 PM

yah, we are already using both them and AWS

Zach

11:01:15 PM

cloudwatch synthetics seemed ridiculously expensive when it was released

Chris Fowles

11:12:01 PM

yeh - that’s kind of where i was going to lead. i think our cost estimates had datadog at 1/3 to 1/2 the price of cloudwatch

Ian Bartholomew

11:26:23 PM

interesting, i think that our estimates had it the other way around. DD is $5 per 1k runs / mo which for one that ran every minute would be $220 / mo, while CW was $0.0012 per run, which would be $51.84 for one that ran every minute. Maybe my math is wrong?

Zach

11:18:30 PM

and datadog is like 2-3x as expensive as other synthetic providers I guess it makes sense if you already use them

Chris Fowles

11:21:29 PM

i’d probably lean into datadog over cloudwatch

Ian Bartholomew

11:22:59 PM

awesome, thanks

Chris Fowles

11:27:36 PM

so the catch is, cloudwatch you need to pay also for: alarms, logs, lambda, and s3

Ian Bartholomew

11:31:44 PM

great call out, thank you @Chris Fowles i really appreciate it

Ian Bartholomew

11:31:54 PM

this is super helpful

Chris Fowles

11:32:03 PM

welcome

Chris Fowles

11:28:03 PM

If you create 5 canaries that run once every 5 minutes, add alarms on 5 of the metrics generated by the canaries, and store the data for 1 month, your monthly bill will be calculated as follows:

5 canaries * 12 runs per hour * 24 hours per day * 31 days per month = 44,640 canary runs

Monthly CloudWatch charges

Canary run charges = 44,640 canary runs * $0.0012 per canary run = $53.57 per month
5 alarms per month = 5 * $0.10 = $0.50 per month
Total monthly CloudWatch charges = $53.57 + $0.50 = $54.07

Monthly additional charges

Each canary run also runs an AWS Lambda function and writes logs and results to CloudWatch Logs and the designated Amazon S3 bucket. For details on AWS service pricing such as AWS Lambda, Amazon S3, and CloudWatch Logs, see the pricing section of the relevant AWS service detail pages.

Lambda charges = requests charges + duration charges
= requests from 44,640 runs * $0.2 per 1,000,000 + duration of 20 seconds * 44,640 canary runs * 1 GB memory size * $0.000016667 per GB per sec
= $0.01 + $14.88 = $14.89 per month

CloudWatch Logs charges = collection charges + storage charges
= collection of 0.00015 GB per run * 44,640 runs * $0.5 per GB + storage of 0.00015 GB per run* 44,640 canary runs * $0.03 per GB per month
= $3.35 + $0.20 = $3.55 per month

S3 charges = put request charges + storage charges
= put requests of 44,640 runs * $0.005 per 1,000 requests + storage of 0.001 GB per run * 44,640 canary runs * 1 month * $0.023 per GB per month
= $0.22 + $1.03 = $1.25 per month

Additional monthly charges = $14.89 + $3.55 + $1.25 = $19.69

Total monthly charges = $54.07 + $19.69 = $73.76

Pricing values displayed here are based on US East Regions. Please refer to pricing tabs for most current pricing information for your respective region(s).

Zach

11:33:25 PM

yah a single synthetic canary run on Cloudwatch per minute was going to be like an entire month on other services

Zach

11:33:40 PM

I honestly don’t understand who is paying/using that

Chris Fowles

11:55:19 PM

people who have approval to spend money on AWS but not enter new contracts with third parties

Zach

11:58:14 PM

Ah, I have a little bit of that going on. VPs don’t raise an eyebrow at the aws bill but if I ask for a new purchase I have to prove that I can’t build it for cheaper

Chris Fowles

11:58:31 PM

this is what AWS marketplace is for

Chris Fowles

11:58:52 PM

https://aws.amazon.com/marketplace/seller-profile?id=e56c35d0-c5d4-4dac-91d5-ebf57fef6e5c

Chris Fowles

11:58:54 PM

Chris Fowles

11:59:22 PM

or maybe you want to roll out sumologic? https://aws.amazon.com/marketplace/pp/B06XXVNPN2?qid=1599609541958&sr=0-1&ref_=srh_res_product_title

Zach

11:59:47 PM

AWS Marketplace mostly feels like “AMIs that I could build with a simple userdata script that fetches a binary from github”

Chris Fowles

12:00:22 AM

yeh - but also i can put things in my aws bill that finance won’t raise an eyebrow at

Zach

12:00:38 AM

or “a cloudformation stack that I almost certainly will regret deploying later”

2020-09-15

btai

12:11:09 AM

is this tutorial still relevant for deploying prometheus operator w/ thanos? https://medium.com/@kakashiliu/deploy-prometheus-operator-with-thanos-60210eff172b

Deploy Prometheus Operator With Thanos attachment image

Preface

btai

12:55:54 AM

i believe its still relevant. I didn’t follow the parts about the peers, but I was able to get thanos read/write from s3 working with prometheus operator

Deploy Prometheus Operator With Thanos attachment image

Preface

2020-09-19

sheldonh

05:52:08 PM

Deep into datadog trial. Had another division propose site24x7. Any general impression on it for a comprehensive platform tool? Anything will help.

Looking to centralized logs, apm, monitors etc for AWS.

Datadog is so expensive due to per host costs so it’s causing some to want more competitor evaluation.

kskewes

08:38:06 PM

We run Prometheus in house and cloudwatch but might move to Loki soon. We’d look at Grafana cloud or honey comb or light step if going external.

sheldonh

10:10:24 PM

Yeah I’d love that with more interest, but they want a boxed solution with minimal development requirements. I love me some grafana :-) would live to try honeycomb

Zach

01:51:20 AM

Also running prometheus and I’m rolling out Loki right now, in Fargate. Have it up and running in my dev environment and everything feeding into it

Zach

01:51:38 AM

“Minimal development” fits it pretty well

Issif

05:26:52 PM

Datadog logs is really good and useful, the metrics part is a shame

Issif

05:27:28 PM

interpolation of metrics is non-understable

Issif

05:27:59 PM

for AWS, metrics are gathered with a latency of 15min at least

Issif

05:30:06 PM

for silences, you have to mute a whole monitor, no filter work (maybe it has changed)

Issif

05:39:45 PM

filter doesn’t accept regexp (like we can do in prometheus)

kskewes

07:08:29 AM

Interesting about metrics. One of our major issues with cloudwatch is log ingestion latency. Others are losing queries, query language in general, etc

Issif

07:15:14 AM

for metrics, a simple example, if you need to compute two different metrics, for creating a ratio for example (nginx : busy_workers over max_workers, eg), you’re not sure the windows which is used for the two is same, they can be shifted in time, so the ratio counts for nothing

joshmyers

12:32:01 PM

Used site24x7 ~ 6 years ago. Horrible horrible.

Eric Berg

07:13:20 PM

It can take a few minutes for events and metrics to show up in Datadog, but i have never experienced a 15-minute wait. That seems like something you should look into. Also, the ability to mute monitors is tag-based, so you have a very flexible way to mute some or all monitors.

Also, WRT your comments about metrics’ being time-shifted to the point at which they are not properly correlated does not in any way match my experience using it over the past 5 years.

I’m a big DD fan and have been impressed by their continued attention to the user experience and expanding their functionality. Additionally, i’ve had opportunity to work directly with DD to address issues both simple and complex, and I’ve found their techs to be knowledgeable and engaged, when working through issues.

It’s expensive and you have to really watch how you configure Datadog, or you may see unexpected charges, but it’s a great, solid, flexible, well-supported platform and I’m thrilled that we’d chosen it, before i got here.

Issif

07:14:58 PM

their agent is better now, but I remember that for a long time, it had panic all the time, I red the source code, it was really ugly

Issif

07:15:14 PM

we got hundreds of restart a day

06:30:09 PM

I also really like DD. We picked them over the other APM providers mostly because they were comparable in price and their approach was the most “hackable”. Many other providers obscure away the implementation details but DD has just enough manual configuration that you could understand what the agent was doing and NOT doing.

You can configure everything any number of ways…config files, env values, docker labels, and on and on.

Their admin UI is complex and a bit cumbersome, but it does a pretty good job–given the amount of data it’s trying to expose for you.

sheldonh

07:44:36 PM

Haven’t looked. Doing trial I wonder what our cloudwatch bill has gone too. Read recently of someone having a cloudwatch bill equal to their datadog bill just based on cloudwatch calls

2020-09-20

2020-09-21

2020-09-22

2020-09-23

2020-09-25

muhaha

02:53:49 PM

Guys? I have a question … What are You using for logging in k8s ( forwarders ) ? I am using Loki & Opendistro ( Elasticsearch ), problem is that I want to use Fluentbit + FluentD combo ( tls forward, exposed separate loadbalancer ), what is problem that there is not complete & matured solution for it, which is weird ..

• fluentbit -> there isnt support for hotreloading, nor API endpoints, signaling option in application

• fluentd -> there is no good helm chart with elasticsearch & loki output and sidecar for reloading after config change ( not only config, but mainly secret (tls) change )

• logging-operator by banzaicloud -> systemd and host logging is behind paywall via logging-opeator-extensions, which is nogo for me

• kubesphere/fluentbit-operator -> seems unfinished ( no helm chart ), but promising

• vmware/kube-fluentd-operator -> helm chart available, its promising

Any other alternatives? I can probably use beats & logstash, but whole community is using fluentbit/fluentd combo…, but this ecosystem is not matured yet… Ideas? Thanks

kskewes

07:37:42 PM

Fluent bit keeps a small index file on disk (on worker node in logs dir) so when you update or roll it’s pods it can pickup from where it left off from. So we just roll the daemon set. Just in case that might suit instead of hot reload. Otherwise dunno.

Erik Osterman (Cloud Posse)

06:49:31 PM

@muhaha sorry we didn’t get to this in #office-hours

Erik Osterman (Cloud Posse)

06:49:54 PM

Did you end up running with something? We use our own fluentd chart. We log to kinesis and from kinesis to elasticsearch & s3

Erik Osterman (Cloud Posse)

06:50:14 PM

Our chart and image are public

#sre (2020-09)

Prometheus, Prometheus Operator, Grafana, Kubernetes

2020-09-01

2020-09-08

2020-09-15

2020-09-19

2020-09-20

2020-09-21

2020-09-22

2020-09-23

2020-09-25