SweetOps #sre for April, 2020

Archive: https://archive.sweetops.com/monitoring/

2020-04-05

btai

Prometheus + EFS users, so nfs isn’t considered a supported storage for Prometheus i guess? Have you guys have any problems with data corruption/data loss?

Erik Osterman (Cloud Posse)

05:10:39 AM

Dirty shutdowns will leave wall files around

Erik Osterman (Cloud Posse)

05:12:03 AM

Also, I imagine if you had two Prometheus operators writing to the same exact file system you would have corruption

btai

05:23:47 AM

Oof. #2 could happen when I do cluster cutovers

btai

05:24:29 AM

When I spin up a new cluster. There’s a short period of time that new cluster and the old cluster both have prom-operator talking to the same EFS

Erik Osterman (Cloud Posse)

05:58:53 AM

Yup, that would be my guess. That’s going to lead to corruption.

2020-04-15

Erik Osterman (Cloud Posse)

08:52:23 PM

https://github.com/tricksterproxy/trickster

tricksterproxy/trickster

Open Source HTTP Reverse Proxy Cache and Time Series Dashboard Accelerator - tricksterproxy/trickster

Erik Osterman (Cloud Posse)

08:52:30 PM

Learned about this today in the kubernetes office hours

Erik Osterman (Cloud Posse)

08:54:11 PM

https://github.com/rchakode/kube-opex-analytics

rchakode/kube-opex-analytics

Kubernetes Cost Allocation and Capacity Planning Analytics Tool. Built-in hourly, daily, monthly reports - Prometheus exporter - Grafana dashboard. - rchakode/kube-opex-analytics

2020-04-16

Abel Luck

08:15:32 AM

Anyone know of any projects out there that would support longer-term prometheus metrics storage for small deployments?

Thanos is much to complex for us. Timescaledb seems promising, but cannot be used with RDS.

We don’t need HA.

Historical rollups like datadog would be a huge plus.

Erik Osterman (Cloud Posse)

10:03:52 PM

EFS works like a charm

Joe Presley

12:02:31 AM

What’s EFS? Google only turns up ways to monitor AWS’s EFS.

Zach

12:23:39 AM

https://aws.amazon.com/efs/

Amazon Elastic File System (EFS) | Cloud File Storage

Amazon Elastic File System (Amazon EFS) provides simple, scalable, elastic file storage for use with AWS Cloud services and on-premises resources. It scales elastically on demand without disrupting applications, growing and shrinking automatically as you add and remove files. Amazon EFS file systems are distributed across an unconstrained number of storage servers, enabling file systems to grow to petabyte-scale providing simultaneous access to your data from Amazon EC2 instances and on-premises servers.

Joe Presley

12:25:12 AM

I understand what AWS’s EFS is, but @Erik Osterman (Cloud Posse) seems to be referring to a monitoring application.

Joe Presley

12:26:04 AM

It’s possible I misunderstood and that he meant that EFS works well for storing metrics.

Erik Osterman (Cloud Posse)

12:27:28 AM

We use prometheus-operator on EKS with EFS (Amazon’s managed NFS offering).

Abel Luck

09:33:23 AM

EFS has worked fine? I remember the prometheus team recommending avoiding NFS/EFS due to certain POSIX non-compliance issues

Erik Osterman (Cloud Posse)

02:38:40 PM

But EFS can it be put in the same bucket as general NFS.

Erik Osterman (Cloud Posse)

02:38:55 PM

Its actually posix compliant

Erik Osterman (Cloud Posse)

02:39:27 PM

Plus, you can easily scale IOPS.

2020-04-17

Abel Luck

10:02:24 AM

We’re ready to move from a homegrown alert system to a more “proper” service like pagerduty/victorops/opsgenie. Does this group have any strong feelings one way or another about one of these (or another) services? Team of 4-8 engineers geo-distributed. We use prometheus/alertmanager and cloudwatch.

I find it kind of silly that in order to do end-to-end tests of the alerting system you have to add another SASS like Dead Man’s Snitch.

joshmyers

10:13:18 AM

I think it comes down to price offerings

joshmyers

10:14:00 AM

PD/VictorOps/Opsgenie are all prety similar in terms of offerings, with PD probably being the fullest featured

joshmyers

10:14:16 AM

Do they all have decent APIs and client tooling for automation?

sheldonh

11:05:22 PM

I will say the UI and notes and all in pager duty was pretty disappointing. Kinda wanted basic formatting even markdown for my notes to make make a log of the steps and my first experiment with it wasn’t very impressive.

These are minor quibbles just saying I was hoping for a little more polish in logging and notes on issue.

kskewes

06:42:00 AM

Yes and only dead man’s snitch. Surprised alert systems like pager duty don’t offer this.

joshmyers

11:00:17 AM

What are people using to monitor the apps on Fargate? APM solution like NewRelic/DataDog?

Erik Osterman (Cloud Posse)

07:24:32 PM

What did you end up doing?

joshmyers

01:47:56 PM

Not much yet, looking like DD. Biggest requirement is JVM metrics (Scala) and application profiling, which AWS don’t offer AFAIK….

2020-04-18

2020-04-20

2020-04-21

Erik Osterman (Cloud Posse)

07:24:23 PM

Pro tip: subscribe to the RSS feed for status pages you depend on. Send those updates to a slack channel. https://slack.com/help/articles/218688467-Add-RSS-feeds-to-Slack e.g.

• https://www.githubstatus.com/history.rss

• https://status.aws.amazon.com/rss/ec2-us-east-1.rss

• https://status.pagerduty.com/history.rss

• https://status.cloud.google.com/feed.atom

Add RSS feeds to Slack

Have a favorite blog or news site? You can use Slack to subscribe to both RSS and Atom feeds and get updates in the Slack channel of your choice. Note: If you get an error when trying to add a fee…