SweetOps #sre for September, 2019

Cheers. We’re using kube-prometheus (jsonnet) and that project has it as a first class extension so should be fine. Just waiting for s3. Then if we can move our logs from elastic to Loki we’re laughing. Use object storage instead of managing redundancy at block layer.

Jeremy G (Cloud Posse)

05:42:38 AM

I notice that the CoreOS Prometheus lists Thanos as a write-only backend.

Erik Osterman (Cloud Posse)

06:07:55 PM

@Jeremy G (Cloud Posse)

Jeremy G (Cloud Posse)

08:37:10 PM

@Erik Osterman (Cloud Posse) I have to wonder how good the performance is and how expensive (real money) it is to use against an S3 back end, but otherwise it looks good on paper. Maybe get @webb to try it to solve the Kubecost history storage problem

webb

10:03:11 PM

@Jeremy G (Cloud Posse) @asmito we did a deep dive ~2 months ago. Our view was… very promising project but we felt that some of the scaling issues were going to be hard for us to go over. We’re ingesting 100k+ metrics per min. We plan to revisit it soon. Happy to share more detail if it would be helpful.

Jeremy G (Cloud Posse)

10:05:27 PM

Yes, please do share some details. Is the bottleneck the performance of S3 or something else? Did you find a threshold rate of metrics that went from acceptable performance to not?

webb

10:29:12 PM

I’m sorry – I just tried to reference our notes from this experiment and I may have been mistaken actually… while we don’t have exact results on hand today, it looks like our notes show that we needed a more expressive query language for the range/scale of data we were querying. We had a general question mark around scale given that Thanos is a sandbox project, but it looks like there are no specific notes around hitting bottlenecks. My apologies. I expect we’ll revisit this soon, but for now we’re using the Postgres adapter.

2019-09-14

Erik Osterman (Cloud Posse)

11:52:17 PM

https://github.com/spotahome/service-level-operator

spotahome/service-level-operator

Manage application’s SLI and SLO’s easily with the application lifecycle inside a Kubernetes cluster - spotahome/service-level-operator

kskewes

02:22:48 AM

Great share! Looks very interesting. Neat to have multi burn rate defined too. There’s a semi recent SoundCloud blog talking about how they do it with vanilla Prometheus using recording rules etc.

spotahome/service-level-operator

Manage application’s SLI and SLO’s easily with the application lifecycle inside a Kubernetes cluster - spotahome/service-level-operator

Erik Osterman (Cloud Posse)

05:08:49 AM

I’m eager to try this one out. Love how apps can easily define their own SLI/SLO by defining a CRD.