#sre (2019-09)
Prometheus, Prometheus Operator, Grafana, Kubernetes
Archive: https://archive.sweetops.com/monitoring/
2019-09-12
Are someone taking golden signals metrics from aws alb/elb monitoration?
@Daniel Minella haven’t heard about that before. What are “Golden Signal Metrics”?
Probably these ones. https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ Requests, latency, errors, saturation (Or different words for same things)
Exactly @kskewes
2019-09-13
hey guys have anyone before tried https://thanos.io/
Thanos - Highly available Prometheus setup with long term storage capabilities
One of our team has in previous job and we plan to roll out to aggregate up regions. Sounds solid.
Thanos - Highly available Prometheus setup with long term storage capabilities
Curated list of Banzai Cloud Helm charts used by the Pipeline Platform - banzaicloud/banzai-charts
Chart looks pretty straightforward to deploy
Cheers. We’re using kube-prometheus (jsonnet) and that project has it as a first class extension so should be fine. Just waiting for s3. Then if we can move our logs from elastic to Loki we’re laughing. Use object storage instead of managing redundancy at block layer.
I notice that the CoreOS Prometheus lists Thanos as a write-only backend.
@Jeremy G (Cloud Posse)
@Erik Osterman (Cloud Posse) I have to wonder how good the performance is and how expensive (real money) it is to use against an S3 back end, but otherwise it looks good on paper. Maybe get @webb to try it to solve the Kubecost history storage problem
@Jeremy G (Cloud Posse) @asmito we did a deep dive ~2 months ago. Our view was… very promising project but we felt that some of the scaling issues were going to be hard for us to go over. We’re ingesting 100k+ metrics per min. We plan to revisit it soon. Happy to share more detail if it would be helpful.
Yes, please do share some details. Is the bottleneck the performance of S3 or something else? Did you find a threshold rate of metrics that went from acceptable performance to not?
I’m sorry – I just tried to reference our notes from this experiment and I may have been mistaken actually… while we don’t have exact results on hand today, it looks like our notes show that we needed a more expressive query language for the range/scale of data we were querying. We had a general question mark around scale given that Thanos is a sandbox project, but it looks like there are no specific notes around hitting bottlenecks. My apologies. I expect we’ll revisit this soon, but for now we’re using the Postgres adapter.
2019-09-14
Manage application’s SLI and SLO’s easily with the application lifecycle inside a Kubernetes cluster - spotahome/service-level-operator
Great share! Looks very interesting. Neat to have multi burn rate defined too. There’s a semi recent SoundCloud blog talking about how they do it with vanilla Prometheus using recording rules etc.
Manage application’s SLI and SLO’s easily with the application lifecycle inside a Kubernetes cluster - spotahome/service-level-operator
I’m eager to try this one out. Love how apps can easily define their own SLI/SLO by defining a CRD.
2019-09-15
@Daren
2019-09-16
Oh thats interesting, thanks for sharing!