#monitoring

Archive: https://archive.sweetops.com/monitoring/

2019-09-16

Daren

Oh thats interesting, thanks for sharing!

2019-09-15

Erik Osterman

@Daren

2019-09-14

Erik Osterman
spotahome/service-level-operator

Manage application’s SLI and SLO’s easily with the application lifecycle inside a Kubernetes cluster - spotahome/service-level-operator

kskewes

Great share! Looks very interesting. Neat to have multi burn rate defined too. There’s a semi recent SoundCloud blog talking about how they do it with vanilla Prometheus using recording rules etc.

spotahome/service-level-operator

Manage application’s SLI and SLO’s easily with the application lifecycle inside a Kubernetes cluster - spotahome/service-level-operator

Erik Osterman

I’m eager to try this one out. Love how apps can easily define their own SLI/SLO by defining a CRD.

2019-09-13

asmito

hey guys have anyone before tried https://thanos.io/

Thanos

Thanos - Highly available Prometheus setup with long term storage capabilities

kskewes

One of our team has in previous job and we plan to roll out to aggregate up regions. Sounds solid.

Thanos

Thanos - Highly available Prometheus setup with long term storage capabilities

Erik Osterman
banzaicloud/banzai-charts

Curated list of Banzai Cloud Helm charts used by the Pipeline Platform - banzaicloud/banzai-charts

Erik Osterman

Chart looks pretty straightforward to deploy

kskewes

Cheers. We’re using kube-prometheus (jsonnet) and that project has it as a first class extension so should be fine. Just waiting for s3. Then if we can move our logs from elastic to Loki we’re laughing. Use object storage instead of managing redundancy at block layer.

Jeremy Grodberg

I notice that the CoreOS Prometheus lists Thanos as a write-only backend.

Erik Osterman

@Jeremy Grodberg

Jeremy Grodberg

@Erik Osterman I have to wonder how good the performance is and how expensive (real money) it is to use against an S3 back end, but otherwise it looks good on paper. Maybe get @ to try it to solve the Kubecost history storage problem

@Jeremy Grodberg @asmito we did a deep dive ~2 months ago. Our view was… very promising project but we felt that some of the scaling issues were going to be hard for us to go over. We’re ingesting 100k+ metrics per min. We plan to revisit it soon. Happy to share more detail if it would be helpful.

Jeremy Grodberg

Yes, please do share some details. Is the bottleneck the performance of S3 or something else? Did you find a threshold rate of metrics that went from acceptable performance to not?

I’m sorry – I just tried to reference our notes from this experiment and I may have been mistaken actually… while we don’t have exact results on hand today, it looks like our notes show that we needed a more expressive query language for the range/scale of data we were querying. We had a general question mark around scale given that Thanos is a sandbox project, but it looks like there are no specific notes around hitting bottlenecks. My apologies. I expect we’ll revisit this soon, but for now we’re using the Postgres adapter.

2019-09-12

Daniel Minella

Are someone taking golden signals metrics from aws alb/elb monitoration?

Erik Osterman

@Daniel Minella haven’t heard about that before. What are “Golden Signal Metrics”?

kskewes

Probably these ones. https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ Requests, latency, errors, saturation (Or different words for same things)

Daniel Minella

Exactly @kskewes

2019-08-29

Jean-Michael Cyr

Anyone using the Loki / Promtail / Grafana stack ?

Jean-Michael Cyr

Damn that channel is really sleeping, how could monitoring for ops people not much discussed here ? hehe

paying for sysdig currently but want to move to using loki/prom/grafana

I haven’t implemented a good sso proxy yet on k8s for grafana access

Erik Osterman

We use Prometheus Operator with Grafana and Keycloak

Erik Osterman
cloudposse/helmfiles

Comprehensive Distribution of Helmfiles. Works with helmfile.d - cloudposse/helmfiles

Erik Osterman

We use AWS managed elastic search with the built in kibana dashboard

2019-07-15

Jean-Michael Cyr

Has anyone found a way to configure alerting for Kibana with the latest version, without using the pricey Watchers functionality ? Seems ElastAlert and SentiNL is not working really great with ES 7.x.

2019-07-11

Garrett (PlanoCloudDude)

Hi, anyone have any tips and tricks for a fast way to go through a hundred config.json files in a repo pulling out the threshold values of metrics for monitoring config???

Erik Osterman

Something more than jq?

Garrett (PlanoCloudDude)
07:38:15 PM

looking for changes and pulling out thresold data points to alarm evaluation period name period statistic to compare, maybe extract to spreadsheet for comparison ??

Garrett (PlanoCloudDude)

will give jq a shot, thank you

Erik Osterman

Yep, can totally use jq for that

Erik Osterman

The syntax is a little bit funky

Erik Osterman

But once you get it, very powerful

1
Erik Osterman

They have a lot of examples

Erik Osterman

If you get stuck let me know

Garrett (PlanoCloudDude)

Thank you!

Garrett (PlanoCloudDude)

Hi Erik, do you have an ex or could point me to an example, not strong in running code query, nor parsing. Want to run see if there’s a way to run jq against my repo or local folder to see any differences in the objects for these monitoring config.json files …and I don’t know if I’m saying asking it correctly, kind of know what I want but fuzzy at the same time

Erik Osterman
jq -r '.metrics[] \| [.name, .evaluationPeriod, .comparisonOperator]\| @csv' < example_snstopic_aws_json.js
1
Erik Osterman

produces

Erik Osterman
"CPUUtilization","1","GreaterThanOrEqualToThreshold"
"StatusCheckFailed","1","GreaterThanOrEqualToThreshold"

2019-06-28

Ill try and dig it out, but I actually was reading some articles about this the other day

and basically they had clients that published metrics from the queue

and tracing

opentracing-contrib/go-amqp

AMQP instrumentation in Go. Contribute to opentracing-contrib/go-amqp development by creating an account on GitHub.

Distributed Tracing with Apache Kafka and Jaeger | Object Partners

If you are using Apache Kafka, you are almost certainly working within a distributed system and because Kafka decouples consumers and producers it can be a challenge to illustrate exactly how data flows through that system. I plan to demonstrate how Jaeger is up to that challenge while navigating the pitfalls of an example project.

2019-06-26

Marcio Rodrigues

Hello, anyone has experience in having observability/tracing of event based applications in kubernetes? (rabbitmq) i know that people usually use istio to have a clear view of the request path of their applications, but in our case we are not using HTTP, we are using AMQP any suggestions?

2019-06-13

2019-06-12

Aayush Anand

anybody has experience with grafana cloud metric ton calculations ?

what’s your issue?

2019-04-26

Abel Luck

is there any open source self-hosted APM solution? for an org already running prom, elk, or sentry that would be useful

Steven

ELK has APM

Abel Luck

oh interesting

2019-04-24

Sysdig monitor is amazing for container monitoring but it’s not an APM, I agree

If cost troubles you, don’t look at app dynamics, the product is very good, but so much expensive (and really focused on big on-premise infra with a lot of java stuff)

if you’re and adventurer, you can give a try to https://www.elastic.co/solutions/apm

2019-04-23

mrwacky

Lightstep is promising.

2019-04-15

joshmyers

What do you want exactly? I’d like to try out honeycomb

2019-04-12

dalekurt

We are looking at DataDog, decided on it this week to move forward as our solution.

NR is prety pricey. any recommendations for cheaper lesser known (up and coming as opposed to crappy) ones?

2019-04-11

anyone have APM suggestions here?

Tim Malone

NewRelic is good… but pricey we’re interested in Elastic’s offering, but we have a PHP app and they don’t have an (official) PHP agent just yet…

tamsky


NewRelic is good… but pricey

@Tim Malone NR typically prices their APM product based on instance size. – I often suggest clients can reduce their costs by only deploying the NR agent on one instance in each service layer, rather than every instance.

Tim Malone

Yeah we’ve done the same thing for now - but it means one instance has different config to the others, which is an anomaly. Also about to move to containers…. haven’t looked into what it’s gonna cost us then

2019-02-14

Erik Osterman
05:20:54 AM

@Erik Osterman set the channel purpose: Archive: https://archive.sweetops.com/monitoring/

2019-01-28

tamsky


Will let you know if I make any progress on that front

@ i like your ideas and would like to subscribe to your newsletter

2019-01-25

Will let you know if I make any progress on that front

2019-01-24

Hey, we have prometheus/alertmanager behind Oauth2, anyone found a way to make amtool or promtool work with that?

Erik Osterman

(we are not, but I am interested what you come up with!)

2018-12-20

Igor Rodionov

@Erik Osterman this can fit our approach. You wanted to spit monitoring from the cluster

Igor Rodionov
Erik Osterman

yes, that’s a good point

Erik Osterman

just hope they don’t change $0.25 per metric, like cloudwatch

2018-12-19

mrwacky
Cortex: a multi-tenant, horizontally scalable Prometheus-as-a-Service - Cloud Native Computing Foundation

Prometheus is one of the standard-bearing open-source solutions for monitoring and observability. From its humble origins at SoundCloud in 2012, Prometheus quickly garnered widespread adoption and later became one of…

2018-12-04

Erik Osterman
hypnoglow/chronologist

Continuously annotate Helm releases in Grafana. - hypnoglow/chronologist

Erik Osterman

pretty sweet

Erik Osterman
06:19:49 AM
Erik Osterman

@ not sure if you’re using much grafana over there yet

ooo nice, I think we use some

06:23:58 AM

@ has joined the channel

2018-12-02

Tamsky, that is the thing as mentioned, my goal was not for you to necessarily fix my issue but rather learn how others are doing it.

Thanks for the offer anyway, we do use discover but the config is sort of fixed on the deployment and I wanted to be able to change it more dynamically, like it does when it discovers with consul.

2018-11-30

Tamsky, I know the options, as I mentioned we are already using this, and we had our own lambda for SD

I was trying to ping/pong how others were doing it

tamsky


(we had a DNS SD based on lambda and ECS events)
but I was looking for a “simpler” solution

I guess I was trying to help out re: “simpler” solutions.

tamsky


I have a similar situation for uploading new prometheus configs, without doing a docker deployment, since albeit incorrectly that was the “easy” start for us but it sort of sucks

how do you handle persistent storage for prometheus in your docker setup – that answer might guide us toward an easy process that can update your prometheus configs.

2018-11-29

mrwacky

I’m sure there’s other options..

2018-11-28

What are you guys using with ECS and prometheus for SD?

We had our own SD in python, but im always afraid of hitting the api limits as we scale, as we did before

(we had a DNS SD based on lambda and ECS events)

joshmyers

At previous client who were so big they ended up having to pay for AWS API requests a tool was written to do a single lookup a la ec2_sd, write it to a file and the file gets mounted inside k8s prom, where k8s prom was multi team and if each team did their own lookups, would bust the limit

Yeah, we did a tool for that in Lambda, tbh is not that complicated

but I was looking for a “simpler” solution

I have a similar situation for uploading new prometheus configs, without doing a docker deployment, since albeit incorrectly that was the “easy” start for us but it sort of sucks

mrwacky
gliderlabs/registrator

Service registry bridge for Docker with pluggable adapters - gliderlabs/registrator

1
mrwacky

@

tamsky

what’s wrong with Consul for SD ?

joshmyers

You need consul?

1
joshmyers

Maybe not what you want if that is all you are going to use Consul for

tamsky

@joshmyers so what are your reasons for not using Consul if used strictly for SD ?

mrwacky

ease of use, setup, deficiencies in AWS SD options, yeah, Consul is great

joshmyers

I don’t have any. I’m just saying folks may not want to run a 3 node etcd cluster when they have been using AWS API as cheap service discovery

1
mrwacky

Good news, Consul is not etcd

joshmyers

hah, oops, same thing. It is a thing you need to manage?

tamsky

of all the services I’ve operated/managed since 2014, consul is the least needy service I’ve met

joshmyers

Nice

tamsky

self-bootstrapping EC2 ASG cluster FTW

joshmyers

Have used with Nomad before and not had any issues with it, but it isn’t a managed type service, was my only point

tamsky

managed type services are good for getting started – one should have a plan for when your org’s needs or skills outgrow a managed service offering from anyone

Erik Osterman

…such as multi-cloud

joshmyers

Aye, multi cloud is hard though

1
2
Erik Osterman

yea, the all elusive multi-cloud strategy

@mrwacky yeah we know about consul and registrator, but as explained by @joshmyers that is ofc the option if you have Consul, we dont

and while it is a good easy discovery once you have that, the question would be you do if you dont hve consul

at the moment our services dont use mesh, as we dont need/want that yet

so consul will be there ONLY to support prometheus discovery, and that seemed overkill to me, but maybe its the only option

tamsky

there are a lot of options. all of them that end in *_sd_config are candidates. let us know what you pick and why:

Configuration | Prometheus

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

2018-10-07

Erik Osterman

Has anyone used https://marbot.io/?

marbot - Send CloudWatch alarms to Slack

Easy-going incident management for AWS. Cloud-native alerting with CloudWatch and Slack.

davidvasandani

We do the same thing with Lambda.

marbot - Send CloudWatch alarms to Slack

Easy-going incident management for AWS. Cloud-native alerting with CloudWatch and Slack.

Erik Osterman

that’s cool!

2018-10-03

Erik Osterman
WIP [prometheus-operator] by gianrubio · Pull Request #6765 · helm/charts

What this PR does / why we need it: This a FR from coreos/prometheus-operator project. Moving the chart to helm upstream will make easier to run e2e tests and to accept/merge PR. @richerve @pierreo…

Igor Rodionov
10:12:48 PM

@Igor Rodionov has joined the channel

Erik Osterman

via @

2018-09-28

Erik Osterman
hunterlong/statup

Status Page for monitoring your websites and applications with beautiful graphs, analytics, and plugins. Run on any type of environment. - hunterlong/statup

2018-09-17

Erik Osterman

@Max Moon you might dig this

Max Moon

nice! will check it out

Erik Osterman

@Jeremy Grodberg this will improve your alert formatting

Jeremy Grodberg
11:17:50 PM

@Jeremy Grodberg has joined the channel

Erik Osterman
11:19:55 PM

Hi all, did someone tried with https://github.com/improbable-eng/thanos, anyone running it in production? Any helm charts to try on? Thanks

improbable-eng/thanos

Highly available Prometheus setup with long term storage capabilities. - improbable-eng/thanos

Erik Osterman

We have not..

2018-09-13

You cuold probably use Collectd exporter and collectd

or something likethat,

2018-09-12

mrwacky

Is there some prometheus exporter that knows how to get LVM statistics

2018-08-23

03:03:20 PM

@ has joined the channel

2018-08-22

11:56:03 PM

@ has joined the channel

mrwacky
02:49:06 AM

@mrwacky has joined the channel

2018-08-09

dat.le
07:04:21 AM

@dat.le has joined the channel

2018-08-05

jylee
04:28:47 PM

@jylee has joined the channel

2018-08-01

Phil
08:14:18 AM

@Phil has joined the channel

my-janala
02:24:04 PM

@my-janala has joined the channel

2018-07-31

i5okie
04:50:52 PM

@i5okie has joined the channel

2018-07-30

johntellsall
04:53:30 PM

@johntellsall has joined the channel

fernando
01:57:02 AM

@fernando has joined the channel

2018-07-25

Arkadiy
02:50:12 PM

@Arkadiy has joined the channel

2018-07-24

alebabai
02:45:00 PM

@alebabai has joined the channel

alebabai


if we could deploy the official grafana against the kube-prometheus and prometheus-operator by coreos, I think that would be the best path forward
i’ll try to look into this

Erik Osterman

Cool

Erik Osterman

I think we can even import their dashboard JSON files using GitHub raw URL

tamsky
08:02:31 PM

@tamsky has joined the channel

2018-07-23

Erik Osterman
11:29:22 PM

@Erik Osterman has joined the channel

Erik Osterman

@ @alebabai

Erik Osterman

@alebabai is going to be looking into the grafana updates for dashboards as it relates to kube-prometheus

Erik Osterman

@ brought it to our attention that is is out of date

11:30:17 PM

@ has joined the channel

Erik Osterman

@ mentioned, kube-prometheus and prometheus-operator repos have merged, but they are still very much separate charts as far as i can tell

Erik Osterman

also, installing dashboards from a URL appears to be a “chart feature” and not a feature of grafana

Erik Osterman

so I am concerned how deep down this path we should go in terms of maintaining the chart functionality

Erik Osterman
[stable/grafana] Utilize 5.x datasource and dashboard import tooling and general refactor by rtluckie · Pull Request #4713 · helm/charts

utilize new features of grafana 5.x to configure datasources and dashboards removed all jobs update ingress manifest to align with helm create remove opinionated affinity config consolidate conf…

Erik Osterman
coreos/prometheus-operator

prometheus-operator - Prometheus Operator creates/configures/manages Prometheus clusters atop Kubernetes

Erik Osterman

if we could deploy the official grafana against the kube-prometheus and prometheus-operator by coreos, I think that would be the best path forward

Daren
11:35:34 PM

@Daren has joined the channel

11:35:34 PM

@ has joined the channel

Max Moon
11:35:35 PM

@Max Moon has joined the channel

11:35:35 PM

@ has joined the channel

Erik Osterman
11:36:07 PM

@Erik Osterman set the channel topic: Prometheus, Prometheus Operator, Grafana, Kubernetes

chris
11:36:56 PM

@chris has joined the channel

    keyboard_arrow_up