#sre (2021-02)
Prometheus, Prometheus Operator, Grafana, Kubernetes
Archive: https://archive.sweetops.com/monitoring/
2021-02-04
Has anyone had success trying out Datadog Real User Monitoring (RUM)? Considering it and just curious about anybody’s experiences. Also open to alternatives for tracking user events and behavior, more so for troubleshooting client-side interactions rather than analytics.
2021-02-09
Hi guys! in k8s, I want to use Opentelemetry
collector to gatter logs, In the cluster I have multiple app’s. Is it posible to not need in each app POD a sidecar
with opentelemetry agent, just only the daemonset? I dont want extra overhead having to put a sidecar to all app’s POD’s
2021-02-11
Does someone know a helm chart for a good / updated query exporter? https://github.com/albertodonato/query-exporter https://github.com/free/sql_exporter https://github.com/justwatchcom/sql_exporter
I just read back the channel and notice you didnt had a response: I have used free/sql_exporter also windows_exporter have an option for mssql with tones of metrics but didn’t found grafana board examples for it.
https://github.com/prometheus-community/windows_exporter/blob/master/docs/collector.mssql.md
Prometheus exporter for Windows machines. Contribute to prometheus-community/windows_exporter development by creating an account on GitHub.
2021-02-17
HELP
I have the next prom query (to reduce the results while testing I limited a specific reader_id)
irate(nexite_reader_all_packets_per_channel_total{reader_id="10000"}[1m])
and here is a cleaned up result (i have manully removed the ns, container,svc i.e)
{branch_id="3689", chain_id="3390", channel="37", instance="10.4.2.236:8188", pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 108
{branch_id="3689", chain_id="3390", channel="37", instance="10.4.2.40:8188", pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 163
{branch_id="3689", chain_id="3390", channel="38", instance="10.4.2.236:8188", pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 77
{branch_id="3689", chain_id="3390", channel="38", instance="10.4.2.40:8188", pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 121
{branch_id="3689", chain_id="3390", channel="39", instance="10.4.2.236:8188", pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 86
{branch_id="3689", chain_id="3390", channel="39", instance="10.4.2.40:8188", pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 131
before k8s they had one “pod” so if they did sum by (branch_id) they got the right results but becues the pods are dynamic they get the results in double and each run qeuerys their database not at the same time they are off by a bit
is there a elagent way to first run avg on both pods, and then run sum by branch_id?
Wouldn’t this be what you want? irate(nexite_reader_all_packets_per_channel_total{reader_id="10000"}[1m]) / count(nexite_reader_all_packets_per_channel_total{reader_id="10000"})
Good Afternoon all, Can anybody make a recommendation for filtering unwanted or wanted lines from Logs running on EC2 instance within AWS using the Unified Cloudwatch Agent? As far as I’m aware there isn’t an ability to filter before ingestion into cloudwatch.
I believe AWS recommendation is to filter the log into another log and then consume the filtered log. So, before I have to write something for Centos and windows, I wonder if anybody can make a recommendation for an app that could be used to transform / filter the logs?
2021-02-24
I’ve always struggled w/ this but my use case is kind of special so curious whether anyone has run into this. We have a ton of kubernetes deployments in our prod cluster (maybe like 15-20k in our production cluster). We run deployments nightly where we will have thousands of deployments of new pods. When this happens I get a ton of alerts for replica pods going down & unavailable deployment replicas detected. I believe this is somewhat normal procedure as the pods get rotated. I wish that I wouldn’t have to resolve all the alerts, but at the same time I don’t want to disable alerting during deployment time either. Anyone have a good workaround for this? (I’m testing out datadog currently)
Set the alarms to only trigger if you are in error condition for longer?