#monitoring (2021-02)
Prometheus, Prometheus Operator, Grafana, Kubernetes
Archive: https://archive.sweetops.com/monitoring/
2021-02-24

I’ve always struggled w/ this but my use case is kind of special so curious whether anyone has run into this. We have a ton of kubernetes deployments in our prod cluster (maybe like 15-20k in our production cluster). We run deployments nightly where we will have thousands of deployments of new pods. When this happens I get a ton of alerts for replica pods going down & unavailable deployment replicas detected. I believe this is somewhat normal procedure as the pods get rotated. I wish that I wouldn’t have to resolve all the alerts, but at the same time I don’t want to disable alerting during deployment time either. Anyone have a good workaround for this? (I’m testing out datadog currently)

Set the alarms to only trigger if you are in error condition for longer?
2021-02-17

HELP
I have the next prom query (to reduce the results while testing I limited a specific reader_id)
irate(nexite_reader_all_packets_per_channel_total{reader_id="10000"}[1m])
and here is a cleaned up result (i have manully removed the ns, container,svc i.e)
{branch_id="3689", chain_id="3390", channel="37", instance="10.4.2.236:8188", pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 108
{branch_id="3689", chain_id="3390", channel="37", instance="10.4.2.40:8188", pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 163
{branch_id="3689", chain_id="3390", channel="38", instance="10.4.2.236:8188", pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 77
{branch_id="3689", chain_id="3390", channel="38", instance="10.4.2.40:8188", pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 121
{branch_id="3689", chain_id="3390", channel="39", instance="10.4.2.236:8188", pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 86
{branch_id="3689", chain_id="3390", channel="39", instance="10.4.2.40:8188", pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 131
before k8s they had one “pod” so if they did sum by (branch_id) they got the right results but becues the pods are dynamic they get the results in double and each run qeuerys their database not at the same time they are off by a bit
is there a elagent way to first run avg on both pods, and then run sum by branch_id?

Good Afternoon all, Can anybody make a recommendation for filtering unwanted or wanted lines from Logs running on EC2 instance within AWS using the Unified Cloudwatch Agent? As far as I’m aware there isn’t an ability to filter before ingestion into cloudwatch.
I believe AWS recommendation is to filter the log into another log and then consume the filtered log. So, before I have to write something for Centos and windows, I wonder if anybody can make a recommendation for an app that could be used to transform / filter the logs?
2021-02-11

Does someone know a helm chart for a good / updated query exporter? https://github.com/albertodonato/query-exporter https://github.com/free/sql_exporter https://github.com/justwatchcom/sql_exporter
2021-02-09

Hi guys! in k8s, I want to use Opentelemetry
collector to gatter logs, In the cluster I have multiple app’s. Is it posible to not need in each app POD a sidecar
with opentelemetry agent, just only the daemonset? I dont want extra overhead having to put a sidecar to all app’s POD’s
2021-02-04

Has anyone had success trying out Datadog Real User Monitoring (RUM)? Considering it and just curious about anybody’s experiences. Also open to alternatives for tracking user events and behavior, more so for troubleshooting client-side interactions rather than analytics.