#monitoring (2021-02)

Prometheus, Prometheus Operator, Grafana, Kubernetes

Archive: https://archive.sweetops.com/monitoring/

2021-02-24

btai avatar

I’ve always struggled w/ this but my use case is kind of special so curious whether anyone has run into this. We have a ton of kubernetes deployments in our prod cluster (maybe like 15-20k in our production cluster). We run deployments nightly where we will have thousands of deployments of new pods. When this happens I get a ton of alerts for replica pods going down & unavailable deployment replicas detected. I believe this is somewhat normal procedure as the pods get rotated. I wish that I wouldn’t have to resolve all the alerts, but at the same time I don’t want to disable alerting during deployment time either. Anyone have a good workaround for this? (I’m testing out datadog currently)

Alex Jurkiewicz avatar
Alex Jurkiewicz

Set the alarms to only trigger if you are in error condition for longer?

2021-02-17

shtrull avatar
shtrull

HELP I have the next prom query (to reduce the results while testing I limited a specific reader_id) irate(nexite_reader_all_packets_per_channel_total{reader_id="10000"}[1m])

and here is a cleaned up result (i have manully removed the ns, container,svc i.e)

{branch_id="3689", chain_id="3390", channel="37", instance="10.4.2.236:8188",  pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 108
{branch_id="3689", chain_id="3390", channel="37", instance="10.4.2.40:8188",  pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 163
{branch_id="3689", chain_id="3390", channel="38", instance="10.4.2.236:8188",  pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 77
{branch_id="3689", chain_id="3390", channel="38", instance="10.4.2.40:8188",  pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 121
{branch_id="3689", chain_id="3390", channel="39", instance="10.4.2.236:8188",  pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 86
{branch_id="3689", chain_id="3390", channel="39", instance="10.4.2.40:8188",  pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 131

before k8s they had one “pod” so if they did sum by (branch_id) they got the right results but becues the pods are dynamic they get the results in double and each run qeuerys their database not at the same time they are off by a bit

is there a elagent way to first run avg on both pods, and then run sum by branch_id?

Gareth avatar
Gareth

Good Afternoon all, Can anybody make a recommendation for filtering unwanted or wanted lines from Logs running on EC2 instance within AWS using the Unified Cloudwatch Agent? As far as I’m aware there isn’t an ability to filter before ingestion into cloudwatch.

I believe AWS recommendation is to filter the log into another log and then consume the filtered log. So, before I have to write something for Centos and windows, I wonder if anybody can make a recommendation for an app that could be used to transform / filter the logs?

2021-02-11

2021-02-09

Joan Porta avatar
Joan Porta

Hi guys!  in k8s, I want to use Opentelemetry collector to gatter logs, In the cluster I have multiple app’s. Is it posible to not need in each app POD a sidecar with opentelemetry agent, just only the daemonset? I dont want extra overhead having to put a sidecar to all app’s POD’s

2021-02-04

kareem.shahin avatar
kareem.shahin

Has anyone had success trying out Datadog Real User Monitoring (RUM)? Considering it and just curious about anybody’s experiences. Also open to alternatives for tracking user events and behavior, more so for troubleshooting client-side interactions rather than analytics.

    keyboard_arrow_up