#sre (2020-01)
Prometheus, Prometheus Operator, Grafana, Kubernetes
Archive: https://archive.sweetops.com/monitoring/
2020-01-02
2020-01-15

Sorry, newbie question… If I wanted to diagnose where a memory usage is going in a PHP application, is that something a tool like New Relic APM can do for me?

New Relic APM does not really show these kind of things but it gives you time spans for how long it takes for certain things to load.

Ahh, okay.. Thanks, that probably won’t help me then. I found another blackfire.io which looks like it might help instead

Google’s Stackdriver is the only tool that I know that does profiling on the fly of your applications and this would still mean it only works if you have a single request in a timespan to check.

I see. I might have to come about this a different way then .. I have no idea where the problem is coming from, so cannot currently isolate it to a single request.

maybe you can explain what is happening and I get an idea how you could tackle this issue

Essentially, we have a WordPress website running on a VPS… And, every day, the log files are full of loads of “exhausted memory” errors in the logs. It only happens in production (probably because of the number of requests, or work load). There is very little to indicate what has caused it, the only mention is a file called wp-db.php
which is a PHP class to interact with MySQL.
Increasing the memory limit for the process does not solve it, it just eats up all of that too.

I’m mostly a front end developer, so I’m quite far out of my depth here, so sorry about that haha

did you verify that the memory limit is higher ? with phpinfo or the likes?

Yep

It was previous 512M, upped it to 1G

so wp-db.php
is the ORM for WP. Wouldn’t surprise me if some query is trying to load all records into memory.

How large is the blog? Are we truly certain that 1gb is enough for a poorly written query?

You mention that it only happens in production. Does staging have an equal dataset?

The amount of content between staging & production is quite similar. It’s an e-commerce store using WooCommerce. There’s about 500 orders per day. With a total of about 180,000 orders on the site.
It’s possible that 1G is not big enough. I could up this to 4G and see if this help. I can’t imagine it needs to be more than 4G. I’ll see if that helps the situation shortly. I’ll also look to see if I can find any rogue queries loading everything in.
Thanks everyone btw

I ran into a similar issue there was a Cronjob trying to generate some kind of report across all orders but this was 5 years ago

Alright, 4G wasn’t enough either!
I’ll see if I can find some information around the queries. Perhaps I can find a pattern

are you on nginx or apache2

Litespeed

does it put out logs compatible to nginx or apache ?

Honestly, I have no idea. If I paste you one of the lines from the log file, would that help?

2020-01-16T10:39:19+00:00 CRITICAL Out of memory (allocated 1556828160) (tried to allocate 4096 bytes) in /home/website/public_html/releases/1579104310/web/wp/wp-includes/wp-db.php on line 2007
2020-01-16T10:41:19+00:00 CRITICAL Out of memory (allocated 1561346048) (tried to allocate 58720264 bytes) in /home/website/public_html/releases/1579104310/web/wp/wp-includes/wp-db.php on line 2007

looks like apache logs

did you ever look at this with something like kibana and analyse the looks that way?

That’s getting way beyond my knowledge of this, and what’s setup.
I inherited this project, mostly do front-end development work, and only get to spend 2 days a month maximum on it.

I mostly hoped there was an easy solution if I am honest.
I appreciate all your help by the way, i’m learning a lot of new stuff based on your comments.

By the looks of it raising the memory limit did not help either because it didn’t apply (there are many ways to set it in php), it was unset or changed by something else, or the server simply doesn’t have enough ram.

Because it crashed at 1.5gb total requested, which is lower than the limit you set

New Relic has a PHP agent which is actually very useful. It will also show you slow queries etc: https://docs.newrelic.com/docs/agents/php-agent/getting-started/introduction-new-relic-php
For an overview of New Relic’s PHP agent (compatibility, requirements, installation, configuration, troubleshooting, known issues), start here.

Switched technologies and i’m still missing this level of tracing in nodejs

Elastic APM has the level of tracing @Roderik van der Veer for nodejs

It has the same limitations as the NR one. In PHP, it can show you for each request, what call (written by you, a dependency or, and this is important, php built in funcion) takes how long. The nodejs APM solutions can show you, the route, a database call. but not for example a call to nodejs crypto which takes forever.

but wil give it a go, because it does look nice TBH

you can define custom spans in that case.

@Erik Osterman (Cloud Posse) oh, sorry i just posted an old line from the logs as an example of what they look like. It definitely did increase to 4GB as I have some entries in the log that exhausted all of that. Sorry for the confusion.
2020-01-16
2020-01-17
2020-01-27

whats the difference between kube-prometheus and prometheus-operator? I’m assuming prom operator is completely bare bones while kube-prometheus has a default baseline of dashboards and monitors?

Prometheus Operator creates/configures/manages Prometheus clusters atop Kubernetes - coreos/prometheus-operator

The stable/prometheus-operator helm chart provides a similar feature set to kube-prometheus.

this tells me kube-prometheus is probably not super useful anymore?

To be honest, I’ve been working with Prometheus Operator Helm Chart

Which is Amazing !

It installs a full fledged Prometheus+Grafana+Alert Manager stack , ready to monitor a K8S cluster

I’d recommend you going that way

@Santiago Campuzano might be a stupid question, does it come w/ default dashboards?

It’s not a stupid question

It comes… and they are amazing

i dont see any within the grafana ui

There may be something with Prom Operator config

kube-prometheus pre-dated prometheus-operator. it was developed originally by TicketMaster, then given to CoreOS, then given to the community.

prometheus-operator is the way to go today

thanks @Santiago Campuzano @Erik Osterman (Cloud Posse) thats what I was trying to figure out

i noticed everything kube-prometheus
was not as frequently maintained anymore

but this from the README confused me a little:
kube-prometheus combines the Prometheus Operator with a collection of manifests to help getting started with monitoring Kubernetes itself and applications running on top of it.

Yep… actually.. there’s an open issue for that

I am a noob who try to setup some monitoring for my cluster & apps. I lost 2 days of work trying to use kube-prometheus because of these lines: https://github.com/coreos/prometheus-operator/blo…

You’re not the only one

almost makes it sound like kube-prometheus
builds on top of prometheus-operator
, which made me think it was providing possibly default dashboards specific to kube

haha thanks @Santiago Campuzano

YW !

Um. Kube-prometheus is a jsonnet based project that bundles Prometheus operator and a ton of dashboards and alerts, the whole stack. It’s very much alive and maintainers are also maintainers of Prometheus etc.

Use Prometheus to monitor Kubernetes and applications running on Kubernetes - coreos/kube-prometheus

It was moved recently so that it could have it’s own releases, though running master is suggested (apps are versioned).

I use both (maybe incorrectly?). I install the operator first then install kube-prometheus as it includes a bunch of bundled exporters, some decent starter prometheus starter config and a good set of default alerts.

kube-prometheus helm chart is like never updated. The expectation is that you will get into the heady realm of jsonnet and build your own custom deployment or something.

Personally ive had to clone and make minor edits to both projects to get a stable deployment for AKS clusters, yuk.

@Zachary Loeber I run clusters in aws and azure (AKS). can you elaborate why you needed to make yucky changes to prom-operator specifically for AKS?

Helm is not maintained by project so I imagine it’s like all the other charts…
Jsonnet is a big step. But I like that you can change stuff. There are no limits and eventually weird differences between environments require ad hoc changes. Having a shared base then minimal patches and secrets files per environment seems to work. There’s some great work done in mixins that are bundled in and also available for including yourself so it’s pretty complete. If some vendor every manages to offer something as complete at a decent price would be a good thing!
2020-01-29

So I’m sending alertmanager alerts to a webhook that triggers an MS Teams notification (though it could be slack or any other route) and I want to also send along an autogenerated link to kibana logs for the namespace. I know how to generate the link but I don’t know the best way to get the cluster specific external dns zone passed through the alerts to construct the link with.

(so something like ‘kibana.<cluster.custom.internal.domain>”)

Am I forced to do label rewrites and appending to make this happen or does alertmanager have any kind of situational awareness lookups it can do that I can tap into for such things?

AFAIK - rewrites

Thanks for confirming what I kinda suspected was the case @joshmyers