SweetOps #sre for January, 2020

Archive: https://archive.sweetops.com/monitoring/

2020-01-02

2020-01-15

Christopher

Sorry, newbie question… If I wanted to diagnose where a memory usage is going in a PHP application, is that something a tool like New Relic APM can do for me?

Pierre Humberdroz

01:46:05 PM

New Relic APM does not really show these kind of things but it gives you time spans for how long it takes for certain things to load.

Christopher

01:47:35 PM

Ahh, okay.. Thanks, that probably won’t help me then. I found another blackfire.io which looks like it might help instead

Pierre Humberdroz

01:54:50 PM

Google’s Stackdriver is the only tool that I know that does profiling on the fly of your applications and this would still mean it only works if you have a single request in a timespan to check.

Christopher

01:58:38 PM

I see. I might have to come about this a different way then .. I have no idea where the problem is coming from, so cannot currently isolate it to a single request.

Pierre Humberdroz

02:00:56 PM

maybe you can explain what is happening and I get an idea how you could tackle this issue

Christopher

02:08:33 PM

Essentially, we have a WordPress website running on a VPS… And, every day, the log files are full of loads of “exhausted memory” errors in the logs. It only happens in production (probably because of the number of requests, or work load). There is very little to indicate what has caused it, the only mention is a file called wp-db.php which is a PHP class to interact with MySQL.

Increasing the memory limit for the process does not solve it, it just eats up all of that too.

Christopher

02:09:11 PM

I’m mostly a front end developer, so I’m quite far out of my depth here, so sorry about that haha

Pierre Humberdroz

03:34:20 PM

did you verify that the memory limit is higher ? with phpinfo or the likes?

Christopher

03:34:33 PM

Yep

Christopher

03:34:50 PM

It was previous 512M, upped it to 1G

Erik Osterman (Cloud Posse)

05:52:09 PM

so wp-db.php is the ORM for WP. Wouldn’t surprise me if some query is trying to load all records into memory.

Erik Osterman (Cloud Posse)

05:52:55 PM

How large is the blog? Are we truly certain that 1gb is enough for a poorly written query?

Erik Osterman (Cloud Posse)

05:53:12 PM

You mention that it only happens in production. Does staging have an equal dataset?

Christopher

10:10:13 AM

The amount of content between staging & production is quite similar. It’s an e-commerce store using WooCommerce. There’s about 500 orders per day. With a total of about 180,000 orders on the site.

It’s possible that 1G is not big enough. I could up this to 4G and see if this help. I can’t imagine it needs to be more than 4G. I’ll see if that helps the situation shortly. I’ll also look to see if I can find any rogue queries loading everything in.

Thanks everyone btw

Pierre Humberdroz

10:40:31 AM

I ran into a similar issue there was a Cronjob trying to generate some kind of report across all orders but this was 5 years ago

Christopher

11:17:51 AM

Alright, 4G wasn’t enough either!

I’ll see if I can find some information around the queries. Perhaps I can find a pattern

Pierre Humberdroz

03:25:39 PM

are you on nginx or apache2

Christopher

03:26:05 PM

Litespeed

Pierre Humberdroz

03:30:50 PM

does it put out logs compatible to nginx or apache ?

Christopher

03:31:49 PM

Honestly, I have no idea. If I paste you one of the lines from the log file, would that help?

Christopher

03:33:22 PM

2020-01-16T10:39:19+00:00 CRITICAL Out of memory (allocated 1556828160) (tried to allocate 4096 bytes) in /home/website/public_html/releases/1579104310/web/wp/wp-includes/wp-db.php on line 2007

2020-01-16T10:41:19+00:00 CRITICAL Out of memory (allocated 1561346048) (tried to allocate 58720264 bytes) in /home/website/public_html/releases/1579104310/web/wp/wp-includes/wp-db.php on line 2007

Pierre Humberdroz

03:34:26 PM

looks like apache logs

Pierre Humberdroz

03:36:22 PM

did you ever look at this with something like kibana and analyse the looks that way?

Christopher

03:37:28 PM

That’s getting way beyond my knowledge of this, and what’s setup.

I inherited this project, mostly do front-end development work, and only get to spend 2 days a month maximum on it.

Christopher

03:39:13 PM

I mostly hoped there was an easy solution if I am honest.

I appreciate all your help by the way, i’m learning a lot of new stuff based on your comments.

Erik Osterman (Cloud Posse)

05:37:01 PM

By the looks of it raising the memory limit did not help either because it didn’t apply (there are many ways to set it in php), it was unset or changed by something else, or the server simply doesn’t have enough ram.

Erik Osterman (Cloud Posse)

05:37:27 PM

Because it crashed at 1.5gb total requested, which is lower than the limit you set

Roderik van der Veer

06:57:45 PM

New Relic has a PHP agent which is actually very useful. It will also show you slow queries etc: https://docs.newrelic.com/docs/agents/php-agent/getting-started/introduction-new-relic-php

Introduction to New Relic for PHP | New Relic Documentation

For an overview of New Relic’s PHP agent (compatibility, requirements, installation, configuration, troubleshooting, known issues), start here.

Roderik van der Veer

06:58:21 PM

Switched technologies and i’m still missing this level of tracing in nodejs

Pierre Humberdroz

07:01:03 PM

Elastic APM has the level of tracing @Roderik van der Veer for nodejs

Roderik van der Veer

07:08:45 PM

It has the same limitations as the NR one. In PHP, it can show you for each request, what call (written by you, a dependency or, and this is important, php built in funcion) takes how long. The nodejs APM solutions can show you, the route, a database call. but not for example a call to nodejs crypto which takes forever.

Roderik van der Veer

07:12:21 PM

but wil give it a go, because it does look nice TBH

Pierre Humberdroz

07:25:11 PM

you can define custom spans in that case.

Christopher

08:39:31 AM

@Erik Osterman (Cloud Posse) oh, sorry i just posted an old line from the logs as an example of what they look like. It definitely did increase to 4GB as I have some entries in the log that exhausted all of that. Sorry for the confusion.

2020-01-16

2020-01-17

2020-01-27

btai

09:55:16 PM

whats the difference between kube-prometheus and prometheus-operator? I’m assuming prom operator is completely bare bones while kube-prometheus has a default baseline of dashboards and monitors?

Santiago Campuzano

09:59:41 PM

https://github.com/coreos/prometheus-operator#prometheus-operator-vs-kube-prometheus-vs-community-helm-chart

coreos/prometheus-operator

Prometheus Operator creates/configures/manages Prometheus clusters atop Kubernetes - coreos/prometheus-operator

btai

10:00:25 PM

The stable/prometheus-operator helm chart provides a similar feature set to kube-prometheus. 

btai

10:00:41 PM

this tells me kube-prometheus is probably not super useful anymore?

Santiago Campuzano

10:02:47 PM

To be honest, I’ve been working with Prometheus Operator Helm Chart

Santiago Campuzano

10:02:50 PM

Which is Amazing !

Santiago Campuzano

10:03:20 PM

It installs a full fledged Prometheus+Grafana+Alert Manager stack , ready to monitor a K8S cluster

Santiago Campuzano

10:03:41 PM

I’d recommend you going that way

btai

10:03:55 PM

@Santiago Campuzano might be a stupid question, does it come w/ default dashboards?

Santiago Campuzano

10:04:09 PM

It’s not a stupid question

Santiago Campuzano

10:04:17 PM

It comes… and they are amazing

btai

10:04:27 PM

i dont see any within the grafana ui

Santiago Campuzano

10:04:56 PM

There may be something with Prom Operator config

Erik Osterman (Cloud Posse)

10:04:58 PM

kube-prometheus pre-dated prometheus-operator. it was developed originally by TicketMaster, then given to CoreOS, then given to the community.

Erik Osterman (Cloud Posse)

10:05:33 PM

prometheus-operator is the way to go today

btai

10:05:40 PM

thanks @Santiago Campuzano @Erik Osterman (Cloud Posse) thats what I was trying to figure out

btai

10:05:56 PM

i noticed everything kube-prometheus was not as frequently maintained anymore

btai

10:06:33 PM

but this from the README confused me a little:

kube-prometheus combines the Prometheus Operator with a collection of manifests to help getting started with monitoring Kubernetes itself and applications running on top of it.

Santiago Campuzano

10:07:00 PM

Yep… actually.. there’s an open issue for that

Santiago Campuzano

10:07:07 PM

https://github.com/coreos/prometheus-operator/issues/2619

Confusing doc prometheus-operator vs kube-prometheus · Issue #2619 · coreos/prometheus-operator

I am a noob who try to setup some monitoring for my cluster & apps. I lost 2 days of work trying to use kube-prometheus because of these lines: https://github.com/coreos/prometheus-operator/blo…

Santiago Campuzano

10:07:22 PM

You’re not the only one

btai

10:07:24 PM

almost makes it sound like kube-prometheus builds on top of prometheus-operator, which made me think it was providing possibly default dashboards specific to kube

btai

10:07:31 PM

haha thanks @Santiago Campuzano

Santiago Campuzano

10:10:27 PM

YW !

kskewes

05:53:58 AM

Um. Kube-prometheus is a jsonnet based project that bundles Prometheus operator and a ton of dashboards and alerts, the whole stack. It’s very much alive and maintainers are also maintainers of Prometheus etc.

kskewes

05:54:48 AM

https://github.com/coreos/kube-prometheus

coreos/kube-prometheus

Use Prometheus to monitor Kubernetes and applications running on Kubernetes - coreos/kube-prometheus

kskewes

05:55:35 AM

It was moved recently so that it could have it’s own releases, though running master is suggested (apps are versioned).

Zachary Loeber

01:57:37 PM

I use both (maybe incorrectly?). I install the operator first then install kube-prometheus as it includes a bunch of bundled exporters, some decent starter prometheus starter config and a good set of default alerts.

Zachary Loeber

01:59:11 PM

kube-prometheus helm chart is like never updated. The expectation is that you will get into the heady realm of jsonnet and build your own custom deployment or something.

Zachary Loeber

02:00:39 PM

Personally ive had to clone and make minor edits to both projects to get a stable deployment for AKS clusters, yuk.

btai

07:11:11 PM

@Zachary Loeber I run clusters in aws and azure (AKS). can you elaborate why you needed to make yucky changes to prom-operator specifically for AKS?

kskewes

07:03:50 AM

Helm is not maintained by project so I imagine it’s like all the other charts…

Jsonnet is a big step. But I like that you can change stuff. There are no limits and eventually weird differences between environments require ad hoc changes. Having a shared base then minimal patches and secrets files per environment seems to work. There’s some great work done in mixins that are bundled in and also available for including yourself so it’s pretty complete. If some vendor every manages to offer something as complete at a decent price would be a good thing!

2020-01-29

Zachary Loeber

02:05:19 PM

So I’m sending alertmanager alerts to a webhook that triggers an MS Teams notification (though it could be slack or any other route) and I want to also send along an autogenerated link to kibana logs for the namespace. I know how to generate the link but I don’t know the best way to get the cluster specific external dns zone passed through the alerts to construct the link with.

Zachary Loeber

02:06:04 PM

(so something like ‘kibana.<cluster.custom.internal.domain>”)

Zachary Loeber

02:06:55 PM

Am I forced to do label rewrites and appending to make this happen or does alertmanager have any kind of situational awareness lookups it can do that I can tap into for such things?

joshmyers

02:42:16 PM

AFAIK - rewrites

Zachary Loeber

03:14:56 PM

Thanks for confirming what I kinda suspected was the case @joshmyers

#sre (2020-01)

Prometheus, Prometheus Operator, Grafana, Kubernetes

2020-01-02

2020-01-15

2020-01-16

2020-01-17

2020-01-27

2020-01-29

2020-01-30