SweetOps #kubernetes for July, 2024

Archive: https://archive.sweetops.com/kubernetes/

2024-07-01

2024-07-03

2024-07-04

2024-07-05

managedkaos

02:04:06 PM

Sharing FYI….

https://learnk8s.io/exploring-the-kubernetes-instance-calculator

Exploring the Kubernetes instance calculator

In Kubernetes, should you use fewer large nodes or several smaller ones?

When using an 8 GB/2vCPU instance, are all the memory and CPU available to pods?

In this webinar, you will explore how Kubernetes reserves resources in a worker node.

You will learn how different cloud providers have different reservations and how those affect deploying workloads and their availability.

You’ll then examine how limits, requests and reservations can be combined to estimate the right instance size for your Kubernetes workloads using the Kubernetes instance calculator.

By the end of the session, you will:

Understand how the Kubernetes scheduler uses requests to allocate workloads.
Identify how the kubelet reserves CPU, memory, storage, etc., in a Kubernetes node.
Master how to choose the right size for your cluster nodes to optimize resource utilization for your workload.

2024-07-07

Hao Wang

04:06:29 PM

https://docs.kubeshark.co/en/introduction

The Kubernetes API Traffic Analyzer - Kubeshark

Real-time K8s network visibility and forensics, capturing and monitoring all traffic and payloads going in, out, and across containers, pods, nodes, and clusters.

2024-07-08

2024-07-10

Adnan

10:53:32 AM

Hello everybody,

I had an issue few days ago during which an nginx deployment had a spike in 504 timeouts while trying to proxy the requests to the upstream php-fpm.

The issue lasted for 1h20min and resolved by itself. I was unable to find those requests reaching the upstream php-fpm pods.

I suspect that a node went away but the endpoints were not cleaned up. Unfortunately, I don’t really have much evidence for it.

Anyone ever had similar issues where you had a large number of 504 between two services but you could not find any logs that would indicate those requests actually reached the other side?

Niek Neuij

11:42:45 AM

do you have a livenessProbe on the php-fpm containers?

Adnan

11:43:44 AM

Yes

Adnan

11:43:56 AM

Livness and readiness

Niek Neuij

11:46:35 AM

if the php-fpm containers are connected to a pvc, in some rare cases, they refuse to terminate in a timely manner when a node has taint, since they have a lock to the pvc

Adnan

11:49:08 AM

Interesting. There no pvcs configured for these deployments i.e. they are not using any. But they are using an ephemeral volume.

Adnan

11:50:53 AM

But other than the volumes, are you saying that if pods are taking a longer time to terminate, endpoints can stay and receive requests?

Niek Neuij

11:58:55 AM

yeah

Niek Neuij

11:59:53 AM

the endpoint is deleted when the pod is deleted

09:54:51 PM

Hey all. Thoughts on openshift for cluster management? Vs something like rancher. Are many folks running k8s on openshift? I can see the value-add, unsure of dollars at this stage but given majority of our footprint is the public cloud, using openshift instead of say AKS or EKS seems a little counter-intuitive

Brett L

01:31:10 PM

I generally think if you’re just looking for observability, LDAP and a webUI as the value add it’s generally not worth it.. It gets very very cost prohibitive very quickly especially when you’re talking about using it in the cloud. The observability is not much more then what you’d get if you installed p8s and elastic. There are other alternative web uis that are free.

2024-07-11

2024-07-12

akhan4u

07:23:06 AM

hey everyone, I wanted to know how you approach k8s upgrades. we are self hosting our K8s clusters (kubespray managed) in different DCs. I want to know how do you make sure the controller/operators in k8s does not have any breaking changes between say 1.x and 2.x? Do you surf through the changelog and look for keywords like breaking/dropped/removed? I want to know if there is some automated way or a tool to compare version change-logs and summarise the breaking changes. We already check for deprecated apis using kubedd, polaris and others, However this controller version change review is manual and error prone.

Michael

01:40:14 PM

Are your manifests in source control? We are using Renovate to notify us of API deprecations as well as some custom Microsoft Power Automate logic that scrapes changelogs and gives us a summary

venkata.mutyala

07:24:00 PM

We use:

• Renovate bot (as mentioned above)

• https://github.com/doitintl/kube-no-trouble

• When upgrading K8s versions we do go through every dependency/helm chart (ex. cert-manager) and make sure it’s compatible with the version we are upgrading too. This is manual and does require digging around and depending on the project it’s either really clear or obscure. Regardless, we do test in a separate environment and make sure things work as expected.

akhan4u

06:33:56 AM

Yup we have a Gitlab flow to notify us about deprecated api’s with kubent. However, as @venkata.mutyala mentioned we have the same manual flow to check for release-notes/change-log of various controller/operators/helm charts and look for any version incompatibility for new upgrades

akhan4u

06:34:41 AM

I was wondering if there was some automated way to diff and find breaking changes in changelogs

venkata.mutyala

06:54:27 PM

Hey @akhan4u would love to know more about what you folks look for in your manual process. I work at a company that is building a DevOps Platform based on entirely OSS. So basically I do this stuff all the time. I’m wondering if i could create some type of monthly/quarterly news letter that could make it easier for folks like yourself to manage your own clusters. If anyone here has any thoughts, please let me know via DM. I’d love to get on a call and see what i can offer back to the community.

2024-07-15

2024-07-16

Zing

11:33:58 AM

hi there, how are folks handling argocd deployments nowadays? im thinking about revamping our setup. right now we’re:

• hub spoke model using argocd app of apps pattern

• bootstrap the argocd namespace and helm deployment in terraform on the hub cluster

• point to our “app of apps” repository which uses helmfile

• let argocd manage itself

• janky CICD workflow script to add new child clusters under management I think we want to continue letting app of apps manage our “addons” using helmfile, but wondering if there’s been any improvements in the initial bootstrap of argocd itself ( for the hub ) and the argocd cluster add portion for child clusters (perhaps via the argocd terraform provider?)

Gabriela Campana (Cloud Posse)

04:26:50 PM

@Jeremy White (Cloud Posse)

2024-07-22

Adnan

10:15:21 AM

I would like to set a lifetime for pods. What are you using to achieve this?

Gabriela Campana (Cloud Posse)

04:38:11 PM

@Jeremy G (Cloud Posse) @Andriy Knysh (Cloud Posse)

Jeremy G (Cloud Posse)

06:17:35 AM

I’m not aware of a way to set a lifetime for pods, although that doesn’t mean there isn’t one.

venkata.mutyala

03:00:38 PM

@Adnan did you happen to find a solution for this?

Adnan

03:03:08 PM

The best I saw was to abuse liveness probes and a library I think it’s called pod-reaper

Jonathan Eunice

09:05:59 PM

We use a CronJob that reaps pods older than $threshold age every few minutes. It runs on Alpine Linux, so container is very light & tight. Purist approach insists logic like this built into k8s itself, but we’ve found that periodic watchdogs are often a simple, very pragmatic way to get the cluster to do what you want.

2024-07-23

2024-07-25

2024-07-26

rohit

03:37:48 PM

Does anyone use cosign to sign their images and artifacts? How do yall deal with rotating those keys? Our concern is if a customer is using our public key to verify that the images running in their kubernetes clusters, if we rotate the signing keys, they will have to also…

Gabriela Campana (Cloud Posse)

04:26:12 PM

@Andriy Knysh (Cloud Posse)

Andriy Knysh (Cloud Posse)

05:41:25 PM

@rohit if you are asking about https://github.com/sigstore/cosign, i don’t have any experience with it, other people might help you

sigstore/cosign

Code signing and transparency for containers and binaries

2024-07-30

enrique

05:42:47 AM

anyone here want to help someone drowning in failure deploying some a helm chart for mage ai?

2024-07-31

rohit

04:08:31 PM

Does anyone know if there’s an open sourced project that helps with creating, managing, verifying licenses for products we create?

enrique

10:26:49 PM

still need help debugging why some of my pods are being killed (exit 137)

Adnan

10:37:46 PM

sounds like they use more memory then available

enrique

10:38:53 PM

they don’t even come close to memory limits though and I’ve been able to confirm that it happens whenever the readiness checks go off by moving the times up and down (and i get readiness check failures)

Adnan

03:33:39 PM

I think memory limits don’t work the way you think they work. you should focus on the requests.

Gabriela Campana (Cloud Posse)

06:08:48 PM

@Jeremy G (Cloud Posse)

Jeremy G (Cloud Posse)

08:33:06 PM

Exit 137 is what you get when a Pod gets killed for using too much memory. Increase the memory requests and limits 10x and see what happens.

enrique

08:34:08 PM

will give it a try

enrique

08:34:23 PM

would it kill it if it requests too much?

Jeremy G (Cloud Posse)

08:35:41 PM

Yes

Assign Memory Resources to Containers and Pods

This page shows how to assign a memory request and a memory limit to a Container. A Container is guaranteed to have as much memory as it requests, but is not allowed to use more memory than its limit. Before you begin You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts.

enrique

08:37:40 PM

will read and validate but from my impression, I requested a lot and it was fulfilled and it kills it as soon as the readiness checks go (i’ve tried changing the length of the checks and it still happens right when they go off (i set it even to 20 min) so I think either the pod is just unresponsive bc the process it runs or theres a network issue or osmething

enrique

08:37:50 PM

I’ve also run the same pods on weaker hardware/diff cluster with no issue

enrique

08:37:58 PM

the 137 is throwing me off

enrique

08:38:01 PM

Jeremy G (Cloud Posse)

08:40:21 PM

To be fair, I’m primarily here to help with issues regarding Cloud Posse software and related resources. General Kubernetes troubleshooting of applications Cloud Posse did not install via a Cloud Posse component is outside my scope. Maybe others here can help.

cloudposse/terraform-aws-components

Opinionated, self-contained Terraform root modules that each solve one, specific problem

Gabriela Campana (Cloud Posse)

08:41:08 PM

General Kubernetes troubleshooting of applications Cloud Posse did not install via a Cloud Posse component is outside the scope. Acknowledge

enrique

08:43:06 PM

no problem apologies!

Jeremy G (Cloud Posse)

08:51:46 PM

No need to apologize, it’s fine to ask for help here, and I’m sorry if my tone was impolite. I simply meant to explain why I personally was not going to be able to help you further.

enrique

09:02:10 PM

yea no worries not impolite