SweetOps #kubernetes for July, 2023

Archive: https://archive.sweetops.com/kubernetes/

2023-07-03

Dhamodharan

When i try to deploy a statefulset application on GKE(created by Autopilot mode) using cloudshell, its throwing the below error, can someone help me to fix this?

Error from server (GKE Warden constraints violations): error when creating "envs/local-env/": admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-disallow-privilege]":["container increase-the-vm-max-map-count is privileged; not allowed in Autopilot"]}
Requested by user: 'dhamodharan', groups: 'system:authenticated'.

Hao Wang

02:57:18 PM

need to disable Autopilot mode if privileged is required

Dhamodharan

04:00:13 PM

is it possible to disable the autopilot mode?? or do i need to delete this cluster and recreate with standard mode??

Hao Wang

07:02:32 PM

You want to customize the node system configuration, such as by setting Linux sysctls.

Hao Wang

07:02:38 PM

https://cloud.google.com/kubernetes-engine/docs/concepts/choose-cluster-mode

Choose a GKE mode of operation | Google Kubernetes Engine (GKE) | Google Cloud

Hao Wang

07:02:45 PM

When to use Standard instead of Autopilot

Hao Wang

07:04:38 PM

https://stackoverflow.com/questions/72615337/migration-existing-standard-gke-to-autopilot-gke

Migration existing Standard GKE to Autopilot GKE

I have Standard GKE cluster and want to migrate all my running services to new Autopilot cluster. I research official documentation and don’t find anything how I can perform this migration

Hao Wang

07:07:35 PM

didn’t find the exact words mentioning no conversion though

2023-07-04

Andy

02:41:05 PM

Hi a cluster-autoscaler question. We appear to have a discrepancy between:

• Resource metrics reported in AWS console (see example with CPU=~50%)

• Resource metrics reported by cluster-autoscaler (they often appear to be bigger than the ones in AWS console) e.g. CPU~80% I’ve attached screenshots of each for the same EC2 instance

Andy

02:41:17 PM

So it looks as though the cluster-autoscaler is getting incorrect information. Has anyone seen this issue before?

Andy

02:41:35 PM

If I hit the metrics server directly:

k get --raw /apis/metrics.k8s.io/v1beta1/nodes/ip-10-48-37-79.eu-west-1.compute.internal | jq '.usage'
{
  "cpu": "1854732210n",
  "memory": "2290572Ki"
}

This is an amd64 c5.xlarge SPOT instance with 4 vCPU

usage_nano=1854732210
capacity_milli=4000

usage_percentage=$(awk "BEGIN { print (($usage_nano / ($capacity_milli * 1000000)) * 100) }")
echo $usage_percentage
46.3683

Hao Wang

03:09:24 PM

they should be using the same metric, what is the value at 15:10 for instance?

Andy

10:03:34 AM

Ah someone in the Kubernetes slack had this explanation:
cluster-autoscaler is looking at the amount of cpu requests vs node allocatable
This isn’t the amount of cpu time used, it’s the amount of cpu reserved by Pods

e.g. if you have a pod that has a cpu request of 3, and then it does nothing but sleep, cluster-autoscaler will see the node as 75% used, aws will see it as 0%
You can’t scale it down, because that Pod still needs to go somewhere that can satisfy the reservation

Hao Wang

11:33:52 AM

got it, differences between request and limit Thanks for sharing the details

Hao Wang

11:55:33 AM

Is there GPU being used in the instances?

Hao Wang

01:49:25 PM

If there is no GPU, from the codes, the utilization of Daemonsets and MirrorPods are counted by Autoscaler

Hao Wang

01:56:44 PM

How did you deploy autoscaler? Both utilizations of Daemonsets and MirrorPods can be disabled by --ignore-daemonsets-utilization and --ignore-mirror-pods-utilization

Hao Wang

01:57:07 PM

Then it should have the similar metrics as instance

2023-07-05

2023-07-09

Balazs Varga

06:55:42 AM

hello all, What backup solution do you use for k8s ? Currently we use velero. Anybody who uses stash, Trilio, Kasten ? How expensive are they ? Are they better than velero ?

2023-07-10

Xu Pengfei

04:59:11 AM

https://medium.com/@xpf6677/gitops-with-kcl-programming-language-cb910230e310 Hi forks ! I wrote a blog about GitOps + KCL Integration and and would love to hear your feedback!

GitOps with KCL Programming Language attachment image

GitOps with KCL Programming Language

Hao Wang

02:30:01 PM

Thanks for the blog, inspiring

GitOps with KCL Programming Language attachment image

GitOps with KCL Programming Language

Hao Wang

02:32:01 PM

This comparison is interesting, https://kcl-lang.io/docs/user_docs/getting-started/intro/

Introduction | KCL programming language.

What is KCL?

Hao Wang

03:03:34 PM

oh more interesting is why AntGroup invests into DevOps toolings lol

Hao Wang

03:14:45 PM

The quality of KCL’s codes and documents look good, and I’ve got some ideas of how it is used now

Xu Pengfei

05:30:04 AM

We are facing some challenges in cloud-native internally, so we have opened up our internal IaC and GitOps practices, Kusion and KCL, hoping to explore and solve common problems with the community. Nice to communicate with you.

kallan.gerard

02:32:10 PM

How does KCL compare to something like Cuelang?

Xu Pengfei

01:19:31 PM

Hi @kallan.gerard Here are some blogs to explain it . https://kcl-lang.io/blog/2022-declarative-config-overview https://kcl-lang.io/docs/user_docs/getting-started/intro

The Landscape of Declarative Configuration | KCL programming language.

The blog is only used to clarify the landscape of declarative configuration, KCL core concept and features, as well as the comparison with other configuration languages.

Introduction | KCL programming language.

What is KCL?

2023-07-11

2023-07-16

Xu Pengfei

06:24:07 AM

Hi all. KCL v0.5.0 is out! See here for release note and more information. https://kcl-lang.io/

Use KCL language and IDE with more complete features and fewer errors to improve code writing experience and efficiency.
Use KPM, KCL OpenAPI, OCI Registry and other tools to directly use and share your cloud native domain models, reducing learning and hands-on costs.
Using community tools such as Github Action, ArgoCD, and Kubectl KCL plugins to integrate and extend support to improve automation efficiency.

KCL programming language. - Mutation Validation Abstraction Production-Ready | KCL programming language.

KCL is an open-source constraint-based record & functional language mainly used in configuration and policy scenarios.

2023-07-17

Matthew James

04:05:23 AM

has anyone in tried template controller with fluxcd? https://medium.com/kluctl/introducing-the-template-controller-and-building-gitops-preview-environments-2cce4041406a

I’m super interested in trying to make ephemeral preview environments and this approach looks pretty slick

2023-07-18

Adnan

06:42:24 AM

Hi all. I was wondering, how are you autoscaling pods (workers) that consume SQS messages? How stable is the autoscaling? Are you happy with it?

kallan.gerard

02:21:22 PM

It depends

kallan.gerard

02:22:57 PM

I’d avoid thinking of this as strictly an SQS and Kubernetes problem. Queues with n workers are quite a common pattern and they’re part of a system as a whole

kallan.gerard

02:27:33 PM

What’s producing the messages, what’s the queue size, whats the ordering, what is consuming messages and what is that thing doing with those messages, what is an acceptable amount of time for a message to remain in the queue, are the messages many aggregates with small lifecycles or few aggregates with long lifecycles, etc etc etc

kallan.gerard

02:30:30 PM

I’d probably just start with one and see what happens

Adnan

05:40:56 AM

Thanks for the input. Are you scaling workers up and down? If so, based on what? Do you have an example (or two)?

kallan.gerard

02:23:43 PM

In one example I can think of we use a custom metric with HPA which is based off the age of the oldest unacknowledged message in the queue

kallan.gerard

02:26:22 PM

It’s a rough metric but it means if the ingesting workers start being overloaded, there will be more messages in the queue than what they can process, which means the message at the back of the queue will get older, which will cause another pod to be scheduled, which will help process messages faster,

kallan.gerard

02:28:42 PM

But you’ll have to do the engineering for your particular requirements. Like I said these aren’t really kubernetes problems, these are application architecture problems. Don’t mix the two.

Adnan

04:29:40 AM

Well, I wouldn’t say I’m mixing something here Kubernetes is the tool doing the actual scaling based on metrics we provide. There is also pretty extensive documentation for the HPA like algorithms, scaling behaviour etc. so I’d say it’s definitely part of it of course you should also consider application requirements if there are any other than “please process my messages the fastest way possible”. Interesting metric for scaling. Thanks in any case.

kallan.gerard

06:55:53 AM

Yes HPA would be whats being used, but Kubernetes is an implementation detail. This is a problem that’s at the application architectural level. Yes there needs to be feedback about what’s actually feasible and sensible with Kubernetes but it’s not what drives the decision.

kallan.gerard

06:58:28 AM

It’s like deciding how much cereal you need to stow on a ship based on the size of the packets the cereal company sells.

kallan.gerard

06:59:21 AM

There’s about a dozen questions that need to be decided first that have nothing to do with cereal boxes

Adnan

12:36:25 PM

2023-07-19

Matthew James

12:09:54 PM

Looking to see if anyone has come across a “config” or other type setup that will mark an ec2 as unhealthy. Reason being when using self managed node groups as part of our patching policy we need to build new AMIs and perform instance refresh in AWS. After hearing what happened to datadog i’m hoping there’s a more sane approach to having the ASG back out of an instance refresh if the nodes aren’t able to register with the cluster and start taking pods.

So ideally i’m looking for a health check that the ASG will understand and then ideally issueing an instance refresh with rollback so if something goes wrong it backs out to the safe launch template (and the older AMI).

2023-07-20

2023-07-22

Chris Wash

08:07:46 PM

Hey folks - seeing some weird behavior within an EKS cluster that started happening recently. It involves not being able to properly attach cluster role/service account/cluster role binding stuff to our Ambassador pods so they can call the cluster API and start up properly. (This is in an older v1.21 cluster) This has worked fine for over a year and now any new pods brought in fail the same way and go into backoff. I wrote up the issue here, wondering if anyone else has seen similar behavior or has any idea a workaround or way to remedy it? If we lose the current pods in service or they get rescheduled I am worried all of our Ambassador ingresses will become unavailable. (We are in the middle of an upgrade to 1.24 but in the process of building that cluster in the background.)

Thanks!

Call for help! Ambassador pods error on startup now on v1.21 EKS cluster

Kubernetes discussion, news, support, and link sharing.

Chris Wash

09:30:25 PM

I’ve been able to resolve this by adding fsGroup: 1337 to the securityContext as described in this article under “Check the pod and user group”. Even though we aren’t using IRSA on this cluster, and even though we are definitely past 1.19 (on 1.21) it seems to fix if I edit the deployment and put that value in there, in case anyone else is in the same boat!

Troubleshoot an OIDC provider and IRSA in Amazon EKS

My pods can’t use the AWS Identity and Access Management (IAM) role permissions with the Amazon Elastic Kubernetes Service (Amazon EKS) account token.

2023-07-25

Adnan

06:04:30 AM

Hi all. Is there a way to find out which feature gates are enabled on EKS clusters?

AFAI was able to see, feature gates are not configurable with EKS. But I was wondering how to find out which of them are enabled. I found this in the docs

“The feature gates that control new features for both new and existing API operations are enabled by default.”

Does that mean that we can assume all features gates that control new features are enabled? I know I just repeated what it says but maybe someone can confirm or correct it.

#kubernetes (2023-07)

2023-07-03

2023-07-04

2023-07-05

2023-07-09

2023-07-10

2023-07-11

2023-07-16

2023-07-17

2023-07-18

2023-07-19

2023-07-20

2023-07-22

2023-07-25

2023-07-27

2023-07-28

2023-07-29