SweetOps #kubernetes for June, 2024

Archive: https://archive.sweetops.com/kubernetes/

2024-06-03

Adnan

Hi all.

What are you using to display the current cluster details in the terminal? As in something that might help you tell if you’re in prod or some other environment?

Andrey Taranik

11:19:50 AM

https://github.com/jonmosco/kube-ps1

jonmosco/kube-ps1

Kubernetes prompt info for bash and zsh

Nate McCurdy

04:51:34 PM

For zsh in particular, I like https://github.com/romkatv/powerlevel10k It shows k8s contexts (active cluster name and namespace) as well as many many other useful bits of context info (like cloud provider account names, etc…)

Dale

04:42:06 PM

I have a script that runs on every new terminal window setup in my zsh config to tell me what context I’m in, and I also run all my kubectl and Terraform commands through an alias that requires me to answer a Y/N prompt to confirm I really want to touch the Staging/Prod cluster (with Yellow/Red colouring respectively to really hit the point home)

I’ll grab it tomorrow when I login and stick it here

Dale

01:11:24 PM

From my .zshrc file

Terraform/Terragrunt

for Terraform (we use Terragrunt so mine is geared towards that) where we have a staging and a production directory to separate the IaC which isn’t shared:

alias tg='AWS_PROFILE=<MYPROFILE> terragrunt'

tg_production() {
    if [[ "$PWD" == *production* ]]; then
        # Prompt for confirmation in red color
        echo -e "\033[31mYou are about to run the 'tg' command in a 'production' directory. Are you sure? (y/n)\033[0m"
        read -r response
        if [[ "$response" =~ ^[Yy]$ ]]; then
            tg "$@"
        else
            echo "Command cancelled."
        fi
    else
        echo "The current directory does not contain 'production' in its path."
    fi
}

tg_staging() {
    if [[ "$PWD" == *staging* ]]; then
        # Prompt for confirmation in yellow color
        echo -e "\033[33mYou are about to run the 'tg' command in a 'staging' directory. Are you sure? (y/n)\033[0m"
        read -r response
        if [[ "$response" =~ ^[Yy]$ ]]; then
            tg "$@"
        else
            echo "Command cancelled."
        fi
    else
        echo "The current directory does not contain 'staging' in its path."
    fi
}

This forces me to ensure I am running a command intended for production from within the production directory, and asks for extra confirmation. Same for staging.

Kubernetes

context="$(kubectl config current-context)"

case "$context" in
    *production*) echo -e "\033[31mWARNING!!! YOU ARE IN PRODUCTION KUBE CONTEXT\033[0m" ;;
    *staging*) echo -e "\033[33mWARNING!!! YOU ARE IN STAGING KUBE CONTEXT\033[0m" ;;
    *) echo "Safe!" ;;
esac

Dale

01:16:48 PM

I also use Warp terminal which sets the prompt to the current k8s context whenever you type kubectl

2024-06-05

miko

10:53:26 AM

Guys, what’s the best practice for resource allocation to deployments/statefulset? Should I not put any allocations at all and just rely on it’s horizontal scaling? I feel like adding memory and CPU resources defeats the purpose of auto scaling? Please correct me if I’m wrong

Erik Osterman (Cloud Posse)

04:45:21 PM

Actually, setting resource limits is crucial. Without them, Kubernetes doesn’t know how much CPU/memory your app needs, making it hard to scale effectively. It’s like setting widths/heights in CSS for predictable layouts. Limits guide the scheduler, ensuring stability and efficient resource use.

Nate McCurdy

04:55:17 PM

Setting allocations is important. For CPU, it’s preferable to set requests instead of limits to prevent throttling. For memory, requests and limits should be set.

This post has a good explanation of the details: https://home.robusta.dev/blog/stop-using-cpu-limits

miko

03:32:49 AM

@Erik Osterman (Cloud Posse) but I am getting worried that the pod would crash when the code runs out of memory/cpu during high usage

miko

03:33:01 AM

@Nate McCurdy Ooooh I see! I’ll read the docu, thank you!

Dale

04:39:45 PM

I have a (I think pretty nooby) question, but for some reason can’t find the right phrase to google to work it out, or I’m skimming over the answer when reading docs somehow…

If I have a Deployment with replicas set to 4, and a HorizontalPodAutoscaler with minReplicas set to 2, which value takes precedence? I’m assuming the HPA will attempt to scale down to 2 when there isn’t enough utilisation to justify it, but I just can’t tell for sure!

Sorry if this is super obvious, but my brain rot from endless meetings this week won’t let me see the answer, lol

Erik Osterman (Cloud Posse)

04:46:45 PM

The deployment will start with 4 replicas, and the HPA will continuously evaluate the resource utilization. If the utilization is low, the HPA will scale down to the minimum of 2 replicas. The HPA dynamically adjusts the replica count to ensure optimal resource usage.

Dale

04:47:49 PM

Ok, perfect, I’m glad my assumption was right. You’re a lifesaver, thanks for weighing in quickly!

Dale

04:49:30 PM

I swear when I first implemented the HPA the Kubernetes docs had a dedicated section in their ‘Concepts’ section for HPAs but all their scaling stuff seems to have been moved into real-world-example-type docs now which isn’t as helpful to me as a reference

Erik Osterman (Cloud Posse)

04:53:35 PM

Yea, when things get so complicated as in Kubernetes, that helps a lot

Erik Osterman (Cloud Posse)

04:54:09 PM

Also, checkout Keda if you end up needing to do more specialized autoscaling

Dale

12:49:16 PM

Forgot to reply yesterday but just wanted to say thanks for the heads up on Keda! - I’ve been testing new scaling configs today with it and I’m really impressed with the flexibility I now have without having to do much work on integration. Some of the metrics I can now easily scale by make me wonder: “Jeez why isn’t this supported by default?” Though of course the answer is because K8s development has lots of other priorities.

As a generic follow up question: Are there any other custom resources you’ve come across that have enhanced K8s in a way that shifted the way you work significantly like Keda looks like it will for me?

2024-06-06

09:26:53 AM

Hello, team!

rohit

04:01:37 PM

Does anyone have any experience or ideas about running a k8s control plane in our account, and having k8s worker nodes in different cloud environments (keeping multi-tenancy in mind).

Our idea is having a product control plane (k8s), and each customer would have worker node groups in their respective cloud accounts. The control plane would ensure our services, license management, sharing configurations from us to the customer all happens there.

Is multiple node-workers spanning across multiple accounts possible with a single control plane?

z0rc3r

04:18:37 PM

Don’t recommend. Kubernetes doesn’t natively support multitenancy and isolation. One bad actor can bring down cluster for everyone. Network latency will eat your SLO for breakfast.

z0rc3r

04:20:19 PM

Not to mention kubernetes version upgrades between nodes and control plane.

z0rc3r

04:23:05 PM

RBAC, Network Policy, Authentication, QoS and DNS setup will make you to lose sanity in this setup.

rohit

04:39:34 PM

Hmm, so generally speaking, we’re talking about a single control plane per customer?

rohit

04:40:36 PM

The only reason we want this is so data never leaves their account/perimeter. We just want our controllers, admission c, security services to ensure customer’s cannot do anything, except have exposure to the service/product we’re providing them.

z0rc3r

04:45:04 PM

I don’t develop such things, so my suggestions might not be what you look.

If you want to have your product installable on kubernetes, just let customers install it via manifests or helm chart on their own cluster. It’s up to them to manage cluster.

z0rc3r

04:46:23 PM

There is no good reason to take over control plane to install something on cluster.

rohit

04:48:02 PM

We have that offering also, but it’s difficult to enforce things we need from an enterprise standpoint. Sharing the nodes with the customer, while retaining control over the control plane allows us to have a single point of entry for authz/authn, deploying product updates, updating customer configurations for their setup in near-realtime.

z0rc3r

04:50:38 PM

Everything you mentioned can be implemented with native kubernetes resources and application logic. I still don’t see you point ¯_(ツ)_/¯

rohit

04:53:21 PM

Yeah enforcing license requirements in an isolated environment is not possible in our scenario. If the customer “cuts the cord” we can’t do anything about it. We won’t have any monitoring/metrics, etc. So this hybrid model came into being, we’re still flushing it all out.

2024-06-10

omkar

07:31:00 AM

Issue: Application Performance Explanation: We have deployed all our microservices on AWS EKS. Some are backend services that communicate internally (around 50 services), and our main API service, “loco,” handles logging and other functions. The main API service is accessed through the following flow: AWS API Gateway -> Nginx Ingress Controller -> Service. In the ingress, we use path-based routing, and we have added six services to the ingress, each with a corresponding resource in a single API gateway. Our Angular static application is deployed on S3 and accessed through CloudFront. The complete flow is as follows: CloudFront -> Static S3 (frontend) -> AWS API Gateway -> VPC Link -> Ingress (Nginx Ingress Controller with path-based routing) -> Services -> Container. Problem: Occasionally, the login process takes around 6-10 seconds, while at other times it only takes 1 second. The resource usage of my API services is within the limit. Below are the screenshots from Datadog traces of my API service:

• Screenshot of the API service when it took only 1 second

• Screenshot of the API service when it took 6-10 seconds Request for Help: How should I troubleshoot this issue to identify where the slowness is occurring?

Nate McCurdy

04:12:31 PM

Any clues from looking at the live-locobuzzing-api span that took ~6s? Is the span broken down further?

Dale

05:30:54 PM

I take it you have reviewed the flame graph of the login service and done a profile of it to rule out the login service itself being the bottleneck?

I only ask because on your second image there are 4 times as many spans being indexed, so it is making me wonder whether between the two screenshots something has invalidated a cache your app relies on and it is having to rebuild that? Maybe a new pod of that service has been spun up from a scaling event and your containers don’t come with the cache prewarmed?

2024-06-11

Jeremy G (Cloud Posse)

11:20:22 PM

We are updating our terraform-aws-eks-node-group module to support EKS on AL2023. With that, it will support AL2, AL2023, Bottlerocket, and Windows Server. Each has different ways to configure kubelet, and kubelet itself is moving from command line args to KubeletConfiguration. I am seeking community input on how the module should take configuration parameters and deploy them properly to each OS.

Goals: • We want to limit maintenance, so we only want to support the most used configuration options. I’m thinking just kube-reserved system-reserved eviction-hard eviction-soft, although we also need to support taints applied before the node joins the cluster. • We want it to interoperate well with EKS defaults. In particular, I’m concerned with using --config <file> with bootstrap.sh. Has anyone tried that? • We will always allow you provide the complete userdata for the Node, so advanced use cases can be handled via that escape route. One other question: AMI IDs in the Launch Template. Previously, if you just wanted the latest AMI for your nodes, we left the AMI ID out of the launch template and let EKS handle updates when new versions were released. I thought we might change that to always specify the AMI ID, providing a consistent experience, getting the AMI ID from the Public SSM Parameter for the latest (“recommended”) AMI. However, it does seem we lose some features that way, like visibility of the availability of updates via the AWS Web Console. Thoughts?

kubelet

Synopsis The kubelet is the primary “node agent” that runs on each node. It can register the node with the apiserver using one of: the hostname; a flag to override the hostname; or specific logic for a cloud provider. The kubelet works in terms of a PodSpec. A PodSpec is a YAML or JSON object that describes a pod. The kubelet takes a set of PodSpecs that are provided through various mechanisms (primarily through the apiserver) and ensures that the containers described in those PodSpecs are running and healthy.

Kubelet Configuration (v1beta1)

Resource Types CredentialProviderConfig KubeletConfiguration SerializedNodeConfigSource FormatOptions Appears in: LoggingConfiguration FormatOptions contains options for the different logging formats. FieldDescription text [Required] TextOptions [Alpha] Text contains options for logging format “text”. Only available when the LoggingAlphaOptions feature gate is enabled. json [Required] JSONOptions [Alpha] JSON contains options for logging format “json”. Only available when the LoggingAlphaOptions feature gate is enabled. JSONOptions Appears in: FormatOptions JSONOptions contains options for logging format “json”. FieldDescription OutputRoutingOptions [Required] OutputRoutingOptions (Members of OutputRoutingOptions are embedded into this type.

#186 Add support for AL2023

New Features, Breaking Changes

tl;dr Upgrading to this version will likely cause your node group to be replaced, but otherwise should not have much impact for most users.

The major new feature in this release is support for Amazon Linux 2023 (AL2023). EKS support for AL2023 is still evolving, and this module will evolve along with that. Some detailed configuration options (e.g. KubeletConfiguration JSON) are not yet supported, but the basic features are there.

The other big improvements are in immediately applying changes and in selecting AMIs, as explained below.

Along with that, we have dropped some outdated support and changed the eks_node_group_resources output, resulting in minor breaking changes that we expect do not affect many users.

Create Before Destroy is Now the Default

Previously, when changes forced the creation of a new node group, the default behavior for this module was to delete the existing node group and then create a replacement. This is the default for Terraform, motivated in part by the fact that the node group’s name must be unique, so you cannot create the new node group with the same name as the old one while the old one still exists.

With version 2 of this module, we recommended setting create_before_destroy to true to enable this module to create a new node group (with a partially randomized name) before deleting the old one, allowing the new one to take over for the old one. For backward compatibility, and because changing this setting always results in creating a new node group, the default setting was set to false.

With this release, the default setting of create_before_destroy is now true, meaning that if left unset, any changes requiring a new node group will cause a new node group to be created first, and then the existing node group to be deleted. If you have large node groups or small quotas, this can fail due to having the 2 node groups running at the same time.

Random name length now configurable

In order to support “create before destroy” behavior, this module uses the random_pet
resource to generate a unique pet name for the node group, since the node group name
must be unique, meaning the new node group must have a different name than not only the old one, but also all other node groups you have. Previously, the “random” pet name was 1 of 452 possible names, which may not be enough to avoid collisions when using a large number of node groups.

To address this, this release introduces a new variable, random_pet_length, that controls the number of pet names concatenated to form the random part of the name. The default remains 1, but now you can increase it if needed. Note that changing this value will always cause the node group name to change and therefore the node group to be replaced.

Immediately Apply Launch Template Changes

This module always uses a launch template for the node group. If one is not supplied, it will be created.

In many cases, changes to the launch template are not immediately applied by EKS. Instead, they only apply to Nodes launched after the template is changed. Depending on other factors, this may mean weeks or months pass before the changes are actually applied.

This release introduces a new variable, immediately_apply_lt_changes, to address this. When set to true, any changes to the launch template will cause the node group to be replaced, ensuring that all the changes are made immediately. (Note: you may want to adjust the node_group_terraform_timeouts if you have big node groups.)

The default value for immediately_apply_lt_changes is whatever the value of create_before_destroy is.

Changes in AMI selection

Previously, unless you specified a specific AMI ID, this module picked the “newest” AMI that met the selection criteria, which in turn was based on the AMI Name. The problem with that was that the “newest” might not be the latest Kubernetes version. It might be an older version that was patched more recently, or simply finished building a little later than the latest version.

Now that AWS explicitly publishes the AMI ID corresponding to the latest (or, more accurately, “recommended”) version of their AMIs via SSM Public Parameters, the module uses that instead. This is more reliable and should eliminate the version regression issues that occasionally happened before.

• The ami_release_version input is now obsolete, because it was based on the AMI name, which is different than the SSM Public Parameter Path. • The new ami_specifier takes the place of ami_release_version, and is specifically whatever path element in the SSM Public Parameter Path replaces “recommended” or “latest” in order to find the AMI ID. Unfortunately, the format of this value varies by OS, and we have not found documentation for it. You can generally figure it out from the AMI name or description, and validate it by trying to retrieve the SSM Public Parameter for the AMI ID.

Examples of AMI specifier based on OS:

• AL2: amazon-eks-node-1.29-v20240117 • AL2023: amazon-eks-node-al2023-x8664-standard-1.29-v20240605 • Bottlerocket: 1.20.1-7c3e9198 _# Note: 1.20.1 is the Bottlerocket, not Kubernetes, version • Windows:

The main utility of including a specifier rather than an AMI ID is that it allows you to have a consistent release configured across multiple regions without having to have region-specific configuration.

Customization via userdata Unsupported userdata now throws an error

Node configuration via userdata is different for each OS. This module has 4 inputs related to Node configuration that end up using userdata:

before_cluster_joining_userdata
kubelet_additional_options
bootstrap_additional_options
after_cluster_joining_userdata

but they do not all work for all OSes, and none work for Botterocket. Previously, they were silently ignored in some cases. Now they throw an error when set for an unsupported OS.

Note that for all OSes, you can bypass all these inputs and supply your own fully-formed, base64 encoded userdata via userdata_override_base64, and this module will pass it along unmodified.

Multiple lines supported in userdata scripts

All the userdata inputs take lists, because they are optional inputs. Previously, lists were limited to single elements. Now the list can be any length, and the elements will be combined.

Kubernetes Version No Longer Inferred from AMI

Previously, if you specified an AMI ID, the Kubernetes version would be deduced from the AMI ID name. That is not sustainable as new OSes are launched, so the module no longer tries to do that. If you do not supply the Kubernetes version, the EKS cluster’s Kubernetes version will be used.

Output eks_node_group_resources changed

The aws_eks_node_group.resources attribute is a “list of objects containing information about underlying resources.” Previously, this was output via eks_node_group_resources as a list of lists, due to a quirk of Terraform. It is now output as a list of resources, in order to align with the other outputs.

Special Support for Kubernetes Cluster Autoscaler removed

This module used to takes some steps (mostly labeling) to try to help the Kubernetes Cluster Autoscaler. As the Cluster Autoscaler and EKS native support for it evolved, the steps taken became either redundant or ineffective, so they have been dropped.

• cluster_autoscaler_enabled has been deprecated. If you set it, you will get a warning in the output, but otherwise it has no effect.

AWS Provider v5.8 or later now required

Previously, this module worked with AWS Provider v4, but no longer. Now v5.8 or later is required.

Special Thanks

This PR builds on the work of <https://githu…

2024-06-15

kirupakaran1799

12:13:36 PM

Hi, We are having ingress manifest which contain a cognito authentication as a default that we don’t want, so after deployment we have removed it manually from console, however since we are managing resources argocd, the rule getting added back. I have disabled the automatic sync, but it’s doesn’t help. Need suggestions!

yxxhero

02:19:54 AM

any url or logs?

kirupakaran1799

02:41:38 AM

I checked the Alb controller and argocd logs I don’t find any errors over there and i could see sync policy none in app details

Hao Wang

01:10:01 PM

is there an annotation related to this auth?

kirupakaran1799

03:28:25 AM

Yes we are using some annotations in ingress manifest

Hao Wang

03:22:12 PM

can you show the annotations?

Hao Wang

03:22:25 PM

guessing one of them may cause the issue

2024-06-16

2024-06-17

2024-06-20

Hamza

04:18:38 PM

Hi, I’m using Postgres-HA chart of Bitnami and after using it for a while, we decided that we don’t really need HA and a single pod DB is enough without the pg_pool and the issues it comes with, right now we’re planning to migrate to normal Postgres chart of Bitnami, I would like to know how to have the database’s data persisted even if we switch to the new chart ?

Piotr Pawlowski

10:05:06 AM

set retention policy to retain for the persistent volume used by the postgresql-ha statefulset (either manually or via helm chart values), remove postgres-ha helm deployment and reuse existing volume in postgresql chart alternatively, setup new postgres server with postgresql chart next to existing onw and use pg_dump and pg_restore to move the data. test either scenario on some dev environment before running this on production

Hamza

03:33:49 PM

Thank you @Piotr Pawlowski, I did disable the postgres-ha and instruct the Postgres chart to use the existing pvc , but when the db pod starts I get:

│ postgresql 15:22:40.22 INFO  ==> ** Starting PostgreSQL setup **                                                                                                             │
│ postgresql 15:22:40.24 INFO  ==> Validating settings in POSTGRESQL_* env vars..                                                                                              │
│ postgresql 15:22:40.25 INFO  ==> Loading custom pre-init scripts...                                                                                                          │
│ postgresql 15:22:40.26 INFO  ==> Initializing PostgreSQL database...                                                                                                         │
│ postgresql 15:22:40.28 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected                                                                  │
│ postgresql 15:22:40.28 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected                                                                      │
│ postgresql 15:22:40.30 INFO  ==> Deploying PostgreSQL with persisted data...                                                                                                 │
│ postgresql 15:22:40.31 INFO  ==> Loading custom scripts...                                                                                                                   │
│ postgresql 15:22:40.32 INFO  ==> ** PostgreSQL setup finished! **                                                                                                            │
│                                                                                                                                                                              │
│ postgresql 15:22:40.34 INFO  ==> ** Starting PostgreSQL **                                                                                                                   │
│ 2024-06-25 15:22:40.400 GMT [1] FATAL:  could not access file "repmgr": No such file or directory                                                                            │
│ 2024-06-25 15:22:40.400 GMT [1] LOG:  database system is shut down                                                                                                           

Hamza

10:36:45 AM

Just an update, the only thing that worked is backup and restore, running the no ha pod on an hap pod’s volume is not working

2024-06-21

2024-06-25

2024-06-28

loren

02:09:10 PM

I got another noob question about K8S… I was testing a cluster update, that happened to cycle the nodes in the node group. I’m using EKS on AWS, with just a single node, but have three AZs available to the node group. There’s a StatefulSet (deployed with a helm chart) using a persistent volume claim, which is backed by an EBS volume. The EBS volume is of course tied to a single AZ. So, when the node group updated, it seems it didn’t entirely account for the zonal attributes, and cycled through 3 different nodes before it finally created one in the correct AZ that could meet the zonal restrictions of the persistent volume claim and get all the pods back online. Due to the zonal issues, the update took about an hour. The error when getting the other nodes going was "1 node(s) had volume node affinity conflict." So basically, any pointers on how to handle this kind of constraint? Is there an adjustment to the helm chart, or some config option I can pass in, to adjust the setup somehow to be more zonal-friendly? Or is there a Kubernetes design pattern around this? I tried just googling, but didn’t seem to get much in the way of good suggestions… I don’t really want to have a node-per-AZ always running…

Erik Osterman (Cloud Posse)

03:04:39 PM

Yup, k8s scheduler is not AZ aware when it comes to that sort of thing.

Erik Osterman (Cloud Posse)

03:04:50 PM

We always deploy 3 separate node pools, one per AZ

Erik Osterman (Cloud Posse)

03:05:21 PM

It also enables the scheduler to scale up specific AZs based on traffic.

Erik Osterman (Cloud Posse)

03:08:47 PM

Alternatively, you can use EFS whcih is not zone specific, but also not suitable for all kinds of workloads

Erik Osterman (Cloud Posse)

03:09:11 PM

We typically make EFS the default storage class

loren

03:09:17 PM

is “node pool” the same as “node group”?

Erik Osterman (Cloud Posse)

03:09:44 PM

yes… the terms can be used interchangeably

loren

03:09:49 PM

i do have efs already setup, but it’s not the default storage class. maybe i’ll try that

Erik Osterman (Cloud Posse)

03:11:12 PM

In our refarch, here’s our component https://github.com/cloudposse/terraform-aws-components/tree/23f29ccd8727cc1fbe8f19ab6b3b71dd37316a00/modules/eks/storage-class

Dale

05:36:14 PM

Wouldn’t Karpenter be a good solution to working around this? You would just need a tiny node that is running anywhere to host the karpenter controller pod, then you can instruct Karpenter to only provision in a preferred AZ?

loren

05:38:49 PM

I’ve never used Karpenter, so I don’t really know?

Gabriela Campana (Cloud Posse)

03:48:44 PM

@Jeremy G (Cloud Posse)

Jeremy G (Cloud Posse)

07:51:11 PM

A few things here:

“Node Group” is an EKS term. As I understand it, it is defined as a collection of nodes sharing a single autoscaling group (ASG) and launch template (LT).

“Node Pool” is a Karpenter term, effectively the same kind of thing, but using Karpenter to launch nodes, rather than ASGs and LTs.

If you create an ASG that spans multiple Availability Zones (AZs), the K8s cluster autoscaler (or any autoscaler) does not have control over which AZ a new EC2 instance will be added to when the ASG scales up. At some point, the EC2 Autoscaler may see that there is a big imbalance and try to fix it by bringing up instances in one AZ and shutting them down in another, but again there is no way to proactively direct this.

Our eks/cluster component handles availability zone balance by creating one managed node group (MNG) per AZ. The K8s cluster autoscaler understands this configuration, and will keep the MNGs in balance across the AZs.

Furthermore, Kubernetes, the K8s cluster autoscaler, and Karpenter all understand that EBS-backed PVs are tied to a “topology zone” (generic term for what in AWS is an AZ), and will only schedule a Pod into the same zone as its PVC targets.

If there is not enough capacity in the zone with the PVC, K8s need to add a new Node in that AZ. Karpenter can always do this. The best the K8s cluster autoscaler can do is increase the desired capacity of an ASG. If the ASG is only in the target AZ, then it is guaranteed success, but if the ASG spans AZs, then scaling up the ASG may or may not increase capacity in the target zone, in which case I think the K8s cluster autoscaler just gives up.

EFS-backed EBS volumes solve the AZ problem, but at cost of 3x the $/MB price and severely limited IO bandwidth and IOPS limits. It’s appropriate for sharing configuration data and small amounts of logging. It gets overwhelmed by storing Prometheus’ scraped metrics unless you pay (dearly) for extra bandwidth.

The short answer, @loren, is that if you only intend to run one Node and you want a Persistent Volume, restrict your ASG to one AZ and use a gp3 EBS volume for your PV.

NodePools

Configure Karpenter with NodePools

loren

04:28:08 PM

Thanks for that Jeremy! I’ll need to refactor things a fair bit to scale out the node group per az. Still working out some EFS issues on destroy, where the namespace gets locked up because the associated PVC has a claim but (I think) the access points got destroyed before the namespace, so K8S fails to remove the finalizers. Good times. Love K8S so so much.

loren

07:58:24 PM

i don’t think i ever would have thought about a node group per az, but what a difference! replacing updates on node groups now down to ~8min vs >40min! thanks again!