#kubernetes (2024-06)
Archive: https://archive.sweetops.com/kubernetes/
2024-06-03
Hi all.
What are you using to display the current cluster details in the terminal? As in something that might help you tell if you’re in prod or some other environment?
Kubernetes prompt info for bash and zsh
For zsh in particular, I like https://github.com/romkatv/powerlevel10k It shows k8s contexts (active cluster name and namespace) as well as many many other useful bits of context info (like cloud provider account names, etc…)
I have a script that runs on every new terminal window setup in my zsh config to tell me what context I’m in, and I also run all my kubectl and Terraform commands through an alias that requires me to answer a Y/N prompt to confirm I really want to touch the Staging/Prod cluster (with Yellow/Red colouring respectively to really hit the point home)
I’ll grab it tomorrow when I login and stick it here
From my .zshrc file
Terraform/Terragrunt
for Terraform (we use Terragrunt so mine is geared towards that) where we have a staging
and a production
directory to separate the IaC which isn’t shared:
alias tg='AWS_PROFILE=<MYPROFILE> terragrunt'
tg_production() {
if [[ "$PWD" == *production* ]]; then
# Prompt for confirmation in red color
echo -e "\033[31mYou are about to run the 'tg' command in a 'production' directory. Are you sure? (y/n)\033[0m"
read -r response
if [[ "$response" =~ ^[Yy]$ ]]; then
tg "$@"
else
echo "Command cancelled."
fi
else
echo "The current directory does not contain 'production' in its path."
fi
}
tg_staging() {
if [[ "$PWD" == *staging* ]]; then
# Prompt for confirmation in yellow color
echo -e "\033[33mYou are about to run the 'tg' command in a 'staging' directory. Are you sure? (y/n)\033[0m"
read -r response
if [[ "$response" =~ ^[Yy]$ ]]; then
tg "$@"
else
echo "Command cancelled."
fi
else
echo "The current directory does not contain 'staging' in its path."
fi
}
This forces me to ensure I am running a command intended for production from within the production directory, and asks for extra confirmation. Same for staging.
Kubernetes
context="$(kubectl config current-context)"
case "$context" in
*production*) echo -e "\033[31mWARNING!!! YOU ARE IN PRODUCTION KUBE CONTEXT\033[0m" ;;
*staging*) echo -e "\033[33mWARNING!!! YOU ARE IN STAGING KUBE CONTEXT\033[0m" ;;
*) echo "Safe!" ;;
esac
I also use Warp terminal which sets the prompt to the current k8s context whenever you type kubectl
2024-06-05
Guys, what’s the best practice for resource allocation to deployments/statefulset? Should I not put any allocations at all and just rely on it’s horizontal scaling? I feel like adding memory and CPU resources defeats the purpose of auto scaling? Please correct me if I’m wrong
Actually, setting resource limits is crucial. Without them, Kubernetes doesn’t know how much CPU/memory your app needs, making it hard to scale effectively. It’s like setting widths/heights in CSS for predictable layouts. Limits guide the scheduler, ensuring stability and efficient resource use.
Setting allocations is important. For CPU, it’s preferable to set requests instead of limits to prevent throttling. For memory, requests and limits should be set.
This post has a good explanation of the details: https://home.robusta.dev/blog/stop-using-cpu-limits
@Erik Osterman (Cloud Posse) but I am getting worried that the pod would crash when the code runs out of memory/cpu during high usage
@Nate McCurdy Ooooh I see! I’ll read the docu, thank you!
I have a (I think pretty nooby) question, but for some reason can’t find the right phrase to google to work it out, or I’m skimming over the answer when reading docs somehow…
If I have a Deployment with replicas set to 4, and a HorizontalPodAutoscaler with minReplicas set to 2, which value takes precedence? I’m assuming the HPA will attempt to scale down to 2 when there isn’t enough utilisation to justify it, but I just can’t tell for sure!
Sorry if this is super obvious, but my brain rot from endless meetings this week won’t let me see the answer, lol
The deployment will start with 4 replicas, and the HPA will continuously evaluate the resource utilization. If the utilization is low, the HPA will scale down to the minimum of 2 replicas. The HPA dynamically adjusts the replica count to ensure optimal resource usage.
Ok, perfect, I’m glad my assumption was right. You’re a lifesaver, thanks for weighing in quickly!
I swear when I first implemented the HPA the Kubernetes docs had a dedicated section in their ‘Concepts’ section for HPAs but all their scaling stuff seems to have been moved into real-world-example-type docs now which isn’t as helpful to me as a reference
Yea, when things get so complicated as in Kubernetes, that helps a lot
Also, checkout Keda if you end up needing to do more specialized autoscaling
Forgot to reply yesterday but just wanted to say thanks for the heads up on Keda! - I’ve been testing new scaling configs today with it and I’m really impressed with the flexibility I now have without having to do much work on integration. Some of the metrics I can now easily scale by make me wonder: “Jeez why isn’t this supported by default?” Though of course the answer is because K8s development has lots of other priorities.
As a generic follow up question: Are there any other custom resources you’ve come across that have enhanced K8s in a way that shifted the way you work significantly like Keda looks like it will for me?
2024-06-06
Does anyone have any experience or ideas about running a k8s control plane in our account, and having k8s worker nodes in different cloud environments (keeping multi-tenancy in mind).
Our idea is having a product control plane (k8s), and each customer would have worker node groups in their respective cloud accounts. The control plane would ensure our services, license management, sharing configurations from us to the customer all happens there.
Is multiple node-workers spanning across multiple accounts possible with a single control plane?
Don’t recommend. Kubernetes doesn’t natively support multitenancy and isolation. One bad actor can bring down cluster for everyone. Network latency will eat your SLO for breakfast.
Not to mention kubernetes version upgrades between nodes and control plane.
RBAC, Network Policy, Authentication, QoS and DNS setup will make you to lose sanity in this setup.
Hmm, so generally speaking, we’re talking about a single control plane per customer?
The only reason we want this is so data never leaves their account/perimeter. We just want our controllers, admission c, security services to ensure customer’s cannot do anything, except have exposure to the service/product we’re providing them.
I don’t develop such things, so my suggestions might not be what you look.
If you want to have your product installable on kubernetes, just let customers install it via manifests or helm chart on their own cluster. It’s up to them to manage cluster.
There is no good reason to take over control plane to install something on cluster.
We have that offering also, but it’s difficult to enforce things we need from an enterprise standpoint. Sharing the nodes with the customer, while retaining control over the control plane allows us to have a single point of entry for authz/authn, deploying product updates, updating customer configurations for their setup in near-realtime.
Everything you mentioned can be implemented with native kubernetes resources and application logic. I still don’t see you point ¯_(ツ)_/¯
Yeah enforcing license requirements in an isolated environment is not possible in our scenario. If the customer “cuts the cord” we can’t do anything about it. We won’t have any monitoring/metrics, etc. So this hybrid model came into being, we’re still flushing it all out.
2024-06-10
Issue: Application Performance Explanation: We have deployed all our microservices on AWS EKS. Some are backend services that communicate internally (around 50 services), and our main API service, “loco,” handles logging and other functions. The main API service is accessed through the following flow: AWS API Gateway -> Nginx Ingress Controller -> Service. In the ingress, we use path-based routing, and we have added six services to the ingress, each with a corresponding resource in a single API gateway. Our Angular static application is deployed on S3 and accessed through CloudFront. The complete flow is as follows: CloudFront -> Static S3 (frontend) -> AWS API Gateway -> VPC Link -> Ingress (Nginx Ingress Controller with path-based routing) -> Services -> Container. Problem: Occasionally, the login process takes around 6-10 seconds, while at other times it only takes 1 second. The resource usage of my API services is within the limit. Below are the screenshots from Datadog traces of my API service:
• Screenshot of the API service when it took only 1 second
• Screenshot of the API service when it took 6-10 seconds Request for Help: How should I troubleshoot this issue to identify where the slowness is occurring?
Any clues from looking at the live-locobuzzing-api
span that took ~6s? Is the span broken down further?
I take it you have reviewed the flame graph of the login service and done a profile of it to rule out the login service itself being the bottleneck?
I only ask because on your second image there are 4 times as many spans being indexed, so it is making me wonder whether between the two screenshots something has invalidated a cache your app relies on and it is having to rebuild that? Maybe a new pod of that service has been spun up from a scaling event and your containers don’t come with the cache prewarmed?
2024-06-11
We are updating our terraform-aws-eks-node-group
module to support EKS on AL2023. With that, it will support AL2, AL2023, Bottlerocket, and Windows Server. Each has different ways to configure kubelet
, and kubelet
itself is moving from command line args to KubeletConfiguration. I am seeking community input on how the module should take configuration parameters and deploy them properly to each OS.
Goals:
• We want to limit maintenance, so we only want to support the most used configuration options. I’m thinking just kube-reserved system-reserved eviction-hard eviction-soft, although we also need to support taints applied before the node joins the cluster.
• We want it to interoperate well with EKS defaults. In particular, I’m concerned with using --config <file>
with bootstrap.sh
. Has anyone tried that?
• We will always allow you provide the complete userdata
for the Node, so advanced use cases can be handled via that escape route.
One other question: AMI IDs in the Launch Template. Previously, if you just wanted the latest AMI for your nodes, we left the AMI ID out of the launch template and let EKS handle updates when new versions were released. I thought we might change that to always specify the AMI ID, providing a consistent experience, getting the AMI ID from the Public SSM Parameter for the latest (“recommended”) AMI. However, it does seem we lose some features that way, like visibility of the availability of updates via the AWS Web Console. Thoughts?
Synopsis The kubelet is the primary “node agent” that runs on each node. It can register the node with the apiserver using one of: the hostname; a flag to override the hostname; or specific logic for a cloud provider. The kubelet works in terms of a PodSpec. A PodSpec is a YAML or JSON object that describes a pod. The kubelet takes a set of PodSpecs that are provided through various mechanisms (primarily through the apiserver) and ensures that the containers described in those PodSpecs are running and healthy.
Resource Types CredentialProviderConfig KubeletConfiguration SerializedNodeConfigSource FormatOptions Appears in: LoggingConfiguration FormatOptions contains options for the different logging formats. FieldDescription text [Required] TextOptions [Alpha] Text contains options for logging format “text”. Only available when the LoggingAlphaOptions feature gate is enabled. json [Required] JSONOptions [Alpha] JSON contains options for logging format “json”. Only available when the LoggingAlphaOptions feature gate is enabled. JSONOptions Appears in: FormatOptions JSONOptions contains options for logging format “json”. FieldDescription OutputRoutingOptions [Required] OutputRoutingOptions (Members of OutputRoutingOptions are embedded into this type.
New Features, Breaking Changes
tl;dr Upgrading to this version will likely cause your node group to be replaced, but otherwise should not have much impact for most users.
The major new feature in this release is support for Amazon Linux 2023 (AL2023). EKS support for AL2023 is still evolving, and this module will evolve along with that. Some detailed configuration options (e.g. KubeletConfiguration JSON) are not yet supported, but the basic features are there.
The other big improvements are in immediately applying changes and in selecting AMIs, as explained below.
Along with that, we have dropped some outdated support and changed the eks_node_group_resources
output, resulting in minor breaking changes that we expect do not affect many users.
Create Before Destroy is Now the Default
Previously, when changes forced the creation of a new node group, the default behavior for this module was to delete the existing node group and then create a replacement. This is the default for Terraform, motivated in part by the fact that the node group’s name must be unique, so you cannot create the new node group with the same name as the old one while the old one still exists.
With version 2 of this module, we recommended setting create_before_destroy
to true
to enable this module to create a new node group (with a partially randomized name) before deleting the old one, allowing the new one to take over for the old one. For backward compatibility, and because changing this setting always results in creating a new node group, the default setting was set to false
.
With this release, the default setting of create_before_destroy
is now true
, meaning that if left unset, any changes requiring a new node group will cause a new node group to be created first, and then the existing node group to be deleted. If you have large node groups or small quotas, this can fail due to having the 2 node groups running at the same time.
Random name length now configurable
In order to support “create before destroy” behavior, this module uses the random_pet
resource to generate a unique pet name for the node group, since the node group name
must be unique, meaning the new node group must have a different name than not only the old one, but also all other node groups you have. Previously, the “random” pet name was 1 of 452 possible names, which may not be enough to avoid collisions when using a large number of node groups.
To address this, this release introduces a new variable, random_pet_length
, that controls the number of pet names concatenated to form the random part of the name. The default remains 1, but now you can increase it if needed. Note that changing this value will always cause the node group name to change and therefore the node group to be replaced.
Immediately Apply Launch Template Changes
This module always uses a launch template for the node group. If one is not supplied, it will be created.
In many cases, changes to the launch template are not immediately applied by EKS. Instead, they only apply to Nodes launched after the template is changed. Depending on other factors, this may mean weeks or months pass before the changes are actually applied.
This release introduces a new variable, immediately_apply_lt_changes
, to address this. When set to true, any changes to the launch template will cause the node group to be replaced, ensuring that all the changes are made immediately. (Note: you may want to adjust the node_group_terraform_timeouts
if you have big node groups.)
The default value for immediately_apply_lt_changes
is whatever the value of create_before_destroy
is.
Changes in AMI selection
Previously, unless you specified a specific AMI ID, this module picked the “newest” AMI that met the selection criteria, which in turn was based on the AMI Name. The problem with that was that the “newest” might not be the latest Kubernetes version. It might be an older version that was patched more recently, or simply finished building a little later than the latest version.
Now that AWS explicitly publishes the AMI ID corresponding to the latest (or, more accurately, “recommended”) version of their AMIs via SSM Public Parameters, the module uses that instead. This is more reliable and should eliminate the version regression issues that occasionally happened before.
• The ami_release_version
input is now obsolete, because it was based on the AMI name, which is different than the SSM Public Parameter Path.
• The new ami_specifier
takes the place of ami_release_version
, and is specifically whatever path element in the SSM Public Parameter Path replaces “recommended” or “latest” in order to find the AMI ID. Unfortunately, the format of this value varies by OS, and we have not found documentation for it. You can generally figure it out from the AMI name or description, and validate it by trying to retrieve the SSM Public Parameter for the AMI ID.
Examples of AMI specifier based on OS:
• AL2: amazon-eks-node-1.29-v20240117 • AL2023: amazon-eks-node-al2023-x8664-standard-1.29-v20240605 • Bottlerocket: 1.20.1-7c3e9198 _# Note: 1.20.1 is the Bottlerocket, not Kubernetes, version • Windows:
The main utility of including a specifier rather than an AMI ID is that it allows you to have a consistent release configured across multiple regions without having to have region-specific configuration.
Customization via userdata
Unsupported userdata
now throws an error
Node configuration via userdata
is different for each OS. This module has 4 inputs related to Node configuration that end up using userdata
:
before_cluster_joining_userdata
kubelet_additional_options
bootstrap_additional_options
after_cluster_joining_userdata
but they do not all work for all OSes, and none work for Botterocket. Previously, they were silently ignored in some cases. Now they throw an error when set for an unsupported OS.
Note that for all OSes, you can bypass all these inputs and supply your own fully-formed, base64 encoded userdata
via userdata_override_base64
, and this module will pass it along unmodified.
Multiple lines supported in userdata
scripts
All the userdata
inputs take lists, because they are optional inputs. Previously, lists were limited to single elements. Now the list can be any length, and the elements will be combined.
Kubernetes Version No Longer Inferred from AMI
Previously, if you specified an AMI ID, the Kubernetes version would be deduced from the AMI ID name. That is not sustainable as new OSes are launched, so the module no longer tries to do that. If you do not supply the Kubernetes version, the EKS cluster’s Kubernetes version will be used.
Output eks_node_group_resources
changed
The aws_eks_node_group.resources
attribute is a “list of objects containing information about underlying resources.” Previously, this was output via eks_node_group_resources
as a list of lists, due to a quirk of Terraform. It is now output as a list of resources, in order to align with the other outputs.
Special Support for Kubernetes Cluster Autoscaler removed
This module used to takes some steps (mostly labeling) to try to help the Kubernetes Cluster Autoscaler. As the Cluster Autoscaler and EKS native support for it evolved, the steps taken became either redundant or ineffective, so they have been dropped.
• cluster_autoscaler_enabled
has been deprecated. If you set it, you will get a warning in the output, but otherwise it has no effect.
AWS Provider v5.8 or later now required
Previously, this module worked with AWS Provider v4, but no longer. Now v5.8 or later is required.
Special Thanks
This PR builds on the work of <https://githu…
2024-06-15
Hi, We are having ingress manifest which contain a cognito authentication as a default that we don’t want, so after deployment we have removed it manually from console, however since we are managing resources argocd, the rule getting added back. I have disabled the automatic sync, but it’s doesn’t help. Need suggestions!
any url or logs?
I checked the Alb controller and argocd logs I don’t find any errors over there and i could see sync policy none in app details
is there an annotation related to this auth?
Yes we are using some annotations in ingress manifest
can you show the annotations?
guessing one of them may cause the issue
2024-06-16
2024-06-17
2024-06-20
Hi, I’m using Postgres-HA chart of Bitnami and after using it for a while, we decided that we don’t really need HA and a single pod DB is enough without the pg_pool
and the issues it comes with, right now we’re planning to migrate to normal Postgres chart of Bitnami, I would like to know how to have the database’s data persisted even if we switch to the new chart ?
set retention policy to retain
for the persistent volume used by the postgresql-ha statefulset (either manually or via helm chart values), remove postgres-ha
helm deployment and reuse existing volume in postgresql
chart
alternatively, setup new postgres server with postgresql
chart next to existing onw and use pg_dump
and pg_restore
to move the data.
test either scenario on some dev environment before running this on production
Thank you @Piotr Pawlowski, I did disable the postgres-ha and instruct the Postgres chart to use the existing pvc , but when the db pod starts I get:
│ postgresql 15:22:40.22 INFO ==> ** Starting PostgreSQL setup ** │
│ postgresql 15:22:40.24 INFO ==> Validating settings in POSTGRESQL_* env vars.. │
│ postgresql 15:22:40.25 INFO ==> Loading custom pre-init scripts... │
│ postgresql 15:22:40.26 INFO ==> Initializing PostgreSQL database... │
│ postgresql 15:22:40.28 INFO ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected │
│ postgresql 15:22:40.28 INFO ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected │
│ postgresql 15:22:40.30 INFO ==> Deploying PostgreSQL with persisted data... │
│ postgresql 15:22:40.31 INFO ==> Loading custom scripts... │
│ postgresql 15:22:40.32 INFO ==> ** PostgreSQL setup finished! ** │
│ │
│ postgresql 15:22:40.34 INFO ==> ** Starting PostgreSQL ** │
│ 2024-06-25 15:22:40.400 GMT [1] FATAL: could not access file "repmgr": No such file or directory │
│ 2024-06-25 15:22:40.400 GMT [1] LOG: database system is shut down
Just an update, the only thing that worked is backup and restore, running the no ha pod on an hap pod’s volume is not working
2024-06-21
2024-06-25
2024-06-28
I got another noob question about K8S… I was testing a cluster update, that happened to cycle the nodes in the node group. I’m using EKS on AWS, with just a single node, but have three AZs available to the node group. There’s a StatefulSet (deployed with a helm chart) using a persistent volume claim, which is backed by an EBS volume. The EBS volume is of course tied to a single AZ. So, when the node group updated, it seems it didn’t entirely account for the zonal attributes, and cycled through 3 different nodes before it finally created one in the correct AZ that could meet the zonal restrictions of the persistent volume claim and get all the pods back online. Due to the zonal issues, the update took about an hour. The error when getting the other nodes going was "1 node(s) had volume node affinity conflict."
So basically, any pointers on how to handle this kind of constraint? Is there an adjustment to the helm chart, or some config option I can pass in, to adjust the setup somehow to be more zonal-friendly? Or is there a Kubernetes design pattern around this? I tried just googling, but didn’t seem to get much in the way of good suggestions… I don’t really want to have a node-per-AZ always running…
Yup, k8s scheduler is not AZ aware when it comes to that sort of thing.
It also enables the scheduler to scale up specific AZs based on traffic.
Alternatively, you can use EFS whcih is not zone specific, but also not suitable for all kinds of workloads
is “node pool” the same as “node group”?
i do have efs already setup, but it’s not the default storage class. maybe i’ll try that
In our refarch, here’s our component https://github.com/cloudposse/terraform-aws-components/tree/23f29ccd8727cc1fbe8f19ab6b3b71dd37316a00/modules/eks/storage-class
Wouldn’t Karpenter be a good solution to working around this? You would just need a tiny node that is running anywhere to host the karpenter controller pod, then you can instruct Karpenter to only provision in a preferred AZ?
I’ve never used Karpenter, so I don’t really know?
@Jeremy G (Cloud Posse)
A few things here:
“Node Group” is an EKS term. As I understand it, it is defined as a collection of nodes sharing a single autoscaling group (ASG) and launch template (LT).
“Node Pool” is a Karpenter term, effectively the same kind of thing, but using Karpenter to launch nodes, rather than ASGs and LTs.
If you create an ASG that spans multiple Availability Zones (AZs), the K8s cluster autoscaler (or any autoscaler) does not have control over which AZ a new EC2 instance will be added to when the ASG scales up. At some point, the EC2 Autoscaler may see that there is a big imbalance and try to fix it by bringing up instances in one AZ and shutting them down in another, but again there is no way to proactively direct this.
Our eks/cluster component handles availability zone balance by creating one managed node group (MNG) per AZ. The K8s cluster autoscaler understands this configuration, and will keep the MNGs in balance across the AZs.
Furthermore, Kubernetes, the K8s cluster autoscaler, and Karpenter all understand that EBS-backed PVs are tied to a “topology zone” (generic term for what in AWS is an AZ), and will only schedule a Pod into the same zone as its PVC targets.
If there is not enough capacity in the zone with the PVC, K8s need to add a new Node in that AZ. Karpenter can always do this. The best the K8s cluster autoscaler can do is increase the desired capacity of an ASG. If the ASG is only in the target AZ, then it is guaranteed success, but if the ASG spans AZs, then scaling up the ASG may or may not increase capacity in the target zone, in which case I think the K8s cluster autoscaler just gives up.
EFS-backed EBS volumes solve the AZ problem, but at cost of 3x the $/MB price and severely limited IO bandwidth and IOPS limits. It’s appropriate for sharing configuration data and small amounts of logging. It gets overwhelmed by storing Prometheus’ scraped metrics unless you pay (dearly) for extra bandwidth.
The short answer, @loren, is that if you only intend to run one Node and you want a Persistent Volume, restrict your ASG to one AZ and use a gp3
EBS volume for your PV.
Configure Karpenter with NodePools
Thanks for that Jeremy! I’ll need to refactor things a fair bit to scale out the node group per az. Still working out some EFS issues on destroy, where the namespace gets locked up because the associated PVC has a claim but (I think) the access points got destroyed before the namespace, so K8S fails to remove the finalizers. Good times. Love K8S so so much.
i don’t think i ever would have thought about a node group per az, but what a difference! replacing updates on node groups now down to ~8min vs >40min! thanks again!