SweetOps #kubernetes for April, 2024

Archive: https://archive.sweetops.com/kubernetes/

2024-04-05

Ben

Has anyone tried to use cloudflared for kubernetes ingress? My main concern is how to use it with a load balancer because right now it’s all integrated in nginx — can anyone advise on whether it’s possible to use nginx just for load balancing?

Erik Osterman (Cloud Posse)

09:06:55 PM

Yes, it should be possible to deploy nginx ingress on a private load balancer. Then deploy cloudflared and point it at the internal load balancer.

Erik Osterman (Cloud Posse)

09:07:08 PM

That way you get nginx capabilities, and the benefits of cloudflared.

Erik Osterman (Cloud Posse)

09:11:15 PM

Ultimately, you just point cloudflared at an endpoint. If it can reach it, it can forward the traffic.

Erik Osterman (Cloud Posse)

09:12:50 PM

tunnel: 6ff42ae2-765d-4adf-8112-31c55c1551ef
credentials-file: /root/.cloudflared/xxx.json

ingress:
  - hostname: foobar.acme.corp
    service: http://<nginx-private-balacner>:80

Erik Osterman (Cloud Posse)

09:12:59 PM

https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/configure-tunnels/local-management/configuration-file/

Configuration file · Cloudflare Zero Trust docs attachment image

Quick tunnels do not need a configuration file.

Meb

08:42:56 AM

Yes confirm cloudflared tunnels are great to avoid exposing your cluster and love their setup. Works fine but beware the doc is outdated and need a full refresh.

2024-04-10

2024-04-11

rohit

04:02:24 PM

Hi folks, a question I haven’t found a solution to:

A customer has vault running in their environment. We will give the customer a helm chart to install on their k8s cluster. Part of the helm install, creates a service-account.yaml for our application.

But the issue we’re running into is authentication this service-account to a customer’s vault. Until which, our apps will not be able to communicate with vault.

How can this be handled gracefully? Do we ask the customer to create the namespace + service-account for our helm chart before they install it? They would create ns + service account, run the auth commands to give this svca access to vault, and then do a helm install ?

venkata.mutyala

04:34:02 AM

That’s probably a reasonable approach/ask. Ex. It’s a pre-requisite to installing your helm chart.

venkata.mutyala

04:34:45 AM

I’m not sure what your product does but if it can not have a dependency on vault then it might be easier for folks to roll out as your helm chart would be entirely self-contained at that point

loren

09:30:30 PM

Hey folks, I have something of a dumb question… I’ve never really used Kubernetes before, and a lot of the terminology still makes no sense to me. I inherited a terraform module managing AWS EKS, and using the helm provider to provision a “helm_release”. I think there is a race condition here that I’m trying to confirm. So, the dumb question is, what exactly is a “helm release” doing? And, does it require a working/functional node group to succeed?

loren

09:31:20 PM

Also, where do I go in the EKS console to see what the helm release did?

bradym

09:39:48 PM

helm (https://helm.sh) calls itself a package manger for k8s. You can use it to deploy both third party apps using publicly published charts, or you can use custom charts to deploy your own apps.

What k8s resources get deployed depend on the individual chart. Typically a chart will create several k8s objects that enable an app to run properly. A deployment/statefulset/daemonset to create the pods where the docker container run, a service to allow connections into the pod and in ingress object to enable connections from outside the cluster are the most common. You may also have other things like secrets or configmaps.

In the EKS console you can see the k8s objects in a cluster by clicking the “Resources” tab once you click on a cluster.

Helm

Helm - The Kubernetes Package Manager.

loren

09:40:44 PM

Lots of wonderful docs here about how to use it, but nothing that actually says, “This is what helm is, this is what it does, this is how it works, the is why you want to use it.”

https://helm.sh/docs/

Docs Home

Everything you need to know about how the documentation is organized.

loren

09:42:18 PM

Ok… Do I need a functional node group to create “k8s resources” using the helm_release resource?

bradym

09:43:53 PM

Yes. The node group defines the worker nodes and their config that will run your k8s workloads.

loren

09:48:13 PM

Got it, ok. Definitely a race condition then. The helm_release is starting before the node group is ready… I think I know how to fix that though… It waits for the cluster and its access entries, but it’s not waiting on the node group config.

bradym

09:48:36 PM

that would definitely do it

loren

09:49:37 PM

it was weird, because sometimes it would work fine. but then sometimes it would all blow up

bradym

09:51:50 PM

In terms of getting familiar with what a helm release does helm dashboard is nice. It’ll run a local webserver and let you browse the releases that are running in your cluster and see the resources that make up the release and the yaml manifests that created them.

bradym

09:53:14 PM

Oh, it’s a plugin for helm: https://github.com/komodorio/helm-dashboard

komodorio/helm-dashboard

The missing UI for Helm - visualize your releases

loren

09:53:32 PM

lol yeah, i’m green enough here that i don’t have any basic tooling installed for connecting to or managing k8s in any way

loren

09:53:53 PM

if it’s not in terraform, i aint got it

bradym

09:55:08 PM

That won’t last long

2024-04-12

Aditya PR

11:05:25 AM

Hi Guys just a suggestion from the community here, we are trying to shift to gitops and need a CD tool. which one is better to use flux or argo CD? i saw this online but theres a mixed response. Any thoughts?

12:52:25 PM

Ive never used flux so cannot say much about it. I do recall aws pushing Flux a lot when chatting with them.

I’ve used argocd for a few years and like it a lot. It’s easy, scalable, and has a nice interface and community. The hardest part about argocd (and probably flux) is setting up the yaml so it’s flexible enough to modify and add new apps too.

12:53:31 PM

If there is a flux specific addon (like the opentofu controller) that you want to use, you can use it in argocd using the flux subsystem

https://github.com/flux-subsystem-argo/flamingo

Henrique Cavarsan

03:16:48 PM

IMHO argocd is more user-friendly and has a smoother experience (nice UI), but from my experience, managing lots of projects can be tough. flux is simpler, more reliable and has a better management than argo (CRDs), especially when you use helm + kustomize + OCI. so, depends on your project needs..

venkata.mutyala

05:37:51 PM

RE: FluxCD. It’s probably worth reading up on what happened with Weaveworks as well as who is taking over the project now just to make sure it’s still got a good future.

RE: ArgoCD

OCI is definitely a bit weird with argocd. It doesn’t auto populate values/etc like a standard helm chart. At least that’s been my experience.

Aside from that I’ve been using argocd with customers for over 2 years now and it’s been great. We use an Appset to manage all of our argocd applications that developers deploy themselves.

ArgoCD also has a UI which I believe flux is still missing or requires an add-on project.

The only other thing I don’t like about ArgoCD is it uses helm template rather than the full helm support which I believe Flux offers. For example if you want to do a “hook” you have to find the respective hook to use within ArgoCD. And i don’t believe it’s a 1:1 translation from helm to argocd either.

Note: I work at a company that sells a PaaS built ontop of ArgoCD + a bunch of other FOSS. So i’m probably a bit bias here.

Aditya PR

05:47:15 PM

thank you all for the info ill take notes from this discussion and discuss among our team,

2024-04-20

Zing

12:16:46 PM

hey all, i have some questions around best practices when using bottlerocket + karpenter.

context: • i have moved all of the karpenter nodepools to bottlerocket • i currently have ASG/MNG (managed node groups) created via terraform • these “base” nodes are not on bottlerocket, since i want to decide which route to go • karpenter is currently scheduled on those base nodes questions: • would it be better to use fargate or managed node groups for these base nodes? ◦ these base nodes are currently on AL2 and i would like to move them to bottlerocket • some of our “cluster-critical” components are not being scheduled on these base nodes right now. ◦ i think they should be (aws lb, aws ebs, aws efs, coredns, etc.) but some people also say to just schedule ONLY karpenter on these base nodes, and let karpenter-provisioned nodes handle the kube-system components thoughts?

Erik Osterman (Cloud Posse)

06:20:04 PM

@Jeremy G (Cloud Posse) are our recommendations documented publicly?

Jeremy G (Cloud Posse)

06:36:07 PM

@Erik Osterman (Cloud Posse) We have recommendations in this post in a thread in this Slack channel, but somehow publishing it to our docs website fell off my to-do list.

@Zing I believe all of your “cluster-critical” components should be deployed to your MNG using on-demand (as opposed to spot) instances. Then you can deploy all or most of your other workloads to Spot instances managed by Karpenter. This is because you should be deploying your MNG such that you have a node in each Availability Zone, thus ensuring your cluster-critical components can be spread across AZs and that your cluster can stay relatively healthy even if Karpenter goes out of service for a while.

I am generally opposed to using Fargate when you have all of the Terraform tools we have available. Fargate is great for managing scale if you have a minimal EKS configuration you are managing with command-line tools. It is overpriced and underperforming if you have a full cluster with an autoscaler like Karpenter.

Much more detail and reasoning in the above linked post. LMK if you have more questions.

I believe the idea of running Karpenter on Fargate was put forward by the Karpenter team because: • They were proud of their accomplishment and wanted to show it off • If you don’t run Karpenter on Fargate, then you have to deploy some nodes some other way, which makes using Karpenter a lot more complicated and the benefits less obvious • In particular, for people setting up an EKS cluster without the help of Terraform and modules like Cloud Posse supplies, creating a managed node pool is a lot of work that can be avoided if you use Karpenter on Fargate • Karpenter’s ability to interact with managed node pools was limited to non-existent when it first launched Karpenter works well with managed node pools now, and the complications of setting up a node pool are greatly mitigated by using Cloud Posse’s Terraform modules and components. So the above motivations are not that powerful.

Our evolution of the decision against running Karpenter on Fargate went like this:

• If you do not run at least one node pool outside of Karpenter, then you cannot create, via Terraform, an EKS cluster with certain EKS-managed add-ons, such as Core DNS, because they require nodes to be running before they will report they are active. Terraform will wait for the add-ons to be active before declaring the cluster completely created, so the whole cluster creation fails. • To support high availability, Cloud Posse and AWS recommend running nodes in 3 separate availability zones. Implied in this recommendation is that any cluster-critical services should be deployed to 3 separate nodes in the 3 separate availability zones. This includes EKS add-ons like Core DNS and EBS CSI driver (controller). • Last time I checked (and this may have been fixed by now), if you specified 3 replicas of an add-on, Karpenter was not sufficiently motivated by the anti-affinity of the add-ons to create 3 nodes, one in each AZ. It just creates 1 node big enough to handle 3 replicas. What is worse, anti-affinity is only considered during scheduling, so once you have all 3 replicas on 1 node, they stay there, even as your cluster grows to dozens of nodes. Your HA and even your basic scalability (due to network bandwidth constraints on the node and cross-AZ traffic) are undermined because Karpenter put all your replicas on 1 node So to get around all of the above, we recommend deploying EKS with a normal managed node pool with 3 node groups (one per AZ). This allows the add-ons to deploy and initialize during cluster creation (satisfying Terraform that the cluster was properly created), and also ensures that the add-ons are deployed to different nodes in different AZs. (While you are at it, you can set up these nodes to provide some floor on available compute resources that ensure all your HA replicas have 3 nodes to run on at all times.) You do not need to use auto-scaling on this node pool, just one node in each AZ, refreshed occasionally.

There is another advantage: you can now have Karpenter provision only Spot instances, and run the managed node pool with On Demand or Reserved instances. This gives you a stable platform for your HA resources and the price savings of Spot instances elsewhere in a relatively simple configuration.

Now that you have a basic node pool to support HA, you can run Karpenter on that node pool, without the risk that Karpenter will kill the node it is running on. Karpenter now (it didn’t for a long time) properly includes the capacity available in the managed node pool when calculating cluster capacity and scaling the nodes it manages, and can scale to zero if the managed node pool is sufficient.

(The caveat here is that we are focusing on clusters that are in constant use and where paying a small premium for extra reliability is worth it. For a cluster where you don’t care if it crashes or hangs, the argument for having 3 static nodes is less compelling.)

Regarding costs, Fargate has a premium in both cost per vCPU and GiB of RAM, and in the quantization of requests. If you are concerned about the cost of running the static node pool, especially for non-production clusters, you can run t3 instances, and/or run 2 instead of 3 nodes.

Pricing comparison in us-west-2 : A c6a.large, which is what we recommend as a default for the static node pool in a production cluster, has 2 vCPUs and 4 GiB of memory, and costs $0.0765 per hour. A Fargate pod requesting 2 vCPUs and 4 GiB would cost $0.09874 per hour. A minimal pod (which is sufficient to run Karpenter) costs $0.0123425 per hour (1/8 the cost of the larger Fargate pod, about 1/6 the cost of the c6a.large, a little more than the cost of a t3.micro with 2 vCPUs and 1 GiB memory). If you have workloads that can run on Gravitron (all the Kubernetes infrastructure does) then you can use the relatively cheaper c6g or c7g mediums at around $0.035 per hour, or $25/month.

So our recommendation is to run a minimal managed node pool and run Karpenter on it. The exception might be for a tiny unimportant cluster where a baseline cost of $120/month is unacceptably high.

Zing

06:55:20 PM

thanks for that. that was a good read. the reason i was considering fargate is that we’re currently looking at modernizing our fleet (half of it is using “old” modules, and the other half is using “even older” modules) with the addition of bottlerocket. we’re using managed node groups for our “old” stuff, but we’re actually using the aws_eks_workers cloudposse module (from 3 years ago!) for our “even older” fleet. that module doesn’t really play nice with bottlerocket in it’s current state, so i was debating just moving our base nodes to fargate instead of thinking about managed groups… but it sounds like that wouldn’t be the right call

Zing

06:56:49 PM

we’re looking at revamping our architecture across the board quite soon, so i don’t want to make any drastic changes to our aws eks cluster modules, but it looks like it wouldn’t be a huge lift to just replace the aws_eks_workers module im currently using with the newest version of aws-eks-node-group (which seems to support bottlerocket?)

Jeremy G (Cloud Posse)

06:58:44 PM

One advantage eks-workers have over MNG is that you can set max_instance_lifetime, ensuring the instances are refreshed every so often. If there is an issue with them and Bottlerocket, open a ticket on GitHub and post a link here and we will look into it (although no promises on how quickly).

Zing

06:59:45 PM

it might just be that i’m using the version from 3 years ago

Zing

07:01:29 PM

based on the thread you posted, it seems like you recommend that deployments that should have HA (non-spot instances) should be scheduled on MNGs? i was going to reserve that ONLY for kube-system components

Zing

07:01:46 PM

curious why you wouldn’t just let the various end-user deployments live on karpenter-provisioned on-demand nodes

Jeremy G (Cloud Posse)

07:09:36 PM

You absolutely can let the various end-user deployments live on karpenter-provisioned on-demand nodes. It’s just that spot nodes are cheaper, and your deployments should, in general, be designed to be interruptible and movable, so they should not need the extra expense that comes with the on-demand service.

Zing

07:19:24 PM

gotcha, appreciate the insight

Zing

07:19:47 PM

next problem to solve: using instance store instead of EBS

Zing

07:19:58 PM

time to do some investigative work

Jeremy G (Cloud Posse)

07:20:26 PM

instance store is ephemeral, not a replacement for EBS for persistent data.

Zing

07:22:17 PM

yeah, historically we’ve used EBS and haven’t looked into instance store since it was new when we last looked, but we actually don’t need persistent data at all

Zing

07:22:23 PM

when we do, we just use persistent volumes

Zing

07:22:42 PM

so we’ve just been wasting money / limiting our performance

Zing

07:24:01 PM

sadly, it’s a whole janky workaround to get it to work on bottlerocket

https://github.com/bottlerocket-os/bottlerocket/discussions/1991#discussioncomment-3265188

Comment on #1991 Use ephemeral disks to store container images

The recommended approach is changing with Bottlerocket 1.9.0 because of the issue with CSI plugins reported in #2218 (thanks, @diranged) and the fix in #2240. The example from @arnaldo2792 above shows using a symlink, but for best results I’d recommend using a bind mount instead in newer versions.

The right approach depends on whether the bootstrap container needs to support older and newer versions of Bottlerocket. If older versions aren’t a concern, the logic can be simplified by quite a bit.

Here’s the Dockerfile for the container:

FROM amazonlinux:2
RUN yum -y install e2fsprogs bash mdadm util-linux
ADD setup-runtime-storage ./
RUN chmod +x ./setup-runtime-storage
ENTRYPOINT ["sh", "setup-runtime-storage"]

And here’s the setup-runtime-storage script with the logic for Bottlerocket before and after 1.9.0, showing how to use one or more ephemerals in a RAID-0 array:

#!/usr/bin/env bash
set -ex

ROOT_PATH="/.bottlerocket/rootfs"

# Symlinks to ephemeral disks are created here by udev
declare -a EPHEMERAL_DISKS
EPHEMERAL_DISKS=("${ROOT_PATH}"/dev/disk/ephemeral/*)

# Exit early if there aren't ephemeral disks
if [ "${#EPHEMERAL_DISKS[@]}" -eq 0 ]; then
  echo "no ephemeral disks found"
  exit 1
fi

MD_NAME="scratch"
MD_DEVICE="/dev/md/${MD_NAME}"
MD_CONFIG="/.bottlerocket/bootstrap-containers/current/mdadm.conf"

# Create or assemble the array.
if [ ! -s "${MD_CONFIG}" ] ; then
  mdadm --create --force --verbose \
    "${MD_DEVICE}" \
      --level=0 \
      --name="${MD_NAME}" \
      --raid-devices="${#EPHEMERAL_DISKS[@]}" \
      "${EPHEMERAL_DISKS[@]}"
  mdadm --detail --scan > "${MD_CONFIG}"
else
  mdadm --assemble --config="${MD_CONFIG}" "${MD_DEVICE}"
fi

# Format the array if not already formatted.
if ! blkid --match-token TYPE=ext4 "${MD_DEVICE}" ; then
  mkfs.ext4 "${MD_DEVICE}"
fi

MOUNT_POINT="${ROOT_PATH}/mnt/${MD_NAME}"

# Mount the array in the host's /mnt.
mkdir -p "${MOUNT_POINT}"
mount "${MD_DEVICE}" "${MOUNT_POINT}"

# Keep track of whether we can unmount the array later. This depends on the
# version of Bottlerocket.
should_umount="no"

# Bind state directories to the array, if they exist.
for state_dir in containerd docker kubelet ; do
  # The correct next step depends on the version of Bottlerocket, which can be
  # inferred by inspecting the mounts available to the bootstrap container.
  if findmnt "${ROOT_PATH}/var/lib/${state_dir}" ; then
    # For Bottlerocket >= 1.9.0, the state directory can be bind-mounted over
    # the host directory and the mount will propagate back to the host.
    mkdir -p "${MOUNT_POINT}/${state_dir}"
    mount --rbind "${MOUNT_POINT}/${state_dir}" "${ROOT_PATH}/var/lib/${state_dir}"
    mount --make-rshared "${ROOT_PATH}/var/lib/${state_dir}"
    should_umount="yes"
  elif [ ! -L "${ROOT_PATH}/var/lib/${state_dir}" ] ; then
    # For Bottlerocket < 1.9.0, the host directory needs to be replaced with a
    # symlink to the state directory on the array. This works but can lead to
    # unexpected behavior or incompatibilities, for example with CSI drivers.
    if [ -d  "${ROOT_PATH}/var/lib/${state_dir}" ] ; then
      # The host directory exists but is not a symlink, and might need to be
      # relocated to the storage array. This depends on whether the host has
      # been downgraded from a newer version of Bottlerocket, or whether it's
      # the first boot of an older version.
      if [ -d "${MOUNT_POINT}/${state_dir}" ] ; then
        # If downgrading from a version of Bottlerocket that supported bind
        # mounts, the directory will exist but should be empty, except for
        # subdirectories that may have been created by tmpfiles.d before an
        # upgrade to that version. Keep a copy of the directory just in case.
        rm -rf "${ROOT_PATH}/var/lib/${state_dir}.bak"
        mv "${ROOT_PATH}/var/lib/${state_dir}"{,.bak}
      else
        # Otherwise, treat it as the first boot of an older version, and move
        # the directory to the array.
        mv "${ROOT_PATH}/var/lib/${state_dir}" "${MOUNT_POINT}/${state_dir}"
      fi
    else
      # The host directory does not exist, so the target directory likely needs
      # to be created.
      mkdir -p "${MOUNT_POINT}/${state_dir}"
    fi
    # Any host directory has been dealt with and the symlink can be created.
    ln -snfT "/mnt/${MD_NAME}/${state_dir}" "${ROOT_PATH}/var/lib/${state_dir}"
  fi
done

# When using bind mounts, the parent directory where the array is mounted can
# be unmounted. This avoids a second, redundant mount entry under `/mnt` for
# every new mount in one of the state directories.
if [ "${should_umount}" == "yes" ] ; then
  umount "${MOUNT_POINT}"
fi

2024-04-21

Ritika Kumar

11:53:36 AM

Hi all, i have been trying to run minikube. i have added alll require permissions too, but its still showing me this. what can be done?

Erik Osterman (Cloud Posse)

12:38:27 PM

Are you able to run Docker without problems?

Ritika Kumar

01:07:30 PM

yes sir, Docker is operating fine

Erik Osterman (Cloud Posse)

04:34:44 PM

Are you trying to use use the built in support that ships with Docker for Desktop

Ritika Kumar

02:38:58 PM

yes sir i used the docker for desktop but it gave me an error related to default file, so i tried using virtual machine as the driver and it worked. But i am still figuring out why it didnt work with docker as a driver.

Erik Osterman (Cloud Posse)

04:19:52 PM

Hrmm… I’m not sure, but at least you found a workaround.

Erik Osterman (Cloud Posse)

04:20:07 PM

Also, Podman released support for kubernetes on the desktop now too

Ritika Kumar

02:17:07 PM

oh thanks sir. I will surely look and explore it

Shivam s

05:38:20 PM

Issue might be with Minikube version, please check u r using correct architecture image.

Ritika Kumar

10:28:01 PM

Hi Shivam I downloaded latest minikube version only. It worked only when i ran the commands to download minikube on my powershell with virtual machine as device driver instead of downloading the .exe file and adding path to environment variable.

2024-04-22

2024-04-23

loren

08:08:06 PM

Am I just missing it, or is there no data source to return the available EKS Access Policies? I opened a ticket to request a new data source, please if it would help you also! https://github.com/hashicorp/terraform-provider-aws/issues/37065

2024-04-24

Balazs Varga

02:21:44 PM

hello all, we are using spot instances in our clusters. sometimes we see a lot of node comes up and goes down because the rebalance recommendation. we set that enabled in termination handler. how could I reduce that to not get a new node with signal that will be rebalanced soon ?

Hao Wang

03:58:37 AM

do you use Karpenter?

Hao Wang

03:58:42 AM

Karpenter publishes Kubernetes events to the node for all events listed above in addition to Spot Rebalance Recommendations. Karpenter does not currently support taint, drain, and terminate logic for Spot Rebalance Recommendations.

If you require handling for Spot Rebalance Recommendations, you can use the AWS Node Termination Handler (NTH) alongside Karpenter; however, note that the AWS Node Termination Handler cordons and drains nodes on rebalance recommendations, potentially causing more node churn in the cluster than with interruptions alone. Further information can be found in the Troubleshooting Guide.

Hao Wang

03:58:47 AM

https://karpenter.sh/docs/concepts/disruption/

Disruption