SweetOps #refarch for April, 2023

Cloud Posse Reference Architecture

2023-04-04

johncblandii

#elasticache-redis

│ Error: engine_version: Redis versions must match <major>.<minor> when using version 6 or higher, or <major>.<minor>.<patch>

      vars:
        name: csm
        family: redis7.cluster.on
        cloudwatch_metric_alarms_enabled: false
        redis_clusters:
          redis-csm:
            engine_version: 7.0
            instance_type: cache.t3.small

the heck are the right combos for engine_version and family? I can’t get any combo to work from redis7/7.0 or 7.x or 7.0.7 to work including the family default.redis7, redis7.cluster.on.

johncblandii

10:30:20 AM

the CLI:

        {
            "Engine": "redis",
            "EngineVersion": "7.0",
            "CacheParameterGroupFamily": "redis7",
            "CacheEngineDescription": "Redis",
            "CacheEngineVersionDescription": "redis version 7.0.7"
        }

johncblandii

10:33:05 AM

“7.0” vs 7.0. #yamlfailure

leaving this all here for anyone else that hits the same problem.

fix: use quotes on values that can be misinterpreted as numbers.

Erik Osterman (Cloud Posse)

11:50:48 AM

Glad you figured that out! Definitely one of those gotchas with YAML

johncblandii

05:35:43 PM

Yup and I’m always removing quotes from strings. This one backfired.

Erik Osterman (Cloud Posse)

06:30:21 PM

(lol, I’m always adding them!

2023-04-05

johncblandii

09:00:18 PM

is there an approach to subscribing to an sns-topic from an sns-queue config? I know we can remote state it, but the subscribers part of sns-topic doesn’t natively support that so curious of patterns

2023-04-07

Michael Dizon

11:54:40 PM

has anyone run into this error when trying to deploy an eks cluster with eks/cluster

│ Error: Post "https:/xxxx.us-gov-west-1.eks.amazonaws.com/api/v1/namespaces/kube-system/configmaps": getting credentials: decoding stdout: no kind "ExecCredential" is registered for version "client.authentication.k8s.io/v1alpha1" in scheme "pkg/runtime/scheme.go:100"

Michael Dizon

12:25:32 AM

turns out i hadn’t update my aws cli in awhile

Erik Osterman (Cloud Posse)

12:31:44 AM

Not using geodesic?

Michael Dizon

12:56:48 AM

no, not on this one

2023-04-17

Michael Dizon

08:57:52 PM

has anyone else tried deploying eks with kubernetes 1.26? running into this issue with nodes not joining

https://github.com/awslabs/amazon-eks-ami/issues/1263

#1263 1.26 nodes fail to join the cluster with custom VPC domain-name

What happened:
1.26 AMI Nodes fail to join 1.26 clusters. In both scenarios of upgrading from 1.25 to 1.26 and new clusters starting fresh with 1.26

What you expected to happen:
The nodes to join the cluster

Anything else we need to know?:

• Using manged node groups. • The exact same Terraform deployment configuration works on 1.25. The only thing changed is the version for cluster/ami which triggers the failure on both upgrades and new clusters. • VPC DHCP domain name is in the format: ec2.internal [acmedev.com](http://acmedev.com)

Environment:

• AWS Region: us-east-1 • Instance Type(s): m6a • EKS Platform version: "eks.1" • Kubernetes version: "1.26" • AMI Version: amazon-eks-node-1.26-v20230406 • Kernel (e.g. uname -a): Linux [ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com) 5.10.173-154.642.amzn2.x86_64 #1 SMP Wed Mar 15 00:26:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux • Release information (run cat /etc/eks/release on a node):

BASE_AMI_ID="ami-099e00fe4091e48af"
BUILD_TIME="Thu Apr  6 01:36:39 UTC 2023"
BUILD_KERNEL="5.10.173-154.642.amzn2.x86_64"
ARCH="x86_64"

I believe the change in cloud-provider from aws to external has created an issue where our hostname for the kubelet is different between 1.25 and 1.26. This causes the aws iam authenticator node bootstrap logic to fail to register with the cluster because the hostname in the requests are not the same.

hostnamed logs are the exact same on 1.25 and 1.26 nodes, including the “warning”

Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed pretty host name to 'ip-10-100-13-0.ec2.internal [acmedev.com](http://acmedev.com)'
Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed static host name to '[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)'
Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed host name to '[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)'
Apr 13 13:55:48 ip-10-100-13-0 cloud-init: Apr 13 13:55:48 cloud-init[2209]: util.py[WARNING]: Failed to non-persistently adjust the system hostname to ip-10-100-13-0.ec2.internal [acmedev.com](http://acmedev.com)

We are not changing any of the kubelet arguments from their AMI defaults. The only thing we are doing is adding some labels/taints to the nodes via the managed node group terraform resources. No hostname overrides.

Apr 13 13:55:53 ip-10-100-13-0 kubelet: I0413 13:55:53.946396    2944 flags.go:64] FLAG: --cloud-provider="external"
Apr 13 13:55:53 ip-10-100-13-0 kubelet: I0413 13:55:53.946638    2944 flags.go:64] FLAG: --hostname-override=""

Pertinent messages that indicate node join failures.

Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.192348    2944 kubelet_node_status.go:669] "Recording event message for node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" event="NodeHasNoDiskPressure"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.192745    2944 kubelet_node_status.go:669] "Recording event message for node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" event="NodeHasSufficientPID"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.193204    2944 kubelet_node_status.go:70] "Attempting to register node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.765164    2944 controller.go:146] failed to ensure lease exists, will retry in 200ms, error: leases.coordination.k8s.io "[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" is forbidden: User "system:node:ip-10-100-13-0.ec2.internal" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-node-lease": can only access node lease with the same name as the requesting node
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.765885    2944 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" is forbidden: User "system:node:ip-10-100-13-0.ec2.internal" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.766850    2944 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\" is forbidden: node \"ip-10-100-13-0.ec2.internal\" is not allowed to modify node \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\"" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.969984    2944 kubelet_node_status.go:70] "Attempting to register node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.972246    2944 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\" is forbidden: node \"ip-10-100-13-0.ec2.internal\" is not allowed to modify node \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\"" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"

On the 1.25 nodes using cloud-provider=aws we can see the logs like:

Apr 12 15:14:31 ip-10-100-12-210 kubelet: I0412 15:14:31.176819    2906 server.go:993] "Cloud provider determined current node" nodeName="ip-10-100-12-210.ec2.internal"

https://github.com/kubernetes/kubernetes/blob/v1.26.2/cmd/kubelet/app/server.go#L989 which does not contain the [acmedev.com](http://acmedev.com) appended to it.

The nodename returned in 1.25 aligns with with the templated private DNS name returned from the https://github.com/kubernetes-sigs/aws-iam-authenticator/tree/master that allows bootstrapping nodes. Since we are not using the aws cloud provider in 1.26 we might be getting back a different value for nodename which does not align.

Since the change to the cloud-provider=external I beieve we are returning the hostname that we would get from hostname or uname -n e.g. [ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com) which does not align with what is returned from the EC2 api when getting the private DNS name for auth. Our node config in the aws-auth cm is standard:

  mapRoles: |
    - "groups":
      - "system:bootstrappers"
      - "system:nodes"
      "rolearn": "arn:aws:iam::1234567890:role/role-name"
      "username": "system:node:{{EC2PrivateDNSName}}"

Erik Osterman (Cloud Posse)

09:35:28 PM

@Jeremy G (Cloud Posse) could this be related to the other issue you were helping with? …relating to CNI addon

#1263 1.26 nodes fail to join the cluster with custom VPC domain-name

What happened:
1.26 AMI Nodes fail to join 1.26 clusters. In both scenarios of upgrading from 1.25 to 1.26 and new clusters starting fresh with 1.26

What you expected to happen:
The nodes to join the cluster

Anything else we need to know?:

Environment:

BASE_AMI_ID="ami-099e00fe4091e48af"
BUILD_TIME="Thu Apr  6 01:36:39 UTC 2023"
BUILD_KERNEL="5.10.173-154.642.amzn2.x86_64"
ARCH="x86_64"

hostnamed logs are the exact same on 1.25 and 1.26 nodes, including the “warning”

Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed pretty host name to 'ip-10-100-13-0.ec2.internal [acmedev.com](http://acmedev.com)'
Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed static host name to '[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)'
Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed host name to '[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)'
Apr 13 13:55:48 ip-10-100-13-0 cloud-init: Apr 13 13:55:48 cloud-init[2209]: util.py[WARNING]: Failed to non-persistently adjust the system hostname to ip-10-100-13-0.ec2.internal [acmedev.com](http://acmedev.com)

Apr 13 13:55:53 ip-10-100-13-0 kubelet: I0413 13:55:53.946396    2944 flags.go:64] FLAG: --cloud-provider="external"
Apr 13 13:55:53 ip-10-100-13-0 kubelet: I0413 13:55:53.946638    2944 flags.go:64] FLAG: --hostname-override=""

Pertinent messages that indicate node join failures.

Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.192348    2944 kubelet_node_status.go:669] "Recording event message for node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" event="NodeHasNoDiskPressure"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.192745    2944 kubelet_node_status.go:669] "Recording event message for node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" event="NodeHasSufficientPID"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.193204    2944 kubelet_node_status.go:70] "Attempting to register node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.765164    2944 controller.go:146] failed to ensure lease exists, will retry in 200ms, error: leases.coordination.k8s.io "[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" is forbidden: User "system:node:ip-10-100-13-0.ec2.internal" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-node-lease": can only access node lease with the same name as the requesting node
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.765885    2944 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" is forbidden: User "system:node:ip-10-100-13-0.ec2.internal" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.766850    2944 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\" is forbidden: node \"ip-10-100-13-0.ec2.internal\" is not allowed to modify node \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\"" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.969984    2944 kubelet_node_status.go:70] "Attempting to register node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.972246    2944 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\" is forbidden: node \"ip-10-100-13-0.ec2.internal\" is not allowed to modify node \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\"" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"

On the 1.25 nodes using cloud-provider=aws we can see the logs like:

Apr 12 15:14:31 ip-10-100-12-210 kubelet: I0412 15:14:31.176819    2906 server.go:993] "Cloud provider determined current node" nodeName="ip-10-100-12-210.ec2.internal"

https://github.com/kubernetes/kubernetes/blob/v1.26.2/cmd/kubelet/app/server.go#L989 which does not contain the [acmedev.com](http://acmedev.com) appended to it.

  mapRoles: |
    - "groups":
      - "system:bootstrappers"
      - "system:nodes"
      "rolearn": "arn:aws:iam::1234567890:role/role-name"
      "username": "system:node:{{EC2PrivateDNSName}}"

Jeremy G (Cloud Posse)

09:36:16 PM

No, this is something else

Jeremy G (Cloud Posse)

09:39:59 PM

@Michael Dizon Did you follow all the upgrade instructions/prerequisites at https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-1.26 ?

Amazon EKS Kubernetes versions - Amazon EKS

The Kubernetes project is continually integrating new features, design updates, and bug fixes. The community releases new Kubernetes minor versions, such as 1.26 . New version updates are available on average every three months. Each minor version is supported for approximately twelve months after it’s first released.

Michael Dizon

10:28:50 PM

deployed from scratch in a new environment. should note that this is in govcloud. not sure if that makes a difference

Michael Dizon

10:29:18 PM

it deploys fine with 1.25.

Michael Dizon

10:32:12 PM

here’s a bit of the log output

message”: “csinodes.storage.k8s.io "ip-10-xxx.xxx.com" is forbidden: User "systemip-10-xxx.us-gov-west-1.compute.internal" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node”, “reason”: “Forbidden”, “details”: { “name”: “ip-xxx.xxx.com”, “group”: “storage.k8s.io”, “kind”: “csinodes” }, “code”: 403 },

Jeremy G (Cloud Posse)

12:57:37 AM

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.26.md#urgent-upgrade-notes

Jeremy G (Cloud Posse)

12:57:52 AM

• Deprecated beta APIs scheduled for removal in v1.26 are no longer served. See https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-26 for more information. (#111973, @liggitt)
• The in-tree cloud provider for OpenStack (and the cinder volume provider) has been removed. Please use the external cloud provider and csi driver from cloud-provider-openstack instead. (#67782, @dims)

Deprecated API Migration Guide

As the Kubernetes API evolves, APIs are periodically reorganized or upgraded. When APIs evolve, the old API is deprecated and eventually removed. This page contains information you need to know when migrating from deprecated API versions to newer and more stable API versions. Removed APIs by release v1.29 The v1.29 release will stop serving the following deprecated API versions: Flow control resources The flowcontrol.apiserver.k8s.io/v1beta2 API version of FlowSchema and PriorityLevelConfiguration will no longer be served in v1.

Michael Dizon

01:05:22 AM

wondering if this PR will fix it. https://github.com/awslabs/amazon-eks-ami/pull/1264

#1264 Override hostname to match EC2's PrivateDnsName

Issue #, if available:

Fixes #1263 .

Description of changes:

Details available in #1263 .

This PR ensures that the name of the Node object matches the PrivateDnsName returned by ec2.DescribeInstances.

This ec2.DescribeInstances call was already being done by the in-tree cloud provider.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done

Reproduce the issue by:

Create a 1.26 cluster: eksctl create cluster --name 126 --version 1.26 --without-nodegroup
Modify the created VPC’s DHCP options set to use a custom domain-name:

domain-name: foo
domain-name-servers: AmazonProvidedDNS

Create a nodegroup: eksctl create nodegroup --cluster 126.
Nodegroup creation will fail.

Test the fix on the latest AMI release by:

config.yaml:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: "126"
  region: us-west-2
  version: "1.26"
managedNodeGroups:
  - name: nodes
    ami: ami-022441ec63297a0c9
    amiFamily: AmazonLinux2
    minSize: 1
    maxSize: 1
    desiredCapacity: 1
    overrideBootstrapCommand: |
      #!/bin/bash
      INSTANCE_ID=$(imds /latest/meta-data/instance-id)
      PRIVATE_DNS_NAME=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[].Instances[].PrivateDnsName' --output text)
      /etc/eks/bootstrap.sh 126 --kubelet-extra-args "--node-labels=eks.amazonaws.com/nodegroup=nodes,eks.amazonaws.com/nodegroup-image=ami-022441ec63297a0c9 --hostname-override=$PRIVATE_DNS_NAME"

eksctl create cluster --config-file config.yaml
Nodes join the cluster as expected.

2023-04-18

2023-04-19

Austin Blythe

07:18:42 PM

I’m looking at using the account component from https://github.com/cloudposse/terraform-aws-components/blob/master/modules/account/README.md. Can anyone confirm if this supports organizational units more than one level deep in AWS Organizations?

Austin Blythe

08:57:56 PM

Answered here: https://sweetops.slack.com/archives/C031919U8A0/p1681937660900889?thread_ts=1681932106.563619&cid=C031919U8A0

Thanks, @Erik Osterman (Cloud Posse). Sounds like I’m on the right track. Do you all ever use nested OUs?Unless i’m missing something, that account factory only supports one level deep. I’m curious if that is an intentional best practice