#refarch (2023-04)
Cloud Posse Reference Architecture
2023-04-04
#elasticache-redis
│ Error: engine_version: Redis versions must match <major>.<minor> when using version 6 or higher, or <major>.<minor>.<patch>
vars:
name: csm
family: redis7.cluster.on
cloudwatch_metric_alarms_enabled: false
redis_clusters:
redis-csm:
engine_version: 7.0
instance_type: cache.t3.small
the heck are the right combos for engine_version
and family
? I can’t get any combo to work from redis7
/7.0
or 7.x
or 7.0.7
to work including the family default.redis7
, redis7.cluster.on
.
the CLI:
{
"Engine": "redis",
"EngineVersion": "7.0",
"CacheParameterGroupFamily": "redis7",
"CacheEngineDescription": "Redis",
"CacheEngineVersionDescription": "redis version 7.0.7"
}
“7.0” vs 7.0. #yamlfailure
leaving this all here for anyone else that hits the same problem.
fix: use quotes on values that can be misinterpreted as numbers.
Glad you figured that out! Definitely one of those gotchas with YAML
Yup and I’m always removing quotes from strings. This one backfired.
2023-04-05
is there an approach to subscribing to an sns-topic
from an sns-queue
config? I know we can remote state it, but the subscribers
part of sns-topic
doesn’t natively support that so curious of patterns
2023-04-07
has anyone run into this error when trying to deploy an eks cluster with eks/cluster
│ Error: Post "https:/xxxx.us-gov-west-1.eks.amazonaws.com/api/v1/namespaces/kube-system/configmaps": getting credentials: decoding stdout: no kind "ExecCredential" is registered for version "client.authentication.k8s.io/v1alpha1" in scheme "pkg/runtime/scheme.go:100"
turns out i hadn’t update my aws cli in awhile
Not using geodesic?
no, not on this one
2023-04-17
has anyone else tried deploying eks with kubernetes 1.26? running into this issue with nodes not joining
What happened:
1.26 AMI Nodes fail to join 1.26 clusters. In both scenarios of upgrading from 1.25 to 1.26 and new clusters starting fresh with 1.26
What you expected to happen:
The nodes to join the cluster
Anything else we need to know?:
• Using manged node groups.
• The exact same Terraform deployment configuration works on 1.25. The only thing changed is the version for cluster/ami which triggers the failure on both upgrades and new clusters.
• VPC DHCP domain name is in the format: ec2.internal [acmedev.com](http://acmedev.com)
Environment:
• AWS Region: us-east-1
• Instance Type(s): m6a
• EKS Platform version: "eks.1"
• Kubernetes version: "1.26"
• AMI Version: amazon-eks-node-1.26-v20230406
• Kernel (e.g. uname -a
): Linux [ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com) 5.10.173-154.642.amzn2.x86_64 #1 SMP Wed Mar 15 00:26:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
• Release information (run cat /etc/eks/release
on a node):
BASE_AMI_ID="ami-099e00fe4091e48af"
BUILD_TIME="Thu Apr 6 01:36:39 UTC 2023"
BUILD_KERNEL="5.10.173-154.642.amzn2.x86_64"
ARCH="x86_64"
I believe the change in cloud-provider
from aws
to external
has created an issue where our hostname for the kubelet is different between 1.25 and 1.26. This causes the aws iam authenticator node bootstrap logic to fail to register with the cluster because the hostname in the requests are not the same.
hostnamed logs are the exact same on 1.25 and 1.26 nodes, including the “warning”
Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed pretty host name to 'ip-10-100-13-0.ec2.internal [acmedev.com](http://acmedev.com)'
Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed static host name to '[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)'
Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed host name to '[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)'
Apr 13 13:55:48 ip-10-100-13-0 cloud-init: Apr 13 13:55:48 cloud-init[2209]: util.py[WARNING]: Failed to non-persistently adjust the system hostname to ip-10-100-13-0.ec2.internal [acmedev.com](http://acmedev.com)
We are not changing any of the kubelet arguments from their AMI defaults. The only thing we are doing is adding some labels/taints to the nodes via the managed node group terraform resources. No hostname overrides.
Apr 13 13:55:53 ip-10-100-13-0 kubelet: I0413 13:55:53.946396 2944 flags.go:64] FLAG: --cloud-provider="external"
Apr 13 13:55:53 ip-10-100-13-0 kubelet: I0413 13:55:53.946638 2944 flags.go:64] FLAG: --hostname-override=""
Pertinent messages that indicate node join failures.
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.192348 2944 kubelet_node_status.go:669] "Recording event message for node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" event="NodeHasNoDiskPressure"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.192745 2944 kubelet_node_status.go:669] "Recording event message for node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" event="NodeHasSufficientPID"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.193204 2944 kubelet_node_status.go:70] "Attempting to register node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.765164 2944 controller.go:146] failed to ensure lease exists, will retry in 200ms, error: leases.coordination.k8s.io "[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" is forbidden: User "system:node:ip-10-100-13-0.ec2.internal" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-node-lease": can only access node lease with the same name as the requesting node
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.765885 2944 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" is forbidden: User "system:node:ip-10-100-13-0.ec2.internal" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.766850 2944 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\" is forbidden: node \"ip-10-100-13-0.ec2.internal\" is not allowed to modify node \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\"" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.969984 2944 kubelet_node_status.go:70] "Attempting to register node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.972246 2944 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\" is forbidden: node \"ip-10-100-13-0.ec2.internal\" is not allowed to modify node \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\"" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
On the 1.25 nodes using cloud-provider=aws
we can see the logs like:
Apr 12 15:14:31 ip-10-100-12-210 kubelet: I0412 15:14:31.176819 2906 server.go:993] "Cloud provider determined current node" nodeName="ip-10-100-12-210.ec2.internal"
https://github.com/kubernetes/kubernetes/blob/v1.26.2/cmd/kubelet/app/server.go#L989 which does not contain the [acmedev.com](http://acmedev.com)
appended to it.
The nodename returned in 1.25 aligns with with the templated private DNS name returned from the https://github.com/kubernetes-sigs/aws-iam-authenticator/tree/master that allows bootstrapping nodes. Since we are not using the aws cloud provider in 1.26 we might be getting back a different value for nodename which does not align.
Since the change to the cloud-provider=external
I beieve we are returning the hostname that we would get from hostname
or uname -n
e.g. [ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)
which does not align with what is returned from the EC2 api when getting the private DNS name for auth. Our node config in the aws-auth cm is standard:
mapRoles: |
- "groups":
- "system:bootstrappers"
- "system:nodes"
"rolearn": "arn:aws:iam::1234567890:role/role-name"
"username": "system:node:{{EC2PrivateDNSName}}"
@Jeremy G (Cloud Posse) could this be related to the other issue you were helping with? …relating to CNI addon
What happened:
1.26 AMI Nodes fail to join 1.26 clusters. In both scenarios of upgrading from 1.25 to 1.26 and new clusters starting fresh with 1.26
What you expected to happen:
The nodes to join the cluster
Anything else we need to know?:
• Using manged node groups.
• The exact same Terraform deployment configuration works on 1.25. The only thing changed is the version for cluster/ami which triggers the failure on both upgrades and new clusters.
• VPC DHCP domain name is in the format: ec2.internal [acmedev.com](http://acmedev.com)
Environment:
• AWS Region: us-east-1
• Instance Type(s): m6a
• EKS Platform version: "eks.1"
• Kubernetes version: "1.26"
• AMI Version: amazon-eks-node-1.26-v20230406
• Kernel (e.g. uname -a
): Linux [ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com) 5.10.173-154.642.amzn2.x86_64 #1 SMP Wed Mar 15 00:26:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
• Release information (run cat /etc/eks/release
on a node):
BASE_AMI_ID="ami-099e00fe4091e48af"
BUILD_TIME="Thu Apr 6 01:36:39 UTC 2023"
BUILD_KERNEL="5.10.173-154.642.amzn2.x86_64"
ARCH="x86_64"
I believe the change in cloud-provider
from aws
to external
has created an issue where our hostname for the kubelet is different between 1.25 and 1.26. This causes the aws iam authenticator node bootstrap logic to fail to register with the cluster because the hostname in the requests are not the same.
hostnamed logs are the exact same on 1.25 and 1.26 nodes, including the “warning”
Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed pretty host name to 'ip-10-100-13-0.ec2.internal [acmedev.com](http://acmedev.com)'
Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed static host name to '[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)'
Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed host name to '[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)'
Apr 13 13:55:48 ip-10-100-13-0 cloud-init: Apr 13 13:55:48 cloud-init[2209]: util.py[WARNING]: Failed to non-persistently adjust the system hostname to ip-10-100-13-0.ec2.internal [acmedev.com](http://acmedev.com)
We are not changing any of the kubelet arguments from their AMI defaults. The only thing we are doing is adding some labels/taints to the nodes via the managed node group terraform resources. No hostname overrides.
Apr 13 13:55:53 ip-10-100-13-0 kubelet: I0413 13:55:53.946396 2944 flags.go:64] FLAG: --cloud-provider="external"
Apr 13 13:55:53 ip-10-100-13-0 kubelet: I0413 13:55:53.946638 2944 flags.go:64] FLAG: --hostname-override=""
Pertinent messages that indicate node join failures.
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.192348 2944 kubelet_node_status.go:669] "Recording event message for node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" event="NodeHasNoDiskPressure"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.192745 2944 kubelet_node_status.go:669] "Recording event message for node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" event="NodeHasSufficientPID"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.193204 2944 kubelet_node_status.go:70] "Attempting to register node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.765164 2944 controller.go:146] failed to ensure lease exists, will retry in 200ms, error: leases.coordination.k8s.io "[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" is forbidden: User "system:node:ip-10-100-13-0.ec2.internal" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-node-lease": can only access node lease with the same name as the requesting node
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.765885 2944 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)" is forbidden: User "system:node:ip-10-100-13-0.ec2.internal" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.766850 2944 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\" is forbidden: node \"ip-10-100-13-0.ec2.internal\" is not allowed to modify node \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\"" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.969984 2944 kubelet_node_status.go:70] "Attempting to register node" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.972246 2944 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\" is forbidden: node \"ip-10-100-13-0.ec2.internal\" is not allowed to modify node \"[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)\"" node="[ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)"
On the 1.25 nodes using cloud-provider=aws
we can see the logs like:
Apr 12 15:14:31 ip-10-100-12-210 kubelet: I0412 15:14:31.176819 2906 server.go:993] "Cloud provider determined current node" nodeName="ip-10-100-12-210.ec2.internal"
https://github.com/kubernetes/kubernetes/blob/v1.26.2/cmd/kubelet/app/server.go#L989 which does not contain the [acmedev.com](http://acmedev.com)
appended to it.
The nodename returned in 1.25 aligns with with the templated private DNS name returned from the https://github.com/kubernetes-sigs/aws-iam-authenticator/tree/master that allows bootstrapping nodes. Since we are not using the aws cloud provider in 1.26 we might be getting back a different value for nodename which does not align.
Since the change to the cloud-provider=external
I beieve we are returning the hostname that we would get from hostname
or uname -n
e.g. [ip-10-100-13-0.ec2.internalacmedev.com](http://ip-10-100-13-0.ec2.internalacmedev.com)
which does not align with what is returned from the EC2 api when getting the private DNS name for auth. Our node config in the aws-auth cm is standard:
mapRoles: |
- "groups":
- "system:bootstrappers"
- "system:nodes"
"rolearn": "arn:aws:iam::1234567890:role/role-name"
"username": "system:node:{{EC2PrivateDNSName}}"
No, this is something else
@Michael Dizon Did you follow all the upgrade instructions/prerequisites at https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-1.26 ?
The Kubernetes project is continually integrating new features, design updates, and bug fixes. The community releases new Kubernetes minor versions, such as 1.26 . New version updates are available on average every three months. Each minor version is supported for approximately twelve months after it’s first released.
deployed from scratch in a new environment. should note that this is in govcloud. not sure if that makes a difference
it deploys fine with 1.25.
here’s a bit of the log output
message”: “csinodes.storage.k8s.io "ip-10-xxx.xxx.com" is forbidden: User "systemip-10-xxx.us-gov-west-1.compute.internal" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node”, “reason”: “Forbidden”, “details”: { “name”: “ip-xxx.xxx.com”, “group”: “storage.k8s.io”, “kind”: “csinodes” }, “code”: 403 },
• Deprecated beta APIs scheduled for removal in v1.26
are no longer served. See https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-26 for more information. (#111973, @liggitt)
• The in-tree cloud provider for OpenStack (and the cinder volume provider) has been removed. Please use the external cloud provider and csi driver from cloud-provider-openstack instead. (#67782, @dims)
As the Kubernetes API evolves, APIs are periodically reorganized or upgraded. When APIs evolve, the old API is deprecated and eventually removed. This page contains information you need to know when migrating from deprecated API versions to newer and more stable API versions. Removed APIs by release v1.29 The v1.29 release will stop serving the following deprecated API versions: Flow control resources The flowcontrol.apiserver.k8s.io/v1beta2 API version of FlowSchema and PriorityLevelConfiguration will no longer be served in v1.
wondering if this PR will fix it. https://github.com/awslabs/amazon-eks-ami/pull/1264
Issue #, if available:
Fixes #1263 .
Description of changes:
Details available in #1263 .
This PR ensures that the name of the Node
object matches the PrivateDnsName
returned by ec2.DescribeInstances
.
This ec2.DescribeInstances
call was already being done by the in-tree cloud provider.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Testing Done
Reproduce the issue by:
- Create a 1.26 cluster:
eksctl create cluster --name 126 --version 1.26 --without-nodegroup
- Modify the created VPC’s DHCP options set to use a custom
domain-name
:
domain-name: foo
domain-name-servers: AmazonProvidedDNS
- Create a nodegroup:
eksctl create nodegroup --cluster 126
. - Nodegroup creation will fail.
Test the fix on the latest AMI release by:
config.yaml
:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: "126"
region: us-west-2
version: "1.26"
managedNodeGroups:
- name: nodes
ami: ami-022441ec63297a0c9
amiFamily: AmazonLinux2
minSize: 1
maxSize: 1
desiredCapacity: 1
overrideBootstrapCommand: |
#!/bin/bash
INSTANCE_ID=$(imds /latest/meta-data/instance-id)
PRIVATE_DNS_NAME=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[].Instances[].PrivateDnsName' --output text)
/etc/eks/bootstrap.sh 126 --kubelet-extra-args "--node-labels=eks.amazonaws.com/nodegroup=nodes,eks.amazonaws.com/nodegroup-image=ami-022441ec63297a0c9 --hostname-override=$PRIVATE_DNS_NAME"
eksctl create cluster --config-file config.yaml
- Nodes join the cluster as expected.
2023-04-18
2023-04-19
I’m looking at using the account
component from https://github.com/cloudposse/terraform-aws-components/blob/master/modules/account/README.md. Can anyone confirm if this supports organizational units more than one level deep in AWS Organizations?
Answered here: https://sweetops.slack.com/archives/C031919U8A0/p1681937660900889?thread_ts=1681932106.563619&cid=C031919U8A0
Thanks, @Erik Osterman (Cloud Posse). Sounds like I’m on the right track. Do you all ever use nested OUs?Unless i’m missing something, that account factory only supports one level deep. I’m curious if that is an intentional best practice