#aws (2023-05)
Discussion related to Amazon Web Services (AWS)
Discussion related to Amazon Web Services (AWS)
Archive: https://archive.sweetops.com/aws/
2023-05-01
Hi guys, I just learned about Cloudwatch. Found that it uses metrics like CPU Usage, Disk usage etc but I also noticed that EC2 also displays many metrics in it’s Monitoring section.
Question: Do we use Cloudwatch with EC2 despite EC2 already providing useful analytics/monitoring? If yes, please share the use cases.
the ec2 metrics are really cloudwatch metrics under the covers
Very true. Unless you want to create some personal insights(dashboards) from the instances you have, all accessed from one place then it will makes more sense to get extra features from Cloudwatch such as events for actionable purposes.
2023-05-02
hello all, in aurora serverless, I see my cpucreditbalane dropped to 0 after a recovery triggered by aws. Is it counting same as ec2 T instances? https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/unlimited-mode-examples.html
The following examples explain credit use for instances that are configured as unlimited .
Hi! im using this module https://registry.terraform.io/modules/cloudposse/ecs-alb-service-task/aws/0.68.0 in version 0.66.2 im trying to update to the latest version because everytime i change an environment variable i have to delete the service and recreate it because it doesnt take the latest task definition (generated by codepipeline) to create the new one. i tryied using redeploy_on_apply but i couldnt find any configuration which correctly takes the latest. My configuration looks like this:
module "ecs_alb_service_task" {
source = "cloudposse/ecs-alb-service-task/aws"
version = "0.66.2"
namespace = var.cluster_name
stage = module.global_settings.environment
name = local.project_name
attributes = []
container_definition_json = module.container_definition.sensitive_json_map_encoded_list
#Load Balancer
alb_security_group = var.security_group_id
ecs_load_balancers = local.ecs_load_balancer_internal_config
#Capacity Provider Strategy
capacity_provider_strategies = var.capacity_provider_strategies
desired_count = 1
ignore_changes_desired_count = true
launch_type = module.global_settings.default_ecs_launch_type
#VPC
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
assign_public_ip = module.global_settings.default_assign_public_ip
network_mode = "awsvpc"
ecs_cluster_arn = var.cluster_arn
security_group_ids = [var.security_group_id]
ignore_changes_task_definition = true
force_new_deployment = true
health_check_grace_period_seconds = 200
deployment_minimum_healthy_percent = module.global_settings.default_deployment_minimum_healthy_percent
deployment_maximum_percent = module.global_settings.default_deployment_maximum_percent
deployment_controller_type = module.global_settings.default_deployment_controller_type
task_memory = local.task_memory
task_cpu = local.task_cpu
ordered_placement_strategy = local.ordered_placement_strategy
label_order = local.label_order
labels_as_tags = local.labels_as_tags
propagate_tags = local.propagate_tags
tags = merge(var.tags, local.tags)
#ECS Service
task_exec_role_arn = [module.task_excecution_role.task_excecution_role_arn]
task_role_arn = [module.task_excecution_role.task_excecution_role_arn]
depends_on = [
module.alb_ingress
]
}
any suggestions?
You are basically saying ignore_changes_task_definition = true
meaning, don’t respect the future updates. It should be false.
When i activate that option it tries to delete the service and recreate it with an older version of the task definition
Then your problem is not the task definition, since it suppose to use the latest version. It’s somewhere else.
I don’t see the redeploy_on_apply
usage which fulfill such purpose.
Im currently modifying my original module (in version 0.66.2) such as i showed before. I guessed that redeploy_on_apply of the latest version of this module would take the latest version of the task definition and update it in the state file. Im not sure whats wrong
one thing is different than the other. redeploy on aply does not update the module version itself, but deploy a new task with version if is detected in the cluster. 2 different things.
nono i know, but even if i upgrade the version and activate that option the task definition is not the latest. The picture i sended before is from 0.66.2 version with ignore_changes_task_definition = false. The next pictures are from v0.68 & redeploy_on_apply = true and ignore_changes_task_definition_false
i tryied with diferent configurations of ignore_changes_task_definition, force_new_deployment and redeploy_on_apply and no one is working
If you change ‘ignore_changes_task_definition’ from true to false you should expect the service to be destroyed the first time due to the way the service is coded in the module. A second run should not require a destroy.
The reason it picks up an older version of v your task definition is because that is all terraform knows about. You have updated the task definition outside of terraform in code pipeline
If you are going to manage revisions is codepipeline you could pass in the correct task definition and version in the variable var.task_definition
2023-05-03
I have a customer that has a huge oracle database: 120 TB, the limit on RDS is 64 TB, any suggestions ?
Sharding or self hosting on ec2
Also talk to an AWS rep. You might get some free credits for moving such a large dataset. They would also give you some advice on your current issue
Aws will give your customer free credits and help cover costs on converting out of oracle for your customer… talk to the TAM
Otherwise ec2 self host or sharding
is aurora serverless v1 HA compatible ?
Hi there, I wrote a blog post that y’all may be interested in. It discusses how to manage cross-account AWS IAM permissions for different teams with an open-source Python tool called IAMbic. Would love feedback!
https://www.noq.dev/blog/aws-permission-bouncers-letting-loose-in-dev-keeping-it-tight-in-prod
Ever had a slight configuration change take down production services? Wish you could give teams more AWS permissions in dev/test accounts, but less in production? Right sizing IAM policies for each team and account can be a tedious task, especially as your environment grows. In this post, we’ll explore how IAMbic brings order to multi-account AWS IAM chaos.
2023-05-04
For AWS Identity center, is there a way to see which accounts a group has access to via the cli? There’s no way in the console afaict.
Not as far as I know It’s such a missing feature
OK. Was just making sure I didn’t just miss it somehow.
2023-05-05
Does anyone know of any tools that will scan a set of AWS accounts for best practices? Any that are recommended? My company has a list of 40+ best practices that we’ve identified and I’m looking for solutions to quickly check these best practices against a set of accounts or AWS organization.
I haven’t used it myself yet, but I think https://github.com/cloud-custodian/cloud-custodian sounds like what you’re looking for.
Rules engine for cloud security, cost optimization, and governance, DSL in yaml for policies to query, filter, and take actions on resources
this is a nice project maintaining a list of various tools… https://github.com/toniblyx/my-arsenal-of-aws-security-tools
List of open source tools for AWS security: defensive, offensive, auditing, DFIR, etc.
if i were to start with just one tool for checking against “best practices”, it would probably be prowler https://github.com/prowler-cloud/prowler
Prowler is an Open Source Security tool for AWS, Azure and GCP to perform Cloud Security best practices assessments, audits, incident response, compliance, continuous monitoring, hardening and forensics readiness. It contains hundreds of controls covering CIS, PCI-DSS, ISO27001, GDPR, HIPAA, FFIEC, SOC2, AWS FTR, ENS and custom security frameworks.
ElectricEye is another great one… https://github.com/jonrau1/ElectricEye
ElectricEye is a multi-cloud, multi-SaaS Python CLI tool for Cloud Asset Management (CAM), Cloud Security Posture Management (CSPM), SaaS Security Posture Management (SSPM), and External Attack Surface Management (EASM) supporting 100s of services and evaluations to harden your public cloud & SaaS environments.
yeah custodian is a good one
the others are also interesting projects, thanks
Good stuff – Thank you folks.
If you are looking for a SaaS solution - I would go with Aqua security … they bought a compnay called CloudSploit a few years ago, and they have a good level of reporting/remediation steps for issues that are detected.
2023-05-08
Just an FYI - if you plan to upgrade to the latest EBS addon for EKS (1.18.0.build1) you may want to wait: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1591
We use kube-stack-prometheus and have a CPUThrottling alarm going off from our upgrade.
/kind bug
What happened?
Upgraded EBS Addon in EKS and CPU usage of the node daemonsets spiked
What you expected to happen?
literally no change to happen
How to reproduce it (as minimally and precisely as possible)?
eksctl update addon --name aws-ebs-csi-driver --version latest \
--cluster ${CLUSTER_NAME} \
--service-account-role-arn arn:aws:iam::${IHSM_ARN}:role/AmazonEKS_EBS_CSI_DriverRole_${CLUSTER_NAME} \
--force
To roll back to a non spiking version:
eksctl update addon --name aws-ebs-csi-driver --version v1.17.0-eksbuild.1 \
--cluster ${CLUSTER_NAME} \
--service-account-role-arn arn:aws:iam::${IHSM_ARN}:role/AmazonEKS_EBS_CSI_DriverRole_${CLUSTER_NAME} \
--force
2023-05-10
we have clusters using spot instances and we use cluster autoscaler. sometimes we see 504. I found few issues in autoscaler github page. how could I avoid 504 when autoscaler scale down instances?
Try karpenter? It drains first before removing nodes.
Yea, might help with Karpenter, but unless the services are deliberately removed from the ALB, you’ll still get 504s.
I am not sure if karpenter does that.
We solved it with annotations. So once it cordoned, it will be removed from LB and if the cordon reason is rebalancing, then after delete from cluster it waits for another 90sec before termination. After this we dont see 504 errors in llLB logs
@Balazs Varga just to clarify, you cordoned the nodes with an annotation similar to this? or some other annotation?
kubectl annotate node <node-name> eks.amazonaws.com/cordon=true
Aws termination hander cordoned the node automatically. We enabled the rebalance watch option in the past.
I am with my laptop, but later I will check the exact annotation
That would be great, as I think I am not aware of this option.
so on svc we use the following: with this we can achieve to close open connections before termination.
and on node termination handler we use the following options:
• enableRebalanceDraining: true
• enableSpotInterruptionDraining: true
• enableScheduledEventDraining: true
Aha, though please note the node termination handler is redundant with functionality in Karpenter now, and no longer recommended to be deployed along side of it. That said, I don’t know if those features are available in kapenter. Thanks for sharing!
https://aws.github.io/aws-eks-best-practices/karpenter/#enable-interruption-handling-when-using-spot
Review Karpenter Frequently Asked Questions
@Dan Miller (Cloud Posse) heads up
I think we cannot use karpenter because of limitations with kops. we create our clusters using kops
Aha!
Yes, not compatible
Probably
maybe later .
@Jonathan Eunice
2023-05-11
Hi all! do memory db for redis patch updates cause any dowmtime?
minimal but yes
2023-05-12
Hey Everyone. Curious to know if anyone is using Turbonomics as your Cloud Financial/Cost Management tool and how is your experience when compared to Cloud Health (or) Cloudability?
Turbomic says it has automated execution actions for Rightsizing Instances. But how does it manage (or) sync the state files if the instances are managed through Terraform?
I would be extremely skeptical of any tool that does right sizing the “right way” in an IAC environment
There’s no one way to do it, and companies have all kinds of strategies.
Hi all
Is there a way to add files( 2 files) to ECS fargate containers ? ( we are using github as source code, not able to add in github due to security reasons )
I’d recommend the service to pull what it needs from either Secrets Manager or S3 depending on the size
how do we pull via task defination ? should be update service via task defination.
you mean, you want to add two files to a container without making a change to the image you’re pulling?
yup
might be able to do that with volume mounts then
Those 2 files are related to Auth files. ( one is private and public )
just because of adding 2 files not sure required EFS mount points
you’re fairly constrained as to what options you have if you’re unwilling to update the image
If I upload these 2 files from S3 how do I call it from task defination that these 2 files should be on xx/xxx path
you would update the image that you’re using to have an entrypoint script that pulls the s3 files and then starts the application as it has before
not just update the task def
that works if you can use an environment variable rather than a file
Does anyone have strong opinions on how to do AWS Lambda while also managing the infrastructure via Terraform? There are a bunch of options out there, but I’ve never personally seen an implementation that I liked. My team and I are working on how to do this better and are evaluating Serverless framework (CloudFormation ), AWS SAM (has TF support, but doesn’t look great), and classic “build our own”.
Would love to hear someone who has implemented a solution that doesn’t feel disjointed and has strong opinions from real experience!
What are the pain points you’ve had? I’ve enjoyed using Anton’s terraform-aws-lambda module
the reason behind it was the easy integration and deployment with github actions and easier
to understand template options
this was only done on `VERY basic lambda deploy, which will connect to infra created by TF(SQS, SNS, RDS etc), if someone was, for example, going to deploy a lambda that required a ton of infra then that was a good candidate to move to TF only deploys hybrid with github actions
that is what we ended with and it works so far
keep in mind that EVERYTHING
else was created with TF ( vpc, subnet, transit gateways, rds, sqs, sns etc)
Yeah, I go either for vanilla SAM or for Anton’s Serverless.tf modules (with or without SAM).
Vanilla CloudFormation is a bit too raw, but works if you’re a large company that wants to build something custom. Serverless Framework I would avoid because of their chaotic history and dubious ecosystem.
Everybody else is too new, too risky, or too limited.
Neither option is perfect, so expect some mild annoyances!
Usually:
• a few Lambdas => serverless.tf
• a bunch of Lambdas and complex serverless architectures => SAM
Personally, i prefer building out my stack in CDK, which bootstraps Cloudformation
+1 to SAM. Stick with the first party tooling here
last team i was on that was using IaC and Lambda did a mix of Terraform and SAM.
That is, all the underlying plumbing was deployed using TF.
The devs would use the TF resources as inputs to their SAM deployments.
it worked pretty well.
AWESOME stuff – thanks folks! Really appreciating this community right now as I’ve talked or worked with a bunch of you already, so I trust your opinions. We’ll likely try out the TF + SAM route
I’ve used Terraform+Github Actions+ECR, I’ll be the first to say that the local dev experience isn’t great (aka slow), but it’s a simple setup where a dev pushes to their feature branch, an artifact is built by GHA and pushed to ECR, then a simple aws cli call to update the function code to the dev’s container tag. I’m looking to explore SAM+TF. when I first checked it out, I believe SAM could only use resources created by SAM and the available resource list was a bit small
2023-05-13
2023-05-14
2023-05-15
Hi All, I was wondering if anyone had experience with setting up control tower on an existing AWS account that’s part of AWS Organizations. I want to separate our current environments into their own account and implement tarraform moving forward. I want to make sure i don’t affect our current environment during this process. Any advice would be awesome. Thanks!
Wouldn’t recommend using Control Tower, CT is like a black box, hard for troubleshooting
Maybe it is just myself
@Hao Wang what solution do you recommend for multi account organizations? I am using control tower to provision new account under AWS organization, it really is not convenient but I don’t know other solutions.
Thanks
@aj_baller23 if your envs are under the root account, enrolling into control tower will not affect it. Then you can create sub accounts to host separate envs.
Cloudposse doesn’t use control tower either, they provision news accounts with their AWS components.
2023-05-16
The Control Tower question got me thinking about another thing I’ve been wondering about for some time:
For SMBs, how do you manage the root account credentials for a multi-account organization?
That is, given a single AWS account that will be used to spawn off sub accounts, how do you govern access to the root email address and the 2FA keys associated with the account?
I’m specifically looking at this from the perspective of a small business or sole proprietorship that needs to keep things secure but also ensure business continuity.
this becomes even trickier for remote teams. but here are the two practices that I’ve seen:
most secure practice - hardware tokens issued to the people who need access and delivered via signature confirmation snail mail. core member provisions the hardware token and dispatches them, token is removed from the root account if they leave the org.
a more manageable (IMO) practice - software 2FA (e.g. Vault or 1Password) in a shared vault, password is rotated if a core member leaves the org
This was also an interesting read….
https://docs.aws.amazon.com/organizations/latest/userguide/orgs_best-practices_mgmt-acct.html
Describes best practices related to the management account in AWS Organizations.
Email group specifically for the root user. Password in 1Password or similar. I set up yubikeys and mail them; only share the password on confirmation of receipt. (What Darren said.)
When someone leaves, deactivation is as simple as removing the yubikey MFA from AWS (now that they support multiple!). Avoids the need to rotate passwords. Keeps MFA as a piece of hardware which I like:
• I don’t have to think about some case where your password vault somehow gets compromised. (Hard, but not impossible.)
• You must go find your physical root MFA key should hopefully make folks think a lot harder about what they need it for vs 1P or similar mindlessly autofilling a TOTP.
the multiple hardware keys is essential to this. that was the biggest blocker and why we used the 1P option for so long.
Second everyone else. We recommend a google group. We used 1P and shared TOTP, but now multiple hardware keys are available. And with 1P we use geofencing.
Also, we recommend resetting the master credentials on all member accounts and enabling MFA on all root credentials for member accounts.
can you share more information about the 1P geofencing? I wasn’t aware of that being an option and would like to explore that a bit more
Learn how to use Advanced Protection to restrict where your team can access 1Password.
thanks
2023-05-17
Anybody know how to purge-with-fire unattended-upgrades
in cloud-init? Because that satan of a package is blocking cloud-init itself from installing packages. I tried throwing a systemctl stop/mask
into a cloud-boothook
section, but that just threw errors.
unattended-upgrades
will be set up beforehand, and cloud-init
should just run once?
not sure if I understand
So deploying a new Ubuntu box, specifying packages to install with cloud-init
, it gets started, but then ubuntu’s unattended-upgrades
script kicks off which blocks cloud-init
because it grabs the apt lock.
using cookup.ai to generate IAM policies….
Input: give permission to read from the ecr repo named container123456, and permission to start and stop ec2 instances in the us-east-1 region.
Result: Here is the AWS IAM Policy that corresponds to the plaintext description you provided:
{
"Version":"2012-10-17",
"Statement":[
{
"Effect":"Allow",
"Action":[
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:GetRepositoryPolicy",
"ecr:DescribeRepositories",
"ecr:ListImages",
"ecr:DescribeImages",
"ecr:BatchGetImage"
],
"Resource":[
"arn:aws:ecr:us-east-1:123456789012:repository/container123456"
]
},
{
"Effect":"Allow",
"Action":[
"ec2:StartInstances",
"ec2:StopInstances"
],
"Resource":"arn:aws:ec2:us-east-1:123456789012:instance/*"
}
]
}
good response but I had to format the code myself…
I’m using https://flosell.github.io/iam-policy-json-to-terraform/ to format the policy to tf hcl
This tool converts standard IAM policies in JSON format (like what you’d find in the AWS docs) into more terraform native aws_iam_policy_document data source code
I prefer keeping my IAM json as json and using jsonencode
in TF. Makes it easier to bring existing policies in…or update my policies in TF when people tweak them outside of IaC
but i guess using these two tools, you can describe a policy in plaintext an end up with native TF.
sounds like the process needs a wrapper…
make sense
if I was being strict, GetRepositoryPolicy is probably not required
actually, there are some other errors. GetAuthorizationToken
requires permissions against the *
resource, right?
Hi All, I wanted to initiate a discussion regarding our kubernetes platform and how we handle services running on Kubernetes with 100% spot instances. Specifically, I would like to address a scenario where a service is running on Kubernetes with two pods in a production environment, each deployed on different spot nodes. In this situation, if one of the nodes experiences spot interruption, resulting in the pod being rescheduled to another node, and that second node also gets interrupted same time other pod also reschedule on another node in initialize state, we will encounter an outage as both pods end up in an “initialize” state. Anyone who’s aware of how we are taking care this running 100% on spot???
you can’t completely mitigate this risk if you run on 100% spot.
If the risk is unacceptable, you can:
- Run more pods on more instance types across more AZs to minimise the risk, or
- run some portion of your workload on on-demand/reserved instances
Or you could try spot ocean which will help manage the node lifecycle. You can treat spot nodes as if they are normal ones.
the vmware product? It can’t prevent spot interruptions. Sometimes many instances will be interrupted at once
It is a node pool scheduler, backed by spot, ondemand and reserved instances, but for k8s, it exposes as a scaling group like resource.
It will mix different kinds of instances to prevent the worst situation.
2023-05-18
2023-05-19
2023-05-22
Hi all. Im just starting to use the cloudposse module for eks clusters. Really liking it so far. Currently I have a bitbucket pipeline that uses OIDC to assume a role in AWS to run the terraform. That role has the administrator policy. I’ve enabled the aws auth configmap and put that role inside the config map, and attached it to the “cluster-admin” group, which I assume has full powers to update anything cluster wide. So my terraform looks like this
map_additional_iam_roles = [
{
rolearn = "arn:aws:iam::***:role/workers"
username = "system:node:{{EC2PrivateDNSName}}"
groups = ["system:bootstrappers", "system:nodes"]
},
{
rolearn = data.aws_iam_role.AdminRole.arn
username = "admin-user"
groups = ["cluster-admin"]
},
{
rolearn = aws_iam_role.infrastructure-management.arn
username = "pipeline"
groups = ["cluster-admin"]
}
]
This works well when I use data.aws_iam_role.AdminRole.arn
to login from my command line. This is temporary creds generated through AWS SSO. However when I use aws_iam_role.infrastructure-management.arn
it fails
Planning failed. Terraform encountered an error while generating this plan.
╷
│ Error: configmaps "aws-auth" is forbidden: User "pipeline" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
│
│ with module.workload_1.kubernetes_config_map.aws_auth[0],
│ on .terraform/modules/workload_1/auth.tf line 138, in resource "kubernetes_config_map" "aws_auth":
│ 138: resource "kubernetes_config_map" "aws_auth" {
│
╵
Ahh I think I worked it out. I need to use “system:masters” as the group. because cluster-admin is a clusterrolebinding but not a group. The issue is I dont know how I work out what groups there actually are?
@Dan Miller (Cloud Posse)
2023-05-23
Hi! I am having issues enrolling an AWS Account under control tower. I am receiving an error saying
AWS Control Tower cannot enroll the account. There's an error in the provisioned product in AWS Service Catalog: ProvisionedProduct with Name: null and Id: *********** doesn't exist
Is there any way to reconcile control tower and service catalog to create a new product when trying to enroll the account?
CT is hard to use for it is like a black box. Feeling you may not have enough permission or the product doesn’t exist
this sounds like a great question for aws support
You’re in for a fun time
To be honest I don’t recommend it. If you’re not too deep already I’d back out.
The trouble you’re experiencing now is not an isolated incident. These class of problems will be a continuous occurrence with CT
@kallan.gerard Do you have any recommendations for an alternative to CT?
Hi @vicentemanzano6 I would probably just use terraform with the aws provider resources and the aws organization primitives
The expand from there
I typically use an admin/workload tf pattern for this sort of thing. I’m not sure what to call it.
One terraform config would be for the organization root account, and would contain things like the aws_organizations_account resources for your member accounts, and the initial seed config
Then each member account would have it’s own terraform governance config directory, which you configure with a governance accountadmin role inside each account, and you import a common stack(s) modules inside each config for any resources you want each account to have
So like / /admin
# the config directory that runs in the org root account
# provisions org member accounts, trust access for those accounts, whatever else you need etc main.tf … /business-units … /sales /engineering /aws-account-analytics-dev main.tf # call stack module /aws-account-analytics-dev main.tf /modules /stack main.tf # module you want in every account