#refarch (2024-10)
Cloud Posse Reference Architecture
2024-10-03
Do you recommend using Helmfile via ArgoCD for app deployments to EKS? Are there advantages to this approach rather than using Kustomize? (It it that it’s easier to spin up preview environments via Helmfile?)
@Dan Miller (Cloud Posse) @Yonatan Koren
We use ArgoCD or helm deploy directly, both with helmfile. Although @Igor Rodionov would have to weigh in on the advantages of that decision
I would pick what makes the most sense to your engineering team. Kustomize is nice, but we’ve been using helmfile for so long that we tend to stick with it.
I think we should revisit this next year
Helmfile seems like it’s worth trying out to compare against. I might have a use case for some environments where Argo is not available (and doesn’t require frequent CD) and using Atmos to deploy to those environments might be a nice alternative.
I like the idea of a monochart that you guys developed, but it feels like it’s not being maintained for things like Istio https://github.com/cloudposse/charts/tree/master/incubator/monochart. Is there an open-source alternative that you’re aware of? Or is there a desire to maintain this one?
I’ve changed tact on monochart. We still standby the pattern, but instead of us managing it, we instruct customers how to create their own monochart for their organization
When we managed it, what ended up happening is we were reinventing the k8s spec after adding support for everything. Because in the end, different companies use different parts. That lead to an unwieldy chart to manage, and was antithetical to what we were trying to achieve.
The idea is that each org should define an “interface” for their apps on k8s. That interface is defined via helm charts that can be reused by teams in the org to deploy services in an idiomatic way. Don’t expose every feature of k8s, instead the one you will need.
So in your case, with istio, we have a customer doing exactly this. They have a custom “monochart” (e.g. “acme-service”), which implements all the custom resources for istio the way they need it. Then any developer can deploy their service using that chart. Just bring-your-own-dockerfile.
Thanks Erik. I’ll give this a try and see how it stacks up against plain kustomize
since kustomize can deploy helm charts, I think this might be the best of both?
(that’s coming from the POV though of a helm advocate, and I’m sure many users of kustomize would disagree)
2024-10-04
2024-10-10
I am currently going through the quick start and I have just deployed the accounts successfully via atmos workflow deploy/accounts -f accounts
and I have run atmos terraform apply account-map -s core-gbl-root
to build the account maps.
But when I run atmos workflow deploy/account-settings -f accounts
I get the following error:
`Planning failed. Terraform encountered an error while generating this plan.
╷
│ Error: Invalid index
│
│ on ../account-map/modules/iam-roles/main.tf line 46, in locals:
│ 46: is_root_user = local.current_identity_account == local.account_map.full_account_map[local.root_account_name]
│ ├────────────────
│ │ local.account_map.full_account_map is object with 12 attributes
│ │ local.root_account_name is “core-root”
│
│ The given key does not identify an element in this collection value.`
Checking the account map I see the line: root_account_account_name = "core-root"
so it doesn’t look like a object with attributes. What are some suggested steps to troubleshoot this?
2024-10-11
@Andriy Knysh (Cloud Posse)
I have used the reference architecture to setup a transit gateway to connect 2 VPCs in different accounts. However, I am having trouble understanding how the transit gateway offers a connection to the internet. Is that not a part of the reference architecture and is something we should setup separately on our own? If so, how can this be achieved?
There is a line in the reference architecture VPC section that states this
# Use PrivateLink in private-only VPCs at least until we have
# a connection to the internet via Transit Gateway.
So ideally, transit gateway setup should provide connection to the internet right?
there are many diff considerations here (e.g. VPC private vs public subnets, etc.). I’m in meetings now, will try to explain later (ping me in a few hours )
@Andriy Knysh (Cloud Posse) Ok, sure
@Andriy Knysh (Cloud Posse) Pinging you again as a reminder
@Shirisha Sudhakar Rao it all depends on your network architecture and the security nd monitoring requirements. You can have many different variations of network architecture, I’ll give you a few examples here:
• If the VPCs in the two accounts have public subnets, they already have egress to the Internet via the IGW
• If the VPCs only have private subnets, then you need to provide Internet access using any of the following:
– You can connect a Site-to_site VPN to the TGW on one side, and then add VPC attachments for the VPC on the other side of the TGW. This way, the VPN connection will provide access to the Internet. In the VPC route tables, you will have a route 0.0.0.0/0
pointing to the TGW. In the TGW route table, you will have a route 0.0.0.0/0
pointing to the VPN connection
– You can have a separate VPC (in the same or separate account, we’ll call it network
). The network
VPC has public subnets with a Internet Gateway (providing connection to the Internet). The VPCs in both accounts can only have private subnets. They will be connected to the Internet via the TGW and the network
VPC. In this case, you will have a subnet route in each VPC 0.0.0.0/0
pointing to the TGW. In the TGW route table, you will have a route 0.0.0.0/0
pointing to the VPC attachment of the network
VPC. In the network VPC, you will have a route 0.0.0.0/0
pinting to the IGW (Internet Gateway)
I suppose your question was about the last part - a network
VPC with Internet access via IGW, and the other VPCs connected to the TGW and then to the network
VPC
regarding this
# Use PrivateLink in private-only VPCs at least until we have
# a connection to the internet via Transit Gateway.
if both VPC are private (no public subnets), then you can use a Private Link to another VPC (e.g. network
in our example), which has connection to the Internet via IGW
in all of these cases, if both VPCs are private, then you have to provide a connection to something that has Internet access
that something can be another VPC (network
), and you connect the 2 VPCs to the network
VPC using a Private Link, or a Transit Gateway
the TGW itself does not offer nor provide connections to the Internet, it’s just a proxy. You need to connect it to a VPC which has connection to the Internet, or to a on-site VPN (this is more complicated since it involves using on-prem equipment like Palo Alto or Cisco)
there are many different variations of the network architecture. Let me know if the above description answers your question, or you need ore help
the Cloud Posse tgw
components https://github.com/cloudposse/terraform-aws-components/tree/main/modules/tgw describes the architecture where you have private VPCs in multiple accounts and also a network
VPC (e.g. in a network
account) which has public subnets and connection to the Internet via IGW
it’s impossible to configure all possible variations of network architectures with one terraform module. If your use case differs from the above, you will have to make adjustments in your own code
@Andriy Knysh (Cloud Posse) Thank you for the information. I will try this out today.
2024-10-15
I am using datadog-lambda forwarders in the ue1 and uw2 regions and since I deployed the datadog api keys and configuration in our global stack (defaulting to uw2) and the key to auto/uw2, I have been experiencing drift and ssm access errors.
For example I have a datadog lambda forwarder for vpc-flow-logs in ue1 and while the datadog client is trying to access the ssm that it thinks exists in ue1 the policy and the actual key are in uw2 and I don’t see any way to change that aside from modifying the TF itself.
I was wondering what the motivation for moving to the global stack is? And how should I migrate from the previous regional paradigm?
2024-10-21
Hi folks,
The company I work with is using the Quick start.
I’m just starting out with the very early steps for set up and when I run the atmos workflow init/tfstate -f baseline
command, I am getting errors.
First, I get an eks error that you can see here when I run a validate:
√ : [superadmin] (HOST) spryops-infrastructure ⨠ atmos validate stacks
no matches found for the import 'catalog/eks/clusters/default' in the file 'catalog/eks/clusters/auto.yaml'
Error: failed to find a match for the import '/localhost/code/spryops-infrastructure/stacks/catalog/eks/clusters/default.yaml' ('/localhost/code/spryops-infrastructure/stacks/catalog/eks/clusters' + 'default.yaml')
I commented out the reference to default in the auto.yaml and then I get another error:
√ : [superadmin] (HOST) spryops-infrastructure ⨠ atmos validate stacks
no matches found for the import 'catalog/iam-service-linked-roles' in the file 'orgs/spt/core/auto/global-region/github.yaml'
Error: failed to find a match for the import '/localhost/code/spryops-infrastructure/stacks/catalog/iam-service-linked-roles.yaml' ('/localhost/code/spryops-infrastructure/stacks/catalog' + 'iam-service-linked-roles.yaml')
Can anyone help me out here? Should the atmos workflow init/tfstate -f baseline
just work or am I expected to create that default file and the iam-service-linked-roles.yaml? If so, can someone point me to some docs on what needs to be in these files?
Thanks!
2024-10-22
hi @Dan Miller (Cloud Posse) - I’d like to use the alb
component to set up a 301 redirect. ex:
[test.example.com](http://test.example.com)
should redirect to [app.example.com](http://app.example.com)
I can do this in the console, but I can’t figure out the syntax using the ALB component (https://github.com/cloudposse/terraform-aws-components/tree/main/modules/alb )
Is this possible to do with the current reference architecture?
rather than using alb
, could you redirect your route with Route53? For example we do this with the dns-*
components (https://github.com/cloudposse/terraform-aws-components/tree/main/modules/dns-primary)
see var.record_config
I tried that, but [app.example.com](http://app.example.com)
is an ecs-service
behind an ALB
so if I make the DNS record, it will point to the right domain, but then will hit the ALB and fail because there’s no rule to route the traffic to the correct target group
and if I use the additional_target
option in the ECS service, then [test.example.com](http://test.example.com)
acts as an alternate URL for [app.example.com](http://app.example.com)
instead of just redirecting to the main domain
@johncblandii if I recall correctly, didnt you do something similar? Do you have any input?
I’ll have to check when I get back to Houston, but I believe we used https://github.com/cloudposse/terraform-aws-alb-ingress/blob/main/main.tf#L58-L88 to define our ingress changes.
I don’t recall if we had any changes we didn’t upstream, though. The listener rule is what you want, though @Taimur Gibson.
resource "aws_lb_listener_rule" "unauthenticated_paths" {
count = module.this.enabled && length(var.unauthenticated_paths) > 0 && length(var.unauthenticated_hosts) == 0 ? length(var.unauthenticated_listener_arns) : 0
listener_arn = var.unauthenticated_listener_arns[count.index]
priority = var.unauthenticated_priority > 0 ? var.unauthenticated_priority + count.index : null
action {
type = "forward"
target_group_arn = local.target_group_arn
}
condition {
path_pattern {
values = var.unauthenticated_paths
}
}
dynamic "condition" {
for_each = length(var.listener_http_header_conditions) > 0 ? [""] : []
content {
dynamic "http_header" {
for_each = var.listener_http_header_conditions
content {
http_header_name = http_header.value["name"]
values = http_header.value["value"]
}
}
}
}
}
2024-10-23
So this was caused by an issue with the generation process not fully cleaning up unused files.
I’ve sent over another zip that should have a couple of improvements.
For clarity however:
The fix is to as you’ve done remove the stack configs for EKS if you are an ECS engagement.
orgs/spt/core/auto/global-region/github.yaml
should look similar to
import:
- orgs/acme/core/auto/_defaults
- mixins/region/us-east-1
- catalog/philips-labs-github-runners
for ECS, as catalog/iam-service-linked-roles
is an EKS component. Similarly, there shouldn’t be any EKS stacks, so we can remove that catalog folder and any references to it (search repo for catalog/eks
and can remove those imports
2024-10-24
2024-10-28
Hi Folks,
Is there a way to have an account be used for multiple stages? I’m early on in working through the reference architecture. I am now at the stage “Prepare Account Deployment” and I am taking heed of the warning, “If you aren’t confident about the email configuration, account names, or anything else, now is the time to make changes or ask for help.” :)
The existing account structure in the accounts.yaml will not work for us. What we want to do is create an OU for each of our customers and then have two accounts for each customer/OU. One would be for “prod” and the other account would be for all their lower environments which can include some, or all of the following: dev, test, staging.
In the reference architecture example, there is an OU for “plat” and then a separate account (with an associated stage) for each stage.
Questions:
- Is it possible for an account to be references for multiple stages?
- How would that be represented in the yaml?
- Would this create other issues or require changes that we may need to consider in other places in the Quickstart?
Would something like this work to replace the plat entries in the accounts.yaml?
organizational_units:
- name: cust1
accounts:
- name: cust1-non-prod
tenant: cust1
stage:
- dev
- staging
- test
- demo
tags:
eks: false
- name: cust1-prod
tenant: cust1
stage: prod
tags:
eks: false
- name: cust1
accounts:
- name: cust2-non-prod
tenant: cust2
stage:
- staging
tags:
eks: false
- name: cust2-prod
tenant: cust2
stage: prod
tags:
eks: false
@GervaisdeM-SpryPoint here’s how you can accomplish what you want to do:
Solution
- Rename
catalog/account.yaml
tocatalog/account.yaml.tmpl
- Find where account is imported, and replace it with something like this:
import:- path: “catalog/account.yaml.tmpl”
context:
tenants:- name: acme
- name: foo
- name: bar
skip_templates_processing: false
ignore_missing_template_values: true
- path: “catalog/account.yaml.tmpl”
- Update your
account.yaml.tmpl
like this:
components:
terraform:
account:
vars:
# … deleted everything else to focus on solution
# you should keep what is there.
organizational_units:
# … Other OUs define here
{{ range .tenants }}
- name: {{ .name }}
accounts:
- name: {{ .name }}-dev
tenant: {{ .name }}
stage: dev
tags:
eks: true
- name: {{ .name }}-sandbox
tenant: {{ .name }}
stage: sandbox
tags:
eks: true
- name: {{ .name }}-staging
tenant: {{ .name }}
stage: staging
tags:
eks: true
- name: {{ .name }}-prod
tenant: {{ .name }}
stage: prod
tags:
eks: true
service_control_policies:
- DenyLeavingOrganization
{{ end }}
service_control_policies_config_paths: []
Outcome
I tested this locally, and got this:
test: components: terraform: account: atmos_component: account atmos_manifest: deploy/test atmos_stack: test atmos_stack_file: deploy/test backend: {} backend_type: “” command: terraform component: account env: {} inheritance: [] metadata: {} overrides: {} providers: {} remote_state_backend: {} remote_state_backend_type: “” settings: {} stack: test vars: account_email_format: aws+cplive-%[email protected] account_iam_user_access_to_billing: DENY aws_service_access_principals: - cloudtrail.amazonaws.com - guardduty.amazonaws.com - ipam.amazonaws.com - ram.amazonaws.com - securityhub.amazonaws.com - servicequotas.amazonaws.com - sso.amazonaws.com - auditmanager.amazonaws.com - config.amazonaws.com - config-multiaccountsetup.amazonaws.com - malware-protection.guardduty.amazonaws.com enabled: true enabled_policy_types: - SERVICE_CONTROL_POLICY - TAG_POLICY organization_config: accounts: [] organization: service_control_policies: - DenyEC2InstancesWithoutEncryptionInTransit organizational_units: - accounts: - name: core-analytics stage: analytics tags: eks: false tenant: core - name: core-artifacts stage: artifacts tags: eks: false tenant: core - name: core-audit stage: audit tags: eks: false tenant: core - name: core-auto stage: auto tags: eks: true tenant: core - name: core-corp stage: corp tags: eks: true tenant: core - name: core-dns stage: dns tags: eks: false tenant: core - name: core-identity stage: identity tags: eks: false tenant: core - name: core-marketplace stage: marketplace tags: eks: false tenant: core - name: core-network stage: network tags: eks: false tenant: core - name: core-public stage: public tags: eks: false tenant: core - name: core-security stage: security tags: eks: false tenant: core name: core service_control_policies: - DenyLeavingOrganization - accounts: - name: acme-dev stage: dev tags: eks: true tenant: acme - name: acme-sandbox stage: sandbox tags: eks: true tenant: acme - name: acme-staging stage: staging tags: eks: true tenant: acme - name: acme-prod stage: prod tags: eks: true tenant: acme name: acme service_control_policies: - DenyLeavingOrganization - accounts: -…