SweetOps #refarch for October, 2024

Cloud Posse Reference Architecture

2024-10-03

Igor M

Do you recommend using Helmfile via ArgoCD for app deployments to EKS? Are there advantages to this approach rather than using Kustomize? (It it that it’s easier to spin up preview environments via Helmfile?)

Gabriela Campana (Cloud Posse)

02:36:19 PM

@Dan Miller (Cloud Posse) @Yonatan Koren

Dan Miller (Cloud Posse)

02:55:12 PM

We use ArgoCD or helm deploy directly, both with helmfile. Although @Igor Rodionov would have to weigh in on the advantages of that decision

Erik Osterman (Cloud Posse)

06:10:27 AM

I would pick what makes the most sense to your engineering team. Kustomize is nice, but we’ve been using helmfile for so long that we tend to stick with it.

Erik Osterman (Cloud Posse)

06:10:44 AM

I think we should revisit this next year

Igor M

11:26:50 AM

Helmfile seems like it’s worth trying out to compare against. I might have a use case for some environments where Argo is not available (and doesn’t require frequent CD) and using Atmos to deploy to those environments might be a nice alternative.

Igor M

11:28:13 AM

I like the idea of a monochart that you guys developed, but it feels like it’s not being maintained for things like Istio https://github.com/cloudposse/charts/tree/master/incubator/monochart. Is there an open-source alternative that you’re aware of? Or is there a desire to maintain this one?

Erik Osterman (Cloud Posse)

01:12:31 PM

I’ve changed tact on monochart. We still standby the pattern, but instead of us managing it, we instruct customers how to create their own monochart for their organization

Erik Osterman (Cloud Posse)

01:13:46 PM

When we managed it, what ended up happening is we were reinventing the k8s spec after adding support for everything. Because in the end, different companies use different parts. That lead to an unwieldy chart to manage, and was antithetical to what we were trying to achieve.

Erik Osterman (Cloud Posse)

01:14:36 PM

The idea is that each org should define an “interface” for their apps on k8s. That interface is defined via helm charts that can be reused by teams in the org to deploy services in an idiomatic way. Don’t expose every feature of k8s, instead the one you will need.

Erik Osterman (Cloud Posse)

01:15:47 PM

So in your case, with istio, we have a customer doing exactly this. They have a custom “monochart” (e.g. “acme-service”), which implements all the custom resources for istio the way they need it. Then any developer can deploy their service using that chart. Just bring-your-own-dockerfile.

Igor M

01:17:09 PM

Thanks Erik. I’ll give this a try and see how it stacks up against plain kustomize

Erik Osterman (Cloud Posse)

01:17:34 PM

since kustomize can deploy helm charts, I think this might be the best of both?

Erik Osterman (Cloud Posse)

01:18:21 PM

(that’s coming from the POV though of a helm advocate, and I’m sure many users of kustomize would disagree)

Igor M

04:26:21 PM

Following up on an old thread. I got a proof-of-concept setup with Helmfile and wanted to try it with your documented ArgoCD patterns.

I noticed the eks/platform module has been deprecated. It is still referenced by your ArgoCD terraform module, and in the github-action-deploy-argocd action.

Has a better pattern emerged? (Should I be looking elsewhere?)

Erik Osterman (Cloud Posse)

06:05:30 PM

@Dan Miller (Cloud Posse)

Dan Miller (Cloud Posse)

06:08:37 PM

The eks/platform component only set a few SSM parameters. We decided that was not worth the effort of deploying and maintaining the full component. Plus the default ingress group can be set in the app repo with the app’s helmfile directly. Yes it’s a bit less DRY, but less abstraction makes it easier to explain/follow

Dan Miller (Cloud Posse)

06:12:20 PM

where do you see eks/platform mentioned? We should update those references

Igor M

07:03:06 PM

Readme here: https://github.com/cloudposse-terraform-components/aws-eks-argocd/tree/main/src and then the chamber reads: https://github.com/cloudposse/github-action-deploy-argocd/blob/main/action.yml#L211 https://github.com/cloudposse/github-action-deploy-argocd/blob/main/action.yml#L224

Dan Miller (Cloud Posse)

07:03:46 PM

ah I see. Thanks for pointing this out!

Igor M

07:04:45 PM

All good. Just wanted to make sure I wasn’t going off track. Thanks!

Gabriela Campana (Cloud Posse)

07:06:03 PM

@Igor Rodionov DEV-2897: Update references to eks/platform anywhere because we deprecated that module

2024-10-04

2024-10-10

github3

11:30:38 PM

I am currently going through the quick start and I have just deployed the accounts successfully via atmos workflow deploy/accounts -f accounts and I have run atmos terraform apply account-map -s core-gbl-root to build the account maps.

But when I run atmos workflow deploy/account-settings -f accounts I get the following error:

`Planning failed. Terraform encountered an error while generating this plan.

╷
│ Error: Invalid index
│
│ on ../account-map/modules/iam-roles/main.tf line 46, in locals:
│ 46: is_root_user = local.current_identity_account == local.account_map.full_account_map[local.root_account_name]
│ ├────────────────
│ │ local.account_map.full_account_map is object with 12 attributes
│ │ local.root_account_name is “core-root”
│
│ The given key does not identify an element in this collection value.`

Checking the account map I see the line: root_account_account_name = "core-root" so it doesn’t look like a object with attributes. What are some suggested steps to troubleshoot this?

2024-10-11

Shirisha Sudhakar Rao

03:11:30 PM

@Andriy Knysh (Cloud Posse)

I have used the reference architecture to setup a transit gateway to connect 2 VPCs in different accounts. However, I am having trouble understanding how the transit gateway offers a connection to the internet. Is that not a part of the reference architecture and is something we should setup separately on our own? If so, how can this be achieved?

There is a line in the reference architecture VPC section that states this

# Use PrivateLink in private-only VPCs at least until we have
# a connection to the internet via Transit Gateway.

So ideally, transit gateway setup should provide connection to the internet right?

Andriy Knysh (Cloud Posse)

03:13:18 PM

there are many diff considerations here (e.g. VPC private vs public subnets, etc.). I’m in meetings now, will try to explain later (ping me in a few hours )

Shirisha Sudhakar Rao

03:14:51 PM

@Andriy Knysh (Cloud Posse) Ok, sure

Shirisha Sudhakar Rao

05:55:25 PM

@Andriy Knysh (Cloud Posse) Pinging you again as a reminder

Andriy Knysh (Cloud Posse)

08:03:41 PM

@Shirisha Sudhakar Rao it all depends on your network architecture and the security nd monitoring requirements. You can have many different variations of network architecture, I’ll give you a few examples here:

Andriy Knysh (Cloud Posse)

08:04:22 PM

• If the VPCs in the two accounts have public subnets, they already have egress to the Internet via the IGW

Andriy Knysh (Cloud Posse)

08:05:01 PM

• If the VPCs only have private subnets, then you need to provide Internet access using any of the following:

Andriy Knysh (Cloud Posse)

08:07:29 PM

– You can connect a Site-to_site VPN to the TGW on one side, and then add VPC attachments for the VPC on the other side of the TGW. This way, the VPN connection will provide access to the Internet. In the VPC route tables, you will have a route 0.0.0.0/0 pointing to the TGW. In the TGW route table, you will have a route 0.0.0.0/0 pointing to the VPN connection

Andriy Knysh (Cloud Posse)

08:11:14 PM

– You can have a separate VPC (in the same or separate account, we’ll call it network). The network VPC has public subnets with a Internet Gateway (providing connection to the Internet). The VPCs in both accounts can only have private subnets. They will be connected to the Internet via the TGW and the network VPC. In this case, you will have a subnet route in each VPC 0.0.0.0/0 pointing to the TGW. In the TGW route table, you will have a route 0.0.0.0/0 pointing to the VPC attachment of the network VPC. In the network VPC, you will have a route 0.0.0.0/0 pinting to the IGW (Internet Gateway)

Andriy Knysh (Cloud Posse)

08:12:18 PM

I suppose your question was about the last part - a network VPC with Internet access via IGW, and the other VPCs connected to the TGW and then to the network VPC

Andriy Knysh (Cloud Posse)

08:13:09 PM

regarding this

# Use PrivateLink in private-only VPCs at least until we have
# a connection to the internet via Transit Gateway.

Andriy Knysh (Cloud Posse)

08:14:05 PM

if both VPC are private (no public subnets), then you can use a Private Link to another VPC (e.g. network in our example), which has connection to the Internet via IGW

Andriy Knysh (Cloud Posse)

08:14:46 PM

in all of these cases, if both VPCs are private, then you have to provide a connection to something that has Internet access

Andriy Knysh (Cloud Posse)

08:16:06 PM

that something can be another VPC (network), and you connect the 2 VPCs to the network VPC using a Private Link, or a Transit Gateway

Andriy Knysh (Cloud Posse)

08:19:00 PM

the TGW itself does not offer nor provide connections to the Internet, it’s just a proxy. You need to connect it to a VPC which has connection to the Internet, or to a on-site VPN (this is more complicated since it involves using on-prem equipment like Palo Alto or Cisco)

Andriy Knysh (Cloud Posse)

08:20:31 PM

there are many different variations of the network architecture. Let me know if the above description answers your question, or you need ore help

Andriy Knysh (Cloud Posse)

08:21:21 PM

Andriy Knysh (Cloud Posse)

08:22:58 PM

the Cloud Posse tgw components https://github.com/cloudposse/terraform-aws-components/tree/main/modules/tgw describes the architecture where you have private VPCs in multiple accounts and also a network VPC (e.g. in a network account) which has public subnets and connection to the Internet via IGW

Andriy Knysh (Cloud Posse)

08:30:33 PM

it’s impossible to configure all possible variations of network architectures with one terraform module. If your use case differs from the above, you will have to make adjustments in your own code

Shirisha Sudhakar Rao

01:32:17 PM

@Andriy Knysh (Cloud Posse) Thank you for the information. I will try this out today.

2024-10-15

github3

03:24:20 AM

I am using datadog-lambda forwarders in the ue1 and uw2 regions and since I deployed the datadog api keys and configuration in our global stack (defaulting to uw2) and the key to auto/uw2, I have been experiencing drift and ssm access errors.

For example I have a datadog lambda forwarder for vpc-flow-logs in ue1 and while the datadog client is trying to access the ssm that it thinks exists in ue1 the policy and the actual key are in uw2 and I don’t see any way to change that aside from modifying the TF itself.

I was wondering what the motivation for moving to the global stack is? And how should I migrate from the previous regional paradigm?

2024-10-21

github3

08:40:15 PM

Hi folks,

The company I work with is using the Quick start.

I’m just starting out with the very early steps for set up and when I run the atmos workflow init/tfstate -f baseline command, I am getting errors.

First, I get an eks error that you can see here when I run a validate:

 √ : [superadmin] (HOST) spryops-infrastructure ⨠ atmos validate stacks
no matches found for the import 'catalog/eks/clusters/default' in the file 'catalog/eks/clusters/auto.yaml'
Error: failed to find a match for the import '/localhost/code/spryops-infrastructure/stacks/catalog/eks/clusters/default.yaml' ('/localhost/code/spryops-infrastructure/stacks/catalog/eks/clusters' + 'default.yaml')

I commented out the reference to default in the auto.yaml and then I get another error:

 √ : [superadmin] (HOST) spryops-infrastructure ⨠ atmos validate stacks
no matches found for the import 'catalog/iam-service-linked-roles' in the file 'orgs/spt/core/auto/global-region/github.yaml'
Error: failed to find a match for the import '/localhost/code/spryops-infrastructure/stacks/catalog/iam-service-linked-roles.yaml' ('/localhost/code/spryops-infrastructure/stacks/catalog' + 'iam-service-linked-roles.yaml')

Can anyone help me out here? Should the atmos workflow init/tfstate -f baseline just work or am I expected to create that default file and the iam-service-linked-roles.yaml? If so, can someone point me to some docs on what needs to be in these files?

Thanks!

2024-10-22

Taimur Gibson

04:39:12 PM

hi @Dan Miller (Cloud Posse) - I’d like to use the alb component to set up a 301 redirect. ex: [test.example.com](http://test.example.com) should redirect to [app.example.com](http://app.example.com) I can do this in the console, but I can’t figure out the syntax using the ALB component (https://github.com/cloudposse/terraform-aws-components/tree/main/modules/alb )

Is this possible to do with the current reference architecture?

Dan Miller (Cloud Posse)

04:45:22 PM

rather than using alb , could you redirect your route with Route53? For example we do this with the dns-* components (https://github.com/cloudposse/terraform-aws-components/tree/main/modules/dns-primary)

Dan Miller (Cloud Posse)

04:45:42 PM

see var.record_config

Taimur Gibson

05:02:52 PM

I tried that, but [app.example.com](http://app.example.com) is an ecs-service behind an ALB

Taimur Gibson

05:03:30 PM

so if I make the DNS record, it will point to the right domain, but then will hit the ALB and fail because there’s no rule to route the traffic to the correct target group

Taimur Gibson

05:04:31 PM

and if I use the additional_target option in the ECS service, then [test.example.com](http://test.example.com) acts as an alternate URL for [app.example.com](http://app.example.com) instead of just redirecting to the main domain

Dan Miller (Cloud Posse)

06:11:24 PM

@johncblandii if I recall correctly, didnt you do something similar? Do you have any input?

johncblandii

02:31:57 PM

I’ll have to check when I get back to Houston, but I believe we used https://github.com/cloudposse/terraform-aws-alb-ingress/blob/main/main.tf#L58-L88 to define our ingress changes.

I don’t recall if we had any changes we didn’t upstream, though. The listener rule is what you want, though @Taimur Gibson.

resource "aws_lb_listener_rule" "unauthenticated_paths" {
  count = module.this.enabled && length(var.unauthenticated_paths) > 0 && length(var.unauthenticated_hosts) == 0 ? length(var.unauthenticated_listener_arns) : 0

  listener_arn = var.unauthenticated_listener_arns[count.index]
  priority     = var.unauthenticated_priority > 0 ? var.unauthenticated_priority + count.index : null

  action {
    type             = "forward"
    target_group_arn = local.target_group_arn
  }

  condition {
    path_pattern {
      values = var.unauthenticated_paths
    }
  }

  dynamic "condition" {
    for_each = length(var.listener_http_header_conditions) > 0 ? [""] : []
    content {
      dynamic "http_header" {
        for_each = var.listener_http_header_conditions

        content {
          http_header_name = http_header.value["name"]
          values           = http_header.value["value"]
        }
      }
    }
  }
}

2024-10-23

github3

10:06:51 PM

So this was caused by an issue with the generation process not fully cleaning up unused files.

I’ve sent over another zip that should have a couple of improvements.

For clarity however:
The fix is to as you’ve done remove the stack configs for EKS if you are an ECS engagement.
orgs/spt/core/auto/global-region/github.yaml
should look similar to

import:

orgs/acme/core/auto/_defaults
mixins/region/us-east-1
catalog/philips-labs-github-runners

for ECS, as catalog/iam-service-linked-roles is an EKS component. Similarly, there shouldn’t be any EKS stacks, so we can remove that catalog folder and any references to it (search repo for catalog/eks and can remove those imports

2024-10-24

2024-10-28

github3

01:46:01 PM

Hi Folks,

Is there a way to have an account be used for multiple stages? I’m early on in working through the reference architecture. I am now at the stage “Prepare Account Deployment” and I am taking heed of the warning, “If you aren’t confident about the email configuration, account names, or anything else, now is the time to make changes or ask for help.” :)

The existing account structure in the accounts.yaml will not work for us. What we want to do is create an OU for each of our customers and then have two accounts for each customer/OU. One would be for “prod” and the other account would be for all their lower environments which can include some, or all of the following: dev, test, staging.

In the reference architecture example, there is an OU for “plat” and then a separate account (with an associated stage) for each stage.

Questions:

Is it possible for an account to be references for multiple stages?
How would that be represented in the yaml?
Would this create other issues or require changes that we may need to consider in other places in the Quickstart?

Would something like this work to replace the plat entries in the accounts.yaml?

          organizational_units:
            - name: cust1
              accounts:
                - name: cust1-non-prod
                  tenant: cust1
                  stage:
                    - dev
                    - staging
                    - test
                    - demo
                  tags:
                    eks: false
                - name: cust1-prod
                  tenant: cust1
                  stage: prod
                  tags:
                    eks: false                    
            - name: cust1
              accounts:
                - name: cust2-non-prod
                  tenant: cust2
                  stage:
                    - staging
                  tags:
                    eks: false
                - name: cust2-prod
                  tenant: cust2
                  stage: prod
                  tags:
                    eks: false

github3

07:26:12 PM

@GervaisdeM-SpryPoint here’s how you can accomplish what you want to do:

Solution

Rename catalog/account.yaml to catalog/account.yaml.tmpl
Find where account is imported, and replace it with something like this:
import:
- path: “catalog/account.yaml.tmpl”
  context:
  tenants:
  - name: acme
  - name: foo
  - name: bar
    skip_templates_processing: false
    ignore_missing_template_values: true
Update your account.yaml.tmpl like this:
components:
terraform:
account:
vars:
# … deleted everything else to focus on solution
# you should keep what is there.
organizational_units:
# … Other OUs define here
{{ range .tenants }}
- name: {{ .name }}
accounts:
- name: {{ .name }}-dev
tenant: {{ .name }}
stage: dev
tags:
eks: true
- name: {{ .name }}-sandbox
tenant: {{ .name }}
stage: sandbox
tags:
eks: true
- name: {{ .name }}-staging
tenant: {{ .name }}
stage: staging
tags:
eks: true
- name: {{ .name }}-prod
tenant: {{ .name }}
stage: prod
tags:
eks: true
service_control_policies:
- DenyLeavingOrganization
{{ end }}
service_control_policies_config_paths: []

Outcome

I tested this locally, and got this:

test: components: terraform: account: atmos_component: account atmos_manifest: deploy/test atmos_stack: test atmos_stack_file: deploy/test backend: {} backend_type: “” command: terraform component: account env: {} inheritance: [] metadata: {} overrides: {} providers: {} remote_state_backend: {} remote_state_backend_type: “” settings: {} stack: test vars: account_email_format: aws+cplive-%[email protected] account_iam_user_access_to_billing: DENY aws_service_access_principals: - cloudtrail.amazonaws.com - guardduty.amazonaws.com - ipam.amazonaws.com - ram.amazonaws.com - securityhub.amazonaws.com - servicequotas.amazonaws.com - sso.amazonaws.com - auditmanager.amazonaws.com - config.amazonaws.com - config-multiaccountsetup.amazonaws.com - malware-protection.guardduty.amazonaws.com enabled: true enabled_policy_types: - SERVICE_CONTROL_POLICY - TAG_POLICY organization_config: accounts: [] organization: service_control_policies: - DenyEC2InstancesWithoutEncryptionInTransit organizational_units: - accounts: - name: core-analytics stage: analytics tags: eks: false tenant: core - name: core-artifacts stage: artifacts tags: eks: false tenant: core - name: core-audit stage: audit tags: eks: false tenant: core - name: core-auto stage: auto tags: eks: true tenant: core - name: core-corp stage: corp tags: eks: true tenant: core - name: core-dns stage: dns tags: eks: false tenant: core - name: core-identity stage: identity tags: eks: false tenant: core - name: core-marketplace stage: marketplace tags: eks: false tenant: core - name: core-network stage: network tags: eks: false tenant: core - name: core-public stage: public tags: eks: false tenant: core - name: core-security stage: security tags: eks: false tenant: core name: core service_control_policies: - DenyLeavingOrganization - accounts: - name: acme-dev stage: dev tags: eks: true tenant: acme - name: acme-sandbox stage: sandbox tags: eks: true tenant: acme - name: acme-staging stage: staging tags: eks: true tenant: acme - name: acme-prod stage: prod tags: eks: true tenant: acme name: acme service_control_policies: - DenyLeavingOrganization - accounts: -…