SweetOps #spacelift for July, 2024

2024-07-02

RickA

I’ve heard rumor that there’s an alleged way of manipulating online workers to allow more room for “burst” runs and not pay billing overages.

I can’t figure out how that math might work. Anyone have any insight on how you manage to turn workers on/off to benefit Spacelift billing?

Our basic use case is there’s light traffic except for release periods 1-2 times a week for 2-3 hours. So we need maximum throughput for releases, but can have minimal numbers or none a vast majority of the time.

RickA

06:12:40 PM

Our contract is for P95 so there’s not a lot of burst room in my math.

loren

06:35:36 PM

i use this module, and enable the lambda autoscaler, https://github.com/spacelift-io/terraform-aws-spacelift-workerpool-on-ec2/

spacelift-io/terraform-aws-spacelift-workerpool-on-ec2

Terraform module deploying a Spacelift worker pool on AWS EC2 using an autoscaling group

loren

06:36:58 PM

can scale to 0 to minimize ec2 cost

loren

06:37:29 PM

we set min to 1 to minimize wait time for most prs

RickA

06:37:33 PM

But you don’t do that because of any Spacelift billing benefits, correct? I’m researching an alleged method of managing workers to impact Spacelift benefits.

loren

06:38:01 PM

max is like 70 workers, and we still stay under P95

loren

06:38:13 PM

so, i’d say there are billing benefits

loren

06:40:21 PM

5 workers * 24 hrs/day * 30 days/month = 3600 worker-hours. P95 is 3420 worker-hours

RickA

06:43:32 PM

Our contract isn’t for worker hours. It’s for workers. Meaning 5 workers in a 30 day month gives us 720 billable hours and a P95 of 684 - or a 36 hour buffer where we’re allowed to run over 5 workers.

Further they do metric captures per minute. So any hour in which I run more than 5 workers for 3+ minutes means that’s an overage hour - one of my 36 available.

RickA

06:43:48 PM

Does your contract allow worker hours and we need to negotiate better?

loren

06:44:10 PM

no, it’s the same as yours

RickA

06:45:44 PM

Is my math wrong then or do your workloads just finish bursting inside of 36 hours regularly?

loren

06:46:10 PM

they aren’t clear on their formula, in my opinion, so i can’t say for certain

loren

06:47:09 PM

but the way they calculate it, i think we’re under “2 workers” in their P95 calculation, even though we burst from 1 to 30+ several times a week

loren

06:48:15 PM

we don’t have 5 workers running constantly, so i think we get to recoup a lot of that time, however they do their calculation

RickA

06:49:05 PM

We moved to the Kubernetes worker management so our big issue is that we manually scale up/down and if you forget you eat up 36 hours in…well 36 hours or less.

But also our workloads mean 100-200 stacks at a time for 2-3 hours 1-2 times a week. On a high side that’s 24 hours just on releases when we scale up to +10 over our default 4.

Which….you hit 70. If we scaled up obscenely then our throughput would be faster and we could be done within 1 hour let’s say each time. Meaning 8 hours of burst instead of 24 because we’re not abusing their leeway enough.

RickA

06:49:49 PM

Even if your P95 is at 2 you’re still paying for your 5 workers though. So there’s no benefit to running less than 5 from a Spacelift perspective (there is from the AWS perspective…).

loren

06:50:21 PM

there is, because it keeps our avg down to the point where we can burst to several dozen

loren

06:50:47 PM

if we ran 5 all the time, we’d have no burst

loren

06:51:19 PM

and 5 is the minimum they contract for. can’t reduce it and pay any less anyway

loren

06:52:55 PM

i haven’t checked to see if the lambda autoscaler works with kubernetes. i suppose it could

RickA

06:53:01 PM

If we write the numbers 1-20 down in a row, say into a spreadsheet, and you want to calculate the P95 value then you’re going to pull the 19th value in the list. Sorted highest to lowest that 19th value is 19.

If you replace 1-18 with a 0, sort the list the same way, and pull the P95 value you’re still going to get 19.

Am I calculating P95 incorrectly?

loren

06:55:03 PM

according to your definition of what you are taking the P95 of, or theirs?

loren

06:55:58 PM

i don’t think they define theirs very well, so i’d be rather surprised if any of us could perform a calculation that came out correct

RickA

06:56:56 PM

I’ve taken the usage data they allow you to download and have calculated it in the past successfully. Will see if I can show an example with a little Excel time.

2024-07-09

Dan Miller (Cloud Posse)

03:40:04 PM

following up on the discussion on triggering Spacelift admin stacks with GitHub Actions

Dan Miller (Cloud Posse)

03:40:35 PM

Here’s our documentation for triggering spacelift runs from GHA using GitHub Comments: https://docs.cloudposse.com/reference-architecture/fundamentals/spacelift/#triggering-spacelift-runs

Dan Miller (Cloud Posse)

03:40:58 PM

Also, this discussion above is similar and may help to read through: https://sweetops.slack.com/archives/C0313BVSVQD/p1718707848601429

The refarch config for the Spacelift admin stacks in each tenant includes the following config (e.g. for plat)

context_filters:
  tenants: ["plat"]
  administrative: false # We don't want this stack to also find itself

We have a few cases where we might want some child stacks for a tenant’s admin stack to be administrative • to create Spacelift terraform resources (e.g. policies or integrations) • (not yet tried) to create a new admin stack for a child OU of a parent OU (keyed off ‘tenant’) Is there a context filter pattern for a tenant’s admin stack that allows for administrative child stacks, whilst still not allowing the stack to find itself?

Dan Miller (Cloud Posse)

03:44:00 PM

And finally, regarding how to include admin-stacks in the atmos describe affected GHA. The action you’re using already includes admin-stacks: https://github.com/cloudposse/github-action-atmos-affected-trigger-spacelift/blob/main/action.yml#L90

However you will need to add the Spacelift GIT_PUSH policy for triggering on PR comments to the stacks in Spacelift. And remove the GIT_PUSH policy that triggers on every commit

        atmos-include-spacelift-admin-stacks: "true"

Dan Miller (Cloud Posse)

03:44:47 PM

cc @Andriy Knysh (Cloud Posse) @michaelyork06

Gabriela Campana (Cloud Posse)

08:08:59 PM

@michaelyork06 here

Michael York

08:37:12 PM

@Gabriela Campana (Cloud Posse) Here

Michael York

02:46:18 PM

@Elena Strabykina (SavvyMoney) ^

Gabriela Campana (Cloud Posse)

04:16:04 PM

@Elena Strabykina (SavvyMoney) @michaelyork06 please let me know if you have any questions or need further assistance

Elena Strabykina (SavvyMoney)

02:17:17 PM

Hi @Gabriela Campana (Cloud Posse). We would like to follow up after our call with Cloud Posse team last week. • Enable/use existing atmos feature to detect affected administrative stacks and run terraform plan only for those. There is a flaw in this atmos feature: it marks an administrative stack as affected if there is a change in its child stack, also it sounded like there is an issue with detection of deleted child stacks - to tell the truth actual logic is not clear to me. To mitigate the flows, we have two options: ◦ Cloudposse seemed to be interested in modifying atmos to detect added/removed child stacks and only then mark their parent as affected. If they implement this, the issue with admin stacks should be solved - we would like to follow up and confirm if/when you guys plan to do this.

Gabriela Campana (Cloud Posse)

04:20:41 PM

@Andriy Knysh (Cloud Posse) @Erik Osterman (Cloud Posse)

Gabriela Campana (Cloud Posse)

06:59:39 PM

Hi @Elena Strabykina (SavvyMoney) We are discussing this internally and will get back to you asap