#spacelift (2024-07)
2024-07-02

I’ve heard rumor that there’s an alleged way of manipulating online workers to allow more room for “burst” runs and not pay billing overages.
I can’t figure out how that math might work. Anyone have any insight on how you manage to turn workers on/off to benefit Spacelift billing?
Our basic use case is there’s light traffic except for release periods 1-2 times a week for 2-3 hours. So we need maximum throughput for releases, but can have minimal numbers or none a vast majority of the time.

Our contract is for P95 so there’s not a lot of burst room in my math.

i use this module, and enable the lambda autoscaler, https://github.com/spacelift-io/terraform-aws-spacelift-workerpool-on-ec2/
Terraform module deploying a Spacelift worker pool on AWS EC2 using an autoscaling group

can scale to 0 to minimize ec2 cost

we set min to 1 to minimize wait time for most prs

But you don’t do that because of any Spacelift billing benefits, correct? I’m researching an alleged method of managing workers to impact Spacelift benefits.

max is like 70 workers, and we still stay under P95

so, i’d say there are billing benefits

5 workers * 24 hrs/day * 30 days/month = 3600 worker-hours. P95 is 3420 worker-hours

Our contract isn’t for worker hours. It’s for workers. Meaning 5 workers in a 30 day month gives us 720 billable hours and a P95 of 684 - or a 36 hour buffer where we’re allowed to run over 5 workers.
Further they do metric captures per minute. So any hour in which I run more than 5 workers for 3+ minutes means that’s an overage hour - one of my 36 available.

Does your contract allow worker hours and we need to negotiate better?

no, it’s the same as yours

Is my math wrong then or do your workloads just finish bursting inside of 36 hours regularly?

they aren’t clear on their formula, in my opinion, so i can’t say for certain

but the way they calculate it, i think we’re under “2 workers” in their P95 calculation, even though we burst from 1 to 30+ several times a week

we don’t have 5 workers running constantly, so i think we get to recoup a lot of that time, however they do their calculation

We moved to the Kubernetes worker management so our big issue is that we manually scale up/down and if you forget you eat up 36 hours in…well 36 hours or less.
But also our workloads mean 100-200 stacks at a time for 2-3 hours 1-2 times a week. On a high side that’s 24 hours just on releases when we scale up to +10 over our default 4.
Which….you hit 70. If we scaled up obscenely then our throughput would be faster and we could be done within 1 hour let’s say each time. Meaning 8 hours of burst instead of 24 because we’re not abusing their leeway enough.

Even if your P95 is at 2 you’re still paying for your 5 workers though. So there’s no benefit to running less than 5 from a Spacelift perspective (there is from the AWS perspective…).

there is, because it keeps our avg down to the point where we can burst to several dozen

if we ran 5 all the time, we’d have no burst

and 5 is the minimum they contract for. can’t reduce it and pay any less anyway

i haven’t checked to see if the lambda autoscaler works with kubernetes. i suppose it could

If we write the numbers 1-20 down in a row, say into a spreadsheet, and you want to calculate the P95 value then you’re going to pull the 19th value in the list. Sorted highest to lowest that 19th value is 19.
If you replace 1-18 with a 0, sort the list the same way, and pull the P95 value you’re still going to get 19.
Am I calculating P95 incorrectly?

according to your definition of what you are taking the P95 of, or theirs?

i don’t think they define theirs very well, so i’d be rather surprised if any of us could perform a calculation that came out correct

I’ve taken the usage data they allow you to download and have calculated it in the past successfully. Will see if I can show an example with a little Excel time.
2024-07-09

following up on the discussion on triggering Spacelift admin stacks with GitHub Actions

Here’s our documentation for triggering spacelift runs from GHA using GitHub Comments: https://docs.cloudposse.com/reference-architecture/fundamentals/spacelift/#triggering-spacelift-runs

Also, this discussion above is similar and may help to read through: https://sweetops.slack.com/archives/C0313BVSVQD/p1718707848601429
The refarch config for the Spacelift admin stacks in each tenant includes the following config (e.g. for plat)
context_filters:
tenants: ["plat"]
administrative: false # We don't want this stack to also find itself
We have a few cases where we might want some child stacks for a tenant’s admin stack to be administrative • to create Spacelift terraform resources (e.g. policies or integrations) • (not yet tried) to create a new admin stack for a child OU of a parent OU (keyed off ‘tenant’) Is there a context filter pattern for a tenant’s admin stack that allows for administrative child stacks, whilst still not allowing the stack to find itself?

And finally, regarding how to include admin-stacks in the atmos describe affected GHA. The action you’re using already includes admin-stacks: https://github.com/cloudposse/github-action-atmos-affected-trigger-spacelift/blob/main/action.yml#L90
However you will need to add the Spacelift GIT_PUSH policy for triggering on PR comments to the stacks in Spacelift. And remove the GIT_PUSH policy that triggers on every commit
atmos-include-spacelift-admin-stacks: "true"

cc @Andriy Knysh (Cloud Posse) @michaelyork06

@michaelyork06 here


@Elena Strabykina (SavvyMoney) ^

@Elena Strabykina (SavvyMoney) @michaelyork06 please let me know if you have any questions or need further assistance

Hi @Gabriela Campana (Cloud Posse). We would like to follow up after our call with Cloud Posse team last week.
• Enable/use existing atmos feature to detect affected administrative stacks and run terraform plan
only for those. There is a flaw in this atmos feature: it marks an administrative stack as affected if there is a change in its child stack, also it sounded like there is an issue with detection of deleted child stacks - to tell the truth actual logic is not clear to me. To mitigate the flows, we have two options:
◦ Cloudposse seemed to be interested in modifying atmos to detect added/removed child stacks and only then mark their parent as affected. If they implement this, the issue with admin stacks should be solved - we would like to follow up and confirm if/when you guys plan to do this.

@Andriy Knysh (Cloud Posse) @Erik Osterman (Cloud Posse)

Hi @Elena Strabykina (SavvyMoney) We are discussing this internally and will get back to you asap
