We at Dgraph Labs use Github as our VCS. And we have recently migrated our CI/CD setup to Github Actions. This was a huge win for us internally, especially in a startup setting like ours. Our wins were broadly in these 3 areas Compute Costs, Maintenance Efforts & Configuration Time.
With this new setup, we designed & developed Dynamic AutoScaling of Github Runners in house. We are thinking of open-sourcing this project. If there is any interest here - pls do reach out. We were able to save ~87% $$ of our Compute Costs with this setup.
In this article we explain our transition to GitHub Actions for our CI/CD needs at Dgraph Labs Inc. As a part of this effort we have built (in-house) & implemented a new architecture for “Dynamic AutoScaling of GitHub Runners” to power this setup. In the past, our CI/CD was powered by a self-hosted on-prem TeamCity setup - this turned out to be a little difficult to operate & manage in a startup setting like ours. Transitioning to GitHub Actions & implementing our new in-house built “Dynamic AutoScaling of GitHub Runners” - has helped us reduce our Compute Costs, Maintenance Efforts & Configuration Time across our repositories for our CI/CD efforts (with improved security).
have you seen https://github.com/philips-labs/terraform-aws-github-runner ? How does your solution differ?
Terraform module for scalable GitHub action runners on AWS
Hey @Alex Jurkiewicz.. we have actually looked at this, and found it a little expensive & difficult to manage. Our setup is quite straightforward with minimal components.
I have mentioned this in the blog post
Happy to discuss further if there is interest.
gotcha. I’m not very clear on what you mean by expensive – doesn’t the philips solution scale to zero as well?
The most difficult part with Philips solution is the dependency it has on number of components it uses on AWS. Which is a combination of Lambda’s, SQS, API-Gateway + Compute resources.
The solution that is proposed above is using SSM + Compute resources.
The more components we have - the harder it gets to track when one of them fails. This makes the philips solution a little expensive, in terms of maintenance costs & also to some extent w.r.t $ costs as well.
What we aimed for & achieved was a simpler solution.
It’s always great to see new solutions. But I’m with Alex on this one. The Philips module is battle tested for a long time and scales to zero. We been using it for nearly 2 years now and haven’t had to debug it once.
Hey @Soren Jensen thanks for the honest feedback. I understand the hesitation when other products are more battle tested & has had better traction.
We will be open sourcing this setup sometime soon. And I will be happy to share the github link. We would like your feedback. (positive or negative - both are welcome :))
We have battle tested this for ~6months (not as much as philips obviously). Given that we are a database company, our workloads are quite diverse - and it covers most edge cases really well. And handles different kinds of machine type requirements, dynamic scheduling, scaling to 0, smart re-use of a machine (if it’s in the cusp of finish, based on historic times). It has some amount of extra smarts, which are in beta mode. But more room to optimize.