Service Level Deploys or: How We Learned to Stop Worrying and Love Artifacts
2020 was a fantastic year of growth for Clear Street, seeing a rough doubling of the engineering team. With this growth came an increased demand to get code deployed into production faster.
Being in the business of holding custody of customer assets and being responsible for the financial well-being of these assets, we place an extreme emphasis on the stability of our software engineering and release processes. To this end, Clear Street works in a monorepo — all code existing within one Git repository — to ensure that everyone works on the same version of the code. We also did something we called a “mono-deploy”: our once-a-week push to production after automation completed all testing in our various QA environments. By the end of 2020, this system was breaking down and needed a rethink.
As teams grew, product delivery deadlines started to shift between them. Different engineering teams wanted to use different versions of the code in the repository. APIs were making and breaking compatibility. Some needed long QA cycles; some didn’t. Teams that maintained deploys now had extra work.
Through all this, we came up with a solution that transformed how we do work at Clear Street, empowering our engineers to take control of their releases while also maintaining our high standard of safety; we call it SLeD.
Laying the Foundation
Our ultimate goal was to achieve service-level deploys at Clear Street. For us, this meant empowering teams and service owners to deploy their services when they saw fit, and to do so safely. Before we could start building our service-level deploy tool (SLeD), we had to change the way we developed our services.
From day one, we’ve run a microservices architecture on multiple Kubernetes clusters. However, because of our monorepo build system and mono-deploy, we weren’t operating in a proper microservices environment. Our internal services didn’t expose a backward compatible interface, and all services were assumed to be running the latest code. We also had tight dependencies between services in our code, which made it difficult to reason about what build changes a single code diff would produce (e.g., a change to service A’s code may affect service B because B imports from A). Consequently, we built artifacts for all of our services on any code change and deployed them all at once.
We decided to enforce backward compatibility (i.e., full transitive compatibility) for our avro, protobuf, and swagger specs to ensure stable contracts between services using tools like Schema Registry, Buf, and swagger_spec_compatibility. In addition, since wire-level backward compatibility does not guarantee behavioral backward compatibility, our developers needed to start thinking about whether changes to a service preserves backward-compatible behavior. It’s important to note that we allowed for alpha and beta protocols with no backward compatibility guarantees to enable rapid iteration of new features to not hinder developer velocity. With these changes, we’ve embraced “expand, migrate, contract” as a way of safely introducing breaking changes in our service interfaces.
To decouple our services, we built a system for publishing and consuming internal libraries in various languages. Instead of service A depending on some useful utility in service B’s code, both service A and B now depend on remote import M at a pinned version. We found that creating these “internal standard libraries” raised the libraries’ code quality, which was a nice side effect.
First we walked, now we SLeD
With our services now decoupled in our monorepo, we began developing a deployment tool that could transition us to service-level deploys. We needed a reliable, fast, and customizable tool with a simple interface. Deployments should be deterministic and idempotent, meaning that a re-played deployment should make no change to the system state. Additionally, the tool would need to support mono-deploys while we transitioned to service-level deploys.
We designed SLeD to meet all of these requirements. Containerized deployments were a clear choice, meaning that all deployments would occur in a well-known, sandboxed environment. During the mono-deploy, simultaneously running hundreds of containers would likely strain the server the containers were running on. We decided to use Kubernetes to launch and run these containers on our cloud servers. As long as our Kubernetes cluster is up and running, these jobs will run until completion — regardless of whether SLeD is available. As a bonus, if there are not enough servers available to run the deployments, Kubernetes could provision as many as needed from our cloud provider.
We also introduced a concept of “deploy groups,” representing one or more services to deploy, helping us transition to service-level deploys more easily. To begin with, we had one deploy group, which represented our mono-deploy. After some time, teams began breaking off their services into their own deploy groups and removing them from the mono-deploy. They could then figure out all the edge cases of, say, mid-day deployments of a single service at a time.
Deployments, while typically Docker containers and Kubernetes resources, could come in many forms. For example, an execution engine would deploy as a binary executable onto a physical server, perhaps co-located with an exchange. Our solution was to have scriptable deployments that did not assume, for example, the deployment would result in a running Docker container. Every service contains a deploy.py, which is the customizable part of the deployment and describes how SLeD deploys that service.
After every code change to a service at Clear Street, we build an “artifact,” usually a Docker container, which can be deployed immediately via SLeD. We store the artifact in our JFrog Artifactory instance, along with the service’s deploy.py. When SLeD runs a deployment, it downloads the deploy.py and sets environment variables to point to the corresponding artifacts. From there, it’s as simple as python ./deploy.py.
Writing a deploy.py
SLeD deploys a service by simply executing the service’s deploy.py in a Kubernetes Job. To make writing a service’s deployment easy and to provide sane defaults, we developed a Python library we call libdeploy. While libdeploy currently only supports deploying Kubernetes resources, we’ve written it in a modular way to support deploying infrastructure using Terraform and deploying to bare-metal servers in the future. Here is an example deploy.py:
import clearstreet. libdeploy as d
d.deploy_all(
d. kubernetes.helpers.create_stateful_service(
"s3exporter",
http_port=8080,
metrics_port=8080,
cpu=100,
memory=150,
uses_aws=True,
)
)
The libdeploy library also checks for command line arguments, so a user can pass in flags such as --dry-run to render out what the deploy will look like and --validate to validate the resources the user defines.
We opted to use Python rather than YAML or JSON because of type checking (we use Mypy for type hinting) and the ability to write custom components and deploy logic. Moreover, we didn’t want our developers to have to learn Kubernetes YAML configurations or HashiCorp Configuration Language for Terraform. Instead, by providing a common Python library for all deployable objects (e.g., Docker containers and infrastructure), we could empower our developers to take ownership of their service’s deployment lifecycle.
SLeD’s User Interface
Developers interact with SLeD using a web UI written in Flutter. A developer can easily find the currently deployed artifacts from a service’s page, deploy an artifact to an environment, and view past deploys. SLeD displays helpful information about an artifact, such as which merge request it was built off of, when that merge request was merged, and the commit SHA of the merged merge request. Clicking on a merge request takes you to the merge request itself for more context.
Because SLeD can deploy any service to any of our environments, security is a critical factor in SLeD’s design. SLeD has authentication using Google OAuth and a role-based access control (RBAC) authorization engine, which are built-in to limit who can deploy what to which environments.
Conclusion
Our transition to full service-level deploys is still ongoing, but the end is in sight. At least one of our teams deploys new code to production daily using SLeD, and team members are empowered to deploy their services as they see fit. We’ve also switched the remnants of our mono-deploy to use SLeD.
Importantly, as teams and services decouple themselves from the mono-deploy, the infrastructure team is no longer in the critical path for service deploys. This has enabled faster development cycles for service owners, which we believe improves overall service quality. To date, SLeD has completed over 100,000 deployments to our various environments in the couple of months it’s been operational. Ninety-five percent of these deployments have succeeded in under 15 seconds, giving us the power to deploy and roll back changes nearly instantly.