Why We Moved to Local Kubernetes Development at Clear Street
At Clear Street, we run hundreds of services to provide prime brokerage products to our clients. Dozens of scheduled batch jobs process millions of trades, journals, and transactions every day to facilitate trade clearing, calculate risk, and generate financial reports.
Our Kubernetes clusters manage all of these critical microservices and batch jobs; we rely on Kubernetes for networking, deployments, pod scaling, load balancing, node auto-scaling, cron jobs — the list goes on. So it might come as a surprise that developers at Clear Street weren’t using Kubernetes during their local development loop until this year. They weren’t able to leverage any Kubernetes features nor use Kubernetes-native infrastructure like Argo.
Today, nearly every Clear Street developer’s inner loop involves Kubernetes with what we’ve dubbed “localkube”: local development with Kubernetes.
Local development before localkube
At the end of last year, we began to see repeating issues crop up with local development. Services’ local configuration was quite different from their cluster configuration: networking, service discovery, replication, pod readiness, and more were all different.
Eventually, local configuration issues started to eat into developer time. The problem was twofold:
- We had to maintain separate local and cluster configurations (and code!).
- Cluster configurations remained untested until deployed in our development cluster.
While using Docker Compose and Docker networks would solve some of these problems, we saw that as a half-baked solution that would still miss out on some essential Kubernetes features.
Why use Kubernetes locally?
We decided to move to local Kubernetes development for the following benefits. It lets us:
- Bring local configuration as close as possible to our clusters.
- Test Kubernetes configuration (resource limits, exposed ports/services, environment variables, liveness/readiness probes) earlier in our testing pipeline, even before code reaches our development cluster.
- Quickly test replicated services, including pod startup and shutdown behavior.
- Develop and experiment with Kubernetes-native tools, like Argo and service meshes.
- Easily test upgrading Kubernetes.
- Remove our bespoke service startup scripts and reduce technical debt.
Our local Kubernetes tooling choices
This article won’t go into comparing tools — there are plenty of articles available already. However, we include some reasons as to why we chose the tools we did.
kind
First, we need to create a local Kubernetes cluster. We chose kind because of its fast startup and shutdown speed, simplicity, good documentation, and community support.
After wrapping some startup configuration in a script, creating a local cluster is as simple as: make create-cluster
This does several things:
- Creates a Kubernetes cluster pinned to our production Kubernetes version.
- Adds Ingress support to the cluster.
- Creates a local Docker registry and connects the cluster to it.
Deleting or recreating the local cluster is simple as well:
# delete
make delete-cluster
# recreate
make delete-cluster create-cluster
Tilt
Next, we found a solution for the “inner development loop”: code, build, run, repeat.
The tooling landscape for this task is extensive, with notable contenders being skaffold, garden, and Tilt.
We chose Tilt for managing the inner development loop for the following reasons:
- Its simple web UI for log viewing and high-level information is essential for developers new to Kubernetes.
- Its large feature set, including container building improvements like “live update.”
- The Starlark configuration language (a dialect of Python) is powerful and straightforward to learn.
Out-of-the-box, Tilt was easy to incorporate into our codebase. In about a week, we could launch most of our services in a local Kubernetes environment.
After describing how to build our images in Starlark, we tell Tilt the services we want to run, like bank and appliances:
tilt up appliances bank
Here, appliances is an alias to our shared core infrastructure: Postgres, Kafka, Schema Registry, etc.
After running this command, Tilt serves a pleasant web UI with service status, build and runtime logs, service groupings, and Kubernetes information like Pod ID:
Once Tilt detects that the code has changed, it will automatically update the correct service in your cluster!
One of Tilt’s many valuable features is describing services’ dependencies. Above, we’ve only specified the bank service, but our Starlark configuration reads other YAML configuration files and figures out that bank needs several other services and launches those.
As a performance improvement, Tilt uses the local registry we set up earlier to send Docker images into the cluster quickly.
Control loop
At its core, Tilt:
- Watches your filesystem for file changes
- Builds and tags Docker images of your services when files change
- Pushes images to the local registry
- Tells Kubernetes to pull the new images and restart specific Pods
Of course, it does much, much more than this, and we rely on its extensive feature set for an improved developer experience.
Tilt performance improvements
After the initial honeymoon period with Tilt, we improved the local developer experience even more by speeding up Tilt’s inner loop.
Build caching
What if we want to launch 50 microservices to test some complicated workflows for a code change? At Clear Street, the time to build 50 Go, Node.js, and Python Docker images is unacceptable.
Instead, we use a performance trick: pull the latest-tagged Docker image for each service unless specified otherwise.
For example:
tilt up appliances price -- --local bank
We use this syntax to say:
- Launch appliances, bank, price, and their dependencies.
- For everything except bank, pull the latest image from our Docker registry.
- Read the local filesystem for bank, build the Docker image, and inject it into the local cluster.
Why does this caching mechanism work well? It’s because any code that merges to our main branch will automatically generate a new latest-tagged image. When the local branch is recent enough with the main branch, it will be running the same image as if it were built locally.
For our services, the time saved from caching is enormous. Additionally, restarts are fast because the local cluster caches the images.
Dockerfile best practices
By following the Dockerfile best practices from the beginning of Clear Street, we’ve benefited from fast reloading of Docker images: carefully ordering Docker layers goes a long way towards speeding up Docker builds.
For example, Go Dockerfiles have:
# Rarely changing code above
COPY go.mod go.mod
COPY go.sum go.sum
RUN go mod download
# Frequently changing below
With a warm cache, Docker will altogether skip the go mod download as long as go.mod and go.sum haven’t changed.
live_update
Finally, some of our services benefited immensely from a Tilt feature called live_update. Rather than build a Dockefile from scratch every time, live_update lets us sync files into the container instantly and run arbitrary commands.
For our Node.js services, we wanted to use nodemon because:
- It’s how our engineers are used to working.
- It can very quickly rebuild Node apps (under five seconds).
With just a few lines of code in Starlark, we could automatically sync our Node.js services’ code into the container and run nodemon like normal. This brought the inner development loop speed to parity as before with those services.
Interacting with the cluster
While it’s great that our services run in a sandboxed Kubernetes cluster locally, a new challenge was breaking into the sandbox and interacting with services in the cluster.
A typical workflow at Clear Street is to use REST or gRPC endpoints to communicate with services, so we found a couple of different ways to support this workflow.
kubefwd
kubefwd is an open-source project that lets the host machine communicate as if it’s inside the cluster:
curl myservice.mynamespace:10000/metrics
It does this by temporarily adding domain entries to your /etc/hosts file with the service names it forwards.
However, kubefwd is not a magic bullet:
- It does not support pod restarts and has no plan to do so. Constantly restarting kubefwd becomes tedious.
- If not shut down appropriately, kubefwd can leave /etc/hosts in a bad state.
Kubernetes Ingresses
Another way to access services in a cluster is to use Kubernetes Ingresses. By using the Ingress NGINX kind configuration, services can create something like this:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: sledfrontend
# ...
spec:
rules:
- host: sledfrontend.localkube.co.clearstreet.io
http:
paths:
- backend:
serviceName: sledfrontend
servicePort: 5000
path: /
When running our SLeD service locally, users can access sledfrontend.localkube.co.clearstreet.io and be redirected to sledfrontend:5000 in the cluster.
Our Route53 redirects *.localkube.co.clearstreet.io to the loopback address, 127.0.0.1, which the Kubernetes Ingress is listening to. Developers can use this to test subdomain routing or access their services through ingresses locally.
Outcomes
After rolling out localkube, we saw developers take off with the new capabilities. We:
- Began using Argo Workflows and Argo Events.
- Created sandboxed environments in the cloud for a specific branch (more on that in a future blog post!).
- Tested deployments before merging to our main branch.
- Tried out more Kubernetes-native tools like service meshes.
- Gained higher confidence and a lower failure rate in production.
The Tilt UI has been instrumental in developer experience and has helped us teach Kubernetes and Docker to more Clear Street developers. Additionally, switching branches and work contexts has never been easier for us.
The Tilt team has been great to work with; they’ve helped us onboard more developer workflows and have promptly released features that we have requested.
Why not use Kubernetes locally?
Despite all the benefits listed in this article, running Kubernetes might not be the best idea for every company for such reasons as:
- Potentially slower inner development loop, especially if not enough resources are devoted to improving caching.
- With Docker-based local development, it can be tricky to set up debugging and may require juggling several different Dockerfiles.
- Onboarding developers to Kubernetes basics is necessary to run and debug Kubernetes locally. DevOps practitioners would say this is a great thing, but not every company thinks so.
Looking forward
Our developers are excited by this new workflow and are looking forward to more improvements down the road. For example, we don’t support debugging services running in a cluster yet. Luckily, Tilt has a guide for this, and we’re planning on enabling it soon.
One long-desired feature at Clear Street was branch deploys: ephemeral, sandboxed, cloud environments built from a single source branch. Because of Tilt and localkube, we’ve been able to create and delete these ephemeral environments quickly. For the tech behind that, stay tuned for a future blog!
Finally, things keep getting better in the local Kubernetes development landscape. Kind has gotten more stable, and the Tilt team consistently improves the tool with the features we request. There is no looking back for Clear Street: local Kubernetes development with these tools has improved developer experience and reduced production risk.