Clear Street — Modernizing the brokerage ecosystem
Engineering13 min read
Oct 26, 2021

Why We Moved to Local Kubernetes Development at Clear Street

Clear Street Engineering

At Clear Street, we run hundreds of services to provide prime brokerage products to our clients. Dozens of scheduled batch jobs process millions of trades, journals, and transactions every day to facilitate trade clearing, calculate risk, and generate financial reports.

Image

Our Kubernetes clusters manage all of these critical microservices and batch jobs; we rely on Kubernetes for networking, deployments, pod scaling, load balancing, node auto-scaling, cron jobs — the list goes on. So it might come as a surprise that developers at Clear Street weren’t using Kubernetes during their local development loop until this year. They weren’t able to leverage any Kubernetes features nor use Kubernetes-native infrastructure like Argo.

Today, nearly every Clear Street developer’s inner loop involves Kubernetes with what we’ve dubbed “localkube”: local development with Kubernetes.

Local development before localkube

At the end of last year, we began to see repeating issues crop up with local development. Services’ local configuration was quite different from their cluster configuration: networking, service discovery, replication, pod readiness, and more were all different.

Eventually, local configuration issues started to eat into developer time. The problem was twofold:

  • We had to maintain separate local and cluster configurations (and code!).
  • Cluster configurations remained untested until deployed in our development cluster.

While using Docker Compose and Docker networks would solve some of these problems, we saw that as a half-baked solution that would still miss out on some essential Kubernetes features.

Why use Kubernetes locally?

We decided to move to local Kubernetes development for the following benefits. It lets us:

  • Bring local configuration as close as possible to our clusters.
  • Test Kubernetes configuration (resource limits, exposed ports/services, environment variables, liveness/readiness probes) earlier in our testing pipeline, even before code reaches our development cluster.
  • Quickly test replicated services, including pod startup and shutdown behavior.
  • Develop and experiment with Kubernetes-native tools, like Argo and service meshes.
  • Easily test upgrading Kubernetes.
  • Remove our bespoke service startup scripts and reduce technical debt.
Our local Kubernetes tooling choices

This article won’t go into comparing tools — there are plenty of articles available already. However, we include some reasons as to why we chose the tools we did.

kind

First, we need to create a local Kubernetes cluster. We chose kind because of its fast startup and shutdown speed, simplicity, good documentation, and community support.

After wrapping some startup configuration in a script, creating a local cluster is as simple as: make create-cluster

This does several things:

  • Creates a Kubernetes cluster pinned to our production Kubernetes version.
  • Adds Ingress support to the cluster.
  • Creates a local Docker registry and connects the cluster to it.

Deleting or recreating the local cluster is simple as well:

# delete

make delete-cluster

# recreate

make delete-cluster create-cluster

Tilt

Next, we found a solution for the “inner development loop”: code, build, run, repeat.

The tooling landscape for this task is extensive, with notable contenders being skaffold, garden, and Tilt.

We chose Tilt for managing the inner development loop for the following reasons:

  • Its simple web UI for log viewing and high-level information is essential for developers new to Kubernetes.
  • Its large feature set, including container building improvements like “live update.”
  • The Starlark configuration language (a dialect of Python) is powerful and straightforward to learn.

Out-of-the-box, Tilt was easy to incorporate into our codebase. In about a week, we could launch most of our services in a local Kubernetes environment.

After describing how to build our images in Starlark, we tell Tilt the services we want to run, like bank and appliances:

tilt up appliances bank

Here, appliances is an alias to our shared core infrastructure: Postgres, Kafka, Schema Registry, etc.

After running this command, Tilt serves a pleasant web UI with service status, build and runtime logs, service groupings, and Kubernetes information like Pod ID:

Once Tilt detects that the code has changed, it will automatically update the correct service in your cluster!

One of Tilt’s many valuable features is describing services’ dependencies. Above, we’ve only specified the bank service, but our Starlark configuration reads other YAML configuration files and figures out that bank needs several other services and launches those.

As a performance improvement, Tilt uses the local registry we set up earlier to send Docker images into the cluster quickly.

Control loop

At its core, Tilt:

  • Watches your filesystem for file changes
  • Builds and tags Docker images of your services when files change
  • Pushes images to the local registry
  • Tells Kubernetes to pull the new images and restart specific Pods

Of course, it does much, much more than this, and we rely on its extensive feature set for an improved developer experience.

Tilt performance improvements

After the initial honeymoon period with Tilt, we improved the local developer experience even more by speeding up Tilt’s inner loop.

Build caching

What if we want to launch 50 microservices to test some complicated workflows for a code change? At Clear Street, the time to build 50 Go, Node.js, and Python Docker images is unacceptable.

Instead, we use a performance trick: pull the latest-tagged Docker image for each service unless specified otherwise.

For example:

tilt up appliances price -- --local bank

We use this syntax to say:

  • Launch appliances, bank, price, and their dependencies.
  • For everything except bank, pull the latest image from our Docker registry.
  • Read the local filesystem for bank, build the Docker image, and inject it into the local cluster.

Why does this caching mechanism work well? It’s because any code that merges to our main branch will automatically generate a new latest-tagged image. When the local branch is recent enough with the main branch, it will be running the same image as if it were built locally.

For our services, the time saved from caching is enormous. Additionally, restarts are fast because the local cluster caches the images.

Dockerfile best practices

By following the Dockerfile best practices from the beginning of Clear Street, we’ve benefited from fast reloading of Docker images: carefully ordering Docker layers goes a long way towards speeding up Docker builds.

For example, Go Dockerfiles have:

# Rarely changing code above

COPY go.mod go.mod

COPY go.sum go.sum

RUN go mod download

# Frequently changing below

With a warm cache, Docker will altogether skip the go mod download as long as go.mod and go.sum haven’t changed.

live_update

Finally, some of our services benefited immensely from a Tilt feature called live_update. Rather than build a Dockefile from scratch every time, live_update lets us sync files into the container instantly and run arbitrary commands.

For our Node.js services, we wanted to use nodemon because:

  • It’s how our engineers are used to working.
  • It can very quickly rebuild Node apps (under five seconds).

With just a few lines of code in Starlark, we could automatically sync our Node.js services’ code into the container and run nodemon like normal. This brought the inner development loop speed to parity as before with those services.

Interacting with the cluster

While it’s great that our services run in a sandboxed Kubernetes cluster locally, a new challenge was breaking into the sandbox and interacting with services in the cluster.

A typical workflow at Clear Street is to use REST or gRPC endpoints to communicate with services, so we found a couple of different ways to support this workflow.

kubefwd

kubefwd is an open-source project that lets the host machine communicate as if it’s inside the cluster:

curl myservice.mynamespace:10000/metrics

It does this by temporarily adding domain entries to your /etc/hosts file with the service names it forwards.

However, kubefwd is not a magic bullet:

  • It does not support pod restarts and has no plan to do so. Constantly restarting kubefwd becomes tedious.
  • If not shut down appropriately, kubefwd can leave /etc/hosts in a bad state.
Kubernetes Ingresses

Another way to access services in a cluster is to use Kubernetes Ingresses. By using the Ingress NGINX kind configuration, services can create something like this:

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

name: sledfrontend

# ...

spec:

rules:

- host: sledfrontend.localkube.co.clearstreet.io

http:

paths:

- backend:

serviceName: sledfrontend

servicePort: 5000

path: /

When running our SLeD service locally, users can access sledfrontend.localkube.co.clearstreet.io and be redirected to sledfrontend:5000 in the cluster.

Our Route53 redirects *.localkube.co.clearstreet.io to the loopback address, 127.0.0.1, which the Kubernetes Ingress is listening to. Developers can use this to test subdomain routing or access their services through ingresses locally.

Outcomes

After rolling out localkube, we saw developers take off with the new capabilities. We:

  • Began using Argo Workflows and Argo Events.
  • Created sandboxed environments in the cloud for a specific branch (more on that in a future blog post!).
  • Tested deployments before merging to our main branch.
  • Tried out more Kubernetes-native tools like service meshes.
  • Gained higher confidence and a lower failure rate in production.

The Tilt UI has been instrumental in developer experience and has helped us teach Kubernetes and Docker to more Clear Street developers. Additionally, switching branches and work contexts has never been easier for us.

The Tilt team has been great to work with; they’ve helped us onboard more developer workflows and have promptly released features that we have requested.

Why not use Kubernetes locally?

Despite all the benefits listed in this article, running Kubernetes might not be the best idea for every company for such reasons as:

  • Potentially slower inner development loop, especially if not enough resources are devoted to improving caching.
  • With Docker-based local development, it can be tricky to set up debugging and may require juggling several different Dockerfiles.
  • Onboarding developers to Kubernetes basics is necessary to run and debug Kubernetes locally. DevOps practitioners would say this is a great thing, but not every company thinks so.
Looking forward

Our developers are excited by this new workflow and are looking forward to more improvements down the road. For example, we don’t support debugging services running in a cluster yet. Luckily, Tilt has a guide for this, and we’re planning on enabling it soon.

One long-desired feature at Clear Street was branch deploys: ephemeral, sandboxed, cloud environments built from a single source branch. Because of Tilt and localkube, we’ve been able to create and delete these ephemeral environments quickly. For the tech behind that, stay tuned for a future blog!

Finally, things keep getting better in the local Kubernetes development landscape. Kind has gotten more stable, and the Tilt team consistently improves the tool with the features we request. There is no looking back for Clear Street: local Kubernetes development with these tools has improved developer experience and reduced production risk.

Help & support

Get support

Contact

Please add your full name
Please add your work phone
Please add your company
Get in Touch ImageGet in Touch Image

Get in touch with our team