Clear Street — Modernizing the brokerage ecosystem
Engineering12 min read
Feb 16, 2023

Building a Platform for Risk Management

Clear Street Engineering

Clear Street is a fintech prime broker working to solve one of the industry’s most neglected problems: legacy technology. We’re building a proprietary, cloud-native, clearing and custody system to replace the legacy infrastructure used across capital markets — improving speed, access, and service for our clients. As a participant in the global capital markets, Clear Street is subject to many forces that impact risk on a daily basis.

Clear Street must navigate a fluid regulatory environment, maintain sound capital management practices, and communicate clearly with its clients. To tackle these challenges efficiently and effectively, we have been building a risk platform that is capable of consolidating data from multiple sources, calculating risk metrics in a timely manner, and quickly disseminating that information to multiple consumers.

Traditional risk technology solutions have focused primarily on batch-driven processes that perform calculations once the trading day has been completed. These types of solutions are not surprising given the increased regulatory demands placed on capital markets participants since 2008. Stress tests, resolution planning, SREP, IFRS9, ICAAP, Basel III, MIFID II — the list of such compliance and regulatory requirements is large and still growing. Given these pressures, many solutions in risk technology tend to be bespoke collections of services that don’t share data or processes, and are often run by separate teams.

At Clear Street, our vision is to build a core data processing and calculation engine that can be adapted to service multiple risk and regulatory requirements. The goals of this platform are:

  • To constantly process data from multiple sources;
  • To provide elastic compute that can adapt to changing demand; and
  • To provide timely access to results
Unification of Batch and Stream

Consider the following scenario. A risk manager is interested in monitoring up-to-date P&L information for a client account during the day. Accurately calculating intraday P&L requires:

  • loading every trade that occurred during the day;
  • maintaining accumulated positions and cost-basis from the trades; and
  • maintaining current prices.

One possible approach is to load trades and prices on demand as discrete batches of data, calculate state from these batches, and produce P&L values from the accumulated state. This approach would provide accurate data, however, it is not scalable to support thousands of accounts with millions of trades.

An alternative approach would be running two systems in parallel, a streaming solution that accumulates trades during the day and an end-of-day batch system that follows the solution described above. This type of architecture, known as Lambda architecture, arose from a 2011 blog post (i) by Nathan Marz. However, it too has shortcomings. It requires the maintenance of two systems and code bases, each with their own copies of data and state, outputting results at different cadences that need to be merged at some point. As a result, systems based on Lambda Architecture tend to be highly complex.

More recent research on this topic has led to the development of a philosophy known as the “unification of batch and stream.” Akidau et al, in their blog posts Streaming 101 (ii) and Streaming 102 (iii), and their book Streaming Systems (iv), advocate that batch processing is a strict subset of stream processing. Consider the data streams involved in the scenario described above:

Image

If we treat prices and trades as coming from an infinite, unbounded dataset, then the batch use case of calculating P&L for the current trading date becomes a strictly bounded subset of data from the respective streams. In the diagram above, consider the events that occur between t and t+1. If a system is able to continuously process prices and trades, maintain state and trigger outputs at different times, it can answer questions about current P&L at any point in time between t and t+1.

This is the design principle that governs all of the processes being developed for Clear Street’s Risk Platform. Every workflow, from P&L calculations to stress tests and margin calculations, requires the same steps:

  • Identify the streams of data needed for calculations;
  • Continuously process those streams by implementing the calculation requirements in a graph; and
  • Make the results available to a variety of consumers.

Two prominent open-source frameworks that provide implementations of this “streaming first, with batch as a special case” philosophy form the backbone of our risk platform:

  • Apache Beam: provides a unified programming model for processing bounded and unbounded datasets. It also decouples the definition of a data processing workflow from the low-level details around how and where it is run. Finally, there are SDKs in several languages, with the Java & Python versions being the most complete at the time of writing.
  • Apache Flink: provides the low-level implementation of one of the runtime backends where a Beam data processing workflow can be executed. Specifically, it manages the distributed compute environment required to execute the workflow and provides the robust tools that are necessary for state and time management in complex stream processing applications.
Data consistency and correctness

One of the primary challenges when designing a Risk system is to ensure that input data is consistent throughout the system. For example, consider a client maintaining a position in the same stock in multiple trading accounts who would like to see P&L aggregated across the accounts. Prices used to calculate the current market value in each account must be consistent. This requirement is complicated further in a stream processing context where data is ingested at different cadences and workloads are handled in distributed nodes. Therefore, it is vital for any stream processing engine to provide guarantees around strongly consistent data and application state. These guarantees have to be provided even in the face of application-level and hardware failures.

Flink applications are deployed as a graph where nodes are operators performing a calculation and where edges carry data from one operator to the next. Flink provides strongly consistent data guarantees via pluggable state backends (In memory or RocksDB) that keep track of variable state in each operator and through the provision of exactly-once state consistency via checkpointing and recovery algorithms. To illustrate how checkpointing works in Flink, consider the following simplified graph of the P&L application:

Image

Flink uses a global checkpointing mechanism that provides a periodic signal, known as a checkpoint barrier to every operator in an application. The following diagram shows how this occurs periodically in the “Calculate P&L” operator:

Image

The checkpoints at T0 and T1 will save the operator’s buffered inputs, variables and buffered outputs to a persistent store, such as RocksDB. Once all operators in the graph have completed persisting their state, the checkpoint is finalized, and operators continue with their normal functions. Suppose we are currently at T3 and a hardware failure occurs. The operator itself has only processed prices and trades up to T2 into current P&L, with the events between T2 and T3 remaining in buffers. When Flink detects the failure, it will automatically restart the operator on another available compute resource, restoring the last known checkpointed state from T1, and reprocessing events that have not been checkpointed yet. Furthermore, if the sink to which P&L is written provides transactional semantics, then end-to-end exactly-once processing semantics can be achieved via Flink. It is this fault-tolerance mechanism that provides the data consistency and correctness attributes required to run a risk system at scale.

Processing at scale

Not all calculations in a risk system are the same. For example, running Monte Carlo simulations for VaR calculations or calculating margin on option positions demand more compute resources than P&L calculations. Therefore, it is imperative that the platform can be scaled to meet a variety of workload requirements.

Beam and Flink both provide mechanisms to adjust how many workers can be used to perform tasks in an application. To illustrate how this is achieved, consider how applications are run in a Flink cluster:

Image

A Flink deployment consists of multiple JVM processes. The JobManager is the main controller of the application. The TaskManager is a worker process containing multiple slots (threads) that can invoke a particular task.

Once a JobManager receives a submission to run a Flink application, it converts the application into an execution graph where each node represents different operators needed to complete the work. The JobManager then assigns tasks for each of these operators to available task slots on a TaskManager. One key role performed by the JobManager is to determine how to divide tasks for a given operator so that they can be executed in parallel. The level to which this occurs can be configured via a parallelism parameter. In a Beam application that runs on Flink, this can only be set to one value for all operators in the application. For example, setting this parameter to 32 means that each operator used in the application will be executed in 32 different task slots in parallel. However, while setting this parameter will create 32 task slots for each operator in the execution graph, those slots will not be utilized efficiently unless the sources and sinks of the application can also support that level of parallelism.

To that end, we make heavy use of Apache Kafka to achieve our desired rate of parallelism. Specifically, the majority of our applications use the following high-level architecture:

Image

Each application sources its input data from a Kafka topic that has 32 partitions, processes those partitions in parallel, and writes data back to another Kafka topic. We currently utilize a series of 26 such applications chained together to ultimately produce P&L and risk metrics.

We have only just begun establishing the building blocks, and have a long way to climb. However, we believe the foundations laid down by this platform will serve us well in not only adapting to the ever-changing landscape of risk management in capital markets, but also to solving many other data processing challenges faced by a prime broker.

If you would like to learn more about what we’re building at Clear Street, please contact us. If you are interested in helping us build out a modern tech stack to tackle some really complex problems facing capital markets, please check out our careers page!

Cheers from the Risk Engineering Team,

Zainal Arifin, Dan Goodman, Biao Li, Sang Min Park, Jeeno Pillai, Akshat Raika, Lukas Rees, Ethan Tam, Tim Trautman, and Arvind Vanthavasi

References

i. Marz, N. (2011, October 13) How to beat the CAP theorem http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

ii. Akidau, T. (2015, August 5). Streaming 101: The world beyond batch O’Reilly Radar https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/

iii. Akidau, T. (2016, January 20). Streaming 102: The world beyond batch O’Reilly Radar https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/

iv. Akidau, T., Chernyak, S. & Lax, R. (2018). Streaming Systems. O’Reilly http://streamingsystems.net/

Help & support

Get support

Contact

Please add your full name
Please add your work phone
Please add your company
Get in Touch ImageGet in Touch Image

Get in touch with our team