From COBOL to the Cloud: Q&A With Product SME Mohammed Arshad
This isn’t the first time we’ve built a clearing service. In fact, some of our most experienced team members, such as Mohammed Arshad, a highly accomplished Clear Street engineer, have been coding clearing from the very beginning, bringing nearly half a century of skills to our endeavor. Mo codes the posting rules for our clearing engine. He also guides the design and development of the core clearing system.
In this Clear Street Q&A, we talk with Mo about how he moved clearing from COBOL to the cloud — and from independent service to monolithic ones and then back again. Mo lays out how the industry has benefited (and learned) from each stage of innovation, and how Clear Street’s combination of microservices and messaging will revolutionize fintech.
What was the earliest clearing system you helped build?
The original version we had was written in COBOL. This was a traditional clearing system in the late 80s, very similar to what the big banks use. The companies that have been on the street for the longest time built their systems at around the same time, a little bit earlier than I was building my first version. These were all based on mainframes and were built with a closed mindset: a siloed set of functions that was very narrow in its scope.
For example, we had one system that did trade capture, and that’s all it did. It didn’t understand anything about what that trade meant downstream, or what had happened previously in the process. This one system got trades, it stored them, and it made them available downstream. That’s all it did. The settlement engine was a totally independent piece that would then use the data that the trading book guy had built, eventually via databases, but at first just using flat files.
The next part of the clearing process was what we call the matching piece of clearing: matching the trades against what was known on the street. Traditionally this was called purchase and sales. The P&S team would have their own system, usually written by a different team than the team that wrote the trade capture, meaning the two systems didn’t always see eye to eye. You ended up needing what was called a reconciliation. The trade capture system would say, “Hey, I got a hundred trades today in our systems alone.” The P&S system would reply, “I got 99.” There was always this additional work that you would have to do to make sure that all the pieces within the process were all talking together and not dropping things along the way.
And the complications didn’t end there. Then there would be the cage system, which talked to DTCC and did the actual settlements with the real world. That, too, would be a separate system, requiring yet another reconciliation with the P&S system. So we’re talking about at least three systems. Beyond that there was all the reporting and so on, regulatory, all the other systems. Each was built as a separate self-contained system. In some cases they even had their own separate security masters, which was highly inefficient. There were so many reconciliations. That was the first iteration of automated clearing systems. They were very fragmented, with a lot of gaps in between them, and they were very unwieldy.
When it was time to build the next generation of clearing systems, what changed?
The next iteration began with the understanding that the prior fragmentation made zero sense. What value was there to all those different data sources? Why would anyone want to have all these different little systems that all have to talk to each other and communicate through files and databases? The second iteration was supposed to be the silver bullet: Let’s have a central database that is the single source of truth for all the data. In theory, it’s a great idea. Trade capture is written to the same database that the P&S matching system is done off of, and that the cage system is interacting with, et cetera, et cetera.
This was the end of reconciliation, and of fragmentation. That was a huge leap, and that got us a long way. This era of system still works, to this day. The last iteration that I did is still running, and it’s going strong. That system does 10 million trades a day without missing a heartbeat. It’s good in terms of its functionality. The big drawback is you’ve got this huge monolith of a system. Where you had small fragmented systems that were all talking separately, now you have this big monolith, which is a monster. It’s all living on one big server, and when you need it to manage more trades, you have to put it on a bigger and bigger and bigger machine. It’s not very scalable.
So, the silver bullet of a single unified system in this second generation of clearing wasn’t what it seemed?
Precisely, because big monolithic systems tend to fall down. They fail because all these operations are in one place, on one machine, contending with each other. There’s a big maintenance problem with that. You have this one big application, and if you touch one part of it, you’re very likely to break something else, something you have no intention of breaking. It becomes unwieldy from a developer’s point of view. From a business perspective, it stops you from growing. Why? Because if I have a new business that I want to add on — for instance, if I’m doing equities and I want to start to trade options — then I’ve got to figure out how to cut a hole in this big monolith and somehow insert some new code in there. I’m bolting things onto it. If I keep bolting things onto it, eventually it just becomes a Frankenstein.
It sounds like the pendulum went too far, from the fragmented version to the monolith. What did you build next?
This is what we’re doing at Clear Street. The new architecture that we’re building with microservices allows you to make small functional units that do something very specific, but they all talk in a common language and share a common repository in terms of the data that they update. Microservices is the first part of this new approach, and the common language, what we call messaging, is the second part.
There can be a little service that’s going to do trade capture, and this service’s job in life is to be able to talk to all different types of external sources to pull in trade data, and then capture it in a way that normalizes it and then passes it down on a message tube. It screams that down the line saying, “Hey, here’s a trade.” There’s a normalized version of what a trade looks like, and any service downstream of that can do anything it wants with that data. The service that’s going to do the matching with NSCC can listen to that same stream, and its job in life is just to match that trade against what NSCC knows. It listens to the same trade stream that the trade capture service is pouring out, and once it’s matched them it’s going to feed that back on another stream saying, “Hey, these are all matched. Anybody interested in which trades are matched can listen to that stream and get that from there.”
What this gives us is the ability to have a structured communication between all our services, and allows us to make our services nice and small. The logic within a given service can be upgraded, and as long as the interface coming into it and the interface going out if it remain constant, everything is fine.
Upgrades are now easy. Say the NSCC service previously did trade matching on quantity and amount, but it’s due to be upgraded to match on price, as well. As a developer, I can go into that little service and I can change a piece of code to accomplish that, without touching anything else in the overarching system or in any of the other functions. I can add on options, with very minimal ripple effect throughout the system.
That covers functionality. What happens, though, when more capacity, or less capacity, is desired?
I’m describing architecture that we’re using, which solves a couple of the problems that we had in our first two iterations. The other main problem that both our initial iterations had was they were all housed on internal servers on internal hardware. You had to own the hardware, you had to own the data centers, and you had to maintain it all. As you grew your business, you needed to get bigger and bigger and bigger machines, right? More real estate, more power, more cost, right? And then once you had that, you were stuck with it. It’s not like you could, if you start doing less traffic or you want to downsize in some way, just get rid of those servers. You bought them, so they’re yours.
Also, every two or three years, the processors change, you have to go through this whole elaborate conversion effort where you’re bringing new hardware, you test it, you make sure it’s working, you do parallel for a year, and then you switch over. So your whole time to cycle to new hardware becomes its own big project, and a big resource hog.
The answer to this, the third part of the architecture we’re doing, is not only the microservices with messaging, but we’re hosting it on a cloud service. With cloud hosting, we can quickly spin up more resources as we need them, and we can spin down resources. We can scale up and down quickly.
The sort of scaling that’s important is if you suddenly have a spike in volume, which we’ve had in this last year or so. In a traditional non-cloud implementation, you’re going to be scrambling, rushing in some new hardware. There’s a risk to doing that, because new hardware doesn’t always work right off the bat. Housing microservices entirely in the cloud means we can expand our footprint very quickly to handle spikes, and then we can downgrade back so we save money. That’s the third ingredient to what the new iteration of a clearing system looks like.
Can you give a number to how many independent microservices we’re talking about?
So we’re at the mid-level right now. I think we’ve got 30 to 50 microservices. Once we’ve got all the functionality, that’s going to go to somewhere around the 100 mark, maybe more. We’re not talking thousands. To qualify, I’m talking about a hundred or so different specific microservices doing their function, their task. What the cloud allows you to do is spin up multiple instances of that service. You might have tens or hundreds of iterations of specific services if you need that volume of throughput.
You used a similar word to describe both the early, fragmented approach and the later, monolithic approach: “unwieldy.” Can you look ahead and foresee what issues the cloud microservices approach might face?
One thing we’ve come across is microservice bloat. That’s where you start building a microservice, and you realize it’s got to do too many things. At some point you have to recognize it’s becoming unwieldy, and break it up. If you don’t keep an eye on it, it can sneak up on you. You have to recognize when it’s time to break a microservice into two or more distinct microservices.
For context, can you briefly go back further in time tell the story of what the COBOL era supplanted?
Paper. That’s what COBOL fixed, or at least helped address. People had paper sheets and they would exchange papers between each other. Then they would compare those, and there were adding machines, and they would write down the results. And paper persists. It continued through the COBOL era, and even today physical stocks still exist, though not as widely as they used to. You still have these people on the street called runners who actually have pouches containing the stocks. They run down to the DTC window, and actually hand over the stocks in person. Somebody sits there and counts the stocks and ticks them off. Before we had the computerized version of clearing systems, that’s how all stocks were handled. There was a big ledger where they kept track of everything.
Back then we had a two-weeks settlement cycle due to all the literal paperwork. And back then, the only place this was happening was actually on Wall Street. People were physically in the same vicinity so they would easily be able to exchange paperwork. You can’t scale that, right? It doesn’t matter how many clerks, the process just won’t go any faster.
Even on the clearing side, today, there is still physical stock, which has to be put in a vault. One of the regulatory bodies has this rule that, to this day, you have to do a quarterly vault count. Back at my old firm, we had a really big vault, packed with physicals. We were one of the few firms that could manage the work. We called it “all hands on deck.” Everybody would sit at their desk, the vault would be taken out, and then they’d dump a bunch of stocks on your desk and you and write it all down, and then somebody would go around, get everybody’s recording, tally it, and then compare it to the electronic stock record. They would have to make sure that what’s physically in the vault and what you had recorded align. The chief compliance officer would then sign that vault count into the SEC and say, “Yep, we counted them physically and they all match.”
We’ve come a long way.