Hot Swapping a Bank Part 3: Deployment and Life Lessons
BK is Clear Street’s proprietary transaction processing system, used to process more than $3 billion in daily trading volume. BK is the newer, faster version of our legacy Book Keeping system, Bank. Where Bank had issues with low throughput, data modeling, and messy points of extension, BK is more agile and scalable, to support our growing business.
In part 1 of this series, I outlined how we identified areas for improvement within Bank and used a data validation system to ensure we were solving those issues when we built BK. In part 2, I covered how we approached compatibility layers and data migration. After multiple rounds of validating, integrating and migrating, we were finally ready to deploy our new core transaction processing system. In the third and final part of this series, I’ll walk through the final release and the lessons we learned along the way.
It’s Go Time: Deployment Day
Before deployment day, we fully scripted what needed to happen upon BK’s release and thoroughly documented each step in a single Release Script wiki page. Clear Street’s engineering and operations teams reviewed this document end-to-end and held multiple dry runs in our pre-production environment.
The Release Script included commands that must be run, services that needed to be scaled down, links to code changes that had to be merged at certain points, SQL scripts to verify things like replication lag and resetting of database sequences, and a laundry list of other items to make sure deployment would go off without a hitch.
The script also included a point of no return, when BK would be released and we had to commit to it being live. We triple-checked everything to avoid any last-minute changes. It was critical that by the time we reached the point of no return, we were comfortable that the system was healthy and in its correct state. Specifically, we had to verify that ledgers were matching and that open trades and settlements were sitting in the right places.
I wish I had some screenshots or recordings of the deployment. We had ~20 people on a Google Meet call, sharing their screens and confirming the state before and after executing each step of the script. It felt like we were landing on the Moon! It was a sight to be seen.
Releasing software this big, complicated, and intertwined with the whole business is… a big deal, so how did Clear Street get it done? We executed multiple rounds of testing with the different tools described, we worked hand-in-hand with our wonderful Operations team, and at the end of the day, we had a LOT of support across the business to work through the release script.
So, What Did We Learn?
Major lessons from the this journey were:
- Align your organization’s resources to make sure you have everything you need to mitigate risk and hit your dates, and in turn, be successful. Migrations are hard, like super hard. Every time. In Clear Street’s case, we had full support across the company. Our CEO, Chris Pento, understood the long-term timeline, the Operations team was available to help resolve subject matter questions or operational concerns and even validate the system, and all of the engineering teams were downstream to provide their feedback and expertise.
- Account for the “unknown unknowns.” In a complex system, it is unlikely that any one person will have a deep and comprehensive understanding of how everything works. It is even more unlikely that the knowledge can be effectively transferred without a good amount of effort. Another thing to consider is that when you work at a smaller scale, all problems seem easier, but once you move to production-sized loads, problems grow in complexity. All of these factors lead to “unknown unknowns,” problems you’ll have to solve that you cannot anticipate.
- Have a complete understanding of how the system you are replacing fits into your business processes and its downstream impact. We found success when we had a holistic view of all the upstream and downstream dependencies of the system we were replacing. We also found a lot of pain when we didn’t fully understand downstream use cases for the data produced by Bank.
- Validate and then validate again. When you replace system A with system B, and you want their output to match, the first step in building system B should be: How do I know if this thing works? This is of variable complexity by domain, and the risk profile is likely different as well. In our case both complexity and risk were high, so we invested heavily in our validation stack.
- Have a script. In our migration execution plan, we wrote a comprehensive script that documented step by step what all stakeholders needed to do for a successful launch. Action items included things like, “scale down service A” and “verify replication lag by running this query on DB A.” This is a living document that should be iterated on, constantly. We did several dry runs of the migration in one of our lower environments and found many areas that needed improvement. By the end, the script eliminated a lot of human error. You’re unlikely to forget to “scale down service A” if it’s spelled out in the script.
Success in performing large scale migrations is dependent on having a well thought-out plan and being incredibly thorough. Hot swapping Clear Street’s core system with a completely different one was a long and challenging journey, but it was absolutely worth it. We are now in a much better position to continue rapidly scaling our business with a much more solid foundation.
If solving complex, highly rewarding problems interests you, check out our careers page!