At VeriSIM Life, we pride ourselves on solving complex problems. After all, we’re working to simulate the entire human body! At the center of our efforts is the BIOiSIM platform: our core simulation technology that accurately predicts how a given drug will behave in a body. What would normally take a lengthy drug trial to determine, we can simulate in minutes. This gives us a great opportunity to truly scale drug development in an unprecedented way. To test more drugs in the current development pipeline would require setting up many costly tests, which are harmful to animals and are sinks for time and money. For us, testing more drugs is simply a matter of computing power. The more simulations we’re able to run, the more insight we can provide to the drug development sector, the more time and money we save the industry. Scale is the gateway to transform the drug pipeline, and we are at the forefront of this effort.
As a software engineer, I’ve naturally bumped up against the problem of scale before. During my time at Epic, I collaborated closely with the scalability team to help create systems for hospitals that would support hundreds of thousands of patients or more. One thing I’ve learned over the years is that if you want to significantly scale up a system, things are going to get both complex and tricky. Here at VeriSIM, in the summer of 2019, our demand for simulations reached the point where we needed to make a drastic change.
When I joined VeriSIM at the beginning of 2019, the company had been running on a 4-core physical server and had a backed-up queue of simulation reports that would take a month to run. Back then, our approach was to migrate to a cloud solution, and run our reports against a machine with 96 cores. This cleared the backlog, and satisfied our needs… for a time. This shouldn’t come as a surprise, but once resources become available, people move quickly to take advantage of them. We were able to generate greater insights from more and more simulations, until our scientists were requesting individual reports that required hundreds of thousands of simulations. At that time, our largest reports could take over a week to run, even on 96 parallel processes! That was the tipping point; we finally had to confront the issue of scale. In our minds was an overall goal: to create a system scalable far beyond our current needs, so that we could continue to stretch our insights past what a single increase in throughput could give us. We set out to do so, but we needed a target at the very least. Ambitious as we are, we set a high bar: improve simulation throughput by 10x within a single quarter.
To some software engineers, scaling past the limits I described may seem as straightforward as setting up a cluster for distributed computing and parallelizing the process further than it already had been. In fact, we did just that. But, as I mentioned earlier, we’re dealing with reports of hundreds of thousands of simulations spread across thousands of drugs, each having a unique set of biologically specific parameters. Every one of these simulations is a full representation of a body over time. Tracking this amount of human simulations, calculating and retrieving relevant drug and bodily parameters, and gathering truly actionable insights has never been attempted at this scale. While the computing tools we work with may be familiar, dealing with the layer of complexity underneath is an exciting challenge. Detailing how all that works is understandably sensitive information, but suffice to say that preserving this level of complexity while scaling up is no small task. I would, however, like to share some of the engineering challenges I encountered along the way, and how I achieved our goal and even more.
The easy part of moving to distributed computing was defining and storing our core simulator in a registry. It got considerably harder when I tried to host simulator copies in a cluster. I initially wanted to use a serverless solution, which would have allowed us to run simulations without ever having to worry about the machines underneath. Initial tests went well, but I soon ran into a hard service-defined limit. We had set our target at 500 simultaneous simulations, and my solution was cut off at 50. I quickly converted to a cluster of virtual machines, which meant more setup and maintenance, but would allow us to scale. This was an important lesson: always double check service limits and load test early. There may be hidden limits that make a solution completely unworkable when expanded.
I also needed to scale the simulator count up and down based on what each report needed. In this case, I performed a load test early, and am very glad I did. At first I wanted to autoscale: to let the cloud decide how many simulators would run based on how many simulation requests there were at any given time. Unfortunately, two issues got in the way of that plan. First, the simulation request count reported very slowly, causing a similarly slow scaling response. But even if there was quicker reporting, the scaling speed provided by the cloud was still far too slow to adjust mid-report. Because of all this, I just decided to scale manually before and after running a report. This let us directly control how much simulation power we have available and adjust based on need. Sometimes, the simplest approach really is best.
With infrastructure in place, I then had to overhaul the entire code base of our reports for the new distributed system. This meant sending all simulation requests as a batch, returning logs and errors from the distributed machines, and making sure all requests would be handled and retrieved. Using multiple batch requests was crucial; this allowed me to balance requests across all available servers in our cluster, greatly reducing wasted computing power during report execution. The most important task, however, was to make the system as resilient as possible. Because there would be so many simulations at once, I decided to use the cost-effective option of machines that had a chance to be removed from a cluster; no individual server was guaranteed to stay online. Even though most machines stick around, all simulation results must come back for every report. To solve this issue I put a system of retries into place, to make sure network connection issues and machine failures would not mean a simulation was lost for good. We want all of our systems to be reliable and robust, and this resilience logic went a long way toward making that happen in our scaled system.
To put this all together: in a single quarter I used distributed computing and robust parallelized logic to scale even further than our original goal. We wanted 10x improvement in simulation throughput, and our final tests showed that we hit 14x! The difference was night and day: reports that would have taken over a week to finish on the old server could be run overnight. But success wasn’t enough, I wanted to blow expectations out of the water. In the following months, I tuned the implementation further and improved throughput by an additional 3x. All said and done, the improvement was a whopping 42x! (How often does anyone get to post that kind of statistic?)
These numbers are fantastic, but the most important thing is that this new system is truly scalable. Our current needs max out at 500 simultaneous simulations. But we can scale up to thousands with a few keystrokes. I have identified efficiency improvements that would bring our baseline throughput even higher, without adding additional machines to the cluster. And I designed the system to be easily replicable, so that we can run multiple scaled reports simultaneously. This is the beginning of a sea change for drug development. As this system becomes more available to the industry, the cost of preclinical trials will be harder and harder to justify. We envision a future when the performance of millions of candidate drugs can be evaluated with a push of a button, and I am proud of my contributions toward making that happen.