Model-Based Runtime Monitoring of Distributed Systems with TLA+

Project

Model-Based Runtime Monitoring of Distributed Systems with TLA+

Principal Investigator

Inria

Oracle Principal Investigator

Calvin Loncaric
Ron Pressler

Summary

One of the costliest elements of software development — second only to finding out what to build — is ensuring the system does what we want it to do. That is why developers spend much of their time on quality assurance through code reviews, testing (functional, performance, security), and a great deal of bug-fixing. Some bugs are costlier than others because:

They are difficult to reproduce during the development phase.

They are design flaws that require significant rework to fix.

They cause significant damage, such as downtime, data loss, or security vulnerabilities.

Because they are non-deterministic and complex, concurrent and distributed systems, tend to suffer from bugs that are costly for all those reasons.

TLA+ is a specification language and a set of tools that offer a cost-effective way of finding and fixing flaws early in development when those flaws are cheaper to fix. Users write a precise and concise TLA+ description of the system and their expectations from its behavior, and the tools then check whether the expectations hold and, if not, how they might break.

TLA+ is particularly attractive for industry use compared to other formal methods. Its unique strengths in specifying concurrent and distributed systems make it applicable precisely where mainstream software needs help the most. It can specify arbitrary properties of systems of arbitrary complexity and "ordinary" developers can still learn it in a few weeks. Automated verification with a model checker makes the cost of verification low enough to compare favorably with traditional assurance methods.

For those reasons, TLA+ is among the most popular formal methods in industry, employed successfully by Amazon, Oracle, Microsoft, ElasticSearch, CrowdStrike, MongoDB, and others. This technical report from Amazon relates how engineers learned and applied TLA+ in just a few weeks. They then found and fixed potentially catastrophic and hard-to-reproduce problems in several AWS services.

At Oracle, TLA+ has found dozens of potential critical data-loss bugs in OCI systems before they hit production (https://confluence.oci.oraclecorp.com/display/VT/Verification%3A+Projects+and+Results). However, while it assists in the design and development stage, the implementation may diverge from the design in the specification — leaving it up to the implementors to go back to the verification engineer to adjust the specification when they believe they've made an important change to the code — and implementation bugs introduced in the translation from the spec to code can slip through unnoticed.

The purpose of this research is to tie a running, production system to its TLA+ specification by monitoring its execution at runtime, and warning if the execution diverges from the spec. This can detect bugs before they manifest and/or help pinpoint their original cause, reducing the time to fix them, thus reducing the cost of maintaining cloud infrastructure software and improving its correctness, preventing catastrophic bugs and/or reducing their cost.