Constraint solving plays a major role in the effectiveness and efficiency of software verification tools. However, the state-of-the-art constraint solvers available today cannot directly scale to handle the constraints generated by program analysis techniques, such as symbolic execution. While program analysis generates a large number of relatively simple queries (but which are potentially correlated), today's constraint solvers are optimized for dealing with single large queries. Thus, naively plugging the available constraint solvers into program analysis tools results in duplicate work and eventually hitting a lower-bound in processing time, dictated by the overhead of manipulating the queries.
We argue that we can leverage systems techniques (caching, parallelization, layering, etc.) to build constraint solving systems that are better suited for program analysis queries. The basic building blocks of such systems would be the individual solvers that we have today. The system would then be capable to transparently handle in an efficient manner large batches of constraints at a time, and exploit the similarities between them.
We propose to offer this system as a service to its users (i.e. program analysis tools), to reduce coupling and promote reusability. Moreover, the system would be easily deployed in a distributed system, where multiple nodes may invoke the services of the constraint solver in parallel.
We describe the proposal in more details below.
1) Constraint-Solver-as-a-Service (CSaaS)
We need to offer the constraint solver as an external service (e.g. web service), which can be transparently upgraded to the benefits of all of its users. This is also in the spirit of the way Google offers most of its services from within the browser, running on their infrastructure.
Why a service? Constraint solving/theorem proving is a hard problem, but fortunately there is intense research work going on, resulting in faster solvers appearing each year. In this regard, we believe that the current approach of bundling a solver with the application requiring it does not scale well. This is because:
(a) It is inconvenient to replace one solver implementation with a more recent version, and even harder when switching from one solver to a totally different one. Moreover, one would need to ensure the new combination does not introduce any bugs, caused by ambiguities in the interface and mutable state.
(b) Some projects (e.g. Klee / Cloud9) work around solver limitations by building a constraint solving stack that first performs some domain-specific optimizations on the query, and then sends the query to the solver, if still necessary. This fragments the innovation in constraint solving, by keeping the constraint solving stacks separate from each other.
By using CSaaS, an application wouldn't need to maintain its own solver stack anymore. It would just send queries to theservice, and wait for a reply. Since the latency of this system may be significantly larger than running a solver locally, it also makes sense to bundle (batch) more constraints in a single request. In the case of symbolic execution, this would work well, because there are usually many execution states that can be explored independently at a time. There are also some hybrid possibilities, where a client library does first a minimum amount of processing, in order to filter the trivial constraints and further reduce the communication overhead.
Finally, a service integrates better with a parallelized application (such as Cloud9). It is easier to have all constraint solving in an internal service, because it may reduce the need for communication among the application
servers. They just use CSaaS which then, magically, takes results from other processing queries into account.
2) Constraint Metadata
Switching from a domain-specific constraint solving stack to a general CSaaS would likely incur slowdowns for certain queries, unless the service finds out the domain-specific information about each query. In this regard, each CSaaS query should be allowed to contain metadata information that would help the service to use the best heuristic for that query, to make efficient use of caching, etc.
For instance, in the domain of symbolic execution, constraints often build one on top of another. Path constraints are solved for a certain branch point, then new a branch constraint are added, the new expression is solved for one more time, and so on. The service should be able to return query identifiers to the application; they can be later mentioned in the metadata of a new constraint, saying that the new formula is an extension of the previous one.
A number of unanswered design questions (we should discuss):
1. This is a
pretty fundamental design question: how to handle mutable state (how to store past queries and their results). If CSaaS provides central storage, then there are questions of coordinating, and consumption of shared resources. Coordinating internally (to parallelize something) and providing an external interface (like query ids) to this coordination are different, the latter is usually delicate since one is then building some sort of database API.
2. What should the lifetime of such results be? Are there "sessions" that can last hours, or days and are shared by different users of the application. Or is it enough for a
session to be equivalent to one invocation of the application.
3. It may be good to use an existing database API (I'm using the word database API very liberally here, using HTTP and callback URLs would be fine) and then design an interface relative to that, instead of committing to CSaaS providing storage as
well. In this approach CSaaS would be a
purely functional interface and if it uses any internal storage, that would be just for optimization.