Verifying Properties of Well-Founded Linked Lists Shuvendu K. Lahiri Microsoft Research shuvendu@microsoft.com Shaz Qadeer Microsoft Research qadeer@microsoft.com Abstract We describe a novel method for verifying programs that manipulate linked lists, based on two new predicates that characterize reachability of heap cells. These predicates allow reasoning about both acyclic and cyclic lists uniformly with equal ease. The crucial insight behind our approach is that a circular list invariably contains a distinguished head cell that provides a handle on the list. This observation suggests a programming methodology that requires the heap of the program at each step to be well-founded, i.e., for any field f in the program, every sequence u.f, u.f.f, . . . contains at least one head cell. We believe that our methodology captures the most common idiom of programming with linked data structures. We enforce our methodology by automatically instrumenting the program with updates to two auxiliary variables representing these predicates and adding assertions in terms of these auxiliary vari- ables. To prove program properties and the instrumented assertions, we provide a first-order axiomatization of our two predicates. We also introduce a novel induction principle made possible by the well-foundedness of the heap. We use our induction principle to derive from two basic axioms a small set of additional first-order axioms that are useful for proving the correctness of several pro- grams. We have implemented our method in a tool and used it to verify the correctness of a variety of nontrivial programs manipulating both acyclic and cyclic singly-linked lists and doubly-linked lists. We also demonstrate the use of indexed predicate abstraction to automatically synthesize loop invariants for these examples. Categories and Subject Descriptors: D.2.4: Software/program verification General Terms: Reliability, security, languages, verification Keywords: First-order axiomatization, well-founded linked lists, decision procedure, automated theorem proving, heap abstraction 1. Introduction Program verification has made enormous progress in the last few decades. A significant contribution to this progress has been made by reasoning techniques based on first-order logic. For a number of programs, the verification condition whose validity establishes the correctness of the program is expressible in a combination of first order theories such as arithmetic, uninterpreted functions, and propositional logic. The pioneering work of Nelson and Oppen [30] Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. POPL'06 January 11­13, 2006, Charleston, South Carolina, USA. Copyright c 2006 ACM 1-59593-027-2/06/0001. . . $5.00. described a method for combining decision procedures for individual theories to yield a decision procedure for the combined theory. Based on their results, a number of powerful automated theorem provers [9, 4] have been developed for program verification. Many successful tools [12, 3, 17] have used these provers for proving program properties. However, first-order reasoning typically breaks down when we want to prove properties of programs that manipulate heapallocated linked data structures. The main problem with reasoning about a data structure, such as a linked list, is that it is impossible to express an invariant about all members of a list in first-order logic. To achieve such a specification in general requires the use of the reachability predicate which cannot be expressed in first-order logic. Consequently, researchers have investigated richer logics such as combination of first order logic with transitive closure [18] and monadic second-order logic [28]. These approaches are typically unable to harness the advances made in automated theorem proving based on first-order logic. In this paper, we develop a method for verifying linked data structures based entirely on first-order logic. The idea behind our approach is to provide a first-order approximation of the reachability predicate. Nelson [32] and Lev-Ami et al. [24] have also proposed first-order axiomatizations for linked lists. Our work improves upon these previous approaches in several significant ways. Most notably, our work makes the analysis of cyclic linked lists as uniform as acyclic lists. While theoretically incomplete, we believe that our approach is complete enough for most realistic programs. We have validated this belief by building a tool to mechanize our method. We have used our tool to verify a number of programs manipulating singly-linked and doubly-linked lists. In the next section, we illustrate the benefits of our analysis for reasoning uniformly about acyclic and cyclic lists. 1.1 Motivation The novelty of our technique is best illustrated by an example. Consider the following function acyclic simple that iterates over an acyclic list pointed to by the variable head and sets the data field of each list element to 0. This example is representative of a variety of programs that use linked lists as sets and iterate over them. //@ requires head != null void acyclic simple() { iter = head; while (iter != null) { iter.data = 0; iter = iter.next; } } To write the loop invariant for this program, we require the reachability function Rnext, which maps each cell u to the set containing u and all cells reachable from u by following the next field 115 and excluding null. The cornerstone of all previous approaches to first-order reasoning about linked lists is the following axiom about Rnext: v Rnext(u) (v = u (u.next = null v Rnext(u.next))) This axiom and the rest of this section assumes, for simplicity but without loss of generality, that null.next = null. We have also added the precondition head != null to make this example similar in as many ways as the example of a cyclic list traversal described later in the section. Using Rnext, the loop invariant for the program can be expressed as follows: v Rnext(head) : (iter = nullv Rnext(iter))v.data = 0 This loop invariant proves the following desired postcondition of the function. v Rnext(head) : v.data = 0 The verification condition for this program is easily constructed using the weakest-precondition transformer [10], and the verification condition can be proved from the axiom using purely first-order reasoning based on uninterpreted functions and quantifier instantiation. We encourage the reader to perform this simple but interesting calculation. Unfortunately, this simple approach breaks down with cyclic lists which are quite common, and perhaps the most common linked data structure used in operating system kernels. To illustrate the problem, consider the following variant cyclic simple of the first function that iterates over a cyclic list pointed to by head and sets the data fields of the list elements to 0. The complication, in this case, is that the next field of the last element of the list points to head rather than null. //@ requires head != null void cyclic simple() { head.data = 0; iter = head.next; while(iter != head) { iter.data = 0; iter = iter.next; } } The circularity of the list results in a breakdown of the reasoning described earlier. The value of Rnext(head) and Rnext(iter) remains constant and equal to the set of all elements in the list during the entire execution of the loop. As a result, we are unable to write a simple loop invariant using Rnext. Even proving the absence of null dereferences in cylic simple is nontrivial compared to acyclic simple where it is trivial because of the loop entry condition. In the remainder of this section, we show how our approach makes verification of programs manipulating cyclic lists as simple as those manipulating acyclic lists. The crucial insight behind our approach is that a circular list invariably contains a distinguished head cell. This head cell provides a handle on the list, usually marks the beginning of the list, and the last element of the list points to it. In the program above, this cell is pointed to by the program variable head. Moreover, it is the presence of this cell that ensures the termination of iterations over circular linked lists. Let Hnext denote the set of head cells of a program. Then, the above program has the precondition {head, null} Hnext. We think of null as a special head and consequently require that Hnext always contain null. The usefulness of constraining Hnext in this fashion will become clear later in this section. We introduce a new axiomatization of linked lists using functions Rnext and Bnext. In contrast to its definition earlier, the new definition of function Rnext maps each cell u to the set of cells containing u and all cells reachable from u by following the next field until a cell v Hnext is reached. The cell v is not included in Rnext(u) unless u = v. Finally, we define Bnext(u) to be v. The cell v acts as a block to the traversal of the next field from u and its identity is captured as Bnext(u). We call Bnext(u) the blocking cell for u. The axioms for Rnext and Bnext are as follows: v Rnext(u) (u = v (u.next Hnext v Rnext(u.next))) Bnext(u) = (u.next Hnext) ? u.next : Bnext(u.next) The first axiom is similar in spirit to the axiom for Rnext described earlier. The second axiom says that Bnext(u) is u.next if u.next Hnext and Bnext(u.next) otherwise. Using the definition of Bnext, we can specify that head points to a circular list by adding the following precondition to the function cyclic simple. //@ requires Bnext(head) = head From the definition of Rnext and Bnext, it is clear that if Bnext(u) = u, then u Hnext and Rnext(u) is the set of cells in a cycle in which no cell other than u is in Hnext. Therefore, the set of cells in the circular list pointed to by head is given by Rnext(head). Using the definition of Rnext and Bnext, we can now write the loop invariant for the loop in cyclic simple as follows. v Rnext(head) : (iter = head v Rnext(iter)) v.data = 0 Bnext(iter) = head Observe that the first conjunct of the loop invariant is similar in structure to the loop invariant for the function acyclic simple. The two new axioms allow us to prove the correctness of the loop invariant, the absence of null dereferences, and the following desired postcondition for cyclic simple: v Rnext(head) : v.data = 0 As mentioned earlier, proving the absence of null dereferences in cyclic simple is nontrivial and requires the use of the axiom about Bnext. Our new axiomatization is a generalization of the first axiomatization that worked only for acyclic lists. If Hnext = {null}, the new axiom about Rnext reduces to the first axiom about Rnext. Indeed, we can verify the first program using the new axiomatization by adding the precondition Hnext = {null} to the function acyclic simple. Thus, our new axiom system can verify programs manipulating acyclic as well as cyclic lists with equal ease. 1.2 Contributions The main technical contribution of this paper is a novel method for verifying linked lists based on two new predicates that characterize reachability of heap cells. These predicates allow reasoning about both acyclic and cyclic lists uniformly with equal ease. The crucial insight behind our approach is that a circular list invariably contains a distinguished head cell that provides a handle on the list. This observation suggests a programming methodology that requires the heap of the program at each step to be well-founded, i.e., for any field f in the program, every sequence u.f, u.f.f, . . . contains at least one head cell. The set of head cells is identified by a new variable added to the program by the programmer. We believe that our methodology captures the most common idiom of programming with linked data structures. We enforce our methodology by automatically instrumenting the program with updates to two auxiliary variables representing these predicates and adding assertions in terms of these auxiliary variables. Our instrumentation captures well-foundedness precisely -- the instrumented program fails one of these assertions if and only if the original program reaches a state containing a heap that is not well-founded. 116 //@ requires a != null && b != null //@ requires {null, a, b} Hnext //@ requires Bnext(a) == a && Bnext(b) == b //@ requires Rnext(a) Rnext(b) == {} //@ requires u Rnext(b) : u.r == b.r //@ ensures Bnext(a) == b && Bnext(b) == a //@ ensures u (Rnext(a) Rnext(b)) : u.r == b.r //@ ensures (Rnext(a) Rnext(b)) == (old(Rnext)(a) old(Rnext)(b)) //@ ensures u old(Rnext)(a) : u.r == old(u.r) void union(Cell a, Cell b) { a.r = b.r; Cell curr = a.next; while (curr != a) { curr.r = b.r; curr = curr.next; } Cell tmp = a.next; a.next = b.next; b.next = tmp; } Figure 1. Performing the union of two circular lists To prove program properties and the instrumented assertions, we provide a first-order axiomatization of our two predicates. Our axiomatization consists of the two basic first-order axioms introduced in Section 1.1. We also introduce a novel induction principle made possible by the well-foundedness of the heap. We use our induction principle to derive from the two basic axioms a small set of additional first-order axioms that are useful for proving the correctness of several programs. All of these derived axioms are intuitive and natural and we state them precisely in Section 5. We also explore an alternative verification approach based on a decision procedure rather than a first-order axiomatization. In Section 6, we show that the quantifier-free fragment of the theory of Rnext and Bnext is NP-complete and propose a decision procedure for this fragment based on a small-model encoding. We have implemented our method in a tool and used it to verify the correctness of a variety of nontrivial programs manipulating both acyclic and cyclic singly-linked lists and doubly-linked lists. We also demonstrate the use of indexed predicate abstraction [13, 21] for automatically synthesizing the loop invariants for these examples. We describe our preliminary experience with this tool in Section 7. For lack of space, we have omitted the proofs of the various theorems and lemmas in the paper; they can be found in the full version of the paper [23]. 2. Examples In Section 1.1, we illustrated our method by verifying the functions acyclic simple and cyclic simple. These functions were interesting and illustrative but comparatively simple. In this section, we illustrate our method on three more complex examples. The first example in Section 2.1 modifies the link structure of the heap; the second example in Section 2.2 uses arithmetic in addition to heap manipulation; the third example in Section 2.3 manipulates a doubly-linked list. We note that the examples in Section 1.1 and all examples of this section have been verified by the tool whose details we provide in Section 7. 2.1 Set-union Our first example is the function union shown in Figure 1. This example is taken from a paper by Nelson [32]. The function union takes two circular linked lists, a and b, as arguments. Each list represents a set; the field r in a cell contains the identifier of the unique set to which it belongs. The function union has a number of preconditions also stated in Figure 1. The second precondition says that both a and b are pointing to head cells. The third precondition says that a and b are pointing to circular lists. The fourth precondition says that the lists pointed to by a and b are disjoint. The objective of union is to merge the list pointed to by a into the list pointed to by b. The function union first sets the r field of each cell in a to b.r. Finally, the contents of a.next and b.next are swapped to merge the two lists. Our tool automatically instruments union with updates to the functions Bnext and Rnext. This instrumentation is introduced only for the last two statements, which are the only statements that update the next field. For each of these statements, the instrumentation also checks, by inserting an assertion, that the update of next leaves the heap well-founded. For example, the assertion introduced just before the last statement of the function is as follows: assert(b Rnext(tmp) => tmp Hnext) This assertion fails precisely if there is a chain of cells connected by next from tmp to b in which no cell is in Hnext. Such a chain causes an ill-founded cycle of cells upon execution of the statement b.next = tmp. Furthermore, this statement modifies Bnext(b) and Rnext(b) as follows: Bnext(b) = (tmp Hnext) ? tmp : Bnext(tmp) Rnext(b) = (tmp Hnext) ? {b} : {b} Rnext(tmp) A precise and complete description of both Bnext and Rnext is given in Section 4. For union, we prove the absence of null dereferences, the assertions described above to check that the heap remains well-founded, and four postconditions. The first postcondition is particularly interesting. It states that the blocking cell for a is b and the blocking cell for b is a. These two facts together mean that cells a and b are the only two head cells in a cycle, which indicates indirectly that indeed the union of the two lists has been created. In the third postcondition, old(Rnext) refers to the value of Rnext at the beginning of the function. The final postcondition says that the r field remains unchanged for all those cells which do not belong to Rnext(a) ini- tially. So far, the set of head cells has been a constant. The first postcondition also motivates the need to modify the set of head cells. It would be intuitively more satisfactory if the programmer can remove the cell a from Hnext once the list a has been merged into b. Therefore, we allow the programmer to remove a cell from Hnext by using the Remove operation. For example, we could add the statement Hnext.Remove(a); at the very end of union. With this modification, our tool proves the following more pleasing postconditions. //@ ensures Bnext(b) == b && Bnext(a) == b //@ ensures u Rnext(b) : u.r == b.r //@ ensures Rnext(b) == (old(Rnext)(a) old(Rnext)(b)) Since the values of the variables Bnext and Rnext depend on Hnext, the instrumentation for this statement is nontrivial. Since Hnext becomes smaller, the heap might not remain well-founded. The instrumentation not only checks via an assertion that the heap remains well-founded, but also updates the values of Bnext and Rnext appropriately. We give a precise description of the instrumentation 117 //@ requires {null} Hnext //@ requires Bnext(l) == null //@ requires l != null && p != null //@ requires p Hnext //@ requires p.data > l.data //@ requires sorted(l) //@ requires p Rnext(l) //@ ensures sorted(l) //@ ensures p Rnext(l) //@ ensures u : u Rnext(l) <=> (u old(Rnext)(l) || u == p) void insert(Cell l, Cell p) { Cell curr = l; Cell succ = l.next; while (succ != null) { if (p.data > succ.data) { curr = succ; succ = curr.next; } else break; } p.next = succ; curr.next = p; } Figure 2. Inserting an element into a sorted acyclic list for this statement as well as for Hnext.Add(x), the converse of this statement, in Section 4. 2.2 Insert Our second example is the function insert shown in Figure 2. The function insert takes an acyclic list l and a cell p as arguments. The cells in the list l are in sorted order based on the values of the field data of the cells. The predicate sorted(l) is defined as follows: sorted(l) = u Rnext(l) : u.next == null|| u.data <= u.next.data The objective of insert is to insert the cell p in the appropriate place in l so that l remains sorted. The fourth precondition requires that the cell p is not in Hnext. If p Hnext, then on return, Rnext(l) contains all the cells from l up to but not including p, and thus violates the the second postcondition p Rnext(l). The fifth precondition of insert is also worth explaining. To simplify the coding of insert, it is expected that the first element of l is a dummy whose data field is guaranteed to be less than any value that might be inserted in the list. This example illustrates an important advantage of axiomatizing linked lists in first-order logic. The specification of insert uses a combination of facts about reachability via the next field, uninterpreted functions, arithmetic, and propositional logic. The axiomatization of linked lists in first-order logic immediately allows us to use any one of a number of theorem provers that deal with a combination of first-order theories. For our implementation, we are using the UCLID theorem prover [7]. Although this example does not use any arrays, adding them is a simple matter because they can be modeled easily with uninterpreted functions and the well-known select-update axioms. //@ requires wf dlist head(hd) //@ requires p != null //@ requires p Rnext(hd) //@ requires p != hd //@ requires Bnext(p) == hd //@ requires p Hnext && p Hprev //@ ensures p Rnext(hd) //@ ensures wf dlist head(hd) //@ ensures u : (u Rnext(hd) || u == p) <=> u old(Rnext)(hd) void dlist remove(Cell hd, Cell p) { Cell tp = p.prev; Cell tn = p.next; tp.next = tn; tn.prev = tp; } Figure 3. Removing an element from a doubly-linked list 2.3 Remove Our third example is the function dlist remove given in Figure 3. This function removes a cell p from a cyclic doubly-linked list with head hd. This example illustrates that our technique handles doubly-linked lists just as cleanly as singly-linked lists. Our instrumentation adds variables Rnext and Bnext for the linking field next, and Rprev and Bprev for the linking field prev. The instrumentation happens just as before for each linking field as if they are independent. However since the linking field pair (next, prev) forms a doubly linked list, we need to define additional preconditions to relate the two fields and their auxiliary vari- ables. We define a predicate wf dlist head(hd) to denote that hd points to a well formed doubly linked list. This predicate is a conjunction of the following predicates: 1. hd != null 2. hd Hnext && hd Hprev 3. u Rnext(hd) : u.next.prev == u && u.prev.next == u 4. Rnext(hd) == Rprev(hd) 5. Bnext(hd) == hd && Bprev(hd) == hd The last conjunct indicates that hd is the unique head cell in Hnext and Hprev present in the cyclic list. The predicate wf dlist head(hd) is particularly useful when the routine dlist remove is invoked in a loop to, for example, remove all the cells from the list that satisfy some property. Then this predicate will be a conjunct in the loop invariant. The first precondition of the routine requires that the list is well formed. The next two preconditions state that p points to a cell in the list. The next precondition states that p is different from hd and is required otherwise the list becomes ill-founded after the remove operation. The postconditions assert that p is removed from the list and the list remains well-formed. 3. Programs The set Var is a set of program variables. A variable in Var may have one of three types: Boolean, Integer, or Cell. The set Cell contains the addresses of heap objects, each of which may have fields from the set Field. A field may also have one of the three aforementioned types. A field of type Boolean is a map from Cell 118 Boolean = {false, true} Integer = {. . . , -2, -1, 0, 1, 2, . . .} u, v, w Cell = {null, . . .} x, y, z Var f, g, h Field c Const = Boolean Integer {null} Op = { + , - , == , != , < , && , || } s Stmt ::= x = c | x = y | x = y Op z | assume(x) | assert(x) | x = new | x = y.f | x.f = y Figure 4. Program syntax to Boolean, a field of type Integer is a map from Cell to Integer, and a field of type Cell is a map from Cell to Cell. We refer to the last category of fields as reference fields. Finally, Const is the set of constants that may appear in the program. The constant null is a special constant of type Cell. A program is a control flow graph (PC, E, pci , pcf , L) with five components. PC is a finite set of program locations, E PC × PC is the set of control-flow edges, pci PC is the initial location where program execution begins, pcf PC is the final location where program execution terminates, and L is a function that maps each edge in E to a statement in Stmt. The restricted syntax of the statements in Stmt does not result in any loss of generality, because more complex statements can reduced to simple statements by the introduction of temporary variables. The statement assume(x) together with nondeterminism in the control flow graph can be used to encode if-then-else and while statements. Moreover, other operations can be encoded with the operations in Op. For example, boolean negation of x can be encoded as x == false. We also assume that the program is free of typing errors. 3.1 Programming with heads We allow programmers to specify a subset of the reference fields as linking fields. These fields are expected to provide the links in data structures such as singly-linked and doubly-linked lists. For each linking field f, the programmer is allowed access to a variable Hf of type Set(Cell). The set Hf contains a subset of the set of allocated cells that together make all lists linked by f well-founded. The set Hf is required to always contain the cell null. We provide two operations to update Hf. 1. Hf.Add(x): The variable x is required to be of type Cell. This statement adds the cell obtained by evaluating x to the set Hf. This statement provides the fundamental mechanism for making circular lists linked by the field f well-founded. A programmer creates such a list by first creating the head cell and then adding it to the set Hf. 2. Hf.Remove(e): Again, the variable x is required to be of type Cell. This operation requires that x be different from null. This statement removes the cell obtained by evaluating x from the set Hf. The precondition ensures that we never remove null from Hf. We have already illustrated the utility of this operation in Section 2.1. This operation is used to remove the head cells of those circular linked lists that are merged into other lists. In addition to the usual statements, we also allow the edges of the control flow graph to be labelled with the above operations on Hf for each linking field f. In Section 4, we describe how we ensure well-foundedness of the heap by automatically instrumenting the program with assertions at every occurrence of an update to the linking fields and every occurrence of the two operations described here. 3.2 Semantics The state of the program contains a program counter in PC and a valuation of the variables in Var, the fields in Field, and the head variables Hf for each linking field f. Variables of type Boolean are initialized to false, variables of type Integer are initialized to 0, and variables of type Cell are initialized to null. The state also contains a special variable Alloc of type Set(Cell) to model memory allocation via new. The variable Alloc is initialized to {}. The statement x = new removes a nondeterministically chosen cell that is not equal to null from Cell \Alloc, assigns it to x, and adds it to Alloc. Thus, the statement x = new is desugared to the following statements: x = choose(Cell \ {null}\Alloc); Alloc = Alloc {x}; When the program executes, its state changes in accordance with the standard operational semantics of its control flow graph. To model the program misbehaving by performing a dereference of null or by failing an assertion, we add a special state wrong with no transition out of it. For example, if the value of the program counter in is l and the edge (l, l ) E is labelled with x.f = y, then one of the following may happen. 1. (x) is null and the program moves to the state wrong. 2. (x) is not null and the program makes a transition to a state in which the program counter is l , the map for f is modified only at cell (x) to (y), and otherwise the state remains unchanged. The operational semantics of the other statements can be defined similarly. 4. Program instrumentation In this section, we show how to automatically instrument a program to ensure that the linked lists in the heap always remain wellfounded. The instrumentation is performed with respect to a subset of reference fields called linking fields that act as the links in a linked structure. We automatically instrument the program with two auxiliary variables for every linking field f. These variables, called Rf and Bf, record information about the shape of the heap graph and are essential to our verification method. The variable Rf is a map from Cell to Set(Cell). The variable Bf is a map from Cell to Cell. Intuitively, for any u Cell, Rf(u) is the set of heap cells containing u and every cell obtained by one or more applications of f to u until a cell v Hf is reached and Bf(u) = v. Note that the final cell v is not a member of Rf(u). We will often write Rf(u, v) to denote v Rf(u). Although the variables Rf and Bf can be defined as a mathematical function of the program state, this definition uses transitive closure and therefore cannot be expressed in first-order logic. An important insight of our work is that even though these variables cannot be defined without using transitive closure, the updates to them as the program state changes can be defined entirely in first-order logic. We automatically instrument the program to record these updates and are thus able to state the assertions which ensure that the heap remains well-founded. Let P be a program with control flow graph (PC, E, pci , pcf , L). The instrumented program P# is obtained by instrumenting each individual statement L(e) for each edge e E. We now define the instrumentation for a given statement s. In many cases shown below, the instrumented statement contains control flow. Such a complex statement is used only for clarity of presentation, and can easily be translated into a simple statement without control flow. 119 1. If the statement s is of the form x = c, x = y, x = y Op z, assume(x), assert(x), x = new, or x = y.f, then the instrumented statement is identical to s. The definition of Rf and Bf depends only on the value of map f and the set Hf. Since neither of those are changed by the statement, we do not need to update the instrumentation variables. 2. If the statement is of the form x.f = y and f is a linking field, then the instrumented statement is assert(x != null); assert(x Rf(y) => y Hf); if (y Hf) { Bf = i. y if x Rf(i) Bf(i) otherwise Rf = i. (Rf(i) \ Rf(x)) {x} if x Rf(i) Rf(i) otherwise } else { Bf = i. Bf(y) if x Rf(i) Bf(i) otherwise Rf = i. (Rf(i) \ Rf(x)) {x} Rf(y) if x Rf(i) Rf(i) otherwise } x.f = y; This statement is the only one that updates the link structure of the heap. First, the instrumentation checks via the assertion assert(x Rf(y) => y Hf) that the heap remains wellfounded, that is, no cycle of cells unbroken by a member of Hf is created as a result of updating the f field of x to y. If such a cycle is created, then there must be a path from from y to x in which no cell, including y and x, is in Hf, which results in a violation of this assertion. Second, the two instrumentation variables Bf and Rf are updated. The values of these functions is updated at a particular cell i only if x Rf(i), that is, there is a path from i to x not broken by any member of Hf. If x Rf(i), the update is split into two cases. If y Hf, then Bf(i) becomes y and we remove from Rf(i) everything that is reachable from x without hitting a member of Hf but then add x itself. If y Hf, then Bf(i) becomes Bf(y) and we remove from Rf(i) everything that is reachable from x but then add x itself and everything that is reachable from y. 3. If the statement is of the form Hf.Add(x), then the instrumented statement is if (x Hf) { Bf = i. x if x Rf(i) && x!= i Bf(i) otherwise Rf = i. Rf(i) \ Rf(x) if x Rf(i) && x!= i Rf(i) otherwise Hf = Hf {x}; } This statement does not update the linking structure of the heap; it only changes the value of Hf. Since the values of the instrumentation variables Bf and Rf depend on Hf, these variables must be updated if Hf changes in case x Hf. The values of Bf and Rf are updated at a particular cell i only if x Rf(i) x = i, that is, x becomes new blocking cell for i. In this case, Bf(i) becomes x and we remove from Rf(i) everything that is reachable from x without hitting a member of Hf. 4. If the statement is of the form Hf.Remove(x), then the instrumented statement is if (x Hf) { assert(Bf(x) != x); Bf = i. Bf(x) if Bf(i)== x Bf(i) otherwise Rf = i. Rf(i) Rf(x) if Bf(i)== x Rf(i) otherwise Hf = Hf \ {x}; } This statement, just like Hf.Add(x), does not update the linking structure of the heap; it only changes the value of Hf. The values of the instrumentation variables Bf and Rf must be updated if Hf changes in case x Hf. First, the instrumentation checks via an assertion that the removal of x from the set of head cells does not result in a heap that is not well-founded. A bad heap can result if x is the only head cell in a cycle, a condition captured by Bf(x) = x and the assertion checks for precisely the negation of this condition. Note that since Bf(null) = null by definition, this assertion also checks the precondition of this operation that x is nonnull. Second, the two instrumentation variables Bf and Rf are updated. The values of Bf and Rf are updated at a particular cell i only if Bf(i) = x, that is, x is the blocking cell for i. In this case, Bf(i) becomes Bf(x) and we add to Rf(i) everything that is reachable from x without hitting a member of Hf. 4.1 Correctness In this section, we formalize the correctness of our instrumentation. The instrumented state of the program is a valuation of the variables in Rf and Bf for each linking field f. While the original program makes transitions of the form , the instrumented program makes transitions of the form (, ) ( , ). To state our correctness theorem, we first need to define a well-formed state. DEFINITION 1 (Well-founded function). A function f : Cell Cell is well-founded with respect to a set of cells H, if for any cell u Cell, there exists n > 0, such that fn (u) H. If f is well-founded with respect to h, we can define the functions Bf H : Cell Cell and Rf H : Cell Set(Cell) as follows. Rf H (u) = {v Cell | v = u n : 0 < n : (v = fn (u) m : 0 < m n : fm (u) H)} Bf H (u) = fn (u), where n = min{m | 0 < m fm (u) H} DEFINITION 2 (Well-founded state). A state is well-founded if for every linking field f, the function (f) is well-founded with respect to (Hf). DEFINITION 3 (Well-formed state). A state is well-formed if the following conditions are satisfied: 1. is well-founded. 2. null Alloc. 120 3. For every variable x of type Cell, (x) (Alloc) or (x) = null. 4. For every cell u (Alloc) and reference field f Field, (f)(u) (Alloc) or (f)(u) = null. 5. For every cell u Cell \ (Alloc) and reference field f Field, (f)(u) = null. 6. For every linking field f, null (Hf). A state is ill-formed if it is not well-formed. We define an instrumentation function I on well-formed states. For every program state , the application I() yields an instrumented state such that (Bf) = Bf H and (Rf) = Rf H , where f = (f) and H = (Hf). Now, we can state a theorem that characterizes the precision of our instrumentation. THEOREM 1. Suppose and are well-formed states of a program P and P# is the instrumented version of P. Then the following statements are true. 1. P can make a transition from to iff P # can make a transition from (, I()) to ( , I( )). 2. P makes a transition from to either an ill-formed state or wrong iff P # makes a transition from (, I()) to wrong. The proof of this theorem is performed by a case analysis over the various types of statements and is given in the full version of the paper [23]. 5. First-order axiomatization of well-foundedness Our checker verifies a program by checking the existence of a wellformed state satisfying a verification condition. The definition of a well-formed state (from Section 4.1) has six constraints. It is easy to write simple first-order axioms to model constraint 2­6. But the first constraint, which requires the state to be well-founded, involves a nontrivial relationship between the variables f, Rf, Bf and Hf for any linking field f. In fact, well-foundedness cannot be precisely expressed in first-order logic. In this section, we first provide two first-order axioms (called base axioms) to capture the relationships between Rf, Bf and f. We then provide an induction principle (IND-WF) that enables us to derive additional first-order axioms from the base axioms. For convenience, we will use the notation f(u) to denote u.f in this section. 5.1 Base axioms The following two fundamental axioms characterizes the relationship between the predicates Rf, Bf, Hf and f in any state of the program. In all these axioms, the cells u Cell, v Cell are implicitly universally quantified out. 1. The first axiom specifies that Rf(u, v) is true if and only if either (i) there is a zero-length path from u to v (when u = v), or (ii) there is a path of length one or more from u to v without encountering any cells from Hf. Rf(u, v) (u = v (f(u) Hf Rf(f(u), v))) (AX1) 2. The second axiom relates Bf(u) with Bf(f(u)) when f(u) Hf. If f(u) H, then Bf(u) = f(u). Bf(u) = (f(u) Hf) ? f(u) : Bf(f(u)) (AX2) The above axioms follow from the definition of the two predicates from Section 4.1. We define a state to be finite if the set Alloc is finite in . Also, recall (from Section 4.1) that for any function f : Cell Cell and a set H Cell, Rf H (u) defines the set of all cells u, f(u), . . . until the first cell from H is encountered, and Bf H(u) denotes the identify of the first cell in H encountered. The base axioms intuitively capture the definition of the fields Rf and Bf. The following theorem, which is a natural generalization of the theorem for acyclic case [24], serves to make this intuition precise. THEOREM 2. For any finite and well-formed state satisfying axioms AX1 and AX2, (Rf) = Rf H, and (Bf) = Bf H, where f = (f) and H = (Hf). 5.2 Induction principle Even though the axioms AX1 and AX2 capture fundamental properties of Rf H and Bf H, they are not complete. As noted by Lev-Ami et al. [24], at least one of the limitation of the above axioms (and first-order logic in general) is that there is no complete axiomatization of "finiteness". A consequence of this limitation is that above axioms might imply other derived axioms in any finite state, but these additional axioms can't be derived from the two axioms above solely by first order reasoning. These derived axioms are required to further constrain the set of states to give a meaningful assignments to Rf, Bf and f. In this section, we present an induction principle (IND-WF) that can be used to derive other theorems from the base axioms AX1 and AX2. DEFINITION 4 (IND-WF). Consider a well-formed state where f = (f) and H = (Hf), for the linking field f. To show that a property P(u) holds for all u Cell, it is sufficient to establish two cases: 1. Base Case: Establishes the property for all cells u such that f(u) H: f(u) H P(u) (1) 2. Induction Step: If P holds for a cell v and v H, then establish that P holds for all the cells u such that f(u) = v: (P(v) f(u) = v v H) P(u) (2) We show that if the above induction principle establishes that a predicate P is true for all the cell u Cell in a state , then satisfies u : P(u). THEOREM 3. The induction principle IND-WF is sound. 5.3 Derived axioms In this section, we present a small set of useful first-order axioms that have been required in the various proof efforts we have undertaken. These axioms can be derived from the base axioms using the induction principle described above. It is a challenge to identify a "core" subset of axioms that is not only useful in practice, but also fairly intuitive to understand. We have identified a small set of such axioms that seem to be sufficient for the set of examples handled in this paper. Of course, we can't make any claims whether this set will suffice for other programs too, since our experience is limited to the set of programs we have handled. But we believe that this set captures the interesting cases in dealing with most linked list programs. 1. Transitivity: The relation Rf enjoys the transitivity property: (Rf(u, v) Rf(v, w)) Rf(u, w) ( TR) 2. Antisymmetry: This property is key to breaking the symmetry in a cycle. (Rf(u, v) Rf(v, u)) u = v ( AS) 121 3. Bounded distinctness: For any heap cell u, if none of the cells f(u), f2 (u), . . . , fk (u) intersect with the set Hf, then all the cells in this sequence are distinct from each other. Let Fk . = {f(u), . . . , fk (u)}, and DISTINCT(S) denotes that all the members in a set S are pairwise distinct. We can derive a parameterized system of theorems DTk for different finite values of k: Fk H = {} DISTINCT({u} Fk) ( DTk) We have found that the case for k = 1 to be useful in proving properties of many linked list programs (e.g. reverse of an acyclic linked list): (f(u) H) u = f(u) ( DT1) These axioms can be proved easily from the base axioms AX1 and AX2. and the induction principle IND-WF. We illustrate this by working through the proof of DT1: Proof: We will substitute the expression (f(u) H) u = f(u), in place of P(u) in Equation 1 and Equation 2. Then we derive contradiction from the negation of the formula. * Base case: Since f(u) H, the formula for the base case is unsatisfiable. * Induction step: Substituting the definition of P and rewriting v with f(u) in Equation 2, we get: (f(f(u)) H f(u) = f(f(u))) f(f(u)) H f(u) H u = f(u) Rewriting the formula after replacing f(u) with u (since u = f(u)), and removing f(u) = f(u), we get (f(u) H f(u) = f(u)) f(u) H which is a contradiction. 5.4 Optimization In addition to the instrumentation discussed in Section 4, we also introduce an assume statement immediately after each statement of the form x = y.f: assume(x Hf => Rf(x) = Rf(y) \ {y}); The constraint described by the assume statement is an instance of the following theorem (that can again be derived from AX1 and the induction principle): f(u) Hf Rf(f(u)) = Rf(u) \ {u} (T1) The addition of the constraint eagerly introduces an appropriate instance of Theorem T1 during the instrumentation phase. Since this axiom is not added explicitly to the set of derived axioms used in the proofs, the theorem prover is later prevented from searching for instances of this axiom. 6. Decision procedure for well-foundedness Many automated theorem provers including Simplify [9] and UCLID [7] use a combination of heuristic quantifier instantiation and decision procedures for ground theories to reason about quantified formulas over certain theories. To take advantage of this approach, we explore decision procedures for quantifier-free formulas containing Rf and Bf. A decision procedure provides an alternative to the methodology presented in Section 5, where we provide a set of (incomplete) first-order axioms about transitive closure to reason about formulas containing Rf and Bf. We show that the complexity of the decision problem for the quantifier-free theory with Rf and l Literal ::= x = y | f(x) = y | x = null | Rf(x, y) | Rf(x, y) | Bf(x) = y | x Hf | x Hf | x Alloc | x Alloc Formula ::= l | 1 2 | 1 2 Figure 5. Quantifier-free logic with Rf and Bf. Bf is NP-complete. We also provide a na¨ive decision procedure for the fragment based on a small-model encoding. Consider a quantifier-free formula Formula in the logic described in Figure 5. It is easy to see that one can reduce a formula containing nested applications of f or Bf (such as f(f(x))) to a formula in this logic by introducing additional skolem constants. For simplicity of presentation, let us take the liberty to include these additional constants in the set of program variables Var. Accordingly, a program state includes an evaluation for these additional symbols. For any state and a formula Formula, we can define the meaning of under ( |= ) inductively over the structure of in the usual way. Suppose L Literal is a finite set of literals. We write |= L if |= Î l for all l L. The set L is satisfiable if there exists a well-formed state such that |= L. The decision problem WFSAT is the following: Is a given finite set L of literals satisfiable? LEMMA 1. The problem WFSAT is NP-hard. The proof of this lemma is by reduction from the 3-COLOR problem. In fact, the decision problem WFSAT remains NP-hard even when we restrict ourselves to literals of the form x = y, f(x) = y, x = null and Rf(x, y). A state is finite if (Alloc) is a finite subset of Cell. If is finite, we define || = |(Alloc)|. We show that the logic described in Figure 5 above enjoys a small-model property. Given a finite set L of literals, let NVar(L) be the number of distinct variables that occur in the literals in L. Then we can prove the following: LEMMA 2. If a finite set of literals L is satisfiable, then there is a finite well-formed state such that |= L and || NVar(L) + 1. The following theorem follows easily from Lemmas 1 and 2. THEOREM 4. The problem WFSAT is NP-complete. The small-model property of Lemma 2 suggests a straightforward decision procedure for the logic. Given a formula in this logic, we can replace each occurrence of Rf and Bf with their definitions. For example, if k . = NVar(L) + 1, we can replace Bf(x) with the formula: f(x) Hf ? f(x) : f2 (x) Hf ? f2 (x) : . . . k times . . . The resultant formula can be solved using a (combined) decision procedure for equality with uninterpreted functions (EUF) and linear arithmetic. In Section 7, we briefly describe our initial experience with the decision procedure. 7. Experiments We have implemented a prototype tool to mechanically verify properties of programs containing linked lists. In this section, we briefly describe the components of the prototype and some preliminary results on a set of programs manipulating singly or doubly linked lists. The source program is written in the Zing [1] programming language, an imperative Java-like language but without inheritance. 122 In addition to the programming language discussed in the paper, it can also support arrays of Boolean, Integer and Cell. The user specifies a subset of fields in the program as linking fields. For each field f, the tool automatically instruments the Zing program with the auxiliary variables Rf, Bf, and adds the updates to these variables. The instrumented program also contains the asserts for null-dereference and the asserts to ensure that the fields are well-founded. The instrumented program is then translated to a guarded transition system and is fed to the UCLID verification system [7]. Let us briefly describe the UCLID system and the features that are used in this paper. 7.1 UCLID The input language of UCLID supports variables of type Booleans, integers, and functions (and predicates) over integers. The functions can have any finite arity. Each variable of type Cell in the source program is mapped to an integer variable. Each field f is mapped to a function variable from integers to integers or Booleans. Set-valued variables such as Rf (respectively Hf) are mapped to a predicate variable of arity two (respectively one). Similarly, Bf is mapped to a function variable of arity one. The updates to function and predicate variables are modeled using a notation used in Section 4 of this paper. The notation generalizes the theory of arrays, and allows us to modify an arbitrary number of entries in an array in a single step. This is convenient for expressing the updates of Rf and Bf variables. The user can add the necessary preconditions, loop invariants and the postconditions to check in the UCLID file. Recall that some of the asserts are automatically generated by our translator and are part of the list of properties to check in the UCLID file. The file also includes the set of base axioms and derived axioms described in Section 5. UCLID supports specification of first-order axioms and properties of the form X : (X), where is a quantifier-free expression over the state variables in the program. This is sufficient to express the axioms, preconditions, loop invariants, and the postconditions for all the examples that we have encountered in this work. Set operations are modeled using quantified expressions (e.g. S = S1 S2 gets translated to x : S(x) (S1(x) S2(x)). The tool can be used in two different ways: * Proving verification conditions (VC): Given the preconditions, loop invariants and the postconditions, the tool generates a VC that is checked using the theorem prover in UCLID. The theorem prover uses quantifier instantiation to eliminate quantifiers in the formula and then uses a Boolean Satisfiability (SAT) based decision procedure for the theories of uninterpreted functions and linear arithmetic. Failed VCs result in a concrete counterexample that is immensely useful for strengthening the loop invariants, adding more preconditions or (in rare cases) adding new axioms. * Synthesizing loop invariants using indexed predicate abstraction: The tool can automatically construct universally quantified invariants using simple indexed predicates from a set P. Each predicate p P is a Boolean expression over the state variables and a set of index symbols X of type integers (recall Cell are identified as integers in UCLID). The tool then constructs the strongest loop invariant of the form X : (X ), where (X ) is a Boolean combination of the predicates in P [21, 13]. For instance, given the set of predicates P = {u = next(v), v = prev(u), Rnext(hd, u), Rnext(hd, v)}, with X = {u, v}, the tool can compose the predicates to construct an invariant (say) u, v : (Rnext(hd, u) Rnext(hd, v)) (v = prev(u) u = next(v)) Example VC Size # Instants Time Taken in UCLID (sec) cyclic simple 991 (45) 0 2.1 reverse a 1896 (72) 0 9.5 setunion S1 1134 (49) 0 3.9 setunion 2960 (104) 1 11.4 insert 5950(110) 2 37.5 dlist remove 3362 (115) 1 30.1 Figure 6. Results of verifying linked list programs. "VC Size" denotes the number of nodes in the VC formula after instantiation; the numbers in the parenthesis denote the number of integer-valued terms in the formula. The column "# Instants" denotes the number of manual instantiations that had to be done. The ability to generate complex quantified invariants from simple predicates often relieves the user from writing down wellformed loop invariants. Various heuristics are also provided to often infer most predicates [22]. In the next two subsections, we describe our initial experience with using the two features of UCLID to verify a set of programs manipulating linked lists. 7.2 Benchmarks We have currently handled the following set of examples using the two options of UCLID mentioned above: * cyclic simple: This the example from Section 1 where all the elements of a cyclic list are initialized to 0. * reverse a: This example performs an in-place destructive update of an acyclic linked list pointed to by a variable l. * setunion, setunion S1: This is the example described in Figure 1. The example setunion S1 denotes the portion of the example before the destructive updates to combine the two lists into one. We have included both setunion and setunion S1 to demonstrate the increase in complexity in the presence of destructive updates. * insert: This is the example of insertion into a sorted list described in Figure 2 in Section 2. * dlist remove: This example of a doubly linked list is also described in Section 2. 7.3 Proving verification conditions In this section, we describe the set of examples for which we provided the loop invariant manually and used the theorem prover to prove the verification conditions. Figure 6 describe the results of verifying these examples on a 2 GHz machine running Linux and 1GB memory. ZChaff [29] was used as the SAT solver inside UCLID. We have measured the complexity of the VC generated (VC Size) and the total time taken by the theorem prover to prove the formula. We have also included the number of cases where we had to manually instantiate an axiom. Observe the difference between the verification of setunion S1 and setunion. In the latter case, adding the destructive updates requires one manual instantiation as well as increased CPU time. Also, the complexity of verifying dlist remove comes from two sources even though the program is just four lines long: first, the precondition of the method is complex, and secondly, the set of axioms double and number terms to instantiate increases because of the presence of two fields (next, prev) in the program. 123 Example # Predicates # Iterations Time Taken (sec) cyclic simple 15(1) 4 11.4 reverse a 16(1) 6 85.45 set union 24 (1) 5 79.79 insert 21(2) 9 1404.07 dlist remove - - 30.1 Figure 7. Results of verifying linked list programs using indexed predicate abstraction. "# Predicates" denotes the number of predicates requires; the number in the parenthesis denotes the number of index variables in the predicates. "# Iterations" is the number of iterations of abstract reachability computation to compute the invariant. "Time taken" is the time taken to construct the invariant and prove the postconditions. The example dlist remove did not require any loop invariant. The need for manual instantiation of quantifiers in the axioms have come mainly from two cases so far: * The quantifier instantiation heuristics fail to infer relevant terms to instantiate. It only happened once for the example insert, where we had to instantiate AX2 with a concrete term. * Since the axiom TR relates three bound variables u, v and w, and the number of combinations to instantiate grows exponentially with the number bound variables resulting in a very large VC, we always use the axiom after instantiating the first argument u with one of the program variables. This was the source of one instantiation for each of setunion, insert and dlist remove examples. Both the above problems can be mitigated by better quantifier instantiation strategies, and we are working towards improving the heuristics in UCLID. We do not use Simplify to discharge the VCs because the concrete counterexample facility in UCLID offsets the value of more advanced instantiation heuristics present in Simplify. We also implemented a simple decision procedure for the quantifier-free fragment (described in Section 6), that is based on the small-model property (Lemma 2) of the logic. However, this simple scheme did not yield an efficient decision procedure for this fragment. For the set of examples in the paper (where NVar() ranges between 10 and 30), expanding Rf and Bf resulted in a huge blowup. Apart from the simplest example cyclic simple, the decision procedure did not terminate on the other examples within a 1000 seconds time limit. 7.4 Invariant synthesis We have also leveraged the predicate abstraction engine in UCLID to infer loop invariants for some of the examples. Our initial experience has been encouraging, and we report some preliminary results in this section. At present, we have managed to construct the loop invariants for the cyclic simple, reverse a, set union and insert examples, given a set of indexed predicates. Figure 7 illustrates the result of synthesizing loop invariants using indexed predicate abstraction for a subset of the examples. Currently the tool suffers from two bottlenecks that results in significant time to construct the invariants. This also explains the time taken to prove the verification conditions in the last section. * UCLID does not maintain the control flow graph explicitly and encodes the program counter as a variable -- the entire program is encoded as a single transition relation. This is because the tool was primarily built for analyzing distributed protocols and systems. This results in a large blowup in the formulas that the theorem prover or the predicate abstraction engine gets, and //@ requires null Hnext //@ requires Bnext(l) == null //@ ensures Rnext(res) == old(Rnext)(l) Cell reverse (Cell l) { Cell curr = l; Cell res = null; while (curr != null) { Cell tmp = curr.next; curr.next = res; res = curr; curr = tmp; } return res; } Set of Predicates X = {u} P = {u = null, u = curr, u = res, u = old(l), Rnext(curr, u), Rnext(res, u), Hnext(u), old(Rnext)(old(l), u), l = old(l), Rnext(old(l), u), curr = null, Bnext(u) = null} Figure 8. Reversing an acyclic list. The source program along with the set of predicates. We use old(x) to denote the value of x at the method entry. consequently slows the analysis. We are currently working on incorporating explicit control flow into the tool. * The quantifier instantiation engine generates a large number of (often redundant) terms to instantiate. Moreover, the number of combinations to instantiate grows exponentially with the number of index variables in X . For instance, with two index variables, the number of combinations to instantiate went up to 81 for the insert example. In some cases, SAT often saw formulas with more than 300K clauses in them. To mitigate these problems, we are exploring alternate quantifier instantiation and predicate abstraction strategies. Besides, one can often trade off the precision of the predicate abstraction to construct less precise loop invariants more efficiently. We plan to investigate if such loop invariants suffice to prove the properties of interest for these candidate programs. Figure 8 describe the reverse a example along with the set of predicates required to prove the postcondition. The example required a single index symbol (denoting a heap cell) u to construct the loop invariant. The tool requires 85.45 seconds to construct the loop invariant and prove the property. The set of predicates for this example was supplied manually. We are currently working on automating the process of predicate discovery. 8. Related work There is a rich body of work in reasoning about programs that perform destructive updates of heap allocated data structures. Work in this area can be divided into the following often overlapping categories. First-order axiomatization of reachability. The work in this category is closest to our work. Nelson [32] proposed the ternary reachability predicate u f - x v to define that u can reach v through applications of f without encountering x, and provided a set of 124 first-order axioms to capture this predicate. Our axiomatization is based on two binary predicates. We believe that our predicates are more intuitive to a programmer and our axiomatization yields simpler correctness proofs. For example, Nelson's proof of the setunion example from Section 2.1 required eight axioms whereas our proof requires only the transitivity axiom in addition to the two base axioms. Lev-Ami et al. [24] proposed another set of axioms for characterizing the reachability predicate. Their approach works only for acyclic lists and will not allow them to express loop invariants for programs manipulating cyclic lists. McPeak [27] proposes a methodology for writing specifications of data structures in firstorder logic. Since he does not attempt to axiomatize reachability, his method, by itself, is not sufficient to express reachability or disjointness properties of linked lists. To express such facts, the programmer has to manually introduce and update ghost variables in a program specific way. His work, however, provides new heuristics for quantifier instantiation targeted towards reasoning for linked data structures. This work should complement our work which depends crucially on first-order theorem provers. Finally, our instrumentation for updating Rf and Bf is similar to the work by Dong and Su [11] on incrementally updating the transitive closure relation of a graph due to unit changes in it. Shape analysis. Work in this category attempts to reason about the shape properties of the heap, such as acyclicity and sharing, in the presence of destructive updates to the heap. Ghiya and Hendren [14] propose the use of two Boolean matrices to record if two pointers are reachable from each other and if they share any heap cell. They provide conservative updates to these matrices for various statements in the program. Sagiv et al. [38, 39] use a 3-valued logic to represent abstractions of the heap graph. For improved precision, their approach requires instrumentation predicates, which usually use a reachability predicate as a building block. The updates for these predicates are either supplied by the user or constructed conservatively in some cases using finite-differencing [36]. The idea is implemented in the TVLA [25] system and has been used to infer loop invariants for various linked data structures. In contrast to this work, our approach depends on purely first-order reasoning. Recently, Hackett and Rugina [16] have proposed a method that uses points-to information to construct an abstraction of the heap. Their approach cannot encode relationships (e.g. equality) between program expressions and cannot describe doubly-linked lists. One advantage of our approach compared to all of the above is the ease with which we can combine reasoning about linked lists with reasoning about arithmetic and arrays. Decidable logics. Various decidable logic fragments have been also proposed to express properties of linked data structures. Nelson and Yao [31] provide a polynomial decision procedure for the quantifier-free fragment with Rf(x, y) and Rf(x, y). Ranise and Zarba [35] provide a decidable logic with NP-complete decision problem for reasoning about linked lists. Our contribution in this paper has been to show the minimal extension to the logic in [31] to obtain NP-hardness. PALE [28] uses monadic second-order logic to express properties of lists, trees, and graphs. The work on graph types by Klarlund et al. [20] allows common shapes, such doublylinked lists and threaded trees, to be concisely expressed. They also provide a decidable monadic second-order logic to check the wellformedness of graph type specifications. The logic Lr [5] was proposed to reason about reachability. However, the decidability results quickly break down in the presence of complex shapes and scalar values in the program. Predicate abstraction for discovering invariants. Recently, predicate abstraction [15] has been extended to construct invariants for linked-list programs. Dams and Namjoshi [8] proposed the use of reachability predicate to construct an abstraction of the concrete system. They provide a heuristic for predicate discovery based on constructing the weakest precondition of the reachability predicate. Manevich et al. [26] observe that the number of shared nodes in the heap consisting of singly-linked lists is statically bounded, and propose a family of predicates to exploit this observation. The idea has been implemented in TVLA. However, the method targets shape properties of lists and can't verify properties regarding the contents of a list as we do for the cyclic simple and insert examples. Balaban et al. [2] provide a decision procedure for restricted formulae involving the reachability predicate and provide a method to compute the abstraction. The restrictions on the reachability predicate does not allow them, for example, to express the loop invariant for the cyclic simple program from Section 1.1. In comparison to all of the above, our ability to harness invariant inference methods for first-order logic (e.g. indexed predicate abstraction) provides us with appreciable automation for programs that manipulate linked structures, arrays, and other scalar values. Separation logic. Separation logic [19, 37] is a promising idea for local reasoning of heap-manipulating programs. Separation logic naturally extends traditional Hoare-style reasoning to bear upon such programs. However, most of the work on separation logic focused on proving programs manually [33, 34]. Recently, Berdine et al. [6] have developed decision procedures for fragments of separation logic. However, we are not aware of any automated tools for program verification based on these decision procedures. Proofs of linked-list programs using separation logic use the reachability predicate; our axiomatization of reachability could potentially help with mechanizing such proofs. Alternatively, our work can benefit from separation logic specifications in future as we extend our work to the inter-procedural setting. 9. Conclusions Programs such as kernels of operating systems and device drivers manipulate a variety of data structures, such as arrays, singly and doubly linked lists, and hashtables. Current verification tools focus on control-dominated properties and are consequently unable to handle such programs in which the control flow interacts with the data in subtle ways. This paper is our first step towards the goal of building a scalable checker for verifying low-level data-rich systems software. In this paper, we presented a novel method for verifying linked data structures based on a first-order axiomatization of reachability with respect to a set of head cells. This axiomatization is based on the idea of a well-founded heap and has the advantage that acyclic and cyclic lists are handled uniformly with equal ease. We have implemented our method in a tool and used it to verify the correctness of a variety of nontrivial programs manipulating both acyclic and cyclic singly-linked lists and doubly-linked lists. There are several immediate directions for future work. We would like to verify more examples that use doubly-linked lists and arrays to evaluate the overhead of using our methodology and the adequacy of our list of derived first-order axioms. We have access to several programs from the Windows kernel that use rich data structures; we intend to evaluate our method on these programs. We would like to automate the inference of invariants more by developing heuristics for predicate discovery targeted towards linked lists. Finally, we would like to extend our work to deal with procedure calls and develop techniques to summarize procedures. Acknowledgements We are grateful to Venkatesan Guruswami for his help in proving Lemma 1 and to Thomas Ball for his insightful comments on this paper. 125 References [1] T. Andrews, S. Qadeer, S. K. Rajamani, J. Rehof, and Y. Xie. Zing: Exploiting program structure for model checking concurrent software. In CONCUR 04: 15th International Conference on Concurrency Theory, volume 3170 of LNCS, pages 1­15. Springer-Verlag, 2004. [2] I. Balaban, A. Pnueli, and L. D. Zuck. Shape analysis by predicate abstraction. In VMCAI 05: Verification, Model checking, and Abstract Interpretation, volume 3385 of LNCS, pages 164­180. SpringerVerlag, 2005. [3] T. Ball, R. Majumdar, T. Millstein, and S. K. Rajamani. Automatic predicate abstraction of C programs. In PLDI 01: Programming Language Design and Implementation, pages 203­213, 2001. [4] C. Barrett and S. Berezin. CVC Lite: A new implementation of the cooperating validity checker. In Computer Aided Verification (CAV '04), volume 3114 of LNCS, pages 515­518. Springer-Verlag, 2004. [5] M. Benedikt, Thomas W. Reps, and S. Sagiv. A decidable logic for describing linked data structures. In ESOP 99: European Symposium on Programming, volume 1576 of LNCS, pages 2­19. SpringerVerlag, 1999. [6] J. Berdine, C. Calcagno, and P. W. O'Hearn. A decidable fragment of separation logic. In FSTTCS 04: Foundations of Software Technology and Theoretical Computer Science, volume 3328 of LNCS, pages 97­109. Springer-Verlag, 2004. [7] R. E. Bryant, S. K. Lahiri, and S. A. Seshia. Modeling and verifying systems using a logic of counter arithmetic with lambda expressions and uninterpreted functions. In Computer-Aided Verification (CAV '02), volume 2404 of LNCS, pages 78­92, July 2002. [8] D. Dams and K. S. Namjoshi. Shape analysis through predicate abstraction and model checking. In VMCAI 03: Verification, Model checking, and Abstract Interpretation, volume 2575 of LNCS, pages 310­324. Springer-Verlag, 2003. [9] D. L. Detlefs, G. Nelson, and J. B. Saxe. Simplify: A theorem prover for program checking. Technical report, HPL-2003-148, 2003. [10] E.W. Dijkstra. A Discipline of Programming. Prentice-Hall, 1976. [11] G. Dong and J. Su. Incremental and decremental evaluation of transitive closure by first-order queries. Information and Computation, 120(1):101­106, 1995. [12] C. Flanagan, K. R. M. Leino, M. Lillibridge, G. Nelson, J. B. Saxe, and R. Stata. Extended static checking for Java. In PLDI 02: Programming Language Design and Implementation, pages 234­ 245. ACM Press, 2002. [13] C. Flanagan and S. Qadeer. Predicate abstraction for software verification. In POPL 02: Principles of Programming Languages, pages 191­202. ACM Press, 2002. [14] R. Ghiya and L. J. Hendren. Is it a tree, a DAG, or a cyclic graph? A shape analysis for heap-directed pointers in C. In POPL 96: Principles of Programming Languages, pages 1­15. ACM Press, 1996. [15] S. Graf and H. Sa¨idi. Construction of abstract state graphs with PVS. In Computer-Aided Verification (CAV '97), volume 1254 of LNCS, pages 72­83. Springer-Verlag, June 1997. [16] B. Hackett and R. Rugina. Region-based shape analysis with tracked locations. In POPL 05: Principles of Programming Languages, pages 310­323. ACM Press, 2005. [17] T. A. Henzinger, R. Jhala, R. Majumdar, and G. Sutre. Lazy abstraction. In POPL 02: Principles of Programming Languages, pages 58­70. ACM Press, 2002. [18] N. Immerman, A. M. Rabinovich, T. W. Reps, S. Sagiv, and G. Yorsh. The boundary between decidability and undecidability for transitiveclosure logics. In CSL 04: Computer Science Logic, volume 3210 of LNCS, pages 160­174. Springer-Verlag, 2004. [19] S. S. Ishtiaq and P. W. O'Hearn. BI as an assertion language for mutable data structures. In POPL 01: Principles of Programming Languages, pages 14­26. ACM Press, 2001. [20] N. Klarlund and M. I. Schwartzbach. Graph types. In POPL 93: Principles of Programming Languages, pages 196­205. ACM Press, 1993. [21] S. K. Lahiri and R. E. Bryant. Constructing quantified invariants via predicate abstraction. In VMCAI 04: Verification, Model Checking and Abstract Interpretation, volume 2937 of LNCS, pages 267­281. Springer-Verlag, 2004. [22] S. K. Lahiri and R. E. Bryant. Indexed predicate discovery for unbounded system verification. In Computer Aided Verification (CAV '04), volume 3114 of LNCS, pages 135­147. Springer-Verlag, 2004. [23] S. K. Lahiri and S. Qadeer. Verifying properties of well-founded linked lists. Technical Report MSR-TR-2005-97, Microsoft Research, 2005. [24] T. Lev-Ami, N. Immerman, T. W. Reps, S. Sagiv, S. Srivastava, and G. Yorsh. Simulating reachability using first-order logic with applications to verification of linked data structures. In CADE 05: Conference on Automated Deduction, volume 3632 of LNCS, pages 99­115. Springer-Verlag, 2005. [25] T. Lev-Ami and S. Sagiv. TVLA: A system for implementing static analyses. In SAS 00: Static Analysis Symposium, volume 1824 of LNCS, pages 280­301. Springer-Verlag, 2000. [26] R. Manevich, E. Yahav, G. Ramalingam, and M. Sagiv. Predicate abstraction and canonical abstraction for singly-linked lists. In VMCAI 05: Verification, Model Checking and Abstract Interpretation, volume 3148 of LNCS, pages 181­198. Springer-Verlag, 2005. [27] S. McPeak and G. C. Necula. Data structure specifications via local equality axioms. In Computer-Aided Verification (CAV '05), volume 3576 of LNCS, pages 476­490. Springer-Verlag, 2005. [28] A. Mller and M. I. Schwartzbach. The pointer assertion logic engine. In PLDI 01: Programming Language Design and Implementation, pages 221­231, 2001. [29] M. Moskewicz, C. Madigan, Y. Zhao, L. Zhang, and S. Malik. Chaff: Engineering an efficient SAT solver. In 38th Design Automation Conference (DAC '01), 2001. [30] G. Nelson and D. C. Oppen. Simplification by cooperating decision procedures. ACM Transactions on Programming Languages and Systems (TOPLAS), 2(1):245­257, 1979. [31] G. Nelson and F. F. Yao. Solving reachability constraints for linear lists, 1982. Unpublished manuscript. [32] G. Nelson. Verifying reachability invariants of linked structures. In POPL 83: Principles of Programming Languages, pages 38­47. ACM Press, 1983. [33] P. W. O'Hearn, J. C. Reynolds, and H. Yang. Local reasoning about programs that alter data structures. In CSL 01: 15th International Workshop on Computer Science Logic, volume 2142 of LNCS, pages 1­19. Springer-Verlag, 2001. [34] P. W. O'Hearn, H. Yang, and J. C. Reynolds. Separation and information hiding. In POPL 04: Principles of Programming Languages, pages 268­280. ACM Press, 2004. [35] S. Ranise and C. Zarba. A decidable logic for pointer programs manipulating linked lists, 2004. Unpublished manuscript. [36] T. Reps, M. Sagiv, and A. Loginov. Finite differencing of logical formulas for static analysis. In ESOP 03: European Symposium on Programming, volume 2618 of LNCS, pages 380­398. SpringerVerlag, 2003. [37] J. C. Reynolds. Separation logic: A logic for shared mutable data structures. In LICS 02: Logic in Computer Science, pages 55­74. IEEE Computer Society Press, 2002. [38] R. Wilhelm S. Sagiv, T. W. Reps. Solving shape-analysis problems in languages with destructive updating. ACM Transactions on Programming Languages and Systems (TOPLAS), 20(1):1­50, 1998. [39] M. Sagiv, T. Reps, and R. Wilhelm. Parametric shape analysis via 3-valued logic. In POPL 99: Principles of Programming Languages, pages 105­118. ACM Press, 1999. 126