TECHNIQUES FOR DEVELOPING CORRECT, FAST, AND ROBUST IMPLEMENTATIONS OF DISTRIBUTED PROTOCOLS BY AAMOD ARVIND SANE THESIS Submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 1998 Urbana, Illinois c Copyright by Aamod Arvind Sane 1998 TECHNIQUES FOR DEVELOPING CORRECT, FAST, AND ROBUST IMPLEMENTATIONS OF DISTRIBUTED PROTOCOLS Aamod Arvind Sane, Ph.D. Department of Computer Science University of Illinois at Urbana-Champaign, 1998 Roy H. Campbell, Advisor A distributed system must satisfy three requirements: it should correctly implement process interactions to realize desired behavior, it should exhibit satisfactory performance, and it should have a robust software architecture that accommodates changing requirements. This thesis presents research that addresses each of these concerns. The thesis presents new techniques for designing protocols that coordinate process interactions. The specication technique allows designers to design protocols by topdown renement. Renement steps divide the original protocol into sub-protocols that have smaller state spaces than the original protocol. Therefore, the divided protocols can be automatically veried without encountering state-space explosion. The complete protocol is synthesized by composing the divided protocols. The thesis also shows how protocols can be tailored for improved performance. A new technique for designing high-performance distributed shared memory consistency protocols is presented. The technique optimizes consistency protocols by using information about previous memory accesses to anticipate future communication. Such anticipation allows communication to overlap with computation, resulting in improved application performance. iii Finally, the thesis presents a software architecture for implementing systems with interacting distributed objects. The architecture allows systems to be incrementally extended with new objects and new operations, including operations over objects on remote systems. This is achieved using design patterns, and a novel scheme for incremental construction of state machines. The architecture was used to build a virtual memory system that is smoothly extended to support distributed shared memory. iv TABLE OF CONTENTS Chapter 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.1 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.2 Thesis Outline : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2 A Protocol Design Technique : : : : : : : : : : : : : : : : 2.1 Goal : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.1.1 The Problem : : : : : : : : : : : : : : : : : : 2.1.2 Our Solution : : : : : : : : : : : : : : : : : : 2.1.3 Summary : : : : : : : : : : : : : : : : : : : : 2.2 Background and Related Work : : : : : : : : : : : : : 2.2.1 Verication Systems : : : : : : : : : : : : : : 2.2.2 High-Level Service Specication : : : : : : : : 2.2.3 Synthesis Methods : : : : : : : : : : : : : : : 2.2.4 Our Approach : : : : : : : : : : : : : : : : : : 2.3 The Synthesis Method : : : : : : : : : : : : : : : : : 2.3.1 Synthesis : : : : : : : : : : : : : : : : : : : : 2.3.2 Process and System : : : : : : : : : : : : : : : 2.3.3 Automata : : : : : : : : : : : : : : : : : : : : 2.3.4 Automata and Processes : : : : : : : : : : : : 2.3.5 Protocols : : : : : : : : : : : : : : : : : : : : 2.3.6 Protocol Synthesis : : : : : : : : : : : : : : : 2.4 Specifying Coordination : : : : : : : : : : : : : : : : 2.4.1 Constraint-Rule Specications : : : : : : : : : 2.4.2 Action-Rule Specications : : : : : : : : : : : 2.4.3 Observation-Rule Specications : : : : : : : : 2.4.4 Proving Implementation : : : : : : : : : : : : 2.5 Implementing Constraints, Actions, and Observations 2.5.1 Synthesizing Constraint Rules : : : : : : : : : 2.5.2 Synthesizing Action Rules : : : : : : : : : : : 2.5.3 Observations via Memory and Messages : : : 2.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : vistributed Shared Memory Consistency Protocols : : : : : 3.1 Goal : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.1.1 The Problem : : : : : : : : : : : : : : : : : : : 3.1.2 Our Solution : : : : : : : : : : : : : : : : : : : 3.2 Background and Related Work : : : : : : : : : : : : : : 3.2.1 Sequential Consistency : : : : : : : : : : : : : : 3.2.2 Beyond Sequential Consistency : : : : : : : : : 3.2.3 Synchronization in Distributed Shared Memory 3.2.4 Our Approach : : : : : : : : : : : : : : : : : : : 3.3 Coordinated Memory : : : : : : : : : : : : : : : : : : : 3.3.1 Adaptive Barriers : : : : : : : : : : : : : : : : : 3.3.2 Other Adaptive Constructs : : : : : : : : : : : 3.4 Designing Consistency Protocols : : : : : : : : : : : : : 3.4.1 Consistency Specication : : : : : : : : : : : : : 3.4.2 Adaptive Barrier : : : : : : : : : : : : : : : : : 3.4.3 Summary : : : : : : : : : : : : : : : : : : : : : 3.5 Implementation and Performance : : : : : : : : : : : : 3.5.1 Experimental Platform : : : : : : : : : : : : : : 3.5.2 Applications : : : : : : : : : : : : : : : : : : : : 3.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 47 47 48 50 51 51 54 60 63 64 64 67 68 69 71 72 73 73 74 77 4 A Software Architecture : : : : : : : : : : : : : : : : : : : : : : 4.1 Goal : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.1.1 The Problem : : : : : : : : : : : : : : : : : : : : : 4.1.2 Our Solution : : : : : : : : : : : : : : : : : : : : : 4.2 Background and Related Work : : : : : : : : : : : : : : : : 4.2.1 Basic Objects : : : : : : : : : : : : : : : : : : : : : 4.2.2 Interactions : : : : : : : : : : : : : : : : : : : : : : 4.2.3 Operations : : : : : : : : : : : : : : : : : : : : : : : 4.3 Why the New Architecture : : : : : : : : : : : : : : : : : : 4.3.1 Examples : : : : : : : : : : : : : : : : : : : : : : : 4.3.2 Why Change is not Easy : : : : : : : : : : : : : : : 4.4 What Needs to be Redesigned : : : : : : : : : : : : : : : : 4.4.1 Data Structures and Synchronization : : : : : : : : 4.4.2 Interactions : : : : : : : : : : : : : : : : : : : : : : 4.4.3 A Solution : : : : : : : : : : : : : : : : : : : : : : : 4.5 Architecture of the Virtual Memory System : : : : : : : : 4.5.1 Exporting Functionality : : : : : : : : : : : : : : : 4.5.2 Organizing the Internals : : : : : : : : : : : : : : : 4.5.3 Concurrency Control : : : : : : : : : : : : : : : : : 4.5.4 Operations Using Object-Oriented State Machines : 4.5.5 Implementing Remote Interactions : : : : : : : : : 4.5.6 Dynamic Page Distribution : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 78 78 78 80 80 80 81 82 82 82 84 86 86 87 88 90 90 93 96 98 104 108 vi : : : : : : : : : : : : : : : : : : : : 4.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 110 5 Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112 5.1 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112 5.2 Future Research : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 114 Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 115 vii Chapter 1 Introduction This thesis presents techniques for the design and implementation of protocols that coordinate the actions of concurrent processes in a distributed system. The design of novel memory consistency protocols for a distributed shared memory system illustrates the application of these techniques. The system is implemented using a new software architecture for designing object-oriented systems with concurrent and distributed operations. Protocols are dicult to design because systems of interacting concurrent processes exhibit a large number of behaviors. Therefore, computer-aided methods are used for protocol design. Currently, such methods can be classied into either verication methods or synthesis methods. Verication methods let users model the protocols in a suitable language, and check that model obeys desired properties by exhaustive search of the system state space. But the detailed, low-level models often result in very large state spaces. The search is made tractable by exploiting patterns in the state space to reduce the states actually examined. Even so, many practical protocols remain beyond the reach of exhaustive search. Synthesis methods avoid building complex low-level models. Instead, they translate high-level specications to low-level implementations. But these methods often require manual proofs, or are useful only in restricted cases such as peerto-peer communication protocols. Ideally, we would like a design method that combines the clarity of high-level specications of synthesis methods with the automated checking characteristic of verication methods. 1 In this thesis, we develop such a design method. We introduce an approach for dividing the task of protocol design into several steps. The division produces protocols that have small state spaces either because they are abstract or because they implement parts of the original protocol. Therefore, their correctness can be easily established using verication tools. We then show how to implement the divided protocols so that the complete protocol can be synthesized by combining the divided protocols. We have applied the synthesis method to guide the implementation of new distributed shared memory consistency protocols. A distributed shared memory (DSM) system simulates shared memory over networked computers. DSM systems allow programs designed for shared memory multiprocessors to be used over networked computers. DSM systems use local memories of the networked computers as caches for the simulated shared memory. Just like shared memory multiprocessors, caches in DSM systems replicate the shared data for eciency, but then require protocols to ensure that the replicas remain consistent. In this thesis, we develop consistency protocols that allow DSM systems to operate eciently over wide-area networks characterized by high-latency high-bandwidth interconnections. A protocol that performs well over a wide-area network must be able to utilize the bandwidth to overcome latency. Our protocols gain their eciency using information about process synchronization and past memory access patterns to predict future requests from other processes. This technique reduces the time processes spend waiting for data to arrive. When computations are regular, this anticipatory communication overlaps communication and computation, giving good speedups for distributed shared memory programs over wide-area networks. Protocol implementations derived by our method are state machines that dene protocol behavior. However, the programmer is still left to manage a myriad details of the implementation environment. In our case, the protocol implementation has to be a part of a virtual memory system that supports distributed shared memory. 2 In this thesis, we present a software architecture for building object-oriented systems that have many concurrent operations on groups of objects. The architecture allows the system to be incrementally extended with new objects and new operations. It smoothly implements interactions between objects on remote systems. In the course of designing the architecture, we have discovered several design patterns, and a new technique for constructing state machines incrementally using an object-oriented approach. The architecture is used to build a virtual memory system. The resulting system is exible: beginning with simple virtual memory facilities, we extended it with facilities like distributed shared memory in an orderly manner. 1.1 Contributions This thesis makes the following contributions: A method for designing process coordination protocols based on { A family of notations to express protocols at dierent levels of abstractions. { A set of transformations to rene protocols from one level to the next. { Application of the method to design memory consistency protocols. Distributed shared memory consistency protocols that { Improve over the performance of existing protocols { Perform well over either wide-area and local-area networks. A software architecture for object systems with concurrent operations on groups of objects. The architecture is based on: { Object-oriented state machines that facilitate construction of state machines by inheritance, composition and other object-oriented techniques. { Design patterns that simplify concurrency control, remote interactions, and resource management. 3 1.2 Thesis Outline In Chapter 2 we present our method for synthesizing distributed shared memory protocols. We begin with a review of background and related work and identify our contribution. Next, we chapter present the basic theory and discusses the notations we use at dierent levels of abstraction. After that we present a set of transformations for synthesizing the protocol implementation from a specication. We then show how interpret the implementation as a shared memory or message passing program. In Chapter 3, we develop our consistency protocols. We present the evolution of consistency protocols, and highlight our approach. Then we explain and formally specify our protocols and comment on the implementation. We conclude this part with performance results. In Chapter 4, we present our new software architecture. We use the design of a virtual memory as the primary example. First we explain the usual architecture of virtual memory systems, motivating the basic objects and operations. Then we critique it by considering the impact of changes, and motivate the new architecture. The architecture is discussed subsequently. In Chapter 5, we review the contributions and identify problems for future research. 4 Chapter 2 A Protocol Design Technique In this chapter, we present a new technique for designing nite state process coordination protocols. We begin by presenting the problem and our solution in brief. Then we examine the background research in detail, and contrast our solution with it. The rest of the chapter presents the formal details. 2.1 Goal 2.1.1 The Problem Protocols that describe the behavior of systems with concurrent interacting components are dicult to design, because such systems exhibit a large variety of behaviors. A human designer may overlook undesirable interactions in the system, leading to errors such as deadlock. So an automated method for synthesizing such systems is highly desirable. There are two types of approaches for computer-aided protocol design, verication methods and synthesis methods. Verication methods help debug previously designed protocols by exhaustive search of state spaces, while synthesis methods start with protocol specications and translate them to low-level protocol implementations. We used these methods in our research for designing distributed shared memory consistency protocols. These protocols describe systems that have a very large number 5 of states. Therefore, we could only verify simplied versions of the protocols. We also attempted to use synthesis methods. But methods that had tool support are designed for synthesizing peer-to-peer communication protocols, or OSI protocol stacks. Synthesis by hand, based on specications with algebraic or logical languages that could describe multi-party protocols, requires manual proofs of the specication. Such proofs were practical only for simple versions of the protocol. Thus, while verication methods are applicable to a wide class of protocols, they are limited by the need to describe protocols in detail, as well as the limitations of exhaustive search. On the other hand, synthesis methods provide abstract protocol description languages, but the methods require manual correctness proofs. Also, the abstract descriptions can be more dicult to produce than low-level descriptions based on communicating automata. This experience suggested the need for a design method that could combine the desirable attributes of both verication and synthesis methods. 2.1.2 Our Solution We introduce an approach for dividing the task of protocol design into several steps. At each step, the protocol is simple enough that exhaustive search is tractable, so that verication tools can be used to establish correctness. The simplication is achieved using notations that support abstraction and decomposition. We introduce a family of notations that are all communicating automata, except that the communication is expressed at dierent degrees of abstraction. We chose notations based on communicating automata because communicating automata are a familiar model used in popular verication tools. In the rst step, a designer uses specications that express process coordination as abstract predicates that suppress details of communication and control. Whole system verication is done at this step. In the second step, the designer produces implementations of the predicates in a notation that expresses control but not the details of communication. Here we verify each predicate implementation separately. The design method requires the implementations to have certain properties 6 that permit composition so that the composite system does not have to be veried again. In the third step, the predicate implementations are translated to protocols that express details of communication media. Again, these translations obey conditions that allow safe composition. Composing the translations terminates the protocol synthesis. We expand on these ideas in the following. Step 1 In our method, the initial design is specied by automata that describe only the desired coordination between automata, without saying how it is implemented. Thus, the initial models are extensional, abstract, and relatively simple. The notation we use formalizes a common way to describe process coordination. For example, mutual exclusion between two processes is often described as follows: \when one process is in its critical region, the other should not be in its critical region." Here, we divide the execution of a process into regions, and express coordination as a predicate on the regions of interacting processes. Our rst notation describes processes as automata, and coordination as predicates on their states. The notation suppresses details of how processes control each other to implement the predicates, as well as details of communication, and leads to models with small state spaces. We use verication tools to verify deadlock freedom, liveness, and other system properties. Step 2 The next step in a design is to show how to implement the predicates. Most distributed systems support communication mechanisms like message passing that allow one process to control another process unidirectionally. Bidirectional control is achieved by some form of request-response interaction. Our next model describes how a process controls another by unidirectional actions. At this level, we do not model elements like message queues. For example, consider a predicate over two process that says: \when one process enters region x, the other should enter y". This might be used to describe the process opening a TCP connection. One implementation might be: \First, the target waits for the connector request. Then the connector sends its request and waits for the reply". We formalize such a description with a notation where transitions in one process 7 may enable or disable transitions in another process. We use verication tools to ensure that such an implementations correctly implements a predicate. The original, abstract protocol may have several predicates. We establish conditions to ensure that implementations of the predicates can be composed without loss of correctness. Thus, the abstract protocol can be translated to the lower-level protocol by translating each predicate separately. Step 3 In the nal step, we use a notation that models communication media. The idea is that one process may \observe" the current state of another process and use the information to choose its transitions. An observation is easily implemented in a message passing system: a process can request the current state of another process, and wait for the response. Similarly, in a shared (or distributed) memory system, one process may observe the state of another process by reading shared (distributed) variables. Observation protocols are used to implement process control: a transition of process P that requires process Q to be in a certain state is disabled when Q is in a dierent state. We use verication tools to ensure that an observation protocol correctly implements process control, and hence the second-level protocols. Again, we establish conditions to ensure that implementations can be composed safely. These conditions carry over from the second-level protocols. Thus, the abstract protocol can be translated to observation-based protocols, and hence to shared memory and message passing programs. 2.1.3 Summary Our approach has several advantages. The notations based on communicating automata are familiar, and allow us to use popular protocol verication tools based on communicating automata. The design approach allows us to simplify protocols, rst by abstraction, and then by decomposition, so that the state space presented to a verier is smaller. The abstract predicates on processes can be implemented in various ways, and dierent im- 8 plementations can coexist in a synthesized protocol. Also, we can use known algorithms to implement the predicates, as long as we ensure that the composition conditions hold. In this thesis, we develop the formal basis for the approach. We have used it to design the consistency protocols for distributed shared memory, described in Chapter 3. 2.2 Background and Related Work We describe some of the previous research on protocol design. We then relate our approach to this work. 2.2.1 Verication Systems Verication systems (also called model checkers ) such as SPIN [Hol91], SMV [McM92], and Mur' [Dil96] are designed to verify the correctness of nite state distributed protocols. Each verication system provides a language for a precise and understandable mathematical model of the system. For instance, SPIN uses communicating nite state machines [BZ83]. Another formal language allows the user to specify correctness predicates; the systems mentioned above use variants of temporal logic [MP91]. Temporal logic allows the user to express notions such as \if process P takes action a, Q will eventually respond with action b". The systems include algorithms that examine the complete state space of a system and verify that the state graph satises the correctness criteria [Hol91]. These systems have two drawbacks. First, the modeling languages are at a fairly low level (messages in SPIN, shared memory in SMV), so that constructing the models is tedious and error prone. Second, exhaustive exploration of the system state space can be intractable: this is the state explosion problem. Recent research has concentrated on developing techniques for checking correctness without exhaustive analysis, as well as methods for managing large state spaces in limited memory. Partial-order methods [WG93] attempt to eliminate states that arise from modeling concurrency as interleaving. If the state transitions of two processes are independent, 9 then in an execution of the system, the transitions may be permuted without aecting correctness of the execution. Therefore, instead of examining all permutations, we can examine the state space for an arbitrary permutation of independent transitions and still check correctness. Moreover, dependencies among the state transitions of a nite-state system can be approximated by examining the source code. Using such dependency information, partial order methods guide the search over a limited part of the system state space. Symbolic model checking [McM92] uses binary decision diagrams (BDDs) to represent the state space. The symbolic representation allows compact storage of a large state space. Algorithms that search the state space to verify correctness can be changed to operate directly on the BDD representation. BDDs work best for digital circuitry that has many replicated components. Traditional state exploration may outperform BDDs for distributed protocols [Hu95]. Fair reachability [GH85, GY84, LM95] methods force state transitions of processes in a distributed system and explore the resulting state space. This space is smaller than the state space generated when some processes do not take steps. The smaller space is sucient to check some properties like deadlocks. Abstraction [Lon93, PD97] Abstract Interpretation [Lon93], and Composition [Lon93] based approaches are developed to present model checkers with simpler systems to verify. These approaches use user-dened equivalence relations [Pon95] induction over replicated components [McM92], symmetry [Ip96], language containment [Kur94] and similar approaches to eliminate irrelevant states in a system. In methods based on abstraction, it is enough to check the abstract model to ensure that a property that holds in the abstracted system is really true of the actual system. Methods based on composition and simulation use theorems that show how to decompose system properties of interest when verifying components or simulations. All these methods require human intervention. Methods for managing a large number of states in limited memory include techniques such as supertrace [Hol91] and hash compaction [WL93]. These methods use hash tables to remember whether a state has been reached in the exhaustive search. The hash table 10 only stores an approximate description of a state, so that there is a small probability that one state is mistaken for another. Thus, the exhaustive search omits some states, and some system errors will not be detected. On the other hand, many more states can be stored in the same amount of memory, so that approximate search is applicable to larger systems. State space caching methods [GHP92] use memory as a cache, trading verication time for memory. The verication systems have a signicant drawback: the use of low-level models rst introduces irrelevant system states, and then techniques like partial-order methods attempt to extract abstract system description. 2.2.2 High-Level Service Specication Other research such as path expressions [Cam74] and logic of knowledge [HM90] has concentrated on high-level notations for describing protocols. Designers often informally describe relationships between the processes in distributed systems in terms of what one process \knows" about another process. For instance, in the description of TCP [Pos81] we nd: \An established connection is said to be halfopen if one of the TCPs has closed or aborted the connection at its end without the knowledge of the other, . . . ". The logic of knowledge formalizes this notion of knowledge so that programs may include knowledge statements directly without referring to the method for gaining and losing knowledge. Such knowledge protocols are abstract and easy to specify [HZ87]. Some results [CM86] hint at ways to implement gain and loss of knowledge. But so far reducing knowledge specications to actual programs has proven dicult [FHMV95]. Path expressions [Cam74] are a well-known and easy to use notation for specifying process coordination. Path expressions are regular expressions that describe the sequences of process activities in a distributed system. Campbell [Cam76] investigates several variants of path expressions. For some restrictive types of path expressions, it can be proven a priori that problems like deadlock do not exist, and there are known 11 algorithms to translate such path expressions to low-level P and V operations. But more expressive notations may be dicult to understand and implement [Hol91]. While verication systems use temporal logic to specify correctness predicates, there have been attempts to use it to specify systems. Temporal logic has been used as a programming language [Gab87]. However, descriptions based purely on temporal logic have proven dicult to understand in practice [Lam94]. 2.2.3 Synthesis Methods Synthesis methods translate a high-level specication to a low level language like communicating nite state machines or CSP. Tableau based methods [MW84] translate specications in temporal logic to languages like CSP or Buchi automata. The synthesis method produces a model for the formula as an existence proof. Tableaus were developed as proof techniques for mathematical logic. A tableau is a systematic way of decomposing a logical formula into subformulae until we reach elementary formulae. The truth of elementary formulae can be easily veried, and the tableau structure ensures that we verify enough elementary formulae to guarantee the truth of the original formula. When applied to temporal logic, the tableau can be interpreted as an automaton [MW84]. The automaton is then regarded as a centralized synchronizer for all processes that interleaves their actions so that the temporal formulae hold on the resulting sequence of actions. But such a centralized solution is undesirable in practice. Also, as noted above, descriptions based purely on temporal logic have proven unwieldy. Finite State methods are used in synthesizing communication protocols. They begin with a description of all desirable interactions in the system to be designed, and decompose them into communicating nite state machines. But these methods are often limited in various ways and appear to be too inexible for use in practice [PS91]. The approach of specifying desirable interactions seems applicable only to small systems [PS91] and decomposition is a dicult problem [PS91]. 12 In a related method [BZ83], the user starts with a dummy initial state for each process in the system to be synthesized. The user then species message transmissions for each process, and the synthesis software deduces the corresponding message receptions. The software traces all possible states where a reception may occur and updates the receiver state machine. After each update, the system warns the user if there are states without any messages in transit and none to be transmitted. Such states correspond to deadlock situations. Conformance to the service specication is not guaranteed by the method, although verication methods can be use after the synthesis is complete. Translation methods [KHvB92] translate specications in notations like LOTOS [BvdLV95] to message exchanges. The specications dene an ordering of operations, and the translation methods produce state machines that generate the sequences. These are suitable where service specication can be done as sequences of operations. But specications are often done in other styles [VSvSB91]. 2.2.4 Our Approach Our work was inspired by research on the logic of knowledge. This research showed that notions like \a process knows" were enough to express many interesting protocols succinctly. The treatment by Chandy and Mishra [CM86] reduced the logical operators to an algebraic form. Path expressions [Cam74] and LOTOS [BvdLV95] were earlier examples of the use of conjunction and disjunction predicates. We combined these ideas with the observation that the operators could be regarded as an abstract form of communication between communicating automata. This combination leads to succinct specications that can be checked by verication tools developed for communicating automata. The next question was how to describe the implementations of constraints without modeling the peculiarities of communication media. This would allow us to model control ow without the extraneous system states introduced by communication media. The model we use here is similar to LOTOS and Path expression operators that permit 13 specifying orders of execution. The novelty is in showing that the implementations can be composed in way that they do not interfere with one another. The nal model is similar to the usual communicating nite state machines with single element queues. The dierence is that we communicate the current state rather than unstructured values. This makes it easy to translate the protocols to either shared memory or message passing with optimizations. Our approach gives a design technique that allows designers to simplify protocols by decomposition and abstraction. Since our development, we have found that the LOTOSPHERE [BvdLV95] project has informally described the idea of design styles that mirror our own. They observe that experience shows that early specications are best described in Constraint-oriented style, while later designs in a State-oriented style. Our design method can be seen as a formalization of this observation. This unexpected similarity between our development and LOTOS research has strengthened our belief in the utility of the method. 2.3 The Synthesis Method In this section, we present the formal details of the synthesis method. We introduce our models of process, distributed system, and show how we use automata to denote processes. Then we discuss our three notations for describing protocols. The rst notation represents communication using abstract operations. The next two notations rene these operators so that they can be implemented using shared memory and message passing programs. For each notation, we show how to prove that one protocol implements another. Then we describe some implementations for the abstract communication operators, and show how the implementations can be expressed using shared memory and message passing. 14 2.3.1 Synthesis Let Beh be a set of desired behaviors, such as a set of sequences of events. Let Ls be a specication language and Li an implementation language that specify desired subsets of Beh . Let the meaning of a specication be given by the function [ :]]s : Ls ! 2Beh , while the meaning of an implementation by [ :]]i : Li ! 2Beh . Then we dene the problem of synthesis as follows. Denition 1 A synthesis method is a total function S : Ls ! Li such that given a specication , S () denotes the same behaviors as , [[S ()]]i = [[] s. A classic example of synthesis is the construction of a nite state machine that recognizes a set of ASCII words !, given a regular expression that species !. Here, Ls is the language of regular expressions, Li the description of automata, and Beh is the set of all ASCII words. The synthesis method inductively translates the regular expression into a nite state machine. A protocol is a set of rules that describes how processes in a distributed system interact. For example, a le transfer protocol is a set of rules followed by processes on two machines in order to transfer les from one machine to another. A protocol synthesis method takes a description of the externally visible behavior b for a set of processes, and produces programs that the processes must execute in order to implement behavior b. We dene processes and distributed systems as sequences of abstract events. The events represent activities such as memory accesses or message transmission. Protocols are specied using automata. The behavior of each process is specied with an automaton, and the the joint behavior of a distributed system by the product of these automata. The state transitions in an automaton that represents a process may depend on the transitions of automata that represent other processes. The rules that govern this dependence constitute a model of process communication. Our synthesis method begins with a high-level specication with an abstract form of communication rules. Through a series of intermediate steps, the high-level specication is translated to programs that 15 use shared memory or message passing for communication. In the following, we make these ideas more precise. 2.3.2 Process and System Denition 2 A process P is a pair (Ep; Rp) where Ep is a nite set of events and Rp is a set of runs, a set innite sequences over Ep . The events of two processes are disjoint: for every pair of processes P and Q, Ep\Eq = ;. Denition 3 A distributed system P is a pair (E ; R) where E is SP 2P EP , and R a set of system runs, a set of sequences over E such that for every run , 2 R, and every process P , the projection P is a run of P , P 2 RP . The runs of processes are specied using automata, and the runs of a distributed system as the product of the automata of the constituent processes. 16 2.3.3 Automata Denition 4 A nite state automaton is a tuple of the form (; ; ; ; ), where is a non-empty nite alphabet, a nonempty nite set of states, a transition relation, S S , a nonempty set of starting states and a nonempty set of nal states. Also, all states of are reachable, i.e., for all s 2 , there is a sequence s0; : : : ; sn where sn = s; s0 2 , and for 0 i < n, (si ; a; si+1) 2 for some a 2 . Let , be the set of transitions f(s; s0)g such that for some letter a 2 , (s; a; s0) 2 . A trace t of automaton A on a word w = a0 : : : an,1 in is a sequence of states s0; : : : ; sn where s0 2 and for every si ; si+1 in t, (si ; ai ; si+1) 2 . Note that a trace can also be thought of as a sequence of transitions, (s0 ; s1); (s1; s2); : : :. States and transitions that constitute a trace are said to occur in that trace. The automaton A accepts a word w if the last state sn is a nal state, sn 2 . The set of all words accepted by an automaton is the language L of the automaton. We are interested in automata that accept innite words. Let w = a0 a1 : : : be an innite word and t be a trace s0; s1; : : : over w. Let the limit of a trace t be the set of states that appear innitely often in t, lim(t) = fs j s = si innitely ofteng. Then A accepts w if there is a state s 2 that appears innitely often in t, lim(t)\ 6= ;. This condition, Buchi acceptance, denes Buchi automata. The language of the automaton is is set of innite words L! that is accepted by the automaton. Since we specify distributed systems as products of automata, we use a slightly different presentation of the acceptance condition called generalized acceptance [GW94]. Let F = fF1; : : : ; Fk g; Fi 2 ; k 0 be a set of sets of accepting states. Then the automaton A accepts w if for every Fi , lim(t)\Fi 6= ;. If there is only one Fi, then the condition is the same as Buchi acceptance. Here, A is intended to be the product of of automata Ai; : : : ; Ak . The condition ensures that accepted sequences are those where every automaton goes through its accept state innitely often. Thus we enforce fairness by requiring that every automaton must make progress. 17 Denition 5 A generalized Buchi automaton is a nite state automaton that accepts innite words under the generalized acceptance condition. Henceforth, we will assume that every automaton is equipped with a generalized acceptance condition. 2.3.4 Automata and Processes We use automata to denote processes. Intuitively, we want either automata states or automata transitions to represent nite sequences of process events. For example, when describing mutual exclusion, we refer to sequences of critical and non-critical states. but when describing serial communication, we might use send and receive transitions. Technically, we achieve this by dening the alphabet to be a set isomorphic to either the set of states or the set of transitions ,. In the rst case, accepted words dene acceptable sequences of states. In the second case, accepted words dene acceptable sequence of transitions. Correspondence between runs and words is established though a semantic mapping from letters to process events. Each letter corresponds to a set of nite sequences of process events. This mapping is inductively extended to map words (sequences over the alphabet) to runs (sequences over events). Let A be an automaton and P a process. Let [[]] be a function, [ ]] : A ! 2Ep , such that for every pair of two distinct letters a; b 2 A , [ a] \[[b]] = ;. The semantics is dened as follows: Denition 6 Automaton A = (; ; ; ; ) is said to denote process p = (E; R) if there exists a (nondeterministic) function [ ]] : A ! 2Ep extended to the words accepted by A as follows: [ ] = [ aw]] = [[a]][[w]] for letter a and word w. 18 If is isomorphic to , events of P are represented by the states of A. If is isomorphic to ,, events of P are represented by transitions of A. 2.3.5 Protocols Protocols describe the behavior of distributed systems. A protocol is dened by automata products with restrictions. In a protocol, individual automata denote process behavior, the product denotes system behavior, and the restrictions on the product model communication. We rst dene the notion of automata products without restriction. Denition 7 Given A = (A; A; A; A; A ) with acceptance condition FA, and B = (B ; B ; B ; B ; B ) with acceptance condition FB , and disjoint alphabets and states, the free product A B = (; ; ; ; ) is dened as follows: = (A B )[A[B = A B = A B = A B Given s = (sA; sB ); t = (tA ; tB ); s; t 2 , and a = (aA; aB ); a 2 , (s; a; t) 2 if (sA ; aA ; tA ) 2 A and (sB ; aB ; tB ) 2 B . Given s = (sA; sB ); t = (tA ; sB ); s; t 2 , and a = aA; a 2 , (s; a; t) 2 if (sA ; aA ; tA ) 2 A . Symmetrically for B. F = SFi2FA fFi B g[SFi2FB fA Fig With this denition, A or B may have individual or joint transitions in A B . A state sA is said to occur in a trace of the product A B if it occurs in some product state (sA; sB ). Similarly, a transition (sA ; tA ) occurs in a product trace if it occurs either individually 19 or jointly in a product transition. The acceptance condition for innite words requires that a word be accepted by A B if both A and B pass innitely often through their accepting states. In the free product, A and B execute their state transitions independently. But if A and B communicate, the product cannot have all possible transitions. This observation motivates the following denition of a protocol. Denition 8 A protocol p with automata A and B is the free product A B and a set of restrictions (a subset of all transitions) C that describes an automaton such that p = , and all the other sets are dened by the states reachable from via the transitions in , C . Note that although a protocol is an automaton, we prefer to specify it as a free product with restrictions. This allows us to reason separately about the structure of component processes (represented by the free product) and communication (represented by the restriction). Dierent ways of specifying the automata and C are used at dierent stages of the synthesis method. 2.3.6 Protocol Synthesis Having dened protocols, we can now dene protocol synthesis. First we dene the notion of protocols that serve as specication and implementation. This is based on the standard notion of language substitution [HU79]. A substitution is a mapping from an alphabet to subsets of 0 . The mapping is used to transform words in a language L() over to words in a language L(0 ) over 0 . We use substitution to transform a specication into an implementation. Let l 2 2 be a a set of nite words over some alphabet . Let (l) be the letters of used in the words of l. The languages l; m over are distinct if they use dierent letters, (l)\(m) = ;. Denition 9 A protocol q implements a protocol p (conversely, p species q) 20 Every automaton of p corresponds to exactly one automaton of q. There is a nondeterministic renement function : p ! 2q such that for every pair of distinct letters a; b 2 p , the languages (a) and (b) are distinct, and the words accepted by q are just the rened words accepted by p (extending to words by induction). Let P be an automaton of p and Q be the automaton of q that corresponds to P . Then there is a nondeterministic function a : P ! 2Q that maps letters of P to distinct languages, and the words accepted by Q are just the rened accepted words of P . This denition captures the ideas that a protocol implementation is composed from automata with terminating executions, adds detail to the specication, and every step of the specication is rened in a distinct way. Note also that the relation between specication and implementation is dened in terms of alphabets. Since words can describe either sequences of state or transitions, an implementation may rene either states or transitions. Finally, notice that the relationship of implementation to specication is dened in terms of relationships between the components, the automata and their states. Thus, we reason can about the (innite) words via straightforward induction. We may now dene protocol synthesis simply as follows: Denition 10 A protocol synthesis method is a function that given a specication pro- tocol produces its implementation protocol. By comparing Denition 6 and Denition 9, it is clear that we can always choose denotations such that implementations that conform to Denition 9 preserve behavior. Indeed, in practice we rene a protocol several times until the events of the process of interest 21 have a one-to-one relationship to transitions of the protocol; the nal step is a ordinary shared memory or message passing program. 2.4 Specifying Coordination Automata in a protocol aect each others' state transitions. The rules that describe the eects model interprocess communication. We use a variety of rules to specify protocols. The most extensional, abstract protocol specications are given by Constraint rules. Constraint-rule specications are implemented by Action -rule specications. Action rules include more details of communication. In turn, Action-rule specications are implemented by Observation -rule specications. Observation-rule specications are intensional; they are suciently detailed so that they can be easily translated to shared memory or message passing programs. System properties like absence of deadlock and reachability are veried once and for all at the most abstract level for Constraint-rule specications. The subsequent syntheses preserve these properties. A property is dened as a set of words over the alphabet of interest [Alp86]. A word w has a property if w 2 . A language Li preserves properties of language Ls if for every property s of Ls, there is a unique property i of Li , and if a word ws has property s , then the corresponding word wi does not have any property disjoint from i . Lemma 1 An implementation preserves properties of its specication. Proof. From Denition 9, by induction, every word accepted by an implementation protocol corresponds to exactly one word accepted by the specication protocol. Therefore, a property s of the a specication maps to a a unique property i . Furthermore, if i is a property disjoint from i , and ws 2 s is a word accepted by the specication with a corresponding word wi of the implementation, then wi 62 i (and wi 2 i ). Thus, the implementation preserves properties. 22 Action-rule specications are derived from Constraint-rule specications by translating constraint rules to action rules, and Observation-rule specications are derived from Action-rule specications by translating action rules to observation rules. We show that each translation leads to automata that are implementations of the corresponding specication. 2.4.1 Constraint-Rule Specications Constraint-rule specications express essential coordination among processes. A specication describes desirable sequences of process behavior as succinctly as possible. Constraintrule specications are extensional: they describe the eects of coordination, but not the details of how processes implement coordination or properties of communication media. Constraint-rule specications are designed so that techniques for specifying and verifying protocol properties are easily applicable. For the purposes of this thesis, two types of constraints suce. One constraint requires that processes synchronize their behavior, and the other species that behaviors be disjoint. In the following, we dene Constraint-rule specications, and show a simple example, the dining philosophers. We discuss the advantages and disadvantages of this style of specication. Then we explain how protocol properties can be checked at this level using verication algorithms. 2.4.1.1 Denitions Let P = (Ep ; Rp) be a process, and Ep be a partition of the set of events Ep into regions. This denition formalizes intuitive notions like the \critical region" used to describe parts of the run of a process. Dene a bijection [[]] : ! Ep. Let A = (; ; ; ; ), where = and a word corresponds to the sequence of states of the trace on that word. A is said to be a region automaton if it denotes P using [[]] according to Denition 6. Every state of a region automaton denotes a distinct region of that process. Constraint-rule specications are protocols that use region automata. 23 Let A1; : : : ; An be automata of some protocol p. A Constraint is dened to be a a tuple (si; : : : ; sl ) 2 Ai : : : Al for some set of automata fAi ; Aj ; : : : ; Al g of p. An automaton Ai appears in a constraint if some state si appears in the constraint. A constraint c is conjunctive (denoted ( ^ : si ; : : : ; sl )) if for every state (h1 ; : : : ; hn ) of p, if hj = sj for some automaton Aj , then every hk = sk for all automata that appear in c. A constraint c is disjunctive (exclusive-or) (denoted ( _ : si ; : : : ; sl )) if for every state (h1 ; : : : ; hn ) of p, if hj = sj for some automaton Aj , then for all other hk ; hk 6= sk . Conjunctive constraints force all states of a constraint to appear simultaneously in a global state, while disjunctive constraints allow exactly one state from a constraint to appear in any global state. Denition 11 A Constraint-rule specication is a set of automata A1; : : : ; An with a set of constraints C that denes a protocol composed of the automata and the smallest restriction that satises C . Each constrained state in an automaton must be preceded and followed by internal states that do not appear in any constraint. The internal states act as placeholders that simplify the renement from Constraintrule specications to Action-rule specications. Protocol 1 shows how to specify a three process dining-philosophers protocol. We denote automata by giving the states and transitions. The internal states are not shown in the protocols for simplicity. For example, before f , there is the state n; after f there is an unseen internal state. Protocol 1 Automata: Rules: P1 : t1 !f1!g1!e1!t1 P2 : t2 !g2 !h2 !e2 !t2 P3 : t3 !h3 !f3 !e3!t3 ( _ : f1 ; f3) ( _ : g1 ; g2) ( _ : h2 ; h3 ) 24 In the protocol, Pi are the philosophers, ti is the region where a philosopher is thinking, ei where a philosopher is eating, and fi ; gi ; hi are regions where philosophers pick up one fork on either side. The constraints require that at any time, only one philosopher may pick up a fork. 2.4.1.2 Advantages and Disadvantages Constraint-rule specications are abstract. We specify only the desired eects of communication, abstract from details such as memory variables and queues. In addition, we can use the generalized acceptance condition of Section 2.3.3 to require that the implementation of disjunctive constraints will be fair. As a result, Constraint-rule specications result in small models; in Protocol 1, the total state space is at most 43 states. Therefore, it is easy to see that the protocol is incorrect : the three philosophers may each pick up forks f , g, and h, and block permanently waiting for the other fork. But this model is not easily implemented in practice. By comparison, in systems such as SPIN [Hol91], the problem would be modeled by processes that communicated over message queues. Both forks and philosophers are modeled by processes. Suppose that the fork processes had three states, fork-here, forkleft, fork-right. In addition, they remember the last philosopher who had the fork to implement fairness. This means each philosopher has six states, and the system has 43 63, more than 12000 states for such a tiny system. The advantage of the detailed model is that if forks indeed represent resources that are accessed by messages, deriving an implementation is easy. In our method, we resolve the dilemma between abstraction and implementation by devising ways to generate implementations from the abstract specications. Therefore, it becomes feasible to use abstract specications. 2.4.1.3 Verifying Correctness Two main approaches are used to verify the properties of protocols. 25 Reachability analysis [Wes78] searches the global protocol state space to nd states or sequences of states that violate correctness properties. For example, a deadlock state is a system state where no process can take a step. Model checking [CES86] species the desired properties of a protocol using a (undesirable) property automaton. The transitions of the property automaton are described by predicates that express properties of the state space of the original system. Thus, the property automaton describes \bad" runs of the system. The protocol violates the property if there is any word that is accepted by both the protocol automaton and the property automaton. Thus, we have to detect cycles in the product of the property automaton and the protocol automaton. Both approaches are applicable to Constraint-rule specications specications. For example, an exhaustive search of the state space of Protocol 1 quickly reveals states where no process can take a step. The main problem with these techniques is that exhaustive search can be intractable. By virtue of their level of abstraction, Constraint-rule specications have small state spaces for many protocols of interest, so verication is not dicult. Constraint-rule specications can also exploit the research in minimizing state exploration [WG93]. There has been a great deal of interest in methods for detecting and exploiting regular patterns in the state space. These methods analyze the models to determine regularities. When models use variables with assignments, dependency detection can be tricky. In contrast, dependencies among processes are explicitly given by Constraint-rule specications. 2.4.2 Action-Rule Specications Action-rule specications implement Constraint-rule specications. They are more intensional, in that they show how processes coordinate their actions to implement constraints. Rules for coordinating actions are binary, that is, any action in one process may control at most one action of one other process. This reects usual peer-to-peer communication 26 available in practical systems. For our purposes, we need rules where an action can either disable or enable other actions, or one out two possible actions may be chosen. In the following, we dene Action-rule specications and present a simple example. Then we explain how to translate a Constraint-rule specication into an Action-rule specication. Constraints in a Constraint-rule specications can be implemented in many ways in an Action-rule implementation. So an abstract Constraint-rule specication can specify several Action-rule implementations. An action is some behavior in one process that aects the behavior of another process. For example, a process P may set the value of a shared variable read by process Q, aecting its behavior. Formally, an action is just a transition. We use the word action to distinguish the transitions of an Action-rule specications from the transitions of Constraint-rule specications or Observation-rule specications. Denitions Let P = (Ep; Rp) be a process, and Ep be a partition of the set of events Ep into regions. Dene a bijection [ ] : ! Ep. Let A = (; ; ; ; ), where = , and a word corresponds to the sequence of transitions of the trace on that word. A is said to be a branch automaton if it denotes P using [ ] according to Denition 6. Every transition of a branch automaton denotes a distinct region of that process. Action-rule specications are protocols that use branch automata. A transition of a branch automaton is called an action. Let A1; : : : ; An be branch automata of some protocol p. An action rule is a pair of actions ((si ; ti ); (sj ; tj )) from two distinct automata Ai and Aj . Actions refer to the transitions of constituent automata such as Ai. Let G be a state of p that appears in some trace. Then the next state H depends on the actions executed from G. The action that leads from G to H must be enabled in G. An action is disabled or enabled relative to a product state in a trace. Therefore, an action disabled (enabled) in one product state of a trace may be enabled (disabled) in another product state that occurs later in the trace. An Action-rule specications is required to 27 be consistent, so that every action is either enabled or disabled in a product state. The following rules describe how actions are selected in an Action-rule specications. An action rule is an enabling rule, ((si ; ti ))(sj ; tj )), if (si ; ti ) is the only action taking si to ti , and for every product state G = (g1; : : : ; gn ) where gi = ti and gj = sj , there is a transition in the product to a state H = (h1 ; : : : ; hn ) with hj = tj . At every state in a trace, all actions that are enabled are executed. An action rule is a disabling rule, ((si ; ti )6)(sj ; tj )), if (si ; ti ) is the only action taking si to ti , and for every product state G = (g1 ; : : : ; gn ) where gi = ti and gj = sj , there is no transition in the product to a state H = (h1 ; : : : ; hn ) with hj = tj . A disabled transition remains disabled in a trace unless enabled by a subsequent action in that trace. An action rule is a choice rule, ((si ; ti )_(sj ; tj )), if for every product state G = (g1 ; : : : ; gn ) where gi = si and gj = sj , there is no transition in the product to state H = (h1 ; : : : ; hn ) with both hi = ti and hj = tj . Choice is assumed to be fair over a trace. An action rule is a condition rule, ((si ; ti ) ) ? (sj ; tj )), if (si ; ti ) is the only action taking si to ti , (sj ; tj ) is disabled by some action (ui ; vi ), and for every product state G = (g1 ; : : : ; gn ) where gi = ti and gj = sj , there is a transition in the product to state H = (h1 ; : : : ; hn) with hj = tj . Intuitively, the transition (sj ; tj ) is enabled by (si; ti ) provided (ui ; vi ) has disabled it previously. Otherwise, the transition (sj ; tj ) does not need enabling. Denition 12 An Action-rule specication is a set of automata A1; : : : ; An with a set of action rules R that denes a protocol composed of the automata and the smallest restriction that satises R. The specication must be consistent so that at every product state, an action is either disabled or enabled. Protocol 2 below presents a simple protocol for one-shot mutual exclusion that works when there are no cycles in the process. Starting with the state (s1 ; s2), either one of the transitions (s1; t1 ) or (s2; t2 ) is chosen. The selected transition bars the progress of the other process. That process waits until it is enabled by another transition of the executing 28 process. The states u1; u2 are the critical regions. If (s1; t1 ) is chosen, it disables (t2 ; u2 ). (t2 ; u2 ) is enabled after P1 executes (u1 ; v1 ). Protocol 2 Automata: Rules: P1 : s1!t1 !u1 !v1 P2 : s2!t2 !u2 !v2 ((s1; t1 )_(s2; t2 )) ((s1; t1 )6)(t2 ; u2)), ((s2 ; t2 )6)(t1 ; u1)) ((u1 ; v1))(t2 ; u2 )), ((u2 ; v2))(t1 ; u1 )) In the following section, we show how how to apply Denition 9 to show how Action-rule specications can implement Constraint-rule specications. 2.4.2.1 Proving Implementation Denition 9 relates an implementation to a specication via substitution. Recall that substitution maps the letters of one language to words over 0, transforming the words of a language L() to those of L(0 ). This transformation over languages can also dened as a transformation on automata. Let us rst consider the transformation informally. Let (s; a; t) be the sole transition of an automaton A, where s is the sole initial state, t the nal state and a a letter. Let be a renement function that maps the letter a to some nite language. Then the automaton A can be transformed to an automaton A0 that recognizes (a) simply by replacing the sole transition by the automaton that recognizes (a). Given an automaton with exactly two transitions (s; a; t) and (t; b; u), substitutions over the letters a and b may be implemented by separately replacing the transitions. If the replacement automata have exactly one initial and nal nal state, then we may compose the replacements by identifying the nal state of the rst replacement with the 29 initial state of the second replacement If they have multiple initial and nal states, we can connect the the nal states of the rst replacement automaton to the initial states of the second replacement by transitions. The epsilon transitions can be removed by the usual determinization algorithms. Thus, the translation involves composing the replacement automaton. Similarly, an implementation protocol dened using Action-rules is derived from a specication protocol dened using Constraint-rules by replacing each constraint of a Constraint-rule specications with Action-rule specications and composing them. We require that when considered in isolation, each replacement implements a constraint according to Denition 9. The action rules that dene one replacement do not aect the rules that dene the other replacements. Therefore, one replacement does not interfere with another replacement when composed. As a result, the overall Constraint-rule specications is implemented by the Action-rule specications produced by composing the replacements. These ideas are formalized as follows. First we formally describe the form of replacements that we will use. We use the term replacement when talking about the implementations of individual constraints in a Constraint-rule specications, while using implementation to refer to the entire Action-rule specications that results upon substituting every constraint with its replacement. Denition 13 An Action-rule replacement is a protocol that implements Constraintrule specications of the form Automata: P1 : n1!s1!o1 P2 : n2!s2!o2 Rules: .. . Either ( ^ : s1; s2; : : :) or ( _ : s1; s2 ; : : :) and the states n and o do not appear in any constraint rule; the replacement must have exactly one initial and one nal state, and must remain an implementation if n and o are identied. 30 We require that the replacement be an implementation even when n and o are identied so that cycles in the Constraint-rule specication do not invalidate correctness. Note that n and o are intended to be internal states that represent what happens before and after a constraint. Given replacements in this form, we can safely compose the replacements. The composition is dened as follows. Denition 14 An Action-rule specication is said to be synthesized from a Constraint- rule specication given the following construction: Given a set of replacements for every constraint in a Constraint-rule specications, identify the rst and last states of the replacement for each transition in the Constraint-rule specications for every automaton. Replace unconstrained states by a single new transition. With this denition for composing replacements, it is not dicult to argue that the composition produces an implementation. Theorem 1 An Action-rule specication A synthesized from a Constraint-rule speci- cation C is an implementation of C . Proof. To prove that A implements C (Denition 9), we have to prove for each au- tomaton, and for the product, that we can dene a function that maps each letter in the alphabet of the specication to a distinct nite language of nite words. In this case, the alphabet is just the possible product states. So we will show that for each product state the function may be dened. First, we note that traces of a replacement remain unchanged even when it is used in the synthesis. Consider a product state in the synthesized Action-rule implementation such that in this state, some replacement automaton is in its initial state. Then, since the action rules aect only the actions dened within the replacement, the transitions of the replacement are unaected by transitions implementing constraints on other states. Therefore, the replacement transitions can always be executed in every reachable state. Initial states in a Constraint-rule specication do not take part in any constraint. They are replaced by a single transition each. Therefore, the initial product has an image. 31 Each new state is reached by a constraint rule. Each transition therefore executes transitions of the replacement that implement the constraint. Dene a map such that the replacement transitions corresponding to the n and o states map to internal states. The transitions that map to the s states dene a suitable image for the product state. Thus, we have the result. If we have a set of replacements, we can produce an implementation. But deriving a replacement for conjunctive constraints requires use to address the issue of simultaneity. Simultaneity A Constraint-rule specication allows joint transitions in a conjunctive constraint: all the processes \simultaneously" satisfy the conjunctive constraint. But Action-rule specications have no synchronous transitions, however they can dene order relations like before and after. Therefore, we interpret simultaneity by requiring that \simultaneous" transitions appear after their predecessors and before their successors. Formally, we have Denition 15 Let a function f be the renement function for a branch automaton that maps the states of a conjunctive constraint over processes Pi such that ( ^ : s1; s2; : : :), and states ni and oi are the predecessor and successor states. Let ti ; ui ; vi be transitions in an implementation of the constraint such that f (n1) 7! ni; f (si ) 7! ui ; f (oi ) 7! vi . Then if in every trace, ui follows all occurrences of ni and precedes all occurrences of vi for all processes, then for all i, transitions ui are said to be simultaneous. For example, consider the Constraint-rule protocol Automata: Rules: A1 : n1 ! s1 ! o1 A2 : n2 ! s2 ! o1 ( ^ : s1; s2) One possible replacement is the protocol: 32 Automata: Rules: A1 : x1 ! y1 ! z1 ! a1 A2 : x2 ! y2 ! z2 ! a2 ((x1 ; y1))(y2 ; z2)), ((x2; y2 ))(y1 ; z1)) ((y1; z1 ))(z2; a2 )), ((y2 ; z2 ))(z1 ; a1 )) We dene the map from implementation to specication so that the transitions (y; z ) map to the s states, while the (x; y) and (z; a) transitions map to n and o states. By Denition 15, the (y; z ) transitions of the processes are simultaneous. In the following section, we consider Observation-rule specications and show how they can replace Action-rule specications. 2.4.3 Observation-Rule Specications Observation-rule specications implement Action-rule specications. They model process communication at a detailed level. One process can \observe" the state of another process, and choose one of several state transitions as its next transition. Action coordination is implemented via observations. In the following, we dene Action-rule specications and present a simple example. Then we explain how to translate an Action-rule specication into a Observation-rule specication. Denitions Observation-rule specications also use branch automata like Action-rule specications. Let A1; : : : ; An be (branch) automata of some protocol p. An observation is a tuple, ((si ; ti ); Xj ) of a transition from one automaton (Ai ) and a set of states Xj from another automaton (Aj ). A transition (si; ti ) is said to be based on an observation if there is an observation with that transition. Every transition may be based on at most one observation. 33 Given two observations ((si ; ti ); Xj ) and ((si ; ui ); Yj ), Xj \Yj = ;. That is, transitions based on observations are deterministic. A state sj is observable if there is an observation ((si ; ti ); Oj ) where sj 2 Oj . The semantics of an observation ((si ; ti ); Xj ) is as follows. Intuitively, automaton Ai in state si observes Aj in one of the states Xj , \remembers it" and later asynchronously changes state to ti . The state where Ai remembers the observed state cannot itself be detected by any other automaton. Formally, we ensure this asynchrony by splitting every transition into a pair of transitions. Transitions based on an observation implicitly specify a pair of ordinary transitions, ((si ; t0i ); (t0i ; ti )) with a hidden state t0i . Hidden means that t0i is not observable. Observations determine global transitions as follows. Given an observation ((si ; ti ); Xj ), in every product state H = (h1 ; : : : ; hn ) where hi = si and hj 2 Xj , all successor states G = (g1 ; : : : ; gn ) have t0i as their ith component, gi = t0i . Further, for every state F = (f1; : : : ; fn) where fi = t0i , for every state E = (e1 ; : : : ; en) where (F; E ) is a transition, either ei = t0i thtObserv0either e states. If P1 sees P2 in n2 and goes to w1 , then P2 observes P1 and goes to w20 , then P1 observed P2 before P2 observed P1. Therefore P1 \wins", and can execute (w1 ; c1). Now suppose that this protocol is executed cyclically; then the next time around P1 will see P2 in w20 and go to w10 , \releasing" P2. Thus this protocol implements fair choice: only one process can execute its (w; c) transitions at a time, but the next time round the other process will be given a chance. Protocol 3 Automata: Rules: P1 : n1!w1 !c1 : n01!w10 !c01 : n1!w10 !w10 : n01!w1 !w1 P2 : n2!w2 !c2 : n02!w20 !c02 : n2!w20 !w20 : n2!w2 !w2 ((n1; w1 ); fn2 ; w2 ; c2g), ((n01; w1 ); fn2 ; w2 ; c2g) ((n01; w10 ); fn02 ; w20 ; c02g), ((n1; w10 ); fn02 ; w20 ; c02g) ((w1 ; w1 ); fn2 ; w2 ; c2g), ((w10 ; w10 ); fn02; w20 ; c02 g) ((n2; w2 ); fn01 ; w10 ; c01g), ((n02; w2 ); fn01 ; w10 ; c01g) ((n02; w20 ); fn1 ; w1 ; c1g), ((n2; w20 ); fn1 ; w1 ; c1g) ((w2 ; w2 ); fn01 ; w10 ; c01g), ((w20 ; w20 ); fn1; w1 ; c1 g) 2.4.4 Proving Implementation In proving that a Observation-rule specication implements a Action-rule specication, we use the same denition (Denition 9) and the same argument as in Theorem 1. For each Action-rule replacement, we produce a Observation-rule replacement. We show how to compose Observation-rule replacements. Every replacement uses observations 35 limited to its own states, therefore, a replacement is unaected by other replacements of the Action-rule. Each Observation-rule replacement is assumed to be an implementation of an Action-rule replacement. Thus, we can dene a function as required by Denition 9, showing that the resulting Observation-rule specication implements the Action-rule specication, and hence the original Constraint-rule specication. First dene the form of an Observation-rule replacement. Denition 17 An Observation-rule replacement is a protocol that implements an Action- rule replacement. It must have the same number of initial and nal states. Transitions from initial states and transitions to nal states must not be based on observations, If an initial (nal) state is substituted by an automaton connected to the replacement only by the initial (nal) transition, and the automaton is assumed to eventually execute the initial transition, we may substitute the initial (nal) state in an observation with the states the automaton without aecting correctness. Just as Action-rule replacements include n and o states that represent possible successors and predecessors, initial and nal transitions are intended to represent the context of a replacement. Since transitions from initial states and to nal states do not involve any interaction with the other process, all interesting behavior begins after the initial transitions and ends before the nal transition. Therefore, a replacement is not aected by other replacements; all of their states can be thought of as an undistinguished initial state or nal state. The condition ensures that only transitions internal to the replacement aect progress. When composing one Observation-rule replacement with another, as a notational shorthand, we let the nal states of a replacement non-deterministically select the initial state of a successor replacement. A determinization algorithm ensures fair behavior through the subset construction [HU79]. Denition 18 An Observation-rule specication is said to be synthesized from a Constraint- rule specication given the following construction: Given a set of replacements for every constraint in a Constraint-rule specications, let every nal state of a replacement be 36 joined to the initial states of the successor replacement. Replace unconstrained states by a single new transition. In all observations of a replacement, change initial and nal states to include all states of all other replacements of the observed process. Theorem 2 An Observation-rule specication O synthesized from a Constraint-rule specication C is an implementation of C . Proof. The proof is similar to that of Theorem 1. We dene the map with the help of the Action-rule maps. First, we note that traces of a replacement remain unchanged even when it is used in the synthesis. Consider a product state in the synthesized Observation-rule implementation such that for some replacement the initial transition is executed. Then, since the observations aect only the transitions dened within a replacement, execution of replacements of other constraints have no eect. Therefore, the replacement transitions can always be executed in every reachable state. Initial states in a Constraint-rule specication do not take part in any constraint. They are replaced by a single transition each. Therefore, the initial product has an image. Every replacement is an implementation of an Action-rule replacement. So we compose the maps from Observation-rule to Action-rule replacements with the map from Action-rule to Constraint-rule. Thus, we have the result. 2.5 Implementing Constraints, Actions, and Observations In this section, we describe a few ways to implement various rules. We present proofs that they are implementations. In practice, such proofs are best done with verication tools. 37 2.5.1 Synthesizing Constraint Rules We rst show how constraint rules are implemented using action rules. 2.5.1.1 Binary Conjunctive Constraints A binary conjunctive constraint is of the form: Protocol 4 Automata: Rules: P1 : n1!s1!o1 P2 : n2!s2!o2 ( ^ : s1; s2) We want a replacement in the sense of Denition 13. In the following, we describe two replacements. In Protocol 5, the conjunctive constraint is interpreted as a two-process rendezvous. In Protocol 6, it is interpreted as a request-response protocol. Protocol 5 Automata: Rules: P1 : n1!w1 !b1!b01!w10 !n01 P2 : n2!w2 !b2!b02!w20 !n02 ((n1; w1 ))(w2 ; b2)), ((n2 ; w2 ))(w1 ; b1)) ((b01; w10 ))(w20 ; n02)), ((b02 ; w20 ))(w10 ; n01)) In the construction, the (w; b) transition of each process depends on the (n; w) transition of the other. Similarly, the (b0 ; w0 ) transitions enable (w0 ; n0 ) transitions. Therefore, the (b; b0) transitions in every trace are \simultaneous" in the sense of Denition 15. So we have: Lemma 2 Protocol 5 is a replacement of Protocol 4. 38 Proof. Dene a map f from the transitions of the replacement to the states of the specication as follows: f ((b; b0 )) 7! s, f ((n; w)) 7! n, f ((w0 ; n0 )) 7! o. In every trace, the (b; b0 ) transitions occur between the (n; w) and (w0 ; n0 ), even if n and o are identied. Therefore, Protocol 5 is an implementation that meets the criteria of Denition 13. Another interpretation for a binary conjunctive constraint is request-response communication. Protocol 6 Automata: Rules: P1 : n1!b1!b01!d1!d01!o1 P2 : n2!b2!b02!d2!d02!o2 ((b1; b01))(b2; b02)) ((d2; d02 ))(d1 ; d01)) Here, P2 waits for P1 to enable (b2; b02). This models the request; the next dependence between (d2; d02 ) and (d1; d01 ) models the response. Lemma 3 Protocol 6 is a replacement of Protocol 4. Proof. Dene a renement map f as follows: f ((n; b)) 7! n, f ((b; b0)) 7! n, f ((d; d0)) 7! o, f ((d0 ; o)) 7! o, f ((b0 ; d)) 7! s, In every trace, the (b0; d) transitions are simultaneous. Hence the result. 2.5.1.2 Binary Disjunctive Constraints A binary disjunctive constraint is of the form: 39 Protocol 7 Automata: Rules: P1 : n1!s1!o1 P2 : n2!s2!o2 ( _ : s1; s2) We want a replacement in the sense of Denition 13. In the following, we describe a replacement. In Protocol 8, the disjunctive constraint is interpreted as symmetric mutual exclusion. One might also implement it using a token based protocol. 40 Protocol 8 Automata: Rules: P1 : n1!b1!w1 !c1!e1!o1 : b1!w10 !c1 P2 : n2!b2!w2 !c2!e2!o2 : b2!w20 !c2 choices rules: ((b1; w1 )_(b2; w2 )) ((b1; w10 )_(b2; w20 )) enable/disable rules for P1: ((b1; w1 )6)(b2; w20 )), ((b1 ; w1 )6)(w2 ; c2)) ((c1 ; e1) ) ? (b2 ; w20 )), ((c1 ; e1) ) ? (w2 ; c2)) ((b1; w10 )6)(b2; w2 )), ((b1 ; w10 )6)(w20 ; c2)) ((c1 ; e1) ) ? (b2 ; w2 )), ((c1 ; e1) ) ? (w20 ; c2)) symmetric enable/disable rules for P2: ((b2; w2 )6)(b1; w10 )), ((b2 ; w2 )6)(w1 ; c1)) ((c2 ; e2) ) ? (b1 ; w10 )), ((c2 ; e2) ) ? (w1 ; c1)) ((b2; w20 )6)(b1; w1 )), ((b2 ; w20 )6)(w10 ; c1)) ((c2 ; e2) ) ? (b1 ; w1 )), ((c2 ; e2) ) ? (w10 ; c1)) We show that: Lemma 4 Protocol 8 is a replacement of Protocol 7. Proof. Dene the map such that the (n; b) transitions map to n states of Protocol 7, (e; o) to o states, (w; c) to s states, and the rest to null. In the construction, the (b; w) transitions disable the (b; w0 ) transitions of the opposite process, and vice versa. The (b; w) and (b; w0 ) transitions choose between one another. Thus, consider the state (b1; b2). By the choice rule, only one of (b1 ; w1 ); (b1; w10 ) or (b2; w2 ); (b2 ; w20 ) may execute. Suppose (b1 ; w1 ) executes. Then it disables (b2; w20 ) and 41 (w2 ; c2). Therefore P1 can safely execute (w1 ; c1). The disabled transitions are enabled by (c1 ; e1), so that P2 can continue. In all four cases, the situation is symmetric. Therefore, in all traces, the transitions (w; c) will never occur concurrently. This does not change if there is a cycle connecting the n and o states. Therefore, Protocol 8 is a replacement. 2.5.1.3 N-process Constraints One general way to implement an N-process constraint replacement from 2-process constraint replacement is to use a hierarchical tournament structure. For example, consider the implementation of a disjunctive constraint between two process, and suppose we want to implement a three way constraint. Then, we let the \winner" of the two-process protocol compete with the third process by repeating the two process protocol. Clearly, the winner of the second round will be the only process that executes the transition corresponding to the constrained state. Conjunctive constraint implementations can be composed in a similar way. 2.5.2 Synthesizing Action Rules Action-rule specications tend to be tedious. We will discuss only one example of an action rule protocol, the implementation of binary disjunction. 42 Protocol 9 Automata: Rules: P1 : n1!t1 !w1!c1!o1 : n01!t01 !w10 !c01!o01 : t1 !w10 !w10 : t01 !w1 !w1 P2 : n2!t2 !w2!c2!o2 : n02!t02 !w20 !c02!o02 : t2 !w20 !w20 : t02 !w2 !w2 observations for P1: ((t1 ; w1 ); fn2 ; w2 ; t2 ; c2; o2 g), ((w1 ; c1 ); fn2; w20 ; o2 g), ((w1 ; w1 ); ft2 ; w2 ; c2g) ((t01 ; w10 ); fn02 ; w20 ; t02 ; c02; o02 g), ((w10 ; c01 ); fn02; w2 ; o02 g), ((w10 ; w10 ); ft02 ; w20 ; c02g) antisymmetric observations for P2: ((t2 ; w2 ); fn01 ; w10 ; t01 ; c01; o01 g), ((w2 ; c2 ); fn01; w1 ; o01 g), ((w2 ; w2 ); ft01 ; w10 ; c01g) ((t02 ; w20 ); fn1 ; w1 ; t1 ; c1; o1 g), ((w20 ; c02 ); fn1; w10 ; o1 g), ((w20 ; w20 ); ft1 ; w1 ; c1g) For convenience, Protocol 9 is depicted graphically in Figure 2.1. In the gure, transitions that show sets of states over the arrow are observation-based; they observe the states of the other process. Lemma 5 Protocol 9 implements Protocol 8. Proof. In Protocol 9 and Figure 2.1, the transitions corresponding to (n; b) and (e; o) of Protocol 8 have not been depicted to avoid clutter. The other sequences of Protocol 9 transitions map to the transitions of Protocol 8 as follows: f ((n; t); (t; w)) 7! (b; w), f ((w; c)) 7! (w; c), f ((w0 ; c0 )) 7! (w; c), f ((c; o)) 7! (c; e), f ((c0 ; o0 )) 7! (c; e). To show that the traces of Protocol 9 constitute an implementation, consider the following argument. Suppose P1 is in n1, and P2 in n2. Both make progress; suppose P1 43 - t1 n1 twc 0 0 0 0 0 0 0 0 0 0 0 0 0 nw0o o1 n1 @@ ,, ,n@t w cno t w c o @ ntwco , , R @ ? - w?1 w10 t w c - w10 ntwco w1 t01 ? ? c01 @ , , ? w2 t w c w2 0 0 0 R@ ? 0 n0 wo0 - o1 t02 , ntwco n0 t0w0 c0 o00 0 0 @ , @ 0 0 n t w c,o ntwco @ n0 wo0 c1 - t2 n2 o2 ? ? c02 Process P2 Figure 2.1: Disjunctive Constraint Using Observation Rules ends up in w1 , then it has seen P2 in either n2 or t2 . Then P1 observes P2 again. If P2 is still in n2, either it will stay in n2, or it will make progress and end up in w20 where it will wait for P1. This argument applies symmetrically to all possible traces. Thus, when one process executes a (w; c) transitions, the other cannot. Hence we have the result. 2.5.3 Observations via Memory and Messages We briey explain how to convert Observation-rule protocols to shared memory and message programs. Note that the following translations are easily expressed in formal models of messages [FLP85] or memory [LAA87]. Messages : States are implemented as local memory bits. A state transition clears the bit representing the current state, and sets the bit for the next state. Observations may be implemented in two ways over reliable, ordered channels. { Pull model : We can use a request-response protocol to implement observations. A process sends a request for the current state of another process, and the reply is the observation. 44 - w20 nw0o c2 Process P1 w2 twc n02 - o2 { Push model : Every process sends its current state in a message to all processes that observe that process. Memory : We code each state as a memory bit in a distributed memory. State transitions are implemented by setting the bit for the next state, and clearing it for the current state. Observations are implemented as if statements that decide the assignment to the next state. Many optimizations are possible: for example, since observations by one process are often based on sets of states of a dierent process, we only need to indicate when state changes in a process can change observations. Therefore, not every state need be explicitly represented, and we need only enough bits for distinct observations. 2.6 Summary We have presented a technique for structuring the design of protocols. The rst step uses automata with abstract coordination operators to produce simple, extensional descriptions of protocols. The descriptions have small state spaces, as they assume fair implementation of coordination and hide the details of communication and control. As a result, whole-system properties can be eectively analyzed at this level by verication tools. In the second step, we rene the coordination operators. The notation at this level expresses how processes control one another, but hides details of communication media. We show how to ensure that these renements can be composed safely to rene the original abstract protocol. In the third step, we translate the rened operator implementations to a notation that reects the behavior of shared memory or message passing systems. Again, the resulting protocols are small and can be veried by exhaustive search tools. The translations are themselves safely composable, completing the implementation of the original protocol. We have presented the essential aspects of the method. It was developed simultaneously with the target application, distributed shared memory consistency protocols, 45 described in the next chapter. Much remains to be done as regards expressiveness and tool development before it can be deployed in practice. 46 Chapter 3 Distributed Shared Memory Consistency Protocols In this chapter, we present a new set of consistency protocols for distributed shared memory (DSM). We rst explain the problem we address and present our solution. The next section discusses related research. We then present our new consistency protocols, and explain them intuitively. Next we describe how we constructed them using our design method. We conclude with a presentation of performance results. 3.1 Goal The goal of this work is to devise consistency protocols for distributed shared memory that operate eciently over either local area or wide area networks. Communication over local area networks is characterized by low latency and low bandwidth, whereas wide area networks have high latency, but high bandwidth. A protocol that performs well over a wide area interconnect must be able to utilize high bandwidth to overcome high latency. This is possible by overlapping computation and communication. Therefore, we must minimize situations where computation waits on communication. Research in distributed shared memory systems has concentrated on minimizing data communication. But in our case, reducing data communication alone is not enough; since 47 computation must not wait for communication whenever possible, we also have to reduce communication needed for synchronization. In the following, we briey review the basic motivation and design of distributed shared memory consistency protocols. We then explain our approach. 3.1.1 The Problem A distributed shared memory system simulates shared memory over networked computers. Distributed shared memory simplies the task of writing parallel programs for several reasons: It eliminates the distinction between accesses to local memory and accesses to remote memory. It relieves the burden of managing data since data in the memory is persistent and easy to access. Since memory can be overwritten, the programmer does not have to explicitly distinguish between stale and newer data. Old values are simply overwritten and the latest values are easily accessible. The shared memory abstraction naturally supports arrays and other familiar data structures. The use of shared memory for interprocess communication is well understood from programming multitasking uniprocessors as well as multiprocessors. DSM implementations strive to be transparent to the user, so that programs written for shared memory can be used on distributed shared memory with as few changes as possible. Shared memory programs have processes that communicate using through shared variables. When these processes execute over machines with distributed shared memory, the variable accesses are transparently converted to distributed accesses using virtual memory hardware. When a variable on a memory page is accessed, if the contents 48 of a shared page exist in the local memory, memory access is allowed. Otherwise, the access is intercepted, and pages on remote machines are fetched over the network. Thus, the local memory is treated as a cache of the hypothetical shared memory. Consistency and Communication Fetching pages over a network takes time. There- fore, when possible, copies of pages must be kept in local memory. But if a process writes to its local copy, the copy may become inconsistent with other copies. DSM implementations avoid such data inconsistency by coordinating memory accesses by following a memory consistency protocol. A fairly simple protocol for avoiding inconsistencies is to allow only one process to write to a page at a time. If a process P writes to a page, and another process Q either reads or writes to that page, the two memory accesses conict. This conict is resolved by granting write permission rst to one process and then the other. When a process acquires permission to write, it may acquire a fresh copy of the page and invalidate the copy held by the previous writer, or update the copy maintained by the previous writer. In either case, data must be communicated across the network. The challenge in implementing DSM is to design consistency protocols that minimize communication. There are many ways to reduce communication. For example, in the single-writer protocol above, virtual memory hardware forces implementations to detect modications to pages rather than bytes or words. Thus, writes by processes executing on dierent processors to dierent osets within a page appear to conict. This is called false sharing. False sharing may be reduced using code rearrangement, choosing small page sizes, proper data placement, or detecting the actual changes to a page and transmitting the dierence. Other ways to reduce communication include improvements to the communication infrastructure, using process scheduling to reduce conicts, and allowing multiple read-only copies of a page. But the communication is dictated primararily by the single-writer protocol: writes to a page from processes on dierent processors will always require paging across the network. The key to reducing communication is to permit multiple writers to a page. 49 Weak Consistency Weak consistency protocols allow multiple writers to a page at the same time. Much of the communication in single-writer protocols happens because writes to a page are interpreted as writes to a common variable. But in practice, true conicts, that is, writes to the same variable never occur, because programmers explicitly resolve such conicts using synchronization constructs such as locks or barriers. Therefore, if a page is accessed by two processes without rst getting a lock (or a barrier), we can assume that there is no conict. Copies of a page can then be separately updated. On the other hand, if a page is accessed from within a synchronization construct, then a conict is possible. Therefore, writes to a single page must be serialized. Moreover, the page must be updated prior to synchronization. The performance of DSM implementations with weak consistency protocols thus depends on how page updates or invalidations are dictated by the synchronization requirements. 3.1.2 Our Solution Just as weak consistency models use the presence of synchronization to weaken the consistency requirements, we suggest that it is possible to take into account program memory access patterns to reduce synchronization requirements. For example, a barrier synchronizes processes so that all processes must gather at the barrier before the barrier opens. When barriers are used for synchronizing phases of a computation, it ensures that all processes complete computations in one phase before beginning the next phase. But in many computations, such strict dependence is not necessary. If we keep track of patterns of memory access during a computation, we can optimistically begin the next phase and communicate the results of the previous phase to those processes that are likely to need them. We detect errors, and back up if necessary. If the computation is relatively regular, we will succeed more often than we fail, and computation and communication of dierent processes will overlap. We use this idea of anticipatory computation and communication to minimize time wasted waiting for communication. This results in a data-driven model for updating the results of distributed shared memory. We call this model coordinated memory. 50 In the following, we explain the ideas in detail, and provide a formal specication. Performance results show that the model achieves the goals: distributed memory computations over wide-area high-bandwidth programs achieve good speedup. 3.2 Background and Related Work In shared memory multiprocessors, every processor has to access data from the common shared memory. However, since fetching data from the common memory takes time (i.e., has high latency), the data is cached locally with each processor. As a result, the data is replicated and copies reside in caches that are accessible at high speed (i.e., low latency). However, whenever a processor modies cached replicated data, somehow all the copies must be updated, otherwise some copies end up with stale data. Furthermore, if two processors write to their caches at the same time, the update procedure must choose between the values written by each processor. These two issues, (1) how to update replicas and (2) how to choose between simultaneous values, are collectively referred to as the cache consistency problem. Particular strategies used to address these issues are called consistency models, consistency protocols, or simply memory models. 3.2.1 Sequential Consistency A particularly intuitive memory model is called sequential consistency. Under sequential consistency, the shared memory behavior of a multiprocessor must lead to results that are similar to those for parallel processes that are interleaved on on a uniprocessor. More precisely [Lam79], The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each processor appear in this sequence in the order specied by its program. We deduce that for sequential consistency (1) a write must update the cache of every other processor and all simultaneous reads must occur either before or after that write, and 51 (2) simultaneous writes by dierent processors may be interleaved in any order. These guarantees are met if we ensure that for any location, only one process may write to it at any given time. Thus, we can have two types of protocols for keeping any memory location consistent: either allow only a single-writer process and single-reader process at a time (SWSR) or allow single-writer but multiple-reader processes (SWMR). Note sequential consistency imposes no restrictions on simultaneous reads. 3.2.1.1 Sequential Consistency Implementation and Performance To implement sequential consistency on networked machines (for simplicity, assume that each machine is a uniprocessor so that every process executes on a separate machine), we must ensure that only one process may write to a given memory location at a time. Other processes are prevented from writing to that location at the same time. If a dierent process writes to that location immediately afterward, that process becomes the writer, and the permission to access the location is revoked from the rst process. On the other hand, if a process has the permission to write to a location, and another process wants to read from that location, we have two alternatives: (1) Either withdraw the write permission from the rst process and grant it to the second, or, (2) Allow both processes to only read that location, and revoke all access if any process later writes to that location, and transfer it to the new writer. Thus, we have to control the read and right permissions to locations, and transport data and permissions between the machines as required. In a seminal paper, Li and Hudak [LH89] showed how to employ virtual memory hardware and message passing over the interconnection network to implement sequentially consistent distributed shared memory. The virtual memory mechanisms of the operating system controlling the networked machines are extended to control the read and write permissions, and to revoke access rights at the granularity of virtual memory pages. Thus, suppose that a process attempts to write to a memory location within a page that it cannot access. This attempt results in a page-fault, and the fault handler contacts 52 other machines across the network, locates the page and gets write access as well as the page data. Similarly, if a process attempts to write to a page that it may on read, there is a protection-fault, and the fault handler must gain the write permission and revoke access for all other processes that may have copies. The implementation must support some directory scheme that allows a process to locate other writers and readers, and some arbitration scheme to select a writer in case of multiple requesters. Typically, the directory scheme itself must be distributed for ecient access and update of directory data. Now it is easy to see that the sequential consistency model lead to severe performance problems, and the particular implementation techniques exacerbate them. Since sequential consistency requires that only one process may write to a memory location at any time, if two (or more) processes write to a location in succession, the memory location \bounces" across the network, as the write permission and page data are exchanged. Thus, a signicant number of write operations require data transmission across the network and add communication overhead to the computation. Note that this problem is not an artifact of the implementation, since the single-writer requirement is imposed by the consistency model, and for every write operation by a distinct process, obtaining write permission requires network transmission. Thus, sequentially consistent memory inherently requires signicant network communication; as a result, sequentially consistent distributed shared memory must be slow and unscalable. The page-level granularity of the implementation further exacerbates the \bouncing". The virtual memory system can detect memory accesses to pages, but cannot dierentiate between accesses to dierent osets within a page. Therefore, writes by dierent processors to dierent osets within a page seem to conict, since they are writes to the same page. This is called false sharing, because data that is in fact not shared (i.e., data at dierent osets within a page), appears to be shared. 53 3.2.2 Beyond Sequential Consistency To improve performance, we can reduce communication by code rearrangement [SMC90], data placement [SMC90], or by choosing small page sizes [FLR+94]. Similarly, we may improve the communication infrastructure [SMC90] and develop special network communication protocols. Other techniques include scheduling improvements, so that after acquiring a page, requests from other processes are ignored for some time period in order to let local computation continue. In our earlier research, we [SMC90] (and others[FP89, DCM+90, MF90]) investigated such issues. However, such changes can only lead to minor improvements, since the communication requirement is dictated by the sequential consistency memory model. Since sequential consistency requires that only one process may write to a page at a time, writes from processes executing on dierent processors will always require paging across the network (\bouncing"). Further scalability problems arise since potentially every computer may have to be visited to locate a page. Thus, a major improvement is possible only if we can allow multiple writers, and manage to bound the number of machines that have to be interrogated to locate a page. At the same time, the consistency model must be suciently intuitive that programmers can easily reason about their programs and implementors of the model have simple, understandable implementations. In the following, we describe weak consistency models that allow multiple writers, and also ameliorate problems with false sharing. Then we argue that the existing models are complex, and that there is room for further optimizations; this motivates coordinated memory as a simpler model that also suggests new optimizations. 3.2.2.1 Weakly Consistent Memory Models Weakly consistent memory models were invented by computer architects [DSB86a, GLL+90]. in order to minimize memory-to-CPU trac by allowing optimizations such as out of order memory operations, overlapping memory operations, and lock-free caches. For example, if a processor sequentially issues two load instructions i and j to distinct memory 54 locations mi and mj then j may be issued before i completes, thereby overlapping memory reads and reducing overall time. However, in a multiprocessor, this may mean that some other processor sees the eect of j before i, resulting in incorrect operation. On the other hand, if no other processor actually reads mi , then the computation would still be correct, and at the same time the overlapping loads allow the program would execute faster. But the sequential consistency model forbids such instruction shuing even though it may improve performance. To address this problem, weakly consistent models have been dened. The denition of weakly consistent memories depend on two (interrelated) observations: (1) Since programmers dene their own high-level consistency requirements using locks or barriers, can we design memory models use this information? (2) Sequential consistency requires that every process must be able to observe the eects of writes in the same order. Is it possible to relax this requirement and still reason about correct programs? The rst question leads to models like release consistency [GLL+90, CBZ95], entry consistency [BZS93] and other hybrid [AF92] consistency models. The hybrid models distinguish between synchronization operations and ordinary operations, and the expected order of program memory observations are analyzed as a combination of the two. Provided programs distinguish between synchronization and ordinary operations, these models lead to memory behaviors that are indistinguishable from sequential consistency. The second question leads to models like Pipelined RAM [LS88], and causal consistency [AHN+93, JA94]. These models do not require that all processes be aected by write and read operations, and allow dierent processes to \see" the eects of operations from dierent processes in dierent orders. Using these models, programmers may have to rewrite programs that were intended for sequentially consistent memory [AHJ91]. Further, it can be expensive to implement these models, since the implementation must maintain information about the order of memory accesses for every memory access. Fortunately, it is possible to deduce the order information from programming constructs such as locks and barriers, thus resulting in hybrid models that are very similar to entry or release consistency. 55 In the following, we will briey discuss the details of the four models and point out the drawbacks, thus motivating coordinated memory. Release Consistency Release consistency [GLL+90, CBZ95] is motivated by the ob- servation that memory locations that are protected by locks are accessed by only one process at a time. Further, an accessing process must rst acquire the lock and will be blocked until the it is released. Thus, the lock holder knows that until it releases the lock no other process will attempt to access the protected data as any \interleaving" is forbidden by the lock. Therefore, changes made by a process p to variables protected by a lock need not become visible to any other process q until after p releases the lock. In other words, memory should become consistent only upon lock release. Thus, a release consistent memory implementation can optimize communication in several ways: 1. It can collect modications made to several variables u; v; w within a critical section and broadcast them in a single message if reducing messages is important [CBZ95]. 2. It may pipeline the writes to u; v; w if hiding write latency is more important than reducing messages [GLL+90]. 3. It may transmit the changed values only to the process that acquires the lock next [DKCZ93] instead of broadcasting them to all processes. 4. It may piggyback the write values to the acquiring processor on the lock grant message [DKCZ93]. 5. It may delay transmitting the changes, and instead merely inform the acquiring processor about the changed values. The acquiring processor can request the changes when necessary [DKCZ93]. 6. It can make use of access information, such as (1) if accesses will be read-only, data can be freely replicated, (2) if process modify data one at a time, then the 56 data can be migrated from the lock holder to the next acquiring process. All such optimization can be brought into play to minimize communication [CBZ95]. 7. Since programmers use synchronization to avoid data races (i.e., simultaneous writes to one memory location), virtual memory based distributed shared memory systems can assume that pages modied by multiple processes between an acquire and the corresponding release are in fact modied at dierent osets within a page. Thus, after the release, the modications can be safely merged. Therefore, pages that are simultaneously modied do not \bounce" as in sequentially consistent distributed shared memory. Selecting the appropriate optimization depends on the tradeos. In true shared memory multiprocessors [GLL+90], data changes will be pipelined with multiple messages since hiding write latency is more important than reducing messages. In distributed shared memory systems [CBZ95], data changes are buered to reduce messages. Other optimizations such as migrating data can be applied in both cases, provided the required information is available. Entry Consistency A release consistent memory model updates changes to all vari- ables upon a release operation. With entry consistency, the programmer creates an explicit association between a lock and the memory locations it protects, and only the associated memory locations are made consistent upon the next lock acquire. Thus, entry consistency takes a step beyond release consistency in making eective use of the information that implicit in a program. Clearly, entry consistency and release consistency are similar enough that aggressive optimizations detailed above can be applied to entry consistency to boost performance. One slight disadvantage of entry consistency is that the programmer must declare the association between the synchronization variables and the data. However, related research indicates that such modications are not very onerous [JKW95]. 57 Causal Consistency Sequential consistency requires that all processes agree on some global order of memory accesses. Then we say that a read of variable v must return the value written to v by the \most recent" write. However, as we have seen, enforcing such a strict order precludes many optimizations that do not aect correctness. So we seek a weaker consistency model that does not require all write operations to be totally ordered. Instead, only writes that can aect the behavior of a process should be ordered; this ordering must allow the writes to be overwritten when they no longer aect process behavior. A process p aects another process q when q reads a value written to some variable v by p. Suppose that as a result of reading v, q writes a value f () to another variable w. Now if a third process r reads the value f () from w, followed immediately by a read of v, then r must read the value and not any prior value of v; this restriction denes how memory locations (variables) are updated. Without some similar restriction, if r were to able to read any previous value of v, then in the degenerate case the architecture need not update memory at all and always return the initial values; in that case, computation would be impossible since processes cannot communicate with one another. We say that a write operation W is causally related to a read operation R either if R reads the value written by W , or if R reads from some process o that read the value written by (but has not itself updated) W . A memory model in which a read operation may return the value written by one of its causally related writes is said to be causally consistent [DSB86b, AHN+93]. To see that multiple return values are possible, consider an extension of the example above such that some process s writes to v simultaneously with p, and r rst reads w followed by some variable x written by s. The reads of x and w establish causal relationships between r and s as well as r and p. Furthermore, since p and s wrote simultaneously to v, the writes to v cannot be ordered. Thus, from the viewpoint of r, v has two possible causally recent values, and (stated another way, under causal memory simultaneous writes to a variable spawn multiple copies of that variable). Upon reading v, r must choose between one of them as the \actual" value. As a consequence, a single 58 causally consistent variable allows multiple writes. Further, since a write must aect only causally related processes, writes need not be propagated to all processes. Thus, implementing causal consistency potentially requires less communication than sequential consistency. Interesting example programs for causal memory are presented in [AHJ91]. Notice that there is a problem with the above description: we have said that writes must be propagated only to causally related processes; however, until the rst read establishes causality, a write need not aect any processes at all. This paradox may be resolved in several ways: (1) Broadcast the values of writes to all processes with timestamps that allow readers to select causally recent values. But this implementation requires too much communication. (2) Initialize memory locations to invalidate states such that readers of these locations must contact possible writers to get initial values. Later, every read operation (logically) updates all locations with causally recent values acquired during the read. Both of these implementation have untenable overhead for determining and transmitting causally recent values. However, we can decrease the need for communication considerably by realizing that the programmer implicitly denes high-level causality relationships by specifying synchronization using locks or barriers [DKCZ93, JA94]. For example, it is obvious that all write operations immediately before a barrier are causally related to read operations immediately afterward, since a barrier release operation is causally related to all further computation. Thus, memory operations for synchronization alone can be used to accurately deduce causality, thereby reducing computational overhead for determining causally recent values and transmitting updates. Such a \hybrid" causal consistency model turns out to be very similar to release and entry consistency. 3.2.2.2 Communication in Weak Models In weak memory models, we assume that conicting write operations occur guarded by synchronization constructs. Therefore, consistency has to be guaranteed at the begin- 59 ning (or the end) of synchronization. Therefore, communication depends on how often synchronization constructs are accessed. When a process must synchronize its computation with other processes, it waits for its pages to be updated, and for the synchronization operations to complete. During this time, computation is suspended waiting for communication. Since we are interested in using distributed shared memory over wide area networks, we want to overlap computation and communication as much as possible. Therefore, we study the implementation of synchronization constructs. 3.2.3 Synchronization in Distributed Shared Memory Early distributed shared memory systems [SMC90, LH89, DCM+90] attempted transparent emulation of shared memory multiprocessors. Therefore, synchronization was implemented by supporting test-and-set like operations within the system. As a result [SMC90], synchronization operations such as spinlocks led to poor performance since the page bouncing (Section 3.2) caused by competing spinlocks requires communication across the network. Weakly consistent systems [CBZ95] moved away from the strict transparency requirement, supporting weak consistency models as well as multiple implementations of release consistency. The programmer is expected to annotate shared memory programs to indicate the optimizations desired (e.g., whether some data should be migrated, while other require multiple writers). In a similar vein, the synchronization operations are implemented by message passing libraries that avoid problems like page bouncing. However, even these implementations require too much communication and signicantly impact application performance. In the following, as an example, we examine the implementations of locks, barriers, and queues and identify the bottlenecks. Later, we see how these bottlenecks can be eliminated, and other synchronization constructions may be suggested. 60 3.2.3.1 Implementing Locks Locks are used to ensure exclusive access to sets of memory locations. Simple spinlocks that spin on a location in shared memory exhibit contention. Contention can be reduced by using queue locks [And90], where processes waiting for the lock enqueue themselves rather than spin. When the lock holder releases the lock, the next holder in the lock queue acquires it. In distributed shared memory systems, the queue is distributed [CBZ95] over all participants. To enqueue itself, a process puts itself on a local queue, and if the lock is held by a remote machine, the request is forwarded to that machine. If another machine has grabbed the lock, the request is appropriately forwarded. When the request reaches a current holder, the request is queued there. When this request reaches the head of the queue, the requesting node acquires the lock. In the best case, the lock is held locally. Otherwise, at least one network access is needed, and the acquisition grant message can return the data. The new holder must send an acknowledgment to the previous holder to ensure that lock has been denitely acquired. In the worst case, the acquisition request may have to \visit" all nodes before it succeeds in entering the queue. When network access is needed, lock requesters must wait at least twice the network latency time, plus data transfer time. Implicitly, the sender must also wait for twice the network time. If such lock accesses could be optimized, in the ideal case the requester should have the lock and data when needed, and only the sender needs to wait. With coordinated memory, we show how the programmer can help the system to approximate this case. 3.2.3.2 Implementing Barriers A barrier synchronizes all processes that participate in the barrier. When the barrier is active, no process can pass the barrier until all processes (participants) have arrived at the barrier. Barriers are commonly used in programs to separate phases of a computation. For example, in methods for the iterative solution of linear equations, successive iterations are separated by barriers. The barrier ensures that all processes complete their updates 61 before any process can start the new iteration. Thus, every iteration is guaranteed to use fresh data. Barriers are typically implemented [CBZ95, BZS93] using a barrier master process which collects barrier entry requests from barrier participants. In case the request also contains updates from processes, the master may merge and rebroadcast the merged results. Since barriers are usually visited by all processes, it would seem that there is little to be gained by using distributed barriers [HFM88]. However, a barrier entails considerable overhead in both communication and waiting for all processes to arrive. However, consider that the purpose of a barrier is often to ensure that only fresh data from earlier phases is used, not that the processes should synchronize; synchronization is only a means to this end. Thus, if the programmer has access more specic synchronization constructs that can express this requirement, the needless synchronization overhead is eliminated. The coordinated memory model suggests how to enrich the programmer's repertoire of synchronization constructs so that such information is available to the implementation and may be utilized. 3.2.3.3 Implementing Task Queues A task queue is used in programs where processes do not have static allocations of work. Instead, each process starts o by processing some part of the problem data, and enqueues the result after it is done. Other processes can then pick the data for further processing. Task queues have been implemented as migratory data [CBZ95] in distributed shared memory systems. When a process needs to access the queue, the queue is migrated over to that process, where it can add or remove tasks. Since the information in the queue must be available to all processes so that they can pick of task as soon as they become idle, the migratory data transfer is natural. Experiments show that [CBZ95] contention for the task queue has a signicant impact on application speedup. Normally, a task queue need not guarantee that the holder of a task has exclusive access to that data. However, if we observe that in some problems (e.g., parallel quicksort) we can also regard queue access as a request for data and access, then we can 62 distributed the queue implementation without suering contention. With coordinated memory, we can give a precise specication of the eect of using synchronization constructs on the memory. Thus, it becomes possible to optimize data transfer by achieving synchronization as a side eect of data transfer (similar to message passing). 3.2.4 Our Approach Early sequentially consistent distributed shared memory systems [LH89, DCM+90] attempted transparent emulation of shared memory multiprocessors. Arguing that these systems inherently [LS88] require more communication than programmers need, weakly consistent models were introduced to reduce communication requirements. These weak models utilized information about synchronization given by the programmer to identify and eliminate unnecessary communication. Modern systems [JKW95, BZS93, DKCZ93] extend this trend to gain eciency by requiring programmers or compilers to specify more information about the program to the runtime system. Even so, for many programs these system appear easier to programsince the programmer has less data management overhead. Just as it is argued that sequential consistency is overly consistent, we argue that traditional synchronization constructs such as locks, barriers, etc., are overly strong. They specify more constraints on process interaction than are necessary. Thus, they require processes to wait for each other more than strictly necessary. By evaluating the relationship between synchronization constructs and memory consistency, and by aggressively utilizing known patterns of process interaction as a part of synchronization, we show that performance of distributed shared memory systems can be further improved. The coordinated memory model shows how to specify the relationship between synchronization, consistency, and known process interaction patterns. In many cases, this requires little or no changes to existing shared memory programs, and has considerable performance benets. 63 3.3 Coordinated Memory In Coordinated Memory, we coordinate memory accesses using information about previous access patterns. We also achieve synchronization as a side eect of data ow. To motivate the solution, we again consider the barrier example and show how to optimize it further. Analyzing the optimization leads to the formal denition of the coordinated memory model. 3.3.1 Adaptive Barriers When a barrier is used to separate the phases in a iterative computation, changes1 made to the memory by each process are forwarded to the barrier master. The barrier master consolidates the changes and sends them to the participants who need the data. But whatever the optimizations used to transfer data, to go through a barrier, processes have to synchronize with the barrier master. Now suppose that the data distribution is known beforehand, and also we know which processes update which data. As a concrete example, consider the implementation of nite dierencing using an iterative method involving three processes. Assuming the data is partitioned as in Figure 3.1: after each iteration, process pairs h1, 2i and h2, 3i must exchange the pages on the boundaries that contain modied data. Normally, a barrier would be used to insert a boundary to tell the distributed shared memory implementation that the data should be updated. However, it is possible to ensure that stale data is not used without explicit barrier synchronization. Faking a Barrier To the same eect as a barrier, we change the processes so that after completing an iteration, each process transmits the boundary data areas (indicated in Figure 3.1) to the process that will use it in the next iteration. For example, consider processes 1 and 2. When 1 sends the data to 2 and receives an acknowledgment, 1 knows 1 Or simply change lists instead of the actual contents. The changes might be retrieved lazily. 64 Data for Process 1 Boundary values needed by 1 & 2 Data for Process 2 Boundary values needed by 2 & 3 Data for Process 3 Figure 3.1: Data Distribution for an Iterative Linear Equation Solver that 2 has received the new data. Similarly, 2 sends the data to 1 and learns that 1 has the data. Both 1 and 2 know that neither will proceed until it gets the data, therefore when both have sent the data and received the acknowledgments, they know that neither will use stale data. Thus, the explicit barrier is unnecessary; the data transmission also guarantees synchronization. However, waiting for an acknowledgment is expensive on a high-latency network. Thus, further speedup is possible if we can eliminate the wait. Interestingly, it is possible to eliminate explicit acknowledgments since both pairs of processes must transmit data to one another. Thus, the data transmission can also be used as acknowledgment. Hence, in our example, on a reliable network, process 1 can transmit the data to 2 and vice versa. Both process know that the other will not start the next phase until it receives the updated data, so once again the barrier is not necessary. But most networks are not reliable. Reliable transmission is usually achieved using acknowledgments, so in a \reliable" network, again latency will aect the computation. It turns out that we can eliminate acknowledgments most of the time even without a reliable network. The idea is that each process transmits its data to the other, and also keeps a local version of the data. For example, suppose that process 1 makes a local copy of the boundary data and transmits it to 2, and begins its computation for the next 65 phase (and symmetrically for 2). If that message does not make it, 2 will explicitly ask 1 for the data; 1 guarantees that until it knows that 2 no longer needs the data it will keep its local version. Now if the message gets to 2, then 2 can continue its computation and when it transmits the results before beginning the next phase, 1 will learn that 2 indeed received the data. Thus, by keeping at most two local versions of the data, the need for the barrier has been eliminated. By assuming that networks are mostly reliable, this optimistic strategy will be able to avoid explicit synchronization for most iterations. The transformations described above can be implemented by altering the program either manually or using compiler techniques. However, we can show that explicit data manipulation within the program is unnecessary; it is possible to implement an adaptive barrier that optimizes itself at run time. Adaptive Barriers In an adaptive barrier implementation, we assume that the data distribution and the shared boundary data does not change from iteration to iteration. We begin with a conventional implementation, where processes contact a barrier master in order to enter a barrier. The barrier master collects the data requests and after combining them takes further action. With the static distribution assumption, suppose the barrier is hit after the rst iteration and every process communicates required boundary data locations to the barrier master. Now, the barrier master collates the data and informs all processes about the future requirement of all other processes. The process acknowledge the information, and then the barrier master allows the processes to proceed. After that, the processes need never again communicate with the barrier master, since they all know one another's data requirement. The implicit or data-driven synchronization can be employed for all future alterations. Thus, we get the ecient barrier without any change to the programs. For situations where we cannot guarantee that processes share the same memory locations from iteration to iteration, the adaptive approach can be used, but the tradeos must be carefully evaluated. For example, if we know that each process will require data 66 receiver must be able to store the data and hand it over to the next acquirer. Thus, depending on the accuracy of the statically expected communication pattern, we can avoid communication and make the data available to the receiver when needed. If the static pattern is violated, the implementation will still be be correct at the cost of extra data management and message exchange. Similarly, if we use task queues where the next task request can be anticipated, then enqueueing process can send the task directly to the desired processor. This technique allows us to leverage known patterns to enhance performance. With increasing research in identifying such patterns, we can apply them to distributed shared memory programs without modifying user programs. 3.4 Designing Consistency Protocols In this section, we show how to use the design method of Chapter 2 to guide the development of protocols for distributed shared memory. In practice, we developed both the protocols and the design method concurrently. Experience gained in one area guided the research in the other area. The method can be used in two ways. In one, we start top down and rene the desired specication. We can use the Constraint-rule specications at the high-level for very simple models, or also model details. Just as we can show that one Action-rule specication implements another, we can also show that one constraint rule specication implements another. In the other way, we can model a new implementation at a low level, for example, using action rules, and attempt to verify that it implements some desired specication. In the following, we give Constraint-rule example specications for several distributed shared memory protocols, and suggest possible implementations. We also show an example of Action-rule specications used to justify the adaptive barrier. 68 3.4.1 Consistency Specication First consider very high-level specications for ordinary, sequentially consistent distributed shared memory. Protocol 10 species a single writer protocol for two processes. Protocol 10 Automata: Rules: P1 : n1!w1 !n1 P2 : n2!w2 !n2 ( _ : w1 ; w2 ) In this protocol, the n states represents the regions where the process has no page, while the w states when the process has a page and is a writer. The constraint tells us that the only process may be a writer at a time. We might interpret this as a mutual exclusion protocol, and implement it using token rings or any other suitable distributed mutual exclusion algorithm. Next consider the specication of multiple reader protocol. Protocol 11 Automata: Rules: P1 : n1$w1 $r1$n1 P2 : n2$w2 $r2$n2 ( _ : w1 ; w2 ) ( ^ : r1 ; r2 ) Here, we have added the state r to represent readers. In this specication, we would actually like to say that any process can be in either n or r, but at most one process can be in w. But the specication does not permit multi-state state constraints as yet. So we get a somewhat clumsy model, but it helps in developing the more general model. 69 In this specication, we can interpret the conjunctive constraint as a group membership protocol, rather than as a barrier. This interpretation would allow us to naturally model the change from reader to writer as a group leader election. These two models are very high level. They tell us that the initial specication makes sense, and that the introduction of reader copies is harmless. We can also use the abstract specications to consider a few more details. One way commonly used in distributed shared memory implementations uses a home site for a page. All data about page copies (group members) is maintained there. This is a statically determined membership protocol. In another approach, the current writer serves as the coordinator that grants requests for read copies and write copies. One may rely on randomizing factors like network delays and the interference of other computations to ensure fairness. We can model these possibilities into the specication. Consider a model with the static home page. Protocol 12 Automata: Rules: N : nN !aN !gN !wN H : nH !aH !fH !gH !nH : nH !a0H !fH !wH W : wW !fW !nW ( _ : aN ; aH ; a0H ) ( ^ : gN ; gH ) ( ^ : fH ; fG ) In this case, we have shown parts of state machines of three processes. N is a process that requests write access, it is originally in state n with no page. H is the home process from which N requests the page. W currently has write access. In state a, N asks for write access, H fetches the page from W (state f ) and grants it to N in state g. The disjunctive constraints between the ask states a ensures that only one ask succeeds; here 70 the disjunction is interpreted as arbitration by the home site. The fetch and grant (f and g) constraints are simple request response exchanges. The path for H through a0 illustrates the ow when H itself requests write access. This specication illuminates the basic exchanges between the automata to implement the single writer protocol. Just as we showed that an Action-rule specication can be an implementation of a Constraint-rule specication, we can also show that one Constraint-rule specication is an implementation of another. The methods are the same: establish maps that relate the higher-level specication to the lower-level specication, and show correspondences between the traces. Here, we can do such an analysis for the entire system, because we have still managed to abstract communication. It is easy to show that Protocol 12 is an implementation of Protocol 10. 3.4.2 Adaptive Barrier So far, we have seen examples of using Constraint-rule specications. For analyzing the correctness of the adaptive barrier, we use a notation that directly expresses dependencies by the enabling and disabling transitions. Protocol 13 expresses the dependencies shown in Figure 3.1. Protocol 13 Automata: Rules: P1 : n1!w1 !b1!b01!w10 !n01 P2 : n2!w2 !b2!b02!w20 !n02 P3 : n3!w3 !b3!b03!w30 !n03 ((n1; w1 ))(w2 ; b2)), ((n3 ; w3 ))(w2 ; b2)), ((n2; w2 ))(w1 ; b1)), ((n2 ; w2 ))(w3 ; b3)) ((b01; w10 ))(w20 ; n02)), ((b03 ; w30 ))(w20 ; n02)) ((b02; w20 ))(w10 ; n01)), ((b02 ; w20 ))(w30 ; n03)) 71 In this protocol, process P2 can enter the barrier only when processes P1 and P3 enter the barrier, while P1 and P3 depend only on process P2. In a regular barrier, all three processes would depend on one another. Analyzing the traces of protocol shows that the barrier transitions (b; b0 ) are not simultaneous in the sense that the transitions all occur after their predecessors and successors. Rather, the reduced dependency allows P1 (or P3) to proceed to the next iteration of the barrier without waiting for P3 (P1 ) from the previous transition. But the dependence on P2 ensures that only the second iteration can be started this way, not the third. Thus, what we have implemented is a barrier in which computations in two phases can be merged. We satisfy specications where P2 proceeds when P1 and P3 are done, while the dependency between P1 and P3 is derived implicitly through their dependence on P2. More formally, this is expressed by specications like Protocol 14, where P2 synchronizes separately with P1 and P3 through the constraints between the regions g and h. In a usual barrier, all three processes would synchronize through a conjunctive constraint on a common region. Protocol 14 Automata: Rules: P1 : h1 !h01 P2 : g2!h2!g20 !h02 P3 : g3!g30 ( ^ : h1 ; h2 ), ( ^ : h01; h02 ) ( ^ : g1 ; g2), ( ^ : g10 ; g20 ) In this case, the reinterpretation is harmless, so we can accept the implementation. 3.4.3 Summary In this section, we saw how to model the consistency protocols. But the models also show that the method has many limitations. We have developed only the basis; the method 72 lacks for constraints involving multiple states in the same process, and ways to specify constraints for a variable number of processes. Thus, when induction is necessary, we need a manual check. At present, this is a common failing of specication methods that use exhaustive search for verication. Future research should reveal solutions. 3.5 Implementation and Performance This section presents experimental performance results. The experiments are designed to explore the eects of synchronization related communication, and to see whether programs execute more quickly with adaptive coordinators. Coordinated memory was developed as a part of the XUNET/BLANCA project which explores research issues in wide-area ATM gigabit networks. Therefore, the main problem for implementing distributed shared memory over a wide area network is that of latency. However, ample bandwidth is available, so we sought a distributed memory model that would allow bulk-data transfer and optimistic communication to compensate for the latency bottleneck. Release consistency and entry consistency are two models that allow bulk-data transfer, and can be extended for optimistic communication. However, we soon discovered that the synchronization requirements became the bottleneck. The development of adaptive coordinators is designed to relieve this bottleneck. Thus, the genesis of coordinated memory is partly a result of the unique experimental platform. 3.5.1 Experimental Platform Our testbed was a four-node cluster of SGI Onyx workstations, one at the University of Wisconsin, another at NCSA at Illinois, and two in the Computer Science Department at Illinois, connected through ATM switches developed at AT&T. We normalized the latency to 74ms roundtrip on all four nodes using the communication latency between the NCSA and Wisconsin workstations to calibrate the behavior. The ATM interconnection was available through locally developed HXA HiPPI to Xunet adaptors that connected 73 TCP2 sockets to the underlying network. The available bandwidth for 4KB buers with a 1280 KB TCP Window averaged around 103 Mb/s, with a peak of about 140.8 Mb/s. With 64KB buers, a peak of 189 Mb/s was observed. While the underlying network is capable of a raw bandwidth of 600 Mb/s, the interconnect with the HiPPI adaptor and TCP overhead considerably aect performance. For comparison, the experiments are also executed on a local-area ethernet. Coordinated memory is implemented as an application program using standard Unix facilities. The implementation has two major components: the library of adaptive coordinators that are implemented using message passing, and the virtual memory manipulation system that traps non-coordinator accesses to ensure consistency. 3.5.2 Applications We selected three applications to evaluate coordinated memory. The rst application, matrix-multiply, models the trivial case of no coordination. The program multiplies two 400 400 integer matrices and puts the result in a third matrix. We split the computation equally between the four processes. Each process independently computes the result. This application serves as a canonical example of an embarrassingly parallel application, where coordinated memory allows unrestricted replication of data. The sequential program runs for 32 seconds. Figure 3.2 shows the speedup for four cases: with Xunet and Ethernet, with and without pre-replication of the matrices. The Xunet without replication is slower than the others, but the application is compute bound, and the dierences are not apparent with only four processors. The second application, SOR (Successive over relaxation) is an iterative method of solving partial dierential equations (PDE). The program models the discretized area over which PDE is solved as a matrix. During each iteration, each matrix element is updated by averaging the values of its four perpendicular neighbors. The program 2 UDP communication turned out to be very unreliable. 74 3 2+ 4 S 3 p e e d u 2 p 3 2+ Xunet 3 Xunet (adaptive) + Ethernet 2 Ethernet (adaptive) 3 + 2 + 3 2 1 1 2 Processes 3 4 Figure 3.2: Matrix Multiplication 4 Xunet 3 Xunet (adaptive) + Ethernet 2 2 Ethernet (adaptive) + S 3 p e e d u 2 p +2 +2 3 2+ 1 1 3 3 2 3 Processes 3 4 Figure 3.3: Successive Over Relaxation divides the matrix into rows, and each process computes the averages on a row. Only the boundary elements of each row are shared with other processes. We computed 50 iterations for a 1024 3072 matrix. After each iteration, the processes synchronize on a barrier. In our case, we experimented with normal and adaptive barriers to explore the performance impact of adaptive barriers. The sequential program completes in 119 seconds. Figure 3.3 shows that with the Ethernet in a local area low-latency network, presence or absence of a barrier does not have much eect. Some impact is apparent with four processes. However, it can be clearly seen that for a high-latency network, the dierence between the versions with a normal 75 4 S 3 p e e d u 2 p + 3 2 1 1 Xunet 3 Xunet (adaptive) + Ethernet 2 Ethernet (adaptive) 2 + + + 2 3 2 Processes 2 3 3 3 4 Figure 3.4: Quicksort barrier versus as adaptive barrier is remarkable. The version with the adaptive barrier exhibits nearly the same speedup as with a local area network, because the processes do not wait for one another. They optimistically send the changes to the intended recipient and continue their computation. This overlaps communication and computation, and the network latency has little eect; it is also compensated by the bandwidth. With an adaptive barrier, the speedup is limited only by load imbalance. The nal application, quicksort, was chosen as an example of an application that exhibits low speedups with release consistency [CBZ95] due to contention over the queue. The quicksort uses an explicit distributed queue to coordinate data transfers, which can even be anticipated in some cases (Section 3.3.1). The program partitions an unsorted list of 512K integers into sublists. Small sublists are locally sorted with bubblesort, while larger ones are enqueued into a workqueue. Whenever possible, the enqueueing process explicitly delegates the task; otherwise, a process that has completed its task must deque task from the workqueue. Coordinated memory allows this optimization without aecting user programs, as discussed earlier (Section 3.3.1). Figure 3.4 shows the speedup for quicksort (for a sequential time of 74 seconds). With the adaptive queue, the quicksort program exhibits behavior similar to SOR; it is speedup limited only by load imbalance. With the explicit distributed queue, there 76 is little communication once the tasks are farmed out, and processes rarely wait for work. However, without the adaptive queue, speedup is severely limited, especially for the high-latency case. 3.6 Summary The preceding sections presented a technique for overlapping computation and communication by minimizing the contention and waiting period for synchronization. The adaptive synchronization structures combine communication with synchronization. In addition, they allow optimistic communication, so that process avoid blocking for one another. Thus, they boost application performance even over hostile environments such as wide area networks. Notice that in our applications, with adaptive coordination, performance becomes limited by load imbalance. Further, the virtual memory driven communication implies that data scattered in local memory must of necessity be communicated sequentially. On the other hand, we have already observed that known patterns of communication can be used to guess future data requirements. Such patterns could be conceivably used for guessing load requirements as well as anticipating future communication. 77 Chapter 4 A Software Architecture In this chapter, we present a new software architecture used to build the Choices virtual memory and distributed shared memory system. The architecture is useful in any application that involves concurrent operations over groups of objects, including objects on remote machines. The architecture permits incremental extension that add new objects and new operations. It uses object-oriented state machines to program the operations, permitting incremental extension of state-machine driven logic. The architecture also includes techniques for resource management. 4.1 Goal We motivate the architecture with virtual memory as the primary example. We show that current software architectures can lead to change-resistant systems, and suggest ways to factor the objects so that incremental changes become easy. 4.1.1 The Problem Let us briey recapitulate the basic concepts of traditional virtual memory systems. Computer architectures provide hardware that permits memory addresses issued by the processor to be late-bound to addresses actually used to access physical memory. The 78 hardware maintains a per-process table (or a cache) that maps virtual addresses to physical addresses to facilitate the late-binding. This allows operating systems to implement per-process virtual address spaces that are far larger than available physical memory. The physical memory is used to cache the contents of the virtual memory that actually reside on the secondary disk storage. For various reasons [Tan92], the caching system manipulates chunks of data called pages rather than the smallest addressable unit of physical memory. Normally, a virtual memory address transparently maps to a physical address and memory is accessed. But if a virtual memory access requires a page not available in physical memory, then a page fault is said to occur. The fault is handled by virtual memory management software that accesses secondary storage to locate the desired memory contents and moves them to physical memory. Periodically, relatively unused pages (selected according to a paging policy) from the physical memory are paged out to disk by a process called the pageout daemon. The pageout daemon must also remove the virtual to physical address binding when a page is removed from physical memory. The virtual memory subsystem of an operating system thus interacts with user level processes, the le system, pageout daemons, the process system, and the hardware. Modern virtual memory systems further complicate this state of aairs. They support shared memory between processes, copy-on-write sharing, user-level page management, distributed shared memory and other facilities. If processes share virtual memory, then the page-fault system has to contend with simultaneous page faults from multiple processes. The pageout daemon has to manipulate the virtual to physical address maps for several processes. If a page is shared copy-on-write, then the page-faults system must copy the page for a write access, but otherwise repair faults as usual. If a page is a part of distributed shared memory, then page-fault handling may require interaction with other machines across the network. User-level page management requires the paging system to make upcalls from kernel level to user level. Thus, page-fault handling becomes complex, and therefore programming a fault handler is tedious and error-prone. We argue that with current approaches for virtual memory design result in systems that are hard to understand and change. 79 4.1.2 Our Solution In the following, we present a software architecture that simplies the construction of such modern virtual memory systems. Our architecture factors out common structure and behavior for virtual memory management systems. The factorization makes it possible to add data structures and logic for new facilities incrementally, starting from a simple traditional virtual memory system. The design disentangles concurrent interactions between the virtual memory, process, le and networking systems. The resulting system can be either embedded inside the operating system, or used as an external paging facility. 4.2 Background and Related Work We rst explain the architecture used for current virtual memory systems [KN93, RJY+ 87, Rus91] We begin by analyzing the requirements of virtual memory systems, and suggest the basic objects for the system. Then we discuss how other parts of an operating system interact with the virtual memory system. The basic objects together with the interactions reveal the overall structure of the system. Next we study how the structure changes when new capabilities like copy-on-write and distributed shared memory are added to the system. We argue that the usual framework structure [Rus91, Lim95] results in a virtual memory system that is hard to change. The arguments motivate an architecture that refactors the framework and reduces spurious dependencies making it easier to change. 4.2.1 Basic Objects The basic requirements of the virtual memory system are derived by looking at the primary clients, the user-level processes. A user process issues virtual addresses in its address space, and the virtual memory system translates them to physical addresses, retrieving memory contents from disk when necessary. A user process may also manipulate regions 80 (ranges of addresses) of the address space. For example, a process may map les (or parts thereof) to regions with read, write, or execute permissions. These requirements suggest the following basic objects: Domains that represents the address space, MemoryObjects that represent datasets like les, MemoryObjectViews that represents parts of memory objects, MemoryObjectCaches that maintain the physical pages that hold the contents of MemoryObjects, and AddressTranslation that manages virtual memory hardware. The other clients are daemon processes such as networking drivers and pageout daemons. These processes operate on physical pages, mapping or unmapping them to virtual addresses in dierent address spaces. They may also use physical pages that belong to the kernel. This suggests the need for a Store object that regulates the distribution of physical pages between various memory object caches, the kernel, device drivers, and pages that are unused. 4.2.2 Interactions The le system, the process system, and the networking also interact with the virtual memory system. The le system supports operations on MemoryObjects that convey or retrieve data from the disk. The virtual memory system uses the le system to repair page faults. The le system uses the virtual memory system to access physical pages used in le caches. The process system interacts with the virtual memory system to manipulate AddressTranslation when scheduling and descheduling processes. The networking system interacts with the virtual memory system to make its pages accessible to the device driver. 81 4.2.3 Operations Given the basic objects, let us consider the structure of typical virtual memory operations. The code for the operations is distributed throughout the objects, creating a framework [JR91] A page fault is detected at the user-level. Given the process and the virtual address, the handler invokes a pageFault method on the Domain associated with that that process. The pageFault determines the memory object, the associated physical page (allocating one from the Store if necessary), and issues a read request on the memory object. A pageout request is generated when the Store runs low on available pages. The pageout daemon visits several MemoryObjectCaches, invoking their pageOut method to release a few physical pages. The data in the physical pages is written out to the disk using the write method of the MemoryObject cached by a MemoryObjectCache. The virtual to physical address maps referring to that page are also altered. These maps, implemented as AddressTranslation objects are associated with address spaces (Domain). So the pager daemon must either visit all Domain objects and operate on all AddressTranslations, or maintain a reverse map. 4.3 Why the New Architecture The preceding discussion portrays the structure of conventional virtual memory implementations with basic facilities. We now consider how the virtual memory system changes when new capabilities are added. We analyze the diculties, identifying parts that must be refactored. 4.3.1 Examples Consider adding support for shared memory to the above system. Shared memory is implemented by allowing multiple Domains to map a single MemoryObject, perhaps with multiple MemoryObjectViews. The addition of shared memory means that multiple virtual 82 addresses from dierent domains may map to the same physical page. The pageout daemon now requires a physical address to virtual address map that references multiple Domains or AddressTranslations. Moreover, a pageout operation may conict with multiple pagein requests, and multiple pagein requests may conict with one another. The policy that selects physical pages during pageout may also be altered to favor non-shared pages. This requires alteration to the synchronization code. Next consider adding support interprocess communication via virtual memory manipulation. For example, to transmit data between process P and Q with dierent address spaces, the physical page mapped to P 's address space can be mapped to the address space of Q. The transmission may have dierent semantics: for example, the mapping for P might be removed after transmission, or the physical page may be mapped with a copy-on-write permission. Should P or Q write to the data, the physical page is copied, so that the process that has not modied its copy has the original data. The communication is implemented by determining the Domain associated with the source process and the physical page associated with the source address. The Domain of the the destination process is modied to map the destination address, and the AddressTranslation is modied to associate it with the physical page. Like the addition of shared memory, adding copy-on-write (especially if copies of copies are allowed) requires changes to the maps. It also requires changes to the synchronization code, since the implementation may simultaneously operate on multiple Domains and AddressTranslations. Unlike the addition of shared memory, adding copy-on-write requires changes to the pagefault repair code. A virtual page may now have an extra state, copy-on-write, in addition to the usual write, read, execute. Addition of interprocess communication creates additional changes. For example, Both the le system and the networking system can use virtual memory manipulations to convey data to and from user level processes to the device drivers. In many cases, this may be faster than copying data between physical pages. Finally, let us briey consider adding support for distributed shared memory. Virtual pages in distributed shared memory have additional states such has-distributed-copies or 83 exists-on-remote machines. Pagefault repair may require the retrieval of page contents across the network. The system must also implement complex page consistency protocols. This requires extensive changes to the pagefault routines. The changes discussed so far may be classied as follows: Changes to objects that associate information: for example, shared memory requires changes to physical-to-virtual address maps maintained in MemoryObjectCaches. New objects that add new associations, for example, data structures that remember pages that are copy-on-write copies. Changes to synchronization code. The usual implementation strategy is to add semaphores to the methods of objects. For example, MemoryObjectCache uses a semaphore to resolve conicts between pagefaults and pageouts. Deadlock is avoided by ordering the semaphores for various objects in a hierarchy, and ensuring that dierent virtual memory operations visit various objects in an ascending order [Tan92] When new operations are added, we have to devise suitable semaphore hierarchies. Changes to pagefault and pageout procedures, as virtual pages acquire new states, and handling pagefaults and pageouts becomes more involved. New interactions between the virtual memory system and the rest of the operating system. These arise when new capabilities of the virtual memory system are exploited in the rest of the operating system. 4.3.2 Why Change is not Easy Applying these changes is tedious in the usual framework structure where objects hide not only implementation details, but also distribute control ow. For example, pagefault processing consists of a set of method calls beginning with a call by the user-process on 84 a Domain. Domain locates the appropriate MemoryObject and invokes pageFault in turn locates MemoryObjectCache; the MemoryObjectCache fetches a physical page from Store and lls the page by invoking MemoryObject::read . In this structure, logically pagefault processing may simply be understood as the method Domain::pageFault . While this is logically elegant and attractive, it makes changes more dicult. Adding new objects or changing existing associations becomes dicult, as assumptions about the organization permeate the method calls, the selection of method call parameters, and the call chain gets harder to understand. Using inheritance to redene objects exacerbates the diculty; the call processing now gets distributed over the inheritance hierarchy in addition to the objects.Embedding synchronization further complicates matters: the hierarchy of semaphores used to resolve deadlocks becomes implicit in the structure of the call chain. Furthermore, many such call chains originating at dierent object appear in the system. For instance, the pageout daemon visits MemoryObjectCaches to remove pages. But removing pages requires operations on AddressTranslations to alter the map and MemoryObject to move data to disk. This creates a class chain that visits objects in a dierent order. Interactions between the virtual memory system and le system multiply the diculties. For example, when a page is removed from a MemoryObjectCache during pageout, the method MemoryObjectCache::write is invoked. That method invokes the disk driver to move the data to disk. In turn, the disk driver uses interprocess communication via virtual memory manipulations that are implemented by the MemoryObjectCache. Such intermingling of data structures, processing, synchronization and inheritance makes the virtual memory system very fragile, and changes can be hazardous. We need a framework structure that separates these aspects, becoming easier to understand and change. 85 4.4 What Needs to be Redesigned We have identied the intermingling of various aspects the virtual memory system as the culprit that makes the software brittle. In the following, we show how to separate these aspects. Then we discuss how the design can be reected in the program by explicitly representing design features as objects. The refactoring and reication make it easy to understand the ramications of adding new virtual memory features. 4.4.1 Data Structures and Synchronization First consider data structures and synchronization. In every operation, given some parameter such as the pair of virtual-address-andprocess, the operation decides upon the objects to be visited, and collects related information such as the physical page associated with the virtual address, the le in which the page contents are stored, and operates on this information. The objects in the virtual memory system are primararily tables that implement the associations. For instance, a Domain maps virtual addresses to MemoryObjects, an AddressTranslation maps virtual addresses to physical addresses and so on. Changes to virtual memory either add new maps or change existing maps. The operations may be invoked concurrently, and many conict with one another. The conict is apparent only when all of the information necessary for an operation is collected. For example, a pagefault operation may conict with a pageout operation. That a pagefault conicts with a pageout is known only invocations of MemoryObjectCache::pageFault and MemoryObjectCache::pageOut identify the same physical page. Adding new capabilities does not change this nature of conict detection. 86 Conicts between operations are resolved by allowing one operation to proceed while others wait. An operation that proceeds gains exclusive rights to alter the data structures. Again, this aspect does not vary when new capabilities are added. The changes to the data are few and predictable: the classic example is that after pagefault, a virtual page has an associated physical page, and after pageout, there is no physical page. The result of an operation depends on the current state. For example, a pagefault operation may allocated and ll a physical page if necessary, but if the physical page is present, it need only add a virtual address to physical address mapping to the AddressTranslation. Detailed analysis shows that the logic of the operations can be easily programmed as a state machine. When new operations are added, the state machine must be extended. 4.4.2 Interactions Next consider interactions with other subsystems: Consider interactions where the virtual memory systems makes le or networking requests as in pageout operations. During such a request, the other system make invoke virtual memory operations on the same physical page that is part of pageout, leading to an apparent conict. Such conicts due to call cycles must be avoided. Consider interactions initiated by other systems, such as virtual memory manip- ulations during interprocess communication. These interactions can be treated as normal virtual memory operations. Other interactions are implicit. For example, physical pages are dynamically dis- tributed in among many entities in the operating system: le system, process system, user allocated memory, device drivers and so on. When new pages are needed elsewhere, pages allocated to one entity must be deallocated. The most appropriate a page to be deallocated, and the disposal of its contents depends on the entity, so we need to distribute the responsibility for deallocation. 87 Interactions across machines in distributed shared memory. Most interactions be- tween virtual memory and other subsystems are simply implemented by designing the appropriate interfaces, because the language compiler takes care of the rest. Distributed shared memory is dierent, because here virtual memory systems interact across dierent machines. 4.4.3 A Solution We make the aspects discussed above explicit. First consider the call chain and synchronization. The basic objects of the virtual memory like Domain and MemoryObjectCache implement methods to query and alter table entries. The call chain for every operation is reied into Operation objects that invokes the various table methods to implement the operation. For example, pagefault is implemented with a OpPageFault class. All the information required to implement an operation is gathered into Parameter objects. For example, the execution of pagefault begins with the virtual address and gathers the relevant MemoryObject,MemoryObjectCache, PhysicallyAddressableUnit (an object representing the physical page) and so on. These parameters are explicitly gathered in ParamPageFault objects. Every invocation of an operation generates an instance of Parameter objects. These objects are enqueued and used to detect conicts between operations with explicit Conict classes. Since only one of several conicting operation proceeds, the instance of Parameter for that operation also serves as a token that grants permission to change various tables as required by the operation. Our next aspect to made explicit is the logic for the operations. 88 The states of virtual pages are explicitly represented as state objects. Dierent types of state objects encode dierent states, and the methods of a state object correspond to dierent operations. For example, a virtual page may be in two states, PhysicalPage and NoPage. The methods PhysicalPage::pageFault and NoPage::pageFault implement pagefault handling. If there is no page, NoPage::pageFault will allocate a physical page, change AddressTranslation and update the hardware, whereas PhysicalPage::pageFault will simply update the hardware. The interactions are made explicit as follows: Interactions between the systems are made explicit by dening Interaction classes whose methods dene the interactions. These are similar in spirit to Operation classes The advantages of this structure may be summarized as follows: Since call chains are explicit in the Operation objects, changes to old operations are made by dening new classes rather than changing methods of individual basic objects as in the traditional design. Explicit Parameter objects help in precisely dening methods that implement conict detection and resolution, rather than implicitly encoding it in semaphore hierarchies. Changes that add new conicts or change the resolution of old conicts can be explicitly programmed. The use of parameter classes as permission tokens greatly simplies concurrency control. New states and changes to the logic of operations can be explicitly described via Object-Oriented State Machines as described below. The preceding discussion gives an overview of the unique aspects of our software architecture. In the following, we describe the architecture in greater detail, highlighting design decisions as design patterns. 89 4.5 Architecture of the Virtual Memory System We present the architecture as a series of patterns that are used to solve design problems. We start from the point of view of users of the virtual memory system. We show how virtual memory functionality may be exported to users and to other parts of the operating system. Next we show how to organize the internals of the system by reifying operations as objects. The design of concurrency control code follows. This completes the basic aspects of the design. Three other aspects are taken up afterward. The rst is the implementation of virtual memory operations using object-oriented state machines. This allows us to smoothly add complex logic for operations with features like copy-on-write and distributed shared memory. Then we present architectural features for adding interactions with remote virtual memory systems, and discuss some design issues for resource management. 4.5.1 Exporting Functionality The rst design question is how to export virtual memory functionality to user level processes and other subsystems like the le system. The design can be tricky, because the virtual memory and other subsystems may use one another's services recursively. We describe the design in two steps. First consider exporting virtual memory services without the recursive aspect. Context : The virtual memory system provides services to many entities like userlevel processes, le system, process system, external pagers and so on. The virtual memory system services are implemented by dierent objects within the system. Problem : Although the virtual memory services are implemented by dierent objects within the system, it is vital that other entities do not depend on the internal structure. If other entities encode knowledge about the internal virtual memory structure, changing the structure can be dicult. Also, dierent entities may use 90 dierent services provided by the system, and may need to know the internals of the system to dierent degrees. Solution : For each interacting entity, dene a Interactor class that describes the services provided by the virtual memory system. For example, VMInterface is an Interactor that provides methods like VMInterface::pageFault . Consequences : The Interactor classes dene entry points to virtual memory system, and make the dependencies between virtual memory and other systems explicit. It allows us to change the internal structure of the system without impacting other subsystems. New services can be provided by extended the Interactors by inheritance. The degree of exposure of the details of the virtual memory system can manipulated by designing the proper interface. It is also conceptually elegant, in that the Interactors dene a notion of a single virtual system subsystem. A drawback of the design is that there are many interfaces. The programmer must ensure that service denitions are identical in dierent interfaces, and that there are not unmanageably many variations. Notes : By itself, this is the Facade [GHJV93] pattern. But as we see below, we need a variation. Next, we look at the impact of recursive relationships between virtual memory and other subsystems. Context : We have to implement virtual memory services that use services from other subsystems. In turn, the requested services may recursively use virtual memory services. For example, the virtual memory system may request le system services during pageout, and in turn the le system requests virtual memory manipulations for the disk driver. Problem : Although the division of the operating system into a set of interacting subsystems is convenient, it partitions the code for operating system functions 91 among the entities. Recursive relationships can arise between the facilities provided by dierent subsystems. This can make it dicult to understand, change and optimize the overall system. For example, any changes to the virtual memory system must guarantee that the le system can safely use the virtual memory system even when the use is reentrant. Solution : For each interacting subsystem, dene a Interactor class that describes the services provided by the virtual memory system, and the services requested by the system. If the two types of services share results or parameters during some operation, implement the appropriate checks to validate the relationship. For example, MemoryObjectCache serves as an Interactor between the virtual memory system and the le system. When the virtual memory system invokes pageOut , it uses MemoryObject::write provided by the le system. MemoryObject::write recursively invokes MemoryObjectCache::pageFault to map physical pages for disk output. The recursive call is detected by the MemoryObjectCache, so that it does not interfere with the changes made to MemoryObjectCache as part of pageOut processing that precedes the MemoryObject::write call. Consequences : The Interactor reduces coupling between the interacting systems. Explicitly validating that the virtual memory and le system services may invoke one another recursively prevents changes to either the le system or the virtual memory system from violating assumptions. It localizes the assumptions that would otherwise be implicit in the code. An Interactor may also serve as a convenient point to cache the results of virtual memory services. A drawback is that the explicit validation may be dicult to implement, especially if the interacting entities and interactions proliferate, On the other hand, the proliferation may be an indication that redesign is necessary. 92 Notes : An Interactor class combines aspects of the Mediator [GHJV93] and Facade [GHJV93]. If caching is implemented, it may have elements of Memento [GHJV93]. 4.5.2 Organizing the Internals The next design issue is the internal structure of the virtual memory system. We argued previously that distributing the behavior for virtual memory operations among the basic objects creates diculties. The solution is to create Operation and Parameter objects that use basic objects. The design decisions involved in their design are explained below. Lastly we show Interactors use Operation and Parameter objects to actually implement the services. 4.5.2.1 Designing Operations First consider the design issues for Operation objects. Context : Virtual memory operations such as pagefault involve interactions between many objects. Problem : The behavior of virtual memory operations is distributed among many objects, so the associations between objects inuence the behavior code. Adding new facilities to the virtual memory system can change the existing associations, and add new behavior. Changing the associations can change existing behavior as a side eect. When adding new behavior, it can be dicult to decide how to distribute it among the objects. Moreover, it is tedious and error prone to change many objects in a system for every new operation. Solution : Collect the behavior in a Operation object that coordinates the basic objects. The basic objects need only implement object associations, state inquiry and state alteration functions. 93 For instance, in the original virtual memory design, pagefault processing was distributed among methods of Domain, MemoryObject, MemoryObjectCache and so on. In our architecture, there is a OpPageFault class with pageFault that manipulates Domain and other objects during pagefault processing. Consequences : Explicit Operation objects centralize operation implementations, making them easier to understand. If associations between objects are changed, local changes can be made in the Operation objects. New operations are easily added without changing the basic objects. On the other hand, the centralized, monolithic behavior can become complex. Then we need other ways to reduce the complexity. Notes : In the virtual memory system, Operations indeed become complex. Objectoriented state machines [SC95] were invented to simplify the operations. Thus, we can successfully use mediators. 4.5.2.2 Data Management Next, consider the issues for Parameter objects. Context : Operations such as pagefault have many parameters such as the virtual address range, physical pages, process, address translation, memory access permissions, memory objects and so on. The parameters have to be communicated to other subsystems like the lesystem, and are useful in detecting conicts among operations. Dierent operations require dierent parameters. Problem : When Operation classes invoke methods of basic objects, dierent meth- ods need dierent parameters. Similarly, services provided by other subsystems, procedures that detect conicts among concurrent operations all need dierent parameters. When the operations change, the parameters change. Therefore, managing the parameters as parameters of method calls is tedious. 94 Solution : Package all interesting parameters into a Parameter object, and dene update and inquiry methods as well as methods that implement conict detection. In our design, there is a VMParameter object that gathers all parameters for operations like pagefaults. When we added distributed shared memory, additional parameters were added by deriving a DSMParameters. Consequences : Parameter objects reduce the large collection of parameters into a single object that is easier to manage. It makes the denition of services more uniform and makes it easier add new parameters for new virtual memory facilities. It also centralizes operations like conict detection explicit in the code. Parameter objects can become complex if there are too many parameters. If a parameter object is used as the sole input to a method, it simplies the interface but hides details like parameter types and distinctions between readonly and writable parameters. Notes : Parameter objects also help in solving synchronization problems and streamlining the interaction of virtual memory and other subsystems. Finally, we have show how requests from the clients of the virtual memory system use the Operation and Parameter objects. Context : Interactors like VMInterface dene methods like VMInterface::pageFault that are user-level processes. The functions are actually implemented by Operation classes like OpPageFault that use Parameter objects like VMParameters. Interactors and Operations carry no state, and may have single instances for the whole system. Problem : Interactors should be able to invoke dierent types of Operation classes and create Parameter objects. Hardcoding the details directly in the methods of Interactors means that we would have to replicate the method for all interactors. Also, it becomes dicult to change details like how to allocate memory for the parameter objects. We also need to create instances of Interactor classes like VMInterface without resorting to global variables to store the instance. 95 Solution : The construction process for invoking an operation is similar for all methods dened by interactors: locate the corresponding Operation class, instantiate the appropriate Parameter object with the parameters provided by the interacting entity and pass the Parameter to the Operation. Dene an abstract Factory class that encodes this procedure as the makeProduct method, and dene concrete class for each variation. When classes have single global instance, let Let the class manage the single instance, providing methods like makeInstance , getInstance , and destroyInstance . This applies to Interactors, Operations and Factorys. Consequences : Factory classes organize details like memory management involved in creating objects. But if the details change for some products, it may become tedious to extend the factory and its subclasses. Notes : These are the standard patterns Abstract Factory [GHJV93] and Singleton [GHJV93] applied to virtual memory. The design presented so far shows how to organize the virtual memory interactions and the implementation of virtual memory operations. 4.5.3 Concurrency Control Thereafter, our design goal is to clarify the design of concurrency control. Context : Virtual memory operations query and modify the state of many objects. When two operations need to modify the state of the same object, we have to sequentialize the modications. The usual way is to associate semaphores with the basic objects and ensure that the operations interact with the objects in a hierarchical fashion so that there are no cycles leading to deadlock. Problem : Some operations may not visit the objects in the same order. For example, pagein begins by visiting Domains, while pageout begins by visiting MemoryObjectCache. Some operations may recursively visit objects more than once, 96 creating cycles. For example, pageout invokes MemoryObjectCache::pageOut , it invokes MemoryObject::write , and in turn the disk driver invokes MemoryObjectCache::pageIn (for the kernel). Other operations, like interprocess communication, may visit multiple Domains and AddressTranslations. Solution : The use of Operation objects makes the order of object invocation explicit. Examine all Operation classes and divide them into categories of operations, such that operations from one category may lead cycles with operations from another category. Identify operations that may lead to recursive cyclic visits to objects. Use semaphores to serialize operations across categories. For example, interprocess communication operations are serialized among one another before they are allowed to conict with other operations. Use Parameter objects as tokens to detect recursion. For example, MemoryObject::write passes along a VMParameter object that represents the pageout operation, and also stores it with the MemoryObjectCache that originates the write call. When the disk driver invokes MemoryObjectCache::pageIn, it passes the same parameter object as a token. The MemoryObjectCache can thus identify the recursion and avoid deadlock. The remaining operations operate on objects in a hierarchical fashion. They can simply use semaphores associated with the objects. Consequences : The use of Operation and Parameter object makes the concurrency control explicit. When new features are added, analyzing the eect of the new features becomes easier. So far, we have discussed the design of interfaces for interactions between virtual memory and other subsystems, the internal design of virtual memory using Operations and Parameters, and the design of concurrency control. The remaining aspects of the design are: How to use state machines to simplify the design of Operation objects. 97 Interactions between virtual memory systems on dierent machines for implementing distributed shared memory. Programming the dynamic distribution of pages. 4.5.4 Operations Using Object-Oriented State Machines In this section, we show how to use program the code for virtual memory operations in our architecture. We begin with a basic design pattern for programming state machines. We then present implementation techniques that make it possible to extend state machines by inheritance. Subclassing and Composition techniques for state machines are described. These methods are used to extend a basic virtual memory system to add copy-on-write and distributed virtual memory. 4.5.4.1 Basic State Machines Let us begin with the state machines. Context : The behavior of virtual memory operations can be dened as a change in the state of the virtual memory data structures. For example, pagefault changes the state of a virtual page from Mapped (physical page is mapped to virtual page) to Unmapped (no mapped physical page) while pageout changes it from Unmapped to Mapped. Problem : Usually, the state of an object is maintained as values of its instance variables. If the behavior during a method call depends on the current state, then the method is programmed using if or case statements. The state is implicit in the variables, and the transitions are implicit in the variable assignment. Such a monolithic organization is dicult to understand. If a new state is added, several cases and methods must be updated together, complicating code maintenance. 98 Alternatively, and explicit state table can be used. The uniform format makes transitions explicit, but the logic for selecting transitions and actions is still implicitly programmed as tests and assignments on state variables. Solution : Represent the state directly using state objects, one object for each state. The behavior of the actual object is implemented as methods of the state objects. The object maintains a pointer to the current state object and delegates methods to that object. Methods of the state objects return the next state. For example, in the original design, MemoryObjectCache maintains tables of pages and their current state, and implement pageFault and pageOut using conditional statements. In our design, these methods are delegated to Mapped and Unmapped objects, and only pointers to these objects are maintained in the MemoryObjectCache. Consequences : The states and transitions are explicit, and the appropriate transi- tion is selected by examining a single variable. The changes to variables that dene the state are grouped within the methods of state objects. The organization simplies the task of adding state or making other changes. In addition, such changes do not aect the delegatee. For instance, adding new states VMPageReadOnly and VMPageWritable instead of VMHasPage will change the methods of VMNoPage, but not aect MemoryObjectCache. Representing state using objects (the State [GHJV93] pattern) simplies Operation classes. When new states are added, or existing states change, we implement the changes by creating new state classes and suitably altering the methods for existing state objects. For example, consider the state machine in Figure 4.1 for simple virtual memory, and the state machine in Figure 4.2 that implements copy on write. In Figure 4.1, a virtual memory page may be mapped into physical memory so that it is accessible. The page may be unmapped to store it on backing store and release physical memory for other use1. 1 For simplicity, the state machine diagrams do not show loops (transitions that do not change state). 99 pageOut Mapped Unmapped WMapped makeCopy pageAccess Figure 4.1: Page States in a Simple Virtual Memory System RMapped pageOut pageRead/pageWrite pageWrite pageOut pageRead WUnmapped makeCopy RUnmapped Figure 4.2: Page States for Copy-On-Write The state machine in Figure 4.2 supports copy-on-write. COW allows data created by one process to be shared with a dierent process without requiring the data to be copied. Instead, the physical pages on which the data resides are shared between processes until the processes modify them. Data is \copied" by mapping the associated physical page into the virtual address space of the target process with read-only access. However, upon a write access, the \copied" data is duplicated by copying the page to a new physical page and changing the read-only access to write access. The two gures show several similarities, for example states RMapped, WMapped, are similar to Mapped and methods pageRead , pageWrite are similar to pageFault . The two gures have dierences corresponding to the additional behavior, for example, makeCopy is added and pageWrite causes transitions from RMapped to WMapped. 4.5.4.2 Derived State Machines One way to implement the copy-on-write state machine is to copy the code of the original state machine and alter it as necessary. However this makes code maintenance dicult. A better alternative is to express the relationship directly, by considering the copy-onwrite machine to be a subclass of the original machine. For example, can we derive both 100 instances of pageOut in Figure 4.2 be dened by inheriting the pageOut method? If so, we could program the copy-on-write machine as follows: Derive pageRead and pageWrite from pageFault , and Program new methods like makeCopy . One solution is as follows: Context : We have implemented the virtual memory state machines using the state pattern. We want to derive both WMapped::pageOut , RMapped::pageOut by inheriting from the method Mapped::pageOut . Problem : The pageOut method is programmed as follows: Mapped::pageOut(Page * p) f p->flushMMU(); p->writeToDisk(); return Unmapped::Instance(); g; We might derive a class WMapped from the class Mapped, hoping to reuse Mapped::pageOut . But now there is a problem: where Mapped::pageOut returns Unmapped::Instance , WMapped::pageOut must return WUnmapped::Instance . Although behavior in the two states is similar, the state transitions dier for the COW machine. If we attempt to reuse Mapped::pageOut by redening Unmapped::Instance to return WUnmapped::Instance , we nd that RMapped::pageOut cannot reuse Mapped::pageOut , as it must return RUnmapped::Instance . Solution : We use indirection to resolve the problem. Return the next state indirectly through a table of states called StateMap. That is, pageOut is programmed as follows: Mapped::pageOut(Page * p) f p->flushMMU(); p->writeToDisk(); return map->Unmapped(); g; 101 Now, both WMapped and RMapped are derived from class Mapped, but the map variable is initialized dierently in the two classes. The map in WMapped returns WUnmapped for the invocation map->Unmapped(). In RMapped, it returns RUnmapped instead. The general principle is that the StateMap together with the implicit virtual function table (VTable [Str91]) for each state object, expresses the relationships between state transitions of the base and derived machines. Class WMapped is derived from Mapped, while its map is initialized to return WMapped and WUnmapped. Thus, state transitions from Mapped to Unmapped in the base machine map to transitions from WMapped to WUnmapped. Consequences : New state machines can be derived from old state machines in a systematic way. Actions from base state machines can be reused in the derived machine. But initializing the StateMaps is tedious. In our system, we solve this problem by dening a small language to express the relationship between base and derived machines. The technique of using StateMaps can be easily extended to implement composition and delegation between state machines. In the virtual memory system, we use composition extensively to implement distributed shared memory consistency protocols. We briey present and example, and comment on the implementation. Other features of objectoriented state machines are presented in [SC95]. 102 DMapped pageOut pageAccess pageAccess remAccess Remote DUnmapped getPage herePage Null Quiescent Send herePage ackPage Figure 4.4: DSM-NET State Machine Figure 4.3: DSM-VM State Machine MappedQ Fetch pageOut herePage pageAccess remAccess FetchN SendN pageAccess ackPage UnmappedQ RemoteQ Figure 4.5: DSM Composite State Machine 4.5.4.3 Composing State Machines State machines are composed to combine behaviors dened in component machines. We demonstrate composition by constructing a distributed shared memory (DSM) protocol machine (Figure 4.5) out of a virtual memory machine (Figure 4.3) and a networking machine (Figure 4.4). DSM [SMC90] provides the illusion of a global shared address space over networks of workstations, whose local memories are used as \caches" of the global address space. The caches have to be kept consistent: a simple approach allows only one machine to access a shared page at a time. If another machine attempts to access that page, its virtual memory hardware intercepts the access, and its fault handler fetches the page from the current page-owner. Thus, behavior for DSM has VM and networking aspects. We dene VM and networking behavior using separate state machines, and compose them to get a DSM machine. 103 DSM-VM Machine: In a DSM system, pages may be either DMapped, DUnmapped or Remote (Figure 4.3). The DMapped and DUnmapped states are inherited from the simple VM machine (Figure 4.1). State Remote represents a page on some remote machine, and remAccess denes actions for pages accessed by a remote machine. The transitions are dened as though no networking were necessary. (The special state Null ignores all VM actions; it is used in composition. We always create Null and Error states for every state machine.) DSM-NET Machine: The networking machine (Figure 4.4) implements a trivial pro- tocol that sends a page to a remote machine, or gets one from a remote machine. It handles details like fragmentation and sequence numbering. DSM Machine: In the composite DSM Machine (Figure 4.5), sux Q indicates that in the composite state, the DSM-Net state is Quiescent, and sux N indicates that the VM state is Null. We implement transitions to and from RemoteQ using the networking machine. MappedQ and UnmappedQ inherit behavior from the DSM- VM machine. Methods of the composite machine reuse behavior dened in methods of component machines by initializing StateMaps so that the states returned from component methods are actually composite states. For example, in the DSM-NET machine, Quiescent::herePage returns the NET state Send, but when invoked from the composite state MappedQ, it returns the composite state SendN. In turn, ackPage when invoked from SendN, returns RemoteQ instead of Quiescent. RemoteQ later gets used as a VM state. 4.5.5 Implementing Remote Interactions The logic of consistency protocols for distributed shared memory is implemented using state machines. But in addition to the logic, the virtual memory system has to interact with dierent machines. 104 Consider the simple consistency protocol depicted in Figure 4.5. When there is a pagefault, the pagefault operation visits objects like Domain, MemoryObject, MemoryObjectCache, and eventually invokes pageFault on some state object. If the page resides on a remote machine, Remote::pageFault is invoked. It must determine the remote machine that has the page (and is in state DMapped), contact it with the identiers for the virtual page (i.e., memory object, oset within the object) retrieve the data, and update the local virtual memory data structures. It must suspend processing when waiting for the data, and proceed after the page is retrieved across the network. By virtue of the internal structure of our system, all the necessary information to locate the remote page, retrieve data, and operate on the local data structures is contained in the Parameter object associated with the operation. Therefore, we can implement the remote interaction by transmitting the parameter object to the remote machine. We ensure that the reply also contains the parameter object, together with the contents of the page, so that we can simply continue processing when the reply arrives. Thus, remote interactions t in smoothly with our basic structure. The key design decisions are examined below in greater detail. 4.5.5.1 Continuations First we show how to use continuations for ecient remote interactions. Context : Consider the usual way of implementing the execution of pagefaults in a distributed shared memory system. When the user process faults, a thread in the kernel begins pagefault processing by executing the methods of the OpPageFault. Eventually, the operation requires page contents that are on a remote system. The thread initiates a remote request and blocks waiting for the reply. On the remote side, some server thread picks up the request. That thread must retrieve the page contents, either from memory or from disk if the page has been paged out. There may be other operations demanded by the memory consistency protocol. 105 Problem : An operating system that supports networking, distributed le systems, distributed shared memory, supports considerable concurrent processing and may require many threads. But threads are operating system resources managed by the kernel. They contain the execution stack, data for schedulers, and tie up slots in kernel data structures. Therefore threads are too expensive to be left waiting for activities to complete. Furthermore, suppose there are concurrent pagefaults such that pagefault processing can be \batched" together (for instance, faults on adjacent pages from dierent processes). If a thread is dedicated to an operation, the threads for the two operation will block separately; batching could be implemented in some ad-hoc fashion at some network layer. The issue is that the system has knowledge about memory operations (as opposed to generic threads) so that operations can be intelligently scheduled in ways dierent from generic thread scheduling. To exploit this knowledge, we should dissociate threads from operations. Solution : Instead of threads, use Parameter objects to implement continuations. A continuation [AS96] at any point of execution informs how to continue the processing. It is an object that contains all the necessary data and a pointer to the code that continues the operation. In our architecture, Parameter objects contain all the data necessary for a virtual memory operation. We add one more parameter to use it as a continuation. Consider an operation like pagefault that starts on one machine, waits until a remote reply is received, and continues processing. We divide the operation into two methods, one for use prior to page fetching, another for use after the page is received. The Parameter object for the operation has a variable that points to the second method. When the rst method is completed, processing can be continued given just the Parameter object. When a thread T that executes the operation completes the rst method, it adds the Parameter object queue, schedules a remote request, and may then pick up any 106 other task. When enough replies are received, the networking driver (or a thread dedicated to reply processing) will schedule a thread U to pick up where T left o. A similar scheme can be used on the remote machine to process incoming requests. Consequences : Continuations do not consume slots in kernel structures, so they are cheap to create and destroy. As fewer threads are used, scheduling and context switching overhead is minimized. Based on the data in Parameter object, operations can be batched or redundant operations eliminated. A drawback is that all operations have to be divided into articial methods, unless the implementation language supports the creation and use of continuations. Next we consider message demultiplexing. 4.5.5.2 Active Messages Ecient message demultiplexing is achieved by using Parameter objects as active messages [vECGS92]. Context : A messages arrives through network drivers as a chunk of uninterpreted data. The receiver has to interpret the data, and take any action requested. In case of distributed shared memory, the messages are typed, the address space and the memory object identiers are included. The types indicate the desired action, an the object identiers indicate the data structures that are aected. When the remote request is complete, the originator gets a reply. The reply also contains identiers that allow it to be matched with the request. Problem : The usual way of interpreting a message uses some form of table lookup to get at the action requested by the message and the data structures to be aected. Table lookups can be expensive, especially because the tables are often protected by semaphores that serialize concurrent accesses. Solution : We can avoid interpretation by embedding directly the pointers to meth- ods and objects in the message. For example, instead of having to interpret message 107 types and decide the action, the program counter for the action code can reside in the message. If the networked computers are of the same type, and the action is described by kernel code that always resides at the same address, then the receiving thread can immediately jump to that address. If the action and data structure addresses dier, they can be determined at some prior time (e.g., as part of setup) Thus, Parameter objects that contain the continuation information can also serve as messages. A message received from the network is usually in a data format resulting from the serialization of data into a sequence of bytes. If the machine formats are dierent from the network format, the data has to be interpreted. We can reduce the interpretation overhead by wrapping the uninterpreted data in an object that has the same interface as a Parameter, but interprets the raw data on demand, when an Operation queries or sets parameters. Consequences : When many messages arrive at the network device driver, it has to determine the recipient. Active messages hasten this demultiplexing. 4.5.6 Dynamic Page Distribution In the conventional architecture, operating system pages are distributed among various subsystems and MemoryObjectCaches. Some pages are permanently allocated: for instance, pages for kernel data structures. Other subsystems such as the process system or the le system may request and return pages dynamically. But most pages are dynamically allocated by MemoryObjectCaches from a Store that manages physical pages. A pageout daemon watches the Store to detect excess allocation and periodically visits MemoryObjectCaches to preempt pages. MemoryObjectCaches dene policies for selecting least desirable pages that are given up when requested by the pageout daemon. But page preemption can be expensive, because it may conict with pagein. Similarly, dynamic page allocation can be expensive in systems like network drivers where it is undesirable for the driver to pause instead of delivering trac. Resource management overhead can be 108 reduced by spreading it over regular computations. The Resource Exchanger [SC96] is used in our architecture to make resource management more ecient. Context : There are dynamically allocated resources like pages used by dierent allocators in a operating system. The particular resource used by an allocator is not important, only the quantity matters. The resources can be preempted from an allocator if necessary. We want to avoid unfair distribution of resources among allocators; at the same time, some allocators may have a greater claim than others. The resources should be distributed according to need. Problem : A common approach to managing preemptable dynamic resources is to run a daemon process that reclaims them from the allocators. However, this means that an allocator that needs a resource may have to wait while preemption is occurring. Also, preemption may cause allocators to suspend operations until preemption is completed. Such pauses are often unacceptable. Solution : We interleave allocation and deallocation of resources, so that the allocator is not drained of resources unless absolutely necessary. For example, consider a network driver that uses pages to receive messages. After a message is received, it must hand the page over to some server for processing: for instance, assembling fragments. While the page is being used, if the drivers page pool drains of pages, it may need to allocate new pages, wasting time needed for communication. We can avoid this if instead of giving up the page to the server, the driver exchanges a page with the server. Thus, allocation and deallocation are interleaved. This means that the server needs to preallocate at least one page. Multiple servers may interact with the driver in this manner. Servers maintain their own pools of pages ready to exchange with the driver. If a server expects bursty trac, it preallocates pages. The number of pages given to a server depends on its credit with the memory system. The credit may be preassigned: for example a video server would have greater credit than a audio server; or it may vary. If a 109 server runs out of credit and buers, then the driver drops server packets, throttling resource hogs. MemoryObjectCaches also use a similar scheme. Every cache has a credit for a number of pages. During pagefault processing, the cache picks pages that can be returned if it is approaching the credit limit. If page contents need not be saved to backing store, it may reuse the page internally; otherwise, the page will be returned to Store. The decit number of pages are allocated from the Store. Consequences : Resource exchange ameliorates the need for resource preemption. It also reduces the time an allocator may have to wait for a resource. But resource exchange means that we must have enough resources that the allocators must have at least one resource to exchange. Otherwise, we must accept a standard preemption scheme. 4.6 Summary This chapter described a new architecture for building virtual memory systems. The architecture has the following primary attributes. It separates data structures from operations over data structures. Basic objects of a virtual memory system implement various types of tables. Typical operations manipulate the objects in groups. Expressing these manipulations in a centralized manner makes it easier to understand them. By the same token, it becomes easier to evaluate the eects of new operations and data structures. It reies operations and operation parameters into objects. This makes the operations and their eects explicit. Also, the reied objects can be used as continuations to program remote interactions without overuse of threads. Parameter objects also serve as active messages for fast remote interactions. The operation objects make the order of basic object invocation explicit, so that it is easier to 110 verify the correctness of concurrency control. Moreover, parameter objects can be used as concurrency control tokens. It uses object-oriented state machines to program the logic of the operations. There- fore, code for new operations can be incrementally added by inheritance and composition, dramatically improving code reuse. Structuring operations as state machines also makes the logic easier to understand. It improves resource management by spreading resource allocation and deallocation during computations. As a result, the need for preemption is reduced. Experience with the Choices system has proven the worth of the architecture. We began with a virtual memory system without copy-on-write support and rudimentary distributed shared memory. The system with new architecture, with copy-on-write support and dierent consistency protocols was 30% smaller without loss of performance. 111 Chapter 5 Conclusion This thesis is concerned with the development of a theoretical basis and a practical architecture for building distributed systems. Our example has been the development of distributed shared memory protocols. We have developed a new protocol design method, novel distributed shared memory protocols, and a exible architecture for object-oriented systems that support concurrent operations on groups of objects, and interact with remote systems. The architecture has been used to implement a virtual memory system that supports distributed shared memory. 5.1 Summary In Chapter 2, we presented a method for synthesizing process coordination protocols. Our method shows how to structure the design trajectory for protocols. We begin with high-level protocols that use abstract communication operators. These protocols are easy to analyze, so system wide verication is conducted at this level. The next step is to develop implementations of the abstract communication operators. These implementations are developed in a notation that hides the details of communication media, but allows the designer to express how a process can control the execution of another process. We presented conditions that these operator implementations must obey so that they can be composed to implement the full protocol. Because of the condition, the protocol imple112 mentation is guaranteed to replicate the behaviors specied in the original specication. The last step is to translate the second-level implementations into a formalism that can be easily implemented using shared memory or message passing programs. The form of the third-level implementations ensures that their composition also implements the original specications correctly. At each level, the protocols and subprotocols we encounter have small state spaces. The original specication is succinct due to its abstractness, while the successive steps look only at parts of the original protocol. As a result, verication tools that use exhaustive search are eective in validating the protocols. The implementations of the abstract communication operators constitute a standard library that can be used in future protocol designs. Thus, we have developed the basis for an eective method for synthesizing protocols. In Chapter 3, we developed consistency protocols that implement an ecient distributed shared memory for computers connected with wide-area interconnects. We showed that communication related to synchronization makes it dicult to use distributed shared memory when the communication latency is high. This is because processes contend with one another for access to synchronization data structures. We can reduce this contention if we can anticipate requests by processes for the data computed within a synchronization construct. The performance results showed that this approach results in good performance over wide area networks. We also showed how our design method can help guide the development of the protocols by analyzing protocols at various levels of detail. In Chapter 4, we presented a software architecture used to develop a virtual memory system that supports our distributed shared memory protocols. But the architecture is considerably more general, in that it can be applied wherever object-oriented systems involve concurrent operations over groups of objects. We showed how such systems can be designed so that the objects, operations, and synchronization aspects can be 113 separated. This separation means that adding new objects and new operations is easier, because the relationships between objects and the concurrency control code is explicit. We demonstrated how the operations can be extended smoothly using continuations so that they can aect objects on remote machines. Another feature of the architecture is the use of object-oriented state machines that allow complex, state-based logic to be structured to increase reuse and permit systematic extensions. In summation, this research has resulted in improvements in protocol design and implementation techniques. We have also shown that distributed shared memory can be useful over wide-area networks. 5.2 Future Research In this thesis, we have barely begun the development of the protocol synthesis method. Future research is need to extend the power of our specication language, and experience is needed to determine the evolution of our notations. Verication tools have to be adapted so that our models can be analyzed. A library of standard replacements also needs to be developed. Another direction is to adapt our approach to notations like LOTOS. Distributed shared memory consistency protocols might prove to be useful for maintaining consistency over Web documents and other distributed data. A standard library for distributed shared memory can be developed, much like the MPI and PVM message passing libraries. We believe that the software architecture we have proposed is useful for applications such as workow. The principles developed for the architecture, like object-oriented state machines, the systematic use of rst-class representations for operations, and the use of continuations can be applied to improve operating system design. 114 Bibliography [AF92] Hagit Attiya and Roy Friedman. A correctness condition for highperformance multiprocessors. In Proceedings of the 24th ACM Symposium on the Theory of Computing, pages 679{690, 1992. [AHJ91] Mustaque Ahamad, Phillip W Hutto, and Ranjit John. Implementing and programming causal distributed shared memory. In Proceedings of the 11th International Conference on Distributed Computing Systems, pages 274{281, May 1991. [AHN+93] Mustaque Ahamad, Phillip W. Hutto, Gil Neiger, James E. Burns, and Prince Kohli. Causal memory: Denitions, implementation and programming. Technical Report GIT-CC-93/55, Georgia Institute of Technology, 1993. [Alp86] Bowen Alpern. Proving Temporal Properties of Concurrent Programs: A Non-Temporal Approach. PhD thesis, Cornell University, February 1986. [And90] Thomas E. Anderson. The performance of spin-lock alternatives for sharedmemory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6{16, January 1990. [AS96] Harold Abelson and Gerald Jay Sussman. Structure and Interpretation of Computer Programs. M.I.T. Press, Cambridge, Mass, 1996. 115 [BvdLV95] Tommaso Bolognesi, Jeroen van de Lagemaat, and Chris Vissers. LOTOSphere: software development with LOTOS. Kluwer Academic Publishers, 1995. [BZ83] Daniel Brand and Pitro Zaropulo. On communicating nite state machines. Journal of the ACM, 30(2):323{342, April 1983. [BZS93] Brian N. Bershad, Matthew Zekauskas, and Wayne A. Sawdon. The midway distributed shared memory system. In IEEE Computer Society International Conference, pages 528{537, 1993. [Cam74] R.H. Campbell. The Specication of process synchronization by PathExpressions. In Lecture Notes in Computer Science, pages 89{102, 1974. [Cam76] Roy Harold Campbell. Path Expressions: A technique for specifying process synchronization. PhD thesis, University of Newcastle Upon Tyne, August 1976. [CBZ95] John B. Carter, John K. Bennet, and Willy Zwaenepoel. Techniques for reducing consistency-related communication in distributed shared memory systems. ACM Transactions on Computer Systems, 1995. To appear. [CES86] E. M. Clarke, E. A. Emerson, and A. P. Sistla. Automatic verication of nite-state concurrent systems using temporal logic specications. ACM Transactions on Programming Languages and Systems, 8(2):244{263, April 1986. [CM86] K. M. Chandy and J. Misra. How processes learn. Distributed Computing, 1:40{52, 1986. [DCM+90] Partha Dasgupta, R. C. Chen, S. Menon, M. Pearson, R. Ananthnarayanan, M. Ahamad, R. Leblanc, W. Applebe, J. M. Bernabeu-Auban, P. W. Hutto, M. Y. A. Khalidi, and C. J. Wilenkloh. The design and implementation 116 of the Clouds distributed operating system. Computing Systems Journal, Winter 1990. [Dil96] David L. Dill. The mur' verication system. In 8th International Conference on Computer Aided Verication, pages 390{393, July/August 1996. [DKCZ93] Sandhya Dwarkadas, Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. Evaluation of release consistent software distributed shared memory on emerging network technology. In Proceedings of the 20th International Symposium on Computer Architecture, 1993. [DSB86a] Michael Dubois, Christoph Scheurich, and Faye Briggs. Memory access dependencies in shared-memory multiprocessors. In International Symposium on Computer Architecture, pages 434{442, May 1986. [DSB86b] Michael Dubois, Christoph Scheurich, and Faye Briggs. Memory access dependencies in shared-memory multiprocessors. In International Symposium on Computer Architecture, pages 434{442, May 1986. [FHMV95] Ronald Fagin, Joseph Halpern, Yoram Moses, and Moshe Vardi. Knowledgebased programs. In Proceedings of the 14th ACM Symposium on Principles of Distributed Computing, pages 129{143. Association for Computing Machinery, ACM Press, 1995. [FLP85] M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty processor. Journal of the ACM, 32(2):374{382, April 1985. [FLR+94] Babak Falsa, Alvin R. Leibeck, Steven K. Reinhardt, Iannis Schoinas, Mark D. Hill, James R. Larus, Anne Rogers, and David A. Wood. Application-specic protocols for user-level shared memory. In Supercomputing 94, 1994. 117 [FP89] Brett Fleisch and Gerald Popek. Mirage: A coherent distributed shared memory design. In ACM Symposium on Operating System Principles, 211223, 1989. [Gab87] Dov Gabbay. Modal and temporal logic programming. In Antony Galton, editor, Temporal Logics and Their Applications, chapter 6, pages 197{237. Academic Press, New York, 1987. [GH85] M. G. Gouda and J. Y. Han. Protocol validation by fair progress state exploration. Computer Networks and ISDN Systems, 9:353{361, 1985. [GHJV93] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design patterns: Abstraction and reuse of object-oriented design. In Proceedings of the European Conference on Object-Oriented Programming, number 707 in Lecture Notes in Computer Science, pages 406{431. Springer-Verlag, New York, 1993. [GHP92] P. Godefroid, G.J. Holzmann, and D. Pirottin. State space caching revisited. In Proc. 4th Computer Aided Verication Workshop, Montreal, Canada, June 1992. also in: Formal Methods in System Design, Kluwer, Nov. 1995, 1-15. [GLL+90] K. Gharachorloo, D. Lenoski, J. Laudon, P.Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared memory multiprocessors. In Proceedings of the 17th International Symposium on Computer Architecture, 1990. [GW94] Patrice Godefroid and Pierre Wolper. A partial approach to model checking. Information and Computation, 110(2):305{326, May 1994. [GY84] M. G. Gouda and Y. T. Yu. Protocol validation by maximal progress exploration. IEEE Transactions on Communications, COM-32(1):94{97, 1984. [HFM88] D. Hensgen, R. Finkel, and Udi Manber. Two algorithms for barrier synchronization. International Journal of Parallel Programming, January 1988. 118 [HM90] Joseph Y. Halpern and Yoram Moses. Knowledge and common knowledge in a distributed environment. Journal of the ACM, 37(3):549{587, July 1990. Also in Proceedings of the 4th ACM Symposium on Principles of Distributed Computing(1984). [Hol91] Gerard J. Holzmann. Design and validation of computer protocols. Prentice Hall, Englewood Clis, New Jersey, 1991. [HU79] John E. Hopcroft and Jerey D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, Reading, Massachusetts, 1979. [Hu95] Alan John Hu. Techniques for Ecient Formal Verication Using Binary Decision Diagrams. PhD thesis, Stanford University, December 1995. [HZ87] Joseph Y. Halpern and Lenore D. Zuck. A little knowledge goes a long way: Simple knowledge-based derivations and correctness proofs for a family of protocols. In ACM Symposium on Principles of Distributed Computing, pages 269{280. ACM, 1987. [Ip96] Chung-Wah Norris Ip. State Reduction Methods for Automatic Formal Verication. PhD thesis, Stanford University, December 1996. [JA94] Ranjit John and Mustaque Ahamad. Evaluation of causal distributed shared memory for data-race-free programs. Technical Report GIT-CC-94/34, Georgia Institute of Technology, 1994. [JKW95] Kirk L. Johnson, M. Frans Kaashoek, and Deborah A. Wallach. Crl: Highperformance all-software distributed shared memory. In ACM Symposium on Operating System Principles, 1995. [JR91] Ralph E. Johnson and Vincent F. Russo. Reusing object-oriented designs. Technical Report UIUCDCS-91-1696, University of Illinois at UrbanaChampaign, May 1991. 119 [KHvB92] Christian Kant, Teruo Higashino, and Gregor von Bochmann. Deriving protocol specications from service specication written in LOTOS. Technical Report 805, Universite de Montreal, January 1992. [KN93] Yousef A. Khalidi and Michael N. Nelson. The Spring virtual memory system. Technical Report TR-93-9, Sun Microsystems, February 1993. [Kur94] Robert P. Kurshan. Computer-aided verication of coordinating processes : the automata-theoretic approach. Princeton University Press, 1994. [LAA87] M. C. Loui and H. H. Abu-Amara. Memory requirements for agreement among unreliable asynchronous processes. Advances in Computing Research, 4:163{183, 1987. [Lam79] Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C28(9):690{691, September 1979. [Lam94] Leslie Lamport. The temporal logic of actions. ACM Transactions on Programming Languages and Systems, 16(3):872{923, May 1994. [LH89] Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. [Lim95] Swee Boon Lim. Adaptive Caching in a Distributed File System. PhD thesis, University of Illinois at Urbana-Champaign, 1995. [LM95] Hong Liu and Raymond E. Miller. Generalized fair reachability analysis for cyclic protocols with nondeterminism and internal transitions. Technical Report UMCP-CSD:CS-TR-3422, University of Maryland, College Park, February 1995. 120 [Lon93] David E. Long. Model Checking, Abstraction and Compositional Verication. PhD thesis, Carnegie Mellon University, July 1993. [LS88] Richard J. Lipton and Jonathan S. Sandberg. Pram: A scalable shared memory. Technical Report CS-TR-180-88, Princeton University, 1988. [McM92] Ken McMillan. Symbolic Model Checking: An Approach to the State Explosion Problem. PhD thesis, Carnegie Mellon University, 1992. [MF90] Ronald G. Minnich and David J. Farber. Reducing host load, network load and latency in a distributed shared memory. In International Conference on Distributed Computing Systems, 1990. [MP91] Zohar Manna and Amir Pnueli. The Temporal Logic of Reactive and Concurrent Systems, volume 1. Specication. Springer-Verlag, New York, 1991. [MW84] Zohar Manna and Pierre Wolper. Synthesis of communicating processes from temporal logic specications. ACM Transactions on Programming Languages and Systems, 6(1):68{93, January 1984. [PD97] Fong Pong and Michel Dubois. Verication techniques for cache coherence protocols. ACM Computing Surveys, 29(1):82{126, March 1997. [Pon95] Fong Pong. Symbolic State Model: A New Approach for the Verication of Cache Coherence Protocols. PhD thesis, University of Southern California, 1995. [Pos81] Jon Postel. Transmission control protocol. Internet RFC 793, Sep 1981. [PS91] Robert. L. Probert and Kassim Saleh. Synthesis of communication protocols: Survey and assessment. IEEE Transactions on Computers, 40(4):468{476, April 1991. 121 [RJY+87] Richard Rashid, Avadis Tevanian Jr., Michael Young, David Golub, Robert Baron, David Black, William Bolosky, and Jonathan Chew. Machineindependent virtual memory management for paged uniprocessors and multiprocessor architectures. In Proceedings of the 2nd International Conference on Architectural Support for Programming Languages and Operating Systems, pages 31{39, 1987. [Rus91] Vincent Frank Russo. An Object-Oriented Operating System. PhD thesis, University of Illinois at Urbana-Champaign, 1991. [SC95] Aamod Sane and Roy H. Campbell. Object-oriented state machines: Subclassing, composition, delegation and genericity. In Proceedings of the Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA'95), pages 17{32, October 1995. [SC96] Aamod Sane and Roy Campbell. Resource exchanger: A behavioral pattern for low overhead concurrent resource management. In Pattern Languages of Program Design. Addison-Wesley Publishing Company, Reading, Massachusetts, 1996. (To appear). [SMC90] Aamod Sane, Ken MacGregor, and Roy Campbell. Distributed virtual memory consistency protocols: Design and performance. In Second IEEE workshop on Experimental Distributed Systems, 1990. [Str91] Bjarne Stroustrup. The C++ Programming Language. Addison-Wesley Publishing Company, Reading, Massachusetts, 2 edition, 1991. [Tan92] Andrew S. Tanenbaum. Modern Operating Systems. Prentice Hall, Englewood Clis, New Jersey, 1992. [vECGS92] Thorsten von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active messages: a mechanism for integrated communication and compu122 tation. In Proceedings of the 19th International Symposium on Computer Architecture, May 1992. [VSvSB91] Chris A. Vissers, Giuseppe Scollo, Marten van Sinderen, and Ed Brinksma. Specication styles in distributed systems design and verication. Theoretical Computer Science, 89(1):179{206, October 1991. [Wes78] Colin H. West. General technique for communications protocol validation. IBM Journal of Research and Development, 22(3):393{404, 1978. [WG93] P. Wolper and P. Godefroid. Partial-order methods for temporal verication. In Proc. CONCUR '93, volume 715 of Lecture Notes in Computer Science, pages 233{246, Hildesheim, August 1993. Springer-Verlag. [WL93] Pierre Wolper and Denis Leroy. Reliable hashing without collision detection. In 5th International Conference on Computer Aided Verication, number 697 in Lecture Notes in Computer Science, June 1993. 123
© Copyright 2025