Effective Specification of Fault Tolerant Distributed

Effective Specification of Fault Tolerant Distributed
Software ∗
Andi Bejleri
Tzu-Chun Chen
Mohammad Qudeisat
TU Darmstadt
TU Darmstadt
Purdue University
bejleri@cs.tutzu-chun.chen@dsp.tu- mqudeisat@cs.purdue.edu
darmstadt.de
darmstadt.de
Lukasz Ziarek
Patrick Eugster
SUNY Buffalo
Purdue University,
TU Darmstadt
lziarek@buffalo.edu
p@cs.purdue.edu
ABSTRACT
1.
Distributed systems are plagued by partial failures, meaning
that certain components or interactions may fail, while others remain unaffected. If such failures are improperly handled in message-passing distributed systems, software components may get stuck waiting for messages that will never
arrive, or enter an inconsistent state after receiving inappropriate messages. Handling of partial failures is an intrinsic
part of many protocols, yet very involving as it depends on
many factors including their timing and nature, the affected
component(s), and of course application semantics.
In this paper, we present a novel specification technique
and verification framework for fault tolerant distributed software based on the theory of session types. More precisely we
contribute with a specification technique that models structured interactions and fine grained failure handling, a software framework that can be used to implement fault tolerant distributed systems in Java, and a static analysis that
checks conformance of corresponding software to their specification. Finally, a detailed empirical study, including a code
quality analysis comparing with straightforward approaches
attempting to achieve the same based on predating techniques without explicit support for failures, demonstrates
the usefulness, simplicity, and efficiency of our technique
and tools.
In message-passing distributed systems, software components communicate with each other through messages to accomplish a common task. The interaction structure (conversation) between components is not only defined by the
values of the messages exchanged, but also by their ordering, branching, and looping that model sequencing, different,
and repetitive behaviour respectively. This structure is commonly referred to as a protocol.
Distributed systems are prone to partial failures, which
means that some components or interactions may fail, while
other components must still respect certain invariants. Since
not all failures can be masked [11], most protocols in a typical “middleware stack” must explicitly deal with some failures. How to handle a partial failure depends on a variety of parameters including the affected component(s), the
timing and nature of the failure, the application semantics,
etc. This makes the design of correct and efficient protocols complex and error-prone. For example, notifying all
parties of a failure and conservatively aborting them tends
to hamper performance of distributed software by incurring
spurious communication, repeating abandoned interactions
(which could have continued despite certain failures), and
over-synchronising. Inversely, omitting the notification of a
single relevant component to a failure can hamper progress
(e.g., a component waits on a message that will never arrive), or lead to inconsistencies (e.g., subsets of interacting
parties engage in different protocol branches that are structurally compliant, but where messages containing different
values are expected).
To ease the burden of reasoning about distributed software, session types [25, 16] have been introduced. Session
types are a well-established type-theory to ensure deadlockfreedom and communication-safety in the context of process
calculi. They were proposed (1) to capture the interaction of
protocol participants in the presence of ordering, branching
and looping, and (2) to verify implementations of these participants. Bi-party session types [25] are the fore-runners
of multiparty session types [16]. The former capture the
communication structure between two participants, ensuring reciprocity of actions (for a “send” on one side, there is
a “receive” on the other). The latter captures the structure
of many participants from a global point of view, ensuring conformance of processes to a global type by projection.
Categories and Subject Descriptors
D.2.1 [Software Engineering]: Requirements/Specifications; D.2.4 [Software Engineering]: Software/Program
Verification; D.2.2 [Software Engineering]: Design Tools
and Techniques; D.3.1 [Theory of Computation]: Specifying and Verifying and Reasoning about Programs
∗Financially supported by ERC Consolidator Award
“Lightweight Verification of Software”.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
INTRODUCTION
Subsequent works include theoretical studies of session types
in object-oriented core calculi [9, 13].
SJ [18] is the most prominent system that has integrated
the theory of session types for distributed, object-oriented
software. It consists of a mature 1 specification technique,
a static analysis, and a software system for client-server distributed systems. Subsequent work [3, 17] has demonstrated
the usefulness of integrating session types for client-server
distributed software in an object-oriented programming language like Java for parallel software systems. Despite this
strong work, SJ is centered on bi-party interactions which
makes real-life multi-party protocols hard to reason about
and in many cases impossible to express at all; this is particularly valid for reasoning about handling of (partial) failures,
which SJ does not support natively.
In this paper we introduce protocol specification and verification for distributed software in the presence of partial
failures. Our technique, based on session types, consists of
an effective specification technique, static analysis and software system. The specification language describes the protocol between components in a distributed middleware or end
application software, hence coining the term of our specification technique protocol types. The static analysis checks
the conformance of components with respect to the specification, and the software system allows for the implementation of efficient distributed components capable of handling partial failure. We build on SJ’s approach, extending
it with bi-directional asynchronous communication channels
per participant pair as advocated in the theory of Bettini et
al. [5] and used in many real-life protocols building on communication abstractions like TCP, and intuitive abstractions
for branching, looping, and failure handling among multiple
parties.
Specifically, the contributions of this paper are:
• A specification technique that reflects simplicity and
efficiency for modelling structured interactions and finegrained handling of both environment-induced failures
and application-level failures (exceptions).
• A system for implementing fault-tolerant distributed
software featuring asynchronous message passing, primitives for structured interactions and partial failure,
and a static analysis for checking the conformance of
that software to protocol specification.
• A specification and software quality analysis showing
the benefits of our specification design and software
system compared to a variant of SJ. The results show
that our approach greatly overcomes the complexity of
specifications in that variant in metrics such as lines
of code, levels of nesting and number of messages.
• A detailed empirical study that demonstrates the usefulness of our specification technique, showing that
a fault-tolerant software conforming a protocol type
specification retains asynchrony; i.e., it imposes less
unnecessary coordination among protocol components
and spurious communication than attempts to encode
failures with existing session type models.
In the remainder of the paper, Section 2 motivates our protocol type specification technique through a real-world fault1
The SJ framework has been maintained for more than five
years.
Service
Provider
Discovery
Service
[2] [6]
[5] [3]
[4]
[1]
Identity
Provider
[7] [9]
Client
[8]
Figure 1: Shibboleth discovery service protocol.
tolerant distributed program. The design of our protocol
type technique is introduced in Section 3, including a transformation tool of specifications. Section 4 and 5 present
respectively the software system and the static analysis. Report on empirical studies and analysis of code quality is given
in Section 6. Section 7 surveys related work and Section 8
concludes and outlines future work.
2.
PROTOCOL SPECIFICATION
This section provides a motivating example and gives an
informal introduction to our protocol specification technique.
2.1
Shibboleth
For illustration we choose the discovery service protocol
(DSP) of Shibboleth [1]— a derivative of the Kerberos [21]
protocol. Shibboleth is among the world’s most widely deployed federated identity solutions, connecting users to applications both within and between organisations. Shibboleth includes components acting in the roles of Identity
Provider (IdP), Service Provider (SP), and Discovery Service (DS) in addition to clients. IdPs provide user information, SPs access to secure content based on that information,
and DS a mechanism to seamlessly add or remove services
and SPs. The discovery service protocol (DSP) between a
client and DS involves communication between all components of the Shibboleth system, as described in Figure 1. We
focus on one step of the protocol (arrow [1]) to demonstrate
the benefits of our technique.
2.2
Shibboleth resource request
The resource request interactions consist of the client authenticating with the single sign-on service and redirected
to an SP to access a resource. This is modelled as the client
issuing a request to the SP in the form of an HttpRequest
and waiting for a corresponding HttpResponse. Consider the
specification of this part of the protocol in SJ
protocol ShibbolethResourceRequest {
p a r t i c i p a n t s : C l i e n t , SP
C l i e n t = !< H t t p R e q u e s t > . ? ( H t t p R e s p o n s e )
SP = ? ( H t t p R e q u e s t ) . ! < H t t p R e s p o n s e >
}
where !<T> specifies the sending of a value of type T and
?(T) the receiving of a value of type T. An overview of SJ’s
specification constructs is given in Table 1.
Table 1: Constructs of SJ’s specification language
Construct
Purpose
!<T>
Send a message of type T
Receive a message of type T
![...]*
Loop and send termination condition
?[...]*
Loop and receive termination condition
!{l1:...,...,ln:...}
Select and send one of the labels l1 - ln
?{l1 :...,..., ln :...} Branch and receive one of the labels
l1 - ln
?(T)
Notice that this specification uses implicit channels, i.e.,
the sending construct does not specify the receiver or the
receiving the sender. Specifications of this technique are
safe. This means that the interactive structure of each component’s specification is reciprocal. In addition, the specification of a component expresses sequencing between its
actions, sufficient in simple client-server protocols. Unfortunately, they cannot express sequencing of communications,
branching, and looping between several components (see [16]
for a formal argumentation) as required in many real-life
distributed systems. It is important to note that loops and
branches are always controlled by one component (evaluating the corresponding condition); so those involving three (or
more) participants can not be modelled by three (or more)
pair-wise loops and branches respectively. Not all loops or
branches will be controlled by the same party.
Pair-wise channels such as TCP are the abstractions most
commonly found real-world distributed software. Thus, our
protocol type specification supports pair-wise channels, borrowing ideas from multiparty session types, as A−>B: <T>
specifying A sends to B a message of type T, where A and B
denote identities of components. Below, we give the specification of the above protocol in our protocol type language.
protocol ShibbolethResourceRequest {
p a r t i c i p a n t s : C l i e n t , SP
// 1 . c l i e n t r e q u e s t s r e s o u r c e
C l i e n t −> SP : <H t t p R e q u e s t >
// 2 . r e s p o n s e i f a u t h o r i z e d
SP −> C l i e n t : <H t t p R e s p o n s e >
}
This specification expresses explicitly the order of communication between the two components and provides a clear and
readable protocol. Furthermore, as we will see later in this
section, our protocol type specifications support multi-party
looping: A [...]∗ , where ... constitutes the loop body which
is controlled by participant A, and a multi-party branching
construct A:{l1 :...,..., ln :...} where A controls which path
li of the conversation to follow.
2.3
Failure handling
In the simple Shibboleth resource request protocol above,
two inherent failure scenarios can occur:
1. Although the client is authenticated, it might not be
authorised to access the requested resource.
2. The client’s authentication ticket may have expired,
and thus the client must request a new ticket in order
to gain access to resources at the SP.
In the following we show the difficulties of modelling such
scenarios even in a variant of the SJ specification language
including the multi-party looping and branching constructs
mentioned above – referred to as SJ? in the following – calling for specific failure-handling abstractions. The first scenario can be modelled as a special value in the HttpResponse,
however hiding the important distinction between success
and failure. Another alternative is to have a branch guarded
by SP (SP: {...} ), specifying the three different scenarios
explicitly: (a) the regular scenario (NoFailure branch) sends
the response; (b) the AuthorisationFailure aborts the attempt
(see 1 above); (c) the ExpiredFailure specifies the second failure scenario (see 2). In the third branch, the client communicates with IdP to obtain a new ticket. Finally, the
client indicates to SP whether it wants to retry the request
by controlling the loop ( Client : [...]∗ ) around the protocol.
Retrying also applies to (b), where the client might want
to request a different resource instead. Below, we give the
specification in SJ? .
1 protocol RobustShibbolethResourceRequest {
2
p a r t i c i p a n t s : C l i e n t , SP , IdP
3
// l o o p f o r r e t r y i n g , i f f a i l e d
4
Client : [
5
// c l i e n t r e q u e s t s r e s o u r c e
6
C l i e n t −> SP : <H t t p R e q u e s t >
7
// b r a n c h g u a r d e d by SP
8
SP : {
9
// d e s i r e d
10
N o F a i l u r e : SP −> C l i e n t : <H t t p R e s p o n s e >,
11
// u n a u t h o r i z e d
12
AuthorizationFailure : ,
13
// t i c k e t e x p i r e d
14
ExpiredFailure :
15
// r e n e w r e q u e s t
16
C l i e n t −> IdP : <H t t p R e q u e s t >
17
// new t i c k e t
18
IdP −> C l i e n t : <H t t p R e s p o n s e >
19
}
20
]∗
21 }
This simple example illustrates several shortcomings of
specifying failure handling with conventional branching and
looping constructs, a technique henceforth referred to as
FSJ? (f ailure handling with SJ ?). First, adding failure
branches adds substantial complexity and can lead to invalid paths. For instance, retrying a failed communication
is specified as a loop around the entire session, allowing the
failure-free path (NoFailure) to be repeated despite success.
This breaks the original protocol, which ends upon success.
An implementation that has such a spurious path will unfortunately pass verification. Second, efficiency has to be
considered for a specification, since the resulting software
contains corresponding communication actions. For example, in the failure-free run, which can be assumed to be the
common case, there exists duplicated and redundant communication. Once the session is initiated, the client controlling the loop sends a message to all components involved
(Line 4), notifying the termination condition. In addition,
there are two messages — the branch label NoFailure (Line 9)
as well as the actual response (Line 9) — which have to be
transmitted between the SP and the client. This can be
addressed by having the SP send a HttpResponse, expressing
also the failure notification as a special value, and the client
guarding the loop. We give this specification below:
C l i e n t −> SP : <H t t p R e q u e s t >
SP −> C l i e n t : <H t t p R e s p o n s e >
Client : {
// b r a n c h g u a r d e d by c l i e n t
NoFailure :
// no more m e s s a g e s n e e d e d
AuthorizationFailure :
...
}
This replace Lines 5-19 in specification in Listing 2.3. Besides losing clarity, efficiency is similarly hampered because
the SP waits for an additional branch label NoFailure in the
failure-free run to be sent by the client to proceed.
Our protocol type specification addresses failures that can
arise during message exchanges via explicit failure handlers.
Branches and loops do not have associated failures as those
are captured by communication within them. We use a
notation that is similar to exception handling features in
mainstream programming languages, like Java and C++.
try ... handle specifies a scope of failure handling, and several
handle clauses can be attached to the same try to describe
different types of failures.2 Below, we give the protocol in
our language:
protocol RobustShibbolethResourceRequest {
p a r t i c i p a n t s : C l i e n t , SP , IdP
try {
// c l i e n t r e q u e s t s r e s o u r c e
C l i e n t −> SP : <H t t p R e q u e s t >
SP −> C l i e n t : <H t t p R e s p o n s e >
| AuthorizationFailure | ExpiredFailure
} handle ( A u t h o r i z a t i o n F a i l u r e ) {
// m i g h t r e q u e s t d i f f e r e n t r e s o u r c e
Client : retry
} handle ( E x p i r e d F a i l u r e ) {
// t i c k e t e x p i r e d
C l i e n t −> IdP : <H t t p R e q u e s t >
// r e n e w r e q u e s t
IdP −> C l i e n t : <H t t p R e s p o n s e >
// new t i c k e t
Client : retry
}
}
The SP can now immediately indicate a failure of either type
AuthorisationFailure or ExpiredFailure — corresponding to the
two failure scenarios — as alternatives ( | ) to sending a response. As illustrated, the same message can yield different
types of failures. In the case of a failure the execution moves
forward to the corresponding handler (handle (...) ). In contrast to the two FSJ? models, the intent is clear and there is
no further need for communication between the client and
the SP in the failure-free path. Similarly, the block terminates as it should and is not artificially constrained to be
within a loop with the hidden invariant that the loop does
not repeat upon success.
In many distributed protocols, retrying the protocol, or
relevant sub-protocols, is a necessary recovery mechanism.
In protocol types, retrying is a first-class construct. The notation A: retry denotes the repetition of the next enclosing
try {...} body, where A makes the decision. Making repetition a choice rather than forcing it allows protocol type
specifications to avoid endless repetition. Assigning the duty
of choosing the “retry path” to a particular participant is
important for the static analysis to check software’s conformance to protocol specification.
2
We avoid the keyword catch to denote handlers to avoid
confusion with the stronger (synchronous) semantics for exceptions in languages like C++ or Java.
Table 2: Constructs or protocol type specification
language
A−>B: <T> | F1 | ... | Fn Message exchange or failure
A: [...]∗
Looping
A: {l1 :..., ..., ln :...}
Branching
try {...} handle(F1) {...} Failure handling
...
handle(Fn) {...}
A: retry
3.
Retry
PROTOCOL TYPE DESIGN
This section describes in more details our specification
language, framework, and semantics.
3.1
Challenges
Our approach provides a specification language to express
distributed protocols in the presence of failure: “environmentinduced” failures (e.g., host or communication failures) as
well as application-defined failures (“exceptions”). The key
challenges in the design of our technique are:
Simplicity. We are interested in few intuitive features
that capture a broad range of scenarios, and simplify
global reasoning for the programmer in the presence of
partial failures.
Efficiency. Our technique should not over-constrain and
hamper performance of software. In particular, the software must retain the potential for asynchronous execution
rather than coupling components by coordination behind
the scenes.
The biggest challenge comes from the seeming conflict between these two requirements. For instance, a technique
which allows to reason in terms of atomic protocol blocks
and performs automatic rollbacks upon failures in a transactional style would probably be easier to use for a programmer. However, such features are hard to implement efficiently at large scale and thus can hamper the asynchrony
underlying many distributed systems. We break our discussion on the protocol type design into (a) the presentation of
the constructs, (b) the design of a transformation tool that
deals with two important patterns: failure notification, also
in the presence of nesting, and (c) the semantics of retrying
in case of failure.
3.2
Constructs
Table 2 provides an overview of the constructs for our protocol type specification language. A, B range over identities
of components, T over types of messages exchanged and F
(as well as with subscript) over failure types. The construct
A−>B: <T> | F1 | ... | Fn specifies a message exchange between A and B of type T or that A may throw an exception of
type F1, ..., Fn. As mentioned, looping A: [...]∗ specifies a
repetitive behaviour, where the termination condition is controlled by A, and A: {l1 :..., ..., ln :...} describes branching
of a conversation where A selects a label li and conversation
follows that path. The construct try {...} handle(F1) {...} ...
handle(Fn) {...} specifies an interaction within a try block
that can raise a failure F1, ..., Fn, handled respectively in
one of the handle blocks. The retry construct describes a
repetition of a surrounding try block.
3.3
Transformation
To not hamper asynchrony and efficiency of software, the
user-defined protocol type specifications are transformed into
specifications having failure notifications following the flow
of communication.
3.3.1
Failure notification
One way to handle failure of a communication within a
try {...} is to abort all subsequent ones, meaning that all such
communications would have to be guarded by the presence
or not of failure. This would imply strong synchronisation at
runtime between recipients of failure-prone communications,
senders, and receivers of any follow-up communication. This
hampers asynchrony and efficiency in the software.
The transformation tool injects failure notifications messages only to components that are causally depending on the
failed communication in a given try ... handle block. Causal
dependence follows the usual definition of causal ordering of
events [19]: (a) a sending action from a participant causally
precedes any subsequent send by that participant, (b) a
sending action causally precedes its receiving action, and
(c) if we have send or receive actions e1 , e2 , e3 such that e1
precedes e2 and e2 precedes e3 , then e1 precedes e3 .
From a component P’s perspective, this means that if P is
on the receiving end of a causal chain of messages (m1 , ..., mn
s.t. ∀i ∈ [1..n]mi =Pi −>Pi+1 : ... , ∀i < j ∈ [1..n] mi occurs
before mj , and Pn+1 =P) within a try ... handle block that
can yield a failure (i.e., m1 =P1 −>P2 : ... | F) then P is
subject to being rolled forward to the corresponding handle
(F) handler. From the user perspective, this does not add
communication, as the failure notification or its absence, inherently, is propagated along the usual communication path
by the tool; i.e., in our approach, the failure notification is
sent directly to all targets.
Other components present in the given try ... handle block
can proceed with their remaining communication inside the
try {...} . Indeed, since their communications are not causally
depending on failure-prone interactions, neither their success nor failure depends on those. So, they can proceed up
to the point of synchronisation at the end of the try {...}
body. If no such synchronisation is desired, then, given the
independence of the said communication from the failureprone parts, the programmer can also move them outside of
the block. For illustration consider the following protocol
1 try {
2
A −> B : <T1> | F
3
C −> D : <T2>
4 } handle (F) {
5
... E ...
6 }
where the second message exchange from C to D at Line 3
does not depend on the first at Line 2. In the case of a failure
F occurring at Line 2, the second exchange can thus proceed.
Also, the programmer can move it out of this try ... handle
block either before or after.
There is a difference between environment induced and
application defined failures. The sender of a corresponding communication in the former is not explicitly raising a
failure and may not be aware of it. As such, it is not considered for determining the set of causally depending participants and communications; only the receiver is considered,
respecting the asynchronous nature of communication.
There is one additional important feature to this basic
semantics to mention. From the example above, any participant E which does not appear in a try {...} but appears
in a given handle(F), upon a F failure, will be informed like
the other dependent participants in the body; E has to be
informed in any case to participate in recovery actions.
3.3.2
Nesting
Our protocol type also supports nesting of failures in the
sense that any try {...} block or handle ...{...} clause can
contain another try ... handle. The rules for nested handling
and propagation of failures is analogous to those for exception handling within method bodies in languages like Java
or C++ (and not like rules for propagation through nested
method calls). That is, any failure, which does not have a
corresponding handle clause attached to its immediate surrounding try , is propagated one scope outwards etc. If a
given failure is not addressed by a corresponding handler
within any enclosing scope, the protocol type is ill-formed
and will be rejected by the static analysis.
Nesting is orthogonal to our synchronisation semantics, or
put differently, composes straightforwardly with it. Components that must be notified for a failure are found in the corresponding try {...} body, includes any component in nested
try {...} blocks. Inversely, for any failure F not handled by
any handle {...} clause of such a nested try , we consider the
components in the causal extension of the raising of F for
determining the set of components depending on the F after
that nested try . Consider the following example:
1
2
3
4
5
6
7
8
9
10
11
12
13
try {
A −> B : <T1> | F1
try {
B −> C : <T2> | F2
B −> D : <T3> | F3
} h a n d l e ( F2 ) {
...
}
} h a n d l e ( F1 ) {
...
} h a n d l e ( F3 ) {
...
}
Upon a failure of type F1 at Line 2 the remainder of
the protocol trivially gets aborted as all the other sends
causally depend on it. Upon a failure F2 at Line 4, Line 5 is
skipped, and the failure F2 is handled “locally”, which presumes that all invariants required for subsequent actions —
as is common with exception handling — can be restored
there. Lastly, if a failure of type F3 is raised at Line 5, then
it will be handled one level of nesting outwards.
3.4
Retrying
One possibility to continue the protocol above after a failure of type F2 at Line 4 is to retry the nested (sub)protocol
delineated by the inner try {...} . This can be achieved by
specifying the handle(F2) clause with B: retry to trigger a
retry guarded by B, meaning that the entire inner try {...}
block is re-attempted. The following replaces Lines 3-8 in
the example:
try {
B −> C : <T2> | F2
B −> D : <T3> | F3
} h a n d l e ( F2 ) {
B: retry
}
A retry can fail again and/or the participant guarding it
can choose to not retry. The programmer is still responsible
for establishing any invariants necessary for any continuation of the protocol. There is no implicit propagation of the
handled failure in the case of a declined retry. Thus retrying
is not a panacea, it’s a feature that relieves the programmer
from the burden of explicitly describing loops presented in
the FSJ? model of the Shibboleth example in Section 2, and
ensures that certain internal paths reflecting success do not
lead to retrying the loop.
4.
PROTOCOL TYPE SOFTWARE SYSTEM
In this section we introduce our protocol type system for
implementing fault-tolerant distributed software, featuring
asynchronous message-passing, branching, looping, failure
notification and failure handling. We extend Java with syntax for structured interactions operations, borrowing ideas
from SJ. Table 3 gives an overview of the constructs in our
protocol type software system. The most interesting ones
are looping, branching, failure handling and retry of a try
block. Looping and branching are constructs defined for two
or more components. The ! indicates the party that controls
looping or branching, while the other parties denoted by ?
follow this decision. The other constructs read the same as
in the specification language. At the beginning of the protocol execution, the software system automatically connects
all distributed components to each other by deciding which
components expect connection requests from which components in a way that avoids deadlocks.
After all components are successfully connected, the protocol sockets belonging to each component are used to perform the operations according to the protocol specification.
Whenever a failure occurs, the software system automatically sends synchronisation (failure) messages to the participants that need to be notified of the failure. These synchronisation messages carry the type of failure. On the receiving
end, whenever the software system receives a failure message, it automatically raises an exception to move execution
to the appropriate failure handler. Moreover, if execution
flow proceeds normally without a failure at a synchronisation point, the software system automatically sends a “no
failure” control message to any component that must be notified of a failure if one occurs. If a component receives a
“no failure” message, it simply continues its execution, representing the failure-free path of the protocol.
To illustrate, we provide the implementation of a simplified version of the SP component for the Shibboleth protocol
in our approach. The first step is to create a protocol socket
with the name of the protocol and identity of component.
These values are used by the static analysis to check the
conformance of the implementation to the protocol specification. Following, on the socket are performed protocol operation such as receive a HttpRequest from the client, sending back a HttpResponse, enclosing operations that may fail
invariants, and so throwing exceptions, within a try-handle
as described in Section 2.
PInstance p = Protocol . i n i t (
R o b u s t S h i b b o l e t h R e s o u r c e R e q u e s t , SP ) ;
i n t validMs = 864000000;
p . try {
H t t p R e q u e s t r q = p . r e c v ( rq , C l i e n t ) ;
i f ( ! r e c o g n i z e d ( r q . g e t ( ”t o k e n ”) ) )
{ p . throw ( new A u t h o r i z a t i o n F a i l u r e ( . . . ) ) ; }
Table 3: Constructs for implementing software components.
Construct
Action
PInstance p = Protocol. init (P, A) Instantiate protocol P
p.send(m, A)
Send message m to A
m = p.recv(A)
Receive message m from A
p.outwhile(cond ){...}
Looping: A sends
termination condition
Looping: A receive
termination condition
p.outbranch(Li ){...}
Branching: A selects and
sends a label
p.inbranch{L1 :...;... Ln :...;}
Branching: A receives a
label
p. try {...}{ F1 f1 :...;... Fn fn:...;} Try block
p.throw(f)
Raising failure f
p. retry (cond)
Retry controlled by A with
p. inwhile {...}
cond
p. retry
Retry controlled by A
i f ( c u r r e n t T i m e M i l l i s ( ) − r q . g e t ( ”t o k e n ”)
> validMs )
{ p . throw ( new E x p i r e d F a i l u r e ( . . . ) ) ; }
e l s e p . send ( new H t t p R e s p o n s e ( . . . ) ) ;
}{ A u t h o r i z a t i o n F a i l u r e a : p . r e t r y ;
ExpiredFailure e :p. retry ; }
else
where RobustShibbolethResourceRequest and SP are the arguments passed to protocol socket; validMs represents the
validity duration of tokens (1 day).
5.
A STATIC ANALYSER
This section presents the static analysis that checks the
conformance of components to the protocol type specification. The analysis is implemented on top of the SJ 3 static
analyser. The SJ’s analyser is written using the Polyglot 4
framework. It takes as input a session type specification as
well as source files, and checks the input program against the
session type specification. In addition, the analyser instruments the code with calls to the SJ software system. The
result is passed to the standard Java compiler.
To implement our static analyser, we first extended the SJ
analyser to support pairwise channels, branching, looping
and the failure handler features as described in Section 3.
The input to our extended version of SJ, called PJ, is a nonstandard Java file that contains a protocol type specification
and a component implementation in our software system.
Just like SJ, PJ’s output is a standard Java file that can be
compiled by a standard Java compiler. Figure 2 gives the
stages of the static analyser.
In the parsing stage, PJ processes the input file and protocol type specification, constructing the AST for the input
file. PJ synthesises component specs from the (global) protocol type by projection. The checking phase of the analyser verifies that the implementation of each component conforms to its spec. This includes checking that messages of
the correct types are communicated with the correct component at all protocol points. Moreover, the checker uses
3
4
http://code.google.com/p/sessionj/.
http://www.cs.cornell.edu/projects/polyglot/.
Input File
Parser
Checker
PJ
Software
System
Executable
Java
Compiler
Transformation
&
Translation
Output File
Figure 2: Work flow of PJ’s static analyser
type information to verify that all failures are handled at
the points specified by the global protocol type and their
handlers implement a recovery protocol as specified.
The analyser uses the specification returned from the transformation tool to instrument failure notification messages in
the software components. This ensures that components are
correctly notified of failures and execute appropriate handlers. Failure notification messages are inserted with one
message send for each component that needs to be notified
of the failure or its absence. After this, the analyser translates the AST into a standard Java file.
6.
EVALUATION
We illustrate the benefits of our technique empirically
both in terms of specifications and software quality, and performance to gauge simplicity and efficiency (see Section 3.1).
We consider six benchmarks programs: Shibboleth introduced earlier, 1PC [27], 2PC [27], 3PC [24], and, as well
as Currency Broker and Buyer-Seller-Shipper examples inspired from previous work on session types with exceptions [7,
8].
In the following we first assess the benefits of protocol type
in terms of specification and software quality by comparing
them to FSJ? versions. Then, we also show that software
based on protocol type system provide better performance
than those of FSJ? in that their execution is faster. For both
specification/software quality and performance comparisons
we also include software versions obtained with SJ ? protocols that are agnostic to failures (SJA? ), i.e., that do not
deal with failures. This gives us a utopian baseline reflecting a world in which we need not worry about failures. The
versions obtained with our protocol type approach are in
the following referred to as PT for brevity. Before we give
the empirical studies along code quality analysis, we briefly
provide an informal description of the 2PC protocol.
6.1
Two Phase Commit
The popular Two Phase Commit (2PC) protocol [27] is
used to decide on the outcome of distributed transactions
executing across several servers. A TwoPC instance kicks off
by having the coordinator send the identifier of a transaction
(long) whose outcome (abort or commit) is to be voted upon
by all participants. The coordinator waits for votes from
each participant (true for commit, false for abort), based on
which the coordinator sends the final decision to all participants. (Given the independence of the two sends from and
to the coordinator respectively these can proceed in parallel.) Any other voting outcome than a unanimous commit
(commit votes from all participants) must lead to aborting
the transaction. If the coordinator times out on any of the
responses then the protocol proceeds with the corresponding handle clause, leading to abort. In contrast to the previous examples, TimeoutFailure is raised by the “environment”,
which means that the runtime raises it. This is no different
than a RemoteException in Java’s remote method invocations
which needs to be added to every remotely invocable method
to convey errors like ConnectionExceptions, or SOAPExceptions
in Web Services. With protocol type, this is supported by
declaring the corresponding failure as a subtype of a built-in
InfrastructFailure . Figure 6 outlines a simplified version of
such a protocol in our language. For simplicity we focus on
the case of two participants — which may fail — as that is
enough to illustrate what we need, as argued by Skeen and
Stonebraker [24].
Table 4: Code metrics
Protocol Approach Spec Soft Nesting Msgs States Inv. Dupl.
lines lines levels max.
paths lines
SJA?
PT
FSJ?
SJA?
2PC
PT
FSJ?
SJA?
3PC
PT
FSJ?
Currency SJA?
Broker
PT
FSJ?
Buyer Seller SJA?
Shipper
PT
FSJ?
SJA?
Shibboleth PT
FSJ?
1PC
6.2
8
14
16
10
14
19
14
45
49
13
24
22
14
19
21
26
43
42
19
36
59
17
30
59
21
139
184
34
67
68
38
55
80
138
245
272
0
1
4
0
1
4
0
3
9
1
3
4
1
2
3
2
3
3
4
7
12
6
6
9
10
12
18
7
11
14
8
8
12
20
21
43
1
4
6
1
4
5
1
16
26
3
8
8
3
8
10
15
16
20
0
1
0
1
0
1
0
0
0
0
0
2
0
8
0
14
0
23
0
0
0
0
0
0
Specification and software quality
To demonstrate the simplicity of devising fault-tolerant
protocols in protocol type, we compare the quality of specifications and their corresponding software to those of FSJ?
and SJA? . We gauge programmer effort by considering a
number of statically determined code characteristics (1) lines
of code (LoC) for protocol descriptions, (2) LoC for corresponding software, (3) nesting levels in protocols, (4) “maximum number” of messages (i.e., number of messages in the
longest failure-free communication path), (5) number of distinct states (state here refers to a group of protocol operations that form a basic block in the protocol’s control-flow
graph), (6) invalid paths between states in the protocol descriptions (i.e., paths permitted but not valid according to
the protocol), and finally, (7) duplicated LoC due to explicit
failure encoding (values for protocol type specifications are
always 0). The last two are moot for SJA? .
Table 4 summaries the outcome of our static code quality
evaluation. While the number of lines (1) is close between
PT and FSJ? , where the latter, however, clearly increases
the numbers of LoC (2), and nesting levels (3). The number
of messages in the longest failure-free path is also clearly
increased in FSJ? (4); this is a hint to performance overhead which will be validated shortly. Another symptom is
the protocol complexity, which is demonstrated through an
increasing number of different states (5). Figures 3a and 3b
graphically illustrate this difference via protocol state transition diagrams for the discovery protocol of Shibboleth with
PT and FSJ? . Notice that NoIdPFailure does not increase the
Start
noDiscovery
1
Discovery
Passive
2a,
2b
DisplayPageF
notPassive
H
3
4
5a
5b,
6a
6b,
7,8
H
9
AuthenticationF
10
6.3
H
H
11
AuthorizationF
ExpiredF
End
(a) With PT
Start
1
noDiscovery
Discovery
2a,
2b
DisplayPageF
H
Passive
notPassive
E1
NoFailure
3
NoIdPF
H
E2
NoFailure
4
5a
5b,
6a
E3
H
6b,
7,8
AuthenticationF
NoFailure
9,10
ExpiredF
AuthorizationF
H
number of states with PT as it is handled locally at the
discovery service end. Moreover, we see that states 9 and
10 in Figure 3a are separate, but they appear in a single
state in Figure 3b. The reason is that with protocol type,
the AuthenticationFailure is tied with the HttpRedirect message, which means that this message can potentially change
the protocol execution path. In Figure 3b, however, the
HttpRedirect message can be sent only after the execution
path has been decided by the Authentication - Failure branch
path, and therefore message 9 is placed in the failure-free
path along with message 10.
The most substantial increase in states with FSJ? occurs
in 3PC. This is largely due to disjoint paths through the
protocol and corresponds directly to the large increase in
nesting level and code duplication. Shibboleth in FSJ? does
not suffer from duplication but instead from invalid paths
like 1PC, 2PC and 3PC (6). All XPC protocols exhibit
significant code duplication (7) with FSJ? . The Currency
Broker shows the least benefits for PT. We believe that this
is largely due to its simplicity and focus on (few) applicationlevel failures.
E4
H
NoFailure
11
End
(b) With FSJ?
Figure 3: State diagrams of Shibboleth discovery
protocol
Performance characteristics
As mentioned, simplicity of modelling can be easily achieved
by proposing features which impose strong synchronisation.
We show that the less simple FSJ? inversely adds much more
overhead than PT. For our performance evaluation we ran
successive rounds of the various protocols. All components
were executing on distinct machines as well as in distinct networks within campus with 1 - 3 hubs connecting each pair of
networks. We varied the percent probability that any given
communication that could result in a failure would actually
raise a failure notification. Thus as the percent probability of failures being raised increases, so does the number of
times that portions of the protocol must be re-executed to
achieve full completion. The exponential trend of the graphs
of Figure 4 is due to the fact that these protocols have more
than one possible failure point. Thus as the failure probability increases for each of the failure points, the probability of
a successful protocol execution exponentially decreases. All
executions were repeated 10000 times and averaged. As the
figure shows, in all reasonable ranges of failure probability
PT clearly outperforms FSJ? . For instance with Shibboleth, when the percent probability of a failure being raised
at a communication point is 20% or below, PT takes only
about 60-70% of the time of FSJ? . These benefits are due to
the absence of redundant messages which occur with manual
implementation of failures. We note that this measurement
approximates an upper bound on our performance gains as
the protocols primarily perform communication.
Figure 5a normalises the improvements of PT over FSJ? .
For instance with 2PC PT shaves off 9-22% of the time used
by FSJ? . For 2PC, Currency Broker, and Buyer-SellerShipper performance improvements of PT are consistent,
saving between 9% and 42% over FSJ? . In failure-free runs,
PT only takes around 20% and 35% of the time FSJ? takes
for Shibboleth and 1PC respectively. These performance
improvements come from a number of factors depending on
how protocols are implemented. For example, in the case
of the FSJ? software of 2PC, all protocol paths incur extra
branch decision control messages that are used to notify the
components of whether or not a failure has occurred at each
communication that can fail in addition to the actual sent
FSJ*
FSJ*
(a) 1PC
(b) 2PC
(d) Currency Broker
(c) 3PC
FSJ*
FSJ*
FSJ*
(e) Buyer-Seller-Shipper
FSJ*
(f) Shibboleth
Figure 4: Average time to complete an iteration of a protocol vs failure probability
(a) PT Normalized vs FSJ?
FSJ*
(b) Overhead of PT and FSJ? vs SJA?
Figure 5: Overhead comparison
message.
A second source of spurious messages that appears in the
FSJ? software is the way retries are implemented. Retries
are implemented in FSJ? using session loops. These add
2 × (n − 1) extra messages to the failure free path where
n is the number of components in the protocol. The first
set of n − 1 messages are the control messages sent by the
loop guard to all components. These messages signal components to enter the first iteration of the loop, in order to
proceed with the execution for the first time. Then, at the
end of executing the subprotocol within the loop, an extra
n − 1 control messages are sent by the loop guard in order to signal whether to re-execute the loop body or not.
This type of overhead occurs in FSJ? for Shibboleth, BuyerSeller-Shipper, 1PC, 2PC and Currency Broker examples.
With PT, the components proceed to execute the subprotocol within the try block without any exchange of messages.
Moreover, retry control signals that are sent only when a failure occurs and the handler offers the possibility of retrying
the failure-prone subprotocol.
The third source of spurious messages is nesting, as best
demonstrated by the FSJ? version of 3PC. Due to the deep
nesting implemented using branching, the performance of
the failure-free path is hampered by n − 1 extra control messages for each nesting level: a branch guard has to send a
control message to each of the n − 1 components which wait
on it to assess whether a failure occurred.
The 1PC and 3PC protocols show interesting performance
trends. Because of the simplicity of the 1PC protocol we see
little performance improvement with PT over FSJ? . This
is because the number of spurious messages that are sent
are very few. Moreover, as the failure probability increases
the two softwares begin to converge and show similar performance results. For 3PC, on the other hand, PT shows consistent but humble improvements when the exception probability is below 60%. However, when the exception probability increases beyond the 60% point we see that PT performance improvements significantly increase. This is because
the number of spurious messages exponentially increases as
the exception probability increases. Last but not least, Figure 5b focuses on failure-free runs, showing the overheads
of PT and FSJ? over SJA? . PT shows no overhead except
for Shibboleth (∼55%) and 3PC (∼22%). In contrast, FSJ?
invariably incurs between ∼22% (2PC) and ∼152% (Shibboleth) overhead.
6.4
Threats to Validity and Discussion
We have shown that protocol type technique and tools
have advantages over SJ both in terms of (1) simplicity and
(2) efficiency on a range of different protocols. Although the
examples are relatively small, they span a number of important protocol examples and families, and involve different
failure models. We have implemented all of the examples
in Java; however, we observe that most languages will contain primitives for loops and branches, while the protocol
type specification language itself is language-independent,
and complexity improvements for PT over FSJ? can be generalised: with FSJ? , a sequence of n failure-prone sends m1 ,
..., mn can translate to n nested branches; mi+1 , ..., mn
are duplicated at branch i at a degree proportionate to the
number of different failures possibly occurring upon mi ; and
every branch requires the sending of a label which is redundant with the subsequent message or failure. Furthermore,
every loop introduced for retrying requires an additional
multi-sending upon first execution. All of this translates
to increased latency. Although the benchmarks show good
percent improvement in the runtime of the software, we expect large, real-world systems to not always exhibit such
performance improvements due to a lower ratio of communication to computation performed. However, we note that
the benefits afforded by protocol type techniques and tools
will be exhibited by reduced latency– a metric often of equal
importance to raw throughput.
Last but not least, even if protocol type yield high LoC
savings, we believe their main benefit lies in the reduction of
invalid paths compared to implementing retrying via loops –
these must be ruled out manually by the programmer which
defies the purpose of the static analysis approach.
7.
RELATED WORK
Session types. Following the pioneering work of Takeuchi,
Honda, and Kubo [25], several type theories to guarantee
deadlock-freedom and communication-safety for process calculi — so-called session types [12, 26, 6] — have been proposed. Honda et al. [16] and Bonelli et al. [6] extended
the original bi-party session types to multi-party interaction
(MPSTs). Bettini et al. [5] introduced implicit participantpairwise channels. Multi-channels have been used in a variant of SJ [23] for parallel software, describing protocols of
an arbitrary number of components, but without addressing failure handling. Bejleri and Yoshida’s work [4] extends
that of Honda et al. for synchronous communication among
multiple interacting peers.
Operational semantics for asynchronous session types were
first studied by Neubauer et al. [20]. Session types have been
applied to functional [12] and object-oriented settings [10],
as well as others. Gay et al. [13] focus on modular typetheory of sessions via objects.
A systems-level object-oriented system [10] integrate the
theory of bi-party session-types and ownership-types into a
variant of C#. It is used to describe the interfaces and verify
the components of the Singularity OS. Components communicate via message-passing, designed over shared memory and implemented as pointer rewriting. The system is
suitable for implementing concurrent, system-level software.
The technique is similar to bi-party session types and is
adaptable only for concurrency software but not for distributed. Scribble [15, 29] is an ongoing project on a session
type based language and tool chain for large scale distributed
applications. Pabble [22] is a recent dialect of Scribble to describe protocols of an arbitrary number of components and
check their conformance to software written in MPI. The
current version of it offers only a limited form of failure handling, namely interruption; thus, given the growing interest
in it, we believe it would be worth investigating our failure
handling technique into it.
Exception handling. Carbone et al. [7, 8] propose structured interactional exceptions for session types based asynchronous communication. When a process throws an exception, execution is interrupted at all participants involved
in the conversation and they move to another dialogue. Exceptions can be nested through nested try blocks but raising
an exception is not permitted to occur within an exception
handler. The model supports only one kind of exception.
Hanazumi and Vieira de Melo [14] and Alexandar et
al. [28] describe a method for exception handling based on
Coordinated Atomic Actions (CAAs). Coordinated exception handling is achieved by satisfying a number of CAA
properties on transactions, namely rollback and exception
handling properties. When an exception is raised within a
CAA or signaled to it, the participants handle the exception
by executing the exception handling code for that CAA. If
the exception is not handled within the CAA, it is propagated to other parts of the system. This method requires
substantial programming effort. Also, transactional guarantees are not always needed or possible.
8.
CONCLUSIONS AND OUTLOOK
To help the programmer combat the challenges of design
and development of distributed software in the presence of
partial failures, we have proposed and presented protocol
type specification technique, including static analysis, and a
software system. We are exploring several extensions to our
work, e.g., a failure propagation semantics with different
root failure and different kind of communication channels
instead of the hardwired “−>”. We plan to investigate our
failure handling technique into emerging, promising specification approaches such as Scribble into a complete specification technique, that describes the structure of data, for
distributed programming.
9.
REFERENCES
[1] Shibboleth. http://www.shibboleth.net.
[2] A. Bejleri, T.-C. Chen, M. Qudeisat, L. Ziarek, and
P. Eugster. Effective specification of fault tolerant
distributed software. Available at:
www.andibejleri.net/papers/protocol.pdf, 2015.
[3] A. Bejleri, R. Hu, and N. Yoshida. Session-based
programming for parallel algorithms: Expressiveness
and performance. In Proceedings Second International
Workshop on Programming Language Approaches to
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
Concurrency and Communication-cEntric Software,
PLACES 2009., pages 17–29, 2009.
A. Bejleri and N. Yoshida. Synchronous Multiparty
Session Types. Electronic Notes in Theoretical
Computer Science (ENTCS), 241:pp. 3–33, 2009.
L. Bettini, M. D. Luca, M. Dezani-ciancaglini, and
N. Yoshida. Global progress in dynamically
interleaved multiparty sessions. In CONCUR 2008 Concurrency Theory, 19th International Conference,
Proceedings, pages 418–433, 2008.
E. Bonelli and A. Compagnoni. Multipoint session
types for a distributed calculus. In Trustworthy Global
Computing, Third Symposium, TGC 2007, Revised
Selected Papers, pages 240–256, 2007.
M. Carbone, K. Honda, and N. Yoshida. Structured
interactional exceptions in session types. In CONCUR
2008 - Concurrency Theory, 19th International
Conference, Proceedings, pages 402–417. 2008.
M. Carbone, N. Yoshida, and K. Honda.
Asynchronous session types: Exceptions and
multiparty interactions. In Formal Methods for Web
Services, pages 187–212. 2009.
M. Dezani-Ciancaglini, D. Mostrous, N. Yoshida, and
S. Drossopoulou. Session types for object-oriented
languages. In ECOOP 2006 - Object-Oriented
Programming, 20th European Conference, Proceedings,
pages 328–352, 2006.
M. F¨
ahndrich, M. Aiken, C. Hawblitzel, O. Hodson,
G. Hunt, J. R. Larus, and S. Levi. Language support
for fast and reliable message-based communication in
singularity os. In Proceedings of the 1st ACM
SIGOPS/EuroSys European Conference on Computer
Systems 2006, EuroSys ’06, pages 177–190, 2006.
F. C. G¨
artner. Fundamentals of fault-tolerant
distributed computing in asynchronous environments.
ACM Comput. Surv., 31(1):1–26, 1999.
S. Gay, V. Vasconcelos, and A. Ravara. Session types
for inter-process communication. Technical report,
2003.
S. J. Gay, V. T. Vasconcelos, A. Ravara, N. Gesbert,
and A. Z. Caldeira. Modular Session Types for
Distributed Object-Oriented Programming. In
Proceedings of the 37th ACM Symposium on
Principles of Programming Languages (POPL ’10),
pages 299–312, 2010.
S. Hanazumi and A. C. V. de Melo. Coordinating
exceptions of java systems: Implementation and
formal verification. In Proceedings of the 2012 Eighth
International Conference on the Quality of
Information and Communications Technology,
QUATIC ’12, pages 108–113. IEEE Computer Society,
2012.
K. Honda, A. Mukhamedov, G. Brown, T.-C. Chen,
and N. Yoshida. Scribbling interactions with a formal
foundation. In Proceedings of the 7th International
Conference on Distributed Computing and Internet
Technologies (ICDCIT 2011), pages 55–75, 2011.
K. Honda, N. Yoshida, and M. Carbone. Multiparty
Asynchronous Session Types. In Proceedings of the
35th ACM Symposium on Principles of Programming
Languages (POPL ’08), pages 273–284, 2008.
R. Hu, D. Kouzapas, O. Pernet, N. Yoshida, and
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
K. Honda. Type-safe eventful sessions in java. In
ECOOP 2010 - Object-Oriented Programming, 24th
European Conference, Proceedings, pages 329–353,
2010.
R. Hu, N. Yoshida, and K. Honda. Session-based
distributed programming in java. In ECOOP 2008 Object-Oriented Programming, 22nd European
Conference, Proceedings, pages 516–541, 2008.
L. Lamport. Time, Clocks, and the Ordering of Events
in a Distributed System. Communications of the
ACM, 21(7):558–565, July 1978.
M. Neubauer and P. Thiemann. An Implementation of
Session Types. In Proceedings of the 6th ACM
International Symposium on Practical Aspects of
Declarative Languages (PADL ’04), pages 56–70, 2004.
B. C. Neuman and T. Ts’o. Kerberos: An
Authentication Service for Computer Networks. IEEE
Communications, (9):33–38, Sept. 1994.
N. Ng and N. Yoshida. Pabble: Parameterised scribble
for parallel programming. In 22nd Euromicro
International Conference on Parallel, Distributed, and
Network-Based Processing, pages 707–714. IEEE
Computer Society, 2014.
N. Ng, N. Yoshida, O. Pernet, R. Hu, and Y. Kryftis.
Safe parallel programming with session java. In
Proceedings of the 13th International Conference on
Coordination Models and Languages,
COORDINATION’11, pages 110–126, 2011.
D. Skeen and M. Stonebraker. A Formal Model of
Crash Recovery in a Distributed System. IEEE
Transactions on Software Engineering, 9(3):219–228,
May 1983.
K. Takeuchi, K. Honda, and M. Kubo. An
Interaction-based Language and its Typing System. In
Proceedings of the 6th International PARLE
Conference on Parallel Architectures and Languages
Europe (PARLE ’94), pages 398–413, 1994.
A. Vallecillo, V. T. Vasconcelos, and A. Ravara.
Typing the Behavior of Software Components using
Session Types. Fundamenta Informaticae, 73(4):pp.
583–598, 2006.
G. Weikum and G. Vossen. Transactional Information
Systems: Theory, Algorithms, and the Practice of
Concurrency Control and Recovery. Morgan
Kaufmann, 2002.
J. Xu, A. B. Romanovsky, and B. Randell. Concurrent
exception handling and resolution in distributed
object systems. IEEE Trans. Parallel Distrib. Syst.,
11(10):1019–1032, 2000.
N. Yoshida, R. Hu, R. Neykova, and N. Ng. The
scribble protocol language. In Trustworthy Global
Computing - 8th International Symposium, TGC
2013, Revised Selected Papers, pages 22–41, 2013.
APPENDIX
A. TWO PHASE COMMIT
Figure 6 outlines a simplified version of the Two Phase
Commit (2PC) protocol in our protocol-type. For simplicity
we focus on the case of two participants — which may fail
— as that is enough to illustrate what we need, as argued
by Skeen and Stonebraker [24].
1 p r o t o c o l TwoPC {
2
p a r t i c i p a n t s : Coord , P a r t 1 , P a r t 2
3
Coord −> P a r t 1 : <l o n g >
4
Coord −> P a r t 2 : <l o n g >
5
try {
6
P a r t 1 −> Coord : <b o o l > | T i m e o u t F a i l u r e
7
P a r t 2 −> Coord : <b o o l > | T i m e o u t F a i l u r e
8
Coord −> P a r t 1 : <b o o l >
9
Coord −> P a r t 2 : <b o o l >
10
} h a n d l e ( T i m e o u t F a i l u r e ) { // a b o r t t o a l l
11
Coord −> P a r t 1 : <b o o l >
12
Coord −> P a r t 2 : <b o o l >
13
}
14 }
Figure 6: 2PC protocol with failure handling in
protocol-type
In an asynchronous distributed system a TimeoutFailure
does not necessarily imply a participant crash, and so we
asynchronously notify both participants of the abort regardless of failures. The 2PC example points to the importance
of choosing the semantics for failure handling. The coordinator always performs the same two sends of decisions to both
participants, regardless of failures. Thus from the perspective of these participants there is no difference to replacing
the entire try ... handle block on Lines 5-13 simply with
Part1
Part2
Coord
Coord
−>
−>
−>
−>
Coord :
Coord :
Part1 :
Part2 :
<b o o l >
<b o o l >
<b o o l >
<b o o l >
This would also avoid sending a failure message and an
abort decision to participants in case of failure. The net difference, however, is that the coordinator can get stuck waiting for a vote from a participant which indeed failed. Based
on the semantics expressed inherently with the example of
Figure 6, a timeout on either participant constrains the coordinator to proceed to the handle clause. It also implies
that any non-faulty participant knows to not expect two
messages. In other terms they too proceed to the handler.
Figure 7 illustrates the 2PC in FSJ? .
1 protocol TwoPCExplicit {
2
p a r t i c i p a n t s : Coord , P a r t 1 , P a r t 2
3
Coord −> P a r t 1 : <l o n g >
4
Coord −> P a r t 2 : <l o n g >
5
Part1 : {
6
T i m e o u t F a i l u r e : // n e x t 2 l i n e s d u p l i c a t e d
7
Coord −> P a r t 1 : <B o o l e a n >
8
Coord −> P a r t 2 : <B o o l e a n >
9
NoFailure :
10
P a r t 1 −> Coord : <B o o l e a n >
11
Part2 : {
12
T i m e o u t F a i l u r e : // n e x t l i n e s d u p l . 7−8
13
Coord −> P a r t 1 : <B o o l e a n >
14
Coord −> P a r t 2 : <B o o l e a n >
15
NoFailure :
16
P a r t 2 −> Coord : <B o o l e a n >
17
}
18
}
19 }
Figure 7: 2PC with FSJ?