Some Weak Idioms

Some Weak Idioms
Doug Lea
SUNY Oswego
dl@cs.oswego.edu
1
Intro
Want good performance for core libraries and runtime systems
Internally use some common non-SC-looking idioms
Most can be seen as manual “optimizations” that have no
impact on user-level consistency
But leaks can show up as API usage rules
Example: cannot fork a task more than once
Example: Publication and transfer (most of this talk)
Generalities
Some java.util.concurrent code
Challenges
Cataloging cases; establishing semantics
First-class language support
2
Publication and Transfers
Class X { int field; X(int f) { field = f; } }
For shared var v (other vars thread-local):
P: p.field = e; v = p;
C: c = v; f = c.field;
Use weakest protocol that ensures that C:f is usable, considering:
“Usable” can be algorithm- and API-dependent
Is write to v final? including:
Write Once (null → x), Consume Once (x → null)
Is write to x.field final?
Is there a unique uninitialized value for field
Are reads validated?
Consistency with reads/writes of other shared vars
Weaker protocols avoid more cache invalidation
3
Avoiding Invalidation on Writes
Avoiding the most expensive per-access cache invalidation:
storeFence; v = x; storeLoadFence
Static single or final Write
The single thread issuing final write is structurally determined
Example: storeFence; v = x;
Dynamic single or final Write
Ensuring one writer requires distinguished value
Example: [storeFence] CAS(&v, null, x)
Validated (including “double-checked”)
Don't fence write if reads validate with CAS
Example: if (v == null) { … if (CAS(&v, null, x) … }
Dependent
Don't fence var if accesses nested under another
Example: lock; v = x; unlock;
4
ForkJoinTasks
class SortTask extends RecursiveAction {
final long[] array;
final int lo; final int hi;
Stealing
SortTask(long[] array, int lo, int hi) {
this.array = array;
this.lo = lo; this.hi = hi;
}
protected void compute() {
if (hi - lo < THRESHOLD)
sequentiallySort(array, lo, hi);
Base
else {
int m = (lo + hi) >>> 1;
SortTask r = new SortTask(array, m, hi);
r.fork();
new SortTask(array, lo, m).compute();
r.join();
merge(array, lo, hi);
}
}
// …
Pushing
Deque
Top
Popping
}
5
Transferring Tasks
Queues perform a form of ownership transfer
Push: make task available for stealing or popping
needs lightweight store-fence
Pop, steal: make task unavailable to others, then run
Needs CAS with at least acquire-mode fence
Java doesn't provide source-level map to efficient forms
So implementation uses JVM intrinsics
T1: push(w) -w.state = 17;
slot = q;
Queue slot
publish
T2: steal() -w = slot;
if (CAS(slot, w, null))
s = w.state; ...
consume
Task w
Int state;
Require: s == 17
6
Task Deque Algorithms
Deque operations (esp push, pop) must be very fast/simple
Competitive with procedure call stack push/pop
Current algorithm requires one atomic op per push+{pop/steal}
This is minimal unless allow duplicate execs or arbitrary
postponement (See Maged Michael et al PPoPP 09)
Less than 5X cost for empty fork+join vs empty method calls
Uses (resizable, circular) array with base and sp indices
Essentially (omitting emptiness, bounds checks, masking etc):
Push(t): s = sp++; storeFence; array[s] = t;
Pop(t): if (CAS(array[sp-1], t, null)) --sp;
Steal(t): if (CAS(array[base], t, null)) ++base;
NOT strictly non-blocking but probabilistically so
A stalled ++base precludes other steals
But if so, stealers try elsewhere (use randomized selection)
7
A variant of classic
array push: q[sp++] = t
(and not much slower)
Sample code
Non-public method of
ForkJoinWorkerThread
void pushTask(ForkJoinTask<?> t) {
ForkJoinTask<?>[] q = queue;
Per-thread arrayint mask = q.length - 1;
based queue with
inc before slot write OK
power of 2 length
int s = sp++;
orderedPut(q, s & mask, t);
Publish via JVM intrinsic
if ((s -= base) == 0)
ensuring previous writes
pool.signalWork();
commit before slot write
(inlined in the actual code)
else if (s == mask)
growQueue();
Stealers use compareAndSet
}
Resize if full
If queue was empty,
wake up others using
scalable event queue
of this slot from non-null
to null to privatize.
8
Improving Language Support
Poor Java language support for special-mode accesses
Requires intrinsics operating on addresses (not values)
These intrinsics have no formal specs
Alternatively, source-level control over fences
Not very usable in Java, but still, I use them a lot
Ideally language constructs should express intent
Programmers already live with non-consistency every day
IO, Web, mobile, clusters
A historical oddity that languages do not incorporate
9
Consistency Issues are Inescapable
Occur in remote multicast and message passing
Memory model mapping to distributed platforms expensive
Many groups don't need strong consistency
But encounter anomalies
Example (“IRIW”): x,y multicast
Node
Node
Node
Node
A:
B:
C:
D:
send x;
send y;
receive x; receive y;
receive y; receive x;
// set x = 1
// set y = 1
// see x=1, y=0
// see y=1, x=0
Full avoidance as expensive as full MM mapping –
atomic multicast, distributed transactions
Moreso when must tolerate remote failure
Occur in local messaging: Processes, Isolates, ...
Usually rely on implicit OS-level consistency model
10
Contention in Shared Data Structures
Mostly-Write
Most producer-consumer
exchanges
Especially queues
Apply combinations of a small
set of ideas
Mostly-Read
Most Maps & Sets
Empirically, 85% Java
Map calls read-only
Structure to maximize
concurrent readability
Use non-blocking sync
via compareAndSet
(CAS)
Without locking, readers
see legal (ideally,
linearizable) values
Reduce point-wise
contention
Often, using immutable
copy-on-write internals
Arrange that threads help
each other make
progress
Apply write-contention
techniques from there
11