Local Parallel Iteration in X10

Local Parallel Iteration in X10
Josh Milthorpe
IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
jjmiltho@us.ibm.com
Abstract
X10 programs have achieved high efficiency on petascale clusters
by making significant use of parallelism between places, however,
there has been less focus on exploiting local parallelism within a
place.
This paper introduces a standard mechanism - foreach - for efficient local parallel iteration in X10, including support for workerlocal data. Library code transforms parallel iteration into an efficient pattern of activities for execution by X10’s work-stealing runtime. Parallel reductions and worker-local data help to avoid unnecessary synchronization between worker threads.
The foreach mechanism is compared with leading programming
technologies for shared-memory parallelism using kernel codes
from high performance scientific applications. Experiments on a
typical Intel multicore architecture show that X10 with foreach
achieves parallel speedup comparable with OpenMP and TBB for
several important patterns of iteration. foreach is composable with
X10’s asynchronous partitioned global address space model, and
therefore represents a step towards a parallel programming model
that can express the full range of parallelism in modern high performance computing systems.
Categories and Subject Descriptors D.1.3 [Concurrent Programming]: parallel programming
Keywords X10, parallel iteration, loop transformations, work
stealing
1.
Introduction
Data parallelism is the key to scalable parallel programs [6]. Although X10 programs have demonstrated high efficiency at petascale, these impressive results have made little use of parallelism
within a place, focusing instead on parallelism between places [8].
As most scientific codes make heavy use of iteration using for
loops, parallel iteration (sometimes called ‘parallel for loop’) is the
most obvious approach for exploiting shared-memory parallelism.
The foreach statement was a feature of early versions of the
X10 language. The statement was defined as follows:
The foreach statement is similar to the enhanced for statement. An activity executes a foreach statement in a similar
fashion except that separate async activities are launched
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
X10’15, June 14, 2015, Portland, OR, USA.
c 2015 ACM 978-1-4503-3586-7/15/06. . . $15.00.
Copyright http://dx.doi.org/10.1145/2771774.2771781
in parallel in the local place of each object returned by the
iteration. The statement terminates locally when all the activities have been spawned.
The requirement that each iteration of the loop be executed as
a separate async made the original definition of foreach unsuitable for typical data-parallel patterns for two reasons. Firstly, it allowed ordering dependencies between iterations, which prevented
arbitrary reordering or coalescing multiple iterations into a single
activity. For example, the loop in Figure 1 was previously a valid
use of foreach, in which the (i+1)th iteration had to be completed
before the ith iteration could begin.
1
2
3
4
5
6
val complete = new Rail [ Boolean ]( ITERS ) ;
foreach ( i in 0..( ITERS -1) ) {
when ( complete ( i +1) ) ;
compute () ;
atomic complete ( i ) = true ;
}
Figure 1: Loop with ordering dependence between iterations
Secondly, any intermediate data structures had to be duplicated
for each iteration of the loop to avoid data races between iterations.
For example, the loop in Figure 2 contains a data race due to sharing
of the array temp between threads.
1
2
3
4
5
6
7
8
9
val input : Rail [ Double ];
val output : Rail [ Double ];
val temp = new Rail [ Double ]( N ) ;
foreach ( i in 0..( ITERS -1) ) {
for ( j in 0..( N -1) ) {
temp ( j ) = computeTemp (i , input ( j ) ) ;
}
output ( i ) = computeOutput (i , temp ) ;
}
Figure 2: Loop with ordering dependence between iterations
For correctness, temp had to be made private to the body of the
loop, requiring that the array be duplicated N times.
For these reasons, the time to compute a parallel loop using
foreach was often orders of magnitude greater than an equivalent sequential loop. Furthermore, foreach (p in region) S was
trivially expressible in terms of simpler X10 constructs as finish
for (p in region) async S. The foreach construct was therefore
removed in X10 version 2.1. Despite its removal, there remains a
strong need for an efficient mechanism for parallel iteration in the
X10 language.
The main contributions of this paper are:
• a standard mechanism for local parallel iteration in the X10
language;
• support for worker-local data in the X10 language;
• experimental evaluation of these mechanisms on a multicore
architecture typical of HPC compute nodes;
• comparison with leading programming technologies for shared-
memory parallelism, specifically, OpenMP and TBB.
2.
Related Work
The standard for shared-memory parallelism is OpenMP [1]. All
leading C/C++ and Fortran compilers implement the OpenMP API,
which provides efficient implementations of common parallel patterns. OpenMP parallel for loops support different scheduling
choices including static scheduling for regular workloads, and
dynamic and guided scheduling for irregular workloads. In addition, OpenMP supports the creation of explicit tasks, which allow
expression of a broader range of parallel patterns. However, the interaction between explicit tasks and the implicit tasks is not fully
defined, which makes them difficult to compose [9].
Intel Threading Building Blocks is a C++ template library for
task parallelism [2]. In addition to efficient concurrency primitives, memory allocators, and a work-stealing task scheduler, TBB
provides implementations of a range of common parallel algorithms and data structures. Parallel iteration and reduction are implemented as parallel_for and parallel_reduce, respectively.
The TBB scheduler inspects the dynamic behavior of tasks to perform optimizations for cache locality and task size [7].
Unlike X10, neither OpenMP nor TBB provide support for
distributed-memory parallelism.
3.
Parallel Iteration With foreach
A new construct for parallel iteration may be defined as follows:
foreach ( Index in Itera tionSpac e ) Stmt
The body Stmt is executed for each value of Index, making use
of available parallelism. The iteration must be both serializable and
parallelizable, in other words, it is correct to execute Stmt for each
index in sequence, and it is also correct to execute Stmt in parallel
for any subset of indices.
The compiler applies one of a set of transformations (see section 4) to generate a set of parallel activities that implements the
foreach statement. The transformations available to the compiler
may depend on the type of the index expression, and the choice of
transformation may be controlled by annotations.
The index expression is evaluated on entry to the foreach
statement to yield a set of indices, which must be invariant
throughout the iteration. It is desirable that the index set should
support recursive bisection. Dense, rectangular index sets (Range
and DenseIterationSpace) are trivially bisectable; for other
region types, we envisage the introduction of a new interface
SplittableRegion defining a split operation, to allow bisection
of other region types, similar to TBB’s splitting constructor.
The body of the foreach statement must be expressible as a
closure. In addition to the usual restrictions on closures – for example, var variables may not be captured – there are further restrictions specific to foreach. A conditional atomic statement (when)
may not be included as it could introduce ordering dependencies
between iterations. Unconditional atomic may be included as it
cannot create an ordering dependency. These restrictions may be
checked dynamically in the same manner as the X10 runtime currently enforces restrictions on atomic statements. Apart from these
restrictions, foreach is composable with other X10 constructs including finish, async and at.
Correct execution of foreach assumes no preemption of X10
activities; each activity created by foreach runs to completion on
the worker thread which started it. There is an implied finish; all
activities created by foreach must terminate before progressing to
the next statement following the construct.=
3.1
Reduction Expression
Along with parallel iteration, reduction is a key parallel pattern and
a feature of many scientific codes. The foreach statement may be
enhanced to provide a parallel reduction expression as follows:
result : U = reduce [T , U ]
( reducer :( a :T , b : U ) = > U , identity : U )
foreach ( Index in Iter ationSpa ce ) {
Stmt
offer Exp : T ;
};
An arbitrary reduction variable of type T is computed using
the provided reducer function reducer:(a:T, b:T)=> T, and an
identity value identity:T such that reducer(identity, x)== x.
For example, the following code computes a vector dot product of
arrays x and y:
1
2
3
4
5
6
7
3.2
val x : Rail [ Double ];
val y : Rail [ Double ];
val dotProd = reduce [ Double ](
( a : Double , b : Double ) = > a +b , 0.0)
foreach ( i in 0..( x . size -1) ) {
offer ( x ( i ) * y ( i ) ) ;
};
Worker Local Data
A common feature of parallel iteration is the use of intermediate
data structures. For example, in the loop in Figure 2, iterations of
the loop that execute in parallel must operate on a separate copy of
the intermediate array temp to avoid data races. Note that it is not
necessary that each iteration of the loop has a private copy of the
array, only that no two iterations that execute in parallel are allowed
to share a copy.
It is possible to simply allocate a separate copy of the intermediate data for each iteration, for example:
1
2
3
4
5
6
7
8
9
val input : Rail [ Double ];
val output : Rail [ Double ];
foreach ( i in 0..( ITERS -1) ) {
val temp = new Rail [ Double ]( N ) ;
for ( j in 0..( N -1) ) {
temp ( j ) = computeTemp (i , input ( j ) ) ;
}
output ( i ) = computeOutput (i , temp ) ;
}
However, for data structures of any significant size, repeated
allocation is unlikely to be efficient due to increased load on the
garbage collector.
An alternative option in Native X10 (using the C++ backend) is
stack allocation, as follows:
1
2
3
4
5
6
7
8
9
val input : Rail [ Double ];
val output : Rail [ Double ];
foreach ( i in 0..( ITERS -1) ) {
@Sta ckAlloca te val temp =
@ S t a c k A l l o c a t e U n i n i t i a l i z e d new Rail [
Double ]( N ) ;
for ( j in 0..( N -1) ) {
temp ( j ) = computeTemp (i , input ( j ) ) ;
}
output ( i ) = computeOutput (i , temp ) ;
}
The annotation @StackAllocate indicates that a variable should
be allocated on the stack, rather than the heap. The second annotation, @StackAllocateUninitialized, indicates that the constructor
call should be elided, leaving the storage uninitialized. This avoids
the cost of zeroing memory, but should be used with care to ensure
values are not read before they are initialized. Stack allocation is
a good choice for many applications, however it is limited to variables that will fit on the stack (no large arrays), and is not supported
in Managed X10 (using the Java backend).
As an alternative to either duplication or stack allocation, we
propose a new class, x10.compiler.WorkerLocal, which provides
a lazy-initialized worker-local store. A worker-local store is created
with an initializer function; the first time a given worker thread
accesses the store, the initializer is called to create its local copy
of the data. The definition of foreach can be extended to support
worker-local data as follows:
foreach ( Index in Itera tionSpac e )
local (
val l1 = Initializer1 ;
val l2 = Initializer2 ;
) {
Stmt
};
The value initializers in the local block may capture the environment of the parallel iteration, but may not reference any symbol
defined inside the body. The body of the iteration may refer to any
of the variables defined within the local block. Because the body
may not include blocking statements each execution of the body
must run to completion on the worker thread on which it began,
therefore it has exclusive access to its worker local data for the entire duration.
The x10.compiler.WorkerLocal class is very similar in design
to TBB’s enumerable_thread_specific type, which is also a lazyinitialized thread-local store.
4.
Implementation
The foreach, reduce and local keywords can be supported in X10
by extending the language syntax, however, we have not actually
implemented these changes in the compiler. Instead, we have created two new classes, x10.compiler.Foreach and x10.compiler.
WorkerLocal, which are intended as targets for future versions of
the language; in the interim, these classes can be used directly from
user code.
Given the definition of the foreach statement in Section 3, a
variety of code transformations are possible. The X10 compiler
should provide an efficient default transformation (for example,
recursive bisection), combined with annotations to allow the user
to choose different transformations for particular applications.
To illustrate some possible transformations, we consider the
following implementation of a simple “DAXPY” using a foreach
statement over a LongRange as follows:
1
2
3
foreach ( i in lo .. hi ) {
x ( i ) = alpha * x ( i ) + y ( i ) ;
}
As a first step, the body of the foreach is extracted into a closure
that executes sequentially over a range of indices as parameters:
1
2
3
4
5
val body = ( min_i : Long , max_i : Long ) = > {
for ( i in min_i .. max_i ) {
x ( i ) = alpha * x ( i ) + y ( i ) ;
}
};
The body closure is then used to construct a parallel iteration
using one of the code transformations in the following subsections.
4.1
Basic
The basic transformation can be applied to any iterable index set,
to create a separate activity for each index:
1
finish for ( i in lo .. hi ) async body (i , i ) ;
This is equivalent to the original definition of foreach.
4.2
Block Decomposition
A block decomposition can be applied to any countable index set,
and divides the indices into contiguous blocks of approximately
equal size. By default, Runtime.NTHREADS blocks are created, one
for each worker thread. Each block is executed as a separate async,
except for the first block which is executed synchronously by the
worker thread that started the loop.
1
2
3
4
5
8
9
10
11
12
13
val numElem = hi - lo + 1;
val blockSize = numElem / Runtime . NTHREADS ;
val leftOver = numElem % Runtime . NTHREADS ;
finish {
for ( var t : Long = Runtime . NTHREADS -1; t >0;
t - -) {
val tLo = lo + t <= leftOver ?
t *( blockSize +1) : t * blockSize +
leftOver ;
val tHi = tLo + (( t < leftOver ) ?
( blockSize +1) : blockSize ) ;
async body ( tLo .. tHi ) ;
}
body (0 , blockSize + leftOver ? 1 : 0) ;
}
4.3
Recursive Bisection
6
7
A recursive bisection transformation can be applied to any splittable index set. In this approach, the index set is divided into two
approximately equal pieces, with each piece constituting an activity. Bisection recurs until a certain minimum grain size is reached.
For multidimensional index sets, bisection applies preferentially to
the largest dimension.
1
2
3
4
5
6
7
8
9
10
11
12
static def doBisect1D ( lo : Long , hi : Long ,
grainSize : Long ,
body :( min : Long , max : Long ) = > void ) {
if (( hi - lo ) > grainSize ) {
async doBisect1D (( lo + hi ) /2 L , hi ,
grainSize , body ) ;
doBisect1D ( lo , ( lo + hi ) /2 L , grainSize ,
body ) ;
} else {
body ( lo , hi -1) ;
}
}
finish doBisect1D ( lo , hi +1 , grainSz , body ) ;
With the recursive bisection transformation, if a worker thread’s
deque contains any activities, then the activity at the bottom of
the deque will represent at least half of the index set held by that
worker. Thus idle workers tend to steal large contiguous chunks of
the index set, preserving locality.
5.
Evaluation
We identified a number of application kernels representing common patterns in high-performance scientific applications. The use
of kernels instead of full applications allows the effects of dataparallel transformations to be studied in a simplified context free
from scheduling effects due to other parts of the applications. Using these kernels, we compared the different compiler transformations for foreach the different storage options for intermediate data
structures. Finally, we compare the performance of the X10 versions of these kernels with versions written in C++ with OpenMP
and/or TBB.
5.1
5.1.1
Application Kernels
DAXPY
The DAXPY kernel updates each element of a vector as xi =
αxi + yi .
1
2
3
5.1.2
foreach ( i in 0..( N -1) ) {
x ( i ) = alpha * x ( i ) + y ( i ) ;
}
Dense Matrix Multiply
The MatMul kernel is an inner-product formulation P
of dense matrix
multiplication which updates each element ci,j ← K
k=1 ai,k bk,j .
Parallel iteration is over a two-dimensional index set.
1
2
3
4
5
6
7
foreach ([ j , i ] in 0..( N -1) * 0..( M -1) ) {
var temp : Double = 0.0;
for ( k in 0..( K -1) ) {
temp += a ( i + k * M ) * b ( k + j * K ) ;
}
c ( i + j * M ) = temp ;
}
Code for all versions of the DAXPY and Matrix Multiplication
kernels can be found in ANUChem1 .
5.1.3
Sparse Matrix-Vector Multiply
The SpMV kernel is taken from the X10 Global Matrix library [3],
available for download at http://x10-lang.org. It performs sparse
matrix-vector multiplication and forms the basis of many GML
algorithms.
1
2
3
4
5
6
7
8
9
5.1.4
foreach ( col in 0..( A .N -1) ) {
val colA = A . getCol ( col ) ;
val v2 = B . d ( offsetB + col ) ;
for ( ridx in 0..( colA . size () -1) ) {
val r = colA . getIndex ( ridx ) ;
val v1 = colA . getValue ( ridx ) ;
C . d ( r + offsetC ) += v1 * v2 ;
}
}
Jacobi Iteration
The Jacobi kernel combines a stencil update of interior elements of
a two-dimensional region with a reduction of an error residual. The
Jacobi benchmark is available in the X10 applications repository2 .
1
2
3
4
5
6
7
8
9
10
11
12
13
error = reduce [ Double ](
( a : Double , b : Double ) = >{ return a + b ;} , 0.0)
foreach ( i in 1..( n -2) ) {
var my_error : double = 0.0;
for ( j in 1..( m -2) ) {
val resid = ( ax *( uold (i -1 , j ) + uold ( i
+1 , j ) ) +
ay * ( uold (i , j -1) + uold (i , j +1) ) +
b * uold (i , j ) - f (i , j ) ) / b ;
u (i , j ) = uold (i , j ) - omega * resid ;
my_error += resid * resid ;
}
offer my_error ;
};
series of time steps up to a chosen end time. At each time step,
node-centered kinematic variables and element-centered thermodynamic variables are advanced to a new state. The new values
for each node/element depend on the values for neighboring nodes
and elements at the previous time step. A model implementation is
provided using C++, OpenMP and MPI; we ported this implementation to X10.
The LULESH application contains a number of important computational kernels which update different node and element variables. The kernel which computes the Flanagan-Belytschko antihourglass force for a single grid element accounts for the largest
portion – around 20% – of the application runtime. It requires a
number of intermediate data structures which are all small 1D or
2D arrays. The LULESH Hourglass Force kernel is available in the
X10 applications repository3 .
1
2
3
25
26
27
foreach ( i in 0..( numElem -1) )
local (
val hourgam = new Array_2 [ Double ](
hourgamStore , 8 , 4) ;
val xd1 = new Rail [ Double ](8) ;
...
)
{
val i3 = 8* i2 ;
val volinv = 1.0 / determ ( i2 ) ;
for ( i1 in 0..3) {
...
val setHourgam = ( idx : Long ) = > {
hourgam ( idx , i1 ) = gamma ( i1 , idx )
- volinv * ( dvdx ( i3 + idx ) * hourmodx
+ dvdy ( i3 + idx ) * hourmody
+ dvdz ( i3 + idx ) * hourmodz ) ;
};
setHourgam (0) ;
setHourgam (1) ;
...
setHourgam (7) ;
}
...
c a l c E l e m F B H o u r g l a s s F o r c e ( xd1 , yd1 , zd1 ,
hourgam ,
coefficient , hgfx , hgfy , hgfz ) ;
...
}
5.2
Experimental Setup
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
The kernels described in §5.1 were executed on an Intel Xeon E54657L v2 @ 2.4 GHz. The machine has four sockets, each with
12 cores supporting 2-way SMT for a total of 96 logical cores.
X10 version 2.5.2 was modified to implement the x10.compiler
.Foreach and x10.compiler.WorkerLocal classes as described in
Section 4. GCC version 4.8.2 was used for post-compilation of the
Native X10 programs, as well for the C++ versions of the kernels.
Intel TBB version 4.3 update 4 was used for the TBB versions of
the kernels. Each kernel was run for a large number of iterations
(100-5000, enough to generate a minimum total runtime of several
seconds), recording the mean time over a total of 30 test runs.
5.3
5.1.5
LULESH Hourglass Force
LULESH2.0 [4, 5] is a mini-app for hydrodynamics on an unstructured mesh. It models an expanding shock wave in a single material originating from a point blast. The simulation iterates over a
Comparison of Compiler Transformations
We first compare the efficiency of parallel iteration using the compiler transformations described in Section 4. Each kernel was compiled using the basic, block and recursive bisection transformations.
Figure 3 shows the scaling with number of threads for each kernel using the different transformations. Parallel speedup (single-
1 https://sourceforge.net/projects/anuchem/
2 http://svn.code.sourceforge.net/p/x10/code/applications/jacobi
3 http://svn.code.sourceforge.net/p/x10/code/applications/lulesh2
25
2
1.5
1
0
0
8
16
24
32
40
48
56
64
72
10
6
5
4
3
2
5
1
80
88
0
96
0
8
16
24
32
number of threads
40
48
56
64
72
80
88
0
96
0
8
16
24
32
number of threads
(a) DAXPY
25
parallel speedup
20
15
10
16
24
32
40
48
56
64
72
80
72
80
88
96
15
10
5
block
bisect
basic
8
64
block
bisect
basic
20
0
56
best single-thread time per iteration: 18.19 ms
25
0
48
(c) Jacobi
best single-thread time per iteration: 434.11 ms
5
40
number of threads
(b) MatMul
30
parallel speedup
7
15
block
bisect
basic
block
bisect
8
parallel speedup
3
parallel speedup
parallel speedup
20
2.5
9
block
bisect
basic
3.5
0.5
best single-thread time per iteration: 4.89 ms
best single-thread time per iteration: 163.20 ms
best single-thread time per iteration: 8.09 ms
4
0
88
96
0
8
16
24
32
number of threads
40
48
56
64
72
80
88
96
number of threads
(d) SpMV
(e) LULESH Hourglass Force
Figure 3: Scaling with number of threads using different X10 compiler transformations.
best single-thread time per iteration: 18.19 ms
heap
stack
local
20
parallel speedup
threaded time / multi-threaded time) is reported with respect to a
baseline of the best mean time per iteration for a single thread.
The results for the basic transformation illustrate why the original definition of foreach in X10 was infeasible: for all kernels,
the basic transformation fails to achieve parallel speedup for any
number of threads.
The block and bisect transformations are more promising: all
codes show some speedup up to at least 32 threads. The results fail
to completely separate the two transformations; for each kernel,
each transformation exhibits a greater parallel speedup for some
portion of the tested thread range (2–96). The block transformation achieves the greatest maximum speedup for DAXPY, Jacobi
and LULESH, whereas the 1D and 2D bisection transformations
achieve the greatest speedup for SpMV and MatMul respectively.
The fact that neither transformation is obviously superior indicates
the importance of allowing the programmer to choose between
them on a per-application or even per-loop basis.
We next compare the three different approaches to storage of
local data that were discussed in Section 3.2. Figure 4 shows the
scaling with number of threads for the LULESH hourglass force
kernel using per-iteration heap allocation, stack allocation, and x10
.compiler.WorkerLocal.
The greatest total speedup for LULESH (× 22) is achieved
with 56 threads using stack allocation, however, there is not a
significant performance difference between the three approaches
over the entire range. The intermediate data structures are not large
(no array is larger than 32 8-byte elements), so it may be that the
cost of allocating multiple copies is insignificant compared to other
factors. Other application examples are needed to more thoroughly
evaluate approaches to storing local data.
15
10
5
0
0
8
16
24
32
40
48
56
64
72
80
88
96
number of threads
Figure 4: LULESH Hourglass Force kernel scaling with number of threads
using different approaches for storage of local data.
5.4
Comparison of Programming Models
We implemented the kernels listing in §5.1 using OpenMP and
TBB. OpenMP codes used schedule(block) for parallel for loops,
and TBB codes used the default auto_partitioner.
Figure 5 shows the scaling with number of threads for each
kernel using the different programming models. Parallel speedup
is normalized to the best single-thread time for any of the three
models. Each programming model achieves the greatest maximum
speedup for one of the kernels. For the DAXPY kernel, OpenMP
significantly outperforms both X10 and TBB. TBB was not tested
for the Jacobi or LULESH kernels.
None of the kernel codes presented here achieve anything near
perfect parallel speedup across the full range of threads tested. The
maximum speedup achievable for a code depends on many factors
in addition to the programming model, including: the level of parallelism available in the algorithm; the balance between floating
point, memory and other operations; and cache locality. We hope
best single-thread time per iteration: 7.75 ms
best single-thread time per iteration: 163.20 ms
45
12
X10
C++/OMP
30
TBB
30
25
20
15
20
15
10
10
0
0
0
8
16
24
32
40
48
56
64
72
80
88
96
8
6
4
2
5
5
X10
C++/OMP
10
25
parallel speedup
parallel speedup
parallel speedup
best single-thread time per iteration: 4.90 ms
35
X10
40 C++/OMP
TBB
35
0
0
8
16
24
number of threads
32
40
48
56
64
72
80
88
96
0
number of threads
(a) DAXPY
8
16
24
32
40
48
56
64
72
80
88
96
number of threads
(b) MatMul
(c) Jacobi
best single-thread time per iteration: 14.73 ms
18
X10
16 C++/OMP
parallel speedup
14
12
10
8
6
4
2
0
0
8
16
24
32
40
48
56
64
72
80
88
96
number of threads
(d) LULESH Hourglass Force
Figure 5: Scaling with number of threads using X10, OpenMP and TBB.
to further explore these issues with regard to particular kernels, to
determine whether enhancements to the X10 scheduler – for example, support for affinity-based scheduling [7] – are necessary to
achieve greater parallel performance.
6.
Conclusion
This paper presented the foreach construct, a new standard mechanism for local parallel iteration in the X10 language. It was shown
that this mechanism achieves parallel speedup comparable with
OpenMP and TBB for a range of kernels typical of high performance scientific codes. None of the compiler transformations are
novel, nor is the provision of a mechanism for worker-local data.
However, the mechanisms presented in this paper are composable
with the X10 APGAS model, which exposes data locality in the
form of places and supports asynchronous remote activities. The
mechanisms presented here therefore represent a further step towards a programming model that can express the full range of parallelism in modern high performance computing systems.
Acknowledgments
Thanks to Olivier Tardieu and David Grove for their advice on
many important details of the X10 runtime and compiler. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research
under Award Number DE-SC0008923.
References
[1] OpenMP application program interface version 4.0. Technical report,
OpenMP Architecture Review Board, Jul 2013. URL http://www.
openmp.org/mp-documents/OpenMP4.0.0.pdf.
[2] Intel Threading Building Blocks reference manual version 4.2. Technical report, Intel Corporation, 2014.
[3] S. S. Hamouda, J. Milthorpe, P. E. Strazdins, and V. Saraswat. A resilient framework for iterative linear algebra applications in X10. In
16th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2015), May 2015.
[4] I. Karlin, J. Keasler, and R. Neely. LULESH 2.0 updates and changes.
Technical Report LLNL-TR-641973, August 2013.
[5] LULESH. Hydrodynamics Challenge Problem, Lawrence Livermore
National Laboratory. Technical Report LLNL-TR-490254.
[6] M. McCool, J. Reinders, and A. Robison. Structured Parallel Programming: Patterns for Efficient Computation. Elsevier, July 2012. ISBN
9780123914439.
[7] A. Robison, M. Voss, and A. Kukanov. Optimization via reflection on
work stealing in TBB. In Proceedings of the 22nd IEEE International
Parallel and Distributed Processing Symposium (IPDPS 2008), pages
1–8, 2008.
[8] O. Tardieu, B. Herta, D. Cunningham, D. Grove, P. Kambadur,
V. Saraswat, A. Shinnar, M. Takeuchi, and M. Vaziri. X10 and APGAS
at petascale. In Proceedings of the 19th ACM SIGPLAN symposium on
Principles and Practice of Parallel Programming (PPoPP ’14), pages
53–66, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2656-8.
[9] X. Teruel, M. Klemm, K. Li, X. Martorell, S. Olivier, and C. Terboven.
A proposal for task-generating loops in OpenMP. In Proceedings of the
9th International Workshop on OpenMP (IWOMP 2013), pages 1–14.
2013.