Cilk – How To Parallelize Your Application with Three Simple Keywords

Cilk – How To Parallelize Your
Application with Three Simple Keywords
Brandon Hewitt
Technical Consulting Engineer
Intel Compiler and Languages
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
1
Agenda
• A quick summary of what Cilk is (and is not)
• Why Cilk
• Cilk keywords
– cilk_spawn and cilk_sync
– What a cilk_spawn does
– Load balancing and spawn overhead
– cilk_for
– What a cilk_for does
– Comparing a cilk_for to a for loop with a cilk_spawn
– Implicit cilk_syncs
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
2
Agenda continued
• Serialization and Serial Semantics
• Composability
• Setting the Worker Count
• An intro to reducers
• Upcoming Cilk topics
• References and Contact Information
• Q&A
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
3
What is Cilk?
• An extension to C and C++ for expressing finegrained task parallelism
– Shared memory multiprocessing, not distributed memory
• Three keywords in API evolved over 15 years
• Useful for programs with serial semantics
– Meaning of the program does not depend on concurrency
– Contrast: producer-consumer, data-flow languages, clientserver
• Mechanism to resolve data races through reducers
– Easy, efficient and readable
• Available now in beta versions of Intel compiler
products
– Cilk++ SDK on http://whatif.intel.com
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
4
Why Cilk?
• Established and award-winning parallel technology
– HPC Challenge Class 2 award 2006
– 1995 Computer Chess World Championship award
• Simple
– Keywords: _Cilk_spawn, _Cilk_for, _Cilk_sync
• Yet powerful
– Efficient work-stealing runtime scheduler
– Reducers resolve data races without locks
• Composable
• Serial Semantics
• Supports C and C++
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
5
Cilk keywords
• Cilk adds three keywords to C and C++:
_Cilk_spawn
_Cilk_sync
_Cilk_for
• If you #include <cilk/cilk.h>, you can write the
keywords as cilk_spawn, cilk_sync, and
cilk_for.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
6
cilk_spawn and cilk_sync
• cilk_spawn (or _Cilk_spawn) gives the runtime
permission to run a child function asynchronously.
– No 2nd thread is created or required!
– If there are no available workers, then the child will
execute as a serial function call.
– The scheduler may steal the parent and run it in parallel
with the child function.
– The parent is not guaranteed to run in parallel with the
child.
– No 2nd thread is created or required!
• cilk_sync (or _Cilk_sync) waits for all children to
complete.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
7
Anatomy of a spawn
spawning
function (parent)
void f()
spawn
{
cilk_spawn g();
}
work
work
work
continuation
void g()
{
work
work
work
spawned
function
(child)
}
cilk_sync;
work
sync
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
8
A simple example
• Recursive computation of a Fibonacci number:
int
{
fib(int n)
int x, y;
if (n < 2) return n;
}
Execution can continue
x = cilk_spawn fib(n-1);
while fib(n-1) is running.
y = fib(n-2);
cilk_sync;
return x+y;
Asynchronous call must
complete before using x.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
9
Serial Execution
void f()
{
g();
work
work
work
void g()
{
work
work
work
}
work
}
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
10
Work Stealing
when no other worker is available
void f()
{
cilk_spawn g();
work
work
work
void g()
{
work
work
work
}
cilk_sync;
work
}
Worker
A
Same behavior
as serial
execution!
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
11
Work Stealing
when another worker is available
void f()
{
cilk_spawn g();
work
work
work
void g()
{
work
work
work
steal!
}
cilk_sync;
work
}
Worker
Worker
A
B
Worker
?
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
12
Load Balancing
•The work-stealing scheduler
automatically load-balances:
–An idle worker will find work to do.
–If the program has enough
parallelism, then all workers will stay
busy.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
13
Work-stealing Overheads
• Spawning is cheap (3-5 times the cost of a function
call)
– Spawn early and often.
– Optimal scheduling requires that parallelism be about an
order of magnitude greater than the actual number of
cores.
• Stealing is much more expensive (requires locks
and memory barriers)
• Most spawns do not result in steals.
• The more balanced the work load, the less stealing
there is and hence the less overhead.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
14
cilk_for loop
• Looks like a normal for loop.
cilk_for (int x = 0; x < 1000000; ++x) { … }
• Any or all iterations may execute in parallel with
one another.
• All iterations complete before program continues.
• Constraints:
– Limited to a single control variable.
– Must be able to jump to the start of any iteration at
random.
– Iterations should be independent of one another.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
15
Implementation of cilk_for
cilk_for (int i=0; i< 8;
++i)
f(i);
spaw
n
0-7
continuatio
n
4-7
0-3
spaw
n
0-1
0
spaw
n
continuatio
n
2-3
1
2
3
continuatio
n
6-7
4-5
4
5
6
7
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
16
cilk_for vs. serial for with spawn
• Compare the following loops:
for (int x = 0; x < n; ++x) { cilk_spawn f(x); }
cilk_for (int x = 0; x < n; ++x) { f(x); }
• The above two loops have similar semantics, but…
• they have very different performance
characteristics.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
17
Serial for with spawn: unbalanced
steal!
0-7
spaw
n
1-7
0
spaw
n
Worker
A
steal!
2-7
1
2
If work per
iteration is small
then steal overhead can
be significant
steal!
steal!
steal!
steal!
3-7
4-7
5-7
6-7
3
4
Worker
B
steal!
7-7
5
6
7
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
18
cilk_for: Divide and Conquer
Worker
A
0-7
spaw
n
Worker
B
steal!
0-3
4-7
spaw
n
0-1
2-3
4-5
6-7
return
0
1
2
3
4
5
6
7
Divide and conquer results if few steals and less
overhead.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
19
cilk_for examples
• cilk_for (int x; x < 1000000; x += 2) { … }
• cilk_for (vector<int>::iterator x = y.begin();
x != y.end(); ++x) { … }
• cilk_for (list<int>::iterator x = y.begin();
x != y.end(); ++x) { … }
– Loop count cannot be computed in constant time for a list.
(y.end() – y.begin() is not defined.)
– Do not have random access to the elements of the list.
(y.begin() + n is not defined.)
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
20
Implicit syncs
void f() {
cilk_spawn g();
cilk_for (int x = 0; x < lots; ++x) {
...
}
At end of a cilk_for body (does not sync g())
try {
cilk_spawnBefore
h(); entering a try block containing a sync
}
catch (...) At
{ end of a try block containing a spawn
...
}
}
At end of a spawning function
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
21
Serialization
• Every Cilk program has an equivalent serial
program called the serialization
• The serialization is obtained by removing
cilk_spawn and cilk_sync keywords and
replacing cilk_for with for
– The compiler will produce the serialization for you if you
compile with /Qcilk-serialize (Windows) or -cilkserialize (Linux)
• Running with only one worker is equivalent to
running the serialization.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
22
Serial Semantics
• A deterministic Cilk program will have the same
semantics as its serialization.
– Easier regression testing
– Easier to debug:
– Run with one core
– Run serialized
– Composable
– Strong analysis tools (Cilk-specific versions will be posted
on WhatIf)
– race detector
– parallelism analyzer
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
23
Composability
• Implicit syncs are important for making each
function call a “black box.”
• Caller does not know or care if the called function
spawns.
– Serial abstraction is maintained.
– Caller does not need to worry about data races with the
called function.
• cilk_sync synchronizes only children that were
spawned within the same function as the sync: no
action at a distance.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
24
Setting the Worker Count
• By default, Cilk will use one worker per processor
core.
• The default can be overriden by setting the
environment variable CILK_NWORKERS to a
positive integer value.
• The default can also be changed under program
control by calling:
__cilkrts_set_param("nworkers","5");
– The above must be called before the first spawning
function.
– The worker count must be a string, not an int.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
25
Coping with Race Bugs
• Although locking can “solve” race bugs, lock
contention can destroy all parallelism.
• Making local copies of the nonlocal variables can
remove contention, but at the cost of restructuring
program logic.
• Cilk provides reducers to mitigate data races on
nonlocal variables without the need for locks or
code restructuring.
IDEA: Different parallel branches may
see different views of the reducer.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
26
What is a reducer?
• A reducer is a subclass of Cilk Hyperobject
– Construct that provides each thread a different view of the
hyperobject so it can be updated in a coordinated fashion
• A reducer provides a private copy of a variable to
each thread, and these copies are then merged at
the cilk_sync
• This provides access to non-local data in Cilk
regions without risking data races
• Avoids locks and the performance problems lock
contention causes
• Retains serial semantics after merge
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
27
Summing Example
int compute(const X& v);
int main()
{
const std::size_t n = 1000000;
extern X myArray[n];
// ...
int result = 0;
for (std::size_t i = 0; i < n; ++i)
{
result += compute(myArray[i]);
}
std::cout << "The result is: "
<< result
<< std::endl;
return 0;
}
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
28
Summing Example in Cilk
int compute(const X& v);
int main()
{
const std::size_t n = 1000000;
extern X myArray[n];
// ...
int result = 0;
cilk_for (std::size_t i = 0; i < n; ++i)
{
result += compute(myArray[i]);
}
std::cout << "The result is: "
<< result
<< std::endl;
return 0;
Race!
}
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
29
Locking Solution
int compute(const X& v);
int main()
{
const std::size_t n = 1000000;
extern X myArray[n];
// ...
Problems
Lock overhead &
lock contention.
mutex L;
int result = 0;
cilk_for (std::size_t i = 0; i < n; ++i)
{
int temp = compute(myArray[i]);
L.lock();
result += temp;
L.unlock();
}
std::cout << "The result is: "
<< result
<< std::endl;
return 0;
}
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
30
Reducer Solution
DeclareX&result
int compute(const
v); to
int main() be a summing
{
reducer over int.
const std::size_t ARRAY_SIZE = 1000000;
Updates are resolved
extern X myArray[ARRAY_SIZE];
automatically without
// ...
races or contention.
cilk::reducer_opadd<int> result;
cilk_for (std::size_t i = 0; i < ARRAY_SIZE; ++i)
At the end, the
{
result += compute(myArray[i]);
underlying int value
}
can be extracted.
std::cout << "The result is: "
<< result.get_value()
<< std::endl;
return 0;
}
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
31
Reducer Library
Cilk’s hyperobject library contains many
commonly used reducers:
•
•
•
•
•
•
•
•
•
•
reducer_list_append
reducer_list_prepend
reducer_max
reducer_max_index
reducer_min
reducer_min_index
reducer_opadd
reducer_ostream
reducer_basic_string
…
You can also write your own using
cilk::monoid_base and cilk::reducer.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Reducer Limitations
• Operations on a reducer must be associative to
behave deterministically
– Refer to the operators supported by a particular reducer
class for safe operations to use
• Floating point types may get different results
– Results may vary from run to run
• If using custom data types for reducers, refer to
the header for the specific reducer for requirements
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
33
Upcoming Topics
• Cilk In-Depth
– The full story on reducers
– How to create your own reducer
– Hyperobjects
– Cilk and Exception Handling
– Debugging and Other Tools Support
– Cilk Performance
– Other topics you’d like to see covered?
• Webinar on C++ Extended Array Notation
– J.D. Patel, June 30, 9am PDT
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
34
References and Contact Information
• Cilk documentation
– <Program Files>\Intel\Parallel Studio
2011\Composer\Documentation\en_US\compiler_c\cilk.pdf
– <Program Files\Intel\CompilerPro12.0\Documentation\en_US\compiler_c\cilk.pdf
– /opt/intel/compilerpro12.0.x.xxx/Documentation/en_US/compiler_c/cilk.pdf
• Intel® Premier Support
– http://premier.intel.com
• User Forums
– For Intel® Parallel Composer beta - http://software.intel.com/enus/forums/intel-parallel-composer-beta/
– For Intel® Parallel Studio beta – http://software.intel.com/enus/forums/intel-parallel-studio-beta
– For Intel® Cilk++ SDK - http://software.intel.com/en-us/forums/intelcilk-software-development-kit/
• My email – brandon.l.hewitt@intel.com
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
35
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED,
BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR
INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Performance tests and ratings are measured using specific computer systems and/or components
and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance.
Buyers should consult other sources of information to evaluate the performance of systems or
components they are considering purchasing. For more information on performance tests and on
the performance of Intel products, reference www.intel.com/software/products.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2010. Intel Corporation.
http://intel.com/software/products
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
36
Q&A
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
37
Backup
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
38
Cilk Compared with OpenMP*
• Cilk more completely integrated into C and C++
• OpenMP* support in Fortran
• Cilk philosophy very different from OpenMP’s finegrained control over scheduling, affinity, etc.
– A programmer can rarely determine how balanced the
work-load of a real-world program will be.
– A scalable program should auto-balance for different core
counts and different load balances.
– System load from other applications fluctuates, so tuning
for the runtime core-count will not help in general.
• OpenMP deviates from serial semantics and not
nearly as composable
• OpenMP supported in Microsoft and gcc compilers
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
39
Cilk compared with Intel® Threading
Building Blocks (Intel® TBB)
• Intel Cilk and Intel® TBB were both inspired by
MIT Cilk and use similar work-stealing schedulers.
• Intel Cilk is integrated into C and C++. Intel® TBB
is a template library and is available in C++ only.
• Cilk requires compiler support. Intel® TBB is a
portable library, even available as open source.
• Both have serial semantics, but Intel® TBB’s
syntax obscures them. Debugging a Cilk program
on a single core is as easy as debugging a serial
program.
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7/1/2010
40