Cilk – How To Parallelize Your Application with Three Simple Keywords Brandon Hewitt Technical Consulting Engineer Intel Compiler and Languages Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 1 Agenda • A quick summary of what Cilk is (and is not) • Why Cilk • Cilk keywords – cilk_spawn and cilk_sync – What a cilk_spawn does – Load balancing and spawn overhead – cilk_for – What a cilk_for does – Comparing a cilk_for to a for loop with a cilk_spawn – Implicit cilk_syncs Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 2 Agenda continued • Serialization and Serial Semantics • Composability • Setting the Worker Count • An intro to reducers • Upcoming Cilk topics • References and Contact Information • Q&A Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 3 What is Cilk? • An extension to C and C++ for expressing finegrained task parallelism – Shared memory multiprocessing, not distributed memory • Three keywords in API evolved over 15 years • Useful for programs with serial semantics – Meaning of the program does not depend on concurrency – Contrast: producer-consumer, data-flow languages, clientserver • Mechanism to resolve data races through reducers – Easy, efficient and readable • Available now in beta versions of Intel compiler products – Cilk++ SDK on http://whatif.intel.com Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 4 Why Cilk? • Established and award-winning parallel technology – HPC Challenge Class 2 award 2006 – 1995 Computer Chess World Championship award • Simple – Keywords: _Cilk_spawn, _Cilk_for, _Cilk_sync • Yet powerful – Efficient work-stealing runtime scheduler – Reducers resolve data races without locks • Composable • Serial Semantics • Supports C and C++ Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 5 Cilk keywords • Cilk adds three keywords to C and C++: _Cilk_spawn _Cilk_sync _Cilk_for • If you #include <cilk/cilk.h>, you can write the keywords as cilk_spawn, cilk_sync, and cilk_for. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 6 cilk_spawn and cilk_sync • cilk_spawn (or _Cilk_spawn) gives the runtime permission to run a child function asynchronously. – No 2nd thread is created or required! – If there are no available workers, then the child will execute as a serial function call. – The scheduler may steal the parent and run it in parallel with the child function. – The parent is not guaranteed to run in parallel with the child. – No 2nd thread is created or required! • cilk_sync (or _Cilk_sync) waits for all children to complete. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 7 Anatomy of a spawn spawning function (parent) void f() spawn { cilk_spawn g(); } work work work continuation void g() { work work work spawned function (child) } cilk_sync; work sync Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 8 A simple example • Recursive computation of a Fibonacci number: int { fib(int n) int x, y; if (n < 2) return n; } Execution can continue x = cilk_spawn fib(n-1); while fib(n-1) is running. y = fib(n-2); cilk_sync; return x+y; Asynchronous call must complete before using x. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 9 Serial Execution void f() { g(); work work work void g() { work work work } work } Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 10 Work Stealing when no other worker is available void f() { cilk_spawn g(); work work work void g() { work work work } cilk_sync; work } Worker A Same behavior as serial execution! Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 11 Work Stealing when another worker is available void f() { cilk_spawn g(); work work work void g() { work work work steal! } cilk_sync; work } Worker Worker A B Worker ? Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 12 Load Balancing •The work-stealing scheduler automatically load-balances: –An idle worker will find work to do. –If the program has enough parallelism, then all workers will stay busy. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 13 Work-stealing Overheads • Spawning is cheap (3-5 times the cost of a function call) – Spawn early and often. – Optimal scheduling requires that parallelism be about an order of magnitude greater than the actual number of cores. • Stealing is much more expensive (requires locks and memory barriers) • Most spawns do not result in steals. • The more balanced the work load, the less stealing there is and hence the less overhead. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 14 cilk_for loop • Looks like a normal for loop. cilk_for (int x = 0; x < 1000000; ++x) { … } • Any or all iterations may execute in parallel with one another. • All iterations complete before program continues. • Constraints: – Limited to a single control variable. – Must be able to jump to the start of any iteration at random. – Iterations should be independent of one another. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 15 Implementation of cilk_for cilk_for (int i=0; i< 8; ++i) f(i); spaw n 0-7 continuatio n 4-7 0-3 spaw n 0-1 0 spaw n continuatio n 2-3 1 2 3 continuatio n 6-7 4-5 4 5 6 7 Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 16 cilk_for vs. serial for with spawn • Compare the following loops: for (int x = 0; x < n; ++x) { cilk_spawn f(x); } cilk_for (int x = 0; x < n; ++x) { f(x); } • The above two loops have similar semantics, but… • they have very different performance characteristics. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 17 Serial for with spawn: unbalanced steal! 0-7 spaw n 1-7 0 spaw n Worker A steal! 2-7 1 2 If work per iteration is small then steal overhead can be significant steal! steal! steal! steal! 3-7 4-7 5-7 6-7 3 4 Worker B steal! 7-7 5 6 7 Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 18 cilk_for: Divide and Conquer Worker A 0-7 spaw n Worker B steal! 0-3 4-7 spaw n 0-1 2-3 4-5 6-7 return 0 1 2 3 4 5 6 7 Divide and conquer results if few steals and less overhead. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 19 cilk_for examples • cilk_for (int x; x < 1000000; x += 2) { … } • cilk_for (vector<int>::iterator x = y.begin(); x != y.end(); ++x) { … } • cilk_for (list<int>::iterator x = y.begin(); x != y.end(); ++x) { … } – Loop count cannot be computed in constant time for a list. (y.end() – y.begin() is not defined.) – Do not have random access to the elements of the list. (y.begin() + n is not defined.) Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 20 Implicit syncs void f() { cilk_spawn g(); cilk_for (int x = 0; x < lots; ++x) { ... } At end of a cilk_for body (does not sync g()) try { cilk_spawnBefore h(); entering a try block containing a sync } catch (...) At { end of a try block containing a spawn ... } } At end of a spawning function Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 21 Serialization • Every Cilk program has an equivalent serial program called the serialization • The serialization is obtained by removing cilk_spawn and cilk_sync keywords and replacing cilk_for with for – The compiler will produce the serialization for you if you compile with /Qcilk-serialize (Windows) or -cilkserialize (Linux) • Running with only one worker is equivalent to running the serialization. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 22 Serial Semantics • A deterministic Cilk program will have the same semantics as its serialization. – Easier regression testing – Easier to debug: – Run with one core – Run serialized – Composable – Strong analysis tools (Cilk-specific versions will be posted on WhatIf) – race detector – parallelism analyzer Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 23 Composability • Implicit syncs are important for making each function call a “black box.” • Caller does not know or care if the called function spawns. – Serial abstraction is maintained. – Caller does not need to worry about data races with the called function. • cilk_sync synchronizes only children that were spawned within the same function as the sync: no action at a distance. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 24 Setting the Worker Count • By default, Cilk will use one worker per processor core. • The default can be overriden by setting the environment variable CILK_NWORKERS to a positive integer value. • The default can also be changed under program control by calling: __cilkrts_set_param("nworkers","5"); – The above must be called before the first spawning function. – The worker count must be a string, not an int. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 25 Coping with Race Bugs • Although locking can “solve” race bugs, lock contention can destroy all parallelism. • Making local copies of the nonlocal variables can remove contention, but at the cost of restructuring program logic. • Cilk provides reducers to mitigate data races on nonlocal variables without the need for locks or code restructuring. IDEA: Different parallel branches may see different views of the reducer. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 26 What is a reducer? • A reducer is a subclass of Cilk Hyperobject – Construct that provides each thread a different view of the hyperobject so it can be updated in a coordinated fashion • A reducer provides a private copy of a variable to each thread, and these copies are then merged at the cilk_sync • This provides access to non-local data in Cilk regions without risking data races • Avoids locks and the performance problems lock contention causes • Retains serial semantics after merge Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 27 Summing Example int compute(const X& v); int main() { const std::size_t n = 1000000; extern X myArray[n]; // ... int result = 0; for (std::size_t i = 0; i < n; ++i) { result += compute(myArray[i]); } std::cout << "The result is: " << result << std::endl; return 0; } Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 28 Summing Example in Cilk int compute(const X& v); int main() { const std::size_t n = 1000000; extern X myArray[n]; // ... int result = 0; cilk_for (std::size_t i = 0; i < n; ++i) { result += compute(myArray[i]); } std::cout << "The result is: " << result << std::endl; return 0; Race! } Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 29 Locking Solution int compute(const X& v); int main() { const std::size_t n = 1000000; extern X myArray[n]; // ... Problems Lock overhead & lock contention. mutex L; int result = 0; cilk_for (std::size_t i = 0; i < n; ++i) { int temp = compute(myArray[i]); L.lock(); result += temp; L.unlock(); } std::cout << "The result is: " << result << std::endl; return 0; } Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 30 Reducer Solution DeclareX&result int compute(const v); to int main() be a summing { reducer over int. const std::size_t ARRAY_SIZE = 1000000; Updates are resolved extern X myArray[ARRAY_SIZE]; automatically without // ... races or contention. cilk::reducer_opadd<int> result; cilk_for (std::size_t i = 0; i < ARRAY_SIZE; ++i) At the end, the { result += compute(myArray[i]); underlying int value } can be extracted. std::cout << "The result is: " << result.get_value() << std::endl; return 0; } Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 31 Reducer Library Cilk’s hyperobject library contains many commonly used reducers: • • • • • • • • • • reducer_list_append reducer_list_prepend reducer_max reducer_max_index reducer_min reducer_min_index reducer_opadd reducer_ostream reducer_basic_string … You can also write your own using cilk::monoid_base and cilk::reducer. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Reducer Limitations • Operations on a reducer must be associative to behave deterministically – Refer to the operators supported by a particular reducer class for safe operations to use • Floating point types may get different results – Results may vary from run to run • If using custom data types for reducers, refer to the header for the specific reducer for requirements Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 33 Upcoming Topics • Cilk In-Depth – The full story on reducers – How to create your own reducer – Hyperobjects – Cilk and Exception Handling – Debugging and Other Tools Support – Cilk Performance – Other topics you’d like to see covered? • Webinar on C++ Extended Array Notation – J.D. Patel, June 30, 9am PDT Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 34 References and Contact Information • Cilk documentation – <Program Files>\Intel\Parallel Studio 2011\Composer\Documentation\en_US\compiler_c\cilk.pdf – <Program Files\Intel\CompilerPro12.0\Documentation\en_US\compiler_c\cilk.pdf – /opt/intel/compilerpro12.0.x.xxx/Documentation/en_US/compiler_c/cilk.pdf • Intel® Premier Support – http://premier.intel.com • User Forums – For Intel® Parallel Composer beta - http://software.intel.com/enus/forums/intel-parallel-composer-beta/ – For Intel® Parallel Studio beta – http://software.intel.com/enus/forums/intel-parallel-studio-beta – For Intel® Cilk++ SDK - http://software.intel.com/en-us/forums/intelcilk-software-development-kit/ • My email – brandon.l.hewitt@intel.com Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 35 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2010. Intel Corporation. http://intel.com/software/products Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 36 Q&A Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 37 Backup Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 38 Cilk Compared with OpenMP* • Cilk more completely integrated into C and C++ • OpenMP* support in Fortran • Cilk philosophy very different from OpenMP’s finegrained control over scheduling, affinity, etc. – A programmer can rarely determine how balanced the work-load of a real-world program will be. – A scalable program should auto-balance for different core counts and different load balances. – System load from other applications fluctuates, so tuning for the runtime core-count will not help in general. • OpenMP deviates from serial semantics and not nearly as composable • OpenMP supported in Microsoft and gcc compilers Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 39 Cilk compared with Intel® Threading Building Blocks (Intel® TBB) • Intel Cilk and Intel® TBB were both inspired by MIT Cilk and use similar work-stealing schedulers. • Intel Cilk is integrated into C and C++. Intel® TBB is a template library and is available in C++ only. • Cilk requires compiler support. Intel® TBB is a portable library, even available as open source. • Both have serial semantics, but Intel® TBB’s syntax obscures them. Debugging a Cilk program on a single core is as easy as debugging a serial program. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/1/2010 40
© Copyright 2024