Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.003 and CSCI-GA 2945.003 Georg Stadler Courant Institute, NYU stadler@cims.nyu.edu Spring 2015, Monday, 5:10–7:00PM, WWH #202 April 6, 2015 1 / 27 Outline Organization issues Shared memory parallelism–OpenMP 2 / 27 Organization and today’s class Organization: I Homework #1 solution posted. Stampede? Anybody? I Homework #3 will be posted tonight (sorry for the delay). 3 / 27 Organization and today’s class Organization: I Homework #1 solution posted. Stampede? Anybody? I Homework #3 will be posted tonight (sorry for the delay). Today: I Distributed memory programming (OpenMP). I Visualization with Paraview 3 / 27 Plan for the rest of this course Your input is welcome I Programming: OpenCL (for GPGPUs and MICs); Performance with accelerators; I Tools: intro to shell scripting; Paraview visualization; (Partitioning with Metis) I Algorithms: Multigrid and FMM 4 / 27 Outline Organization issues Shared memory parallelism–OpenMP 5 / 27 Shared memory programming model I Program is a collection of control threads, that are created dynamically I Each thread has private and shared variables I Threads can exchange data by reading/writing shared variables I Danger: more than 1 processor core reads/writes to a memory location: race condition Main difference to MPI is that only one process is running, which can fork into shared memory threads. 6 / 27 Threads versus process A process is an independent execution unit, which contains their own state information (pointers to instruction and stack). One process can contain several threads. Threads within a process share the same address space, and communicate directly using shared variables. Separate stack but shared heap memory. Using several threads can also be useful on a single processor (“multithreading”), depending on the memory latency. 7 / 27 Shared memory programming—Literature OpenMP standard online: www.openmp.org Very useful online course: www.youtube.com/user/OpenMPARB/ Recommended reading: Chapter 6 in Chapter 7 in 8 / 27 Shared memory programming with OpenMP I Split program into serial and into parallel regions I Race condition: more than one thread reads/writes to the same shared memory location I Easy to parallelize loops without data dependencies by adding #pragma commands I pragmas are compiler directive externals to the programming language; they are ignored by the compiler if it doesn’t understand them 9 / 27 Shared memory programming with OpenMP I Provides synchronization functions such as the collective calls in MPI I Tell the compiler which parts are parallel I Compiler generates threaded code; usually it defaults to one (or 2/4 for hyper-threading) threads per core I You can set the number of threads used through the environment variable OMP NUM THREADS or inside the code using the library call omp set num threads(4) I Dependencies in parallel parts requires synchronization between threads 10 / 27 Shared memory programming with OpenMP #pragma omp p a r a l l e l n u m t h r e a d s ( 3 ) { // . . . } Define number of threads explicitly Can be done through environment variable: export OMP NUM THREADS=8 (bash) setenv OMP NUM THREADS 8 (csh) 11 / 27 Shared memory programming with OpenMP #pragma omp p a r a l l e l f o r d e f a u l t ( none ) private () shared () f o r ( i =0; i <N ; ++i ) { // . . . . } Per default, all variables are shared in parallel part Can explicitly specify which variables are private and which shared private(): Non-initialized variable firstprivate(): initialized with outside value lastprivate(): main thread returns with value from parallel construct 12 / 27 Shared memory programming with OpenMP #pragma omp p a r a l l e l { #pragma omp f o r n o w a i t f o r ( i =0; i <N ; ++i ) { // s o s o m e t i n g . . . } #pragma omp f o r f o r ( i =0; i <N ; ++i ) { // do s o m e t h i n g e l s e . . . } } Threads will not synchronize at the end of parallel part 13 / 27 Shared memory programming with OpenMP Race condition 14 / 27 Shared memory programming with OpenMP r e s u l t =0; #pragma omp p a r a l l e l f o r r e d u c t i o n (+: r e s u l t ) f o r ( i =1; i <N, ++i ) { r e s u l t += a [ i ] ; } Reduction is analogue to MPI Reduce Other reduction operations: -,*,max,min,... 15 / 27 OpenMP example int main ( int argc , char ** argv ) { /* initialize variables , allocate and fill vectors */ { printf ( " Hello hello . I ’m thread % d of % d \ n " , omp_ ge t_ th re ad _num () , omp_get_num_threads () ) ; } for ( p = 0; p < passes ; ++ p ) { for ( i = 0; i < n ; ++ i ) { c [ i ] = a [ i ] * b [ i ]; } prod = 0.0; for ( i = 0; i < n ; ++ i ) { prod += c [ i ]; } } /* free memory and return */ } 16 / 27 OpenMP example int main ( int argc , char ** argv ) { /* initialize variables , allocate and fill vectors */ # pragma omp parallel { printf ( " Hello hello . I ’m thread % d of % d \ n " , omp_ ge t_ th re ad _num () , omp_get_num_threads () ) ; } for ( p = 0; p < passes ; ++ p ) { # pragma omp parallel for default ( none ) shared (n ,a ,b , c ) for ( i = 0; i < n ; ++ i ) { c [ i ] = a [ i ] * b [ i ]; } prod = 0.0; # pragma omp parallel for reduction (+: prod ) for ( i = 0; i < n ; ++ i ) { prod += c [ i ]; } } /* free memory and return */ } 17 / 27 Shared memory programming with OpenMP #pragma omp s e c t i o n s [ . . . ] p r i v a t e ( ) reduction () nowait . . . { #pragma omp s e c t i o n . . . structured block #pragma omp s e c t i o n . . . another structured block } implied barrier at the end of SECTIONS directive sections are split amongst threads; some threads get none or more than one section no branching in and out of sections 18 / 27 Shared memory programming with OpenMP Example #pragma omp p a r a l l e l s h a r e d ( n , a , b , x , y ) , private ( i ) { #pragma omp s e c t i o n s n o w a i t { #pragma omp s e c t i o n f o r ( i =0; i <n ; i ++) b [ i ] += a [ i ] ; #pragma omp s e c t i o n f o r ( i =0; i <n ; i ++) x [ i ] = sin (y [ i ]) ; } } 19 / 27 Shared memory programming with OpenMP #pragma omp s i n g l e [ . . . ] p r i v a t e ( ) reduction () nowait . . . { structured block } only one thread in the team executes the enclosed block useful for code in parallel environment that isn’t thread-safe (such as input/output code) rest of threads wait at the end of enclosed block (unless NOWAIT clause is used) 20 / 27 Shared memory programming with OpenMP More clauses I Implicit barrier at the end of parallel region. Can be called explicitly using #pragma omp barrier. I critical sections: only one thread at a time in critical region with the same name: #pragma omp critical [name]. I atomic operation: protects updates to individual memory location. Only single expression allowed #pragma omp atomic. I locks: low-level library routines 21 / 27 Shared memory programming with OpenMP Critical Example #pragma omp p a r a l l e l s e c t i o n s { #pragma omp s e c t i o n { task = produce task () ; #pragma omp c r i t i c a l ( t a s k q u e u e ) { insert into queue ( task ) ; } #pragma omp s e c t i o n { #pragma omp c r i t i c a l ( t a s k q u e u e ) { task = delete from queue ( task ) ; } } } 22 / 27 Shared memory programming with OpenMP Atomic Examples #pragma omp p a r a l l e l s h a r e d ( n , i c ) p r i v a t e ( i ) { f o r ( i =0; i <n;++ i ) { // do some s t u f f #pragma omp a t o m i c ic = ic + 1; } #pragma omp p a r a l l e l s h a r e d ( n , i c ) p r i v a t e ( i ) { f o r ( i =0; i <n;++ i ) { // do some s t u f f #pragma omp a t o m i c ic = ic + F() ; } Computation of F is not atomic, only increment. allowable atomic operations: x += expr x++ x-23 / 27 Shared memory programming with OpenMP Locks Locks control access to shared variables. I Lock variables must be accessed only through locking routines: omp init lock omp set lock omp test lock omp destroy lock omp unset lock I A lock is a type omp lock t I omp set lock ( omp lock t ∗ lock ) forces calling thread to wait until the specified lock is available 24 / 27 Shared memory programming with OpenMP Lock example omp lock t my lock ; i n t main ( ) { o m p i n i t l o c k (& m y l o c k ) ; #pragma omp p a r a l l e l { i n t t i d = omp get thread num () ; int i , j ; f o r ( i = 0 ; i < 1 0 ; ++i ) { o m p s e t l o c k (& m y l o c k ) ; p r i n t f ( ‘ ‘ Thread %d −− s t a r t i n g l o c k r e g i o n \n” , t i d ) ; ... p r i n t f ( ‘ ‘ Thread %d −− f i n i s h i n g l o c k r e g i o n \n” , t i d ) ; o m p u n s e t l o c k (& m y l o c k ) ; } } o m p d e s t r o y l o c k (& m y l o c k ) ; } 25 / 27 Shared memory programming with OpenMP Demos: https://github.com/NYU-HPC15/lecture8 26 / 27 Visualization with Paraview Download Paraview: http://www.paraview.org/ Paraview Tutorial: http://www.paraview.org/Wiki/The_ParaView_Tutorial Example files are checked in /vtkdata in https://github.com/NYU-HPC15/lecture8 27 / 27
© Copyright 2024