Slides used in class

Advanced Topics in Numerical Analysis:
High Performance Computing
MATH-GA 2012.003 and CSCI-GA 2945.003
Georg Stadler
Courant Institute, NYU
stadler@cims.nyu.edu
Spring 2015, Monday, 5:10–7:00PM, WWH #202
April 6, 2015
1 / 27
Outline
Organization issues
Shared memory parallelism–OpenMP
2 / 27
Organization and today’s class
Organization:
I
Homework #1 solution posted. Stampede? Anybody?
I
Homework #3 will be posted tonight (sorry for the delay).
3 / 27
Organization and today’s class
Organization:
I
Homework #1 solution posted. Stampede? Anybody?
I
Homework #3 will be posted tonight (sorry for the delay).
Today:
I
Distributed memory programming (OpenMP).
I
Visualization with Paraview
3 / 27
Plan for the rest of this course
Your input is welcome
I
Programming: OpenCL (for GPGPUs and MICs);
Performance with accelerators;
I
Tools: intro to shell scripting; Paraview visualization;
(Partitioning with Metis)
I
Algorithms: Multigrid and FMM
4 / 27
Outline
Organization issues
Shared memory parallelism–OpenMP
5 / 27
Shared memory programming model
I
Program is a collection of
control threads, that are
created dynamically
I
Each thread has private and
shared variables
I
Threads can exchange data by
reading/writing shared variables
I
Danger: more than 1 processor
core reads/writes to a memory
location: race condition
Main difference to MPI is that only one process is running, which
can fork into shared memory threads.
6 / 27
Threads versus process
A process is an independent execution unit, which contains their
own state information (pointers to instruction and stack). One
process can contain several threads.
Threads within a process share the same address space, and
communicate directly using shared variables. Separate stack but
shared heap memory.
Using several threads can also be useful on a single processor
(“multithreading”), depending on the memory latency.
7 / 27
Shared memory programming—Literature
OpenMP standard online:
www.openmp.org
Very useful online course:
www.youtube.com/user/OpenMPARB/
Recommended reading:
Chapter 6 in
Chapter 7 in
8 / 27
Shared memory programming with OpenMP
I
Split program into serial and into parallel regions
I
Race condition: more than one thread reads/writes to the
same shared memory location
I
Easy to parallelize loops without data dependencies by adding
#pragma commands
I
pragmas are compiler directive externals to the programming
language; they are ignored by the compiler if it doesn’t
understand them
9 / 27
Shared memory programming with OpenMP
I
Provides synchronization functions such as the collective calls
in MPI
I
Tell the compiler which parts are parallel
I
Compiler generates threaded code; usually it defaults to one
(or 2/4 for hyper-threading) threads per core
I
You can set the number of threads used through the
environment variable OMP NUM THREADS or inside the code
using the library call omp set num threads(4)
I
Dependencies in parallel parts requires synchronization
between threads
10 / 27
Shared memory programming with OpenMP
#pragma omp p a r a l l e l n u m t h r e a d s ( 3 )
{
// . . .
}
Define number of threads explicitly
Can be done through environment variable:
export OMP NUM THREADS=8 (bash)
setenv OMP NUM THREADS 8 (csh)
11 / 27
Shared memory programming with OpenMP
#pragma omp p a r a l l e l f o r d e f a u l t ( none )
private () shared ()
f o r ( i =0; i <N ; ++i ) {
// . . . .
}
Per default, all variables are shared in parallel part
Can explicitly specify which variables are private and which shared
private(): Non-initialized variable
firstprivate(): initialized with outside value
lastprivate(): main thread returns with value from parallel
construct
12 / 27
Shared memory programming with OpenMP
#pragma omp p a r a l l e l
{
#pragma omp f o r n o w a i t
f o r ( i =0; i <N ; ++i ) {
// s o s o m e t i n g . . .
}
#pragma omp f o r
f o r ( i =0; i <N ; ++i ) {
// do s o m e t h i n g e l s e . . .
}
}
Threads will not synchronize at the end of parallel part
13 / 27
Shared memory programming with OpenMP
Race condition
14 / 27
Shared memory programming with OpenMP
r e s u l t =0;
#pragma omp p a r a l l e l f o r r e d u c t i o n (+: r e s u l t )
f o r ( i =1; i <N, ++i ) {
r e s u l t += a [ i ] ;
}
Reduction is analogue to MPI Reduce
Other reduction operations: -,*,max,min,...
15 / 27
OpenMP example
int main ( int argc , char ** argv ) {
/* initialize variables , allocate and fill vectors
*/
{ printf ( " Hello hello . I ’m thread % d of % d \ n " ,
omp_ ge t_ th re ad _num () , omp_get_num_threads () ) ;
}
for ( p = 0; p < passes ; ++ p )
{
for ( i = 0; i < n ; ++ i ) {
c [ i ] = a [ i ] * b [ i ];
}
prod = 0.0;
for ( i = 0; i < n ; ++ i ) {
prod += c [ i ];
}
}
/* free memory and return */
}
16 / 27
OpenMP example
int main ( int argc , char ** argv ) {
/* initialize variables , allocate and fill vectors
*/
# pragma omp parallel
{ printf ( " Hello hello . I ’m thread % d of % d \ n " ,
omp_ ge t_ th re ad _num () , omp_get_num_threads () ) ;
}
for ( p = 0; p < passes ; ++ p )
{
# pragma omp parallel for default ( none ) shared (n ,a ,b , c )
for ( i = 0; i < n ; ++ i ) {
c [ i ] = a [ i ] * b [ i ];
}
prod = 0.0;
# pragma omp parallel for reduction (+: prod )
for ( i = 0; i < n ; ++ i ) {
prod += c [ i ];
}
}
/* free memory and return */
}
17 / 27
Shared memory programming with OpenMP
#pragma omp s e c t i o n s [ . . . ] p r i v a t e ( )
reduction () nowait . . .
{
#pragma omp s e c t i o n . . .
structured block
#pragma omp s e c t i o n . . .
another structured block
}
implied barrier at the end of SECTIONS directive
sections are split amongst threads; some threads get none or more
than one section
no branching in and out of sections
18 / 27
Shared memory programming with OpenMP
Example
#pragma omp p a r a l l e l s h a r e d ( n , a , b , x , y ) ,
private ( i )
{
#pragma omp s e c t i o n s n o w a i t
{
#pragma omp s e c t i o n
f o r ( i =0; i <n ; i ++)
b [ i ] += a [ i ] ;
#pragma omp s e c t i o n
f o r ( i =0; i <n ; i ++)
x [ i ] = sin (y [ i ]) ;
}
}
19 / 27
Shared memory programming with OpenMP
#pragma omp s i n g l e [ . . . ] p r i v a t e ( )
reduction () nowait . . .
{
structured block
}
only one thread in the team executes the enclosed block
useful for code in parallel environment that isn’t thread-safe (such
as input/output code)
rest of threads wait at the end of enclosed block (unless NOWAIT
clause is used)
20 / 27
Shared memory programming with OpenMP
More clauses
I
Implicit barrier at the end of parallel region. Can be called
explicitly using #pragma omp barrier.
I
critical sections: only one thread at a time in critical region
with the same name: #pragma omp critical [name].
I
atomic operation: protects updates to individual memory
location. Only single expression allowed #pragma omp
atomic.
I
locks: low-level library routines
21 / 27
Shared memory programming with OpenMP
Critical Example
#pragma omp p a r a l l e l s e c t i o n s
{
#pragma omp s e c t i o n
{
task = produce task () ;
#pragma omp c r i t i c a l ( t a s k q u e u e )
{
insert into queue ( task ) ;
}
#pragma omp s e c t i o n
{
#pragma omp c r i t i c a l ( t a s k q u e u e )
{
task = delete from queue ( task ) ;
}
}
}
22 / 27
Shared memory programming with OpenMP
Atomic Examples
#pragma omp p a r a l l e l s h a r e d ( n , i c ) p r i v a t e ( i ) {
f o r ( i =0; i <n;++ i ) {
// do some s t u f f
#pragma omp a t o m i c
ic = ic + 1;
}
#pragma omp p a r a l l e l s h a r e d ( n , i c ) p r i v a t e ( i ) {
f o r ( i =0; i <n;++ i ) {
// do some s t u f f
#pragma omp a t o m i c
ic = ic + F() ;
}
Computation of F is not atomic, only increment.
allowable atomic operations:
x += expr
x++
x-23 / 27
Shared memory programming with OpenMP
Locks
Locks control access to shared variables.
I
Lock variables must be accessed only through locking routines:
omp init lock
omp set lock
omp test lock
omp destroy lock
omp unset lock
I
A lock is a type omp lock t
I
omp set lock ( omp lock t ∗ lock )
forces calling thread to wait until the specified lock is available
24 / 27
Shared memory programming with OpenMP
Lock example
omp lock t my lock ;
i n t main ( ) {
o m p i n i t l o c k (& m y l o c k ) ;
#pragma omp p a r a l l e l
{
i n t t i d = omp get thread num () ;
int i , j ;
f o r ( i = 0 ; i < 1 0 ; ++i ) {
o m p s e t l o c k (& m y l o c k ) ;
p r i n t f ( ‘ ‘ Thread %d −− s t a r t i n g l o c k
r e g i o n \n” , t i d ) ;
...
p r i n t f ( ‘ ‘ Thread %d −− f i n i s h i n g l o c k
r e g i o n \n” , t i d ) ;
o m p u n s e t l o c k (& m y l o c k ) ;
}
}
o m p d e s t r o y l o c k (& m y l o c k ) ;
}
25 / 27
Shared memory programming with OpenMP
Demos:
https://github.com/NYU-HPC15/lecture8
26 / 27
Visualization with Paraview
Download Paraview:
http://www.paraview.org/
Paraview Tutorial:
http://www.paraview.org/Wiki/The_ParaView_Tutorial
Example files are checked in /vtkdata in
https://github.com/NYU-HPC15/lecture8
27 / 27