Enabling Efficient Use of UPC and OpenSHMEM PGAS

Enabling Efficient Use of UPC and OpenSHMEM
PGAS Models on GPU Clusters
Presented at GTC ’15
Presented by
Dhabaleswar K. (DK) Panda The Ohio State University E-­‐mail: panda@cse.ohio-­‐state.edu hCp://www.cse.ohio-­‐state.edu/~panda Accelerator Era GTC ’15
•  Accelerators are becoming common in high-­‐end system architectures Top 100 – Nov 2014 (28% use Accelerators) 57% use NVIDIA GPUs 57%
28%
•  Increasing number of workloads are being ported to take advantage of NVIDIA GPUs •  As they scale to large GPU clusters with high compute density – higher the synchronizaPon and communicaPon overheads – higher the penalty •  CriPcal to minimize these overheads to achieve maximum performance 3
Parallel Programming Models Overview P1 P2 P3 P1 GTC ’15
P2 P1 P3 P2 P3 Logical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model ParPPoned Global Address Space (PGAS) DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … •  Programming models provide abstract machine models •  Models can be mapped on different types of systems - e.g. Distributed Shared Memory (DSM), MPI within a node, etc. •  Each model has strengths and drawbacks -­‐ suite different problems or applicaPons 4
Outline GTC ’15
•  Overview of PGAS models (UPC and OpenSHMEM) •  LimitaPons in PGAS models for GPU compuPng •  Proposed Designs and AlternaPves •  Performance EvaluaPon •  ExploiPng GPUDirect RDMA 5
ParPPoned Global Address Space (PGAS) Models GTC ’15
•  PGAS models, an aCracPve alternaPve to tradiPonal message passing -  Simple shared memory abstracPons -  Lightweight one-­‐sided communicaPon -  Flexible synchronizaPon •  Different approaches to PGAS -  Libraries •  OpenSHMEM •  Global Arrays •  Chapel -  Languages •  Unified Parallel C (UPC) •  Co-­‐Array Fortran (CAF) •  X10 6
OpenSHMEM GTC ’15
•  SHMEM implementaPons – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP SHMEM, GSHMEM •  Subtle differences in API, across versions – example: SGI SHMEM Quadrics SHMEM Cray SHMEM Ini8aliza8on start_pes(0) shmem_init
start_pes Process ID _my_pe my_pe shmem_my_pe •  Made applicaPons codes non-­‐portable •  OpenSHMEM is an effort to address this: “A new, open specifica3on to consolidate the various extant SHMEM versions into a widely accepted standard.” – OpenSHMEM Specifica3on v1.0 by University of Houston and Oak Ridge NaPonal Lab SGI SHMEM is the baseline 7
OpenSHMEM Memory Model GTC ’15
•  Defines symmetric data objects that are globally addressable - Allocated using a collecPve shmalloc rouPne - Same type, size and offset address at all processes/
processing elements (PEs) - Address of a remote object can be calculated based on info of local object int main (int c, char *v[]) { int *b; start_pes(); b = (int *) shmalloc (sizeof(int)); b
Symmetric Object int main (int c, char *v[]) { int *b; start_pes(); b = (int *) shmalloc (sizeof(int)); shmem_int_get (b, b, 1 , 1); } (dst, src, count, pe) b
} PE 0 PE 1 8
Compiler-­‐based: Unified Parallel C GTC ’15
•  UPC: a parallel extension to the C standard •  UPC SpecificaPons and Standards: -  IntroducPon to UPC and Language SpecificaPon, 1999 -  UPC Language SpecificaPons, v1.0, Feb 2001 -  UPC Language SpecificaPons, v1.1.1, Sep 2004 -  UPC Language SpecificaPons, v1.2, 2005 -  UPC Language SpecificaPons, v1.3, In Progress -­‐ Dram Available •  UPC ConsorPum -  Academic InsPtuPons: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland… -  Government InsPtuPons: ARSC, IDA, LBNL, SNL, US DOE… -  Commercial InsPtuPons: HP, Cray, Intrepid Technology, IBM, … •  Supported by several UPC compilers -  Vendor-­‐based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC -  Open-­‐source UPC compilers: Berkeley UPC, GCC UPC, Michigan Tech MuPC •  Aims for: high performance, coding efficiency, irregular applicaPons, … 9
UPC Memory Model GTC ’15
Thread 0 Global Shared Space Private Thread 2 Thread 1 A1[0] A1[1] y y A1[2] y Thread 3 A1[3] y Space •  Global Shared Space: can be accessed by all the threads •  Private Space: holds all the normal variables; can only be accessed by the local thread shared int A1[THREADS]; //shared variable •  Example: int main() { int y; //private variable A1[0] = 0; //local access A1[1] = 1; //remote access } 10
MPI+PGAS for Exascale Architectures and ApplicaPons GTC ’15
•  Gaining aCenPon in efforts towards Exascale compuPng •  Hierarchical architectures with mulPple address spaces •  (MPI + PGAS) Model - MPI across address spaces - PGAS within an address space •  MPI is good at moving data between address spaces •  Within an address space, MPI can interoperate with other shared memory programming models •  Re-­‐wriPng complete applicaPons can be a huge effort •  Port criPcal kernels to the desired model instead 11
Hybrid (MPI+PGAS) Programming GTC ’15
•  ApplicaPon sub-­‐kernels can be re-­‐wriCen in MPI/PGAS based on communicaPon characterisPcs •  Benefits: - Best of Distributed CompuPng Model - Best of Shared Memory CompuPng Model •  Exascale Roadmap*: - “Hybrid Programming is a pracPcal way to program exascale systems” HPC Applica8on Kernel 1 MPI Kernel 2 PGAS MPI Kernel 3 MPI Kernel N PGAS MPI 12
MVAPICH2-­‐X for Hybrid MPI + PGAS ApplicaPons GTC ’15
MPI, OpenSHMEM, UPC, CAF and
Hybrid (MPI + PGAS) Applications
OpenSHMEM Calls
UPC Calls
CAF Calls
MPI Calls
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
•  Unified communicaPon runPme for MPI, UPC, OpenSHMEM,CAF available from MVAPICH2-­‐X 1.9 : (09/07/2012) hCp://mvapich.cse.ohio-­‐state.edu •  Feature Highlights -  Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC, MPI(+OpenMP) + CAF -  MPI-­‐3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant, CAF 2015 standard compliant -  Scalable Inter-­‐node and intra-­‐node communicaPon – point-­‐to-­‐point and collecPves •  Effort underway for support on NVIDIA GPU clusters 13
Outline GTC ’15
•  Overview of PGAS models (UPC and OpenSHMEM) •  LimitaPons in PGAS models for GPU compuPng •  Proposed Designs and AlternaPves •  Performance EvaluaPon •  ExploiPng GPUDirect RDMA 14
LimitaPons of PGAS models for GPU CompuPng GTC ’15
•  PGAS memory models does not support disjoint memory address spaces -­‐ case with GPU clusters ExisPng OpenSHMEM Model with CUDA •  OpenSHMEM case PE 0 •  Copies severely limit the performance PE 0 GPU-­‐to-­‐GPU Data Movement PE 1 host_buf = shmalloc (…) cudaMemcpy (host_buf, dev_buf, . . . ) shmem_putmem (host_buf, host_buf, size, pe) shmem_barrier (…) PE 1 host_buf = shmalloc (…) shmem_barrier ( . . . ) cudaMemcpy (dev_buf, host_buf, size, . . . ) •  SynchronizaPon negates the benefits of one-­‐sided communicaPon •  Similar limitaPons in UPC 15
Outline GTC ’15
•  Overview of PGAS models (UPC and OpenSHMEM) •  LimitaPons in PGAS models for GPU compuPng •  Proposed Designs and AlternaPves •  Performance EvaluaPon •  ExploiPng GPUDirect RDMA 16
Global Address Space with Host and Device Memory heap_on_device(); /*allocated on device*/ dev_buf = shmalloc (sizeof(int)); heap_on_host(); /*allocated on host*/ host_buf = shmalloc (sizeof(int)); GTC ’15
Host Memory Host Memory Private Private Shared N shared space on host memory Shared N Device Memory Device Memory Private Private Shared N shared space on device memory Shared N Extended APIs: heap_on_device/heap_on_host a way to indicate locaPon on heap Can be similar for dynamically allocated memory in UPC 17
CUDA-­‐aware OpenSHMEM and UPC runPmes GTC ’15
•  Amer device memory becomes part of the global shared space: - Accessible through standard UPC/OpenSHMEM communicaPon APIs - Data movement transparently handled by the runPme - Preserves one-­‐sided semanPcs at the applicaPon level •  Efficient designs to handle communicaPon - Inter-­‐node transfers use host-­‐staged transfers with pipelining - Intra-­‐node transfers use CUDA IPC - Possibility to take advantage of GPUDirect RDMA (GDR) •  Goal: Enabling High performance one-­‐sided communicaPons semanPcs with GPU devices 18
Outline GTC ’15
•  Overview of PGAS models (UPC and OpenSHMEM) •  LimitaPons in PGAS models for GPU compuPng •  Proposed Designs and AlternaPves •  Performance EvaluaPon •  ExploiPng GPUDirect RDMA 19
Shmem_putmem Inter-­‐node CommunicaPon GTC ’15
Large Messages 35
30
25
20
15
10
5
0
Latency (usec)
Latency (usec)
Small Messages 22%
1
4
16
64 256 1K
4K
3000
2500
2000
1500
1000
500
0
28%
16K
Message Size (Bytes)
64K
256K
1M
Message Size (Bytes)
4M
•  Small messages benefit from selecPve CUDA registraPon – 22% for 4Byte messages •  Large messages benefit from pipelined overlap – 28% for 4MByte messages S. Potluri, D. Bureddy, H. Wang, H. Subramoni and D. K. Panda, Extending OpenSHMEM for GPU CompuPng, Int'l Parallel 20
and Distributed Processing Symposium (IPDPS '13) Shmem_putmem Intra-­‐node CommunicaPon GTC ’15
Large Messages 30
25
20
15
10
5
0
Latency (usec)
Latency (usec)
Small Messages 3.4X
1
4
16 64 256 1K
Message Size (Bytes)
3000
2500
2000
1500
1000
500
0
4K
•  Using IPC for intra-­‐node communicaPon •  Small messages – 3.4X improvement for 4Byte messages •  Large messages – 5X for 4MByte messages 5X
16K
64K
256K
1M
Message Size (Bytes)
4M
Based on MVAPICH2X-2.0b + Extensions
Intel WestmereEP node with 8 cores
2 NVIDIA Tesla K20c GPUs, Mellanox QDR HCA
CUDA 6.0RC1
21
30000
25000
20000
15000
10000
5000
0
512x512 1Kx1K
2Kx2K
4Kx4K
Problem Size/GPU
(192 GPUs)
8Kx8K
GTC ’15
Total Execution Time (msec)
Total Execution Time (msec)
ApplicaPon Kernel EvaluaPon: Stencil2D 14
12
10
8
6
4
2
0
65%
48
96
192
Number of GPUs
(4Kx4K problem/GPU)
•  Modified SHOC Stencil2D kernelto use OpenSHMEM for cluster level parallelism •  The enhanced version shows 65% improvement on 192 GPUs with 4Kx4K problem size/GPU •  Using OpenSHMEM for GPU-­‐GPU communicaPon allows runPme to opPmize non-­‐conPguous transfers 22
Total Execution Time (msec)
ApplicaPon Kernel EvaluaPon: BFS GTC ’15
1600
1400
1200
1000
800
600
400
200
0
12%
24
48
96
Number of GPUs
(1 million vertices/GPU with degree 32)
•  Extended SHOC BFS kernel to run on a GPU cluster using a level-­‐synchronized algorithm and OpenSHMEM •  The enhanced version shows upto 12% improvement on 96 GPUs, a consistent improvement in performance as we scale from 24 to 96 GPUs. 23
Outline GTC ’15
•  Overview of PGAS models (UPC and OpenSHMEM) •  LimitaPons in PGAS models for GPU compuPng •  Proposed Designs and AlternaPves •  Performance EvaluaPon •  ExploiPng GPUDirect RDMA 24
OpenSHMEM: Inter-­‐node EvaluaPon GTC ’15
25
20
15
10
5
0
800
6.6X
1
Host-Pipeline
GDR
4
16
64 256 1K
Message Size (bytes)
4K
Small Message shmem_get D-D
Latency (us)
Large Message shmem_put D-D
40
30
20
9X
10
Host-Pipeline
GDR
0
1
4
16
64 256 1K
Message Size (bytes)
4K
Latency (us)
Latency (us)
Small Message shmem_put D-D
Host-Pipeline
600
GDR
400
200
0
8K
32K 128K 512K 2M
Message Size (bytes)
•  GDR for small/medium message sizes •  Host-­‐staging for large message to avoid PCIe boClenecks •  Hybrid design brings best of both •  3.13 us Put latency for 4B (6.6X improvement ) and 4.7 us latency for 4KB •  9X improvement for Get latency of 4B 25
OpenSHMEM: Intra-­‐node EvaluaPon Small Message shmem_put D-H
IPC
8
6
GDR
Latency (us)
Latency (us)
10
2.6X
4
2
0
1
• 
• 
• 
• 
• 
• 
GTC ’15
4
16
64 256 1K
Message Size (bytes)
4K
14
12
10
8
6
4
2
0
Small Message shmem_get D-H
IPC
GDR
3.6X
1
4
16
64 256 1K
Message Size (bytes)
GDR for small and medium message sizes IPC for large message to avoid PCIe boClenecks Hybrid design brings best of both 2.42 us Put D-­‐H latency for 4 Bytes (2.6X improvement) and 3.92 us latency for 4 KBytes 3.6X improvement for Get operaPon Similar results with other configuraPons (D-­‐D, H-­‐D and D-­‐H) 4K
26
OpenSHMEM: Stencil3D ApplicaPon Kernel EvaluaPon Execution time (sec)
Host-Pipeline
GTC ’15
GDR
0.1
19%
0.08
0.06
0.04
0.02
0
8
16
32
64
Number of GPU Nodes
- Input size 2K x 2K x 2K - Plazorm: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-­‐IB) - New designs achieve 20% and 19% improvements on 32 and 64 GPU nodes K. Hamidouche, A. Venkatesh, A. Awan, H. Subramoni, C. Ching and D. K. Panda, ExploiPng GPUDirect RDMA in Designing High Performance OpenSHMEM for GPU Clusters. (under review) 27
Conclusions GTC ’15
•  PGAS models offer lightweight synchronizaPon and one-­‐sided communicaPon semanPcs •  Low-­‐overhead synchronizaPon is suited for GPU architectures •  Extensions to the PGAS memory model to efficiently support CUDA-­‐
Aware PGAS models. •  High efficient GDR-­‐based designs for OpenSHMEM •  Plan on exploiPng the GDR-­‐based designs for UPC •  Enhanced designs are planned to be incorporated into MVAPICH2-­‐X 28
Two More Talks GTC ’15
-  Learn more on how to combine and take advantage of MulPcast and GPUDirect RDMA simultaneously •  S5507 – High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware MulPcast for Streaming ApplicaPon •  Thursday, 03/19 (Today) •  Time: 14:00 -­‐-­‐ 14:25 •  Room 210 D -  Learn about recent advances and upcoming features in CUDA-­‐
aware MVAPICH2-­‐GPU library • 
• 
• 
• 
S5461 -­‐ Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Thursday, 03/19 (Today) Time: 17:00 -­‐-­‐ 17:50 Room 212 B 29
Thanks! QuesPons? GTC ’15
Contact
panda@cse.ohio-state.edu
http://mvapich.cse.ohio-state.edu
http://nowlab.cse.ohio-state.edu
30