07.04.2011
GPU optimizations at RRZE
GPU--Computing Workshop
GPU
Why GPUs?
J. Habich(a), Prof. Dr. G. Wellein(a,b) , C. Feichtinger(b)
(a)HPC
Services – Regional ComputingCenter Erlangen
of Computer Science
(b)Department
April 6th 2011
University Erlangen-Nürnberg
Peak Performance of CPU vs. GPU
Johannes.Habich@rrze.uni-erlangen.de
2
Peak Memory Bandwidth of CPU vs. GPU
Consumer
Cards
Consumer
Cards
Professional
HP Cards
6x
4x
Single Chip
Single Chip
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
3
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
4
1
07.04.2011
Outline
GPGPU CUDA Hardware
Memory performance by massive parallelism
Iterative Solvers
Lattice Boltzmann on GPUs
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
GPGPU CUDA Hardware
5
Specifications of the NVIDIA Fermi GPU
April 6th 2011
Up to 6 GB of global memory (DRAM)
32 processors SP driven by :
Single Instruction Multiple Data (SIMD)
Single Instruction Multiple Thread (SIMT)
Explicit in-order architecture
32K Registers
48 KB of local on-chip memory
(shared memory)
1500 MHz DDR
384 bit bus
Global gather/scatter possible
144 GB/s bandwidth
16 GB/s PCIe 2.0 x16 interface to CPU
1st and 2nd level Cache hierarchy
clock rate of 1.15 GHz
Memory Memory
Clock
Peak
Memory Memory
Interface Bandwidth
(GB) Clock (MHz)
(MHz) (GFLOPs)
(bit)
(GB/sec)
1030 GFLOP/s (single precision)
515 GFLOP/s (double precision)
1150
1030
6
1500
384
144
GeForce GTX 280 1400
Tesla C2070
1000
1
1160
512
148.6
1350
345
0.768
900
384
86
Host ( Westmere) 2.66
255
24
1333
3*64
63
GeForce 8800
GTX
Johannes.Habich@rrze.uni-erlangen.de
6
Specifications of the NVIDIA Fermi GPU
14 Multiprocessors (MP); each with:
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
7
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
8
2
07.04.2011
Features and paradigm of the CUDA toolkit
Remember: OpenMP
0
Thread 0
Typical coarse-grained scheduling
using contiguous chunks
Thread 1
Divide domain to huge chunks
1
2
3
Memory Bandwidth on GPUs
April 6th 2011
Thread 2
5
Johannes.Habich@rrze.uni-erlangen.de
9
Features and paradigm of the CUDA toolkit
Now: CUDA
Block 1
Cyclic mapping between workload
and threads is often helpful
Block 2
April 6th 2011
Parallelize the most outer loop to
minimize overhead
Johannes.Habich@rrze.uni-erlangen.de
10
Features and paradigm of the CUDA toolkit
Example:
Streamcopy-benchmark (GPU)
32 Blocks, 32 Threads each
Fine-grained workload scheduling!
No caches no consecutive data access within a thread required
Divide domain/loop into small
chunks
Equally distribute to threads
4
Thread 0
0
Thread 1
1
Thread 2
2
Thread 0
3
Thread 1
4
Thread 2
5
__global__ void vectorCopyGPUOuterIter 1st element Block 0 Thread 0
33rd element Block 1 Thread 0
(float *d_C, float *d_A, int DATA_N){
1025th element Block 0 Thread 0
for (
int pos = (blockIdx.x * blockDim.x) + threadIdx.x;
pos < DATA_N ;
pos += blockDim.x*gridDim.x
bl kDi
* idDi
)
{
ID of current thread
d_C[pos] = d_A[pos];
ID of current block
Number of threads per
}
block
__syncthreads();
}
Jump across all
Elements
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
11
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
12
3
07.04.2011
Memory bandwidth case study
CUDA Scheduling impacted by Resource limits
How much parallelism is needed? Example: Streamcopy C=A
Resources per Streaming Multiprocessor (MP)
32 000 (32bit) Register
16 KB to 48 KB of Shared Memory
2010 Tesla
2008 Tesla
Paralellism is limited by resource usage per Block/Thread
2007
Consumer
AMD Magny Cours
1536 threads can be executed/scheduled parallel per MP
0 Registers
eg s e s pe
per Thread
ead
20
10 to 35 byte shared memory per Thread (GT300 allows switching)
Using more registers will decrease parallel threads per MP
Threads 1536 1024 512
Registers
• Not enough blocks
• GPUs not used perfectly
April 6th 2011
20
30
62
384
256
128
64
83
125
250
500
• Not enough work per blocks
• Threads run empty
Johannes.Habich@rrze.uni-erlangen.de
13
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
14
PCI Express bandwidth measurements
PCIe Gen 2.0:
Theoretically 8 GB/s per
direction
Pinned host memory is
mandatory!
Blocked data
data-copy
copy only
improves unpinned
memory transfer
Iterative Jacobi Solver on
GPUs
Cuda Version 2.3
nVIDIA GTX 280
float * h_A; float MAXMEMORY = 1024*1024*512
cudaHostAlloc( (void**)&h_A, (sizeof (float) * MAXMEMORY), 0) ;
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
15
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
16
4
07.04.2011
Seminar work: Moritz Kreutzer
Evaluation of real SP to DP ratio
Jacobi solver on a 3-dimensional domain
Case Study:
(Ω = [0; 1]³ ) using CUDA
Jacobi iterative solver (6 point stencil)
Loads 6 elements / Stores 1 element
Underlying PDE: Poisson’s equation with right hand side 0
Does 6 FLOPs
(implementation dependent)
∆u = f ≡ 0
Algorithmic balance between 4.6 (sp) byte/flop and 9.3(dp) byte/flop
Sinusoidal boundary
y conditions with a maximum at each face
A kernel running at 1 Gflop/s needs 9.3 Gbyte/s of data
6-point stencil:
gdst (x; y ; z) =
a · ( gsrc (x+1; y ; z) + gsrc (x - 1; y ; z) +
gsrc (x; y+1; z) + gsrc (x; y - 1; z) +
gsrc (x; y ; z+1) + gsrc (x; y ; z - 1))
Tesla Top Speed 9/18 GFLOPs (dp/sp)
Far away from any floating point barrier
Watch all occuring flops and integer calculations
6 Flops, 6 reads, 1 write per lattice site update (LUP)
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
17
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
18
Jacobi on GPUs
Shared memory access
One GPU-block treats 3 dim. Block of domain
Threads within a block can share data through a 16 KB shared memory
One thread per node of GPU-Block
Use of shared memory, e.g. for simple blocking
Threads load central point for neighbors (central stripe)
Element gsrc (x; y ; z) is stored into shared memory for all threads
( = 1;…;
(x
1
di X – 1)
dimX
Access to elements gsrc (x + 1; y ; z) and gsrc (x - 1; y ; z) via shared memory
Extension to multiple stripes:
load complete stencil data to shared memory in advance
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
19
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
20
5
07.04.2011
Jacobi on GPUs
Jacobi on GPUs
Texture Cache
Padding
GPUs do not comprise ordinary caches known from CPUs
needed to obtain coalesced memory access
sensible only in x-direction
(adjacent elements in x-direction are adjacent in memory)
However, it is possible to exploit the Texture Cache
Source grid is mapped to a texture
Repetitive read operations can be satisfied by this fast cache with much
higher performance
Consider boundary/ghost layer when calculating optimal padding
Swap grids re-map source grid
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
21
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
22
Jacobi on GPUs: Results
Speedup of 2 for cached algorithms
Double precision leads to 50 % reduction as known from CPUs
Memory bound execution
Lattice Boltzmann on GPUs
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
23
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
24
6
07.04.2011
The lattice Boltzmann method
The lattice Boltzmann method
f(0:xMax+1,0:yMax+1,0:zMax+1,0:18,0:1)
Incompressible flow solver
x = threadIdx.x ; // set i index of current cell
y = blockIdx.x+1; // set j index of current cell
z = blockIdx.y+1; // set k index of current cell
Explicit, fully discrete Boltzmann
equation with BGK collision operator
0, t)
1, t)
2, t)
3, t)
y-1, z-1,18, t)
1st order accurate in time
SAVE f(x,
Pull/collision optimized layout
y,
n
io
Relaxation (complex computations)
s
en
f( x,
z,
z,
z,
z,
m
Di
2nd order accurate in space
f( x, y,
f( x+1, y+1,
f( x, y+1,
f( x-1, y+1,
X-
LOAD
LOAD
LOAD
LOAD
…
LOAD
Physical discretization: D3Q19
Z-Dimension
if( fluidcell(x,y,z) ) then
z, 0:18, t+1)
endif
Halfway bounce-back for obstacle
treatment
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
25
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
Pure kernel (SP), no PCIe/IB transfer
Kernel with boundary transfer (SP), no IB
Maximum performance starting at 50x50x50
Maximum performance starting at 200x200x200
(64 times more than pure kernel (50x50x50)!!)
Fluctuations due to different thread numbers and influence of
alignment
Blocks influence
kernel with any
domainsize
Blocks influence
kernel
Domains < 200x200x200
Comparison:
Xeon Node ~100 MLUPS
LUPS:
Lattice updates per second
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
26
28% is lost for 64
blocks
Why?
27
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
28
7
07.04.2011
Time measurements of kernel with 1 and 64 blocks
Performance on GPU
Domains > 250^3 about 50% of execution time is spent in nonkernel parts
C2050 speedup of 2 compared to full 12 core Intel Westmere in dp
75% of attainable memory bandwidth
Kernel execution time is constant no matter how much blocks are
used
SP
Domains < 150^3
non-kernel
non
kernel part
becomes dominant
ECC=1
DP
Difference of sp and
dp vanishes for 64
blocks
Algorithm becomes
communication
bound
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
29
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
30
Johannes.Habich@rrze.uni-erlangen.de
32
WaLBerla
Heterogeneous GPU CPU
computing
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
31
April 6th 2011
8
07.04.2011
WaLBerla
Heterogeneous LBM
Copy to Buffers on CPU and GPU
Patch and Block based domain decomposition
After each iteration, boundary data
is copied to Communication Buffers
Block contains Simulation Data and Metadata
e.g. for parallelization, advanced models
Block can be algorithm or
architecture specific
All Blocks are equal in spatial
dimensions
Processes can have one or
multiple blocks
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
33
Buffer swap on GPU
April 6th 2011
34
Transfer of buffers to the host
Local Communication Buffers
are only swapped.
No Copy is done!
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
Johannes.Habich@rrze.uni-erlangen.de
Data of GPU processes is
transferred to the Host
35
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
36
9
07.04.2011
Transfer of buffers to the host
Disparity of Scales: The tale of the load balance
In Theory:
2 MPI processes for CPU domain
1 MPI process for GPU domain
Buffers are transferred/received
to/from other hosts
Time per iteration
Sample block of 75x75x75
GPU: 3ms (@130 MLUPS)
CPU: 9ms (@ 50 MLUPS)
Take 8 blocks 6 for GPU and 2 for CPUs
What about 4 socket machines?
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
37
Disparity of Scales: The tale of the load balance 2
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
38
Disparity of Scales: The tale of the load balance 3
Problems involved:
Theory estimate: (130 + 50 +50 ) MLUPS = 230 MLUPS
GPU as well as CPU get better with large domains and only one block
Reality:
= 167 MLUPS
However, a factor of 7 to 2 restricts largest blocksize and gives you
multiple blocks
Tuning of parameters leads to balance of 7 to 2 (GPU to CPU) blocks.
Both architectures are not operated under best conditions
(N t that
(Note
th t a CPU bl
block
k is
i computed
t d by
b three
th
CPU cores via
i OpenMP)
O
MP)
Results of tuned setup : 189 MLUPS
Overheads (PCIe transfers, Thread spawning, onGPU copys) are not
covered by simple performance model
Still behind lightspeed estimate
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
39
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
40
10
07.04.2011
Weak scaling GPU per Node performance
Strong Scaling GPU per Node performance
Weak scaling works as expected
Initial performance drop from one to two cards per node
Loss of 64% in SP on 30 Nodes (60 GPUs GT200)
Loss of 75% in DP on 30 Nodes (60 GPUs GT200)
Up to 16 GFLUPS
max. performance
(GT200)
Huge performance drop due to domain size
66%
Up to 137 Intel Xeon nodes necessary!
Up to 7 GFLUPS in SP
46%
Up to 1275 BlueGene/P nodes necessary!
67%
Up to 70 Intel Xeon nodes necessary!
35%
About 3 GFLUPS in DP
Up to 750 BlueGene/P nodes necessary!
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
41
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
42
Summary
Thank you very much for your attention
High attainable performance on GPUS is achievable by
sophisticated optimization
Sweetspots of different architectures are quite the same
(i.e . large domains)
Data locality plays important role
PCIexpress is still a major bottleneck
All trademarks are the property of their respective owners.
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
43
April 6th 2011
Johannes.Habich@rrze.uni-erlangen.de
44
11
© Copyright 2025