GPGPU Agenda – Intro to Nvidia GPGPU Lesson #3: Nvidia HW and CUDA - Introduction Nvidia GPGPU programming model – Nvidia GPGPU HW – Dr. Avi Mendelson - Intel Dr. Evgeny Bolotin - Intel Technion, GPGPU course, Winter 2008 What is 3 2 ? - a reminder ☺ Introduction to CUDA Will start with G80 (towards G200 in next lesson) The material by courtesy of : D.Kirk (Nvidia), Prof. W.M Hwu (UIUC), M. Behar (Intel), Nvision 08 and Hotchips 08 - Cuda Tutorials Lecture #3: Nvidia HW and CUDA - Introduction GPU vs CPU: Peak GFLOPS General Purpose computing on Graphics Processing Units GPU well-suited to data-parallel problems Common Apps: physics, image filtering, video and audio encoding, matrix algebra, sorting, scientific workloads…. Lecture #3: Nvidia HW and CUDA - Introduction 4 Lecture #3: Nvidia HW and CUDA - Introduction 1 What is Behind such an Evolution? Bandwidth The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) – So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU ALU ALU Control CPU GPU Cache DRAM DRAM Lecture #3: Nvidia HW and CUDA - Introduction 5 GPGPU Challenge in the Old Days – Mapping the algorithm to the API Graphic terminology (shader, pixel, triangle…) Limited Shader capabilities – – Instruction Set Communication between threads – – 8 Extended C No knowledge of graphics is required Unlimited, random and efficient access to memory – Limited instruction set, no integer, no bit operation Lecture #3: Nvidia HW and CUDA - Introduction “Compute Unified Device Architecture” No graphics API overhead Low learning curve – Number of outputs was limited No scatter operation – 7 CUDA (and G80), since Nov. 2006 Graphic API (OpenGL or DX) – Lecture #3: Nvidia HW and CUDA - Introduction 6 Thread can read/write as many locations as needed Gather/scatter capabilities Software-managed cache (Shared Memory) Significantly extends the GPU beyond graphics Lecture #3: Nvidia HW and CUDA - Introduction 2 GFx Pipeline - Unified Design 9 Lecture #3: Nvidia HW and CUDA - Introduction 10 Why Unify? 11 Lecture #3: Nvidia HW and CUDA - Introduction Lecture #3: Nvidia HW and CUDA - Introduction Why Unify? 12 Lecture #3: Nvidia HW and CUDA - Introduction 3 G80 Thread Computing Pipeline The future of GPUs is programmable processing Host Host Input Input Assembler Assembler Setup / Rstr / ZCull Vtx Thread Issue SP SP SP SP SP SP Geom Thread Issue SP SP SP SP Pixel Thread Issue SP SP SP SP SP SP Build the architecture around the processor Parallel Data TF Cache Parallel Data TF Cache Parallel Data TF Cache Parallel Data TF Cache Parallel Data TF Cache Parallel Data TF Cache Parallel Data TF Cache Parallel Data TF Cache Texture Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Load/store L2 FB Lecture #3: Nvidia HW and CUDA - Introduction 13 14 Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms … and not mechanics of a parallel programming language. Lecture #3: Nvidia HW and CUDA - Introduction Load/store L2 FB FB Global Memory Lecture #3: Nvidia HW and CUDA - Introduction FB Load/store L2 Load/store L2 FB FB SPMD (Single Program, Multiple Data) – – – 15 Load/store L2 CUDA = SPMD Model Cuda Design Goals Load/storeL2 Thread Processor Thread Execution Manager 16 All PE’s execute the same program in parallel, but have their own data Each PE uses a unique ID to access its portion of data Different PE can follow different paths through the same code Lecture #3: Nvidia HW and CUDA - Introduction 4 Streaming architecture lexicon wars: SPMD vs. Scalar/Vector/SIMD – WARPs? Traditionally multi-processors provided set of vector instructions – – – – – NVIDIA approach: vector operations out of scalar operations – – – – – – 17 MMX – early days of Intel’s vectors SSE – Intel’s second generation: vectors of 128bits each AVX – Intel’s next generation: vectors of 256bits each LRBNI – Intel’s future graphics with a vector operations of 512bits vectors Altivec – IBM definition of vector operations The user defines a scalar code (sequence of operations running on a single element) At instantiation time user tells the run-time the problem size and partitioning hints HW breaks the problem into thousands of threads (one thread per element) organized in several hierarchies and “vectors” – HW SIMDification? Each thread runs the same code on different data element Threads get scheduled in groups of 32 (“WARP”) – just an implementation WARP – special case of SIMD? Lecture #3: Nvidia HW and CUDA - Introduction 18 Lecture #3: Nvidia HW and CUDA - Introduction Host Data Assembler Setup / Rstr / ZCull Vtx ThreadGeom IssueThread Pixel Issue Thread Issue CUDA-Scalable Computing HW Overview (G80) SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L2 FB L1 L1 L2 FB L2 FB L1 L1 L2 FB L1 L2 FB L1 Thread Processor L2 FB Stream Processor Array - SPA TPC NV Mobile NV Tegra TPC TPC TPC TPC TPC Texture Processor Cluster SM Data L1 Instruction Fetch/Dispatch Shared Memory Server-Desktop Tesla D870 TEX SP SM Server Tesla S870 Lecture #3: Nvidia HW and CUDA - Introduction TPC Streaming Multiprocessor Instruction L1 19 TPC Desktop GeForce 8800 SP SP SP SFU 20 SFU SP SP SP SP Lecture #3: Nvidia HW and CUDA - Introduction 5 Cuda System - Physical Reality CUDA Processor Terminology TPC SPA TPC (2 SM + TEX) SM – – – – – – GPU w/ local DRAM (CUDA Device) Shared Memory TEX SP SM Texture Processor Cluster SP SP SP SFU SP SP SP SFU SP Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA thread block Streaming Processor (FMAD operations) Scalar ALU for a single CUDA thread SFU – 21 Instruction L1 Data L1 Instruction Fetch/Dispatch SP – Streaming Processor Array CPU (Host) Streaming Multiprocessor SM Super Function Units (Transcendentals) Lecture #3: Nvidia HW and CUDA - Introduction Lecture #3: Nvidia HW and CUDA - Introduction 22 Cuda Programming Model A Highly Multithreaded Coprocessor The GPU is viewed as a compute device that: – CUDA Programming model – – Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads Differences between GPU and CPU threads – – – – – 23 Lecture #3: Nvidia HW and CUDA - Introduction 24 Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Up to 12,268 threads are alive simultaneously Multi-core CPU needs only a few Lecture #3: Nvidia HW and CUDA - Introduction 6 Host & Device Serial code runs on Host Device as a ultra-parallel coprocessor for Host CUDA Device Memory Space (Device) Grid Block (0, 0) Parralel CUDA threads run on Device Host memory and Device memory Block (1, 0) Shared Memory Registers Host Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Global Memory Constant Memory Texture Memory Lecture #3: Nvidia HW and CUDA - Introduction 25 26 Memory Space Memory-Summary Registers 32 KB per SM (8K registers) Memory Location Cached Access Global and local (NOT cached!) Registers On-chip N/A Read/write One thread Local Off-chip No Read/write One thread Shared On-chip N/A resident Read/write All threads in a block Global Off-chip No Read/write All threads + host Constant Off-chip Yes Read All threads + host Texture Off-chip Yes Read All threads + host – – DRAM – large latency access Constant memory (cached) Texture memory (cached) – – – Fast if all threads read the same address (Broadcast) Cached The texture cache is optimized for 2D spatial locality Shared memory – – – – – 27 Lecture #3: Nvidia HW and CUDA - Introduction SW managed Very low latency 16KB on chip-memory per SM, organized in 16 banks All threads of the block can share/exchange data using the shared memory Allows reuse of data reducing bandwidth substantial performance improvement Lecture #3: Nvidia HW and CUDA - Introduction 28 Who Lecture #3: Nvidia HW and CUDA - Introduction 7 Thread Hierarchy Hierarchy: Kernel launches Grid – – – – Grid of blocks Blocks of threads Threads organized and scheduled in warps (microarchitecture) Threads and blocks have unique IDs – – Sharing through a low latency shared memory HW Synchronizing primitives – Threads from different blocks cannot cooperate Blocks are required to execute independently Simplifies memory addressing when processing multidimensional data – – Lecture #3: Nvidia HW and CUDA - Introduction 29 Image processing Matrix manipulation Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Lecture #3: Nvidia HW and CUDA - Introduction 30 Thread Hierarchy example This way thread can decide what data to process Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Device A thread block is a batch of threads that cooperate : – Block and Thread IDs CUDA Common Programming style A kernel is defined __global__ specifier The number of CUDA threads for each call is specified using a new <<<…>>> syntax Thread ID is accessible within the kernel through the built-in threadIdx variable Each thread will work (compute) on 1 element – – Define what to do for 1 element Launch Kernel on the whole data set Store blocks of data in fast shared memory (Tiling) Use synchronization primitives among threads of same block Grid = 1 block Block = NxN threads 31 Lecture #3: Nvidia HW and CUDA - Introduction 32 Lecture #3: Nvidia HW and CUDA - Introduction 8 A typical programming pattern using the shared memory Barrier - Synchronization Function void __syncthreads(); Synchronizes all threads in a block Once all threads have reached this point, execution resumes normally Used to avoid RAW/WAR/WAW hazards when accessing shared or global memory – 33 Lecture #3: Nvidia HW and CUDA - Introduction phase } 35 The host runtime component can only be used by host functions It provides functions to handle: – // compute sum for the threadblock for(int dist = blockDim.x/2; dist>0; dist/=2) { if(threadIdx.x<dist) s_data[threadIdx.x] += s_data[threadIdx.x+dist]; __syncthreads(); } // write the block's sum to global memory if(threadIdx.x==0) g_output[blockIdx]= s_data[0]; Lecture #3: Nvidia HW and CUDA - Introduction Process the data in shared memory Synchronize again if necessary to make sure that shared memory has been updated with the results Write the results back to device memory CUDA Host Runtime Component __global__ void sum_kernel(int *g_input, int *g_output) { extern __shared__ int s_data[]; // allocated during kernel launch // read input into shared memory from global memory Read unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x; phase s_data[threadIdx.x] = g_input[idx]; __syncthreads(); Write read shared memory locations that were written by different threads Lecture #3: Nvidia HW and CUDA - Introduction 34 Example: Cuda kernel Compute phase Load data from device memory to shared memory Synchronize with all the other threads of the block so that each thread can safely: – – – – – 36 Device management Memory management Code module management Execution control Texture reference management Interoperability with OpenGL and Direct3D. Lecture #3: Nvidia HW and CUDA - Introduction 9 Example for Host Runtime code 37 Lecture #3: Nvidia HW and CUDA - Introduction Example for Host Runtime cont. 38 Language Extensions: Variable Type Qualifiers CUDA = C Language Extensions Declaration: global, __global__ void convolve (float *image) device , local , shared Keywords: dim3 threadIdx, dim3 blockIdx dim3 blockDim; int i = blockIdx.x * blockDim.x + threadIdx.x; Intrinsics __syncthreads(); __sin();__sqrt() Runtime API cudaMalloc(), cudaMemcpy() Function launch // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); 39 Lecture #3: Nvidia HW and CUDA - Introduction Lecture #3: Nvidia HW and CUDA - Introduction Memory Scope Lifetime local thread thread int SharedVar; shared block block int GlobalVar; global grid application constant grid application __device__ __local__ int LocalVar; __device__ __shared__ __device__ __device__ __constant__ int ConstantVar; • Automatic variables without any qualifier reside in a register • Pointers can only point to memory allocated or declared in global memory •Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr) 40 Lecture #3: Nvidia HW and CUDA - Introduction 10 Compiling CUDA, PTX-intermediate CUDA SDK C/C++ Application ASM-level Library Programmer Parallel Thread eXecution (PTX) – NVCC Compiler – – PTX Code ISA – Instruction Set Architecture – PTX Code – – Translator – 41 Lecture #3: Nvidia HW and CUDA - Introduction 42 G80 … Translates PTX to Target code Program install time Driver implements VM runtime – C Variable declarations Instructions and operands Translator is an optimizing compiler PTX to Target Virtual Machine and ISA Programming model Execution resources and state Coupled with Translator GPU Target code Lecture #3: Nvidia HW and CUDA - Introduction CUDA Device Memory Space (Device) Grid Block (0, 0) CUDA HW Architecture Block (1, 0) Shared Memory Registers We will describe G80 – will talk later what changed in G200 Host Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Global Memory Constant Memory Texture Memory 43 Lecture #3: Nvidia HW and CUDA - Introduction 44 Lecture #3: Nvidia HW and CUDA - Introduction 11 Access Times HW Overview (G80) Register – dedicated HW - single cycle Shared Memory – dedicated HW - single cycle Local Memory – DRAM, no cache - *slow* Global Memory – DRAM, no cache - *slow* Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality Instruction Memory (invisible) – DRAM, cached Lecture #3: Nvidia HW and CUDA - Introduction 45 Stream Processor Array - SPA TPC – – – 47 TPC Data L1 Instruction Fetch/Dispatch Shared Memory TEX SP SP SP SM SP SFU SFU SP SP SP SP Lecture #3: Nvidia HW and CUDA - Introduction L1 Fill Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Work 1 to 768 threads active Cover latency of texture/memory loads Hot clock 1.35 GHz (20+ GFLOPS) Register file (32KB) Shared memory (16KB) ALL 8 SPs execute the same instruction from the same warp (4 cycles) Each instruction manipulates different dataLecture #3: Nvidia HW and CUDA - Introduction TPC Thread / Instruction Dispatch Multi-threaded instruction dispatch – TPC Streaming Multiprocessor SM Kernel launching on Host Instruction Fetch Instruction L 1 Cache TPC Thread Scheduling Streaming Multiprocessor (SM) 8 Streaming Processors (SP) 2 Super Function Units (SFU) TPC Instruction L1 46 Streaming Multiprocessor (SM) TPC Texture Processor Cluster Streaming Multiprocessor (SM) TPC Shared Memory S F U Thread Execution Manager Control SP0 RF0 RF4 SP4 SP1 RF1 RF5 SP5 SP2 RF2 RF6 SP6 SP3 RF3 RF7 Results S F U SP7 Load Texture Constant L1 Cache Load from Memory L1 Fill Store to Store to Memory 48 Lecture #3: Nvidia HW and CUDA - Introduction 12 Thread Scheduling Thread Scheduling Grid 1 Kernel launching on Host Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 1 Kernel launching on Host Thread Execution Manager Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Thread Execution Manager Block (1, 1) Up to 24 Warps Per SM … Thread Thread (14, 0) (15, 0) Thread Thread (0, 1) (1, 1) … Thread Thread (14, 1) (15, 1) Thread Thread (0, 2) (1, 2) … Thread Thread (14, 2) (15, 2) Block(1,1) W1 Block(1,1) W2 warp Thread Thread (0, 0) (1, 0) Block(2,1) W7 Block(2,1) W8 Lecture #3: Nvidia HW and CUDA - Introduction 49 Block (0, 0) Lecture #3: Nvidia HW and CUDA - Introduction 50 Thread Life Cycle in HW Host SM Executes Blocks Device t0 t1 t2 … tm Grid is launched on SPA Thread Blocks are serially distributed to all the SM’s – – Kernel 1 Potentially >1 Thread Block per SM Up to 8 bocks per SM Every 32 consecutive threads are grouped into a Warp Each SM launches Warps of Threads – Grid 1 2 levels of parallelism SM schedules and executes Warps that are ready to run As Warps and Thread Blocks complete, resources are freed – SPA can distribute more Thread Blocks Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Threads are assigned to SMs in Block granularity – – Up to 8 Blocks to each SM as resource allows SM in G80 can take up to 768 threads Grid 2 Kernel 2 MT IU SP – – Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) SM assigns/maintains thread id #s SM manages/schedules thread execution 51 Lecture #3: Nvidia HW and CUDA - Introduction SP Blocks Shared Memory Shared Memory TF Texture L1 L2 Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) t0 t1 t2 … tm MT IU Blocks Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently Block (1, 1) SM 0 SM 1 52 Lecture #3: Nvidia HW and CUDA - Introduction Memory 13 Thread Scheduling/Execution Each Thread Block is divided in 32-thread Warps – – – – Warps whose next instruction has its operands ready for consumption are eligible for execution – Eligible Warps are selected for execution on a prioritized scheduling policy warp 8 instruction 11 All threads in a Warp execute the same instruction when selected warp 3 instruction 95 .. . warp 8 instruction 12 – warp 3 instruction 96 SM Instruction Buffer – Warp Scheduling 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 – – 56 from instruction L1 cache into any instruction buffer slot Issue one “ready-to-go” warp instruction/cycle – I$ L1 Fetch one warp instruction/cycle – Study case: – If one global memory access is needed for every 4 instructions – A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency (16 execution cycles * 13 warps = 208) Lecture #3: Nvidia HW and CUDA - Introduction warp 1 instruction 42 Lecture #3: Nvidia HW and CUDA - Introduction 54 55 SM multithreaded Warp scheduler At any point in time, only one of the 24 Warps will be selected for instruction fetch and execution. Using Warps to hide Memory Latency Zero-overhead Warp scheduling time Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps 24 warps * 32 threads = 768 threads Lecture #3: Nvidia HW and CUDA - Introduction 53 This is an implementation decision, not part of the CUDA programming model!!! Warps are THE scheduling units of the SM If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? – SM Warp Scheduling from any warp - instruction buffer slot operand scoreboarding used to prevent hazards Issue selection based on roundrobin/age of warp SM broadcasts the same instruction to 32 Threads of a Warp Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MAD SFU Lecture #3: Nvidia HW and CUDA - Introduction 14 HW Summary G80 is a Scoreboarding Machine All register operands of all instructions in the Instruction Buffer are scoreboarded – – – – Warps are the basic SM scheduling unit SM can execute multiple blocks concurrently Decoupled Memory/Processor pipelines – The device processes only one grid at a time Blocks are distributed to SMs by the Thread Execution manger Each thread-block of a grid is split into warps Status becomes ready after the needed values are deposited Prevents hazards Cleared instructions are eligible for issue Any thread can continue to issue instructions until scoreboarding prevents issue Allows Memory/Processor ops to proceed in shadow of Memory/Processor ops – – Lecture #3: Nvidia HW and CUDA - Introduction 57 Lecture #3: Nvidia HW and CUDA - Introduction 58 HW Summary: Threads, Warps, Blocks 59 Only <32 when there are fewer than 32 total threads – – – SPMD is a natural way to express data level parallelism – – Warps and scoreboarding used to hide memory latency – 60 CUDA – non-graphics programming interface Shared Memory, Barriers Unified, scalar Scalable (SP per SM, SM per TPC, TPC count) The GPU is used as an accelerator for massively data parallel tasks Cuda extends C If resources (registers, thread space, shared memory) allow, more than 1 Block can occupy each SM Lecture #3: Nvidia HW and CUDA - Introduction G80 is a revolutionary GPU architecture – There are (up to) 16 Warps in a Block Each Block (and thus, each Warp) executes on a single SM G80 has 16 SMs At least 16 Blocks required to “fill” the device More is better – Exec. Summary SW/HW There are (up to) 32 threads in a Warp – Shared memory and registers are partitioned among the threads of all concurrent blocks So, decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently Easy learning…? Scalable programming model No need to think in vectors Lecture #3: Nvidia HW and CUDA - Introduction 15
© Copyright 2025