GPUvm: Why Not Virtualizing GPUs at the Hypervisor? Yusuke Suzuki* in collaboraBon with Shinpei Kato**, Hiroshi Yamada***, Kenji Kono* * Keio University ** Nagoya University *** Tokyo University of Agriculture and Technology Graphic Processing Unit (GPU) • GPUs are used for data-‐parallel computaBons – Composed of thousands of cores – Peak double-‐precision performance exceeds 1 TFLOPS – Performance-‐per-‐waT of GPUs outperforms CPUs • GPGPU is widely accepted for various uses – Network Systems [Jang et al. ’11], FS [Silberstein et al. ’13] [Sun et al. ’12], DBMS [He et al. ’08] etc. NVIDIA/GPU L1 L1 L1 L1 L1 L1 L2 Cache L1 Video Memory CPU Main Memory MoBvaBon • GPU is not the first-‐class ciBzen of cloud compuBng environment – Can not mulBplex GPGPU among virtual machines (VM) – Can not consolidate VMs that run GPGPU applicaBons • GPU virtualizaBon is necessary – VirtualizaBon is the norms in the clouds VM Share a single GPU among VMs VM VM Hypervisor GPU Physical Machine VirtualizaBon Approaches • Categorized into three approaches 1. I/O pass-‐through 2. API remoBng 3. Para-‐virtualizaBon I/O pass-‐through • Amazon EC2 GPU instance, Intel VT-‐d – Assign physical GPUs to VMs directly – MulBplexing is impossible VM Assign GPUs to VMs directly VM VM … Hypervisor GPU GPU GPU API remoBng • GViM [Gupta et al. ’09], rCUDA [Duato et al ’10], VMGL [Largar-‐Cavilla et al. ’07] etc. – Forward API calls from VMs to the host’s GPUs – API and its version compaBbility problem – Enlarge the trusted compuBng base (TCB) Host Library Library vv4 4 Driver VM Wrapper Library v4 VM Wrapper Library v5 … Hypervisor GPU Forwarding API calls Para-‐virtualizaBon • VMWare SVGA2 [Dowty ’09] LoGV [GoEschalk et al. ’10] – Expose an ideal GPU device model to VMs – Guest device driver must be modified or rewriTen Host Driver VM VM Library Library PV Driver PV Driver … Hypervisor GPU Hypercalls Goals • Fully virtualize GPUs – allow mulJple VMs to share a single GPU – without any driver modificaJon • Vanilla driver can be used “as is” in VMs • GPU runBme can be used “as is” in VMs • IdenBfy performance boTlenecks of full VM virtualizaBon – GPU details are not open… Library Driver VM Library Driver Virtual GPU Virtual GPU GPU Outline • • • • • • MoBvaBon & Goals GPU Internals Proposal: GPUvm Experiments Related Work Conclusion GPU Internals • PCIe connected discrete GPU (NVIDIA, AMD GPU) • Driver accesses to GPU w/ MMIO through PCIe BARs • Three major components – GPU compuJng cores, GPU channel and GPU memory Driver, Apps (CPU) MMIO PCIe BARs GPU Channel GPU Channel … … … GPU Channels GPU CompuBng Cores GPU Memory GPU GPU Channel & CompuBng Cores • GPU channel is a hardware unit to submit commands to GPU compuBng cores • The number of GPU channels is fixed • MulBple channels can be acBve at a Bme App App GPU Commands GPU GPU Channel Channel … GPU Commands are executed on compuBng cores … …CompuBng GPU CompuBng Cores GPU Memory • Memory accesses from compuBng cores are confined by GPU page tables App App GPU Channel GPU Channel GPU Commands GPU … Pointer to GPU Page Table … GPU Virtual Address … GPU CompuBng Cores GPU Memory GPU Page Table GPU Physical Address GPU Page Table Unified Address Space • GPU and CPU memory spaces are unified – GPU virtual address (GVA) is translated CPU physical addresses as well as GPU physical addresses (GPA) App GPU Commands GPU Channel GPU … … … GVA GPU Page Table GPU CompuBng Cores GPU Memory CPU physical address GPA Unified Address Space CPU Memory DMA handling in GPU • DMAs from compuBng cores are issued with GVA – Confined by GPU Page Tables • DMAs must be isolated between VMs App GPU Commands GPU Channel GPU … …Memcpy(GVA1, GVA2) … GVA1 translated GPU CompuBng Cores GPU Memory to GPA GPU Page Table DMA GVA2 translated to CPU physical address CPU Memory Outline • • • • • • MoBvaBon & Goals GPU Internals Proposal: GPUvm Experiments Related Work Conclusion GPUvm overview • Isolate GPU channel, compuBng cores & memory VM1 … … GPU GPU Channel Channel Assigned to VM2 … … Time Sharing GPU CompuBng Cores … … GPU GPU Channel Channel Assigned to VM1 Virtual GPU VM2 … … Virtual GPU … … GPU GPU Memory Assigned to VM1 Assigned to VM2 … GPUvm Architecture • Expose the Virtual GPU to each VM and intercept & aggregate MMIO to them • Maintain Virtual GPU views and arbitrate accesses to physical GPU Host GPUvm VM VM Library Library Driver Driver Virtual GPU Virtual GPU … GPUvm Hypervisor GPU Intercept MMIOs GPUvm components 1. GPU shadow page table – Isolate GPU memory 2. GPU shadow channel – Isolate GPU channels 3. GPU fair-‐share scheduler – Isolate GPU Bme using GPU compuBng cores GPU Shadow Page Table • Create GPU shadow page tables – Memory accesses from GPU compuBng cores are confined by GPU shadow page tables VM1 GPU Commands Virtual GPU Virtual GPU GPU Channel GPU Channel … … Access not allowed … … VM2 … … … GPU Channel … … … GPU Memory GVA GPU GPU GPU Keep Shadow Page consistency Page Table Table Access allowed GPU Shadow Page Table & DMA • DMA is also confined by GPU shadow page tables – Since DMA is issued with the GVA • Other DMAs can be intercepted by MMIO handling VM1 VM2 GPU Commands Virtual GPU Virtual GPU GPU Channel GPU Channel … … … … … … GPU Channel … … … GPU Memory DMA with GVAs GPU GPU Shadow Page Table DMA not allowed DMA allowed CPU Memory VM1 Memory VM2 Memory GPU Shadow Channel • Channels are logically parBBoned for VMs • Maintain mappings between virtual & shadow channels VM1 VM2 Virtual GPU GPU Virtual Channels 0 1 2 Virtual GPU GPU Virtual Channels 0 1 2 … 0 1 2 … … VM1 VM2 0 1 2 … 0 1 2 … 66 … 64 65 0 1 2 Assigned to VM1 66 64 65 Assigned to VM2 Mappings between virtual & shadow channels GPU Shadow Channels GPU … … GPU Fair-‐Share Scheduler • Schedules non-‐preempBve command execuBons • Employs BAND scheduling algorithm [Kato et al. ’12] • GPUvm can employ exisBng algorithms – VGRIS [Yu et al. ’13], Pegasus [Gupta et al. ’12], TimeGraph [Kato et al. ’11], Disengaged Scheduling [Menychtas et al. ’14] VM2 VM1 VM2 Queue Assigned to VM2 … … Assigned to VM1 … … VM1 Queue GPU fair-‐share scheduler Virtual VGPU Channels … Virtual VGPU Channels GPU Shadow Channels GPU Time … OpBmizaBon Techniques • Introduce several opBmizaBon techniques to reduce overhead caused by GPUvm 1. BAR Remap 2. Lazy Shadowing 3. Para-‐virtualizaBon BAR Remap • MMIO through PCIe BARs is intercepted by GPUvm • Allow direct BAR accesses to the non-‐virtualizaBon-‐sensiBve areas Naive w/ BAR Remap Guest Driver Guest Driver MMIO MMIO Intercepted Intercepted SensiBve Non-‐sensiBve Virtual BAR GPUvm Aggregated Issued Physical BAR SensiBve Non-‐sensiBve Virtual BAR Aggregated GPUvm Issued Physical BAR Direct Access Lazy Shadowing • Page-‐fault-‐driven shadowing cannot be applied – When fault occurs, computaBon cannot be resumed • Scanning enBre page tables incurs high overhead • Delay the reflecBon to the shadow page tables unBl the channel is used TLB flush Scan 3 Bmes Scan Naive Scan only once Lazy Shadowing Ignore Channel becomes acBve Channel becomes acBve Time Para-‐virtualizaBon • Shadowing is sBll a major source of overhead • Provide para-‐virtualized driver – Manipulate page table entries through hypercalls (similar to Xen direct-‐paging) – Provide a mulBcall interface that can batch several hypercalls into one (borrowed from Xen) • Eliminate cost of scanning enBre page tables Outline • • • • • • MoBvaBon & Goals GPU Internals Proposal: GPUvm Experiments Related Work Conclusion EvaluaBon Setup • ImplementaBon – Xen 4.2.0, Linux 3.6.5 – Nouveau [hEp://nouveau.freedesktop.org/] • Open-‐source device driver for NVIDIA GPUs – Gdev [Kato et al. ’12] • Open-‐source CUDA runBme • Xeon E5-‐24700, NVIDIA Quadro6000 GPU • Schemes – NaJve: non-‐virtualized – FV Naive: Full-‐virtualizaBon w/o opBmizaBons – FV OpJmized: FV w/ opBmizaBons – PV: Para-‐virtualizaBon Overhead • Significant overhead over NaBve RelaJve Jme (log-‐scaled) – Can be miBgated by opBmizaBon techniques – PV is faster than FV since it eliminates shadowing – PV is sBll 2-‐3x slower than NaBve • Hypercalls, MMIO intercepBon 1000.00 275.39 100.00 Shadowing & MMIO handling reduced 40.10 10.00 Shadowing eliminated 2.08 1.00 FV Naive FV OpBmized PV madd (short term workload) 1.00 NaBve Performance at Scale • FV incurs large overhead in 4-‐ and 8-‐ VM case – Since page shadowing locks GPU resources VirtualizaBon overhead 400 300 200 1 2 4 Number of GPU Contexts 8 NaBve PV FV OpBmized NaBve PV FV OpBmized NaBve PV FV OpBmized PV 0 NaBve 100 FV OpBmized Time (seconds) Kernel execuBon Bme Performance IsolaBon • In FIFO and CREDIT a long-‐running task occupies GPU • BAND achieves fair-‐share GPU UBlizaBon (%) 100" 100" short" 100" long" short" long" short" 75" 75" 75" 50" 50" 50" 25" 25" 25" 0" 0" 0" 0" 10" 20" 30" 40" Time%(seconds) FIFO 50" 0" 10" 20" 30" 40" Time%(seconds) CREDIT 50" 0" 10" 20" 30" 40" Time%(seconds) BAND long" 50" Outline • • • • • • MoBvaBon & Goals GPU Internals Proposal: GPUvm Experiments Related Work Conclusion Related Work • I/O pass-‐through – Amazon EC2 • API remoBng – GViM [Gupta et al. ’09], vCUDA [Shi et al. ’12], rCUDA [Duato et al ’10], VMGL [Largar-‐Cavilla et al. ’07], gVirtuS [Giunta et al. ’10] • Para-‐virtualizaBon – VMware SVGA2 [Dowty et al. ’09], LoGV [GoEschalk et al. ’10] • Full-‐virtualizaBon – XenGT [Tian et al. ’14] – GPU Architecture is different (Integrated Intel GPU) Conclusion • GPUvm shows the design of full GPU virtualizaBon – GPU shadow page table – GPU shadow channel – GPU fair-‐share scheduler • Full-‐virtualizaBon exhibits non-‐trivial overhead – MMIO handling • Intercept TLB flush and scan page table – OpBmizaBons and para-‐virtualizaBon reduce this overhead – However sBll 2-‐3 Bmes slower
© Copyright 2025