Measuring the Whole System Holistic Profiling of CPU and GPU for Optimal Vision Applications on ARM Platforms Tim Hartley 1 COPYRIGHT © 2015 ARM The Evolution of Mobile GPU Compute OpenGL ES 3.1 Compute Shaders GPU Compute within OpenGL ES API OpenCL™ Full Profile / RenderScript Portable Heterogeneous Parallel Computation Mali-T600 Series OpenGL ES 2.0 Mali-T700 & T800 Series Mali-400 MP Mali-450 MP Programmable pipeline Mali-200 Mali-300 OpenGL® ES 1.1 Fixed pipeline ARM® Mali™-55 GPU 2007 2 COPYRIGHT © 2015 ARM 2009 2010 2012 2013,2014,2015 Measuring the Whole System Computer Vision will, for some time, succeed in using every drop of processing power we give it And techniques in computer vision still evolving rapidly New, complex, sustained low power use cases Building computer vision applications an ever more complex process The availability of more processors and processor types makes this even more so Capturing and analyzing accurate and effective measurements from platforms plays a vital role in achieving optimal performance 3 COPYRIGHT © 2015 ARM Modern Computer Vision Applications CPU Core NEON CPU Core NEON CPU Core NEON CPU Core GPU NEON Vision Application DSP 4 COPYRIGHT © 2015 ARM Inside an ARM Mali Midgard Core SIMD: Several components per operation 128-bit registers VLIW: Several operations per instruction word Some operations are “free” Built in function library Accelerated in hardware T max( A0 , A1 , LS , Tex) 5 COPYRIGHT © 2015 ARM Hardware Counters Counters per core Active cycles Pipe activity L1 cache Counters for the GPU Active cycles L2 caches MMU Accessed through DS-5 Streamline Timeline of all hardware counters, and more Explore the execution of the full application Zoom in on details 6 COPYRIGHT © 2015 ARM DS-5 Streamline Identify hotspots and system bottlenecks at a glance Select from CPU/GPU counters OS level and custom data sources Select one or more tasks to isolate their contribution Accumulate counters, measure time and find instant hotspots Combined task switching trace and sample-based profile 7 COPYRIGHT © 2015 ARM Example: Complex Computer Vision Application 8 COPYRIGHT © 2015 ARM Lane and Car Detection 9 COPYRIGHT © 2015 ARM Streamline 10 COPYRIGHT © 2015 ARM Streamline: OpenCL Timeline 11 COPYRIGHT © 2015 ARM Streamline: OpenCL Timeline 12 COPYRIGHT © 2015 ARM Optimisation Overview kernel Limited by kernel execution time or mem management? Mem ops Limited by Arith ops or Mem ops? High number of instruction re-issues? No Arithmetic Yes Yes Limited to 64 threads? Large no. of instruction cache misses? Limited to 64 threads? Large no. of register bank conflicts? Large no. of instruction cache misses? No memory Ensure you are not copying memory unnecessarily Yes No Reduce register pressure. Simplify or shorten kernels Reduce register pressure. Simplify or shorten kernels Improve memory access pattern to improve cache efficiency Vectorise the LS operations if possible. Decrease mem accesses if possible. Vectorise the kernel if possible. Decrease the arith work if possible. Limited by same factors? Done optimising 13 COPYRIGHT © 2015 ARM Yes No Reiterate Deriving Meaning from Hardware Counters Counters on their own usually don’t mean a huge amount Combining counters is more useful Comparing values to determine limiting pipes Calculating more meaningful values from multiple values New graph traces can be added from these counters …and become an integral part of the timeline 14 COPYRIGHT © 2015 ARM Custom Charts: Bringing Counters Together 100 * $MaliCoreCyclesTripipeCycles / $MaliJobManagerCyclesGPUCycles 100 * $MaliArithmeticPipeAInstructions / $MaliCoreCyclesTripipeCycles 100 * MaliLoadStorePipeLSInstructionIssues / $MaliCoreCyclesTripipeCycles 100 * $MaliLoadStorePipeLSInstructions / $MaliLoadStorePipeLSInstructionIssues 15 COPYRIGHT © 2015 ARM ALU Bound kernel One load One store “n” ALU operations 16 COPYRIGHT © 2015 ARM __kernel void kernel_alu_bound( global float* arr, uint n) { float value = arr[get_global_id(0)]; for(uint i = 0; i < n; i++) { value += sin(value); } arr[get_global_id(0)] = value; } ALU Bound kernel One load One store “n” ALU operations 17 COPYRIGHT © 2015 ARM __kernel void kernel_alu_bound( global float* arr, uint n) { float value = arr[get_global_id(0)]; for(uint i = 0; i < n; i++) { value += sin(value); } arr[get_global_id(0)] = value; } L/S Bound kernel One load One store No ALU operation 18 COPYRIGHT © 2015 ARM __kernel void kernel_memcpy( global float *a, global float *b) { float4 v = vload4(0, a); vstore4(v, get_global_id(0), b); } L/S Bound kernel One load One store No ALU operation 19 COPYRIGHT © 2015 ARM __kernel void kernel_memcpy( global float *a, global float *b) { float4 v = vload4(0, a); vstore4(v, get_global_id(0), b); } Cache misses One byte read every 64 bytes One byte written every 64 bytes Really bad cache utilisation! 20 COPYRIGHT © 2015 ARM __kernel void kernel_cache_misses( global uchar *a, global uchar *b) { b[64 * get_global_id(0)] = a[64 * get_global_id(0)]; } Cache misses One byte read every 64 bytes One byte written every 64 bytes Really bad cache utilisation! 21 COPYRIGHT © 2015 ARM __kernel void kernel_cache_misses( global uchar *a, global uchar *b) { b[64 * get_global_id(0)] = a[64 * get_global_id(0)]; } What does good whole-system optimisation look like? 22 COPYRIGHT © 2015 ARM Conclusions Computer vision applications need careful optimisation Understanding your system as a whole is a vital first step Understanding each individual processor core type is the next Use tools to measure hardware counters across the entire platform Whole-system views of the relative performance of heterogeneous architectures are invaluable Allows you to decide where there is capacity to move workloads And how to target optimisations by exposing the limiting component within individual cores Ideally, use these tool throughout the development process, not just at the end The Mali Ecosystem is making GPU Compute a reality today ARM enables developers with platforms, drivers, tools and support Industry leaders take advantage of ARM Mali GPU capabilities to innovate and deliver Be one of them! Tomorrow at the EVA Summit, 4pm: “Understanding the Role of Integrated GPUs in Vision Applications”, Roberto Mijat 23 COPYRIGHT © 2015 ARM Ecosystem Resources www.malideveloper.com Download guides, papers, tools (including DS-5 Streamline), etc. http://community.arm.com/welcome Community forums, blogs and more malidevelopers@arm.com Graphics and GPU Compute developer support http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/ A range of video and written tutorials for GPU Compute http://malideveloper.arm.com/develop-for-mali/features/mali-t6xx-gpu-user-space-drivers/ ARM Mali-T600 series GPU user-space binary drivers available for download Linaro BSP now available with Mali-T600 series GPU support 24 COPYRIGHT © 2015 ARM Measuring the Whole System Holistic Profiling of CPU and GPU for Optimal Vision Applications on ARM Platforms Tim Hartley 25 COPYRIGHT © 2015 ARM
© Copyright 2024