Measuring the Whole System
Holistic Profiling of CPU and GPU for Optimal Vision
Applications on ARM Platforms
Tim Hartley
1
COPYRIGHT © 2015 ARM
The Evolution of Mobile GPU Compute
OpenGL ES 3.1 Compute Shaders
GPU Compute within OpenGL ES API
OpenCL™ Full Profile / RenderScript
Portable Heterogeneous Parallel Computation
Mali-T600
Series
OpenGL ES 2.0
Mali-T700 &
T800 Series
Mali-400 MP
Mali-450 MP
Programmable pipeline
Mali-200
Mali-300
OpenGL® ES 1.1
Fixed pipeline
ARM® Mali™-55 GPU
2007
2
COPYRIGHT © 2015 ARM
2009
2010
2012
2013,2014,2015
Measuring the Whole System
Computer Vision will, for some time, succeed in using every drop of
processing power we give it
And techniques in computer vision still evolving rapidly
New, complex, sustained low power use cases
Building computer vision applications an ever more complex process
The availability of more processors and processor types makes this even more so
Capturing and analyzing accurate and effective measurements from platforms plays
a vital role in achieving optimal performance
3
COPYRIGHT © 2015 ARM
Modern Computer Vision Applications
CPU Core
NEON
CPU Core
NEON
CPU Core
NEON
CPU Core
GPU
NEON
Vision Application
DSP
4
COPYRIGHT © 2015 ARM
Inside an ARM Mali Midgard Core
SIMD: Several components
per operation
128-bit registers
VLIW: Several operations per
instruction word
Some operations are “free”
Built in function library
Accelerated in hardware
T max( A0 , A1 , LS , Tex)
5
COPYRIGHT © 2015 ARM
Hardware Counters
Counters per core
Active cycles
Pipe activity
L1 cache
Counters for the GPU
Active cycles
L2 caches
MMU
Accessed through DS-5 Streamline
Timeline of all hardware counters, and more
Explore the execution of the full application
Zoom in on details
6
COPYRIGHT © 2015 ARM
DS-5 Streamline
Identify hotspots and system bottlenecks at a glance
Select from CPU/GPU counters
OS level and custom data sources
Select one or more tasks to
isolate their contribution
Accumulate counters, measure time
and find instant hotspots
Combined task switching trace and
sample-based profile
7
COPYRIGHT © 2015 ARM
Example: Complex Computer Vision Application
8
COPYRIGHT © 2015 ARM
Lane and Car Detection
9
COPYRIGHT © 2015 ARM
Streamline
10
COPYRIGHT © 2015 ARM
Streamline: OpenCL Timeline
11
COPYRIGHT © 2015 ARM
Streamline: OpenCL Timeline
12
COPYRIGHT © 2015 ARM
Optimisation
Overview
kernel
Limited by kernel execution
time or mem management?
Mem ops
Limited by Arith ops or
Mem ops?
High number of instruction
re-issues?
No
Arithmetic
Yes
Yes
Limited to 64 threads?
Large no. of instruction cache misses?
Limited to 64 threads?
Large no. of register bank conflicts?
Large no. of instruction cache misses?
No
memory
Ensure you are not copying
memory unnecessarily
Yes
No
Reduce register pressure.
Simplify or shorten kernels
Reduce register pressure.
Simplify or shorten kernels
Improve memory access
pattern to improve cache
efficiency
Vectorise the LS operations if possible.
Decrease mem accesses if possible.
Vectorise the kernel if possible.
Decrease the arith work if possible.
Limited by same factors?
Done optimising
13
COPYRIGHT © 2015 ARM
Yes
No
Reiterate
Deriving Meaning from Hardware Counters
Counters on their own usually don’t mean a huge amount
Combining counters is more useful
Comparing values to determine limiting pipes
Calculating more meaningful values from multiple values
New graph traces can be added from these counters
…and become an integral part of the timeline
14
COPYRIGHT © 2015 ARM
Custom Charts: Bringing Counters Together
100 * $MaliCoreCyclesTripipeCycles / $MaliJobManagerCyclesGPUCycles
100 * $MaliArithmeticPipeAInstructions / $MaliCoreCyclesTripipeCycles
100 * MaliLoadStorePipeLSInstructionIssues / $MaliCoreCyclesTripipeCycles
100 * $MaliLoadStorePipeLSInstructions / $MaliLoadStorePipeLSInstructionIssues
15
COPYRIGHT © 2015 ARM
ALU Bound kernel
One load
One store
“n” ALU operations
16
COPYRIGHT © 2015 ARM
__kernel void kernel_alu_bound( global float* arr, uint n)
{
float value = arr[get_global_id(0)];
for(uint i = 0; i < n; i++)
{
value += sin(value);
}
arr[get_global_id(0)] = value;
}
ALU Bound kernel
One load
One store
“n” ALU operations
17
COPYRIGHT © 2015 ARM
__kernel void kernel_alu_bound( global float* arr, uint n)
{
float value = arr[get_global_id(0)];
for(uint i = 0; i < n; i++)
{
value += sin(value);
}
arr[get_global_id(0)] = value;
}
L/S Bound kernel
One load
One store
No ALU operation
18
COPYRIGHT © 2015 ARM
__kernel void kernel_memcpy( global float *a, global float *b)
{
float4 v = vload4(0, a);
vstore4(v, get_global_id(0), b);
}
L/S Bound kernel
One load
One store
No ALU operation
19
COPYRIGHT © 2015 ARM
__kernel void kernel_memcpy( global float *a, global float *b)
{
float4 v = vload4(0, a);
vstore4(v, get_global_id(0), b);
}
Cache misses
One byte read every 64 bytes
One byte written every 64 bytes
Really bad cache utilisation!
20
COPYRIGHT © 2015 ARM
__kernel void kernel_cache_misses( global uchar *a,
global uchar *b)
{
b[64 * get_global_id(0)] = a[64 * get_global_id(0)];
}
Cache misses
One byte read every 64 bytes
One byte written every 64 bytes
Really bad cache utilisation!
21
COPYRIGHT © 2015 ARM
__kernel void kernel_cache_misses( global uchar *a,
global uchar *b)
{
b[64 * get_global_id(0)] = a[64 * get_global_id(0)];
}
What does good whole-system optimisation look like?
22
COPYRIGHT © 2015 ARM
Conclusions
Computer vision applications need careful optimisation
Understanding your system as a whole is a vital first step
Understanding each individual processor core type is the next
Use tools to measure hardware counters across the entire platform
Whole-system views of the relative performance of heterogeneous architectures are invaluable
Allows you to decide where there is capacity to move workloads
And how to target optimisations by exposing the limiting component within individual cores
Ideally, use these tool throughout the development process, not just at the end
The Mali Ecosystem is making GPU Compute a reality today
ARM enables developers with platforms, drivers, tools and support
Industry leaders take advantage of ARM Mali GPU capabilities to innovate and deliver
Be one of them!
Tomorrow at the EVA Summit, 4pm:
“Understanding the Role of Integrated GPUs in Vision Applications”, Roberto Mijat
23
COPYRIGHT © 2015 ARM
Ecosystem Resources
www.malideveloper.com
Download guides, papers, tools (including DS-5 Streamline), etc.
http://community.arm.com/welcome
Community forums, blogs and more
malidevelopers@arm.com
Graphics and GPU Compute developer support
http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/
A range of video and written tutorials for GPU Compute
http://malideveloper.arm.com/develop-for-mali/features/mali-t6xx-gpu-user-space-drivers/
ARM Mali-T600 series GPU user-space binary drivers available for download
Linaro BSP now available with Mali-T600 series GPU support
24
COPYRIGHT © 2015 ARM
Measuring the Whole System
Holistic Profiling of CPU and GPU for Optimal Vision
Applications on ARM Platforms
Tim Hartley
25
COPYRIGHT © 2015 ARM
© Copyright 2025