A Novel Technique to Improve Parallel Program Performance Co

A Novel Technique to Improve Parallel Program
Performance Co-executing with Dynamic Workloads
Murali Krishna Emani
Michale O’Boyle
School of Informatics
University of Edinburgh, UK
m.k.emani@sms.ed.ac.uk
School of Informatics
University of Edinburgh, UK
mb@sms.ed.ac.uk
Abstract—In the current multi and many core computing
systems, multiple parallel programs co-execute by sharing the
resources in a highly dynamic environment. This dynamicity
includes variations in hardware, software, input data and external
workloads. This leads to a contention in the system and seriously
degrading the performance of few or all co-executing programs.
Many of existing solutions assume program execution in isolation
and devise techniques with this wrong assumption. Here we
propose a machine leaning based technique to improve program
performance when it executes along with external workloads
which are dynamic in nature. We use program static features,
dynamic runtime features obtained during compilation and execution phases respectively. We show that our approach improves
speedup over 1.5x over best existing scheme on a 12 core machine.
Keywords Parallelism mapping, Workloads, Compile and
runtime optimization, Machine Learning
I.
I NTRODUCTION
Multicore-based parallel systems now dominate the computing landscape from data centers to mobile devices. Efficient mapping techniques for programs onto underlying multi
and many-cores is highly essential in improving efficieny of
programs performance in presence of dynamic environment.
Specifically, designing such soultions for parallel programs
is quite challenging in these scenarios owing to the complex
underlying implementation of parallel programming models.
General research solutions to broad problem of parallelism
mapping tend to ignore the basic reality of shared and interactive execution environment. In any realistic scenario, the computing environment is dynamic in nature. These parameters
include program input data, hardware, software, load caused
by external programs and others.
External load causing programs tend to share computing
resources with wide variety of emerging workloads that span
from light to heavy leading to significant resource contention.
[1], [2] Hardware is becoming increasingly heterogeneous
with processors of asymmetric computing capabilities. Any
failure of hardware changes the amount of available computing
resources. If this occurs during a program execution, it needs
to adapt instantaneously to the available resources. Input datadriven applications are emerging in day-to-day computing
where the input data size varies during program execution.
This has a profound effect on the memory, I/O systems. The
latest trend BigData adds more complexity when the parallel
programs need to process huge amounts of data. [3] Software
upgrades are quite frequent where an upgraded versions may
provide different set of computing programming environment
with different set of features to improve application performance.
Thus the existing thread-to-core parallel mapping solutions
may not be appropriate in these scenarios. There is a critical
requirement that the mapping solutions need to be revised considering the dynamic environment into consideration. Given
this highly dynamic environment, the applications need to
adapt to varying parameters and autotune in order to execute
efficiently with minimal intervention from the application
programmer.
A widespread assumption in research community of parallel computing is that the program under consideration is the
only execution unit in the system with the resources being the
same throughout its execution. This may be true in certain
applications but in reality for majority of applications, this
assumption no longer holds true.
a) Hardware Adaptability: Modern NUMA machines
are made up of multi and many heterogeneous cores. They vary
in operating frequency, multi vs gpu Applications executing on
these systems need to leverage maximum potential of these
processors. As the hardware is prone to different types of
failures which are not unusual in any computing environment,
special mechanims are employed to ensure that there is minimal disruption to the running applications. Planned outages
are widespread employed method to ensure that computing
units are either switched off or migrated elsewhere during the
hardware repair or maintenance. The major problem occurs
when there is a sudden hardware failure giving minimal time
for providing alternate computing resources for executing
applications. One of prominent hardware failures is drop or
malfunctioning of processors as shown in figure 1. This reduces the number of available computing resources which can
show adverse effects on the applications. For latency-sensitive
applications, there can be drastic dregrade in applications
performance due to the delay caused by the shortage of
computing units.
Several techinques exist today to ensure smooth running of
applications when a hardware failure occurs. However these
techniques do not reduce the load in proportion to available
computing resources. Cloud computing fits in this scenario
where the applications executed in a cloud are resilient to
hardware failures owing to elastic nature of the cloud. However
cloud computing deals with this problem at a macro level and
is still far from reach for many computing applications which
can be migrated directly to a cloud.
Fig. 1. Thread mapping strategies for two programs P1, P2 (a) default with
fully-functional processors (b) default with faulty processors (c) ideal with
faulty processors
b) Co-execution and contention: to modify The complexity of managing smooth execution of an application during
hardware failure is further increased by the contention caused
by external programs co-executing with the current application.
Minimizing the contention caused by the competition for
shared resources by co-executing programs is a widely studied
area. One widely used assumption in solutions proposed for
paralle programs mapping is that a computing machine is fully
for an application and all the resources are static throughout
the program lifetime. This assumption is necessarily not true
in majority of computing platforms.
In most pure-static compiler approaches, program structure
and machine characteristics are analysed to determine the
best mapping of a program. These approaches do not have
knowledge of program behaviour during execution at runtime and they typically make simplifying assumptions about
resource availablity and external workloads. On other hand,
pure runtime systems approches are generic in adapting to
environment change. However they do not have sufficient
program knowledge which is a great source of performance
improvement potential.
Solutions employing Machine learning based approaches
[4],[2],[5],[6],[7] are proving to be highly reliable and promising as they are significantly showing promising results in
improving program performance during parallelism mapping.
These approaches are generally trained offline using a training set of data. Features are collected during these training
runs and the model is learnt using different methods. During
deployment, these features are extracted from the system and
input to the learnt model that predicts the optimal mapping
policy. Existing techinques rely either on program featuers
or runtime features only or may degrade external workloads’
performance trying to optimize current target program and
reduce contention [8]. In this work we aim to imporve a
parallel programs’ performance undere resource contention
when it is executing with varying external programs. We
propose an approach where this machine learning model uses
both static and dynamic features to deliver better execution
efficiency in unseen dynamic environments.
Fig. 2.
Graph showing the performance degradation of a program coexecuting with different external workloads
•
We have no impact on external workloads using our
technique and don’t degrade their performance.
Throughout this paper, we mean Target to be the program
we are trying to optimize, Workload to be any other program
co-executing with the target program that generates load in the
system. We use Core or a Processor interchangebly to denote
a processing unit.
II.
M OTIVATION
To depict that a program’s performance is degraded significantly when it is co-executing with another program, we ran
a target program cg from NAS parallel benchmark along with
other program chosen from same benchmark. We measure the
speedup over OpenMP default with different number of target
program threads. To see the variation in nature of workloads,
we repeated the experiments with increasing number of workload threads. Figure 2 shows the resultant speedup of target
program. We observe that the default behaviour of the target
is severly affected in presence of external workload and the
amount of degradation increases with a increase in number of
workload therads.This proves that if a programs is run as it is
with openmp default policy it always gets executed with same
number of threads which is equal to the number of available
maximum processors and greatly slows down. This is due to
the increased contention arising out of resource contention by
multiple programs executing at same time.
•
We propose a novel technique using a machine learning model to enable a parallel program to adapt to
changing workloads.
In figure 3 we show a microscopic view of thread configurations assigned by different policies when a target program
is co-executed with a workload [1]. We observe that openmp
default assigns same number of threads irrespective of any
external programs. A state-of-art technique uses hill climbing
optimization policy where it assigns thread numbers in unit
steps. The best possible scheme oracle thread configuration
is also shown. All existing techniques vary greatly in threads
assigned to the parallel loops of the program.
•
We show effectiveness of our approach by achieving
better speedup improvement over OpenMP default,
and 1.5x over best existing scheme [9] .
Speedup obtained by various approaches in this scenario is
observed in figure 4. OpenMP default scheme performs barely
same as sequential, a best static and hill climbing methods
Contributions
Our contributions include
# Threads of Target Program
12
10
8
6
4
2
Default
Hill Climbing
0
2
4
6
8
10 12 14 16 18 20 22 24 26 28 30
Time
Workloads
W
0
W
1
2
4
6
8
2
Type
Feature
Static
Static
Static
Dynamic
Runtime
Runtime
Runtime
Runtime
Total number of Load/Store instructions
Total number of Branches
Total number of Instructions
Number of avaiable Processors
# Workload threads
Run queue length
cpu load ldavg-1
cpu load ldavg-5
TABLE I.
L IST OF FEATURES
10 12 14 16 18 20 22 24 26 28 30
Time
Fig. 3.
A microscopic view of thread numbers assignment by different
approaches
Fig. 4. Comparison of speedup of various techniques. We observe that there
is a great scope for performance improvement. Our solution aims to fill this
gap
improve over default. However we observe the best possible
oracle scheme can further improve the program performance
and current existing techniques are no where near the oracle.
Our approach aims to fill this gap to achive best possible
speedup.
III.
A PPROACH
Our goal is to develop a model that determines optimal
number of threads for each parallel section of the target
program based on program static features and system runtime
features representing external workload. Instead of building a
model that is bound to a particular setting, we use supervised
learning to automatically generate this heuristic. This prortable
approach ensures we can use our technique on any platform.
Figure 5 describes how our approach works. For every
new target program during its compilation phase, the compiler extracts significant characteristic information about the
program in the form of code features which are static. It
then links this compiled program with a light-weight runtime
library that consists of an heuristic that learns automatically.
For each parallel section, the compiler inserts a call to the
runtime where the static program features of that parallel
section are passed as a parameter. During execution time,
the runtime combines these program features with dynamic
external workload information as inputs to our predictive
model that returns optimal number of threads for this parallel
section. The program then executes the parallel loop with
newly determined optimal thread number.
We build our machine learning model based on a generic
three-step process for supervised learning. These include (1)
generate training data (2) train a model (3)use the heuristic.
We generate training data by exhaustively running each training program together with a workload program. During the
training, we vary the number of threads used for the target
and the workload programs and record their execute time. We
collect a set of features during generation of training data that
is used to characterize target program and external workload.
This training data is used to build the model offline. Once this
model is deployed, no further learning takes place. The fatures
set is a collection of several feature vectors. Each such vector
consists of numerical values of chosen features of program and
dynamic workload.
Feature vector The set of features used are arranged asa
numerical vector. Static features of a program include number
of total instructions, memory and branch summary information
where the corresponding values are normalized to the total
number of instructions. Workload is characterized by the load
it generates on the cpu. We obtain this information from proc
file system of the linux kernel to collect dynamic workload
features. Linux kernel provides a very useful tool sar that
collects cumulative activity counters of the operating system
and can be used to obtain system characteristics at every time
unit. To characterize the runtime environment, we use three
features from /proc filesystem: run queue, ldavg-1, ldavg-5.
The run queue length represents the number of processes
waiting for scheduling in the Linux kernel which gives an
indication of how many tasks are running on the system.
The ldavg-n is system load average calculated as the average
number of runnable or running tasks and the number of tasks in
uninterruptible sleep over an interval of n (n = 1, 5) minutes.
These runtime features reflect the load created by the external
workloads. Number of workload threads and number of cores
form rest of the feature vector. These 8 features as mentioned
in table I constitue the feature vector that is fed as input of
our machine learning model.
A. Training Data
We train our heuristic using training data collected from
a synthetic workload setting and apply the trained heuristic
to various unseen dynamic runtime environments. This is
different and unique from previous approaches [10] where
the model is trained for each target program, Training data
are generated from experiments where each target program is
executed with one workload program varying its number of
threads. We vary the number of threads used by the workload
program. To know the best possible scheme for each such
experiment, we assigned exhaustively different thread number
to each parallel loop. Then we record the best performing
scheme and observe its thread setting, We extract runtime
features during the training run. Those runtime and static
program featuers and the best-performing thread number are
Fig. 5. An overview of our approach. During compilation phase, program features are extracted. This is then now combined with runtime features if workload
exists. These features are fed to the predictive model that determines optimal thread number. Else it returns openmp default number of threads.
put together to form the training data set. Although producing
training data takes time, it is only an one off cost incurred by
our heuristic. Generating and collecting data is a completely
automatic process and is performed off-line. The model is
trained only once offline and frozen and no further learning
takes place during program execution. Figure 6 depicts the
training phase of our machine learning model.
Workload
Number of programs
Number of threads
Minimal
Normal
Heavy
<2
[2-5]
>5
<6
[6-12]
>12
TABLE II.
W ORKLOAD S ETTINGS
were compiled using gcc 4.6 with parameters “-O3 -fopenmp”.
B. Machine learning model
Our machine learning model is based on an artificial neural
network [11]. We employ the standard Multilayer Perceptron
with 1 hidden layer that learns by back propagation algorithm.
The network learns by back propagation which is a generalized
form of linear mean squares algorithm. This heuristic is automatically constructed from the training data. Figure 6 describes
how to train a heuristic from the training data. We supply the
training algorithm with training data collected offline. Each
such data item includes the static program features for the
training program, the runtime features and the best mapping.
The training algorithm tries to find a function γ which, takes
→
−
in a feature set, fv , and gives a prediction, th, that closely
matches actual best mapping, thoptimal in the training data set.
B. Benchmarks
We used all C programs from NAS parallel benchmark
suite [12], SPECOMP-2006 suite [13] and Parsec benchmark
suite [14]. These programs are representative parallel programs, which provide a pool of wide variety parallel programs
and emerging workloads.
C. Varying workloads
To introduce dynamicity in the workloads, we invoke
workload programs at low frequency and high frequency where
the inter-arrival time between two programs is 2and 5 seconds
respectively. To show variation in nature of workloads, we
define three categories of workloads, minimal, normal and
heavy as shown in table II
C. Deployment
Once we have gathered training data and built the heuristic
we can use it to select the mapping for any unseen, new
program. During execution time, the library is called and
checks whether there is a workload program running on the
system. If any workload program is detected, runtime features
from /proc are collected and act as inputs to the neural network
which outputs the optimal number of threads for the target
program. The runtime uses this number of threads to execute
the corresponding parallel region. If there is no workload,
the target program runs with default configuration using all
available physical threads.
IV.
E XPERIMENTAL SETUP
A. Hardware and Software Configurations
We carried out experiments to evalute our approach on an
Intel Xeon platform with two 2.4 GHz six-core processors
(12 threads in total) and 16GB RAM. with Red Hat 4.1.2-50
operating system running Linux kernel 2.6.18. All programs
V.
R ESULTS
In this section we compare the performance improvement
gained by our approach compared to existing state-of-art technique. We first summarize the performance of our approach
against alternative approaches across all workload settings.
Due to limited space, we omit detailed results and performance
graphs for each workload setting and each workload frequency.
Then, we evaluate our approach on a workload scenario that
is derived from a large scale warehouse system as a casestudy.
We show the performance improvement averaged for all
benchmark programs for target program for six experimental
settings. To show the effectiveness of our technique in another
dimension, we show the impact of our approach on external
workloads for each benchmark program averaged across different experimental scenarios.
Figure 7 shows the performance results on six different
workload scenarios averaged across all benchmark programs.
These scenarios are formed by varying two levels of frequency
+
Best Mappings +
Runtime Features
Training
Programs
Program
Extraction
Learning
Algorithm
Training Runs
Neural Network
Static Features
Fig. 6.
Training phase of the machine learning model used in our approach. We use Artificial Neural Networks to build our model
with each of workload, nominal, normal and heavy. In a given
workload setting, the speedup improvement varies for different
programs. Hence, the min-max bars in this graph show the
range of speedups achieved across various target programs.
Our approach not only gives better performance when
compare to OpenMP default but also significantly outperforms
best existing technique that uses hill climbing optimization
technique across all workload scenarios. For nominal workloads ,OpenMP default scheme performs reasonably well as
the amount of resource contention is minimum. Under such a
setting, our automatic approach gives the least improvement
with a speedup of 1.5x. This still translates to 1.15 times
of improvement over best existing scheme. When considering
medium and heavy workload settings, our approach has a clear
advantage with speedups above 2.4x (up to 3.2x) over the
OpenMP default scheme. This translates to a speedup over 1.36
(up to 2.3x) when comparing to the hill climbing approach.
By looking on the min-max statistical bars, it is clear that
our technique delivers stable performance for all workload
scenarios without slowing down any program. Overall, the
automatic approach achieves a geometric mean speedup of
2.3x. This translates to a 1.5 times improvement over best
existing scheme.
Fig. 9.
Performance with live system workload
workload scenario, figure9 shows the speedup of one target
program lu, with different schemes. It can be observed that our
predictive model fares better than OpenMP default and stateof-the-art technique by 1.37 and 1.22 times performance improvement. This clearly shows that our model adapts well with
the dynamic external workload programs in any computing
environment. Even in this experiment, the impact on workload
by our approach is minimal creating a win-win situation for
both target and workload programs.
VI.
A. Effect on workload
As seen in figure 8, where we compare speedups of external
workload under various approaches, we observe that default
and best existing scheme affects the workload and degrade its
performance. This is undesirable as any optimization technique
motive should be to improve a program’s performance by
depleting and degrading other programs in a greedy fashion.
Our approach doesn’t impact any workload in any experimental setting as we reduce the system contention to a greater
extent which indirectly benefits the workload as well. Hence
we observe a mild improvement in workload performance as
well.
B. Case study
To validate our approach in a real world setting, we
selected a workload environment based on a sample of an inhouse high performance cluster of computing systems. Large
number of different jobs are submitted to this cluster by many
departments that require extensive computational resources.
The distribution of the arrival of jobs in this cluster and the
number of requested processors over a period of 30 hours are
obtained from the inbuilt system log. We extracted jobs from
a 15 minute snapshot this real-world workload from a log that
recorded system activity over this period. This snapshot was
selected to highlight variation in workload pattern. Over this
W ORK IN PROGRESS : L EARN - ON - THE - FLY
Machine learning models show siginificant performance
improvement when the experimental settings for evaluation
are in a similar setting that they are trained for. Exploring the
exhaustive number of possible states to find best scheme during
offline training is not always possible. If some parameter of the
execution environment changes during the program execution
for which the model was not trained for, it is highly likely
that the predictions are not optimal for the new changed
environment. In all existing machine learning model based
mapping techniques, there is no mechanism to verify if the
prediction made was indeed the best possible one. Moreover
without realizing if predictions were faulty, the models continue the same logic in the new environment. We are currently
working to tackle this problem of how to determine if the
model predictions are invalid in a new changed execution
environment. In such cases if the model can be enable to learnon-the-fly, it can avoid the pitfalls of mispredictions. We use
advanced concepts of Reinforcement learning to get feedback
for the computing environment to verify the quality of the
predictions and if necessary learn and update the model to
improve the prediction quality on-the-fly. Figure 10 shows an
overview for a generic reinforcement learning framework.
VII.
C ONCLUSION
This paper has introduced a novel technique based on
predictive modeling to devise optimal mapping policy for a
Speedup over default
lig
Hill Climbing
6
Our Approach
5
4
3
2
1
0
h
o
t.l
w
fr
e
q
lig
h
t.h
ig
h
fr
m
e
e
q
d
iu
m
.lo
w
m
fr
e
e
d
q
iu
m
.h
ig
h
fr
e
h
q
e
a
.lo
vy
w
fr
h
e
e
q
a
.h
vy
ig
h
fr
e
q
e
M
a
n
Fig. 7. Comparison of our approach over OpenMP default and state-of-art scheme. We improve program performance by 1.5x over best existing scheme.
Ranges over bars denote extent of speedup improvement for wide variety of benchmark programs.
Fig. 8. Comparison of effect of various techniques on external workload. Our technique doesn’t penalize the workload in any case creating a win-win situation
for target and workloads.
[3]
[4]
[5]
Fig. 10. Reinforcement Learning framework where the agent improves its
control logic based on the feedback obtained from its interaction with the
environment
parallel program co-executing with dynamic external workloads. This approach employs static and dynamic parameters
in form of program features and system runtime features
to optimize an application. Our method improves program
performance significantly (1.5x) over best existing technique
inspite of severe resource contention with minimal impact of
external workloads. To strengthen our proposal, we evaluted
this method in a real world casestudy. Further, we envision
to improve this technique to enable any parallel program to
adapt to dynamic environment using online learning as its key
strength. and to exploit heterogeneous cores with a mix of
OpenMP and OpenCL programs.
R EFERENCES
[1]
M. K. Emani, Z. Wang, and M. F. P. O’Boyle, “Smart, adaptive mapping
of parallelism in the presence of external workload,” Proceedings of the
2013 IEEE/ACM International Symposium on Code Generation and
Optimization (CGO), vol. 0, pp. 1–10, 2013.
[2] D. Grewe, Z. Wang, and M. F. P. O’Boyle, “A workload-aware mapping
approach for data-parallel programs,” in HiPEAC ’11, pp. 117–126.
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
D. Vengerov, L. Mastroleon, D. Murphy, and N. Bambos, “Adaptive
data-aware utility-based scheduling in resource-constrained systems,”
J. Parallel Distrib. Comput., vol. 70, no. 9, pp. 871–879, 2010.
J. Martinez and E. Ipek, “Dynamic multicore resource management: A
machine learning approach,” in Micro ’09, pp. 8–17.
ˇ
P. Radojkovi´c, V. Cakarevi´
c, M. Moret´o, J. Verd´u, A. Pajuelo, F. J.
Cazorla, M. Nemirovsky, and M. Valero, “Optimal task assignment
in multithreaded processors: a statistical approach,” in ASPLOS ’12,
pp. 235–248.
Z. Wang and M. F. O’Boyle, “Mapping parallelism to multi-cores: a
machine learning based approach,” in PPoPP ’09, pp. 75–84.
R. Bitirgen, E. Ipek, and J. F. Martinez, “Coordinated management
of multiple interacting resources in chip multiprocessors: A machine
learning approach,” in Proceedings of the 41st annual IEEE/ACM
International Symposium on Microarchitecture, MICRO 41, pp. 318–
329, 2008.
J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa, “Contention
aware execution: online contention detection and response,” in CGO
’10, pp. 257–265.
A. Raman, A. Zaks, J. W. Lee, and D. I. August, “Parcae: a system for
flexible parallel execution,” in PLDI ’12, pp. 133–144.
R. W. Moore and B. R. Childers, “Using utility prediction models to
dynamically choose program thread counts,” in ISPASS ’12, pp. 135–
144.
C. M. Bishop, Pattern Recognition and Machine Learning (Information
Science and Statistics). Springer-Verlag New York, Inc., 2006.
“NAS parallel benchmarks 2.3, OpenMP C version.” http://phase.hpcc.
jp/Omni/benchmarks/NPB/index.html.
“SPECOMP Benchmark suite.” http://www.spec.org/omp/.
“Parsec benchmark suite.” http://parsec.cs.princeton.edu/.