 
        Faculty of Science and Technology
MASTER’S THESIS
Study Program / Specialization
Spring Semester, 2014
Computer Science
Open / Restricted Access
Writer:
Piyush Duggal
…………………………………………
(Writer’s signature)
Faculty supervisor: Chunming Rong, (UiS)
External supervisor : Yann Chagourin, (Accenture)
Thesis title:
SAP PA: What is the Inside Chemistry?
Predicting future of Predictive Analysis
Credits (ECTS): 30
Key words:
Pages:
Enclosure : CD
Stavanger 15/06/2014
SAP, Predictive analysis, Association, Apriori,
Regression, PAL, Cluster Analysis, K-Means,
SOM, ABC, Scaling
Front page for master thesis
Faculty of Science and Technology
Decision made by the Dean October 30th 2014
1
How Worth is Predictive Analysis?
Predicting the Future & Exploring Inside Chemistry of SAP PA
Uncovering value in the data
Piyush Duggal
Department of Electrical and Computer Engineering
University of Stavanger
E-mail: p.duggal@stud.uis.no
Thesis submitted in partial fulfillment of the
Requirements for MASTER DEGREE
In Computer Science
June 14, 2014
2
Acronyms
BI
Business Intelligence
BIT
Business Intelligence Tools
DB
Database
DM
Data Mining
DS
Data Sources
DSMS
Data Stream Management System
DW
Data Warehouse
EAI
Enterprise Application Integration
ETL
Extract Transform and Load
HANA
High Performance Analytic Appliance
KPI
Key Performance Indicator
OLAP
Online Analytical Processing
PA
Predictive Analysis
PAL
Predictive Analysis Library
PMML
Predictive Modelling Markup Language
RDBMS
Relational Database Management System
RT
Real Time
SOM
Self Organizing Maps
SVID
SAP Visual Intelligence Document
SPAR
SAP Predictive Analysis Archive file
3
Abstract
With enormous growth in analytical data and insight about advantage of managing future
brings Predictive Analysis in picture. It really has potential to be called one of efficient and
competitive technologies that give an edge to business operations. The possibility to predict
future market conditions and to know customers’ needs and behavior in advance is the area of
interest of every organization. Other areas of interest may be maintenance prediction where we
tend to predict when and where any equipment or component will break or fraud detection for
insurance and banking sector companies. SAP predictive analysis tool is sum of all efforts and
investments SAP has made through support of open source statistics language R, many inbuilt
predictive algorithms. This tool thus supports definition, visualization, processing and
deployments of predictive analysis processes in way it was never done or imagined so effectively
before. There are other tools available, like SAS, infinite insight in market for quite some time
now but SAP has now strategically came up with an impressive investment; a concept with Hana
(in-memory database) combined with PA gives them the edge to competitors as they have a
powerful selling case allowing business users to do predictive analysis on huge amounts of data,
with a user-friendly tool that still, via R support, offers the possibility for expert users to develop
their own algorithms
Decision support systems based on predictive models are increasing in popularity as
organizations collect more data than decision makers can handle manually. These predictive
models can be applied to find potentially valuable patterns in the data, or to predict the outcome
of some event. This report talks about PA as concept, understanding it’s necessity, the value it
adds to business and how can analytical users trying to predict future of their business
operations, algorithms involved, hot topics and trends, challenges & criteria of success and more.
Predictive analytics thus enables having an insight of future outcomes and trends based on
extracted information with a probability of outcome from existing data sets with help of models,
pattern recognition, statistical algorithms. Software for PA process can be deployed on-premises
for enterprise users or can be accessed via cloud and there are various solutions both proprietary
& ones based on open source technologies available in market for instance Angoss, IBM
Predictive Analytics, KXEN, Oracle Data Mining, SAP PA, Statistica etc.
4
Acknowledgements
I would like to thank Prof. Chunming Rong & Yann Chagourin, my supervisors for their
valuable advises, support and contributions in every phase of thesis work. My deepest gratitude
goes to Chunming for his relentless positive suggestions and insightful comments throughout. He
was always available whenever, I needed help. I would like to extend my sincere thanks to Yann
Chagourin, Analytics manager with Accenture having vast experience with data mining and
analytics for his support. The thesis would have never been possible without his help. I would like
to thank my friends Bikash Agarwal, Tormod Lea, Raji Khushi and Marina Samohvalova Wifi for
their inspirations and help. I would also like to thanks all my wonderful colleagues in Accenture
for their encouraging words.
Last but not the least, I would like to thank all my family and friends in Norway and India
for making my thesis completing successfully. I also greatly thank God for giving me the courage
and energy to finish my Master program despite my other responsibilities. Without his blessings,
I will never be able to reach the achievements I have until now.
Piyush Duggal
University of Stavanger
5
Preface
This thesis i submitted in partial fulfilment of the requirements to complete the Master
of Science (M.Sc.) degree at the Department of Electrical & Computer Engineering at the
University of Stavanger (UiS), Stavanger Norway.
The work can be seen as a case study trying to understand and explore the inner
mechanism of new tool from SAP called Predictive Analysis. Thanks to my supervisors and the
books I referred to achieve the goals set at time of starting the thesis. The report contains work
done from February 2014 to June 2014.
This might be helpful for those thinking of switching their IT skills towards data mining
and analytics. It might have a solid ground for the students and statisticians who are curious to
dive in this wonderful contribution from SAP to computer world. Data visualization, prediction,
exploration and analysis techniques are covered with focus on Apriori and K-means algorithms
implemented in PA. Self-growing Maps can also be area of interest for those who are new to
analytics.
Piyush Duggal
University of Stavanger
6
Table of Contents
Chapter 1: Overview of Predictive Analysis ........................................... 11
1.1
Definition .......................................................................................................... 11
1.2
Potential and what value PA can bring ............................................................ 12
1.3
Five Kinds of Analysis for 5 questions .............................................................. 14
1.3.1
Time Series Analysis .................................................................................. 15
1.3.2
Classification Analysis ............................................................................... 16
1.3.3
Cluster Analysis ......................................................................................... 17
1.3.4
Association Analysis .................................................................................. 18
1.3.5
Outlier Analysis ......................................................................................... 19
1.4
Predictive Analysis as a process ....................................................................... 20
1.5
User’s Classification .......................................................................................... 22
1.6
Challenges & Criteria for Success ..................................................................... 22
Chapter 2: PA as a product from SAP .................................................... 26
2.1
Intro to SAP HANA (Based on 3rd Semester Project work)............................... 26
2.2
SAP HANA Predictive Analysis Library ............................................................... 28
2.3 R Integration ....................................................................................................... 29
2.4
Interface walkthrough of SAP Predictive Analysis as a tool ............................. 30
2.4.1 Step 1: Accessing and viewing the Data Source .......................................... 32
2.4.2
Preparing Data for Analysis ........................................................................ 34
2.4.3 Step 3: Applying Algorithms for data analysis ............................................ 36
2.4.4
Step 4: Running the model and viewing the Results................................. 40
2.4.5 Step 5: Deploying Model in Business Application ....................................... 43
Chapter 3: Predictive Analysis Applied .................................................. 46
3.1 Initial Data Exploration ....................................................................................... 46
3.1.1
Sampling .................................................................................................... 48
3.1.2 Scaling.......................................................................................................... 50
3.1.3 Binning.......................................................................................................... 52
7
3.1.4 Outliers ........................................................................................................ 55
3.2 Which Algorithm When ....................................................................................... 56
3.3. Challenges & Resolutions ................................................................................... 61
Chapter 4: Cluster & Association Analysis Explored............................... 65
4.1 Association Analysis ............................................................................................ 65
4.1.1 Applications of Association Analysis ........................................................... 66
4.1.2 Apriori Association Analysis ........................................................................ 67
4.1.3 Apriori Association Analysis in PAL ............................................................. 68
4.1.4 Strength & Weakness with Apriori Lite ....................................................... 69
4.2 Cluster Analysis ................................................................................................... 70
4.2.1 Introduction & Applications of Cluster Analysis .......................................... 70
4.2.2 ABC Analysis in PAL ...................................................................................... 71
4.2.3 K-Means Cluster Analysis in PAL .................................................................. 75
4.2.4 Silhouette ..................................................................................................... 78
4.2.5 Self-Organizing Maps ................................................................................... 80
Chapter 5: Conclusion ........................................................................... 88
5.1 Problem Set: Burn that Churn ............................................................................ 88
5.2 Results & Analysis ............................................................................................... 89
5.2.1 Clustering ......................................................................................................... 92
5.2.2 Decision Tree .................................................................................................... 94
5.2.3 Apriori............................................................................................................... 95
5.2.4 Neural Network ................................................................................................ 96
5.3 Discussion & Issues ........................................................................................... 100
5.3.1 SAP PA compared to Hadoop....................................................................... 100
5.3.2 Sharing your own R component................................................................. 101
5.3.3 Configuring HANA PAL to use with SAP PA ............................................... 101
5.4 Future Work ...................................................................................................... 102
5.5 Conclusion ......................................................................................................... 102
References .......................................................................................... 104
8
Table of Figures
Figure 1: PA utilizing approaches from many disciplines ............................................................. 12
Figure 2: Competitive Advantage goes well with Analysis ........................................................... 13
Figure 3: Five main questions of PA .............................................................................................. 14
Figure 4: Historic points as base for future points plot ................................................................ 15
Figure 5: Classification Analysis .................................................................................................... 16
Figure 6: Cluster Analysis .............................................................................................................. 18
Figure 7: Association Analysis ....................................................................................................... 19
Figure 8: Two Dimensional showing Outlier……………………………………………………………………………..20
Figure 9: Steps of PA Process…………………………………………………………………………………………………….21
Figure 10: SAP HANA Internal Architecture……………………………………………………………………………….27
Figure 11: R Integration of PA…………………………………………………………………………………………………..30
Figure 12: Welcome Screen for PA…………………………………………………………………………………………….31
Figure 13: Select Input Source for PA………………………………………………………………………………………..32
Figure 14: Window to search for database………………………………………………………………………………..34
Figure 15: Merge option in Step 1……………………………………………………………………………………………..34
Figure 17: Preparing Data for Analysis ………………………………………………………………………………………35
Figure 18: Possibility to apply and configure algorithms…………………………………………………………..36
Figure 19: Configuring the attributes for algorithms………………………………………………………………….38
Figure 20: An Advanced Analysis in PA……………………………………………………………………………………..40
Figure 21: Dialogue to create a new R Component for PA…………………………………………………………40
9
Figure 22: Predict Results Grid View………………………………………………………………………………………….41
Figure 23: Cluster Parallel Coordinate Chart……………………………………………………………………………..42
Figure 24: Scoring the saved model in PA………………………………………………………………………………….44
Figure 35: Share View in PA for outputs……………………………………………………………………………………45
Figure 36: Table versus Charts…………….……………………………………………………………………………………47
Figure 37: Input & Output Systematic Sampling………………………………………………………………..………50
Figure 38: The Sample Component in PA…………………………………………………….……………………………50
Figure 39: Scaling types and their results compared…………………………………………………………………51
Figure 40: Normalization Component in PA………………………………………………………………………………53
Figure 41: Input Output tables for Binning table in PAL.……………………………………………………………54
Figure 42: Algorithm Categories with tasks and examples…..……………………………………………………57
Figure 43: Four Data Sets in Anscombe’s Quartet...................................................................…...62
Figure 44: Process of Overfitting the models……………………………………………….……………………………63
Figure 45: Examples of Multicollinearity………………..…………………………………………………………………64
Figure 46: Apriori Principle.………………………………………………………………………………………………………66
Figure 47: Parameter Table Definition for Apriori………..……………………………………………………………67
Figure 48: An Example of ABC Analysis………………………………..……………………………………………………68
Figure 49: ABC Analysis Input & Output tables…...........................................................................73
Figure 50: Parameter Table Definition for K-Means….………………………………….……………………………76
Figure 51: Decision Tree Analysis of Clusters …..…….……………….……………………………………………….80
Figure 52: Data Set Records to the Map...…………………………………………………………………………………83
Figure 53: Four Clusters in the 4 * 4 Map……………………..……………………………………………………………86
10
Chapter 1: Overview of Predictive Analysis
1.1 Definition
SAP defines its predictive analysis tool as ‘SAP Predictive Analysis is a statistical analysis and
data mining solution that enables you to build predictive models to discover hidden insights and
relationships in your data, from which you can make predictions about future events by allowing
you to perform various analyses on the data, including time series forecasting, outlier detection,
trend analysis, classification analysis, segmentation analysis, and affinity analysis’. In most simple
words it is quantitative analysis supporting predictions and steps involved. It is a trending term
in computer science terminology but not a new topic as we in past few decades can find many
prediction attempts like product sales, costs, headcount, customer churn, advertising campaign
response, possible fraud etc. One can argue that it involves data mining in contrary to involving
knowledge discovery, whatever be end of debate, and it can prove to be business changing
methodology if skilled to best of its potential. It is essentially a process of finding meaningful
correlations, patterns and trends by interpreting and analyzing over through large amounts of
data stored in data repositories deploying statistical/mathematical techniques or pattern
recognition concepts. Inferential statistics and statistical sampling not only enforces requirement
of very large data sets for prediction but also provides possibility to analyze smaller data sets for
efficient sampling of correlations among datasets.
Wikipedia defines it to be an area of statistical analysis in which you extract information
from data to predict patterns and trends. This can then be used to predict an unknown, be it past,
present or future; for example identifying fraud that has been committed or as it is actually
occurring, through to forecasting future sales. The heart of predictive analytics is finding the
relationship between known variables and a predicted variable, using past occurrences. This
relationship is then used to predict an unknown outcome. Naturally, in such an analysis the
quality of the data analysis and the assumptions made, will greatly affect the accuracy and
usability of the predictions. Predictive analysis is a blend of multi quantitative analytics disciplines
and Venn diagram below may describe the contribution of these disciplines to PA. Predictive
analytics enriches decision makers and analysts with the potential to make accurate predictions
about future events based on complex statistical algorithms applied to data under investigation.
In other words, PA is synergy of interdisciplinary methodologies and prospective and
combination of useful approaches to problem solving from diff professions. Statisticians find
analysis methods like inferential statistics, regression and other multivariate methods as key
concepts while operational researchers prefer simulation & optimization methods contrary to
11
Chapter 1: Overview of Predictive Analysis
Artificial intelligence and information extraction approach followed by data miners. No matter
what approach one goes for, this will always be an analytics process which initializes with data
selection, acquisition and explorations using visualizations or sampling, finding validity of results,
possibly reiterate whole result set and then dissemination in end to implement improved
business processes. Predictive Analytics thus now can be seen as a broader term describing a
variety of statistical and analytical algorithms/techniques used in order to develop models that
can predict future behaviors or events.
Figure 1.1: PA utilizing approaches from many disciplines
1.2 Potential and what value PA can bring
White paper titled ‘The Business Value of Predictive Analytics’ by IDC research reported an
asset management firm increased its marketing offer acceptance rate by300%; an insurance
company identified fraudulent claims 30 days faster than before; a bank was able to identify 50%
of fraud cases within the first hour and a communications company increased customer
satisfaction by 53%. During the 2009 pandemic of H1N1 influenza virus or swine flu, Google was
able to leverage search term activities to predict the spread of the H1N1 disease two weeks
ahead of the government’s reports. This knowledge enabled state and local healthcare to ensure
the availability of medicine resources and treatment for patients. What can describe better the
advantage of being powered with information of what may happen in future depending on model
efficiency? Management becomes when you have an insight of future provided the
12
Chapter 1: Overview of Predictive Analysis
predictions are accurate. Better and accurate analysis of future happenings better is the control
over it. Figure below describes the competitive advantage we get progressing from simply
reporting the past to predicting the future and clearly advantage increases considerably.
SAP Predictive Analysis was launched in late 2012 as a supplement to SAP Lumira (formerly Visual
Intelligence), a tool to allow users to run R, HANA PAL, and HANA-R algorithms through a userfriendly interface. It will be quiet interesting to note here the results of a survey by Ventana
research titled ‘Predictive Analytics: Improving performance by making future more visible’ and
the results were as below
55% use predictive analytics to create new revenue opportunities.
68% who use predictive analytics claimed a competitive edge.
86% asserted that predictive analytics will have a major positive impact.
Measurement of benefit from PA is not easy to calculate as theoretically it is difference
between what happened from using PA to what would have been happened if PA was not used
and we don’t have value known for later but the fact that market for predictive analysis software
is estimated at over 2 billion dollar can give an idea about its potential, worth and relevance to
business today.
Figure 1.2: Competitive Advantage goes well with Analysis
Users for PA can be data scientists, data analysts or business users. Data scientists are less than
1% of an organization’s head count and generally create complex predictive models, validate
13
Chapter 1: Overview of Predictive Analysis
predictive requirements and publish results to management. Data analysts contribute to around
3% of head count and assists data scientists in transforming & enriching data sources, creating
simple models and visualize results to publish to BI tools. Rest 97% are direct or indirect
consumers of this analysis information and collaborate with each other for further business
actions. Data scientists generally encompasses traditional terms like data miner, statisticians or
data researchers and have deep knowledge & expertise to build predictive models for analysis,
data collection, validation, exploration, selection and finally prediction. Business users don’t have
technical knowledge and just need output of analysis.
1.3 Five Kinds of Analysis for 5 questions
Whatever be the business or reason to deploy Predictive Analysis in that business, technically PA
tries to find answers to following 5 questions as shown in figure below.
Figure 1.3: Five main questions of PA
Finding trends in historical data can be used to project future data by applying times series
analysis, by utilizing historical data points to see how they might continue. This can be applied to
predict demands or sales forecasting. Keeping track of key influencers of an event or an outcome
can prove to be worthy for churn analysis as we can try to follow purchasing trend of customers.
There can also exist significant segments or groups in data which are of more interest to us,
finding them can be a key for further analysis. Are there any clear groupings of data or some main
influencers? PA also tries to find associations or links between products by analyzing market
baskets to trigger recommendation engines and lastly what and why some anomalies exist in
14
Chapter 1: Overview of Predictive Analysis
data, are they errors or actual variations to be further analyzed. Off course PA is being deployed
for large set of applications in cross industry but the key questions to be investigated remain
same are thoroughly used. They are actually as basic to PA as a methodology that we can even
group classes of applications. Each of 5 question we discussed above correspond to one of 5
classes of Predictive Analysis helping to describe structure of data for analysis. We classify
predictive analysis applications to one of following 5 classes 1.3.1 Time Series Analysis
Time series analysis accounts for the fact that data points taken over time may have an internal
structure such as autocorrelation, trend or seasonal variation that should be accounted for where
Time Series is an ordered sequence of values of a variable at equally spaced time intervals. It
helps to obtain an understanding of the underlying forces and structure that produced the
observed data and it helps to predict a model and proceed to forecasting, monitoring or even
feedback and feed forward control. The intent is to discern whether there is some pattern in the
values collected to date, with the intention of short term forecasting. Past data points are used
as basis for predicting future ones. Time Series is an ordered sequence of values of a variable at
equally spaced time intervals which give an understanding of the underlying forces and structure
for data under observation. This is also a major weakness because it relies on the assumption
that past behavior will be repeated which may not be true always and thus should be used with
caution. There is actually no real argument to say that decision trees are a better algorithm than
neural networks to classify data but still very common. It will depend on the fit the data set at
hand, and also on the demands of the client as decision trees are easy to understand, and can
help with understanding data patterns, whereas neural networks are black boxes.
Figure 1.4: Historic points as base for future points plot
15
Chapter 1: Overview of Predictive Analysis
1.3.2 Classification Analysis
This is largest group of applications and tend to predict a variable using data of other variables
that is believed to affect the value of variable we want to predict. Prediction variable is also called
as output variable or target variable as it depends on few independent variables or input
variables. Studying churn analysis or target marketing is most used result. It is one of most
common data mining techniques for finding hidden patterns in data along with clustering
analysis. Classification is different to clustering as it also segments customer records into distinct
segments called classes but unlike to cluster approach classification analysis requires that the
end-user/analyst know ahead of time how classes are defined. A common approach to classify is
to use decision trees for segmenting & partitioning records when better records are obtained by
traversing the tree from the root via branches and nodes, to the leaf as it is a class instance. The
path takes through a decision tree is a rule, as in "Income<$30,000 and age<25, and debt=High.
Due to the sequential nature of the way a decision tree splits the records, it can result in a
decision tree being overly sensitive to initial splits. It is thus advisable to find error rate of each
leaf node.it is easy to express as paths can be shown as rules making it possible to use measures
for evaluating the usefulness of rules such as Support, Confidence and Lift to also evaluate the
usefulness of the tree. We don't use these values practically to measure the quality of a decision
tree model, they go more well with Apriori algorithms. On decision tree models you can just
check the accuracy of the model on known past data.
Figure 1.5: Classification Analysis
16
Chapter 1: Overview of Predictive Analysis
1.3.3 Cluster Analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects
in the same group (called a cluster) are more similar (in some sense or another) to each other
than to those in other groups. Cluster analysis itself is not one specific algorithm, but the general
task to be solved. The greater the similarity (or homogeneity) within a group, and the greater the
difference between groups, the “better” or more distinct the clustering. Cluster analysis is a
classification of objects from the data, where by classification we mean a labeling of objects with
class (group) labels. Cluster analysis is distinct from pattern recognition or the areas of statistics
know as discriminant analysis and decision analysis, which seek to find rules for classifying objects
given a set of pre-classified objects and hence can be considered an alternative to Factor Analysis.
As the groups are not know in advance, it can be difficult as results don’t make sense in the
context of the research being conducted. Hierarchical Clustering is efficient and groups data over
a variety of scales by creating a cluster tree which not a single set of clusters, but rather a
multilevel hierarchy, where clusters at one level are joined as clusters at the next level. K-Means
Clustering is a partitioning method and partitions data into k mutually exclusive clusters, and
returns the index of the cluster to which it has assigned each observation. Unlike hierarchical
clustering, k-means clustering operates on actual observations rather than the larger set of
dissimilarity measures by creates a single level of clusters.
Gaussian Mixture Models form clusters by representing the probability density function of
observed variables as a mixture of multivariate normal densities. Mixture models of the Gaussian
mixture distribution class use an expectation maximization (EM) algorithm to fit data, which
assigns posterior probabilities to each component density with respect to each observation.
Clusters are assigned by selecting the component that maximizes the posterior probability and
often considered as soft clustering method. It helps to understand the attributes of smaller
subsets more effectively. Patterns in data or any further relationships are easy to find when we
focus on these clusters and it is also possible to cluster data in a way that allows us to focus on a
specific group within dataset. Cluster Analysis is actually pattern recognition without a priori
knowledge of the data set. When we have groups of similar customers, based on some attributes,
it can be utilized to improve the business processes. For example, if the algorithms find a cluster
of high value customers, there might be idea to target those with specific campaigns. Contrary
to the classification analysis where all observations are known to be a part one of a number of
groups and the objective is to predict the group to which a new observation belongs, cluster
analysis tries to find the number and composition of the groups.
17
Chapter 1: Overview of Predictive Analysis
Figure 1.2: Cluster Analysis
1.3.4 Association Analysis
Given a set of transactions, we try to find rules that will predict the occurrence of an item based
on the occurrences of an item based on the occurrences of other items in the transaction. The
purpose of association analysis is to find patterns in particular in business processes and to
formulate suitable rules, of the sort "If a customer buys product A, that customer also buys
products B and C". Thus association is a data mining function that discovers the probability of the
co-occurrence of items in a collection. The relationships between co-occurring items are
expressed as association rules. In transactional data, a collection of items is associated with each
case. The collection could theoretically include all possible members of the collection. For
example, all products could theoretically be purchased in a single market-basket transaction.
However, in actuality, only a tiny subset of all possible items are present in a given
transaction; the items in the market-basket represent only a small fraction of the items available
for sale in the store. In transactional data, a collection of items is associated with each case. The
collection could theoretically include all possible members of the collection. For example, all
products could theoretically be purchased in a single market-basket transaction. However, in
actuality, only a tiny subset of all possible items are present in a given transaction; the items in
the market-basket represent only a small fraction of the items available for sale in the store. The
associations necessarily don’t need to be products in shopping baskets, it can even be people in
social network or telephone calling patterns etc.
18
Chapter 1: Overview of Predictive Analysis
Figure 1.3: Association Analysis
1.3.5 Outlier Analysis
An outlier is a data point which is significantly different from the remaining data i.e. is an
observation which deviates so much from the other observations as to arouse suspicions that it
was generated by a different mechanism and can also be referred to as abnormalities, deviants
or anomalies. An outlier often contains useful information about abnormal characteristics of the
systems and entities, which impact the data generation process. Most outlier detection algorithm
output a score about the level of “outliereness” of a data point. This can be used in order to
determine a ranking of the data points in terms of their outlier tendency. This is a very general
form of output, which retains all the information provided by a particular algorithm, but does not
provide a concise summary of the small number of data points which should be considered
outliers. A second kind of output is a binary label indicating whether a data point is an outlier or
not. While some algorithms may directly return binary labels, the outlier scores can also be
converted into binary labels. This is typically done by imposing thresholds on outlier scores, based
on their statistical distribution. A binary labeling contains less information than a scoring
mechanism, but it is the final result which is often needed for decision making in practical
applications. Predictive model should be capable of differentiating between outlier caused due
to errors in data or genuine variations of data. It is mostly applied in fraud detection, clinical trials,
voting irregularity etc.
19
Chapter 1: Overview of Predictive Analysis
Data sets with multiple outliers or clusters of outliers are subject to masking and swamping
effects. Masking occurs when a cluster of outlying observations skews the mean and the
covariance estimates toward it, and the resulting distance of the outlying point from the mean is
small. Swamping occurs when a group of outlying instances skews the mean and the covariance
estimates toward it and away from other non-outlying instances, and the resulting distance from
these instances to the mean is large, making them look like outliers. This is the second main usage
for outlier analysis, i.e. "improving" the quality of a data set before running other algorithms on
it as algorithms that would "suffer" from the presence of outliers, like regression algorithms
should not be brought in.
Two Dimensional with an outlier point
1.4 Predictive Analysis as a process
Like every other process, Predictive analysis also is series of defined logical steps and can be
defined well with following steps.
1. Requirement Analysis – What is reason behind your prediction and what is motivation behind
this attempt to predict? What outcomes are expected and who are prospective participants,
timelines and resources. It is important step and requires on average around 20% of total
time to come up with good analysis of requirement. A flaw here will always affect the last
prediction output.
2. Data Identification - What are data requirements and what sort of data will support our
prediction model the best? What are various sources available and which one can be most
reliable? Validation of data after acquiring from various sources is also a good practice. Initial
20
Chapter 1: Overview of Predictive Analysis
data exploration is conducted and some data transformations like sampling, binning or
rescaling the data may be performed in preparation for model building. It will be interesting
to note here that data selection, acquisition and preparation is most time consuming step in
whole process. A lot of questions like what is needed, in what terms it should be measured,
where it can be obtained from, how good resultant data set is etc. determine data to be
captured and transformed. On an average this step accounts for 36% of total time spent on a
PA process.
3. Model Building – Here algorithms come in picture and they are applied to identify data along
with chosen parameters to find the best analysis. Training the model or testing the model on
one set of data and then reapplying on another or unseen part of data is generally done to
evaluate algorithm fit (how good is the algorithm at solving the problem at hand) for
particular case. This also gives an idea about model performance in terms of robustness,
usability and goodness. 20% time is average consumption in this step.
4. Deployment – Here comes the opportunity to apply selected models in various business
applications and is sometimes also referred to as model scoring. Business rules may
sometimes need to be integrated with business rules & fundamentals to get a better
prospective of business context. Here we also monitor model performance over time.
5. Reiterate – Predictive analysis requires iteration to any stage of process back thus making it
wrong to define as single pass through sequential well defined steps. It is because data is
needed further to transform data so as to alternate the analysis for better results and
provides an option to see a set of data from multi prospective scale.
Figure 1.9: Steps of PA Process
21
Chapter 1: Overview of Predictive Analysis
1.5 User’s Classification
We can have PA users classified in diverse categories with different skill sets and domain
experience ranging from data scientists to consumer of business applications. Data Scientists
form the minority group as they come under 0.01% of total users and are responsible to create
predictive models, validate predictive business requirements and publish results to management.
Models built by data scientists are used by data analysts through an interface like some wizard
to explore and analyze data related to a particular application say marketing campaign. They
generally have functional and business domain knowledge but generally don't want to get
engaged with technical process of applying algorithms and creating models. They generally need
guidance for understanding these models and basically are interested in output of these models
and most of times are market researchers, market campaign managers or analysts. People who
just want the benefits of predictive analysis simply embedded in their business processes are
classified as business users. They are only interested to analyses the output of algorithms for
decisions. With Service Pack 6 for HANA, SAP has now introduced the Application Function
Modeler tool, that is a graphical interface for running advanced algorithms & accessing results
from the Application Function Libraries and thus helping these business users be more effective.
1.6 Challenges & Criteria for Success
Before we look Predictive Analysis as a product from SAP and its potential, let’s talk about
some challenges of PA and myths involved. It has to be understood that PA don’t guarantee the
successful & consistent prediction of future. In classical business organizations, enormous
amount of data is collected without knowing where & when it will be used with an approach in
mind to save everything because you never know when you need something. But for analyzing
data, quality of data is more important than quantity of data. It may be interesting and efficient
if we store data with a metadata defining the purpose and possible decision making it may
support. Identifying the variable correctly that has biggest impact on prediction or output is most
of times very difficult. Objectives of analysis should be very clear in starting if we want a
successful predictive analysis project implemented. There are many myths about PA in market
and are propagated by innocent or biased parties. Below fiver are most common misconceptions
held about PA as per SAP published book on Predictive Analysis.
1.
2.
3.
4.
5.
PA is all about algorithms.
PA is all about accuracy.
PA requires a data warehouse.
PA is all about vast quantities of data.
PA is done by predictive experts.
22
Chapter 1: Overview of Predictive Analysis
First myth that PA is all about algorithms is not totally wrong as algorithms form heart of this
process. Good and efficient algorithms are a part of story only. In previous section, we saw that
only 20% of this PA process is devoted to generating models. Other important things that make
core of PA other than algorithms are defining project goals, acquiring understanding and
manipulating data, analyzing, evaluating, modeling and presenting the results. Believing this
myth is something like driving a car that only has engine but no steering wheel, fuel or brakes.
Everyone wants the best model and thus there are various measures to model quality but that
doesn’t mean PA is always accurate. Spending continuous energy and time to refine a model in
order to get very last model of precision always creeps in extra cost. It is business decision that
how worth that extra efforts are to get more and more accurate analysis. A part of predictive
accuracy may discover an interesting pattern in data which will prove to be very crucial in
business decision, but this usefulness doesn’t depend on accuracy of model. Usefulness of PA
algorithms depend on understandability & deplorability of it in business model and not fully on
its accuracy. Third myth is based on assertion that we need to have a fully functional data
warehouse to start with predictive analysis process. It is not true but for sure PA process will be
more efficient and easy to implement if organizational data is relatively clean & easy to access.
While planning to start a data warehouse requisite data for any analysis should be considered.
Management reporting is basic purpose of data warehouse and not data analysis. That’s
why it is a considered a myth and having no data warehouse can’t stop anyone anyhow to start
predictive analysis. Its era of Big Data and thus computers memories, database sizes,
performance factors etc. All are many times bigger these days; but it is definitely a myth that we
cannot predict or analyze without having a vast dataset. We can't "predict" if there are no
patterns in the data, whatever the amount of data. The idea with "big data" is that the more data
you have, the more likely you are to see the patterns, but that pre-supposes the existence of
patterns. PA is equally relevant & successful for very small volume of data. As far as statistical
inference is concerned, data volumes can be small but yet of very high importance for analysis.
In most of cases like churn analysis, credit risks test, loan defaulting etc., even though we have
thousands of records in dataset, PA still depends on just very few key variables.
It is not wrong to say data analysis of bigger dataset is increasing in popularity but for sure
analysis of small data volumes may be equally beneficial and popular to next business decision.
Last myth is true bit not only case and hence PA can be done by Predictive experts as well as
newbies in this area. With tools like PA, the idea is that you do not need to be an expert in R, or
in the inner workings of the provided algorithms, in order to build predictive models. You still
need some knowledge in order to know how to use the algorithms, their strengths and
weaknesses, but yes actually the most important part might be business knowledge to be able to
interpret the results. It actually depends on what has to be done and complexity of this task.
When prospective borrower is browsing a bank website to analyze the bank rules for passing a
23
Chapter 1: Overview of Predictive Analysis
loan, he is doing his predictive analysis but definitely need not be an expert in this case. PA is best
performed by someone who has relevant business domain knowledge.
Soon we will be exploring how to use predictive analysis so it becomes wise to go through
common pitfalls that every user should avoid to help saving him from any unapparent source of
trouble. Firstly predictive analysis will make no sense by simply throwing in data without any
thoughts. Rational thinking treats it illogical which makes no sense to just dump all accessible
data but some data scientists prefer doing it and want to reply on some intelligent and reliable
algorithm that can work out to find important variables and ignore all irrelevant ones. It is
somehow related to myth 4 mentioned above that PA is all about vast quantities of data. It can
be a good practice to keep dumping all data if you believe on an algorithm that can sort out noise
from signal and has potential to reject some variables if algorithm thinks it to be irrelevant to
analysis. However experienced business users find this approach dangerous and counterintuitive.
Predictive analysis will off course be of no use if user don’t have basic to intermediate knowledge
of related business domain. Without business knowledge of application area it is ideally
impossible to guide predictive analysis process towards useful results and make a decision based
on that results once we have them. Thus it involves teams working together with diverse skillsets
from business knowledge to analysis knowledge working together otherwise neither results nor
variables and dependencies will sound understandable to someone with no business knowledge
when making prediction.
Lack of data knowledge is another common pitfall to be avoided and thus approach
should be to have detailed answers about data, data types, authenticity, source, provider,
measurements, and interpretation in terms of business rules etc. need to be found out at first
place. Has data come from sample or survey or was it unbiased, such questions have deep
significance. Irrelevant data or lack of data knowledge can be as worse as having no data. Without
data knowledge we can be misled even and tend to make erroneous invalid assumptions too.
Some assumptions need to be verified twice for an instance if a customer can hold multiple
accounts or if a class attendance is mandatory. In case of legacy and outsourced data, it is difficult
task as even data experts need to be sure about these assumptions. In Short, sources should
always be questioned and drilled before we finalize any assumption while data verification.
After these pitfalls and myths it would be easy to summarize and understand this section as
key factors for success for any predictive analysis process. Expectations should not be kept very
high and data mining does not guarantee finding gold. It depends on your expectations to see if
finding 10 when you promised 12 is a success or failure. It is always advisable to steer any
predictive analysis project after agreeing the first steps of setting objectives, business case and
desired outcomes and not to go other way to start process and wait for something to be found if
we are lucky. Working in team is also crucial with business domain knowledge experts available
at every step while data analysts supporting them and understanding their requirements.
24
Chapter 1: Overview of Predictive Analysis
Sensitivity analysis is to question the impact of assumptions made and to verify result of these
assumptions on analysis output. It is actually a very influencer of success. Solution can be
considered unstable or model is considered unhealthy if small changes in assumptions bring large
alterations in results.
25
Chapter 2: PA as a product from SAP
Intro to SAP HANA (Based on 3rd Semester Project work)
2.1
Before we talk about SAP HANA & Predictive Analysis library, here is a quick introduction
to SAP HANA and in-memory computing basics from report of 3rd semester work. An in-memory
database system also known as main memory database system contrast traditional systems
which rely on disk storage mechanism and are claimed to be faster relying on faster internal
optimization algorithms which eliminates seek time. SAP HANA is a powerful platform providing
libraries for predictive, planning, text processing, spatial and business analytics combining data
processing, application platform and database capabilities in memory. SAP HANA is a powerful
platform providing libraries for predictive, planning, text processing, spatial and business
analytics combining data processing, application platform and database capabilities in memory.
SAP HANA is an innovative in-memory data platform that is deployed on-premise as an appliance,
in the cloud or as hybrid of two. The key lies in its unique ability to converge database and
application logic within in-memory engine to perform advanced, real-time analytics. HANA stores
a table in column store as sequence of columns in consecutive memory locations maximizing
spatial locality of table columns. CPU execution speeds are high without need of internal waits
for memory address operations. Data is compressed in two-fold making it a less costly database
allowing speedy searches and calculations. Hana Database also called SAP in-memory database
follows hybrid approach and consists of two relational database engines. Column bases store
arranges data in columns and is optimized to hold huge amount of data, which can be aggregated
in real time. Row based storage is more optimized for insert and updates and stores data in rows.
To achieve the desired performance, in-memory computing follows these basic concepts:
Keep data in main memory to speed up data access.
Minimize data movement by leveraging the columnar storage concept, compression, and
performing calculations at the database level.
Divide and conquer leverage the multi-core architecture of modern processors and multiprocessor servers, or even scale out into a distributed landscape, to be able to grow
beyond what can be supplied by a single server.
All standard features expected from any relational database like views, triggers, indexes etc.
are supported by HANA database engines. At time of table creation, administrator can select
either of two options. It is always possible later to convert tables from one form to another. Both
engines share common persistency layer which is responsible for page management and logging.
26
Chapter 2: PA as a product from SAP
Logger saves every transaction committed on HANA database in a log entry written on persistent
storage. Log volumes use low latency flash technology for storage. Modeling capabilities to define
in memory transformation of analytical views from relational tables are also provided. Analytical
views always provide real time results as views are never materialized. In-memory computing
allows the processing of massive quantities of real time data in main memory to provide
immediate results from analysis and transaction. In order to support developers in creating
applications and services directly within this new SAP HANA Extended Application Services, SAP
has enhanced the SAP HANA Studio to include all the necessary tools. SAP HANA Studio was
already based upon Eclipse; therefore we were able to extend the Studio via an Eclipse Team
Provider plug-in which sees the SAP HANA Repository as a remote source code repository similar
to Git or Perforce. This way all the development resources (everything from HANA Views,
SQLScript Procedures, Roles, Server Side Logic, HTML and JavaScript content, etc.) can have their
entire lifecycle managed with the SAP HANA Database. These lifecycle management capabilities
include versioning, language translation export/import, and software delivery/transport.
SAP HANA Internal Architecture
27
Chapter 2: PA as a product from SAP
2.2
SAP HANA Predictive Analysis Library
More & more are getting aware of SAP huge efforts and contributions in the area of
predictive analysis ranging from SAP HANA (in-memory computing database) to modern user
interface for visualizing, defining and executing the whole process efficiently. SAP is also
recognized as a leader in big data predictive analysis by Forrester in their report ‘The Forrester
Wave: Big Data, Predictive Analysis Solutions’ just because of innovative solution and research
contributions providing business users with powerful predictive assets as data preparation, data
predictive algorithms, developer tools and a workbench to execute, visualize and share analysis
accelerating the business applications. SAP allows its predictive tool to support many data
sources like HANA or data from SAP BO along with non-SAP solutions like Hadoop (via SAP data
services), CSV or even normal excel files.
The fundamentals behind these PA assets will always be the powerful predictive analysis
algorithms and Predictive Analysis Library (PAL) in HANA which is C++ built in library to perform
in-database data mining and in-database statistical calculations. An enterprise class solution is
delivered by SAP Data Services for data integration, quality management, text analytics, data
profiling and metadata management. Unstructured data sources are also supported through
combination of data services. PAL contain a lot of defined predictive analysis algorithms that
execute in-database to process large datasets. Point here being data is not extracted out of SAP
HANA to another analysis placed somewhere and thus reducing data movement time allowing
performing calculations with in HANA server and database. These algorithms are called from
within HANA SQLScript procedures and are generally grouped together using following classes of
applications. Listed below are all algorithms provided by PA.
a. Association Analysis
- Apriori
- Apriori Lite
b. Cluster Analysis
- ABC Classification
- DBSCAN
- K-Means
- Kohonen Self Organized Maps
c. Classification Analysis
- C4.5 Decision Tree Analysis
- CHAID Decision Tree Analysis
- K Nearest Neighbor/(KNN)
- Multiple Linear Regression
- Polynomial Regression
- Exponential Regression
- Bi-Variate Geometric Regression
28
Chapter 2: PA as a product from SAP
d.
e.
f.
g.
- Bi-Variate Logarithmic Regression
- Logistic Regression
- Naïve Bayes
Time Series Analysis
- Single Exponential Smoothing
- Double Exponential Smoothing
- Triple Exponential Smoothing
Outlier Detection
- Inter-Quartile Range Test (Tukey’s Test)
- Variance Test
- Anomaly Detection
Link Prediction
- Common Neighbors
- Jaccard’s Coefficient
- Adamic/Adar
- Katz_
Data Preparation
- Sampling
- Binning
- Scaling
- Convert Categorical to Binary
Link prediction is emerging set of group of algorithms to analyze social networks finding links
between entities on social networks. It will not be wrong to consider PAL to be table based
because each algorithm PA supports, three tables are maintained for each algorithm, an input
table which contains data for analysis, a parameter or control table containing various parameter
combinations for particular algorithm and an output table for the output of the analysis. The
SQLScript which calls PAL contains code which first generates specific procedure followed by
definitions of table for input data, parameter settings & results and finally calls the procedure.
All these procedures are defined in AFL schema which stands for Application Function Library
Schema.
2.3 R Integration
R, open source statistics language with around 3500 plus packages/algorithms, is one of
most used predictive analysis tool approximately used by 60% of data miners. Allowing use of R
from with HANA offer breadth of algorithms available for business calculations in addition to
specific algorithms defined in PAL. A high level architecture for SAP predictive assets and their
association with R can be shown as figure below. SAP HANA platform being core along with PAL
provides flexibility to involve R. HANA studio provides development environment while the client
tool SAP Predictive Analysis is used by business analysts and data scientists. R and SAP HANA
29
Chapter 2: PA as a product from SAP
resides on separate servers side by side and R servers takes in data from data stored in HANA
tables which is transformed by R into R vectors or R frames which is default data format used by
R. SQLScript embeds within R script code which is passed over to R for R processing on R server
and the results are transferred back. These results given by R server are again in data vectors
format and are thus needed to be converted back to HANA table. All these transformations and
transfer are performed by HANA Platform. SQLScript containing R script first calls code to initiate
specific procedure and calls parameters, input & output tables before calling the procedure. No
doubt, predictive analysis gets huge flexibility and comprehensiveness due to this R support &
Integration with HANA. If you want to use PAL algorithms you should know SQLScript similarly
like you need knowledge of R to use open source R algorithms and packages. SAP PA is a simple
tool with nice user interface allowing business users to get best benefit of predictive analysis
without knowledge of R or SQLScript. SAP PA capability increases to a great extent with this
feature to add R algorithms. With use of R in SAP PA, data mining capabilities can be extended
with many new algorithms. It also enhances further charts/visualization capabilities. Prerequisite
is off course R software to be installed on host machine with necessary libraries and R algorithms.
R Integration for PA
2.4
Interface walkthrough of SAP Predictive Analysis as a tool
SAP Predictive Analysis (PA) in most simple terms is a tool or solution from SAP that serves
like user interface which defines and executes all predictive analysis processes. These processes
can be on in-database PAL in HANA or on predictive algorithms in R or even traditional data
sources such as SAP BO, XLS or CSV. PA has another advantage of being fully integrated with
Lumira to enhance and ease of sharing the results after data acquisition, visualization and
manipulation is done. PA fully supports all analysis processes for prediction mentioned earlier in
this report. This section below gives a detailed description of SAP PA as a product with
30
Chapter 2: PA as a product from SAP
screenshots and functions possible explaining data preparation, applying algorithms and
deployment of models. All stages starting from accessing/viewing input data to performing
required data preparation and then to finally applying algorithms to analyze this data is covered
in following section of report. The first screen you get when you initiate or invoke PA is called
Welcome Screen. Immediately, a simple five-step getting Started guide comes up and displays
possible five steps. It is really simple by design and interface too as it seems to be in text.
Welcome screen also contains collection of SAMPLES to help new users learn and understand the
product and get used to the tool.
Connecting to a data source. Select a data source.
Prepare your data. Explore the input data.
Analyze your data. Click on the Predict View.
Visualize the analysis results. Click on Results.
Save the analysis: an optional step.
WELCOME SCREEN FOR PA
31
Chapter 2: PA as a product from SAP
2.4.1 Step 1: Accessing and viewing the Data Source
When we select NEW DOCUMENT button from PA welcome Screen, a new dialogue box
opens up giving us possibility to SELECT A SOURCE. PA supports seven different kind of data
sources and this list is populated under the NEW DATA SOURCE COLUMN while on right RECENT
DATA SOURCES column list recently acquired data sources for convenience and fast access.
Unique feature in this list of ‘New Data Source’ is SAP HANA ONLINE which acts as a data source
helping to acquire data from SAP HANA tables, views and analysis views to perform in-database
predictive analysis functions using PAL algorithms and R integration of SAP HANA. Rest all data
sources in list are off course non in-database and exclude PAL algorithms & R integrating for
HANA as analysis part runs outside HANA.
SAP HANA Source Input
Following seven different data sources appearing under New Data Source are
1. CSV file: This option gives the possibility to acquire data from comma-separated value
data file and perform in-process analysis using native PA algorithms and R integration for
PA.
2. Free hand SQL: This option helps to create user’s own data provider allowing manual
entry of SQL values to a target data source to perform in-process analysis with help of
native PA algorithms and R integration for PA.
32
Chapter 2: PA as a product from SAP
3. SAP HANA Offline: This option lets you to acquire data from SAP HANA tables, views and
analysis views and allows performing in-process analysis of data using native PA
algorithms and with help of integration R to PA. It is only option that allows the predictive
models to be run on the HANA database, all the other options have model run locally.
4. SAP HANA Online: This option can acquire data from HANA tables, views and analysis
views to perform in-database analysis using SAP HANA PAL algorithms and R Integration
for SAP HANA.
5. MS Excel: Microsoft excel spreadsheet can be used as a data source and after acquiring
data from an excel spreadsheet, we can perform in-process analysis using native PA
algorithms with integrating R to PA.
6. Universe 3.x: This option allows you to acquire data from SAP BusinessObjects Universe
which are available on X1 3.x platform and perform in-process analysis using native PA
algorithms and by integrating R to PA.
7. Universe 4.x: This option allows you to acquire data from SAP BusinessObjects Universe
which are available on BI 4.x platform and perform in-process analysis using native PA
algorithms and by integrating R to PA.
Selecting SAP HANA online populates a dialogue box asking for SAP HANA connection
information and selecting desired table to fetch data as shown in figure above. Once user select
HANA table and gets a successful connection to HANA, PA directs to PREPARE view where you
can opt to PREDICT view by using all in-database PAL algorithms listed as shown in figure below,
with other data source components in the analysis editor. Data writer and preparation
components are always run in-database in HANA when you opt for SAP HANA online as your data
source and it is important to note that various PAL algorithms like Apriori, K-Means, CNR Tree or
Multiple linear regression are supported for R Integration in such cases. For other 6 data sources
other than HANA online, for an instance CSV file, prompt for data source selection looks like as
figure below where you can browse for data file located in hard disk of your machine. Once data
is acquired for analysis, similar to HANA Online case PA takes you to PREPARE view from where
you can opt PREDICT view and can utilize all the native PA algorithms and PA supported R
algorithms from algorithms section. All supported algorithms are listed under algorithms tab and
can be selected to use based on your requirement.
SAP PA also allows you to combine data from two different data sets from within PREPARE
view. You have two options for combining data sets, either you can merge or union data sets.
MERGE functionality creates a combined table to match a key column from two given data sets
while UNION appends to target dataset the selected columns of source data set based on an
identifier provided matching columns in two data sets have same data type. Figure below gives
you an idea of functionality to MERGE two data sets in the Prepare View of PA. Once you have
acquired input data from any of 7 possible data sources, you are ready to analyze and perform
initial data exploration along with preparation before applying any algorithm.
33
Chapter 2: PA as a product from SAP
Window to search for input file database
Merge Data in Step 1
2.4.2
Preparing Data for Analysis
Once data input is finished, PA moves to prepare view where we have possibility to review
data in grid format or apply to columns rich features like sorting, filtering, renaming, merging,
34
Chapter 2: PA as a product from SAP
creating as geographical hierarchy, creating a time hierarchy, reformatting or converting to a
different data type. Data in prepare stage can be viewed in both grid or facets display option as
shown in figures below. In facets view, the data is shown by distinct value equivalent to horizontal
bar chart but by value. It is useful in case we have few distinct values. Data manipulators available
are similar in both views.
Preparing Data for Analysis
A unique and useful functionality available in Prepare View is the Visualize view with help
of the available extensive chart library. After accessing the data and then exploring it using many
35
Chapter 2: PA as a product from SAP
visualizations and choosing one or more of extensive chart options, we get better control of data
and thus can perform further data preparation to apply PA algorithms in an effective way.
2.4.3 Step 3: Applying Algorithms for data analysis
After data preparation is over, all components that can be added to create an analysis are
grouped under tabs into Algorithms, Data Preparation and Data Writers are seen under the
Predict view of PA. Components available to use vary depending on whether you are building an
in-database analysis using HANA algorithms or alternatively adding an in-process analysis. Actual
construction of analyses is similar in case of both methods even when components available vary.
Building an analysis is quiet straight forward, simply select a component and then drag selected
component to analysis editor workspace and you will see it getting automatically connected to
component in focus. Second way is to double click desired next component instead of dragging
which also make it connect to component in focus automatically. Input and output anchors also
called as connection points are contained by each component and are useful to get connected to
other components. Data source output always have only single output connection point.
Connected data components always work in a fashion that data transmits from predecessor
component to their successor component or we can say output of predecessor component in a
connection acts as source of input to successor component in that connection. Structure of
component is shown in figure below and has options to rename, run, delete or configure its
properties. Figure below also shows different states a component can be in. ‘Not Configured’
refers to scenario when user drag a component on analysis editor workspace and it needs to be
configured before analysis can be run. ‘Configured’ refers to case when all mandatory properties
of components are configured and analysis can be run. ‘Success’ is displayed after successful
execution of analysis and ‘Failure’ refers to case when component causes execution of analysis
towards failure state.
Possibility to apply and configure Algorithms
36
Chapter 2: PA as a product from SAP
Now we will discuss both in-data base analysis with data sourced from HANA and algorithms
based on PAL along with case of in-process analysis with data sourced from a CSV file and
algorithms based on integrating R to PA using approach to building an analysis and with help of
screenshots.
Case 1: In-Database analysis using HANA tables and PAL
We start with selecting the data in SAP HANA and then choose predict view in PA to run this
analysis. Key difference and point to mention here is that in this case when data source is SAP
HANA online, the data does not leave SAP HANA i.e. whole analysis is run in-database. We take
an example to run analysis aiming to segment or cluster the retails stores data into similar groups
based on sales turn over, profit margins, staff numbers and store size and we choose to
implement HANA K-means algorithm for this analysis. Choosing component as explained in
previous step is simple and it requires just to drag and connect the component in analysis editor
workspace as shown in figure below.
PAL on HANA Data source
Next we proceed by configuring the properties of HANA K-means component as shown in figure
below and we are given options to change all primary properties. We can choose what variables
we want to use in the analysis e.g. Turnover, Size, margin etc. and the number of clusters that is
the value of K, which in this example is taken to be 5. Clicking the Advanced properties tab in
dialogue box will display rest of control parameters for this algorithm component as shown in
figure below which are clearly defined for business analysts contrary to writing SQLScript. Fields
marked with an asterisk are mandatory inputs. Generally default values for advanced properties
37
Chapter 2: PA as a product from SAP
are displayed and can be changed, for instance, the maximum no of literation’s should be
changed to some lower value than default 100 if data volumes are too large and processing time
is crucial factor. After primary and advanced properties are set, analysis is ready to run.
Configuring the attributes for algorithms
It starts with selecting the data in SAP HANA and then choose predict view in PA to run this
analysis. Key difference and point to mention here is that in this case when data source is SAP
HANA online, the data does not leave SAP HANA i.e. whole analysis is run in-database. We take
an example to run analysis aiming to segment or cluster the retails stores data into similar groups
based on sales turn over, profit margins, staff numbers and store size and we choose to
implement HANA K-means algorithm for this analysis. Choosing component as explained in
previous step is simple and it requires just to drag and connect the component in analysis editor
workspace as shown in figure below.
Case 2: In-Process analysis using CSV file and R Integration
38
Chapter 2: PA as a product from SAP
We start with selecting the data from a CSV file for this case of in-process analysis and
then go to predict view. To explain this case we refer to same example of aiming to cluster or
group the retail stores into similar groups based on sales, turnover, store-size etc. We similar to
previous case again proceed to select the R K-Means algorithm and drag to connect the
component in analysis editor workspace. Next we have options available to configure the
properties of the R K-means components which are very similar to editable properties in previous
case. We can select the variables for the analysis and value of K which represents the number of
clusters to be created during the analysis. Going to Advanced properties in dialogue box displays
rest of control parameters for R K-Means algorithm which are defined for business analysts as
opposed to writing R language script. Values displayed under advanced properties are default
values and can be left untouched by user. Analysis is now ready to be run as we have configured
right data source with required algorithms and given desired parameters.
Before we move on to next step when we describe how to run the analyses, let’s have a
look to a more advanced and more realistic analysis as compared to analyses we have built in
upper cases just by dragging in two components. In figure below which represents a realistic
analysis scenario, Stores.csv is data source which then runs the inter-quartile range test on each
variable to filter for outliers before running cluster analysis on data. Cluster analysis is then run
which results writing to a database table the source data with assigned cluster numbers, while
specific results for cluster one are under execution. Data as a result of cluster analysis is then
analyzed with help of a decision tree when target or dependent variable is from a set of
previously derived cluster number and independent variables still represent store turnover,
margin, staff and size giving us an insight to find specific rules and patterns explaining why such
cluster sexist. Finally the results are exported to a filtered subset. It is quiet beneficial as we have
an advantage of saving our decision tree models to reapply to a new data to predict a new stores
cluster assignment. Saving models offer an option to export it to another application using
Predictive Modeling Markup Language (PMML) standard which is explained more lately in this
report.
We start with selecting the data from a CSV file for this case of in-process analysis and
then go to predict view. To explain this case we refer to same example of aiming to cluster or
group the retail stores into similar groups based on sales, turnover, store-size etc. We similar to
previous case again proceed to select the R K-Means algorithm and drag to connect the
component in analysis editor workspace. New feature added to PA is the ability to define and run
R algorithm from the PA analysis editor. GUI to add R script as new component in A is provided
by tool based on either R integration for PA or R integration for SAP HANA to run such scripts.
Even capability to add your own custom algorithm written in C++ or JAVA is available. Figure
below displays a part of wizard to write custom R script components which can later be included
in any analysis. This integration features SAP PA to thousands of algorithms from R libraries. An
expert R user can write new components or algorithms which another business user can then
easily embed into his analysis.
39
Chapter 2: PA as a product from SAP
An Advanced Analysis in PA
Dialogue to create a new R Component for PA
2.4.4
Step 4: Running the model and viewing the Results
40
Chapter 2: PA as a product from SAP
Running in analysis whether in in-database or in-process method is exactly same. Let’s
take an example when we run in-process analysis we developed in last step using CSV file to
explain the PA functionality using screenshots. We can run analysis after we generate it in two
ways. Either we can run using the ‘Run till here’ option or R K-Means component or we can also
invoke it from the RUN ANALYSIS icon on the analysis editor toolbar. Once analysis run is
completed successfully, we can switch to predict results view for tabular or grid output along
with specific charts of used algorithms and default ad hoc chart viewer for user defined
visualizations. Figure below shows a new column added corresponding to each record, its
assigned cluster number in the results view for first few records of our cluster analysis with input
data listed to left on new column.
Predict Results Grid View
If you click CHARTS option, just on right of Grid button, then for used K-Means algorithm
you can see cluster chart which provides four different visualizations of results of this cluster
analysis for further exploration of data by user and to get a better view of analysis. Vertical bar
chart compares each cluster by showing size of each cluster which can also be changed to a
horizontal bar chart or pie chart. There is also generated a cluster density chart with distance
chart where color code scale of dark to light for dense to sparse clusters is shown following the
fact that thicker line means closer the clusters. Small-sized clusters that seem to be close can be
entertained for combining while other small-sized clusters distant from other clusters may be
considered as outliers. Two chars generated at bottom can be used by user to compare clusters
based on any specific variable to differentiate properties of each cluster. For every algorithm
available, an algorithm summary is generated when it is used in an analysis, which in this example
can be shown as figure below representing output from R for this R K-means algorithm applied.
This summary includes the cluster center coordinates, the within cluster sum of squares which is
41
Chapter 2: PA as a product from SAP
like the squared sum of the distance between individual records in the cluster and the cluster
center for all records in the cluster and finally the size of each cluster.
Cluster Chart in PA
Under CHARTS option, a third chart is available with name CLUSTER PARALLEL COORIDINATES
CHART where each record is plotted as horizontal line connected by its value on vertical axis of
each displayed variable and then colored coded based on its cluster number. As with many of
visualizations in PA, we can drill down into specific data to examine it in more details. Each
algorithm has a default visualization which is discussed under algorithms sections of this report.
We can also use the visualize option for ad hoc user defined chart creation. In figure below, a
trellis chart is created showing each variable by cluster group of comparison. Going further we
have possibility to use the model to predict new data either with in PA or in an external
application.
Cluster Parallel Coordinate Chart
42
Chapter 2: PA as a product from SAP
2.4.5 Step 5: Deploying Model in Business Application
We have several options available in SAP PA to deploy models:
Scoring models in PA and exporting the results.
Exporting the model as PMML.
Sharing the analysis in the Share View in PA.
Exporting and importing analyses between PA users.
Exporting an SAP HANA PAL model from PA as a stored procedure.
Most commonly used option of these possibilities is to use PA to predict new data or score the
model. PA provides this functionality to save the built model after you have built this model and
then when you want to make predictions using same model but with new set of data. Saved
model can be run again with new data from which the target or dependent variable can be
predicted. It got its name scoring models based on phrase of scoring a customer’s credit
worthiness or their probability to churn. You can extend the analysis by adding the decision tree
algorithms R-CNR tree to derive rules that describe why records have been assigned to specific
clusters. Independent or input variables are retail store turnover, margin, staff number and shop
size and target or dependent variable is cluster number. Figure below demonstrates this,
possibility of an option to save model with a model name and corresponding description. Saved
models are shown in a new tab, along with the Algorithms, Data Preparation and Data Writers
tab once the model is saved. It makes easy to build new analysis using saved models to predict
new data set. In prepare view, directly add saved model to predict along with new data set.
Alternative approach can be to export saved model from current analysis and after this to create
a new analysis with new data for scoring and importing saved model. Below two screenshots
show the saved model tab with name and description in PA tool and also how the scoring the
saved model can be done.
Above section explained the procedure to export the predictions of a model for reuse and
so in following section we will talk about exporting the model as PMML. PA gives you the
opportunity to export a saved model as the Predictive Modell ling Markup Language (PMML)
which is now seen as an industry standard for model sharing between various applications. The
option to export model as PMML appears when we right click on saved model component in PA
analysis editor where we can specify the output file to store the XML. Figure below is a screenshot
from such an XML of PMML generated for a saved model. PMML can then easily be read by
another application which describes the created model and then can exploit it to bring up
predictions. This was detailed view of exporting model as PMML and now in next section we will
see how to share the analysis with other users. Off course sharing the analysis is as important
step as developing analysis.
43
Chapter 2: PA as a product from SAP
Scoring the saved model in PA
There is a share view option given in PA tool which gives us the following functionalities
to use on data, chart of both of them together. We can share our charts, export our generated
data set to a file, publish generated data set for an analysis view to SAP HANA, publish generated
data set and charts to Stream work, and publish generated data set to SAP Lumira cloud or
publishing our data set to Business Objects information space to be accessed in SAP BO Explorer.
Diagram below gives an idea of how we can share the Customers.xlsx data along with associated
visualizations. Following section will cover exporting and importing analyses between PA users.
PMML Output for decision tree
44
Chapter 2: PA as a product from SAP
To Export a model for analysis by another PA user, .spar file can be generated in a SVID
document which can later be used in another SVID document once imported is successful. It can
be done with a series of simple steps. From PA tool in the predict view, choose ‘Export Model’
and provide name of .spar file when prompted and save it. This saved model in .spar file can be
reused by another SVID document by importing it from saved .spar file. To Import model is also
as straightforward as export model. Choose ‘Import Model’ under PA tool bar in Predict view of
PA tool and when asked choose the path and file of desired .spar file and click Open. This will
display the model in saved model tab after importing completes. SVID stands for SAP Visual
Intelligence Document where Visual Intelligence is earlier name of Lumira and was a product to
store data set along with visualizations generated by users. Saving model in PA automatically
makes a copy and stores it in SVID also. The purpose is to make it available to share with other
users or to share on Lumira cloud from where they can be accessed directly in PA or Lumira. SPAR
is an acronym for SAP Predictive Analysis Archive file and forms the proprietary format to export
models created in PA. Current scenario sees it mainly for transporting model purpose only but
eventually plans are to cover analysis and custom created components also in this format. It is
very helpful and core of business operations when one user creates the model and other uses it
or share it or modify it to his own needs.
To use a saved PA model in SAP HANA, it is possible to export saved model as SAP HANA
PAL model too. We can export and save model using wizard as shown in figure below after
creating an in-database SAP HANA model and saving it in correct format. Exported procedure
along with associated objects of tables, types, procedures appear under the selected schema in
SAP HANA.
Share View in PA for output
45
Chapter 3: Predictive Analysis Applied
It all starts from initial data exploration i.e. historic and current data sets form basis of all
predictions. Following section talks about importance and methods of initial data exploration
along with data preparation for predictive analysis. Quality of any analysis in general is directly
dependent on the quality of input data sets agreeing with age old concept of garbage n garbage
out. It don’t seem true that pushing algorithms on any collected data will give useful predictions
thus making data exploration very difficult and crucial step. Prediction depends on identifying
first what data might be useful, then finding out where it might be available, analyzing it,
reviewing it to understand and validate, to propose key element that effects the outcome etc.
3.1 Initial Data Exploration
There are two types of data: qualitative and quantitative. Qualitative data also called as
categorical data in statistics is mostly expressed by means of natural languages and not in terms
of numbers. Examples include text color is red, tallest in class is Anna, male elephant etc. These
categories generally are associated with some structure. Nominal categories are which has no
natural ordering like race, gender, religion while ordinal variables are ones which can have their
categories ordered in some way like small, medium and large. Numerical measurements
expressed in terms of numbers instead of natural languages are categorized as quantitative data.
Case can be some numbers which can’t be continuous or measurable like post codes and tax
codes. We can differentiate on fact that such numerical data can’t be added or subtracted.
Quantitative variables either are discrete such as number of students in class or continuous such
as weight, height salary etc. they are more like integers. There can be further categorization like
binary variables like 0 or 1, on or off etc. or date formats which look like numerical digits. In a
sense, categorical, string or text variables are qualitative while numeric which can go under
algebraic functions are quantitative variables.
Data type is crucial and important for any analysis as non-numeric values cannot be
predicted by linear regression while some decision tress need input data in a standard format.
Data types should be chosen before test as most statistical test and analysis are sensitive to data
types. Database types are generally reflected by data types in PAL. In PA there are string or
varchar data types for qualitative variables and only integer and double for quantitative variables.
‘Date’ data type is also supported by PA giving facility to convert data types, format them to
appropriate formats. There are functions available in PAL which converts data types, for an
46
Chapter 3: Predictive Analysis Applied
instance, CONV2BINARYVECTOR converts categorical input data to numeric data for use of
algorithms which accept only numeric data like K-Means.
Second way to convert categorical variables to numeric ones is to use Formula component
and clauses such as If (‘Name’ == ‘ANNA’) Then (1). Under data types, we may sometimes need
to construct new variables from existing variables. A case can be finding ratio of bank transactions
made on weekend compared to transactions made on weekdays may be more helpful than using
those two values independently. Missing value also forms a very considerable data type when
we talk about predictive analysis. There can be lot of reasons behind getting missing values in a
data set like mistakes, reluctance to provide confidential information or simply unavailable
values. We can approach these missing values with different ways, either ignoring them or
substituting them with some values based on similar records or can be interpolated from within
set of known data values. Data type is crucial and important for any analysis as non-numeric
values cannot be predicted by linear regression while some decision tress need input data in a
standard format.
Everyone will believe when you say that data is understood better when represented in a
visual form rather than through lists or tables of data. Following figure proves the power of
visualizing data makes understanding easy over tables or lists. Most frequently call pairing and
group of callers can clearly be analyzed with figure on right compared to same data represented
in table for call traffic. Credit to invent idea of line charts, bar charts, pie charts etc. goes to
William Playfair who published first version of these charts in 1786. Once data is explored and
visualized, we come to next stage of data preparation for predictive analysis: sampling, scaling
and binning in following section.
Table versus Chart
47
Chapter 3: Predictive Analysis Applied
3.1.1
Sampling
Sampling refers to process of creating subsets of all of the date in order to produce inferences
about al of given data. All data is referred to as the population by statisticians and hence we also
refer this process as sampling from the population and is pushed where efforts to access and
interpret all data is too high. It also helps when some of data goes missing for any reason.
Generally speaking sampling is done mainly to explore data initially with different analysis points
before choosing one and focusing in detail, as it gives us an idea about which data is actually
unnecessary for the specific analysis. ‘Sample is representative of all data.’ Sampling when used
in predictive analysis technique is called as cross-validation and we can also sample because the
sheer amount of data makes running/training models on the complete data set too time
consuming.
Simply it involves creating of subset of data to build a model and then use the left over
data to test this model; that implies a condition that test data has to be removed from initial
creation of model so that we can compare later to see how good our analysis and prediction is.
Most of times, model works poorly on new data as contrary to given data thus we need model
over fitting which means to adjust an excellent model using trading data and poor model for the
test data. Overfitting a model is the issue one could have by getting a predictive model that is too
good on the data set it was trained upon. That could render it useless on other samples of the
data. For example you have a 100 000 rows data set, take a sample of 10000 to train a model,
and get a predicted accuracy of 97%. You think it's pretty good, but when you run the model on
another (control) sample, you get an accuracy of below 50% => your model was over fit. Sampling
is easy to use with simple train before test splitting of data, called as holdout, through to k-fold
cross-validation where data set is composed of k subsets and hold method is called k times in
iteration. All k-1 subsets are amalgamated to form training set every time one of k subsets is used
as test set which helps to compute average across all k trials. A different approach to this
sampling can be to randomly distribute data into attest and training each set k times. In PAL,
there are eight sampling methods, invoked from function to sample data. It is thus easy but you
still have several available methods with the underlying issue when you're not looking at the
whole data set. What you gain in processing times, you lose in data coverage.
1. First N
2. Middle N
3. Last N
4. Every Nth
5. Simple random with replacement
6. Simple random without replacement
7. Systematic sampling
8. Stratified sampling
48
Chapter 3: Predictive Analysis Applied
Simple random with replacement method initiates random sampling of N or N% of all
records with replacement i.e. for potential further section selected record is allowed to come
back and join all left data. Simple random without replacement also works same as smile random
with replacement but a select record here cannot be returned back to all data for further
selection. In systematic sampling, little structure is required for random sampling. It is also called
as interval sampling sometimes. We first determine sampling interval (k) and then divide number
of units in the population by this k to create samples. For an instance, selecting from a population
of 600 a sample of 50, we would require a sampling interval of 600 ÷ 50 = 12. Therefore, k will be
set to 12, which allows you to choose one record from every 12 records to total as 100 records
in the sample. Given a random number between one and k, which would be the first number
included in your sample is referred to as random start.
The 25 records that comprise population or data which we will use in this example to select
samples are shown below in figure. Code below represents main elements of PAL Script for
systematic
sampling
while
full
code
is
available
in
the
file
SAP_HANA_PAL_SAMPLING_SYSTEMATIC_Example_SQLScript on the SAP PRESS website.
This sampling is based on method 6 of eight methods mentioned above. When we select a
sampling size i.e. k as 5 output is sown in figure below with 5 records showing an increment of 5
as sampling interval when starting value was chosen randomly as 2.
// The procedure generator
Call SYSTEM.afl_wrapper_generator ('SAMPLING_TEST','AFLPAL','SAMPLING',PDATA);
// The Control Table parameters
INSERT INTO #CONTROL_TAB VALUES ('SAMPLING_METHOD',6,null,null);
INSERT INTO #CONTROL_TAB VALUES ('SAMPLING_SIZE',5,null,null);
//Assume the data as in figure below and calling the procedure
CALL SAMPLING_TEST(DATA_TAB, "#CONTROL_TAB", RESULT_TAB) WITH OVERVIEW;
SELECT * FROM RESULT_TBL;
There is one more sampling method available in PAL called Stratified sampling which
targets to find out those attributes that may divide up a data set or population in subpopulations
or strata. This should be in a way that selected sample from population can still be considered as
representative of population. Stratified sampling takes samples from each stratum of population.
Requirement while sampling population in such case is that the proportion of each stratum in
sample be same as in population. These takes are applied more when population is
heterogeneous and dissimilar but still gives a possibility to find homogeneous subpopulations.
On other hand, when data is homogeneous, simple random sampling methods are more
appropriate.
49
Chapter 3: Predictive Analysis Applied
Input & Output Systematic Sampling
PA allows us to sample data both for in-database in SAP HANA or on non-SAP HANA data sources,
with sample component option available in Data Preparation tab of Predict view.
The Sample Component in PA
3.1.2 Scaling
Scaling of data is done before we run any of predictive algorithms with intention to be sure that
every variable in the model gets equal weight and emphasis as an input to the model which needs
50
Chapter 3: Predictive Analysis Applied
to define a common data scale for all these variables. For instance, we can scale all the data used
as input to be classified within a selected range, such as, -2.5 to 2.5, or 0.4 to 2.9. Another scaling
approach close to normalization or standardization the data use a z-score and a variable is
rescaled to have a mean of zero and a standard deviation of one. Calculation is thus in case taken
as the average of the variable subtracted from the value for each record, giving in result the mean
of the standardized variable of zero, divided by the standard deviation, which results in a
standard deviation of one. In simple words a value of 2 indicates that the value for that record is
two standard deviation above the mean, while a value of -3 indicates that a record has a value
three standard deviations less than the mean. Scaling or normalization plays a vital role to classify
algorithms which involve neural networks, or distance measurements such as nearest neighbor
classification and clustering, where independently scaled data, for an instance, few entities in
millions and some in tens or hundreds, can influence the analysis to a considerable value and
hence a need for a common scale among the numeric variables arises.
The PAL supports three methods, and these function can be called using SCALINGRANGE:
EE Min-max normalization.
EE Z-Score normalization
EE Normalization by decimal scaling
Let’s take an example of scaling assuming we have data table DATA_TAB in HANA as shown in
figure below. To scale this data in range A to B, formula we use is (B – A) * (Xi – Min Xi) / (Max Xi
– Min Xi) + A which when considering A as 0 and B as 1 simplifies to (Xi – Min Xi) / (Max Xi – Min
Xi). Procedure to scale can be called as with the following PAL SQLScript for scaling.
// Calling the procedure
CALL SCALINGRANGE_TEST (DATA_TAB,"#CONTROL_TAB", RESULT_TAB) with overview;
SELECT * FROM RESULT_TAB;
Scaling types and their results compared
51
Chapter 3: Predictive Analysis Applied
Figure above shows us the result when we scale that data from 0 to 1 by setting the parameter
NEW_MAX to 1 and NEW_MIN to 0 i.e. maximum value 1 and minimum value 0.
Other method to scale in PAL is z-score normalization and can be called by setting scaling method
control parameter to value 1, and then choosing one of 3 possibilities which are ‘Mean and
standard deviation’, ‘Mean and mean absolute deviation’, ‘Median and median absolute
deviation’. Consider Z_Score_Method as Zero with same input table as in previous example
results are shown in figure above where scaling of data is on mean zero with standard deviation
of 1. We can scale with Normalization component in PA also with an interface available in Data
Preparation tab under Predict view. Scaling can be done on both HANA and non-SAP HANA data
sources. Below figure gives an overview of Normalization component is PA using inbuilt function
for normalization.
Normalization Component in PA
3.1.3 Binning
Binning of data is used to summarize or group for better visualizations when data volume to be
analyzed is large. To construct a histogram for an instance is not easy without data binning. To
visualize huge amount of data points need data binning first which can further ask for subsequent
interactive drill. Generally data binning is done prior to run any predictive algorithm as an
52
Chapter 3: Predictive Analysis Applied
attempt to reduce complexity of model. Complex models are no one’s cup tea as they are difficult
to understand and thus aim is to achieve concept of parsimony i.e. simplest model with very few
variants. Binning of numeric data is also called discretization of continuous data and it is
important to have it done effectively otherwise it will lead to complex models. You can imagine
a situation trying to construct decision trees based on variables with huge set of numbers where
each branch of tree is considering every number thus making a complex decision tree difficult to
implement. Before discussing binning functions available in PAL, it is good to know that PAL also
allows three methods of smoothing.
Noise is a random error or variance in a measured variable. Given a numerical attribute
such as, say, age, how can we “smooth” out the data to remove the noise? Data Smoothing uses
an algorithm to remove noise from a data set thus facilitating important patterns in dataset to
stand out. Random, random walk, moving average, simple exponential, linear exponential and
seasonal exponential smoothing are some common ways in data mining for smoothing Firstly
smoothing by bin where mean value of bin replaces all other values in bin, secondly smoothing
by bin medians where bin median replaces all other values in bin and lastly smoothing by bin
boundaries where closest boundary value replace each bin value and represents minimum or
maximum values in given bin. It is also trying to find threshold on continuous variables. For
example a person's income might have an effect on where the person is going on vacation and
maybe the threshold is on 500K a year. Those with less stay in Norway, those with more go
abroad etc and if you use a continuous variable in a decision tree the threshold will be difficult to
spot, but if you bin that data in "less than 500k" and "more than 500k" then it will be easier. The
problem is that often we would not be knowing beforehand what binning strategy should be
applied i.e. how many groups, and with what rules?
PAL supports following four methods to achieve binning.
Equal widths based on the number of bins
Equal widths based on the bin width
Equal number of records per bin
Mean/Standard Deviation bin boundaries
Let’s try to implement an example of this binning process on table DATA_TAB as shown in figure
below. Main elements of the SQLScript are shown in code below while full code is available in file
SAP_HANA_PAL_BINNING_Example_SQLScript on the SAP PRESS web site.
53
Chapter 3: Predictive Analysis Applied
Input Output tables for Binning table in PAL
// The procedure generator
Call SYSTEM.afl_wrapper_generator('BINNING_TEST','AFLPAL','BINNING',PD ATA);
// The Control Table parameters
INSERT INTO #CONTROL_TAB VALUES ('BINNING_METHOD',0,null,null);
INSERT INTO #CONTROL_TAB VALUES ('SMOOTH_METHOD',0,null,null);
INSERT INTO #CONTROL_TAB VALUES ('BIN_NUMBER',4,null,null);
//Assume the data as shown in table DATA_TAB from input table in figure below
//Calling the procedure
CALL BINNING_TEST(DATA_TAB, "#CONTROL_TAB", RESULT_TAB) with overview;
SELECT * FROM RESULT_TAB;
In this example of binning method, numbering starts from zero of equal widths depending
on number of bins here set to 4 and smoothing is done using smoothing by bin mean. The bin
widths calculated as (max – min) / k, which in this case becomes (38 – 6) / 4 equals 8, so the bin
ranges are >=6 to <14; >=14 to <22; >=22 to <30; >=30 to <=38 which makes first bin getting the
values 6, 12, 13 and 10, which has a mean of 10.25. Same way we can have second bin containing
the value of 15 only while the third bin possesses the values 23, 24 and 25, with a mean value of
24.The last bin fourth one, contains the values 30, 32 and 38and thus man value of 33.33. Results
are shown in figure up. We can choose binning method based on which one of us appeals most
in that particular scenario, most of times found by hit & trials. Best approach can be to try all
binning and smoothing methods to see individual impact of each on model. Binning method is
not important when model gets robust to changes with binning method. It is always practical to
analyze the reason behind variations by looking data in detail if model solution does vary
significantly.
54
Chapter 3: Predictive Analysis Applied
3.1.4 Outliers
Outliers in data can always influence any algorithm’s performance, model’s parameters
and confidence in predictions to a significant level thus a crucial practice as a part of initial data
exploration is always to check existence of outliers or unusual values in given data set to
understand the cause and decide next action items concerning them. Some algorithms are more
sensible to outliers than others. An outlier is an observation which deviates so much from the
other observations as to arouse suspicions that it was generated by a different mechanism.
Outliers can easily be visualized in scatter plots, although difficult to scale for large data volumes.
Box plot as show in in figure is most popular used visualization approach for outlier
detections. Box plots are thus an excellent tool for conveying location and variation information
in data sets, particularly for detecting and illustrating location and variation changes between
different groups of data. Every value is specified in the box with upper and lower quartiles on y
axis scale, induced white line in the box representing median value. Fences on top and bottom
of box signifies a factor time’s interquartile range. A single box plot can be drawn for one batch
of data with no distinct groups. Alternatively, multiple box plots can be drawn together to
compare multiple data sets or to compare groups in a single data set. For a single box plot, the
width of the box is arbitrary. For multiple box plots, the width of the box plot can be set
proportional to the number of points in the given group or sample (some software
implementations of the box plot simply set all the boxes to the same width). All the dots plotted
outside these fences represent the outliers. Data volumes and dimensions affect the outlier
detection and visualization.
PAL offers specific algorithms for outlier detection namely variance Test, the InterQuartile Range Test, the K Nearest Neighbor Outlier Test, and Anomaly Detection using Cluster
Analysis. The Inter-Quartile Range test is a simple and popular test for outlier detection and is
the basis of the very useful box plot. It is also a robust test in that the outliers do not themselves
affect the statistics of the test, as opposed to the Variance test, where outliers clearly affect the
limits given that they are measured in terms of standard deviations. That is the weakness of the
Variance test, but, again, its simplicity makes it popular. K Nearest Neighbors looks for local
outliers, as opposed to global outliers, which is very useful as these are often harder to find
because they are not so obvious. The weakness of the test is that the value of K may affect the
solution, but this can be minimized by exploring the solution using several values of K. The other
weakness is that by specifying the number of outliers, you ensure that you get that number, and
some may not really be outliers.
55
Chapter 3: Predictive Analysis Applied
3.2 Which Algorithm When
In Chapter 2 of this report, we mentioned vast number of algorithms provided by PAL but an
interesting question is out of all many algorithms available, which one should be used when? It
seems to be a whole big task for new users to decide which algorithm will should be used by
them to get the result & analysis they want. The task become worse if we consider 3500 plus
packages or algorithms contained in R. Below section discusses the criteria and main factors to
consider before we select the right algorithm. Also we will talk here about accuracy factor of an
algorithm and trying to summarize with general set of rules to select efficient and right algorithm.
Basic questions that drive the decision of selecting an algorithm are
What is purpose of algorithm and what you want to see as an analysis result? For example:
group the data, look for associations in the data, or predict a series of data values.
What data do you have and what are the attributes of that data? For example: numeric,
categorical, Boolean, etc.
The answers to these questions help you finding the best algorithm to apply as you have a
better idea which algorithm fits best in your purpose. Below table lists some common tasks with
corresponding algorithm category and example algorithms as mentioned in SAP Predictive
analysis book which makes it to pick algorithm i.e. if we want to look for unusual values outliers
we can use variance test and inter-quartile range test. To build a predictive model on a variable
using data of second variable for model building we can use decision trees or neural networks
and regression models. Thinking about second main factor about kind of data and its attributes
is also crucial as some algorithms work only on numeric data while others on categorical data
while others can be modified to support both of data types. Off course the table above can help
but still we will try to classify algorithms based on five main classes of application in PA.
Algorithms in PA can broadly be classified in following 5 groups and this classification gives a
better understanding of purpose and use of algorithms.
1. Association analysis trying to find for associations or affinities in the data.
2. Segmentation or cluster analysis, trying to segment or group the data into similar clusters.
3. Classification analysis trying to classify or predict new data based on a model built by an
algorithm. It is the largest group of algorithms in PA to predict a variable using the data
of other variables that are believed to affect the values of the variable that we are trying
to predict
4. Time-series analysis trying to use data with an inherent periodicity to predict values for
future time periods.
5. Outlier analysis trying to find unusual values in the data.
56
Chapter 3: Predictive Analysis Applied
Task Example Algorithms
Algorithm Category
Example Algorithms
Summary statistics
Descriptive statistics
Mean, median, variance…
Outlier detection
Statistical tests
Variance test, IQR test, anomaly
detection…
Preparation of the data for
analysis
Data preparation
Sampling, scaling, binning…
Statistical inference
Sampling theory
T tests, F tests, ANOVA…
Relationships, cause and
effect
Correlation and
regression
Multiple linear regression, non-linear
regression…
Clustering or grouping data
Cluster analysis
ABC Analysis, K-Means, Kohonen SOMs…
Time series forecasting
Time series analysis
Exponential smoothing, regression…
Association or affinity analysis
Association analysis
Apriori
Prediction, model building
Classification analysis
Decision trees, neural networks,
regression…
Social network analysis
Network analysis
Jaccard’s coefficient, common neighbors…
Optimization
Optimization
Linear and non-linear programming
Risk analysis, modelling
Simulation
Monte Carlo analysis
Algorithm Categories with tasks and examples
In Association analysis, most common and powerful algorithm is Apriori which is
discussed in detail later in this chapter. In second group of segmentation most popular one is KMeans algorithm and it is known for simplicity and positive correlation. Classification analysis has
largest group of algorithms indicating its importance in PA and are further sub classified in 3
groups: regression algorithms, decision trees algorithms and neural network algorithms.
Regression algorithms is essentially fitting of a model either linear or non-linear, of the form Y is
a function of X1, X2…XN, where Y is the dependent variable and Xi are the independent variables,
which minimizes the difference between the fitted data and the actual data. Bivariate linear and
non-linear, multiple linear, polynomial & logistic regression are main regression algorithms.
Decision trees recursively part the data, initializing with the most divisive split of the input
variable values compared to the target variable, and keep doing same till any of many stopping
criteria is met. Result then defines the relationships between input & target variables.
57
Chapter 3: Predictive Analysis Applied
Class of Problem and
Algorithm Group
Association
Input or
Independent
Variables
Categorical
Output or Target or Dependent
Variable
Algorithms
Categorical : Association rules with
support, confidence and lift
Apriori, Apriori Lite
Cluster
Numeric
NA : Cluster groupings, cluster quality
K-Means,
Analysis,
SOMs
Classify - Regression
Numeric
Numeric : Best fit regression equation
Multiple Linear &
Non-Linear Regression
Classify- Regression
Numeric/Categoric
al
Numeric/Categorical : Best fit logistic
curve, probabilities of outcomes
Logistic Regression
Classify -Decision Trees
Numeric/Categoric
al
Numeric/Categorical : Decision tree
and rules with confidence level
C4.5, CHAID
Classify
Networks
Numeric/Categoric
al
Numeric/Categorical : Black box model
for prediction
Neural Network
Numeric
Numeric/Categorical : Classification of
new data
K Nearest Neighbor
Numeric : Best fit and projected values
Exponential
Smoothing,
Regression
NA : Detected outliers
IQR, Variance Test,
Anomaly Detection
-Neural
Classify -Other
Time Series Analysis
Outlier Detection
Numeric
Numeric
ABC
Kohonen
Neural network algorithms are closer to the way human brain processes information. Two
neural network algorithms, sourced from R namely Monmlp package and Nnet package are
supported by PAL. Functionally they work by simulating huge number of interconnected simple
processing units which are arranged in layers; input, hidden and output layer attached with
varying connections strengths or weights. The network adapts by analyzing individual records, to
give prediction for each record, and adjusting the value of weights whenever it sees an incorrect
prediction. K-nearest neighbor is final sub category under classification algorithms which predicts
or classifies objects based on their similarity or closeness to other objects with prediction
calculated as average classification. Time series algorithms are significant as business applications
need advantage of time series forecasting. Data is generally constant, trending or seasonal and
thus smoothing goes hand in hand here. Outlier analysis algorithms come under last group and
seek unusual values. Best known algorithm here is Inter-Quartile range test which is also ground
of Box Plot. Variance test is also commonly used following the simple concept that unusual data
58
Chapter 3: Predictive Analysis Applied
is distant from average of data. With this much knowledge, we can start applying algorithms on
hit & trial basis to find out best suitable for our purpose but still it is advised to simply try all
algorithms under same group to see which provides best fit for your analysis. Table below
summarize what we just discussed.
To check and analyze which algorithm is working best with our problem set, easy and
logical approach is to apply and run all algorithms on input data and choose the best one but
what factors will decide what is best and how to measure it? Answer to this question will be
different for each group as we can’t compare two algorithms in different groups. For association
analysis, the choice of algorithms is between Apriori and Apriori Lite. Apriori lite being a subset
of the Apriori, is restricted to find single pre and post rules. The choice thus depends on rule
requirements and performance, as Apriori Lite will be faster than the generic Apriori but is
restricted in terms of the rules extracted from the data. For cluster analysis, finding better
algorithm is difficult as for example in ABC Analysis, different values of A, B or C can’t be judged
good or worse. User can only find the best value and thus no model is best. The K-Means
algorithm may be poorer to analyze cluster better than Kohonen Self Organizing Maps (Kohonen
SOM) but is easier to understand & flexible while Kohonen SOMs lack functionality to determine
the number of clusters in advance. It is thus logical to try both K-Means and Kohonen SOMs, with
varying cluster numbers, to explore the solutions in order to decide which is the most appropriate
for the application. For time series analysis, numeric predictions as in classification analysis
assume to have same measure of model quality, except that the analysis is done based on time
periods.
The Variance test and Inter-Quartile Range (IQR) test help to find overall outliers in the
data set for the outlier tests. The Variance test is trivial and well-known but the outliers
themselves influence the analysis. So an algorithm using the median and quartiles is more
popular IQR test, measuring in a way that an identified outlier is not affected by the actual
outliers anyhow. Local outliers in the data set can easily be found by the Anomaly Detection
algorithm. The Variance test and Inter-Quartile Range (IQR) test help to find overall outliers in
the data set for the outlier tests. The Variance test is trivial and well-known but the outliers
themselves influence the analysis. So an algorithm using the median and quartiles is more
popular IQR test, measuring in a way that an identified outlier is not affected by the actual
outliers anyhow. Local outliers in the data set can easily be found by the Anomaly Detection
algorithm.
For classification models, it is more logical to compare algorithms based on their input
being numerical or categorical. For numeric predictions, the residual error which is the sum of
square of actual minus fitted for each data point is the most common measure along with another
common way of presenting as MSE that is Mean Square Root. MSE scales back to original data
RMSE on associations. Statistical measure of goodness of fit like R Squared, analysis of variance
59
Chapter 3: Predictive Analysis Applied
or F value are used to produce numeric predictions for regression analysis. Classifier confusion
matrices which are how often each category correctly predicts and how often incorrectly,
generally evaluate categorical prediction. Model quality measure can then be made based on
these matrices and measures of quality like Sensitivity or true Positive rate and Specify or True
Negative rate. Gain and lift charts are plotted to compare model and algorithm performance in
case of binary classification models. Based on this discussion in section above, we can have
following rules as a base to start o compare algorithms their performance and applications based
on user requirement as follows.
If objective is to find associations in the data,
Use Apriori. If all multiple item associations are required.
Use Apriori Lite if only single pre and post item rules are required
Use Apriori Lite sampling if performance with Apriori is too slow
If objective is to find clusters or segments in the data,
Use ABC Analysis if the cluster sizes are user defined.
Use K-Means if the desired number of clusters is known.
Use Kohonen SOMs if the number of clusters is unknown.
If objective is to find outliers or unusual values,
Use the Variance and IQR tests if you are looking for global outliers.
Use IQR test if there are significant outliers.
Use Anomaly Detection If you are looking for local outliers.
If objective is to classify data when the target variable is numeric and only one independent
numeric variable exists,
Use Bivariate Linear Regression if a linear relationship is considered.
Use Bivariate Exponential or Geometric or Natural Logarithmic Regression if a non-linear
relationship is considered.
Use Multiple Linear and Non-Linear Regression for linear and non-linear models
respectively if more than one independent numeric variable is considered.
If objective is to classify data when the variables are categorical or a mixture of categorical and
numeric,
Use either C4.5 or CHAID or CNR and choose the best fit if Output of decision tree rules is
desired.
Use Logistic Regression when preference is to find output of the probability.
60
Chapter 3: Predictive Analysis Applied
Use Neural Networks and choose the best fit if model quality is of primary concern.
If objective is to predict time series data,
Use Single Exponential Smoothing when the data is Constant or stationary.
Use Double Exponential Smoothing when the data is trending.
Use Triple Exponential Smoothing when the data is seasonal.
Use K Nearest Neighbor when objective is to classify data with simple easy to understand
approach and the target variable is numeric or categorical in case of numeric input variables.
3.3. Challenges & Resolutions
This section will discuss and elaborate following four common difficulties faced in any predictive
analysis process.
1.
2.
3.
4.
Cause & effect
Lies, dammed lies & statistics
Model Over fitting
Correlation between independent variables
We cannot always conclude a relation to be a cause & effect relationship if we find out a
good mathematical relationship between variables as it is not always easy to interpret every
mathematical relationship as cause & effect relationships. If we plot a graph between numbers
of jobs in market to number of cars newly bought in city, we can see a mathematical relationship
but we cannot summarize it that with every car sold in market there is a new job created. We can
understand second challenge of lies & statistics by an example which covers dangers of looking
only at statistical measures and is known as Anscombe’s quartet. He used 4 datasets having
simple similar statistical properties to prove the importance of plotting data before analyzing it
and impact of outliers as they all appear to be very different from one another when plotted.
First dataset appears as a well-behaved dataset having clean and well-fitting linear model and
can be plotted using y = 3 + 0.5x having mean of X as 9 and mean of Y as 10. Second dataset does
not have a linear correlation strangely has same equation y = 3 + 0.5x but with R squared value
of 0.67. Third dataset does have linear relation but the linear regression is thrown off by an outlier
which means if the outliers were spotted and removed before plotting it would have been easy
to fit a correct linear model. Last dataset does not fit any kind of linear model but the single
outlier makes keeps the alarm from going off. This implies that it is wise to understand data
before applying any algorithm. Graphs and 4 data sets used are shown in figure below.
61
Chapter 3: Predictive Analysis Applied
I
x
II
y
x
III
y
x
IV
y
x
y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
Four Data Sets in Anscombe’s Quartet
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Overfitting signifies a condition when data under analysis fits a model “too good” that it
can be thought to describing your sample nearly perfectly and is too rigid to fit any other sample.
This condition thus makes it loose enough to serve our predictive needs by fitting badly on new
data. Over fit specifically needs to be watched when you’ve got small sample sizes or your data
is too small & limited in some way and defining as phenomenon where the predictive model may
well describe the relationship between predictors only but may fail to provide valid predictions
in new data. It is generally due to high expectations and need for accuracy requiring an extra
good job to fit the sample data by introducing too many input variables. Most of times it is case
when a model has to many data points compared to number of data points. Including test data
and analyzing it from every angle set is crucial when building a predictive model to have it more
accurate and stable over time. Figure below explains the context with help of two figures
representing two graphs based on same data points. Left graph off course is doing a decent job
62
Chapter 3: Predictive Analysis Applied
as it captures general nature & characteristic of the relationship between the X and Y variables.
While right hand side graphs is clearly attempting too hard to capture every subtle change in the
relationship between the two variables; It makes model on left outperforming model on right
when new data points are fed into the model as the right hand side model will not be able to
generalize well the data it has not seen before. To avoid Overfitting, words of advice is to use a
proportion and balance of the available data to train the model and the rest of the data which is
unseen or hold out data to test the model. This is a key methodology in PA and definitely an
important one in classification analysis and time series analysis.
Process of Overfitting the models
‘Multicollinearity’ is problem & comes in picture when you’re trying to fit a regression
model or other linear model. It indicates a case of predictors correlated with other predictors in
the model. Unfortunately, the effects of Multicollinearity can feel unsure and intangible, which
makes it unclear about how to fix if you are able to decide that it should be fixed. Statisticians
define multicollinearity as a strong correlation between two or more independent variables. It is
quite difficult to remove effects on dependent variables because of linear relation making model
easily assuming the existence of multicollinearity in dataset. Estimates made on parameter may
alter significantly in response to small changes in the model or the data which means
Multicollinearity effects the calculations regarding individual predictors without minimizing the
predictive power or reliability of the model as a whole specially at least within the sample data
itself indicating that a multiple regression model with correlated predictors can definitely show
you the degree of relation between bundle of predictors predicts the outcome variable, but will
not produce always a valid results about any individual predictor and about extent of redundancy
of predictors with regards to each other. Multicollinearity to an extent is normal but if it has
higher value it becomes a problem because i the variance of the coefficient estimates increase
which make the estimates very sensitive to minor changes in the model. Following can be seen
as main sources of multicollinearity; method used for data collection, constraints pushed in the
population, Model specification or an over fitted over defined model. Removing multicollinearity
63
Chapter 3: Predictive Analysis Applied
fully is not possible but can be reduced by several remedial measures such as collecting the
additional data or new data, re-specification of the model, ridge regression or by using data
reduction technique like principal component analysis.
Examples of Multicollinearity
Figure above shows two graphs X1 & X2 that are highly positively correlated and value of
correlation coefficient between them is 0. 9771 as computed by data on left. Trying to find a
model now that describe the relationship between Y and independent variables X1 & X2 is
difficult because we can merely differentiate because of them being so close and hence higher
value of multicollinearity becomes a problem because i the variance of the coefficient estimates
increase which make the estimates very sensitive to minor changes in the model. Mitigation,
adding more data sampling gives an advantage can’t solve it completely. Omitting one of
correlated variables can be another interesting approach if you can decide which variable to
ignore risking the danger of ignoring real casual variable.
64
Chapter 4: Cluster & Association Analysis Explored
Although all groups of algorithms supported and analysis techniques used by PA are crucial and
have their own importance. Keeping time limit in mind, I thought to go through in detail of only
two classes of Analysis techniques instead of covering all briefly. It was quiet interesting to go
through code behind and implementation of these algorithms so neatly on SAP PA interface.
4.1 Association Analysis
As name suggests, Association Analysis looks for associations between objects and also
known as affinity analysis. Output of this analysis is generally in form of rules like ‘if item A is
purchased by customer A, he has a very high probability to purchase item B and item C’, ‘75% of
those who buy comics on-line also buy music on-line’, ‘60% of those who have high blood
pressure and are overweight have high levels of cholesterol’ etc. Following the original definition
by Agrawal the problem of association rule mining is defined as:
Let I = {i1, i2, ..., in} be a set of n binary attributes called items. Let D = {t1, t2... tn} be a
set of transactions called the database. Each transaction in D has a unique transaction ID and
contains a subset of the items in I. A rule is defined as an implication of the form X→Y where X,
Y ⊆ I and ∩ = ∅. The sets of items (for short item sets) X and Y are called antecedent (left-handside or LHS) and consequent (right-hand-side or RHS) of the rule respectively. Quality of these
rules can be calculated by finding the number of cases when this rule was proved to be true
divided by total number of sales from that store and is referred as rule support. Support of an
item set is defined as the percentage of the data set which contains that particular item set. Rule
confidence is related important statistical measure and calculates the efficiency of rule to
calculate prediction of right hand side of rule, here item B in our example when left hand side of
rule, item A in this example is triggered thus giving number of baskets in which A & B both exist
divided by number of baskets with only expressed in percentage. Confidence of combination of
items divided by support of result is called as lift and gives a ratio of how often B is bought along
with A to how often B is bought independent. This is nice calculation as it gives better picture to
association than rule support as we can see clearly that B is more often bought with A when lift
value is more than one otherwise not. We will elaborate more about these statistical terms and
their calculations in detail with help of examples later in this report.
As calculations are simple, the challenge comes with performance as generally data under
analysis is huge. Even to interpret results is difficult without deep business domain knowledge as
you can lead in wrong impression of rules to be either trivial associations or apparent nonsensical
65
Chapter 4: Cluster & Association Analysis Explored
associations. It is most often called as market bucket analysis based on its most common
application of finding out rules of products getting sold together in a supermarket. Using the data
gathered from baskets, list of products sold together, we can have an analysis of patterns or
strong relations between products to recommend product placement in store, suggest additional
product purchases to buyers or identify unusual combinations of fraud management. Different
objective measures define different association patterns with different properties and
applications. For instance, the purchase of an electronic device that does not include batteries
often implies the purchase of batteries or charger.
Apriori principle: If an itemset is frequent, then all of its subsets are frequent.
4.1.1 Applications of Association Analysis
Netflix based on previous rating of movies compared to other users watching patterns predicts
for you movies of interest to you. Associations generally depend on finding patterns that can be
evaluated through subjective arguments. It is considered uninteresting for data analysis if it don't
reveal unexpected information about the data or give some new unknown information that can
lead to profitable actions. To include subjective knowledge into pattern evaluation needs lot of
efforts and knowledge from domain experts and an extensive amount of prior information from
historic data. Pattern evaluation gets more challenging when partial associations among items
within the pattern are also present. For an instance, few associations & relationships keep
appearing and disappearing when conditioned with the value of certain items.
Support(X) = no. of transactions which contain the itemset X / total no. of transactions
Confidence (X>>Y) = Support(X U Y) / Support (X)
Lift (X>>Y) = Support(X U Y) / Support (Y) * Support (X)
66
Chapter 4: Cluster & Association Analysis Explored
4.1.2 Apriori Association Analysis
It is an influential algorithm to find associations in market basket data or sales transaction data
giving some Boolean association rules as an output based on is calculations of three statistical
values, support, confidence and lift. It continues to identify the frequent individual items and
extend them to larger data sets till these item sets appear sufficiently often in the analysis. Apriori
is designed to handle databases that hold transactional data like list of items bought by customers
or details of a website frequentation. Apriori algorithm works on following general process by
splitting association rule generation into two separate steps:
1. Minimum support is applied to find all frequent itemsets in a database.
2. These frequent itemsets combined with minimum confidence constraint are considered to
output rules.
67
Chapter 4: Cluster & Association Analysis Explored
Let’s take the example dataset as shown above to illustrate these three terms and algorithms.
Support is calculated as ratio of total number of baskets that support rule i.e. a
combination exists to total number of baskets expressed in percentage. Note that support is
bidirectional that is 'if 10 then 20' will be similar to 'if 20 then 10' as both will have same rule
percentage. Confidence is defined as ratio of number of baskets in which both items 1 & 2 exist
divided by the number of baskets with only item 1 in them expressed as percentage. Confidence
is not bidirectional as support. Both support & confidence give an idea about rule’s validity but
cases exist when value of both of them is high and concluded rule is of no use. This shortcoming
of these 2 measures bring into picture one more measure to find accuracy association called a
lift or improvement and is defined by the ratio of how often when item 1 is bought item 2 is also
bought divided by how often item 2 is bought independent. Value less than one indicates item 2
is more often bought independently on own while value greater than one tells item 2 is often
bought with item one. Calculations are as shown in figure and it is not challenging to conclude
that Support can be used to see most popular rules, Confidence gives you most useful rules while
overall most useful & popular rules are given by Lift. Now as we understand terms to measure
and compare associations in a dataset, we will now talk about how to implement Apriori
association analysis.
4.1.3 Apriori Association Analysis in PAL
In PA library for SAP, the function name to perform Apriori association analysis is
APRIORIRULE and the algorithm name is Apriori. The input is always of two variables, first being
the transaction ID representing Basket ID and second being Item ID signifying the product name.
Data types for both these variables can be Integer, varchar or char. The output is comprised of
two tables where first one contains the association rules with leading items or pre-rule or lefthand side in the first column and dependent items also called as post rules in the second column.
Some applications allow and demand combining the pre-rule & post-rule item columns in order
to construct the total rule i.e. table will now display support, confidence and lift values. Second
table shows PMML definition of Apriori model which are forced to be the calculated rules and
their measures. Following figure shows the definition of Parameter table for Apriori algorithm.
68
Chapter 4: Cluster & Association Analysis Explored
Parameter Table Definition for Apriori
The following text displays main components of the SQLScript for Apriori. The full code can be
accessed from the file SAP_HANA_PAL_Apriori_Example_SQLScript on the SAP PRESS website.
// The procedure generator
CALL "SYSTEM".afl_wrapper_generator('PAL_APRIORI_RULE', 'AFLPAL','APRIORIRULE',
PDATA);
// The Control Table parameters
INSERT INTO PAL_CONTROL_TAB VALUES ('MIN_SUPPORT',null,0.01,null);
INSERT INTO PAL_CONTROL_TAB VALUES ('MIN_CONFIDENCE',null,0.01,null);
INSERT INTO PAL_CONTROL_TAB VALUES ('PMML_EXPORT',2,null,null);
INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null);
// Assume the data has been stored in the table PAL_TRANS_TAB
// Calling the procedure
CALL PAL_APRIORI_RULE(PAL_TRANS_TAB, PAL_CONTROL_TAB,
PAL_RESULT_TAB,PAL_PMMLMODEL_TAB) with overview;
SELECT * FROM PAL_RESULT_TAB;
SELECT * FROM PAL_PMMLMODEL_TAB;
// Merging Prerule & Postrule
DROP VIEW TMP_RESULT_V;
CREATE VIEW TMP_RESULT_V AS SELECT CONCAT(PRERULE, ' => ') AS
PRERULE, POSTRULE, SUPPORT, CONFIDENCE, LIFT FROM PAL_RESULT_TAB;
DROP VIEW RESULT_V;
CREATE VIEW RESULT_V AS SELECT CONCAT(PRERULE, POSTRULE) AS RULES ,
SUPPORT, CONFIDENCE, LIFT FROM TMP_RESULT_V;
SELECT * FROM RESULT_V;
As discussed, we get two output tables one containing the measures and one showing
associations in PMML file. This PMML output can be used to transfer model rules to a business
application such as recommendation engine.
4.1.4 Strength & Weakness with Apriori Lite
Apriori Lite can be considered as an alternate to Apriori but actually is a specific instance
of Apriori algorithm in a way that it looks pre-rule and post-rule rules only for a single item. This
makes it more efficient and faster but can be applied only when we try to seek one-to-one rules
instead of getting all associations. Another plus point of this algorithm is the possibility to sample
data. LITEAPRIORIRULE is the function name to call it in SAP Predictive analysis library and the
input to this algorithm is exactly same as Apriori full. In parameter table of Apriori Lite,
69
Chapter 4: Cluster & Association Analysis Explored
MAXITEMLENGTH does not exist but include two extra parameters namely OPTIMIZATION_TYPE
and IS_RECALCULATE.
Reason behind association analysis being so popular is the ability to produce clear results.
Calculations are most of times so straightforward that anyone in management position without
detailed technical knowhow can understand it easily and thus speeding up faster and better
decision making. One of biggest drawback with Apriori is its heaviness and it requires more and
more computations exponentially with increase in data. Apriori lite thus is an alternate when
one-to-one rules have to be found. Sometimes results are of no value and misleading. No matter
how many weakness we can count for this association analysis and related algorithms, we can
never utilize PA to fullest without them in most of cases. There may be cases and problem sets
that can be best solved with regression analysis & never use Apriori.
4.2 Cluster Analysis
This chapter involves the concepts around cluster analysis and how it is being
implemented in PAL, R and SAP PA. Cluster analysis is also referred to as segmentation analysis
and is a very popular application of SAP PA. We start this section explaining simplest of all
algorithms i.e. ABC Classification which groups records in data set based on specific parameter
into top X % then top Y % and all remaining in Z % to total up to 100%. We will also talk about
popular and efficient statistical analysis algorithm for comparing, machine-learning clustering,
self-organizing maps etc. and is called K-Means cluster analysis. Task of grouping and
classification started since days of early man who needed to distinguish between edible &
poisonous food, pet & wild animals and we still do in daily life by grouping students at university
by country of origin, previous studies aggregates, age, gender etc. giving us a reason to do it. It is
clear that we have a better understanding of data even when data size is large if we break it down
in groups. Attribute values which describe the objects are used for assessing the dissimilarities
among clusters.
4.2.1 Introduction & Applications of Cluster Analysis
Cluster Analysis intends to organize and segment data into groups with close
characteristics and features in a way that data within a group is closely matching to other data in
same group and is different from data in other groups in some respect. In other words, objects
within one cluster are shortly distant and compact within cluster while distance between intracluster objects & clusters is more as they look disparate. Cluster analysis or clustering thus can
be defined as the job of grouping a set of objects in such a way that objects in the same group
(called a cluster) are more similar (in some sense or another) to each other than to those in other
groups (clusters). We may need to do this clustering for many reasons, for an example, identifying
70
Chapter 4: Cluster & Association Analysis Explored
people with similar shopping pattern to find better marketing strategies, grouping movie shows
into similar categories based on viewer ratings, making task of cluster analysis considered as
intellectually satisfying, profitable, and sometimes both. Cluster analysis is a concept, a method
which do not represent or identify a particular statistical method or model, as factor analysis, and
regression. There can be many ways to cluster data into groups and the choice depends on
various factors and requirement. Cluster analysis encompasses a variety of algorithms and
methods to classify objects of similar kind into given categories. Cluster analysis as on output
finds for you structures in data without knowledge of why they exist.
Cluster analysis can be seen as most widely used class of predictive analysis methods with
diverse applications including criminal pattern analysis, medical research, social services,
psychiatry, education, archaeology, astronomy, and taxonomy making it indeed ubiquitous and
significant for data analysis. Market segmentation is one of most talked about application and
helps to make better decisions by making different business plans for different group of buyers
with different promotional offers. A nice example to highlight the importance of clustering is that
there is an exponential decrease in number of different sizes of clothes available in stores
because we after analyzing many measurements of body size and came up with a generalized
system of body measurements whereby individuals are allocated to specific sizes/clusters.
4.2.2 ABC Analysis in PAL
Cluster Analysis intends to organize and segment data into groups with close
characteristics and features in a way that data within a group is closely matching to other data in
same group and is different from data in other groups in some respect. In other words, objects
within one cluster are shortly distant and compact within cluster while distance between intracluster objects & clusters is more as they look disparate. Cluster analysis or clustering thus can
be defined as the job of grouping a set of objects in such a way that objects in the same group
(called a cluster) are more similar (in some sense or another) to each other than to those in other
groups (clusters). We may need to do this clustering for many reasons, for an example, identifying
people with similar shopping pattern to find better marketing strategies, grouping movie shows
into similar categories based on viewer ratings, making task of cluster analysis considered as
intellectually satisfying, profitable, and sometimes both. Cluster analysis is a concept, a method
which do not represent or identify a particular statistical method or model, as factor analysis, and
regression. There can be many ways to cluster data into groups and the choice depends on
various factors and requirement. Cluster analysis encompasses a variety of algorithms and
methods to classify objects of similar kind into given categories. Cluster analysis as on output
finds for you structures in data without knowledge of why they exist.
71
Chapter 4: Cluster & Association Analysis Explored
ABC Analysis, K-means and self-organizing maps are three cluster analysis algorithms
supported by SAP Predictive analysis library. Three user defined clusters are given by ABC Analysis
while K-means creates K clusters based on data memberships. Self-Organizing maps use a map,
usually an M*N matrix, to map the given data item to some coordinates on the map which later
takes form of clusters when multiple records get mapped to specific coordinates.
ABC which stands for Artificial Bee Colony Analysis clusters data depending on what that
particular data item contributes to total to find top X % of items based on characteristic A or top
Y % of items based on characteristic B etc. was first proposed by Karaboga. ABC classification
hence gives functionality to an organization to segregate units into three groups: A, the most
important; B, important; and C, the least important. The intention behind such classification of
items into groups is to have a better control and understanding over each item based on their
group. The ease of use and simplicity to understand make it more popular. Data generally is
initially sorted in descending numeric order and then grouped into first A %, the second B % and
finally remaining C % to total up as complete hundred. It can be considered as one weakness of
this algorithm as it don’t support more than 3 groups.
The Artificial Bee Colony (ABC) algorithm treats the search space representing data set as
it were a foraging environment, and each point in this search space relates to a food source which
we take as solution and has to exploit by the artificial bees. This can be useful but basically the
algorithm just sorts the data based on a continuous variable i.e. is my customer in the 20%, 50%,
80% most spending customers for an example? The fitness of the solution is represented as the
nectar amount of a food source. According to this algorithm, there are 3 kinds of bees employed
bees, onlooker bees, and scout bees. Specific food sources are first exploited by Employed bees
before and then forwarded for the quality information of the food sources to the onlooker bees.
Information about the food sources is received by onlooker bees who then will exploit
a
particular food source based on the information of nectar quality will be chosen by them. The
more nectar the food source contains, the larger probability the onlooker bees will choose it [23]
[24]. “Limit” is the quality parameter controlling the employed bees whose food should be
abandoned. Food sources is responsibility of scout bees by searching & analyzing whole
environment. ABC algorithm can be defined as below steps:
1. Initialization phase when each food source X i,j available in environment is initialized by
scout bees after setting up the control parameters. The number of food sources equals to
half of the colony size. D the dimension represents the number of parameters to be
optimized.
2. Employed bees phase when Employed bees start searching for more food sources having
more nectar i.e. increased fitness value in the neighborhood of the food sources in their
memory. Once a neighbor food source is encountered, these employed bees calculate its
72
Chapter 4: Cluster & Association Analysis Explored
fitness. Greedy algorithm is applied on the new food source to the original food source
and the best will be placed in memory. If the food source is improved, the trials counter
of this food will be reset to zero else incremented by one.
3. Onlooker bees phase when Onlooker bees resting till now in their hive are given all this
food source information from employed bees which based on their probabilistically
calculations on fitness values on given information choose their food sources. An
onlooker bee chooses a food source depending on its probability value which may allow
multiple onlooker bees choosing a same food source if that food source has a higher
fitness. Once food sources have been selected by onlooker bees, each of onlooker bee
will now find a new food source in the neighborhood and will compute Fitness values of
these new food sources same way as the employed bees did in their phase. That means
more onlooker bees will be used to find richer food sources.
4. Scout bees phase is last phase when the value of trials counter of each food source is used
to decide. If the value is more than the limit parameter, the food source will be
abandoned and the bee there will become a scout bee going back to initialization phase
and a new food source will be produced randomly in the search space for these new scout
bees and the trials counter will be reset to zero for them. The first three phases will be
repeated until some end criterion is met and the best food source showing the best
optimal value will be considered as final solution.
An example is shown in figure below where values of A, B and C are 20%, 30% and 50%
respectively which help us to visualize that A segment being 20% of total is accounted for 5 items
out of 70 or 7.1% of all items. Segment B being 30 % of total is accounted by 9 items and this
being 12.9% of all items whereas last segment of 50% is accounted by 56 items and thus relates
to 80% of item population.
An Example of ABC Analysis
73
Chapter 4: Cluster & Association Analysis Explored
Algorithm name to call it for grouping in PAL is ABC ANALYSIS and corresponding function
name is ABC. There are two columns in input able when item or record names are contained in
first column while numeric values to be used for analysis are stored in second column. Item name
has data type as char or varchar while corresponding value is inputted as Double always. The
parameter table contains four parameters namely ’PERCENT_A’ having data type as Double
representing Interval for A class, ’PERCENT_B’ having data type as Double representing Interval
for B class, ’PERCENT_C’ having data type as Double representing Interval for C class and
’THREAD_NUMBER’ having data type as an integer value representing the total number of
threads. Values of ABC always should add up to 100 and is a check done by PAL before calling the
algorithm. Output table has 2 columns again holding assigned values of A, B or C to items and
other one holding item name in output table as shown in figures below.
ABC Analysis Input & Output tables
The main elements of the SQLScript are as follows, with the control parameters set as A=35%,
B=20% and C=45%. The full code is available in the file SAP_HANA_PAL_ABC_Example_SQLScript
on the SAP PRESS website.
// the procedure generator
Call SYSTEM.afl_wrapper_generator ('PAL_ABC','AFLPAL','ABC', PDATA);
// The Control Table parameters
INSERT INTO #CONTROL_TBL VALUES ('PERCENT_A', null, 0.35, null);
INSERT INTO #CONTROL_TBL VALUES ('PERCENT_B', null, 0.20, null);
INSERT INTO #CONTROL_TBL VALUES ('PERCENT_C', null, 0.45, null);
INSERT INTO #CONTROL_TBL VALUES ('THREAD_NUMBER', 1, null, null);
//Assume the data has been stored in table TESTABCTAB
74
Chapter 4: Cluster & Association Analysis Explored
//Calling the procedure
CALL PAL_ABC (TESTABCTAB, "#CONTROL_TBL", RESULT_TBL) with overview;
SELECT * FROM RESULT_TBL;
4.2.3 K-Means Cluster Analysis in PAL
The K-Means algorithm is one of the best known predictive analysis algorithms and very
famous for cluster analysis as it efficiently clusters the records or observations into K clusters
such that each record belongs to the cluster with the nearest mean. This algorithm works over
continuous data and can be applied in different kinds of domains. As k-means needs initial
partitions to initialize the task, best results can be expected only when the initial partitions keep
getting closer to the final solution. This algorithm tries to identify relatively homogeneous groups
of values in given dataset based on chosen parameters and the specified number of clusters. This
algorithm processes a set of data to cluster them into a predefined number of clusters
represented by K. k-means initialize itself with random cluster centroids as a starting point and
as it progresses it keeps replacing the data objects in the dataset to cluster centroids depending
on closeness between the cluster centroids and the data objects. This reassignment procedure
and algorithm terminates As soon as the any of finishing convergence criterion like the number
of iterations, or the cluster results being unchanged even after a certain number of loops) is
encountered. Performance is dependent significantly on random selection of centroids to
initialize the process.
The k-means clustering process can be seen as by the four following steps:
1. Randomly picking K centroids to give an initial dataset partition, which most of times is
struggle to find, how to choose value of K, and calculate centers with distances?
2. Placing each value from the dataset under analysis to the closest cluster centroid. The
measure nearest can be looked differently as there can be several inter-object distance
ways to it which can affect the assignment.
3. Recalculating the centroid of each of K clusters to get new mean.
4. Repeating above two steps until exit criteria is met.
In Predictive Analysis Library, associated function name is KMEANS and algorithm name is KMEANS. Input table is not fixed in structure but there is always a column for initial ID and then
subsequent columns to contain variables for analysis which are always numeric in nature. This is
because clusters are calculated based on inter-object distance measures which don’t make sense
for non-numeric data type.
The table represents the definition of the parameter table for K-means algorithm in PAL.
Value of K is simply an integer signifying the quantity of clusters or segments we wish to derive.
Manhattan distance which is also called city block distance calculates distance between two
points horizontally or vertically on a grid. Euclidean distance is unique shortest path and is used
75
Chapter 4: Cluster & Association Analysis Explored
as most common approach to calculate distance between any two points for clustering.
Murkowski generalize both of above mentioned Euclidean and Manhattan distances. Maximum
number of iterations can be used to define an exit criteria thus giving some control over
processing time and complexity of algorithm and saving from process going relatively very long.
For initialization or seeding, first K records method takes first K records as initial cluster centers
and makes illogical processing if data is sorted. Random with replacement method selects
randomly K records from data set to be used as initial centers. SAP patent method to find K
random clusters is based on max-min approach where initial center is chosen very close to
minimum point and then subsequent centers chosen. Threshold value indicates when iterative
process should end and default value is set as 0.00001.
Name
Data Type
Description
GROUP_NUMBER
Integer
The value of K, the number of clusters.
DISTANCE_LEVEL
Integer
Computes the distance between the item
and cluster center, can be Manhattan
distance, Euclidean distance or Murkowski
distance
MAX_ITERATION
Integer
The maximum number of iterations.
Integer
Center initialization method: 4 options First
K, Random with replacement, Random
without replacement or one which is SAP’s
patent for selecting the initial centers
NORMALIZATION
Integer
Normalization method with three options
No, Yes for each point or Yes for each
column
EXIT_THRESHOLD
Double
The threshold (actual value) for exiting the
iterations.
THREAD_NUMBER
Integer
The number of threads.
INIT_TYPE
Parameter Table Definition for K-Means
To save our analysis from a situation where very large numbers dominating very small numbers,
it’s a good practice to normalize or standardize data which ensures all variables having equal
weight in subsequent calculations. The output for K-Means is in form of two tables. The first table
holds results of the analysis, which maps each record or item in the data set to an assigned cluster
76
Chapter 4: Cluster & Association Analysis Explored
number along with the distance from that item to the cluster center this data item is part of. The
co-ordinates of each cluster center are listed in second table as the Center Points row. This
distances between item & cluster centers for all data items gives a way to measure the
compactness of the clusters and to identify unusual values or outliers.
The main elements of SQLScript to call PAL K-Means algorithm are as follows with defined
parameters
settings
while
the
full
code
is
available
in
the
file
SAP_HANA_PAL_KMEANS_Example_SQLScript on the SAP PRESS website.
// the procedure generator
Call SYSTEM.afl_wrapper_generator ('PAL_KMEANS','AFLPAL', 'KMEANS', PDATA);
//The Control Table parameters
INSERT INTO PAL_CONTROL_TAB VALUES ('GROUP_NUMBER', 4, null, null);
INSERT INTO PAL_CONTROL_TAB VALUES ('INIT_TYPE', 4, null, null);
INSERT INTO PAL_CONTROL_TAB VALUES ('DISTANCE_LEVEL', 2, null, null);
INSERT INTO PAL_CONTROL_TAB VALUES ('MAX_ITERATION', 100, null, null);
INSERT INTO PAL_CONTROL_TAB VALUES ('EXIT_THRESHOLD', null, 0.000001, null);
INSERT INTO PAL_CONTROL_TAB VALUES ('NORMALIZATION', 0, null, null);
INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER', 2, null, null);
//Assume the data has been stored in table PAL_KMEANS_DATA_TAB
//Calling the procedure
CALL _SYS_AFL.PAL_KMEANS (PAL_KMEANS_DATA_TAB, PAL_CONTROL_TAB,
PAL_KMEANS_RESASSIGN_TAB, PAL_KMEANS_CENTERS_TAB) with overview;
SELECT * FROM PAL_KMEANS_CENTERS_TAB;
SELECT * FROM PAL_KMEANS_RESASSIGN_TAB;
77
Chapter 4: Cluster & Association Analysis Explored
The results of running the SQLScript are shown above with input data, which show the
assignment of each record to a specific cluster and the data points for each cluster center.
4.2.4 Silhouette
The cluster viewer window in SAP PA displays four charts. A horizontal bar chart showing the size
of each cluster and A cluster density and distance chart with a color coded scale of dark to light,
for dense to sparse clusters. Thicker the line, the closer are the clusters. Other two charts at the
bottom allow the user to compare clusters by user chosen variable for better customized
understanding and analysis.
How to choose value of K is a key question as it impacts performance and output of
analysis. In some business cases we can think of a value based on requirement and application
like t shirt size groups for sale in stores, still for many cases the value of K is difficult to pre-set
before analysis. One of commonly used approaches is to use the square root of N halved where
N is total number of records in dataset. This becomes challenging when data set is too big as an
instance we would need 700+ clusters to cluster a million records based on this approach and
this big number of clusters can’t be considered easy to manage. In that case, data visualization is
a big aid and stored can be seen and plotted in a bubble plot. Cluster Analysis comes under
undirected data mining as here is no target or dependent variable to be predicted, so another
quantitative approach can be to determine value of K which measures cluster quality is
silhouette. Silhouette is thus a practical way to measure quality of cluster analysis without which
it is not possible as we can never compare or find the output of analysis. Good clusters can be
thought of groups where cluster members are close to each other as well as far from members
of other clusters.
The average for all the records in the dataset of (b−a) / max (a, b), is calculate here by
silhouette method where a is the average distance of the record to all other records within the
same cluster i.e. cohesion while b is the average distance of the record to all the other records in
the nearest cluster center that it does not belong to i.e. separation. This calculated value indicates
all records to be located directly on cluster centers when value is 1 while indicating all records to
be located on cluster centers of some other cluster when value is -1. When records will be
equidistant from their own cluster center as well as cluster center of nearest other cluster, it will
have silhouette coefficient set to 0. These values off course are ideal values but still can help to
summarize a general guide as per which value less than 0.2 means very poor clustering while
value of 0.5 or above is considered good and meaningful.
In PAL, it can be called with function name VALIDATEKMEANS and has two input tables to
run. First table is required for cluster analysis while second one is used to assign cluster numbers
to each record. For table 1, the first column which is for Record ID can either hold Integer or
78
Chapter 4: Cluster & Association Analysis Explored
String as data type while all other columns in table one which represent attribute data store
values either in integer or double value. For table 2, the first column which is again used to store
Record ID here hold only Integer values while second columns in table two which represent
assigned cluster number store values integer form only. Parameter table and output table for
Validate K-means when called in PAL are defined below. The silhouette measured value will
increase and, finally, when K = N it will become equal to 1, as the value of K approaches the
number of records (N).
Name
Data Type
Description
VARIABLE_NUM
Integer
The number of variables
THREAD_NUMBER
Integer
The number of threads
Name
Data Type
Description
Result 1
Varchar or
char
Name
Result 2
Double
The Silhouette value
Parameter Table Definition for Validate K-Means
The SQLScript to call Validate K-Means in the PAL is as follows while the full code available in the
file SAP_HANA_PAL_VALIDATEKMEANS_Example_SQLScript on the SAP PRESS website.
// the procedure generator
Call SYSTEM.afl_wrapper_generator
('palValidateKMeans','AFLPAL','VALIDATEKMEANS',PDATA);
// The Control Table parameters
INSERT INTO #CONTROL_TAB VALUES ('VARIABLE_NUM', 2, null, null);
INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER', 1, null, null);
// Calling the procedure
CALL palValidateKMeans(PAL_KMEANS_DATA_TAB,
V_KMEANS_TYPE_ASSIGN,"#CONTROL_TAB", KMEANS_SVALUE_TAB) with overview;
SELECT * FROM KMEANS_SVALUE_TAB;
To choose the initial cluster center for K-means algorithm is second key question and is
answered by seeding strategy applied. Generally many different seeding strategies are applied
and solutions are compared for robustness to see if the solution change or remains constant with
change in values. Clusters need to be understood and analyzed again to check reason of
differences if solution is changing frequently. This was a discussion when data set to be clustered
79
Chapter 4: Cluster & Association Analysis Explored
was all numeric but imagine a scenario when we have to cluster categorical data. Does PA allow
us to handle it? For sure, categorical data cannot be clustered based on inter-object distances as
we cannot say the difference between sun and moon. An approach can be to convert each
category into a new variable in binary format. Once we have them we can rescale them by
multiplying by SQRT (0.5) to reduce influence when we have binary variable mapped with
categorical data items. It is also a good practice to merge categories when there are many
categorical variables and within each variable more sub categories exist or alternatively we can
consider other non-distance based clustering algorithms. Decision tress to associate analysis
close to clustering is a win situation.
Decision Tree Analysis of Clusters
4.2.5 Self-Organizing Maps
Self-Organizing Maps or SOMs also called as Kohonen SOMs after their inventor professor Teuvo
Kohonen, of the Academy of Finland, are a type of neural network that can be used to cluster a
dataset into distinct groups. One or two dimensions in a vector or matrix, known as a map are
generally used to represent multi-dimensional data in much lower dimensional space. Once
network gets trained records in data set that are different will appear far apart while records
which are similar will appear close together on the output map. More populated units are shown
by the number of records or observations captured by each cell or unit in the map indicating
groupings of the records or segments initializing the existence of a sense of the appropriate
number of clusters in the dataset. The value of ‘K’ is not predetermined as in K-Means cluster
analysis. They are based on unsupervised learning, which means that no human intervention is
needed during the learning and that little needs to be known about the characteristics of the
input data.
80
Chapter 4: Cluster & Association Analysis Explored
A network created from a 2D lattice of “nodes”–the map is shown in figure below which is fully
connected to the input layer. A small SOM network of 3 * 3 nodes connected to the input layer
of a two dimensional vector i.e. a two-variable dataset is also shown below. Generally specific
topological position is assigned to each node, an x, y coordinate in the lattice or map, which also
contains a vector of weights of the same dimension as the dimensions of input vectors. In the
input vector/dataset there are 2 dimensions/variables in our example which means each node
will have a corresponding weight vector W, of 2 dimensions: W1, W2 there to represent
adjacency we have lines connecting the nodes but they do not signify a connection.
3*3 SOM connected to a 2-variable Dataset
Following steps occurring over many iterations represent training of the SOM:
1. Each node in the map gets its weights initialized with random values between -0.05 and 0.05
set by PAL.
2. From inside the set of training data, a vector is chosen generally starting with the first one and
is presented on the map.
3. The ”winning” node or the Best Matching Unit (BMU) is calculated such that nodes’ s weights
are most similar or closest to the input vector using any distance measure such as the Euclidean
distance for each and every node in set.
4. The radius of the neighborhood of the BMU is then calculated which represents a value that
starts large, typically chosen to the radius of the lattice but decreases with each iteration and all
nodes found within this radius are deemed to be inside the BMU’s neighborhood.
81
Chapter 4: Cluster & Association Analysis Explored
5. Weights of each neighboring node’s found in step 4 are adjusted to make them more similar
to the input vector. Weights get altered more when node is closer to the BMU.
6. Steps 2 to 5 are repeated for the next vector in the data set and then for N iterations or until
the weights stop changing. Figure below highlights with an example the size of a typical
neighborhood when its time near to the commencement of training. The area of the
neighborhood shrinks with time, which is accomplished by making the radius of the
neighborhood shrink too with help of decay function, a unique characteristic of the Kohonen
SOM.
Decreasing Neighborhood Size during SOM Iterations
Eventually with time, the neighborhood shrinks to the size of just one node which we call
as the BMU. The goal is to discover some underlying structure of the data. Thus we can say that
SOMs have two phases: Learning phase when map is built and network organizes using a
competitive process with help of training set and secondly Prediction phase when new vectors
are quickly given a location on the converged map, easily classifying or categorizing the new data.
The weight vector of node is adjusted as follows when a node is found to be within the
neighborhood, else it is left alone.
W (t+1) = W (t) + λ (t) * (V (t) – W (t))
Where T represents the iteration and λ, which is a small variable and called the learning rate,
reduces with each iteration that means the new adjusted weight for the node gets equal to the
old weight (W) when added to a fraction of the difference (λ) between the old weight and the
input vector (V).
Λ (t) = λ0 exp (–t / λ)
λ0, denotes the width of the lattice at iteration t = 0 and the Greek letter lambda, λ, denotes a
constant. λ0 is set to 0.5 by default in PAL.
82
Chapter 4: Cluster & Association Analysis Explored
The effect of learning should also be proportional to the distance of a node from the BMU
with every decay in leering rate over time. The learning process should barely have any effect at
all at the edges of the BMUs neighborhood. The quantity of learning generally fades with distance
same way as it do in the Gaussian decay which is shown below:
λ(t) = exp (–dist2 / 2λ2(t))
With each iteration, records from the training data set are allocated to cells on the map, and
closely related records grouped together, as shown in figure below. Self-organizing maps are
different from other artificial neural networks in the sense that they use a neighborhood function
to preserve the topological properties of the input space.
Assignment of the Data Set Records to the Map, Showing the Clusters
We can call this method by associated function name ‘SELFORGMAP’ In the Predictive
Analysis Library where it comes under set of algorithms and the algorithm name is Self-Organizing
Maps. The Input table consists of an initial ID column, and all subsequent columns contain the
variables to be used for the cluster analysis. These variables must be numeric to be used with the
cluster analysis. It’s because SOMs cluster data objects using the inter object distance which
cannot be computed if the input variables are not-numeric. The Parameter Table Definition for
Self-Organizing Maps is shown in table below. There can be 3 methods to normalize the data as
discussed below:
1. The data stays as default setting and the parameter is set to 0.
83
Chapter 4: Cluster & Association Analysis Explored
2. For each variable X (x1,x2,...,xn), the minimum and maximum value of X is found and
then X[i] = (X[i]-min)/(max-min) is calculated to rescale the data between 0 (min value)
and 1 (max value) and and the parameter is set to 1.
3. For each variable X (x1,x2,...,xn), the normalized values depend on the mean and
standard deviation of X. x1, is normalized to X’ by computing X’ = (xi – Mean(X)) /
S.D.(X) for each X when the parameter is set to 2.
Name
Data Type
Description
SIZE OF MAP
Integer
The self-organizing map is made up of n × n
unit cells. This parameter defines the value n.
MAX_ITERATION
Integer
The maximum number of iterations.
NORMALIZATION
Integer
Normalization method with 3 methods : No or
Transform to new range (0.0, 1.0) or Z-score
normalization
THREAD_NUMBER
Integer
Number of threads.
The Parameter Table Definition for Self-Organizing Maps
Self-Organizing Maps outputs two tables. The first table is called SOM Map & holds the
final weights corresponding to each of the map cell IDs, along with the number of records or
tuples assigned to each map cell ID i.e. cluster size. It is stored in last column of table and is always
integer type. The weight vectors to stimulate original tuples are always outputted as double in
middle column. The second output table is called SOM Assign row and maps each Cell ID assigned
to each record displaying the membership of the clusters. ID for tuples can be integer or string.
Name
Data Type
Description
1st column
Integer
Unit cell ID.
Other columns except
the last one
Double
The weight vectors used to simulate the
original tuples.
Last column
Integer
The number of original tuples that every unit
cell contains.
Output Tables Defined for Self-Organizing Maps
See below an example of self-organizing maps in the PAL implemented on same data set as shown
in figure above on which we implemented K-Means and thus an interesting attempt to
84
Chapter 4: Cluster & Association Analysis Explored
comparison of the two cluster analysis algorithms also. The PAL SQLScript to call for Self
Organizing Maps is as follows while full code available in the file
SAP_HANA_PAL_SELFORGMAP_Example_SQLScript on the SAP PRESS website.
// PAL set-up
Call SYSTEM.afl_wrapper_generator('PAL_SELF_ORG_MAP', 'AFLPAL', 'SELFORGMAP',
PDATA);
// Preparing application data for calling procedure
INSERT INTO PAL_CONTROL_TAB VALUES ('MAX_ITERATION', 200, null, null);
INSERT INTO PAL_CONTROL_TAB VALUES ('SIZE_OF_MAP', 4, null, null);
INSERT INTO PAL_CONTROL_TAB VALUES ('NORMALIZATION', 0, null, null);
INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER', 2, null, null);
//Assume the data has been stored in table PAL_SOM_DATA_TAB
// Calling the procedure
CALL PAL_SELF_ORG_MAP(PAL_SOM_DATA_TAB, PAL_CONTROL_TAB,
PAL_SOM_MAP_TAB, PAL_SOM_RESASSIGN_TAB) with overview;
Select * from PAL_SOM_MAP_TAB;
Select * from PAL_SOM_RESASSIGN_TAB;
This script outputs two tables as shown below. The final weights assigned to each of the map cell
IDs along with the number of records or tuples corresponding to specific map cell ID (represents
the cluster sizes) is shown in first output table. Cell ID assigned to each record which is the
membership of the clusters is output shown in second table. E.g., Cell ID 0 has 2 records with
trans_id 11, 12 and Cell ID 13 has 2 records again with trans_id 18, 19 and Cell ID 15 has 5 records
(Trans ID 0,1,2,3,4)etc.
85
Chapter 4: Cluster & Association Analysis Explored
Generally after this the results are presented in a visualization to understand and analyze the
clustering of the data, for instance, cell 7 has five records, cell 8 has two records, and so on. We
can run this SOM in N*N grid. Smaller the number of N, more close the cells and hence more
closed grouping can be seen. Comparing the outputs of SOM with K-Means give most of times
same number of clusters but in K-Means algorithm user predetermines the value of K prior to
running the algorithm but this is not case with Kohonen SOMs as we allow the data suggest the
value of K. SOM if have less variables than number of cells, it will result a sparse output with lot
of empty regions in middle of classified objects while in alternate case, cells will be forced to
share with variables and thus giving groups or clusters. Following figure show the visualization of
results when we run SOM on input data in figure XX and plot the table data for 4 clusters in 4*4
map.
The Four Clusters in the 4 * 4 Map
Cluster analysis definitely is capable to be called one of most popular methods of
predictive analysis but is more involved for business operations because of its uniqueness in
splitting down huge amounts of data into smaller manageable clusters giving better
understanding compared to analyzing data for predictions. Most commonly used application is
market segmentation based on fact that focused marketing is more effective than a generic
approach. Biggest strength of cluster analysis in general is it's easiness to understand as intention
in this step is not to make predictions but to cluster existing data. Each cluster analysis algorithm
discussed above has individual pros 6 cons. ABC Classification is very simple, very practical and
therefore very popular but limited to 3 grouping seven when a user wants more groups which
maybe not necessarily add up to 100% like finding the top 15%, second 10% and third 40%.
86
Chapter 4: Cluster & Association Analysis Explored
K-Means is also easy to understand and to apply if we choose value of K in advance and
detection to be undirected and it being data driven. K-Means is clearly driven by the choice of K
which depends on user experience and skills but if chosen wrong will make whole clustering a
bad experience. In K-Means the results generally vary dependent on the choice of distance is
sensitive to the initial choice of cluster centers and measure. We can never try to optimize value
of K as it don’t involve maximizing or minimizing some function. Input data is assumed to be
numeric from idle run but Non-numeric data can be aliased to numeric using some approaches
to initiate cluster analysis such as using decision trees and association analysis to find groups in
the data.
Self-organizing maps find its application to create an order visualization of multidimensional data which simplifies complexity and reveals meaningful relationships and thus can
be compared to a non-parametric regression technique which converts multi-dimensional data
spaces into lower dimensional abstractions. They are popular and beneficial as they provide quick
and concise model creation even for voluminous data sets and have outstanding prediction
accuracy due to patented procedure for the extraction of non-linear relations. They can work
well only if clusters exist, but they can also not work i.e. it can’t find clusters for user if there are
no clusters.
87
Chapter 5: Conclusion
Chapter 5: Conclusion
5.1 Problem Set: Burn that Churn
To implement and have a hand on experience for SAP PA, I decided to have data predictive
analysis on a telecom data for last few years and then compare the efficiency and trueness of
predictions by comparing the model correctness applying same model to different data sample
subsets. Huge volumes of customer data is analyzed in order to identify business factors like new
revenue opportunities, reduce churn which can help make informed decisions. Customers
become “churners” when they discontinue their subscription and move their business to a
competitor. Credit card issuers, insurance companies and telecommunication companies always
are keen to predict churning i.e. process of customer turnover. If they predict the laving customer
at right time, they can always try to hold him with better offers as it is always cheaper to retain
a current customer than to gain a new one. In this work, we have only considered customerinitiated churn and ignored operator-initiated churn because mostly latter is because of
payments problems from customers and are of no interest.
All telecom service providers must be able to respond and react in timely manner to
customer’s preferences for their survival in this competitive world. They must be able to predict
and prevent subscriber churn by understanding reasons of already churned customers as well as
expectations of current customers. All telecom operators store enormous size of data about
subscribers and their behaviors. In this thesis part of work, this data has been attempted to
analyze and find some valuable insight for business decision makers. For this objective there are
many approaches to begin like customers segmentation based on location, demographics,
purchase history etc., understanding which market campaigns and new products were able to
add in most customers and more. As it is not practical to understand personal preferences for
each and every subscriber, but it is possible to make visualizations and understand patterns based
on historic data in repositories to predict probability of churn in each segment. Following
variables were taken in main consideration as they appear to be having highest effect on churn.
Customer demographics, i.e., age, gender, marital status, location, etc.
Call statistics: length of calls at different times of the day, no of long distance and local
calls.
Billing information for each customer – what the customer is paying for local and long
distance.
Extra service information, that is, what extra plan the customer is registered on, e.g.
special long distances rates.
88
Chapter 5: Conclusion
Complaint information: how many customer service calls are made for disputed billing,
dropped calls, slow service provisioning, non-working special services, and so on.
Credit history.
5.2 Results & Analysis
I got the data set for this work from TERADATA CENTER FOR CUSTOMER RELATIONSHIP
MANAGEMENT AT DUKE UNIVERSITY which they used for NCR Teradata 2003 Tournament. The
data represents a major wireless telecommunication service. Data holds records for more than
100,000 customers who have minimum of 6 months of service history with them. I found data
quiet good for churn modeling and predictions to get results and analysis for this thesis report.
Dataset was named tourn_1_calibration.csv’ and is copied to attached CD. Using normal Excel
program, firstly I removed noise by deleting all rows with missing or null or strange appearing
values for variables that were most appealing to me to influence prediction and modelling. These
variables from data source are:
1.
2.
3.
4.
5.
6.
7.
8.
Age of Handset
Calls to customer care in last 3 months
% change in monthly outgoing calls on previous 3 months average
Age of Customer
Current Handset Price
Income of customer
Total number of months as customer
Credit History
New dataset after removing noise is named as tourn_1_calibration _thesis.csv and is also
copied to CD. Binning is implemented too on this data set to improve the analysis results and
to have data read in group’s clusters instead of integer values. New columns I introduced to
the dataset under binning process are:
1. range_handprice which has 3 possible values ‘low end’, ’medium’ & ‘high end’
representing the cost group of headsets. Mobile phones costing below 79 are
classified as low end, costing more than 170 as high end while remaining come under
medium group.
2. range_income which again has 3 possible values ‘low’, ’medium’ & ‘high’ representing
the income bracket of users. Customers earning less than or equal to 3 on index are
classified as low income group, earning more than or equal to 7 as high income group
while remaining come under medium group.
3. range_agemob which has 3 possible values ‘less than 1 year’, ‘around 2 years’ and
‘very old mobile’ depending on duration of handset used. As name signifies, group
called less than 1 year holds all entries where handset is used less than 12 months. 13
89
Chapter 5: Conclusion
to 30 months of handset use comes under around 2 years group and all others more
than 30 months old are classified under very old mobile group.
4. A new column is also inserted in dataset with column name ‘range_churn’ simply
representing value of churn flag in a readable text format for quick analysis. Churn
value 0 which represents customer still active has been shown in this new column as
‘Remained’ while value 1 is shown as ’churned’ stating that customer has already
churned.
5. range_custcare is a new column grouping the number of customer care calls a user
has made in last 3 months and has 3 possible values ‘less than 20’,’20 to 60 calls’ &
’more than 60’. Values under these 3 groups are easy to guess based on the group
names. Same way all crucial columns that I thought to have a major influence on churn
have been held in new columns and replaced by same bracket value with help of excel
program. All these new columns are named as range_xxx to quickly select them from
list of variables in PA.
The process starts with importing this modified data set with added columns and noise
removed to PA. In PA interface you can acquire the data source file from File menu. Select
New, search csv file option and navigate to dataset in question.
Once the dataset got imported, I run sample function to create 4 sample datasets of 7500
records each to make modelling faster. These 4 diff samples also helped to verify correctness
of any model by applying models repeatedly and comparing results for these four samples
individually. I named them as ‘churnsample1.csv’, ‘churnsample2.csv’, and
‘churnsample3.csv’ and ‘churnsample4.csv’ .All 4 sampled datasets were taken by random
selection in PA and are copied in attached CD.
90
Chapter 5: Conclusion
Next step is to start applying algorithms on each dataset and compare the calculated
predicted value of churn flag with the actual value of churn flag. Following subsections hold
the results for 4 algorithms categories and comments on how we can get an idea of future
churning behavior based on the results of these algorithms on datasets. All the models
created to get the below results are saved in .pmml format and are copied to attached CD for
reference.
Under Predict tab, PA Interface allows you to drag and drop the required algorithm in front
of data set and before running you can configure the settings of each algorithm by left clicking
on settings icon on algorithm itself, as shown in figure below. Subsections 5.2.1 to 5.2.4 are
results for analysis and calculated values by PA on 4 different datasets for four different
classes of algorithms. Some algorithms allow you to predict value for a particular variable say
churn in our example by letting us dependent variable under configuration settings while
others don’t. In that case we cannot have direct analysis for model efficiency by comparing
the correctly predicted values to total number of values, but we still can have some
understanding of patterns and behavior based on visualizations and comparisons.
91
Chapter 5: Conclusion
5.2.1 Clustering
Algorithm Used: R-K-Means (MacQueen)
Number of clusters: 5
Dependent variables: change_mou, months, no_of_cars, income, eqpdays and custcare_mean
No of Iterations: 100
As this algorithm don’t allow us to calculate a value to specific variable based on other dependent
variables, we can still find clusters and common patterns based on some independent variables.
As clear from result sets (visualizations) below that cluster 5 is most dense while cluster 1 is
smallest. From cluster density & distance chart, we can see strong connection between cluster 2
and cluster 5. Cluster center representation diagram reveals that cluster 1, 3 and 4 are more
inclined on handset price variable while biggest one cluster 5 is based on 2 values equipment age
and income bracket. Parallel coordinate chart displays calculated_churns for all 5 clusters while
scatter matrix chart plots relations between different chosen independent variables against each
other. The summary is as change_mou months hnd_price eqpdays income custcare_mean
1 2391.470 19.55148 7998999023 431.2059 5.913963 1.7418900
2 3313.584 24.77475 3026621144 639.8905 5.955474 0.8334206
3 2213.295 18.43822 5997563536 465.2371 5.928161 1.1652299
4 1932.428 19.43715 9995865138 308.0222 5.407119 1.9365962
5 1896.514 16.94159 1556729838 304.7009 5.701856 1.7578339
92
Chapter 5: Conclusion
Cluster Analysis for first dataset
93
Chapter 5: Conclusion
5.2.2 Decision Tree
This algorithm allows us to make calculations for a chosen variable or target variable and thus we
can easily find the efficiency and correctness of model after running the model.
Algorithm Used: R-CNR Tree (Regression)
Output Mode: Trend
Dependent variables: range_income, range_custcare, range_months, range_handprice, range_agemo
Target variable: churn
New output column: calculated_churn
The output chart shows the probability of churn based on above mentioned dependent variables.
By analyzing in detail, we can predict customers with high churning possibility based on rules.
Example, customers with more than 20 customer calls and more than 23% decrease in overall
revenue over past 3 months have 69% probability to churn or customers with expensive handsets
and more use of mobile data compared to voice calls have 78% probability to churn, may be they
use handsets for internet gaming or browsing more and thus latest handsets are only interest to
them. This analysis also under result grid view give you a calculated predicted value of each
customer to churn as shown in figure below.
Decision Tress showing probability and classification of dependent variables.
94
Chapter 5: Conclusion
Calculated churn value for each customer based on dependent variables
5.2.3 Apriori
Here also we can’t calculate value for a particular variable based on other independent variables
in dataset, so we rely on visualizations analysis and rules summary. Although it is not a favorable
algorithm to apply in this problem set but business decisions can be made better just by some
small piece of new information.
Sort Type: Ascending Transaction Size
Output Mode: Rules
Dependent variables: range_income, range_custcare, range_months, range_handprice, range_agemo
Support: 0.1
Confidence: 0.8
Below output figure is a set of rules generated by Apriori and can be very useful for decision
making if dived in. This analysis based on 5 dependent variables gave us a set of 37 rules.
Interestingly, changing the values of support and confidence brings change in number of
generated rules as well as value of lift for each rule comparatively.
95
Chapter 5: Conclusion
Apriori rules output for first dataset
5.2.4 Neural Network
This algorithm according to my experience fits best in this problem set and allows us to prepare
a model to be reused with new data to predict values for target variable. Using this approach, we
run a model based on training data and keep fitting it unless the drop in correctness when applied
to another dataset is not huge. When we run this model, it adds a new column in the table which
according to model should be the value for that target variable (churn) based on previous
patterns and behavior. This allows us to compare the value of target data in real to value of target
data predicted by PA.
Algorithm: R-MONMLP
Target Variable: Churn
Output Mode: Trend
Dependent variables: income, months, custcare_mean, handprice,
Hidden Layer 1 Neurons: 5
Predicted column name: predicted_value
The following screenshot shows the insertion of new column predicted_value when we run this
algorithm with values 0 and 1 which represents the customer churning or not churning according
to this model and configuration settings for chosen dependent variables.
96
Chapter 5: Conclusion
New column ‘Predicted value’ in result set
We can easily find how many of the values are predicted right by PA by comparing the output
values to real churn values in dataset. Total number of right findings divided by total number of
rows will give the percentage correctness of the model.
In first attempt, I got the following figures for the predicted values compared to churn values in
dataset.
Confusion Matrix for predicted churn variable
That means out of 3780 correct one values, our model predicted ones correctly 2082 times i.e.
approx. 55% correctness. I then tried to put new values by hit & trial method, combinations of
variables to configure algorithm until I got the results and a model which is relatively better. In
PA any model with 80% or above correctness is considered to be good. No model can be 100%
accurate and applicable in all problem sets. It needs business processes understanding and
domain knowledge to plan and implement predictive analysis solution for a problem set.
97
Chapter 5: Conclusion
Output model with 65% correctness
Output model with 96% correctness
This improvement in model can also be an outcome of model Overfitting, so to confirm the
possibility of model Overfitting, I applied the same model run on other 3 datasets. The
correctness percentage of model din fluctuate to large values, so I considered this to be a good
model. To save a model, using interface click on save as model under predict tab. Once saved
models appear under component pane as shown in diagram below. This model is one which gave
us 96% correct values for churnsample1. Saving a model makes it easy to reapply it to other
datasets as we don’t need to worry about configuring algorithm settings again.
98
Chapter 5: Conclusion
Saving a model with high correctness
Following steps show the results and figures for same model ‘neuralnetwork’ when applied to
other 3 datasets.
Confusion matrix for data set 2, correctness 99%
Confusion matrix for data set 3, correctness 98%
99
Chapter 5: Conclusion
Confusion matrix for data set 4, correctness 79%
As we don’t see a big fall or deviation in correctness when this model was applied to other 3
sample sets, we can conclude it to be a good configured model and can use it to predict churn
variable for customers who are still active. That’s the potential of modelling and PA.
5.3 Discussion & Issues
5.3.1 SAP PA compared to Hadoop
SAP Predictive Analysis tool as per date is designed to predict and analyze structured data
only, that means user should have structured data either in xls, csv, txt or HANA database tables.
If data is not in structured format then user cannot proceed further for analysis and prediction
using PA while Hadoop is leveraged to analyze the unstructured and semi structured data as well.
Hadoop is often used to store and process big data volumes of semi structured or binary data.
But it looks feasible to pass results from Hadoop to PA with 2 steps. Firstly Hadoop doing first
selection and aggregation to output reduced dataset. Take a case of text documents stored in
Hadoop and it returning only list of words with frequency to PA and after that SAP Predictive
Analysis when inputted this information from Hadoop can then apply data mining algorithms on
the Hadoop result set.
Comparing them makes no sense as they drive different purposes. Hadoop is a file system
which stores variety of data like Big Data i.e. Structured, Unstructured and semi-structured data;
while PA being an advanced analytical tool reads the historical data from above data sources and
projects the predicted results for a business query where data can be picked from any data
sources like HANA, HADOOP, BW, BO. It can be a next big approach from SAP to provide
possibilities of SAP & Hadoop integration to ease the customer, either by merging PA with
InfiniteInsight to PA which allows this Hadoop integration or some other innovative module
100
Chapter 5: Conclusion
5.3.2 Sharing your own R component
It is possible to share with colleagues or customers your own created R components and
just involves few simple steps. Depending on whether you need a new library or not with your
component, library has to be added first under C:\users\Public\R-3.0.1\library and then can be
shared. Remember to close PA program at time of sharing components. In the folder
C:\Users\piyush\SAP Predictive Components\RScript you can see all your R components created
by you and this folder is to be used to paste component shared to you by someone else. Rename
the components to more meaningful names from default names given by PA. After renaming, it
is crucial to modify the name of folder in the ‘component.xml’ file. With help of simple text editor
replace ‘automatically created’ name by your new folder name. That’s all required to make your
R created components being reused by your colleagues by simply sharing the folder. Logic is
encapsulated in so called “Custom R Components”, allowing users even without R skills to use
them.
Connecting R and PA
5.3.3 Configuring HANA PAL to use with SAP PA
SAP PA while connected to HANA in an online mode; facilitates users to leverage the
HANA Predictive Analysis Library with help of a user friendly interface that push all the processing
operations to the HANA server. The PAL is not installed on HANA by default. To connect them,
Install AFL (Application Function Library) at first place on HANA server with help of below
commands. Login to Root of HANA server. Extract the files using SAPCAR - SAPCAR -xvf
IMDB_AFL100_60_1-10012328.SAR and then navigate into the SAP_HANA_AFL directory which
was created in step 2 and execute 'hdbinst' as ~/tmp/SAP_HANA_AFL #./hdbinst
It can be downloaded from http://service.sap.com/swdc and once installed you can verify
the AFL installation success with commands below:
101
Chapter 5: Conclusion
SELECT * FROM "SYS"."AFL_AREAS" WHERE SCHEMA_NAME = '_SYS_AFL' AND AREA_NAME = 'AFLPAL';
SELECT * FROM "SYS"."AFL_PACKAGES" WHERE SCHEMA_NAME = '_SYS_AFL' ANDAREA_NAME = 'AFLPAL';
SELECT * FROM "SYS"."AFL_FUNCTIONS" WHERE SCHEMA_NAME = '_SYS_AFL' AND AREA_NAME = 'AFLPAL';
Add the afl_wrapper_generator and afl_wrapper_eraser procedures if they don't exist. On
the
HANA
server,
navigate
to
the
/hanamnt//<SID>/HDB
<instance_number>/exe/plugins/afl/ directory and execute the afl_wrapper_generator.sql
and afl_wrapper_eraser.sql scripts as HANA user SYSTEM. (An easy way to do this is to open
the files in a text editor on the Linux server and copy the code back to HANA studio for
execution as the SYSTEM user in a SQL console).
You now have two procedures - AFL_WRAPPER_GENERATOR and AFL_WRAPPER_ERASER
which are owned by SYSTEM.
Grant
the
EXECUTE
privilege
on
system.afl_wrapper_generator
and
system.afl_wrapper_eraser to your predictive analysts.
For example, if the user name is MyHANAUser, run the commands: GRANT EXECUTE ON
system.afl_wrapper_generator
to
MyHANAUser;
GRANT
EXECUTE
ON
system.afl_wrapper_eraser to MyHANAUser.
5.4 Future Work
SAP traditional Predictive Analysis tool allows to run data analysis activities but they are
manual and thus repetitive and prone to human errors. SAP InfiniteInsight (formerly KXEN)
introduce automation to PA activities and allow users to concentrate more on business decisions.
It would be interesting to dive in and explore extra potential added to traditional PA. I wanted to
implement HANA server connectivity and running PA predictions on both HANA server and local
machine to show the difference in performance in figures, but couldn’t because of expensive
licenses fees for HANA server. Developing mobile applications to integrate PA results and charts
to other applications for better and faster business making. A predictive model which is as an
equation, algorithm, or set of rules needed to predict an outcome depending on the input
dataset; can simply be a set of business rules based on past observations, and can be developed
more accurately using statistically rigorous predictions and statistical algorithms. Future work
can also consider making a standard for these models to be used by business communities again
just by importing.
5.5 Conclusion
In normal daily life even, everyone takes advantage of predictive analytics, in the form of
anything from weather forecasts to insurance premiums. Predictive analytics will be used more
& more as businesses understand and appreciate the business benefits that this prediction tools
102
Chapter 5: Conclusion
bring. Predictive analytics is a subset of data mining but indicates a focus on making predictions.
SAP in 2012 announced the launch of SAP Predictive Analysis 1.0, their new solution in the
predictive analytics portfolio as a replacement to the classical offering of SAP BO Predictive
Workbench. With PA, HANA and HANA native predictive library (PAL) enables the execution of
predictive algorithms in-database that is making the procedures running in the DB layer and then
exporting just the result set, instead of exporting the whole dataset for the algorithms to run in
the application layer. This gave SAP a leading edge and made SAP a big contender in the Big Data
Predictive Analytics space. But in actual terms, SAP's actual portfolio was very small. HANA no
doubt brought a lot of modern and ground breaking technologies to the game that weren't
available before, but in terms of actual functionalities related to analytical models, it was still
behind its main competitors SAS, IBM, and Tibco etc. Only a couple of dozen algorithms in PAL
were available and definitely were not enough to compete.
But the game changed since day SAP announced the R integration which means more than
3500 algorithms are part of library set. This HANA R integration, although powerful, still had
disadvantage as it required a very specific set of development skills in order to deliver actual
analytical models to the business users. To support these users with very little or almost none
technical knowledge, SAP PA 1.0 latest version 11 allows implementing custom R functionalities
(i.e. algorithms that weren't built in standard PA) without having to resort to developing HANA
SQLScript/R procedures. Users can share their existing R scripts (just adapting it to a function
model and can now visually create their analytical models with the most complex algorithms. PA
interface and possibility to run analysis on in-memory databases allows analysis of very large
amount of data with better performance.
The basic working character for predictive analysis is the predicator, that variable whose
measured value for an individual or entity gives idea about future behavior. For example, an
insurance company could consider age, income, credit history, insurance claims history and other
demographics as predictors when issuing an insurance policy to determine an applicant’s risk
factor. Predictive model is combination of such multiple predicators which may be helping factor
to forecast future possibilities or behaviors with an acceptable value of reliability when they are
subjected to an analysis. Model must be kept re-validated and revised as additional data become
available for further predictions based on collected data, formulations of statistical models and
previous properties of models. PA always go hand in hand with business knowledge and statistical
techniques for its full exploitation and prediction objective/insight must be clear before starting
the process. It's a good practice to keep ready multiple related predictive models available to be
run and applied to dataset for better strategic company decisions.
103
References
[1] J. MacGregor, Predictive Analysis with SAP®, SAP Press, 2013.
[2] C. Mankala og G. M. V, SAP HANA Cookbook, Packt Publishing, Packt Publishing.
[3] I. Gordon, Managing the New Customer Relationship: Strategies to Engage the Social
Customer and Build Lasting Value, 2013: Wiley.
[4] E. Siegel, Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, Wiley,
2013.
[5] A. Bari, Predictive Analytics For Dummies, For Dummies, 2014.
[6] T. W. Miller, Modeling Techniques in Predictive Analytics: Business Problems and Solutions
with R, Pearson FT Press, 2013.
[7] B. Ratner, Statistical and Machine-Learning Data Mining: Techniques for Better Predictive
Modeling and Analysis of Big Data, Second Edition, CRC Press, 2011.
[8] C. Carlberg, Decision Analytics: Microsoft Excel, Conrad Carlberg , 2013.
[9] M. Kuhn, Applied Predictive Modeling, Springer, 2013.
[10 M. Zaki, R. U. N. U. Dept. of Comput. Sci., S. Parthasarathy, W. Li og M. Ogihara, «Evaluation
]
of sampling for data mining of association rules,» Research Issues in Data Engineering, 1997.
Proceedings. Seventh International Workshop, 1997.
[11 G. P.-S. P. S. Usama Fayyad, «From Data Mining to Knowledge Discovery in Databases,»
]
Association for the Advancement of Artificial Intelligence (www.aaai.org), 2014.
[12 J. S. Park, M.-S. Chen og P. S. Yu, «Efficient parallel data mining for association rules,» CIKM
]
'95 Proceedings of the fourth international conference on Information and knowledge
management.
[13 C. Rygielski, J.-C. Wang og D. C. Yen, «http://www.sciencedirect.com/,» Elsevier Science Ltd,
]
[Internett].
Available:
http://www.sciencedirect.com/science/article/pii/S0160791X02000386. [Funnet 2014].
[14 V. K. J. R. Q. J. G. Q. Y. H. M. G. J. M. A. N. B. L. P. S. Y. Z.-H. Z. M. S. D. J. H. D. S. Xindong Wu,
]
«Top 10 algorithms in data mining,» Springer-Verlag, 2006.
104
References
[15 P. J. R. Leonard Kaufman, «Finding Groups in Data,» i An Introduction to Cluster Analysis,
]
John Wiley & Sons, 2009, pp. 47-55.
[16 P. J. Rousseeuw, «Silhouettes: A graphical aid to the interpretation and validation of cluster
]
analysis,» http://dx.doi.org/10.1016/0377-0427(87)90125-7, p. 467.
[17 A. Sturn, J. Quackenbush og Z. Trajanoski, «Genesis: cluster analysis of microarray data,» i
]
Oxford University Press 2002.
[18 C. Fraley og A. E. Raftery, «How Many Clusters? Which Clustering Method? Answers Via
]
Model-Based Cluster Analysis,» The Computer Journal (1998), nr. 41 (8): 578-588.
[19 A. J. S. a. M. Knott, «A Cluster Analysis Method for Grouping Means in the Analysis of
]
Variance,»
April
2008.
[Internett].
Available:
http://www.ime.usp.br/~abe/lista/pdfXz71qDkDx1.pdf. [Funnet 2014].
[20 T. W. H. M. Akihiro Inokuchi, «An Apriori-Based Algorithm for Mining Frequent
]
Substructures from Graph Data,» Springer Link, nr. Department of Computer and
Information Science, Norwegian University of Science and Technology, pp. 13-23.
[21 C. B. (borgelt@iws.cs.uni-magdeburg.de), «Efficient Implementations of Apriori,»
]
[Internett]. Available: http://www.intsci.ac.cn/shizz/fimi.pdf.
[22 Y. Ye, A. Corp. og C.-C. Chiang, «A Parallel Apriori Algorithm for Frequent Itemsets Mining
]
10.1109/SERA.2006.6».Software Engineering Research, Management and Applications,
2006. Fourth International Conference.
[23 F. E. H. J. G. J. B. M. E. Y. V. J. F. H. Ewout W Steyerbergemail address, «Internal validation
]
of predictive models,» Journal of Clinical Epidemiology, nr. Volume 54, Issue 8, 2000.
[24 H. Bliss, «SAP BI Blog, All things Business Intelligence,» [Internett]. Available:
]
http://sapbiblog.com/category/predictive-analytics/.
[25 D. W. T. H. a. R. T. Gareth James, «Linear Regression,» i An Introduction to Statistical
]
Learning with applications in R, Springer.
[26 D. Alahakoon, S. Halgamuge og B. Srinivasan, «Dynamic self-organizing maps with
]
controlled growth for knowledge discovery,» IEEE, 2002.
[27 A.
RAUBER,
«Self-organizing
maps,»
[Internett].
]
http://www.ifs.tuwien.ac.at/ifs/research/pub_html/mer_dexa98/node3.html.
105
Available:
References
[28 F. Y og J. B. Partovi, «Emerald Insight :Using the Analytic Hierarchy Process for ABC
]
Analysis,»
[Internett].
Available:
http://www.emeraldinsight.com/journals.htm?articleid=848735&show=abstract.
[29 A. Tanwari, A. Lakhiar og A. Ghulam, «ABC Aanlysis as a Inventory Control Technique,»
]
http://www.goiit.com/upload/2009/2/22/e4fac76b66664f7f346c3aaed9feb829_1302799.
pdf.
[30 I. Ben-Gal, «OUTLIER DETECTION,» http://www.eng.tau.ac.il/~bengal/outlier.pdf.
]
[31 «Outlier
detection,»
[Internett].
]
http://www.molmine.com/magma/global_analysis/outlier_detection.html.
[32 C.
C.
AGGARWAL,
OUTLIER
ANALYSIS,
]
http://www.charuaggarwal.net/outlierbook.pdf.
New
Available:
York
USA:
[33 M. N. M. S. M. O. Mansur, «Outlier Detection Technique in Data Mining: A Research
]
Perspective,»
http://eprints.utm.my/3336/1/Mohd_Noor__Outlier_Detection_Technique_in_Data_Mining_A_Research_Perspective.pdf?origin=publication_detail.
[34 http://scn.sap.com/docs/DOC-32651, «Official Product Tutorials – SAP Predictive Analysis,»
]
SAP.
[35 S.
D.
Zengyou
He,
«fats
Greedy
Algorith
for
]
http://arxiv.org/ftp/cs/papers/0507/0507065.pdf, China University.
Outlier
Mining,»
[36 C. C. A.-N. 4. I. License, «Linear Least Squares Regression,» [Internett]. Available:
]
http://www.cyclismo.org/tutorial/R/linearLeastSquares.html.
[37 O. Torres-Reyna, «Getting Started in Linear Regression using R,» [Internett]. Available:
]
http://www.princeton.edu/~otorres/Regression101R.pdf.
[38 H. C. H. J. A. W. A. K. S. a. M. J. v. d. W. Robert A van den Berg*, «Centering, scaling, and
]
transformations: improving the biological information content of metabolomics data,» BMC
Genomics, [Internett]. Available: http://www.biomedcentral.com/1471-2164/7/142.
[39 «Data Collection,» i Centering, scaling, and transformations: improving the biological
]
information
content
of
metabolomics
data,
http://highered.mcgrawhill.com/sites/dl/free/0073373656/639839/doa73656_ch02.pdf.
106
References
[40 D. K. a. B. Basturk, «Artificial bee colony (ABC) optimization algorithm for solving
]
constrained optimization problems,» Lecture Notes in Computer Science, pp. 789-821, 2007.
[41 B. A. a. C. O. D. Karaboga, «Artificial bee colony (ABC) optimization algorithm for training
]
feed-forward neural networks,» Lecture Notes in Computer Science, pp. 318-329, 2007.
[42 D. V. H. D. Morris, «T h e B u s i n e s s V a l u e o f P r e d i c t i v e A n a l y t i c s by IBM,»
]
[Internett].
Available:
http://www.spss.com.ar/MKT/Promos/2012/0612_PA/0612_businessvalue_PA.pdf.
[43 R. D. Kugel, «Business Planning and Predictive Analytics,» [Internett]. Available:
]
http://www.ventanaresearch.com/research/article.aspx?id=355.
[44 L. M. 8, «Mining Frequent Itemsets – Apriori Algorithm,» [Internett]. Available:
]
http://software.ucv.ro/~cmihaescu/ro/teaching/AIR/docs/Lab8-Apriori.pdf.
[45 I. Tudor, «Association Rule Mining as a Data
]
http://bmif.unde.ro/docs/20081/7%20ITudor.pdf, Romania.
107
Mining
Technique,»
				
											        © Copyright 2025