Download Report

A Proposed Statistical Approach for Outlier
Detection
Dr. Amr Mohamed Mohamed Kamal
Ph.D. in Computers and Information
Information Technology Department
College of Applied Sciences
Ministry of Higher Education, Ibri, Sultanate of Oman.
Email: amrmkamal.ibr@cas.edu.om
Abstract:
This paper illustrates some applications,
causes, techniques, approaches of anomaly
detection, and important issues that need to
be addressed when dealing with anomalies.
It also suggests a proposed statistical
approach for anomaly detection.
Keywords
Anomalies detection; anomaly score; outlier
detection; outlier score; deviation detection;
data cleaning; discordant observation; and
exception mining.
1. Literature review:
Gupta provides a comprehensive and
structured overview of a large set of interesting
outlier definitions for various forms of
temporal data [3]. Ranjan presents a new
clustering approach for anomaly intrusion
detection by using the approach of K-medoids
method of clustering and its certain
modifications [5]. The proposed algorithm is
able to achieve high detection rate and
overcomes the disadvantages of K-means
algorithm [5]. Shon concentrated on machine
learning techniques for detecting attacks from
internet anomalies [6]. The machine learning
framework consists of two major components:
Genetic Algorithm (GA) for feature selection
and Support Vector Machine (SVM) for packet
classification. Thiprungsri examined the use of
clustering technology to automate fraud
filtering during an audit [4].
2. Introduction:
Anomaly is a pattern in the data that does not
conform to the expected behaviour. In anomaly
detection, the goal is to find objects that are
different from most other objects. Often,
anomalous objects are known as outliers, since
on a scatter plot of the data, they lie far away
from other data points. Anomaly detection is
also known as deviation detection, because
anomalous objects have attribute values that
deviate significantly from the expected or
typical attribute values, or as exception mining,
because anomalies are exceptional in some
cases. There are a variety of anomaly detection
approaches from several areas, including
statistics, machine learning, and data mining.
All try to capture the idea that an anomalous
data object is unusual or in some way
inconsistent with other objects. Although
unusual objects or events are, by definition,
relatively rare, this does not mean that they do
not occur frequently in absolute terms.
Anomalous values may indicate either a
problem or a new phenomenon to be
investigated. However, when they occur, their
consequences can be quite dramatic and quite
often in a negative sense. The following
examples illustrate some different applications
for which anomalies are of considerable
interest:
Fraud detection: The purchasing behavior of
someone who steals a credit card is probably
different from the original owner. Credit card
companies attempt to detect a theft by looking
for buying patterns that characterize theft or by
noticing a change from a typical behavior.
Intrusion detection: Unfortunately, attacks on
computer systems and computer networks are
commonplace. While some of these attacks
such as those designed to disable or
overwhelm computers and networks, are
obvious, other attacks, such as those designed
to secretly gather information, are difficult to
detect. Many of these instructions can only be
detected by monitoring systems and networks
for unusual behavior.
Ecosystem disturbances: In the natural world,
there are atypical events that can have a
significant effect on human beings. Examples
include hurricanes, floods, droughts, heat
weaves, global warming, and fires. The goal is
often to predict the likelihood of these events
and the causes of them.
Public health: If all children in a city are
vaccinated for a particular disease, e.g.,
measles, then the occurrence of a few cases
scattered across various hospitals in a city is an
anomalous event that may indicate a problem
with the vaccination programs in the city.
1
Although much of the recent interest in
anomaly detection has been driven by
applications in which anomalies are the focus,
historically, anomaly detection (and removal)
has been viewed as a technique for improving
data objects analysis. For instance, a relatively
small number of outliers can distort the mean
and standard deviation of a set of values or
alter the set of clusters produced by a
clustering algorithm. The term cluster refers to
a group of data objects among which there
exists a certain degree of similarity [1].
Therefore, anomaly detection (and removal) is
often a part of data processing.
3. Some issues of anomalies:
3.1 Data from different classes:
An object may be different from other objects,
(anomalous), because it is of a different type or
class. To illustrate, someone committing credit
card fraud belongs to a different class of credit
card users than those people who use credit
cards legitimately.
3.2 Natural variation:
Many data sets can be modeled by statistical
distributions, such as a normal (Gaussian)
distribution, where most of the objects are near
a center (average object) and the probability of
a data object decreases rapidly as the distance
of the object from the center of the distribution
increases.
3.3 Data measurement and collection
errors:
Errors in data collection or measurement
process are another source of anomalies.
Measurement may be recorded incorrectly
because of human error, a problem with the
measuring device, or the presence of noise.
The goal is to eliminate such anomalies, since
they provide no interesting information but
also reduce the quality of the data and the
subsequent data analysis. Indeed, the removal
of this type of anomaly is the focus of data
preprocessing, specifically data cleaning. So,
noise should be removed before outlier
detection.
4. Techniques to anomaly detection:
I will illustrate a high level description of some
anomaly detection techniques and their
associated definitions of an anomaly.
4.1 Model based techniques:
Many anomaly detection techniques first build
a model of the data. Anomalies are objects that
do not fit the model very well. For example, a
model of the distribution of the data can be
created by using the data to estimate the
parameters of a probability distribution. An
object does not fit the model very well; i.e., it
is an anomaly, if it is not very likely under the
distribution. If the model is a set of clusters,
then an anomaly is an object that does not
strongly belong to any cluster [4]. When a
regression model is used, an anomaly is an
object that is relatively far from its predicted
value [4]. Because anomalous and normal
objects can be viewed as defining two distinct
classes, classification techniques can be used
for building models of these two classes [1].
In some cases, it is difficult to build a model;
e.g., because the statistical distribution of data
is unknown or no training data are available. In
these situations, techniques that do not require
a model, such as those described below, can be
used.
4.2 Proximity-based techniques:
It is often possible to define a proximity
measure between objects, and a number of
anomaly detection approaches are based on
proximities. Anomalous objects are those that
are distant from most of the other objects.
Many of the techniques in this area are based
on distances and are referred to as distancebased outlier detection techniques [1].
4.3 Density-based techniques:
Objects that are in regions of low density are
relatively distant from their neighbors, and can
be considered anomalous [5].
5. Use of class labels:
There are three basic approaches to anomaly
detection: unsupervised, supervised, and semisupervised [4]. The major distinction is the
degree to which class labels (anomaly or
normal) are available for at least some of the
data.
5.1 Supervised anomaly detection:
Labels are available for both normal data and
anomalies [4].
5.2 Unsupervised anomaly detection:
No labels are assumed. Based on the
assumption that anomalies are very rare
compared to normal data. In such cases, the
objective is to assign a score (or a label) each
instance that reflects the degree to which the
instance is anomalous [4].
5.3 Semi-supervised anomaly detection:
Labels are available only for normal data. In
Semi-supervised setting, the objective is to
find an anomaly label or score for a set of
given objects by using the information from
labeled normal objects.
6. Important issues that need to be
addressed when dealing with anomalies:
6.1 Number of attributes used to define an
anomaly:
Since an object may have many attributes, it
may have anomalous values for some
attributes, but ordinary values for other
attributes. Furthermore, an object may be
anomalous even none of its attribute values are
individually anomalous. For example, it is
common to have people who are 70 cm tall
2
(child) or are 150 kg in weight, but uncommon
to have a 70 cm tall person who weights 150
kg. A general definition of an anomaly must
specify how the values of multiple attributes
are used to determine weather or not an object
is an anomaly. This is a particularly important
issue when the dimensionality of data is high.
6.2 Global versus local perspective:
An object may seem unusual with respect to all
objects, but not with respect to objects in its
local neighborhood. For example, a person
whose height is 2.3 m is unusually tall with
respect to the general population, but not with
respect to professional basketball players.
6.3 Degree to which a point is an anomaly:
An object is either an anomaly or it is not.
Frequently, this does not reflect the underlying
reality that some objects are more extreme
anomalies than others. Hence, it is desirable to
have some assessment of the degree to which
an object is anomalous. This assessment is
known as the anomaly or outlier score.
7. Statistical approaches:
Depending on weather we are working with a
population or a sample, a numerical measure is
known as either a parameter or a statistic.
Parameter: is a measure computed from the
entire population. As long as, the population
does not change, the value of the parameter
will not change [2]. Statistic is a measure from
a sample that has been selected from a
population. The value of the statistic will
depend on which sample is selected [2].
Statistical approaches are model-based
approaches; i.e., a model is created for the
data, and objects are evaluated with respect to
how well they fit the model. Most statistical
approaches to outlier detection are based on
building a probability distribution model and
considering how likely objects are under that
model. This paper represents one of the
statistical approaches for outlier detection.
8. Probabilistic definition of an outlier:
An outlier is an object that has a low
probability with respect to a probability
distribution model. If data are assumed to have
a Gaussian distribution, then the mean and
standard deviation of the underlying
distribution can be estimated by computing the
mean and standard deviation of the data. The
probability of each object under the
distribution can then be estimated. A wide
variety of statistical tests have been devised to
detect outliers, or discordant observations. So,
there are two basic assumptions:
1. Normal objects are in the center of the data
space.
2. Outliers are located at the border of the data
space [±3].
So, we will use the statistical concepts of
central tendency (sample mean, median, and
mode) and measure of variation (variance and
standard deviation) in our proposed approach.
9. Important issues that need to be
addressed when dealing with probabilistic
definition of an outlier:
9.1 Identifying the specific distribution of a
data set:
Probability is the way decision makers express
their uncertainty about outcomes and events.
Discrete distributions such as (uniform,
binomial, multinomial, hyper geometric,
Poisson, negative binomial and geometric)
combined with the continuous distributions
such as (normal, gamma, exponential, Chisquare, and weibull) are used frequently in
business decision making. Discrete random
variables are determined by counting.
Continuous random variables are determined
by measuring. Of course, if the wrong model is
chosen, then an object can be erroneously
identified as an outlier.
9.2 The number of attributes used:
Data set is univariate, bivariate, or multivariate
depending on whether it contains information
on one variable only, on two variables, or on
more than two [9]. Most statistical outlier
detection techniques apply to a single attribute,
but some techniques have been defined for
multivariate data. In this paper, I propose a
framework for detecting outliers in a
Univariate environment.
10. Detecting outliers in a Univariate
Normal Distribution:
The Gaussian (normal) distribution is one of
the most frequently used distributions in
statistics, and I will use it to describe a simple
approach to statistical outlier detection. In
continuous probability distributions, we find
the probability that a value is within a specified
range. Its graph, called the normal curve, is
the bell shaped curve that describes the
distribution of so many sets of data which
occur in nature, industry, and research. The
mathematical equation for the probability
distribution of the continuous variable depends
on the two parameters µ and σ, its mean and
standard deviation. Here I shall denote the
density function of X by n(x; µ, σ)
The normal distribution density function of the
normal random variable X, with mean µ and
2
variance  , is
f(x) = n(x; µ, σ) =
1
e(1 / 2)[( x µ) /  ]
2 
Where
2
, -∞ <x<∞ ,
 =3.14159 and e=2.71828 [7]
3
Once µ, σ are specified, the normal curve is
completely determined. The area under a
probability curve must be equal to 1, and
therefore the more variable the set of
observations, the lower and wider the
corresponding curve will be.
10.1 Properties of the normal curve:
1. The highest point on the normal curve is
located at the mean, which is also the
median and the mode of the distribution.
2. The curve is symmetric about a vertical axis
through the mean µ.
3. If a random variable has a small variance or
standard deviation, we would expect most of
the values to be grouped around the mean. A
large value of  indicates a greater
variability, and therefore the area is to be
more spread out.
4. The normal curve approaches the horizontal
axis asymptotically as we proceed in either
direction away from the mean.
5. The total area under the curve and above the
horizontal axis is equal to 1.
I shall now show that the parameters µ and
 2 are indeed the mean and the variance of
the normal distribution. To evaluate the
meaning,
I
write
E(X)
=

2
1
 (1 / 2)[( x   ) /  ]
dx Setting
 xe
2  -
z= ( x   ) /  → z  = x- 
Differentiating both sides by x, we will get
z*0 +  *
dz
= 1-0 → dx=  dz
dx
So, we obtain

1
z
dz
 (   z )e
2  -
2/2
E(X) =
=
1  z
 e
2  
µ
 
 ze
2  
2/2
dz +
z2/2
dz
The first integral is µ times the area under a
normal curve with mean zero and variance 1,
and hence equal to µ. The second integral is
equal to zero.
The variance of the normal distribution is
given by
 2 = E [ (X - µ)2 ] =

1
2 (1 / 2)[( x   ) /  ]
dx
 (X - µ) e
2  -
2
Again setting z= ( x   ) /  → z  = x- 
Differentiating both sides by x, we will get
z*0 +  *
dz
= 1-0 → dx=  dz
dx
2
E [ (X - µ) ] =
2  2
z e
2  
z2/2
dz
Integrating by parts with u=z
 z2 / 2
ze
that
and dv=
 z2 / 2
so that du=dz and v=- e
, we find
2
2
2
E [ (X - µ) ]=  (0+1)= 
Changing µ shifts the distribution left or right.
Changing σ increases or decreases the spread
as shown in figure1.
No matter what  and  are, the area between
- and + is about 68%; the area between
-2 and +2 is about 95%; and the area
between -3 and +3 is about 99.7%.
Almost all values fall within 3 standard
deviations. Often, the three-sigma interval
[±3] is called a tolerance interval that
contains almost all of the measurements in a
normally distributed population [8] as shown
in figure2.
There is a unique normal curve for every
combination of  and  . There are many
theoretically unlimited numbers of such
combinations. Fortunately, we are able to
transform all the observations of any normal
random variable X to a new set of observations
of a normal random variable Z with mean zero
and variance 1. This can be done by means of
transformation Z= ( X   ) /  . Whenever X
assumes a value x, the corresponding value of
Z is given by z= ( x   ) /  .
f(X)
Changing μ shifts the
distribution left or right.
Changing σ increases or
decreases the spread.
X
Fig.1 Effect of changing

and 
Fig.2 Tolerance interval
4
Therefore, if X falls between the values x= x1
and x= x2 , the random variable Z will fall
between
the
corresponding
values
z1 = ( x1   ) /  and z2 = ( x2   ) /  .
So, all normal distributions can be converted
into the standard normal curve by subtracting
the mean and dividing by the standard
deviation
Consequently, we can write P( x1 <X< x2 )=
1 x  (1 / 2)[( x   ) /  ]
dx =
 xe
2  x
2
2
1
z2
1
z
 e
2 z
2/ 2
dz =
1
z2

n(z;0,1) dz= P( z1 <Z< z 2 )
Making Approach", seventh edition, Pearson International
Edition.Upper Saddle River, New Jersy, U.S.A, 2008.
[3] Manish Gupta, Jing Gao, Charu C. Aggarwal, and
Jiawei Han, "Outlier Detection for Temporal Data: A
Survey", IEEE transactions on knowledge and data
engineering, vol. 25, no. 1, January 2014
[4] Sutapat Thiprungsri, Miklos A. Vasarhelyi, "Cluster
Analysis for Anomaly Detection in Accounting Data: An
Audit Approach", The International Journal of Digital
Accounting Research Vol. 11, pp. 69 - 84 ISSN: 15778517, 2011.
[5] Ravi Ranjan and G. Sahoo, "A new clustering approach
for anomaly intrusion detection", International Journal of
Data Mining & Knowledge Management Process (IJDKP)
Vol.4, No.2, March 2014.
[6] Taeshik Shon, Yongdue Kim, Cheolwon Lee, and
Jongsub Moon,"A Machine Learning Framework for
Network Anomaly Detection using SVM and GA",
Proceedings of the 2005 IEEE Workshop on Information
Assurance and Security United States Military Academy,
West Point, NY, U.S.A, 2005.
[7] Derek L. Waller, "Statistics for business", Elsevier,
Book Aid International, Sabre Foundation, 2008.
[8] Bruce L. Bowerman, Richard T.O'Connell, J.B. Orris,
and Emily S. Murphree, "Essential of Business Statistics",
McGraw-Hill, Irwin, 2010.
[9] Heinz Kohler, "Statistics for Business and Economics",
Thomson Learning, Inc, 2002.
z1
But, it is very important to notice that:
1) Not all continuous random variables are
normally distributed.
2) Both the mean and standard deviation are
extremely sensitive to outliers. Effectively one
“bad point” can completely skew the mean.
3) It is important to evaluate how well the data
are approximated by a normal distribution.
11. A proposed statistical approach for
outlier detection:
1) Look at the histogram and check does it
appear bell shaped.
2) Compute descriptive summary measures
(mean, median, and mode).
3) Do about 68 % of observations lie within 1
standard deviation of the mean? Do about 95%
of observations lie within 2 standard deviations
of the mean? Do about 99% of observations lie
within 3 standard deviations of the mean?
4) Be cautious, about sample size, because the
distribution is highly influenced by sample
size.
12. Conclusion:
1. Outlier detection using Univariate Normal
Distribution is a very promising technique for
detecting critical information in data, and can
be applied in various application domains.
2. Nature of outlier detection problem is
dependent on the scope of application domain.
3. Different techniques are required to solve a
particular problem formulation.
References:
[1] Hongbo Du, "Data Mining Techniques and
Applications – An Introduction", Cengage Learning
EMEA, Cheriton House, North Way, Andover,
Hampshire, SP10 5BE, UK., 2010.
[2] David F. Groebner, Patrick W. Shannon, Phillip C. Fry,
and Kent D. Smith, "Business Statistics – A Decision
5