COMP 598 – Applied Machine Learning and Support Vector Machines

COMP 598 – Applied Machine Learning
Lecture 12: Ensemble Learning (cont’d)
and Support Vector Machines
!
Instructor: Joelle Pineau (jpineau@cs.mcgill.ca)
TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca)
Angus Leigh (angus.leigh@cs.mcgill.ca)
Class web page: www.cs.mcgill.ca/~jpineau/comp598
Outline
•  Perceptrons
–  Definition
–  Perceptron learning rule
–  Convergence
•  Margin & max margin classifiers
•  Linear support vector machines
–  Formulation as optimization problem
–  Generalized Lagrangian and dual
COMP-598: Applied Machine Learning
2
Joelle Pineau
1
Perceptrons
•  Given a binary classification task with data {xi, yi}i=1:n, yi={-1,+1}.
•  A perceptron (Rosenblatt, 1957) is a classifier of the form:
hw(x) = sign(wTx) = {+1 if wTx≥0; -1 otherwise}
–  As usual, w is the weight vector (including the bias term w0).
•  The decision boundary is wTx=0.
•  Perceptrons output a class, not a probability.
•  An example <xi, yi> is classified correctly if and only if: yi(wTxi)>0.
COMP-598: Applied Machine Learning
3
Joelle Pineau
Perceptrons
1
w0
x1
w1
xm
…
∑
y
wm
COMP-598: Applied Machine Learning
4
Joelle Pineau
2
Perceptron learning rule
•  Consider the following procedure:
Initialize wj, j=0:m randomly,
While any training examples remain incorrectly classified
–  Loop through all misclassified examples
–  For misclassified example xi, perform the update:
w ⃪ w + α yi xi
where α is the learning rate (or step size).
•  Intuition: For misclassified positive examples, increase wTx,
and reduce it for negative examples.
COMP-598: Applied Machine Learning
5
Joelle Pineau
Gradient-descent learning
•  The perceptron learning rule can be interpreted as a gradient
descent procedure, with optimization criterion:
Err(w) = ∑i=1:n { 0 if wTxi≥0; -yiwTx otherwise }
•  For correctly classified examples, the error is zero.
•  For incorrectly classified examples, the error tells by how much
wTx is on the wrong side of the decision boundary.
•  The error is zero when all examples are classified correctly.
•  The error is piecewise linear, so it has a gradient almost
everywhere.
COMP-598: Applied Machine Learning
6
Joelle Pineau
3
Linear separability
Linear separability
separability
Linear
•  The data is linearly separable if and only if there exists a w such
The
data set
set isis linearly
linearly separable
separable ifif and
and only
only ifif there
there exists
exists w,
w, ww00 such
such
•• The
data
that:
that:
that:
–  For all examples, yiwTxi > 0
Forall
alli,i, yyii(w
(w··xxii++ww00))>>0.0.
–– For
equivalently,
thethe
0-10-1
lossloss
zero
forsome
some
setof
ofset
parameters
(w,ww0(w).
).
– OrOr
equivalently,
is zero
for some
of parameters
–– Or
equivalently,
the
0-1
loss
isiszero
for
set
parameters
(w,
0).
xx22
xx22
++
++
--
++
--
--
++
++
xx11
--
xx11
--
(a)
(a)
Linearly separable
COMP-598: Applied Machine Learning
(b)
(b)
Not linearly separable
7
Joelle Pineau
COMP-652,Lecture
Lecture99- -October
October9,9,2012
2012
COMP-652,
55
Perceptronconvergence
convergence theorem
theorem
Perceptron
convergence
Perceptron
theorem
The perceptron
perceptron convergence
convergence theorem
theorem states
states that
that ifif the
the perceptron
perceptron
•• The
learning
rule
is
applied
to
a
linearly
separable
data
set,
a solution
solution
learning
rule
is
applied
to
a
linearly
separable
data
set,
a
•  The basic theorem:
willbe
befound
found after
after some
some finite
finite number
number of
of updates.
updates.
will
–  If
the perceptron
learning
ruleonis the
applied
a linearly
separable
dataset,
The
number
of updates
updates
depends
datato
set,
and also
also
on the
the step
step
•• The
number
of
depends
on the data
set,
and
on
a
solution
will
be
found
after
some
finite
number
of
updates.
sizeparameter.
parameter.
size
the data
data isis not
not linearly
linearly separable,
separable, there
there will
will be
be oscillation
oscillation (which
(which can
can
•• IfIf the
bedetected
detected automatically).
automatically).
be
Decreasing
the
learning rate
rate to
to 00 can
can cause
cause the
the oscillation
oscillation to
to settle
settle on
on
•• • Decreasing
learning
Additionalthe
comments:
someparticular
particular solution
solution
some
–  The number of updates depends on the dataset, and also on the
learning rate.
–  If the data is not linearly separable, there will be oscillation (which can
be detected automatically).
–  Decreasing the learning rate to 0 can cause the oscillation to settle on
some particular solution.
COMP-652,Lecture
Lecture99- -October
October9,9,2012
2012
COMP-652,
COMP-598: Applied Machine Learning
66
8
Joelle Pineau
4
Perceptron learning example–separable data
w = [0 0] w0 = 0
1
Perceptron
learninglearning
example–separable
Perceptron
example data
0.9
0.8
w = [0 0] w0 = 0
1
0.7
x2
0.9
0.6
0.8
0.5
0.7
0.4
x2
0.6
0.3
0.5
0.2
0.4
0.1
0.30
0
0.2
0.2
0.4
0.6
0.8
1
0.6
0.8
1
x1
0.1
0
0
0.2
0.4
x1
COMP-652, Lecture 9 - October 9, 2012
7
COMP-598: Applied Machine Learning
9
Joelle Pineau
COMP-652, Lecture 9 - October 9, 2012
7
Perceptron
learninglearning
example–separable
Perceptron
example data
-./.0$"'''.("&*!$1...-!./.!$
'
Perceptron
learning example–separable data
!"+
!"&
-./.0$"'''.("&*!$1...-!./.!$
'
!"*
,#
!"+
!"%
!"&
!")
!"*
!"$
,#
!"%
!"(
!")
!"#
!"$
!"'
!"(!
!
!"#
!"#
!"$
!"%
!"&
'
!"%
!"&
'
,'
!"'
!
!
!"#
COMP-598: Applied Machine Learning
!"$
,'
10
Joelle Pineau
COMP-652, Lecture 9 - October 9, 2012
8
COMP-652, Lecture 9 - October 9, 2012
8
5
Weight
asasa acombination
of input
inputvectors
vectors
Weight
combination of
• Recall percepton learning rule:
•  Recall perceptron learning rule:
ww⃪←
ww
+ α+yi γy
xi i x i ,
w0 ← w0 + γyi
If initial
weights
zero,then
then at
weights
are a
• If • initial
weights
arearezero,
at any
anystep,
step,thethe
weights
are a linear
combination
of featureofvectors
the examples:
linear combination
featureofvectors
of the examples:
mα y x
w = ∑i=1:n
�
i i i
m
�
w
=
α
y
x
,
w
=
i i used for 0all updatesαbased
i yi on example i.
where αi is the sum of stepi sizes
i=1
i=1
–  By the end of training, some examples may have never participated
an update, so will have αi=0 used
.
where αin
for all updates based on example
i is the sum of step sizes
i.
•  This
is called
representation
of the
classifier.
• This
is called
the the
dualdual
representation
of the
classifier.
• Even by the end of training, some example may have never participated
in an update, so the corresponding αi = 0.
COMP-598: Applied Machine Learning
11
Joelle Pineau
COMP-652, Lecture 9 - October 9, 2012
9
Perceptron learning example
Example used (bold) and not used (faint) in updates
•  Examples used (bold) and not (faint). What do you notice?
-./.0$"'''.("&*!$1...-!./.!$
'
!"+
!"&
!"*
,#
!"%
!")
!"$
!"(
!"#
!"'
!
!
!"#
!"$
!"%
!"&
'
,'
COMP-598: Applied Machine Learning
COMP-652, Lecture 9 - October 9, 2012
12
Joelle Pineau
10
6
Perceptron learning example
•  Solutions are often non-unique. The solution depends on the
Comment: Solutions are nonunique
set of instances and the order of sampling in updates.
-./.0#"'(+).'"+(*#1...-!./.!#
!"+
!"&
!"*
!"%
,#
!")
!"$
!"(
!"#
!"'
!
!
!"#
!"$
!"%
!"&
'
,'
Solutions depend on the set of instances and the order of sampling in updates
COMP-598: Applied Machine Learning
13
COMP-652, Lecture 9 - October 9, 2012
Joelle Pineau
11
Perceptron summary
Perceptron summary
•  Perceptrons can be learned to fit linearly separable data, using
• Perceptrons
can be learned
a gradient-descent
rule. to fit linearly separable data, using a gradient
descent rule.
Solutions
are non-unique.
••  There
are other
fitting approaches – e.g., formulation as a linear
constraint satisfaction problem / linear program.
•  For non-linearly separable data:
• Solutions are non-unique.
–  Perhaps data can be linearly separated in a different feature
• Logistic neurons are often thought of as a “smooth” version of a
space?
perceptron
Perhaps weseparable
can relaxdata:
the criterion of separating all the data?
• For– non-linearly
Perhaps
canfitting
be linearly
separatede.g.
in aformulation
different feature
space?
•  –There
aredata
other
approaches,
as a linear
– Perhaps we can relax the criterion of separating all the data?
constraint satisfaction problem / linear program.
•  The logistic function offers a “smooth” version of the perceptron.
COMP-598: Applied Machine Learning
COMP-652, Lecture 9 - October 9, 2012
14
Joelle Pineau
12
7
Support Vector Machines (SVMs)
•  Support vector machines (SVMs) for binary classification can be
viewed as a way of training perceptrons.
•  Main new ideas:
Support Vector Machines
–  An alternative
optimization
criterion
(the “margin”),
eliminates
theviewed
• Support
vector machines
(SVMs)
for binarywhich
classification
can be
as a way
training and
perceptrons
non-uniqueness
of of
solutions
handles non-separable data.
• There are three main new ideas:
–  An efficient way of operating in expanded feature spaces, which allows
– An alternative optimization criterion (the “margin”), which eliminates
non-linear functions
to be represented
(theand
“kernel
trick”).
the non-uniqueness
of solutions
has theoretical
advantages
– An efficient way of operating in expanded feature spaces, which allow
non-linear functions to be represented – the “kernel trick”
•  SVMs can also
used
for multiclass
classification
and regression.
– Abeway
of handling
overfitting
and non-separable
data by allowing
mistakes
• SVMs can also be used for multiclass classification and regression.
COMP-598: Applied Machine Learning
15
Joelle Pineau
COMP-652, Lecture 9 - October 9, 2012
13
The non-uniqueness issue
Returningbinary
to theclassification
non-uniqueness
issue
•  Consider a linearly separable
dataset.
•  There is an
infinite number
hyperplanes
that separate
the{xi, yi}m .
• Consider
a linearly of
separable
binary classification
data set
i=1
classes: • There is an infinite number of hyperplanes that separate the classes:
!
!
!
!
!
•  Which plane is best?
"
"
"
"
"
•  Relatedly,•for
a given
Which
planeplane,
is best?for which points should we be most
confident •in Relatedly,
the classification?
for a given plane, for which points should we be most confident
in the classification?
9 - October 9, 2012
COMP-598: AppliedCOMP-652,
MachineLecture
Learning
16
Joelle Pineau
14
8
The margin and linear SVMs
•  For a given separating hyperplane, the margin is twice the
The margin, and linear SVMs
(Euclidean) distance from hyperplane to nearest training example.
The margin, and linear SVMs
It isa the
width
of the “strip”
around
the
decision
boundary
no
•–  For
given
separating
hyperplane,
the
margin
is two
times thecontaining
(Euclidean)
distance
from
the
hyperplane
to
the
nearest
training
example.
training
• For aexamples.
given separating hyperplane, the margin is two times the (Euclidean)
! hyperplane to the nearest training example.
!
distance
! from the
!
!
!
!
"
"
"
"
"
"
"
!
!
!
!
"
"
!
!
!
!
!
"
!
!
!
!
"
"
"
"
"
"
"
"
"
"
It is
the
“strip”
the
boundarycontaining
containing
• the
It SVM
is width
the width
of the
“strip”around
around
the decision
decision
boundary
no no
•  A• linear
is of
a perceptron
for
which
we chose
w such
that the
training
examples.
training
examples.
margin
maximized.
• Aislinear
SVM
a perceptronfor
forwhich
which we
we choose
choose w,
margin
• A linear
SVM
is aisperceptron
w,ww0 0sosothat
that
margin
is maximized
is maximized
COMP-598: Applied Machine Learning
17
Joelle Pineau
COMP-652, Lecture 9 - October 9, 2012
15
COMP-652, Lecture 9 - October 9, 2012
15
Distance
to the
decision
boundary
Distance
to the
decision boundary
Distance to the decision boundary
• Suppose we have a decision boundary that separates the data.
•  Suppose we have a decision boundary that separates the data.
• Suppose we have a decision boundary that separates the data.
"
#
x"!i
!
!
$%&
!
$%&
xi0#
• Let γi be the distance from instance xi to the decision boundary.
• How can we write γi in term of xi, yi, w, w0?
• • Let
the distance
distancefrom
from
instance
xi the
to the
decision
boundary.
Let γɣii be
be the
instance
xi to
decision
boundary.
COMP-652, Lecture 9 - October 9, 2012
• How can we write γ in term of x , y , w, w ?
16
0
•  How can we writei ɣi in terms ofi xii, yi, w?
COMP-652, Lecture 9 - October 9, 2012
COMP-598: Applied Machine Learning
18
16
Joelle Pineau
9
Distance to the decision boundary
•  The vector w is normal to the decision boundary, thus w/||w|| is the
unit normal.
•  The vector from xi to xi0 is ɣi w / ||w||.
•  xi0, the point on the decision boundary nearest xi, is xi-ɣi w / ||w||.
• 
As xi0 is on the decision boundary,
wT( xi-ɣi w / ||w||) = 0
•  Solving for ɣi yields, for a positive example:
ɣi = wTxi / ||w||
COMP-598: Applied Machine Learning
19
Joelle Pineau
The margin
•  The margin of the hyperplane is 2M, where M = mini ɣi .
•  The most direct statement of the problem of finding a maximum
margin separating hyperplane is thus:
maxw mini ɣi = maxw mini yiwTxi / ||w||
•  Alternately:
maximize
M
with respect to w
subject to
yiwTxi / ||w|| ≥ M, ∀i
•  However this turns out to be inconvenient for optimization.
–  w appears nonlinearly in the constraints.
–  Problem is underconstrained. If (w, M) is optimal, so is (βw, M), for
any β>0.
COMP-598: Applied Machine Learning
20
Joelle Pineau
10
The margin
•  The margin of the hyperplane is 2M, where M = mini ɣi .
•  The most direct statement of the problem of finding a maximum
margin separating hyperplane is thus:
maxw mini ɣi = maxw mini yiwTxi / ||w||
•  Alternately:
maximize
M
with respect to w
subject to
yiwTxi / ||w|| ≥ M, ∀i
-> min ||w||
-> w.r.t. w
-> s.t. yiwTxi ≥ 1
•  However this turns out to be inconvenient for optimization.
–  w appears nonlinearly in the constraints.
–  Problem is underconstrained. If (w, M) is optimal, so is (βw, M), for
any β>0.
Add a constraint: ||w||M = 1
COMP-598: Applied Machine Learning
21
Joelle Pineau
Final formulation
•  Let’s maximize ||w||2 instead of ||w||.
(Taking the square is a monotone transformation, as ||w|| is positive,
so this doesn’t change the optimal solution.
•  This gets us to:
min
||w||2
w.r.t.
w
s.t.
yiwTxi ≥ 1
•  This can be solved! How?
–  It is a quadratic programming (QP) problem – a standard type of
optimization problem for which many efficient packages are
available. Better yet, it’s a convex (positive semidefinite) QP.
COMP-598: Applied Machine Learning
22
Joelle Pineau
11
21
Example
Example
-./.0''"*+)+.'#"&!%%1...-!./.!'#"+'*$
-./.0$+"%)!$.$%"&+%#1...-!./.!$&"%+(%
'
!"+
!"+
!"&
!"&
!"*
!"*
!"%
!"%
!")
,#
,#
MP-652, Lecture 9 - October 9, 2012
!")
!"$
!"$
!"(
!"(
!"#
!"#
!"'
!"'
!
!
!"#
!"$
!"%
!"&
'
!
!
!"#
!"$
!"%
!"&
'
,'
,'
We have a solution, but
no support vectors yet...
We have solution, but no support vectors yet.
MP-652, Lecture 9 - October 9, 2012
COMP-598: Applied Machine Learning
23
Lagrange multipliers
Joelle Pineau
22
•  Consider the following optimization problem, called primal:
minw
f(w)
s.t.
gi(w) ≤ 0, i=1…k
•  We define the generalized Lagrangian:
L(w, α) = f(w) + ∑i=1:k αi gi(w)
where αi, i=1…k are the Lagrange multipliers.
COMP-598: Applied Machine Learning
24
Joelle Pineau
12
A different optimization problem
•  Consider P(w) = maxα:αi≥0 L(w,α)
•  Observe that the following is true:
P(w) = { f(w),
+∞,
if all constraints are satisfied,
otherwise }
•  Hence, instead of computing minw f(w) subject to the original
constraints, we can compute:
p* = minw P(w) = minw maxα:αi≥0 L(w,α)
COMP-598: Applied Machine Learning
Primal
25
Joelle Pineau
Dual optimization problem
•  Let d* = maxα:αi≥0 minw L(w,α)
(max and min are reversed)
Dual
•  We can show that d* ≤ p*.
–  Let p* = L(wp, αp)
–  Let d* = L(wd, αd)
–  Then d* = L(wd, αd) ≤ L(wp, αd) ≤ L(wp, αp) = p* .
•  If f and gi are convex and the gi can all be satisfied simultaneously
for some w, then we have equality: d* = p* = L(w*, α*).
•  Moreover, w*, α* solve the primal and dual if and only if they
satisfy the Karush-Kunh-Tucker (KKT conditions).
COMP-598: Applied Machine Learning
26
Joelle Pineau
13
What you should know
From today:
•  The perceptron algorithm.
•  The margin definition for linear SVMs.
•  The use of Lagrange multipliers to transform optimization
problems.
After the next class:
•  The primal and dual optimization problems for SVMs.
•  Feature space version of SVMs.
•  The kernel trick and examples of common kernels.
COMP-598: Applied Machine Learning
27
Joelle Pineau
14