Lecture 1 slides

Machine Learning (COMP-652)
Instructor: Doina Precup
Email: dprecup@cs.mcgill.ca
Teaching assistants: Neil Girdhar and Boyu Wang
Class web page: http://www.cs.mcgill.ca/∼dprecup/courses/ml.html
COMP-652, Lecture 1 - January 6, 2015
Outline
•
•
•
•
•
Administrative issues
What is machine learning?
Types of machine learning
Linear models
Error functions
COMP-652, Lecture 1 - January 6, 2015
1
Administrative issues
• Class materials:
– No required textbook, but several textbooks available
– Required or recommended readings (from books or research papers)
posted on the class web page
– Class notes: posted on the web page
• Prerequisites:
– Knowledge of a programming language
– Knowledge of probability/statistics, calculus and linear algebra; general
facility with math
– Some AI background is recommended but not required
COMP-652, Lecture 1 - January 6, 2015
2
Evaluation
•
•
•
•
•
Five homework assignments (50%)
Midterm examination (25%)
Short quizzes (5%)
Project (20%)
Participation to class discussions (up to 1% extra credit)
COMP-652, Lecture 1 - January 6, 2015
3
What is learning?
• H. Simon: Any process by which a system improves its performance
• M. Minsky: Learning is making useful changes in our minds
• R. Michalsky: Learning is constructing or modifying representations of
what is being experienced
• L. Valiant: Learning is the process of knowledge acquisition in the
absence of explicit programming
COMP-652, Lecture 1 - January 6, 2015
4
Why study machine learning?
Engineering reasons:
• Easier to build a learning system than to hand-code a working program!
E.g.:
– Robot that learns a map of the environment by exploring
– Programs that learn to play games by playing against themselves
• Improving on existing programs, e.g.
– Instruction scheduling and register allocation in compilers
– Combinatorial optimization problems
• Solving tasks that require a system to be adaptive, e.g.
– Speech and handwriting recognition
– “Intelligent” user interfaces
COMP-652, Lecture 1 - January 6, 2015
5
Why study machine learning?
Scientific reasons:
• Discover knowledge and patterns in highly dimensional, complex data
– Sky surveys
– High-energy physics data
– Sequence analysis in bioinformatics
– Social network analysis
– Ecosystem analysis
• Understanding animal and human learning
– How do we learn language?
– How do we recognize faces?
• Creating real AI!
“If an expert system–brilliantly designed, engineered and implemented–
cannot learn not to repeat its mistakes, it is not as intelligent as a worm
or a sea anemone or a kitten.” (Oliver Selfridge).
COMP-652, Lecture 1 - January 6, 2015
6
Very brief history
• Studied ever since computers were invented (e.g. Samuel’s checkers
player)
• Very active in 1960s (neural networks)
• Died down in the 1970s
• Revival in early 1980s (decision trees, backpropagation, temporaldifference learning) - coined as “machine learning”
• Exploded starting in the 1990s
• Now: very active research field, several yearly conferences (e.g., ICML,
NIPS), major journals (e.g., Machine Learning, Journal of Machine
Learning Research), rapidly growing number of researchers
• The time is right to study in the field!
– Lots of recent progress in algorithms and theory
– Flood of data to be analyzed
– Computational power is available
– Growing demand for industrial applications
COMP-652, Lecture 1 - January 6, 2015
7
What are good machine learning tasks?
• There is no human expert
E.g., DNA analysis
• Humans can perform the task but cannot explain how
E.g., character recognition
• Desired function changes frequently
E.g., predicting stock prices based on recent trading data
• Each user needs a customized function
E.g., news filtering
COMP-652, Lecture 1 - January 6, 2015
8
Important application areas
• Bioinformatics:
sequence alignment, analyzing microarray data,
information integration, ...
• Computer vision: object recognition, tracking, segmentation, active
vision, ...
• Robotics: state estimation, map building, decision making
• Graphics: building realistic simulations
• Speech: recognition, speaker identification
• Financial analysis: option pricing, portfolio allocation
• E-commerce: automated trading agents, data mining, spam, ...
• Medicine: diagnosis, treatment, drug design,...
• Computer games: building adaptive opponents
• Multimedia: retrieval across diverse databases
COMP-652, Lecture 1 - January 6, 2015
9
Kinds of learning
Based on the information available:
• Supervised learning
• Reinforcement learning
• Unsupervised learning
Based on the role of the learner
• Passive learning
• Active learning
COMP-652, Lecture 1 - January 6, 2015
10
Passive and active learning
• Traditionally, learning algorithms have been passive learners, which take
a given batch of data and process it to produce a hypothesis or model
Data → Learner → Model
• Active learners are instead allowed to query the environment
– Ask questions
– Perform experiments
• Open issues: how to query the environment optimally? how to account
for the cost of queries?
COMP-652, Lecture 1 - January 6, 2015
11
Example: A data set
Cell Nuclei of Fine Needle Aspirate
• Cell samples were taken from tumors in breast cancer patients before
surgery, and imaged
• Tumors were excised
• Patients were followed to determine whether or not the cancer recurred,
and how long until recurrence or disease free
COMP-652, Lecture 1 - January 6, 2015
12
Data (continued)
• Thirty real-valued variables per tumor.
• Two variables that can be predicted:
– Outcome (R=recurrence, N=non-recurrence)
– Time (until recurrence, for R, time healthy, for N).
tumor size
18.02
17.99
20.29
...
texture
27.6
10.38
14.34
COMP-652, Lecture 1 - January 6, 2015
perimeter
117.5
122.8
135.1
...
outcome
N
N
R
time
31
61
27
13
Terminology
tumor size
18.02
17.99
20.29
...
texture
27.6
10.38
14.34
perimeter
117.5
122.8
135.1
...
outcome
N
N
R
time
31
61
27
• Columns are called input variables or features or attributes
• The outcome and time (which we are trying to predict) are called output
variables or targets
• A row in the table is called training example or instance
• The whole table is called (training) data set.
• The problem of predicting the recurrence is called (binary) classification
• The problem of predicting the time is called regression
COMP-652, Lecture 1 - January 6, 2015
14
More formally
tumor size
18.02
17.99
20.29
...
texture
27.6
10.38
14.34
perimeter
117.5
122.8
135.1
...
outcome
N
N
R
time
31
61
27
• A training example i has the form: hxi,1, . . . xi,n, yii where n is the
number of attributes (30 in our case).
• We will use the notation xi to denote the column vector with elements
xi,1, . . . xi,n.
• The training set D consists of m training examples
• We denote the m × n matrix of attributes by X and the size-m column
vector of outputs from the data set by Y .
COMP-652, Lecture 1 - January 6, 2015
15
Supervised learning problem
• Let X denote the space of input values
• Let Y denote the space of output values
• Given a data set D ⊂ X × Y, find a function:
h:X →Y
such that h(x) is a “good predictor” for the value of y.
• h is called a hypothesis
• Problems are categorized by the type of output domain
– If Y = R, this problem is called regression
– If Y is a categorical variable (i.e., part of a finite discrete set), the
problem is called classification
COMP-652, Lecture 1 - January 6, 2015
16
Steps to solving a supervised learning problem
1. Decide what the input-output pairs are.
2. Decide how to encode inputs and outputs.
This defines the input space X , and the output space Y.
(We will discuss this in detail later)
3. Choose a class of hypotheses/representations H .
4. ...
COMP-652, Lecture 1 - January 6, 2015
17
Example: What hypothesis class should we pick?
x
0.86
0.09
-0.85
0.87
-0.44
-0.43
-1.10
0.40
-0.96
0.17
COMP-652, Lecture 1 - January 6, 2015
y
2.49
0.83
-0.25
3.10
0.87
0.02
-0.12
1.81
-0.83
0.43
18
Linear hypothesis
• Suppose y was a linear function of x:
hw (x) = w0 + w1x1(+ · · · )
• wi are called parameters or weights
• To simplify notation, we can add an attribute x0 = 1 to the other n
attributes (also called bias term or intercept term):
hw (x) =
n
X
w i xi = w T x
i=0
where w and x are vectors of size n + 1.
How should we pick w?
COMP-652, Lecture 1 - January 6, 2015
19
Error minimization!
• Intuitively, w should make the predictions of hw close to the true values
y on the data we have
• Hence, we will define an error function or cost function to measure how
much our prediction differs from the ”true” answer
• We will pick w such that the error function is minimized
How should we choose the error function?
COMP-652, Lecture 1 - January 6, 2015
20
Least mean squares (LMS)
• Main idea: try to make hw (x) close to y on the examples in the training
set
• We define a sum-of-squares error function
m
1X
J(w) =
(hw (xi) − yi)2
2 i=1
(the 1/2 is just for convenience)
• We will choose w such as to minimize J(w)
COMP-652, Lecture 1 - January 6, 2015
21
Steps to solving a supervised learning problem
1. Decide what the input-output pairs are.
2. Decide how to encode inputs and outputs.
This defines the input space X , and the output space Y.
3. Choose a class of hypotheses/representations H .
4. Choose an error function (cost function) to define the best hypothesis
5. Choose an algorithm for searching efficiently through the space of
hypotheses.
COMP-652, Lecture 1 - January 6, 2015
22
Notation reminder
• Consider a function f (u1, u2, . . . , un) : Rn 7→ R (for us, this will usually
be an error function)
• The partial derivative w.r.t. ui is denoted:
∂
f (u1, u2, . . . , un) : Rn 7→ R
∂ui
The partial derivative is the derivative along the ui axis, keeping all other
variables fixed.
• The gradient ∇f (u1, u2, . . . , un) : Rn 7→ Rn is a function which outputs
a vector containing the partial derivatives.
That is:
∂
∂
∂
∇f =
f,
f, . . . ,
f
∂u1 ∂u2
∂un
COMP-652, Lecture 1 - January 6, 2015
23
A bit of algebra
∂
J(w) =
∂wj
m
∂ 1X
(hw (xi) − yi)2
∂wj 2 i=1
m
1 X
∂
=
·2
(hw (xi) − yi)
(hw (xi) − yi)
2
∂w
j
i=1
!
n
m
X
X
∂
wlxi,l − yi
=
(hw (xi) − yi)
∂wj
i=1
l=0
=
m
X
(hw (xi) − yi)xi,j
i=1
Setting all these partial derivatives to 0, we get a linear system with (n + 1)
equations and (n + 1) unknowns.
COMP-652, Lecture 1 - January 6, 2015
24
The solution
• Recalling some multivariate calculus:
∇w J
= ∇w (Xw − Y )T (Xw − Y )
= ∇w (wT X T Xw − Y T Xw − wT X T Y + Y T Y )
= 2X T Xw − 2X T Y
• Setting gradient equal to zero:
2X T Xw − 2X T Y
= 0
⇒ X T Xw = X T Y
⇒ w = (X T X)−1X T Y
• The inverse exists if the columns of X are linearly independent.
COMP-652, Lecture 1 - January 6, 2015
25
y
Example: Data and best linear hypothesis
y = 1.60x + 1.05
x
COMP-652, Lecture 1 - January 6, 2015
26
Linear regression summary
• The optimal solution (minimizing sum-squared-error) can be computed
in polynomial time in the size of the data set.
• The solution is w = (X T X)−1X T Y , where X is the data matrix
augmented with a column of ones, and Y is the column vector of target
outputs.
• A very rare case in which an analytical, exact solution is possible
COMP-652, Lecture 1 - January 6, 2015
27
Coming back to mean-squared error function...
•
•
•
•
Good intuitive feel (small errors are ignored, large errors are penalized)
Nice math (closed-form solution, unique global optimum)
Geometric interpretation (Doina does this on the board)
Any other interpretation?
COMP-652, Lecture 1 - January 6, 2015
28
A probabilistic assumption
• Assume yi is a noisy target value, generated from a hypothesis hw (x)
• More specifically, assume that there exists w such that:
yi = hw (xi) + i
where i is random variable (noise) drawn independently for each xi
according to some Gaussian (normal) distribution with mean zero and
variance σ.
• How should we choose the parameter vector w?
COMP-652, Lecture 1 - January 6, 2015
29
Bayes theorem in learning
Let h be a hypothesis and D be the set of training data.
Using Bayes theorem, we have:
P (D|h)P (h)
P (h|D) =
,
P (D)
where:
• P (h) is the prior probability of hypothesis h
R
• P (D) = h P (D|h)P (h) is the probability of training data D
(normalization, independent of h)
• P (h|D) is the probability of h given D
• P (D|h) is the probability of D given h (likelihood of the data)
COMP-652, Lecture 1 - January 6, 2015
30
Choosing hypotheses
• What is the most probable hypothesis given the training data?
• Maximum a posteriori (MAP) hypothesis hM AP :
hM AP
= arg max P (h|D)
h∈H
= arg max
h∈H
P (D|h)P (h)
(using Bayes theorem)
P (D)
= arg max P (D|h)P (h)
h∈H
Last step is because P (D) is independent of h (so constant for the
maximization)
• This is the Bayesian answer (more in a minute)
COMP-652, Lecture 1 - January 6, 2015
31
Maximum likelihood estimation
hM AP = arg max P (D|h)P (h)
h∈H
• If we assume P (hi) = P (hj ) (all hypotheses are equally likely a priori)
then we can further simplify, and choose the maximum likelihood (ML)
hypothesis:
hM L = arg max P (D|h) = arg max L(h)
h∈H
h∈H
• Standard assumption: the training examples are independently identically
distributed (i.i.d.)
• This alows us to simplify P (D|h):
P (D|h) =
m
Y
i=1
COMP-652, Lecture 1 - January 6, 2015
P (hxi, yii|h) =
m
Y
P (yi|xi; h)P (xi)
i=1
32
The log trick
• We want to maximize:
L(h) =
m
Y
P (yi|xi; h)P (xi)
i=1
This is a product, and products are hard to maximize!
• Instead, we will maximize log L(h)! (the log-likelihood function)
log L(h) =
m
X
i=1
log P (yi|xi; h) +
m
X
log P (xi)
i=1
• The second sum depends on D, but not on h, so it can be ignored in the
search for a good hypothesis
COMP-652, Lecture 1 - January 6, 2015
33
Maximum likelihood for regression
• Adopt the assumption that:
yi = hw (xi) + i,
where i ∼ N (0, σ).
• The best hypothesis maximizes the likelihood of yi − hw (xi) = i
• Hence,
m
yi −hw (xi ) 2
Y
1
1
−
σ
√
e 2
L(w) =
2πσ 2
i=1
because the noise variables i are from a Gaussian distribution
COMP-652, Lecture 1 - January 6, 2015
34
Applying the log trick
log L(w) =
=
m
X
i=1
m
X
√
log
log √
i=1
1
2πσ 2
e
(y −h (x ))2
− 12 i w2 i
σ
1
2πσ 2
−
!
m
X
1 (yi − hw (xi))2
2
i=1
σ2
Maximizing the right hand side is the same as minimizing:
m
X
1 (yi − hw (xi))2
2
i=1
σ2
This is our old friend, the sum-squared-error function! (the constants that
are independent of h can again be ignored)
COMP-652, Lecture 1 - January 6, 2015
35
Maximum likelihood hypothesis for least-squares
estimators
• Under the assumption that the training examples are i.i.d. and that we
have Gaussian target noise, the maximum likelihood parameters w are
those minimizing the sum squared error:
w∗ = arg min
w
m
X
2
(yi − hw (xi))
i=1
• This makes explicit the hypothesis behind minimizing the sum-squared
error
• If the noise is not normally distributed, maximizing the likelihood will not
be the same as minimizing the sum-squared error
• In practice, different loss functions are used depending on the noise
assumption
COMP-652, Lecture 1 - January 6, 2015
36
A graphical representation for the data generation
process
X~P(X)
X
eps~N(0,sigma)
Deterministic
but unknown
h_w
eps
y
Deterministic
• Circles represent (random) variables)
• Arrows represent dependencies between variables
• Some variables are observed, others need to be inferred because they are
hidden (latent)
• New assumptions can be incorporated by making the model more
complicated
COMP-652, Lecture 1 - January 6, 2015
37
Predicting recurrence time based on tumor size
80
time to recurrence (months?)
70
60
50
40
30
20
10
0
10
COMP-652, Lecture 1 - January 6, 2015
15
20
tumor radius (mm?)
25
30
38
Is linear regression enough?
• Linear regression is too simple for most realistic problems
But it should be the first thing you try for real-valued outputs!
• Problems can also occur is X T X is not invertible.
• Two possible solutions:
1. Transform the data
– Add cross-terms, higher-order terms
– More generally, apply a transformation of the inputs from X to some
other space X 0, then do linear regression in the transformed space
2. Use a different hypothesis class
• Today we focus on the first approach
COMP-652, Lecture 1 - January 6, 2015
39
Polynomial fits
• Suppose we want to fit a higher-degree polynomial to the data.
(E.g., y = w2x2 + w1x1 + w0.)
• Suppose for now that there is a single input variable per training sample.
• How do we do it?
COMP-652, Lecture 1 - January 6, 2015
40
Answer: Polynomial regression
• Given data: (x1, y1), (x2, y2), . . . , (xm, ym).
• Suppose we want a degree-d polynomial fit.
• Let Y be as before and let

xd1
xd2
...
...
x21
x22

X=
..
 ..
xdm . . . x2m

x1 1
x2 1 
..
.. 

xm 1
• Solve the linear regression Xw ≈ Y .
COMP-652, Lecture 1 - January 6, 2015
41
Example of quadratic regression: Data matrices









X =







0.75
0.01
0.73
0.76
0.19
0.18
1.22
0.16
0.93
0.03
COMP-652, Lecture 1 - January 6, 2015
0.86
0.09
−0.85
0.87
−0.44
−0.43
−1.10
0.40
−0.96
0.17
1
1
1
1
1
1
1
1
1
1


























Y =







2.49
0.83
−0.25
3.10
0.87
0.02
−0.12
1.81
−0.83
0.43

















42
XT X
XT X =

0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.03
0.86 0.09 −0.85 0.87 −0.44 −0.43 −1.10 0.40 −0.96 0.17
1
1
1
1
1
1
1
1
1
1




×



0.75
0.01
0.73
0.76
0.19
0.18
1.22
0.16
0.93
0.03
0.86
0.09
−0.85
0.87
−0.44
−0.43
−1.10
0.40
−0.96
0.17
1
1
1
1
1
1
1
1
1
1









4.11 −1.64 4.95
=  −1.64 4.95 −1.39 
4.95 −1.39
10
COMP-652, Lecture 1 - January 6, 2015
43
XT Y
XT Y =

0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.03
0.86 0.09 −0.85 0.87 −0.44 −0.43 −1.10 0.40 −0.96 0.17
1
1
1
1
1
1
1
1
1
1




×



2.49
0.83
−0.25
3.10
0.87
0.02
−0.12
1.81
−0.83
0.43









3.60
=  6.49 
8.34
COMP-652, Lecture 1 - January 6, 2015
44
Solving for w
w = (X T X)−1X T Y =
4.11
−1.64
4.95
−1.64
4.95
−1.39
4.95
−1.39
10
−1 3.60
6.49
8.34
=
0.68
1.74
0.73
So the best order-2 polynomial is y = 0.68x2 + 1.74x + 0.73.
COMP-652, Lecture 1 - January 6, 2015
45
Linear function approximation in general
• Given a set of examples hxi, yiii=1...m, we fit a hypothesis
hw (x) =
K−1
X
wk φk (x) = wT φ(x)
k=0
where φk are called basis functions
• The best w is considered the one which minimizes the sum-squared error
over the training data:
m
X
(yi − hw (xi))2
i=1
• We can find the best w in closed form:
w = (ΦT Φ)−1ΦT y
or by other methods (see next lecture)
COMP-652, Lecture 1 - January 6, 2015
46
Linear models in general
• By linear models, we mean that the hypothesis function hw (x) is a linear
function of the parameters w
• This does not mean the hw (x) is a linear function of the input vector x
(e.g., polynomial regression)
• In general
K−1
X
hw (x) =
wk φk (x) = wT φ(x)
k=0
where φk are called basis functions
• Usually, we will assume that φ0(x) = 1, ∀x, to create a bias term
• The hypothesis can alternatively be written as:
hw (x) = Φw
where Φ is a matrix with one row per instance; row j contains φ(xj ).
• Basis functions are fixed
COMP-652, Lecture 1 - January 6, 2015
47
Example basis functions: Polynomials
1
0.5
0
−0.5
−1
−1
0
1
φk (x) = xk
“Global” functions: a small change in x may cause large change in the
output of many basis functions
COMP-652, Lecture 1 - January 6, 2015
48
Example basis functions: Gaussians
1
0.75
0.5
0.25
0
−1
φk (x) = exp
•
•
•
•
0
x − µk
2σ 2
1
µk controls the position along the x-axis
σ controls the width (activation radius)
µk , σ fixed for now (later we discuss adjusting them)
Usually thought as “local” functions: if σ is relatively small, a small
change in x only causes a change in the output of a few basis functions
(the ones with means close to x)
COMP-652, Lecture 1 - January 6, 2015
49
Example basis functions: Sigmoidal
1
0.75
0.5
0.25
0
−1
x − µk
φk (x) = σ
s
•
•
•
•
0
1
1
where σ(a) =
1 + exp(−a)
µk controls the position along the x-axis
s controls the slope
µk , s fixed for now (later we discuss adjusting them)
“Local” functions: a small change in x only causes a change in the
output of a few basis (most others will stay close to 0 or 1)
COMP-652, Lecture 1 - January 6, 2015
50
y
Order-2 fit
x
Is this a better fit to the data?
COMP-652, Lecture 1 - January 6, 2015
51
y
Order-3 fit
x
Is this a better fit to the data?
COMP-652, Lecture 1 - January 6, 2015
52
y
Order-4 fit
x
Is this a better fit to the data?
COMP-652, Lecture 1 - January 6, 2015
53
y
Order-5 fit
x
Is this a better fit to the data?
COMP-652, Lecture 1 - January 6, 2015
54
y
Order-6 fit
x
Is this a better fit to the data?
COMP-652, Lecture 1 - January 6, 2015
55
y
Order-7 fit
x
Is this a better fit to the data?
COMP-652, Lecture 1 - January 6, 2015
56
y
Order-8 fit
x
Is this a better fit to the data?
COMP-652, Lecture 1 - January 6, 2015
57
y
Order-9 fit
x
Is this a better fit to the data?
COMP-652, Lecture 1 - January 6, 2015
58
Overfitting
• A general, HUGELY IMPORTANT problem for all machine learning
algorithms
• We can find a hypothesis that predicts perfectly the training data but
does not generalize well to new data
• E.g., a lookup table!
• We are seeing an instance here: if we have a lot of parameters, the
hypothesis ”memorizes” the data points, but is wild everywhere else.
COMP-652, Lecture 1 - January 6, 2015
59
Another overfitting example
M =0
1
M =1
1
t
t
0
0
−1
−1
0
x
1
0
M =3
1
1
M =9
1
t
x
t
0
0
−1
−1
0
x
1
0
x
1
• The higher the degree of the polynomial M , the more degrees of freedom,
and the more capacity to “overfit” the training data
• Typical overfitting means that error on the training data is very low, but
error on new instances is high
COMP-652, Lecture 1 - January 6, 2015
60
Overfitting more formally
• Assume that the data is drawn from some fixed, unknown probability
distribution
• Every hypothesis has a ”true” error J ∗(h), which is the expected error
when data is drawn from the distribution.
• Because we do not have all the data, we measure the error on the training
set JD (h)
• Suppose we compare hypotheses h1 and h2 on the training set, and
JD (h1) < JD (h2)
• If h2 is ”truly” better, i.e. J ∗(h2) < J ∗(h1), our algorithm is overfitting.
• We need theoretical and empirical methods to guard against it!
COMP-652, Lecture 1 - January 6, 2015
61
Typical overfitting plot
1
ERMS
Training
Test
0.5
0
0
3
M
6
9
• The training error decreases with the degree of the polynomial M , i.e.
the complexity of the hypothesis
• The testing error, measured on independent data, decreases at first, then
starts increasing
• Cross-validation helps us:
– Find a good hypothesis class (M in our case), using a validation set
of data
– Report unbiased results, using a test set, untouched during either
parameter training or validation
COMP-652, Lecture 1 - January 6, 2015
62
Cross-validation
• A general procedure for estimating the true error of a predictor
• The data is split into two subsets:
– A training and validation set used only to find the right predictor
– A test set used to report the prediction error of the algorithm
• These sets must be disjoint!
• The process is repeated several times, and the results are averaged to
provide error estimates.
COMP-652, Lecture 1 - January 6, 2015
63
y
y
Example: Polynomial regression
x
COMP-652, Lecture 1 - January 6, 2015
x
y
y
x
x
64
Leave-one-out cross-validation
1. For each order of polynomial, d:
(a) Repeat the following procedure m times:
i. Leave out ith instance from the training set, to estimate the true
prediction error; we will put it in a validation set
ii. Use all the other instances to find best parameter vector, wd,i
iii. Measure the error in predicting the label on the instance left out,
for the wd,i parameter vector; call this Jd,i
iv. This is an unbiased estimate of the true prediction error
Pm
1
(b) Compute the average of the estimated errors: Jd = m i=1 Jd,i
2. Choose the d with lowest average estimated error: d∗ = arg mind J(d)
COMP-652, Lecture 1 - January 6, 2015
65
Estimating true error for d = 1
D = {(0.86, 2.49), (0.09, 0.83), (−0.85, −0.25), (0.87, 3.10), (−0.44, 0.87),
(−0.43, 0.02), (−1.10, −0.12), (0.40, 1.81), (−0.96, −0.83), (0.17, 0.43)}.
Iter
1
2
3
4
5
6
7
8
9
10
Dtrain
D − {(0.86, 2.49)}
D − {(0.08, 0.83)}
D − {(−0.85, −0.25)}
D − {(0.87, 3.10)}
D − {(−0.44, 0.87)}
D − {(−0.43, 0.02)}
D − {(−1.10, −0.12)}
D − {(0.40, 1.81)}
D − {(−0.96, −0.83)}
D − {(0.17, 0.43)}
COMP-652, Lecture 1 - January 6, 2015
Dvalid
(0.86, 2.49)
(0.09, 0.83)
(−0.85, −0.25)
(0.87, 3.10)
(−0.44, 0.87)
(−0.43, 0.02)
(−1.10, −0.12)
(0.40, 1.81)
(−0.96, −0.83)
(0.17, 0.43)
mean:
Errortrain
0.4928
0.1995
0.3461
0.3887
0.2128
0.1996
0.5707
0.2661
0.3604
0.2138
0.2188
Errorvalid (J1,i)
0.0044
0.1869
0.0053
0.8681
0.3439
0.1567
0.7205
0.0203
0.2033
1.0490
0.3558
66
Leave-one-out cross-validation results
d
1
2
3
4
5
6
7
8
9
Errortrain
0.2188
0.1504
0.1384
0.1259
0.0742
0.0598
0.0458
0.0000
0.0000
Errorvalid (Jd)
0.3558
0.3095
0.4764
1.1770
1.2828
1.3896
38.819
6097.5
6097.5
• Typical overfitting behavior: as d increases, the training error decreases,
but the validation error decreases, then starts increasing again
• Optimal choice: d = 2. Overfitting for d > 2
COMP-652, Lecture 1 - January 6, 2015
67
Estimating both hypothesis class and true error
• Suppose we want to compare polynomial regression with some other
algorithm
• We chose the hypothesis class (i.e. the degree of the polynomial, d∗)
based on the estimates Jd
• Hence Jd∗ is not unbiased - our procedure was aimed at optimizing it
• If we want to have both a hypothesis class and an unbiased error estimate,
we need to tweak the leave-one-out procedure a bit
COMP-652, Lecture 1 - January 6, 2015
68
Cross-validation with validation and testing sets
1. For each example j:
(a) Create a test set consisting just of the jth example, Dj = {(xj , yj )}
¯ j = D − {(xj , yj )}
and a training and validation set D
(b) Use the leave-one-out procedure from above on Dj to find a hypothesis,
h∗j
• Note that this will split the data internally, in order to both train
and validate!
• Typically, only one such split is used, rather than all possible splits
(c) Evaluate the error of h∗j on Dj (call it J(h∗j ))
2. Report the average of the J(h∗j ), as a measure of performance of the
whole algorithm
• Note that at this point we do not have one predictor, but several!
• Several methods can then be used to come up with just one predictor
(more on this later)
COMP-652, Lecture 1 - January 6, 2015
69
Summary of leave-one-out cross-validation
• A very easy to implement algorithm
• Theoretically, we obtain an unbiased estimate of the true error of a
predictor
• It can indicate problematic examples in a data set (when using multiple
algorithms)
• Computational cost scales with the number of instances (examples), so
it can be prohibitive, especially if finding the best predictor is expensive
• Alternative: k-fold cross-validation
COMP-652, Lecture 1 - January 6, 2015
70