Machine Learning (COMP-652) Instructor: Doina Precup Email: dprecup@cs.mcgill.ca Teaching assistants: Neil Girdhar and Boyu Wang Class web page: http://www.cs.mcgill.ca/∼dprecup/courses/ml.html COMP-652, Lecture 1 - January 6, 2015 Outline • • • • • Administrative issues What is machine learning? Types of machine learning Linear models Error functions COMP-652, Lecture 1 - January 6, 2015 1 Administrative issues • Class materials: – No required textbook, but several textbooks available – Required or recommended readings (from books or research papers) posted on the class web page – Class notes: posted on the web page • Prerequisites: – Knowledge of a programming language – Knowledge of probability/statistics, calculus and linear algebra; general facility with math – Some AI background is recommended but not required COMP-652, Lecture 1 - January 6, 2015 2 Evaluation • • • • • Five homework assignments (50%) Midterm examination (25%) Short quizzes (5%) Project (20%) Participation to class discussions (up to 1% extra credit) COMP-652, Lecture 1 - January 6, 2015 3 What is learning? • H. Simon: Any process by which a system improves its performance • M. Minsky: Learning is making useful changes in our minds • R. Michalsky: Learning is constructing or modifying representations of what is being experienced • L. Valiant: Learning is the process of knowledge acquisition in the absence of explicit programming COMP-652, Lecture 1 - January 6, 2015 4 Why study machine learning? Engineering reasons: • Easier to build a learning system than to hand-code a working program! E.g.: – Robot that learns a map of the environment by exploring – Programs that learn to play games by playing against themselves • Improving on existing programs, e.g. – Instruction scheduling and register allocation in compilers – Combinatorial optimization problems • Solving tasks that require a system to be adaptive, e.g. – Speech and handwriting recognition – “Intelligent” user interfaces COMP-652, Lecture 1 - January 6, 2015 5 Why study machine learning? Scientific reasons: • Discover knowledge and patterns in highly dimensional, complex data – Sky surveys – High-energy physics data – Sequence analysis in bioinformatics – Social network analysis – Ecosystem analysis • Understanding animal and human learning – How do we learn language? – How do we recognize faces? • Creating real AI! “If an expert system–brilliantly designed, engineered and implemented– cannot learn not to repeat its mistakes, it is not as intelligent as a worm or a sea anemone or a kitten.” (Oliver Selfridge). COMP-652, Lecture 1 - January 6, 2015 6 Very brief history • Studied ever since computers were invented (e.g. Samuel’s checkers player) • Very active in 1960s (neural networks) • Died down in the 1970s • Revival in early 1980s (decision trees, backpropagation, temporaldifference learning) - coined as “machine learning” • Exploded starting in the 1990s • Now: very active research field, several yearly conferences (e.g., ICML, NIPS), major journals (e.g., Machine Learning, Journal of Machine Learning Research), rapidly growing number of researchers • The time is right to study in the field! – Lots of recent progress in algorithms and theory – Flood of data to be analyzed – Computational power is available – Growing demand for industrial applications COMP-652, Lecture 1 - January 6, 2015 7 What are good machine learning tasks? • There is no human expert E.g., DNA analysis • Humans can perform the task but cannot explain how E.g., character recognition • Desired function changes frequently E.g., predicting stock prices based on recent trading data • Each user needs a customized function E.g., news filtering COMP-652, Lecture 1 - January 6, 2015 8 Important application areas • Bioinformatics: sequence alignment, analyzing microarray data, information integration, ... • Computer vision: object recognition, tracking, segmentation, active vision, ... • Robotics: state estimation, map building, decision making • Graphics: building realistic simulations • Speech: recognition, speaker identification • Financial analysis: option pricing, portfolio allocation • E-commerce: automated trading agents, data mining, spam, ... • Medicine: diagnosis, treatment, drug design,... • Computer games: building adaptive opponents • Multimedia: retrieval across diverse databases COMP-652, Lecture 1 - January 6, 2015 9 Kinds of learning Based on the information available: • Supervised learning • Reinforcement learning • Unsupervised learning Based on the role of the learner • Passive learning • Active learning COMP-652, Lecture 1 - January 6, 2015 10 Passive and active learning • Traditionally, learning algorithms have been passive learners, which take a given batch of data and process it to produce a hypothesis or model Data → Learner → Model • Active learners are instead allowed to query the environment – Ask questions – Perform experiments • Open issues: how to query the environment optimally? how to account for the cost of queries? COMP-652, Lecture 1 - January 6, 2015 11 Example: A data set Cell Nuclei of Fine Needle Aspirate • Cell samples were taken from tumors in breast cancer patients before surgery, and imaged • Tumors were excised • Patients were followed to determine whether or not the cancer recurred, and how long until recurrence or disease free COMP-652, Lecture 1 - January 6, 2015 12 Data (continued) • Thirty real-valued variables per tumor. • Two variables that can be predicted: – Outcome (R=recurrence, N=non-recurrence) – Time (until recurrence, for R, time healthy, for N). tumor size 18.02 17.99 20.29 ... texture 27.6 10.38 14.34 COMP-652, Lecture 1 - January 6, 2015 perimeter 117.5 122.8 135.1 ... outcome N N R time 31 61 27 13 Terminology tumor size 18.02 17.99 20.29 ... texture 27.6 10.38 14.34 perimeter 117.5 122.8 135.1 ... outcome N N R time 31 61 27 • Columns are called input variables or features or attributes • The outcome and time (which we are trying to predict) are called output variables or targets • A row in the table is called training example or instance • The whole table is called (training) data set. • The problem of predicting the recurrence is called (binary) classification • The problem of predicting the time is called regression COMP-652, Lecture 1 - January 6, 2015 14 More formally tumor size 18.02 17.99 20.29 ... texture 27.6 10.38 14.34 perimeter 117.5 122.8 135.1 ... outcome N N R time 31 61 27 • A training example i has the form: hxi,1, . . . xi,n, yii where n is the number of attributes (30 in our case). • We will use the notation xi to denote the column vector with elements xi,1, . . . xi,n. • The training set D consists of m training examples • We denote the m × n matrix of attributes by X and the size-m column vector of outputs from the data set by Y . COMP-652, Lecture 1 - January 6, 2015 15 Supervised learning problem • Let X denote the space of input values • Let Y denote the space of output values • Given a data set D ⊂ X × Y, find a function: h:X →Y such that h(x) is a “good predictor” for the value of y. • h is called a hypothesis • Problems are categorized by the type of output domain – If Y = R, this problem is called regression – If Y is a categorical variable (i.e., part of a finite discrete set), the problem is called classification COMP-652, Lecture 1 - January 6, 2015 16 Steps to solving a supervised learning problem 1. Decide what the input-output pairs are. 2. Decide how to encode inputs and outputs. This defines the input space X , and the output space Y. (We will discuss this in detail later) 3. Choose a class of hypotheses/representations H . 4. ... COMP-652, Lecture 1 - January 6, 2015 17 Example: What hypothesis class should we pick? x 0.86 0.09 -0.85 0.87 -0.44 -0.43 -1.10 0.40 -0.96 0.17 COMP-652, Lecture 1 - January 6, 2015 y 2.49 0.83 -0.25 3.10 0.87 0.02 -0.12 1.81 -0.83 0.43 18 Linear hypothesis • Suppose y was a linear function of x: hw (x) = w0 + w1x1(+ · · · ) • wi are called parameters or weights • To simplify notation, we can add an attribute x0 = 1 to the other n attributes (also called bias term or intercept term): hw (x) = n X w i xi = w T x i=0 where w and x are vectors of size n + 1. How should we pick w? COMP-652, Lecture 1 - January 6, 2015 19 Error minimization! • Intuitively, w should make the predictions of hw close to the true values y on the data we have • Hence, we will define an error function or cost function to measure how much our prediction differs from the ”true” answer • We will pick w such that the error function is minimized How should we choose the error function? COMP-652, Lecture 1 - January 6, 2015 20 Least mean squares (LMS) • Main idea: try to make hw (x) close to y on the examples in the training set • We define a sum-of-squares error function m 1X J(w) = (hw (xi) − yi)2 2 i=1 (the 1/2 is just for convenience) • We will choose w such as to minimize J(w) COMP-652, Lecture 1 - January 6, 2015 21 Steps to solving a supervised learning problem 1. Decide what the input-output pairs are. 2. Decide how to encode inputs and outputs. This defines the input space X , and the output space Y. 3. Choose a class of hypotheses/representations H . 4. Choose an error function (cost function) to define the best hypothesis 5. Choose an algorithm for searching efficiently through the space of hypotheses. COMP-652, Lecture 1 - January 6, 2015 22 Notation reminder • Consider a function f (u1, u2, . . . , un) : Rn 7→ R (for us, this will usually be an error function) • The partial derivative w.r.t. ui is denoted: ∂ f (u1, u2, . . . , un) : Rn 7→ R ∂ui The partial derivative is the derivative along the ui axis, keeping all other variables fixed. • The gradient ∇f (u1, u2, . . . , un) : Rn 7→ Rn is a function which outputs a vector containing the partial derivatives. That is: ∂ ∂ ∂ ∇f = f, f, . . . , f ∂u1 ∂u2 ∂un COMP-652, Lecture 1 - January 6, 2015 23 A bit of algebra ∂ J(w) = ∂wj m ∂ 1X (hw (xi) − yi)2 ∂wj 2 i=1 m 1 X ∂ = ·2 (hw (xi) − yi) (hw (xi) − yi) 2 ∂w j i=1 ! n m X X ∂ wlxi,l − yi = (hw (xi) − yi) ∂wj i=1 l=0 = m X (hw (xi) − yi)xi,j i=1 Setting all these partial derivatives to 0, we get a linear system with (n + 1) equations and (n + 1) unknowns. COMP-652, Lecture 1 - January 6, 2015 24 The solution • Recalling some multivariate calculus: ∇w J = ∇w (Xw − Y )T (Xw − Y ) = ∇w (wT X T Xw − Y T Xw − wT X T Y + Y T Y ) = 2X T Xw − 2X T Y • Setting gradient equal to zero: 2X T Xw − 2X T Y = 0 ⇒ X T Xw = X T Y ⇒ w = (X T X)−1X T Y • The inverse exists if the columns of X are linearly independent. COMP-652, Lecture 1 - January 6, 2015 25 y Example: Data and best linear hypothesis y = 1.60x + 1.05 x COMP-652, Lecture 1 - January 6, 2015 26 Linear regression summary • The optimal solution (minimizing sum-squared-error) can be computed in polynomial time in the size of the data set. • The solution is w = (X T X)−1X T Y , where X is the data matrix augmented with a column of ones, and Y is the column vector of target outputs. • A very rare case in which an analytical, exact solution is possible COMP-652, Lecture 1 - January 6, 2015 27 Coming back to mean-squared error function... • • • • Good intuitive feel (small errors are ignored, large errors are penalized) Nice math (closed-form solution, unique global optimum) Geometric interpretation (Doina does this on the board) Any other interpretation? COMP-652, Lecture 1 - January 6, 2015 28 A probabilistic assumption • Assume yi is a noisy target value, generated from a hypothesis hw (x) • More specifically, assume that there exists w such that: yi = hw (xi) + i where i is random variable (noise) drawn independently for each xi according to some Gaussian (normal) distribution with mean zero and variance σ. • How should we choose the parameter vector w? COMP-652, Lecture 1 - January 6, 2015 29 Bayes theorem in learning Let h be a hypothesis and D be the set of training data. Using Bayes theorem, we have: P (D|h)P (h) P (h|D) = , P (D) where: • P (h) is the prior probability of hypothesis h R • P (D) = h P (D|h)P (h) is the probability of training data D (normalization, independent of h) • P (h|D) is the probability of h given D • P (D|h) is the probability of D given h (likelihood of the data) COMP-652, Lecture 1 - January 6, 2015 30 Choosing hypotheses • What is the most probable hypothesis given the training data? • Maximum a posteriori (MAP) hypothesis hM AP : hM AP = arg max P (h|D) h∈H = arg max h∈H P (D|h)P (h) (using Bayes theorem) P (D) = arg max P (D|h)P (h) h∈H Last step is because P (D) is independent of h (so constant for the maximization) • This is the Bayesian answer (more in a minute) COMP-652, Lecture 1 - January 6, 2015 31 Maximum likelihood estimation hM AP = arg max P (D|h)P (h) h∈H • If we assume P (hi) = P (hj ) (all hypotheses are equally likely a priori) then we can further simplify, and choose the maximum likelihood (ML) hypothesis: hM L = arg max P (D|h) = arg max L(h) h∈H h∈H • Standard assumption: the training examples are independently identically distributed (i.i.d.) • This alows us to simplify P (D|h): P (D|h) = m Y i=1 COMP-652, Lecture 1 - January 6, 2015 P (hxi, yii|h) = m Y P (yi|xi; h)P (xi) i=1 32 The log trick • We want to maximize: L(h) = m Y P (yi|xi; h)P (xi) i=1 This is a product, and products are hard to maximize! • Instead, we will maximize log L(h)! (the log-likelihood function) log L(h) = m X i=1 log P (yi|xi; h) + m X log P (xi) i=1 • The second sum depends on D, but not on h, so it can be ignored in the search for a good hypothesis COMP-652, Lecture 1 - January 6, 2015 33 Maximum likelihood for regression • Adopt the assumption that: yi = hw (xi) + i, where i ∼ N (0, σ). • The best hypothesis maximizes the likelihood of yi − hw (xi) = i • Hence, m yi −hw (xi ) 2 Y 1 1 − σ √ e 2 L(w) = 2πσ 2 i=1 because the noise variables i are from a Gaussian distribution COMP-652, Lecture 1 - January 6, 2015 34 Applying the log trick log L(w) = = m X i=1 m X √ log log √ i=1 1 2πσ 2 e (y −h (x ))2 − 12 i w2 i σ 1 2πσ 2 − ! m X 1 (yi − hw (xi))2 2 i=1 σ2 Maximizing the right hand side is the same as minimizing: m X 1 (yi − hw (xi))2 2 i=1 σ2 This is our old friend, the sum-squared-error function! (the constants that are independent of h can again be ignored) COMP-652, Lecture 1 - January 6, 2015 35 Maximum likelihood hypothesis for least-squares estimators • Under the assumption that the training examples are i.i.d. and that we have Gaussian target noise, the maximum likelihood parameters w are those minimizing the sum squared error: w∗ = arg min w m X 2 (yi − hw (xi)) i=1 • This makes explicit the hypothesis behind minimizing the sum-squared error • If the noise is not normally distributed, maximizing the likelihood will not be the same as minimizing the sum-squared error • In practice, different loss functions are used depending on the noise assumption COMP-652, Lecture 1 - January 6, 2015 36 A graphical representation for the data generation process X~P(X) X eps~N(0,sigma) Deterministic but unknown h_w eps y Deterministic • Circles represent (random) variables) • Arrows represent dependencies between variables • Some variables are observed, others need to be inferred because they are hidden (latent) • New assumptions can be incorporated by making the model more complicated COMP-652, Lecture 1 - January 6, 2015 37 Predicting recurrence time based on tumor size 80 time to recurrence (months?) 70 60 50 40 30 20 10 0 10 COMP-652, Lecture 1 - January 6, 2015 15 20 tumor radius (mm?) 25 30 38 Is linear regression enough? • Linear regression is too simple for most realistic problems But it should be the first thing you try for real-valued outputs! • Problems can also occur is X T X is not invertible. • Two possible solutions: 1. Transform the data – Add cross-terms, higher-order terms – More generally, apply a transformation of the inputs from X to some other space X 0, then do linear regression in the transformed space 2. Use a different hypothesis class • Today we focus on the first approach COMP-652, Lecture 1 - January 6, 2015 39 Polynomial fits • Suppose we want to fit a higher-degree polynomial to the data. (E.g., y = w2x2 + w1x1 + w0.) • Suppose for now that there is a single input variable per training sample. • How do we do it? COMP-652, Lecture 1 - January 6, 2015 40 Answer: Polynomial regression • Given data: (x1, y1), (x2, y2), . . . , (xm, ym). • Suppose we want a degree-d polynomial fit. • Let Y be as before and let xd1 xd2 ... ... x21 x22 X= .. .. xdm . . . x2m x1 1 x2 1 .. .. xm 1 • Solve the linear regression Xw ≈ Y . COMP-652, Lecture 1 - January 6, 2015 41 Example of quadratic regression: Data matrices X = 0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.03 COMP-652, Lecture 1 - January 6, 2015 0.86 0.09 −0.85 0.87 −0.44 −0.43 −1.10 0.40 −0.96 0.17 1 1 1 1 1 1 1 1 1 1 Y = 2.49 0.83 −0.25 3.10 0.87 0.02 −0.12 1.81 −0.83 0.43 42 XT X XT X = 0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.03 0.86 0.09 −0.85 0.87 −0.44 −0.43 −1.10 0.40 −0.96 0.17 1 1 1 1 1 1 1 1 1 1 × 0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.03 0.86 0.09 −0.85 0.87 −0.44 −0.43 −1.10 0.40 −0.96 0.17 1 1 1 1 1 1 1 1 1 1 4.11 −1.64 4.95 = −1.64 4.95 −1.39 4.95 −1.39 10 COMP-652, Lecture 1 - January 6, 2015 43 XT Y XT Y = 0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.03 0.86 0.09 −0.85 0.87 −0.44 −0.43 −1.10 0.40 −0.96 0.17 1 1 1 1 1 1 1 1 1 1 × 2.49 0.83 −0.25 3.10 0.87 0.02 −0.12 1.81 −0.83 0.43 3.60 = 6.49 8.34 COMP-652, Lecture 1 - January 6, 2015 44 Solving for w w = (X T X)−1X T Y = 4.11 −1.64 4.95 −1.64 4.95 −1.39 4.95 −1.39 10 −1 3.60 6.49 8.34 = 0.68 1.74 0.73 So the best order-2 polynomial is y = 0.68x2 + 1.74x + 0.73. COMP-652, Lecture 1 - January 6, 2015 45 Linear function approximation in general • Given a set of examples hxi, yiii=1...m, we fit a hypothesis hw (x) = K−1 X wk φk (x) = wT φ(x) k=0 where φk are called basis functions • The best w is considered the one which minimizes the sum-squared error over the training data: m X (yi − hw (xi))2 i=1 • We can find the best w in closed form: w = (ΦT Φ)−1ΦT y or by other methods (see next lecture) COMP-652, Lecture 1 - January 6, 2015 46 Linear models in general • By linear models, we mean that the hypothesis function hw (x) is a linear function of the parameters w • This does not mean the hw (x) is a linear function of the input vector x (e.g., polynomial regression) • In general K−1 X hw (x) = wk φk (x) = wT φ(x) k=0 where φk are called basis functions • Usually, we will assume that φ0(x) = 1, ∀x, to create a bias term • The hypothesis can alternatively be written as: hw (x) = Φw where Φ is a matrix with one row per instance; row j contains φ(xj ). • Basis functions are fixed COMP-652, Lecture 1 - January 6, 2015 47 Example basis functions: Polynomials 1 0.5 0 −0.5 −1 −1 0 1 φk (x) = xk “Global” functions: a small change in x may cause large change in the output of many basis functions COMP-652, Lecture 1 - January 6, 2015 48 Example basis functions: Gaussians 1 0.75 0.5 0.25 0 −1 φk (x) = exp • • • • 0 x − µk 2σ 2 1 µk controls the position along the x-axis σ controls the width (activation radius) µk , σ fixed for now (later we discuss adjusting them) Usually thought as “local” functions: if σ is relatively small, a small change in x only causes a change in the output of a few basis functions (the ones with means close to x) COMP-652, Lecture 1 - January 6, 2015 49 Example basis functions: Sigmoidal 1 0.75 0.5 0.25 0 −1 x − µk φk (x) = σ s • • • • 0 1 1 where σ(a) = 1 + exp(−a) µk controls the position along the x-axis s controls the slope µk , s fixed for now (later we discuss adjusting them) “Local” functions: a small change in x only causes a change in the output of a few basis (most others will stay close to 0 or 1) COMP-652, Lecture 1 - January 6, 2015 50 y Order-2 fit x Is this a better fit to the data? COMP-652, Lecture 1 - January 6, 2015 51 y Order-3 fit x Is this a better fit to the data? COMP-652, Lecture 1 - January 6, 2015 52 y Order-4 fit x Is this a better fit to the data? COMP-652, Lecture 1 - January 6, 2015 53 y Order-5 fit x Is this a better fit to the data? COMP-652, Lecture 1 - January 6, 2015 54 y Order-6 fit x Is this a better fit to the data? COMP-652, Lecture 1 - January 6, 2015 55 y Order-7 fit x Is this a better fit to the data? COMP-652, Lecture 1 - January 6, 2015 56 y Order-8 fit x Is this a better fit to the data? COMP-652, Lecture 1 - January 6, 2015 57 y Order-9 fit x Is this a better fit to the data? COMP-652, Lecture 1 - January 6, 2015 58 Overfitting • A general, HUGELY IMPORTANT problem for all machine learning algorithms • We can find a hypothesis that predicts perfectly the training data but does not generalize well to new data • E.g., a lookup table! • We are seeing an instance here: if we have a lot of parameters, the hypothesis ”memorizes” the data points, but is wild everywhere else. COMP-652, Lecture 1 - January 6, 2015 59 Another overfitting example M =0 1 M =1 1 t t 0 0 −1 −1 0 x 1 0 M =3 1 1 M =9 1 t x t 0 0 −1 −1 0 x 1 0 x 1 • The higher the degree of the polynomial M , the more degrees of freedom, and the more capacity to “overfit” the training data • Typical overfitting means that error on the training data is very low, but error on new instances is high COMP-652, Lecture 1 - January 6, 2015 60 Overfitting more formally • Assume that the data is drawn from some fixed, unknown probability distribution • Every hypothesis has a ”true” error J ∗(h), which is the expected error when data is drawn from the distribution. • Because we do not have all the data, we measure the error on the training set JD (h) • Suppose we compare hypotheses h1 and h2 on the training set, and JD (h1) < JD (h2) • If h2 is ”truly” better, i.e. J ∗(h2) < J ∗(h1), our algorithm is overfitting. • We need theoretical and empirical methods to guard against it! COMP-652, Lecture 1 - January 6, 2015 61 Typical overfitting plot 1 ERMS Training Test 0.5 0 0 3 M 6 9 • The training error decreases with the degree of the polynomial M , i.e. the complexity of the hypothesis • The testing error, measured on independent data, decreases at first, then starts increasing • Cross-validation helps us: – Find a good hypothesis class (M in our case), using a validation set of data – Report unbiased results, using a test set, untouched during either parameter training or validation COMP-652, Lecture 1 - January 6, 2015 62 Cross-validation • A general procedure for estimating the true error of a predictor • The data is split into two subsets: – A training and validation set used only to find the right predictor – A test set used to report the prediction error of the algorithm • These sets must be disjoint! • The process is repeated several times, and the results are averaged to provide error estimates. COMP-652, Lecture 1 - January 6, 2015 63 y y Example: Polynomial regression x COMP-652, Lecture 1 - January 6, 2015 x y y x x 64 Leave-one-out cross-validation 1. For each order of polynomial, d: (a) Repeat the following procedure m times: i. Leave out ith instance from the training set, to estimate the true prediction error; we will put it in a validation set ii. Use all the other instances to find best parameter vector, wd,i iii. Measure the error in predicting the label on the instance left out, for the wd,i parameter vector; call this Jd,i iv. This is an unbiased estimate of the true prediction error Pm 1 (b) Compute the average of the estimated errors: Jd = m i=1 Jd,i 2. Choose the d with lowest average estimated error: d∗ = arg mind J(d) COMP-652, Lecture 1 - January 6, 2015 65 Estimating true error for d = 1 D = {(0.86, 2.49), (0.09, 0.83), (−0.85, −0.25), (0.87, 3.10), (−0.44, 0.87), (−0.43, 0.02), (−1.10, −0.12), (0.40, 1.81), (−0.96, −0.83), (0.17, 0.43)}. Iter 1 2 3 4 5 6 7 8 9 10 Dtrain D − {(0.86, 2.49)} D − {(0.08, 0.83)} D − {(−0.85, −0.25)} D − {(0.87, 3.10)} D − {(−0.44, 0.87)} D − {(−0.43, 0.02)} D − {(−1.10, −0.12)} D − {(0.40, 1.81)} D − {(−0.96, −0.83)} D − {(0.17, 0.43)} COMP-652, Lecture 1 - January 6, 2015 Dvalid (0.86, 2.49) (0.09, 0.83) (−0.85, −0.25) (0.87, 3.10) (−0.44, 0.87) (−0.43, 0.02) (−1.10, −0.12) (0.40, 1.81) (−0.96, −0.83) (0.17, 0.43) mean: Errortrain 0.4928 0.1995 0.3461 0.3887 0.2128 0.1996 0.5707 0.2661 0.3604 0.2138 0.2188 Errorvalid (J1,i) 0.0044 0.1869 0.0053 0.8681 0.3439 0.1567 0.7205 0.0203 0.2033 1.0490 0.3558 66 Leave-one-out cross-validation results d 1 2 3 4 5 6 7 8 9 Errortrain 0.2188 0.1504 0.1384 0.1259 0.0742 0.0598 0.0458 0.0000 0.0000 Errorvalid (Jd) 0.3558 0.3095 0.4764 1.1770 1.2828 1.3896 38.819 6097.5 6097.5 • Typical overfitting behavior: as d increases, the training error decreases, but the validation error decreases, then starts increasing again • Optimal choice: d = 2. Overfitting for d > 2 COMP-652, Lecture 1 - January 6, 2015 67 Estimating both hypothesis class and true error • Suppose we want to compare polynomial regression with some other algorithm • We chose the hypothesis class (i.e. the degree of the polynomial, d∗) based on the estimates Jd • Hence Jd∗ is not unbiased - our procedure was aimed at optimizing it • If we want to have both a hypothesis class and an unbiased error estimate, we need to tweak the leave-one-out procedure a bit COMP-652, Lecture 1 - January 6, 2015 68 Cross-validation with validation and testing sets 1. For each example j: (a) Create a test set consisting just of the jth example, Dj = {(xj , yj )} ¯ j = D − {(xj , yj )} and a training and validation set D (b) Use the leave-one-out procedure from above on Dj to find a hypothesis, h∗j • Note that this will split the data internally, in order to both train and validate! • Typically, only one such split is used, rather than all possible splits (c) Evaluate the error of h∗j on Dj (call it J(h∗j )) 2. Report the average of the J(h∗j ), as a measure of performance of the whole algorithm • Note that at this point we do not have one predictor, but several! • Several methods can then be used to come up with just one predictor (more on this later) COMP-652, Lecture 1 - January 6, 2015 69 Summary of leave-one-out cross-validation • A very easy to implement algorithm • Theoretically, we obtain an unbiased estimate of the true error of a predictor • It can indicate problematic examples in a data set (when using multiple algorithms) • Computational cost scales with the number of instances (examples), so it can be prohibitive, especially if finding the best predictor is expensive • Alternative: k-fold cross-validation COMP-652, Lecture 1 - January 6, 2015 70
© Copyright 2024