Sample Tests for the Course Computational Intelligence Thomas Natschl¨ager June 16, 2004 The aim of these multiple choice tests is that students can check their understanding of the topics presented during the lectures. An interactive version of these tests1 can be found at the homepage of the course2 or you can also download the interactive version as a zip file3 . Introduction to Machine Learning 1. A learning algorithm is a function which maps each attribute vector a = ha1 , . . . , ad i to a target value b. True False 2. The empirical error on the training set is always lower than the empirical error on the test set. True False 3. The true error errorP (H) of the hypothesis H is necessarily larger then the empirical error errorTk (H) measured on the test set Tk . True False 4. If the training set L and the test set T are generated by two totally different distributions, then the larger the test set Tk the closer the empirical error errorTk (H) is to the true error errorP (H) of the hypothesis H. limk→∞ errorTk (H) = errorP (H) where k is the size of the test set Tk . it may happen that also for very large test sets Tk the empirical error does not approximate the true error very well. 5. Generalization has to do with the ability of a learning algorithm to find a hypothesis which has a low error on the training set. 1 http://www.igi.tugraz.at/lehre/CI/tests/index.html http://www.igi.tugraz.at/lehre/CI 3 http://www.igi.tugraz.at/lehre/CI/InteraktiveTests.zip 2 1 with the ability of a learning algorithm to find a hypothesis which has a low error on the test set. with the ability of a learning algorithm to find a hypothesis which performs well on examples ha, bi which were not used for training. 6. A “good” learning algorithm is a learning algorithm which has good generalization capabilities. can find for the training set L a Hypotheses H with errorL (H) = 0. finds for rather small training sets L a Hypothesis HL with a small true error. Neuronale Netze 1. Mit einem linearen Schwellengatter g(x) = sign(w · x + wo ) kann bei geeigneter Wahl von w und wo jede Boole’sche Funktion f : {−1, 1}n → {−1, 1} berechnet werden. True False 2. Ein 3-schichtiges ANN mit einem Output, welches nur aus linearen Gattern besteht, kann durch ein einzelnes lineares Gatter simuliert werden. True False 3. Jede stetige Funktion f : R → (0, 1) kann von einem vorwrtsgerichteten ANN aus sigmoiden Gattern mir einer hidden Schicht beliebig genau approximiert werden. True False 4. Die ∆-Regel findet fr ein binres Klassifikationsproblem immer eine Lsung, falls die Daten linear trennbar sind. True False 5. Falls die Daten eines binren Klassifikationsproblems nicht linear trennbar sind, so ist die ∆-Regel nicht anwendbar. True False 6. Mit Hilfe eines linearen Programms kann man auch fr den nicht linear trennbaren Fall die Gewichte eines linearen Schwellengatters ”‘trainieren”’. True False 7. Bei der Berechnung der Fisher Diskriminante wird jener Richtungsvektor ermittelt, auf dem die projizierten Daten die grte Varianz aufweisen. True False 2 8. Die mittels Pseudo-Inverse ermittelten Gewichte fr ein Schwellengatter sind optimal in Bezug auf die Anzahl der falsch klassifizierten TBs. True False 9. Der durch eine SVM mit linearem Kernel bestimmte Gewichtsvektor hw1 , . . . , xn i eines Schwellengatters ist eine Linearkombination der TBs. True False 10. Bei einem linear trennbaren Klassifikationsproblem bezeichnet man jene trennende Hyperebene als die optimale Hyperebene, welche den maximalen Abstand zu den TBs aufweist. True False 11. Beim Trainieren einer SVM werden mittels gradient descent die Gewichte eines Schwellengatters bestimmt. True False 12. Der Backprop-Alg. findet garantiert jene Gewichte fr ein ANN, die ein globales Minimum der Fehlerfunktion darstellen. True False 13. weight-decay ist eine Heuristik, um overfitting beim Trainieren von ANNs zu vermeiden. True False 14. Bei ungnstiger Einstellung der Parameter bei Backprop mit adaptiver Lernrate kann es sein, da der Gewichtsvektor zu oszillieren beginnt. True False 15. Gradient descent ist eine speziell fr ANNs entwickelte Technik zum Minimieren von quadratischen Fehlerfunktionen. True False 16. Der Vorteil von Backprop mit Momentum ist, da der Lernvorgang in ”‘Plateaus”’ der Fehlerfunktion beschleunigt wird. True False 17. Quasi-Newton und conjugate gradient Verfahren unterscheiden sich nur in der Art wie die Hess’sche Matrix berechnet wird. True False 3 18. Nach dem whitening von Daten gibt es keine empirisch mebaren linearen Abhngikeiten mehr zwischen verschiedenen Attributen. True False Classification Algorithms 1. For any finite training set C4.5 can produce a decision tree which makes no errors on the training set. True False 2. The performance of a nearest neighbor algorithm depends stronlgy on the relativ scaling of the attributes. True False Adaptive Filtering 1. Increasing the step size µ generally results in faster convergence of the LMS algorithm. True False 2. The goal of system identification is to build a model of an unknown system. True False 3. The RLS algorithm usually converges faster than the LMS algorithm. True False 4. Why is it (usually) not desirable to achieve the global minimum of the mean squared error (of the whole time-series) for an adaptive filter? Because the filter should adapt to temporal variation of an unknown system. Because the wanted signal (e.g., the signal of a local speaker for the application in echo-cancellation) would be suppressed. 5. An adaptive filter trained using the RLS algorithm with a forgetting factor ρ = 1 has constant coefficient values over time w[n] = w (it does not adapt). reaches the global minimum of the mean squared error for a time-series at the end of the time-series. considers indirectly all past signal samples for the computation of the local error and the adaption of the coefficients. displays an identical adaptation behavior as an adaptive filter trained using the LMS algorithm with µ = 0. 4 Gaussian Statistics 1. Consider a 2-dimensional Gaussian Process. Find the correct answers: If the first and second dimension are independent, the cloud of points (xi , yi )i=1,...,N and the pdf contour has necessarily the shape of a circle. If the first and second dimension are independent, the cloud of points and pdf contour has to be elliptic with the principle axes of the ellipse aligned with the abscissa and ordinate axes (consider a circle as a special form of an ellipsis). The covariance matrix Σ is symetric. That is for i, j = 1, . . . , d it holds that cij = cji 2. Estimation of the parameters of a 2-dimensional normal distribution. Find the correct answers. An accurate mean estimate requires more samples than an accurate variance estimate. Using more data results in an more accurate estimate of the parameters of the normal distribution. 3. Computation of the log-likelihood for classification instead of the likelihood gives the same classification results since the logarithm is a monotonically increasing function. is computationally beneficial, since we do not have to deal with very small numbers. turns products for the computation of the likelihood into sums for the computation of the log-likelihood. 4. For the computation of the log-likelihood for observations x with respect to Gaussian models according to log p(x|Θ) = 12 [−d log(2π) − log(det(Σ)) − (x − µ)T Σ−1 (x − µ)], we may (for all Gaussian models) drop the division by 2. drop the term d log(2π). drop the term log(det(Σ)). drop the term (x − µ)T Σ−1 (x − µ). pre-compute the term log(det(Σ)) for each of the Gaussian models. Hidden Markov Models 1. The parameters of a Markov model (NOT a hidden Markov model) are: The set of states. The prior probabilities (probabilities to start in a certain state). The state transition probabilities. The emission probabilities. 2. Find the correct statements. The (first-order) Markov assumption means that the probability of an event at time n only depends on the event at time n − 1. 5 An ergodic HMM allows transitions from each state to any other state. For speech recognition usually ergodic HMMs are used to model phoneme sequences in words. 3. Viterbi algorithm The Viterbi algorithm finds the most likely state sequence for a given observation sequence and a given HMM. The Viterbi algorithm finds the most likely state sequence for a given HMM. The Viterbi algorithm computes the likelihood of an observation sequence with respect to an HMM (considering all possible state sequences). In the Viterbi algorithm at each time step and for each state only one path leading to this state (the surviver path) and its metric are stored for further processing. 6
© Copyright 2025