Hidden Markov Models for Information Extraction CSE 454 Course Overview Info Extraction Datamining Ecommerce P2P Security Web Services Semantic Web Case Studies: Nutch, Google, Altavista Information Retrieval Precision vs Recall Inverted Indicies Crawler Architecture Synchronization & Monitors Systems Foundation: Networking & Clusters © Daniel S. Weld 2 What is Information Extraction (IE) • Is the task of populating database slots with corresponding phrases from text Slide byS.Okan © Daniel Weld Basegmez 3 What are HMMs? • A HMM is a finite state automaton with stochastic transitions and symbol emissions. (Rabiner 1989) © Daniel S. Weld 4 Why use HMMs for IE • • • • Strong statistical foundations Well suited to natural language domains Handling new data robustly Computationally efficient to develop Disadvantages • A priori notion of model topology • Large amounts of training data Slide byS.Okan © Daniel Weld Basegmez 5 Defn: Markov Model Q: set of states p: init prob distribution A: transition probability distribution s0 s1 s2 s3 s4 s5 s6 s0 s1 s2 s3 s4 s5 s6 © Daniel S. Weld p12 Probability of transitioning from s1 to s2 ∑ ? 6 E.g. Predict Web Behavior When will visitor leave site? Q: set of states (Pages) p: init prob distribution (Likelihood of site entry point) A: transition probability distribution (User navigation model) © Daniel S. Weld 7 Diversion: Relational Markov Models Avgerage negative log likelihood 8.5 RMM-uniform RMM-rank RMM-PET PMM 8.0 7.5 7.0 6.5 6.0 5.5 5.0 4.5 4.0 3.5 10 © Daniel S. Weld 100 1000 10000 Number training examples 100000 1000000 8 Probability Distribution, A • Forward Causality The probability of st does not depend directly on values of future states. • Probability of new state could depend on The history of states visited. Pr(st|st-1,st-2,…, s0) • Markovian Assumption Pr(st|st-1,st-2,…s0) = Pr(st|st-1) • Stationary Model Assumption Pr(st|st-1) = Pr(sk|sk-1) for all k. © Daniel S. Weld 9 Defn: Hidden Markov Model Q: set of states (hidden!) p: init prob distribution A: transition probability distribution O bi(ot) © Daniel S. Weld set of possible observatons probability of si emitting ot 10 HMMs and their Usage • HMMs very common in Computational Linguistics: Speech recognition • (observed: acoustic signal, hidden: words) Handwriting recognition • (observed: image, hidden: words) Part-of-speech tagging • (observed: words, hidden: part-of-speech tags) Machine translation • (observed: foreign words, hidden: words in target language) Information Extraction • (observed: Slide byS.Bonnie © Daniel Weld Dorr 11 Information Extraction with HMMs • Example - Research Paper Headers Slide byS.Okan © Daniel Weld Basegmez 12 The Three Basic HMM Problems • Problem 1 (Evaluation): Given the observation sequence O=o1,…,oT and an HMM model (A,B, p ) , how do we compute the probability of O given the model? Slide byS.Bonnie © Daniel Weld Dorr 13 The Three Basic HMM Problems • Problem 2 (Decoding): Given the observation sequence O=o1,…,oT and an HMM model (A,B, p ) , how do we find the state sequence that best explains the observations? Slide byS.Bonnie © Daniel Weld Dorr 14 The Three Basic HMM Problems • Problem 3 (Learning): How do we adjust the model parameters (A,B, p ) , to maximize P(O | ) ? Slide byS.Bonnie © Daniel Weld Dorr 15 Information Extraction with HMMs • Given a Model M and its parameters Information Extraction is performed by determining the sequence that was most likely to have generated the entire document • This sequence can be recovered by dynamic programming with Viterbi algorithm Slide byS.Okan © Daniel Weld Basegmez 16 Information Extraction with HMMs • Probability of a string x being emitted by an HMM M • State sequence V(x|M) that has the highest probability of having produced an observation sequence Slide byS.Okan © Daniel Weld Basegmez 17 Simple Example Rt-1 P(Rt) t 0.7 f 0.3 Raint-1 Raint Raint+1 Rt t f Umbrellat-1 © Daniel S. Weld Umbrellat P(Ut) 0.9 0.2 Umbrellat+1 18 Simple Example Rain1 Rain2 Rain3 Rain4 true true true true false false false false Umbrella true © Daniel S. Weld true false true 19 Forward Probabilities • What is the probability that, given an HMM , at time t the state is i and the partial observation o1 … ot has been generated? t (i) P(o1... ot , qt si | ) Slide byS.Bonnie © Daniel Weld Dorr 20 Problem 1: Probability of an Observation Sequence • What is P(O | ) ? • The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM. • Naïve computation is very expensive. Given T observations and N states, there are NT possible state sequences. • Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths • Solution to this and problem 2 is to use dynamic programming Slide byS.Bonnie © Daniel Weld Dorr 21 Forward Probabilities t (i) P(o1...ot , qt si | ) Slide byS.Bonnie © Daniel Weld Dorr N t ( j) t1(i) aij b j (ot ) i1 22 Forward Algorithm • Initialization: • Induction: 1(i) p ibi (o1) 1 i N N t ( j) t1(i) aij b j (ot ) 2 t T,1 j N i1 • Termination: N P(O | ) T (i) i1 Slide byS.Bonnie © Daniel Weld Dorr 23 Forward Algorithm Complexity • In the naïve approach to solving problem 1 it takes on the order of 2T*NT computations • The forward algorithm takes on the order of N2T computations Slide byS.Bonnie © Daniel Weld Dorr 24 Backward Probabilities • Analogous to the forward probability, just in the other direction • What is the probability that given an HMM and given the state at time t is i, the partial observation ot+1 … oT is generated? t (i) P(ot 1...oT | qt si , ) Slide byS.Bonnie © Daniel Weld Dorr 25 Backward Probabilities t (i) P(ot 1...oT | qt si , ) Slide byS.Bonnie © Daniel Weld Dorr N t (i) aijb j (ot 1) t 1 ( j) j1 26 Backward Algorithm • Initialization: T (i) 1, 1 i N • Induction: N t (i) aijb j (ot 1) t 1 ( j) t T 1...1,1 i N j1 • Termination: N P(O | ) p i 1 (i) i1 Slide byS.Bonnie © Daniel Weld Dorr 27 Problem 2: Decoding • The solution to Problem 1 (Evaluation) gives us the sum of all paths through an HMM efficiently. • For Problem 2, we want to find the path with the highest probability. • We want to find the state sequence Q=q1…qT, such that Q arg max P(Q'| O, ) Q' Slide byS.Bonnie © Daniel Weld Dorr 28 Viterbi Algorithm • Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum N • Forward: ( j) (i) a b (o ) t i1 t1 ij j t • Viterbi Recursion: 1iN Slide byS.Bonnie © Daniel Weld Dorr t ( j) max t1(i) aij b j (ot ) 29 Viterbi Algorithm • Initialization: • Induction: 1 (i) p ib j (o1) 1 i N t ( j) max t1(i) aij b j (ot ) 1iN t ( j) arg max t1 (i) aij 2 t T,1 j N 1iN • Termination: qT* argmax T (i) p* max T (i) 1iN •Read out path: Slide byS.Bonnie © Daniel Weld Dorr 1iN q t 1 (q ) t T 1,...,1 * t * t 1 30 Information Extraction • We want specific info from text documents • For example, from colloq emails, want Speaker name Location Start time © Daniel S. Weld 31 Simple HMM for Job Titles © Daniel S. Weld 32 HMMs for Info Extraction For sparse extraction tasks : • Separate HMM for each type of target • Each HMM should Model entire document Consist of target and non-target states Not necessarily fully connected Given HMM, how extract info? Slide byS.Okan © Daniel Weld Basegmez 33 How Learn HMM? • Two questions: structure & parameters © Daniel S. Weld 34 Simplest Case • Fix structure • Learn transition & emission probabilities • Training data…? Label each word as target or non-target • Challenges Sparse training data Unseen words have… Smoothing! © Daniel S. Weld 35 Problem 3: Learning • So far: assumed we know the underlying model (A,B, p ) • Often these parameters are estimated on annotated training data, which has 2 drawbacks: Annotation is difficult and/or expensive Training data is different from the current data • We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model ' , such that ' argmax P(O | ) Slide byS.Bonnie © Daniel Weld Dorr 36 Problem 3: Learning • Unfortunately, there is no known way to analytically find a global maximum, i.e., a model ' , such that ' argmax P(O | ) • But it is possible to find a local maximum! can always find a model • Given an initial model , we ' , such that P(O | ') P(O | ) Slide byS.Bonnie © Daniel Weld Dorr 37 Parameter Re-estimation • Use the forward-backward (or Baum-Welch) algorithm, which is a hill-climbing algorithm • Using an initial parameter instantiation, the forward-backward algorithm iteratively re-estimates the parameters and improves the probability that given observation are generated by the new parameters Slide byS.Bonnie © Daniel Weld Dorr 38 Parameter Re-estimation • Three parameters need to be re-estimated: Initial state distribution: p i Transition probabilities: ai,j Emission probabilities: bi(ot) Slide byS.Bonnie © Daniel Weld Dorr 39 Re-estimating Transition Probabilities • What’s the probability of being in state si at time t and going to state sj, given the current model and parameters? t (i, j) P(qt si , qt 1 s j | O, ) Slide byS.Bonnie © Daniel Weld Dorr 40 Re-estimating Transition Probabilities t (i, j) P(qt si , qt 1 s j | O, ) t (i, j) Slide byS.Bonnie © Daniel Weld Dorr t (i) ai, j b j (ot 1 ) t 1 ( j) N N (i) a t i1 j1 i, j b j (ot 1 ) t 1 ( j) 41 Re-estimating Transition Probabilities • The intuition behind the re-estimation equation for transition probabilities is aˆ i, j expected number of transitions from state s i to state s j expected number of transitions from state s • Formally: i T 1 (i, j) t aˆ i, j t1 T 1 N (i, j') t t1 j'1 Slide byS.Bonnie © Daniel Weld Dorr 42 • Defining Re-estimating Transition Probabilities N t (i) t (i, j) j1 As the probability of being in state si, given the complete observation O • We can say: T 1 (i, j) t aˆ i, j t1 T 1 (i) t t1 Slide byS.Bonnie © Daniel Weld Dorr 43 Review of Probabilities • Forward probability: (i) t The probability of being in state si, given the partial observation o1,…,ot • Backward probability: t (i) The probability of being in state si, given the partial observation ot+1,…,oT • Transition probability: t (i, j) The probability of going from state si, to state sj, given the complete observation o1,…,oT (i) • State probability: t The probability of being in state si, given the complete observation o1,…,oT Slide byS.Bonnie © Daniel Weld Dorr 44 Re-estimating Initial State Probabilities • Initial state distribution: p i is the probability that si is a start state • Re-estimation is easy: ˆ i expected number of times in state s p • Formally: i at time 1 ˆ i 1(i) p Slide byS.Bonnie © Daniel Weld Dorr 45 Re-estimation of Emission Probabilities • Emission probabilities are re-estimated as ˆb (k) expected number of times in state s i and observe symbol v i expected number of times in state s i • Formally: k T bˆi (k) (o ,v ) (i) t k t t1 T (i) t t1 Where (ot ,vk ) 1, if ot vk , and 0 otherwise Note that here is the Kronecker delta function and is not related to the in the discussion of the Viterbi algorithm!! Slide byS.Bonnie © Daniel Weld Dorr 46 The Updated Model • Coming from (A,B, p ) we get to ˆ ) by the following update rules: ' ( Aˆ , Bˆ , p T 1 (i, j) t aˆ i, j t1 T 1 (i) t t1 Slide byS.Bonnie © Daniel Weld Dorr T bˆi (k) (o ,v ) (i) t k t t1 T (i) ˆ i 1(i) p t t1 47 Expectation Maximization • The forward-backward algorithm is an instance of the more general EM algorithm The E Step: Compute the forward and backward probabilities for a give model The M Step: Re-estimate the model parameters Slide byS.Bonnie © Daniel Weld Dorr 48 Importance of HMM Topology • Certain structures better capture the observed phenomena in the prefix, target and suffix sequences • Building structures by hand does not scale to large corpora • Human intuitions don’t always correspond to structures that make the best use of HMM potential Slide byS.Okan © Daniel Weld Basegmez 49 How Learn Structure? © Daniel S. Weld 50 Conclusion • IE is performed by recovering the most likely state sequence (Viterbi) • Transition and Emission Parameters can be learned from training data (Baum-Welch) • Shrinkage improves parameter estimation • Task-specific state-transition structure can be automatically discovered Slide byS.Okan © Daniel Weld Basegmez 51 References • Information Extraction with HMM Structures Learned by Stochastic Optimization, Dayne Freitag and Andrew McCallum • Information Extraction with HMMs and Shrinkage - Dayne Frietag and Andrew McCallum • Learning Hidden Markov Model Structure for Information Extraction, Kristie Seymore, Andrew McCallum, Roni Rosenfeld • Inducing Probabilistic Grammars by Bayesian Model Merging, Andreas Stolcke, Stephen Omohundro • Information Extraction using Hidden Markov Models, Leek, T. R, Master’s thesis, UCSD Slide byS.Okan © Daniel Weld Basegmez 52
© Copyright 2024