Topic modeling: an update Khoat Than Hanoi University of Science and Technology FPT, March 31, 2015 Contents ¡ About me ¡ Introduction to topic modeling ¡ Some challenges ¡ Our recent research 2 3 About me A short bibliography ¡ B.S: Applied Mathematics and Informatics, University of Science, VNU, Hanoi (2004) ¡ M.S: Information Technology, Hanoi University of Science and Technology (2009) ¡ Ph.D: Knowledge Science, Japan Advanced Institute of Science and Technology (2013) 4 Academic activity ¡ Program committee member: ¨ PAKDD (2015) ¨ ACML (2015, 2014) ¨ KSE (2015, 2014) ¡ PC co-chair: PhD colloquium at DASFAA-2015 ¡ Director of Laboratory of Knowledge and Data Engineering, at HUST. 5 Some projects ¡ NAFOSTED (2015-2017, VN): director ¨ ¨ Title: Inference methods for analyzing the hidden semantics in big data Area: Machine learning, Big data ¡ AFOSR (2015-2017, USA): director ¨ Title: Inferring the hidden structures in big heterogeneous data ¨ Area: Machine learning, Big data ¡ AFOSR (2013-2014, USA): member ¨ ¨ Title: Methods of sparse modeling and dimensionality reduction to deal with big data Area: Machine learning, Big data 6 Research of interests ¡ Topic modeling (mô hình hóa chủ đề). ¡ Probabilistic graphical models (mô hình đồ thị). ¡ Sparse modeling (mô hình thưa), sparse coding (mã hóa thưa). ¡ Stochastic inference, SGD, Online learning (học trực tuyến). ¡ Manifold learning (học đa tạp). ¡ Dimensionality reduction (giảm chiều dữ liệu). 7 8 Introduction to Topic Modeling Topic modeling 9 ¡ One of the main ways to automatically understand the meanings of texts. ¡ Efficient tools to organize, understand, uncover useful knowledge from a huge amount of data. ¡ Efficient tools to discover the hidden semantics/structures in data. Hidden semantics (1) 10 Hidden semantics (2) ¡ Hidden evolutions 11 Hidden semantics (3) ¡ Meanings of pictures 12 Hidden semantics (4) ¡ Objects in pictures 13 Hidden semantics (5) ¡ Activities 14 Hidden semantics (6) ¡ Contents of medical images 15 Hidden semantics (7) ¡ Interactions of hidden entities 16 Hidden semantics (8) ¡ Communities in social networks 17 18 Recent applications (1) ¡ Boosting performance of Search engines over the baseline [Wang et al., ACM TISTfor2014] ck: Learning Long-Tail Topic Features Industrial Applications 39:1 MAP (A) 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 2 10 10 3 10 # Topics (B) 4 10 5 19 Recent applications (2) ¡ Boosting performance of Online advertisement over the baseline [Wang et al., ACM TIST 2014] Y. Wang et a AUC improvement (%) 2 1.5 1 0.5 0 10 2 10 3 10 4 10 5 # Topics Fig. 11. Topic features improve the pCTR performance in online advertising systems. Topic models: some concepts (1) 20 ¡ Topic: is a set of semantically related words ¡ Document: is a mixture of few topics [Blei et al., JMLR 2003] ¡ Topic mixture: shows proportions of topics in a document Topic models: some concepts (2) ¡ In reality, we only observe the documents. ¡ The other structures (topics, mixtures, ...) are hidden. ¡ Those structures compose a Topic Model. 21 Topic models: learning ¡ The main aim is to infer the hidden variables, ¡ e.g., topics, relations, interactions, ... 22 23 Topic models: posterior inference Rockets strike Kabul -- AP, August 8, 1990. More than a dozen rockets slammed into Afghanistan's capital of Kabul today, killing 14 people and injuring 10, Afghan state radio reported. No one immediately claimed repsonsibility for the attack. But the Radio Kabul broadcast, monitored in Islamabad, blamed ``extremists,'' presumably referring to U.S.-backed guerrillas headquartered in Pakistan. Moslem insurgents have been fighting for more than a decade to topple Afghanistan's Communist-style government. In the past year, hundreds of people have died and thousands more injured in rocket assaults on the Afghan capital. How much topics contribute to the news? 50 2 38 35 31 28 10 18 17 Some topics previously learned from a collection of news 2 police students palestinians curfew sikh gaza rangoon moslem israeli militants 10 sinhalese tamil iranian dam khomeini sri cemetery accord wppss guerrillas 18 31 38 contra fire shuttle sandinistas winds nasa chamorro firefighters space ortega mph launch rebels blaze magellan sandinista brush mars aid homes spacecraft nicaragua acres telescope managua water venus ceasefire weather astronauts 40 42 gorbachev index soviet stock republics yen politburo points yeltsin market moscow shares tass trading party dow treaty unchanged grigoryants volume 50 beirut hezbollah lebanon aoun syrian militia lebanese amal troops wounded ¡ Infer the hidden variables for a given document, e.g., ¨ What topics/objects appear in? ¨ What are their contributions? Recent trends in topic modeling 24 ¡ Large scale learning: learn models from huge corpora (e.g., 100 millions of documents). ¡ Sparse modeling: respect the sparseness nature of texts. ¡ Nonparametric models: automatically grow the model size. ¡ Theoretical foundation: provide guarantees for learning and posterior inference. ¡ Incorporating meta-data: encode meta-data into a model. 25 Some challenges and Lessons learnt Challenges: first ¡ Can we develop a fast inference method that has provably theoretical guarantees on quality? ¡ Inference on each data instance: ¨ What topics appear in a document? ¨ What are they talking about? ¨ What animals appear in a picture? ¡ Vital role in many probabilistic models: ¨ ¨ Enable us to design fast algorithms for massive/stream data. Ensure high confidence and reliability when using topic models in practices ¡ But: inference is often intractable (NP-hard) 26 Challenges: second ¡ How can we learn a big topic model from big data? ¡ Big model: ¨ billions of variables/parameters ¨ Which might not fit in the memory of a supercomputer ¡ Many applications lead to this problem: ¨ Exploration of a century of literature ¨ Exploration of online forums/networks ¨ Analyzing political opinions ¨ Tracking objects in videos ¡ But largely unexplored in the literature. 27 Challenges: third 28 ¡ Can we develop methods with provable guarantees on quality for handling streaming/dynamic text collections? ¡ Many practical applications: ¨ Analyzing political opinions in online forums ¨ Analyzing behaviors & interests of online users ¨ Identifying entities and temporal structures from news. ¡ But: existing methods often lack a theoretical guarantee on inference quality. Lessons: learnability 29 ¡ In theory: ¨ A model can be recovered exactly if the number of documents is sufficiently large . [Anandkumar et al., NIPS 2012; Arora et al., FOCS 2012; Tang et al., ICML 2014] ¨ It is impossible to guarantee learnability of a model when having few documents. ¡ In practice: [Tang et al., ICML 2014] ¨ Once there are sufficently many documents, further increasing the number may not significantly improve the performance. ¨ The document length should be long, but need not too long. ¨ A model performs well when the topics are well separated. Lessons: practical effectiveness ¡ Collapsed Gibbs sampling (CGS): ¨ ¨ Efficient Better than VB and BP in large-scale applications [Wang et al., TIST 2014] ¡ Variational Bayes (VB): [Jiang et al., PAKDD 2015] ¨ Often slow ¨ And inaccurate ¡ Belief propagation (BP): ¨ Memory-intensive 30 Lessons: posterior inference 31 ¡ Inference for individual texts: ¨ Variational method (VB) [Blei et al., JMLR 2003] ¨ Collapsed VB (CVB) [Teh et al., NIPS 2007] ¨ CVB0 [Asuncion et al., UAI 2009] ¨ Gibbs sampling [Griffiths & Steyver, PNAS 2004] ¨ Online Frank-Wolfe [Than & Doan, ACML 2014] ¡ It is often intractable in theory [Sontag & Roy, NIPS 2011]. ¡ But it might be tractable in practice [Than & Doan, ACML 2014] ¡ Online Frank-Wolfe an efficient algorithm that has provable guarantees on quality. 32 Our recent research (Than & Doan, ACML 2014) Latent Dirichlet Allocation 33 ¡ Latent Dirichlet Allocation (LDA) [Blei et al., JMLR 2003] is a widely-used class of Bayesian networks. ¨ ¨ Provides an efficient tool to analyze hidden themes in data Helps us recover hidden structures/ evolutions in big text collections and streaming data [Blei, Comm. 2012; Mimno, JCCH 2012] ¡ LDA is the core of a large family of probabilistic models. Posterior inference in LDA 34 ¡ Learning (Bayesian inference) from a corpus C: ¨ Estimate the posterior distribution ¨ β are the hidden topics. ¨ Θ are the topic mixtures in documents. ¡ Posterior inference for a document d: ¨ Estimate the joint distribution ¡ Those problems are intractable [Sontag & Roy, NIPS 2011] Posterior inference: approaches 35 ¡ Posterior inference for a document d: ¨ Variational method (VB) [Blei et al., JMLR 2003]: ¨ Collapsed VB (CVB) [Teh et al., NIPS 2007]: ¨ CVB0 [Asuncion et al., UAI 2009]: ¨ Gibbs sampling [Griffiths & Steyver, PNAS 2004]: ¡ Our work: approximate p(✓ d , d| , ↵) p(z d , ✓ d , d| , ↵) p(z d , d| , ↵) p(z d , d| , ↵) p(z d , d| , ↵) Posterior inference: tractability 36 ¡ Theoretical results for MAP inference: ✓ ⇤ = arg max✓d Pr(✓ d , d| , ↵) ¨ Intractable (NP-hard) in the worst case [Sontag & Roy, NIPS 2011] ¨ Non-concave in general ¡ Our work: tractable (concave) under some conditions ¨ High dimensionality (fit well with text modeling) (fit well with stream/online environments) ¨ Long documents (similar with [Tang et al., ICML 2014]) 37 Posterior inference: tractability Dual online inference for latent Dirichlet allocation Dual online inference for latent Dirichlet allocation Corollary 7 (Concavity for long documents) Using the assumptions in Theorem 4, ⇤ p p 4 2Pr(✓ , problem arg max + if nd ( V ✓ K= 1) C 4 (1 ✓↵) (3) is concave with probability d d| , ↵) d then the 1 at least 1 (nd ) 4 (V K+1) e cV . Corollary 7 (Concavity for long documents) Using the assumptions in Theorem 4, cV . + As ndp! +1, the problem (3) is concave with probability at least 1 e p + if nd ( V K 1)4 C 4 (1 ↵)2 then the problem (3) is concave with probability 1 1 1 (V K+1) cV 4 4 Proofat The first statement can be derived from Theorem 6 by choosing " = n least 1 (nd ) e . C d . The second statement thus follows. + As nd ! +1, the problem (3) is concave with probability at least 1 e cV . 1 Proof The first statement can be derived from Theorem 6 by choosing " = C1 nd 4 . The Corollary 8 (Concavity for high dimensionality) Using the notations and assumpsecond statement thus follows. tions in Theorem 4, let K and nd be fixed. Then the MAP problem (3) is concave over K with probability 1 as V ! +1. Corollary 8 (Concavity for high dimensionality) Using the notations and assump2.3. tolet practices tions Connection in Theorem 4, K and n be fixed. Then the MAP problem (3) is concave over Implications in a broader context ¡ LDA is the core of a large family of probabilistic models ¨ MAP inference is very likely tractable in practice ¨ Hence might be solved easily 38 Posterior inference: algorithms 39 ¡ Posterior inference for a document d: ¨ ¨ Variational method (VB), Collapsed VB (CVB), CVB0, Gibbs sampling But quality and convergence rate are unknown ⇤ ¡ Our work: consider ¨ ¨ ✓ = arg max Pr(✓ d , d| , ↵) Online Frank-Wolfe algorithm (using stochastic zero & first order information) Has a theoretical guarantee on quality & convergence rate Posterior inference: OFW 40 Implications in a broader context ¡ A large family of probabilistic models: ¨ ¨ Posterior inference can be done efficiently by OFW And with a theoretical guarantee on quality (for which VB, CVB, CVB0, CGS do not have) 41 Large-scale learning for LDA 42 ¡ Learning LDA from a massive dataset C: ¨ The hidden topics β are often of practical interests. ¡ Approaches: ¨ ¨ ¨ Parallel/distributed algorithms [Smola et al., VLDB 2010; Asuncion et al., al., Stat. Med. 2011] Online learning [Hoffman et al., JMLR 2013; Mimno et al., ICML 2012; Foulds et al., KDD 2013; Patterson et al., NIPS 2013] Streaming learning [Broderick et al., NIPS 2013] Online learning: schemes 43 ¡ Existing schemes for online learning: ¨ The global variables β are learnt online (stochastically). ¨ The local variables (z or θ) are approximated by u Variational method, or u Gibbs sampling [Mimno et al., ICML 2012] ¡ Our algorithm (DOLDA): ¨ Both global and local variables are learnt online (stochastically) ¨ There is a provable guarantee on quality when inferring local variables Large-scale experiments 44 ¡ Algorithms in comparison: ¨ Stochastic variational inference (SVI) [Hoffman et al., JMLR 2013] ¨ Streaming variational Bayes (SSU) [Broderick et al., NIPS 2013] ¨ Dual online algorithm (DOLDA) [our work] ¡ Data: ¨ Pubmed with 8 millions documents ¨ New York Times with 200K news ¡ Measures: ¨ Coherence: semantic quality of a model ¨ Predictive probability: predictiveness and generalization on new data 45 Experimental results −9 −10 0 2 4 6 Documents seen (in millions) 8 Log Predictive Probability Pubmed −8 New York Times −8 −9 −10 0 0.05 0.1 0.15 Documents seen (in millions) 0.2 0 0.05 0.1 0.15 Documents seen (in millions) 0.2 −300 Coherence −600 −700 −800 −900 15 10 2 4 6 Documents seen (in millions) 8 DOLDA SVI SSU 5 0 −11 −10 −9 Log Predictive Probability −400 −500 Learning hours 0 Learning hours Coherence Log Predictive Probability ¡ DOLDA performed better than SVI and SSU in both generalization and semantic quality. −8 0.8 0.4 0 −10 −9.5 −9 −8.5 Log Predictive Probability References 46 ¡ Anandkumar, Anima, et al. "A spectral algorithm for latent dirichlet allocation." In NIPS. 2012. ¡ Arora, Sanjeev, Rong Ge, and Ankur Moitra. "Learning topic models--going beyond SVD.” In FOCS, 2012. ¡ A. Asuncion, P. Smyth, and Max Welling. Asynchronous distributed estimation of topic models for document analysis. Statistical Methodology, 8(1):3–17, 2011. ¡ David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. JMLR, 3(3):993–1022, 2003. ¡ Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael Jordan. Streaming variational bayes. In NIPS, pages 1727–1735, 2013. ¡ J. Foulds, L. Boyles, C. DuBois, P. Smyth, and Max Welling. Stochastic collapsed variational bayesian inference for latent dirichlet allocation. In KDD, pages 446–454. ACM, 2013. ¡ T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228, 2004. ¡ Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013. ¡ David Mimno. Computational historiography: Data mining in a century of classics journals. Journal on Computing and Cultural Heritage, 5(1):3, 2012. ¡ Alexander Smola and Shravan Narayanamurthy. An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1-2):703–710, 2010. ¡ David Sontag and Daniel M. Roy. Complexity of inference in latent dirichlet allocation. In NIPS, 2011. ¡ Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. Understanding the limiting factors of topic modeling via posterior contraction analysis. In ICML, pages 190–198, 2014. ¡ Y.W. Teh, D. Newman, and M. Welling. A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In NIPS, volume 19, page 1353, 2007. ¡ WANG, Y., ZHAO, X., SUN, Z., YAN, H., WANG, L., JIN, Z., ... & ZENG, J. Peacock: Learning Long-Tail Topic Features for Industrial Applications. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, 2014. ¡ Zeng et al. “A Comparative Study on Parallel LDA Algorithms in MapReduce Framework”. In PAKDD, 2015. 47 Thank you
© Copyright 2025