Slide - CS224d: Deep Learning for Natural Language Processing

CS224d: Deep NLP Lecture 15: Applica8ons of DL to NLP Richard Socher richard@metamind.io Overview •  Model overview: How to properly compare to other models and choose your own model •  Word representa8on •  Phrase composi8on •  Objec8ve func8on •  Op8miza8on • 
• 
• 
• 
• 
Character RNNs on text and code Morphology Logic Ques8on Answering Image – Sentence mapping Lecture 1, Slide 2 Richard Socher 5/27/15 Model overview: Word Vectors •  Random •  Word2Vec •  Glove •  Dimension – oRen defines the number of model parameters •  Or work directly on characters or morphemes Lecture 1, Slide 3 Richard Socher 5/27/15 Model overview: Phrase Vector Composi8on •  Composi8on Func8on governs how exactly word and phrase vectors interact to compose meaning s
W
V p
W
•  Averaging: p = a + b c
c
•  Lots of simple alterna8ves score
1
2
•  Recursive neural networks •  Convolu8onal neural networks •  Recurrent neural network Lecture 1, Slide 4 Richard Socher 5/27/15 Composi8on: Bigram and Recursive func8ons •  Many related models are special cases of MV-­‐RNN •  Mitchell and Lapata, 2010; Zanzo^o et al., 2010: •  Baroni and Zamparelli (2010): A is an adjec8ve matrix and b is a noun vector •  RNNs of Socher et al. 2011 (ICML, EMNLP, NIPS) are also special cases •  Recursive neural tensor networks bring quadra8c and mul8plica8ve interac8ons between vectors Addi8onal choice for recursive neural nets •  Dependency trees focus more on seman8c structure 1.  Cons8tuency Tree 2.  Dependency Tree 3.  Balanced Tree Composi8on: CNNs •  Several variants also: •  No pooling layers •  Pooling layers: simple max-­‐pooling or dynamic pooling •  Pooling across different dimensions •  Somewhat less explored in NLP than RNNs2 •  Not linguis8cally nor cogni8vely plausible Lecture 1, Slide 7 Richard Socher 5/27/15 Composi8on: Recurrent Neural Nets •  Vanilla •  GRU •  LSTM •  Many variants of LSTMs “LSTM: A Search Space Odyssey” by Greff et al. 2015 Lecture 1, Slide 8 Richard Socher 5/27/15 Model overview: Objec8ve func8on •  Max-­‐margin •  Cross-­‐entropy •  Supervised to predict a class •  Unsupervised: predict surrounding words •  Auto-­‐encoder •  My opinion: Unclear benefits for NLP •  Unless encoding another modality Lecture 1, Slide 9 Richard Socher 5/27/15 Op8miza8on •  Ini8aliza8on (word vector and composi8on parameters)!! •  Op8miza8on algorithm •  SGD •  SGD + momentum •  L-­‐BFGS •  AdaGrad •  Adelta •  Op8miza8on tricks •  Regulariza8on (some define as part of model) •  Dropout Lecture 1, Slide 10 Richard Socher 5/27/15 Character RNNs on text and code h^p://karpathy.github.io/2015/05/21/rnn-­‐effec8veness/ Lecture 1, Slide 11 Richard Socher 5/27/15 Character RNNs on text and code •  Haven’t yet produced useful results on real datasets •  Shows that RNNs can memorize sequences and keep memory (mostly LSTMs) •  Most interes8ng results simply train on dataset and sample from it aRerwards (first shown by Sutskever et al. 2011: Genera8ng Text with Recurrent Neural Networks) •  Results from an LSTM (karpathy.github.io) à Lecture 1, Slide 12 Richard Socher 5/27/15 Shakespeare PANDARUS: Alas, I think he shall be come approached and the day When li^le srain would be a^ain'd into being never fed, And who is but a chain and subjects of his death, I should not sleep. Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that. Lecture 1, Slide 13 Richard Socher 5/27/15 Wikipedia Naturalism and decision for the majority of Arab countries' capitalide was grounded by the Irish language by [[John Clair]], [[An Imperial Japanese Revolt]], associated with Guangzham's sovereignty. His generals were the powerful ruler of the Portugal in the [[Protestant Immineners]], which could be said to be directly in Cantonese Communica8on, which followed a ceremony and set inspired prison, training. The emperor travelled back to [[An8och, Perth, October 25|21]] to note, the Kingdom of Costa Rica, unsuccessful fashioned the [[Thrales]], [[Cynth's Dajoard]], known in western [[Scotland]], near Italy to the conquest of India with the conflict. Lecture 1, Slide 14 Richard Socher 5/27/15 Latex (had to be fixed manually) Lecture 1, Slide 15 Richard Socher 5/27/15 Code! (Linux source code) Lecture 1, Slide 16 Richard Socher 5/27/15 Morphology •  Be^er Word Representa8ons with Recursive Neural Networks for Morphology – Luong et al. (slides from Luong) •  Problem with word vectors: poorly es8mate rare and complex words. (Collobert & Weston, 2010) (Huang et. al., 2012) dis8nct different dis8nc8ve broader narrower unique broad dis8nc8ve separate dis8nctness morphologies pesawat cleRs pathologies companion roskam hitoshi enjoyed affect exacerbate impacts characterize allow prevent involve enable unaffected Lecture 1, Slide 17 unno8ced dwarfed mi8gated Richard Socher mon8 sheaths krystal 5/27/15 Limita8ons of exis8ng work •  Treat related words as independent en88es. •  Represent unknown words with a few vectors. 35323 23828 22687 3526 3479 1285 569 406 175 141 108 Word frequencies in Wikipedia docs (986m tokens) Luong’s approach – network structure vT f (W[x1; x2 ;...; xn ]+b)
v
Neural Language Model W, b
unfortunately the bank was closed Wm , bm
Morphology Model unfortunate Wm , bm
ly Wm , bm
un close d p = f (Wm [xstem ; xaffix ]+ bm )
fortunate •  Neural Language Model: simple feed-­‐forward network (Huang, et al., 2012) with ranking-­‐type cost (Collobert et al., 2011). •  Morphology Model: recursive neural network (Socher et al., 2011). Analysis •  Blends word structure and syntac8c-­‐seman8c informa8on. Words (Collobert et al., 2011) This work commen8ng insis8ng insisted focusing commented comments cri8cizing unaffected unno8ced dwarfed mi8gated undesired unhindered unrestricted dis8nct different dis8nc8ve broader divergent diverse dis8nc8ve dis8nctness morphologies pesawat cleRs dis8nc8veness smallness largeness heartlessness corrup8ve inhumanity ineffectual ∅
saudi-­‐owned avatar mohajir kripalani saudi-­‐based syrian-­‐controlled Solu8ons to the problem of polysemous words •  Improving Word Representa8ons Via Global Context And Mul8ple Word Prototypes by Huang et al. 2012 Lecture 1, Slide 21 Richard Socher 5/27/15 Natural language inference Claim: Simple task to define, but engages the full complexity of composi8onal seman8cs: •  Lexical entailment •  Quan8fica8on •  Coreference •  Lexical/scope ambiguity •  Commonsense knowledge •  Proposi8onal awtudes •  Modality •  Fac8vity and implica8vity Lecture 1, Slide 22 Richard Socher 5/27/15 First training data Training data: dance entails move waltz neutral tango tango entails dance sleep contradicts dance waltz entails dance Memoriza8on (training set): Generaliza8on (test set): •  dance ??? move sleep ??? waltz •  waltz ??? tango tango ??? move • 
• 
• 
• 
• 
• 
• 
Lecture 1, Slide 23 Richard Socher 5/27/15 Natural language inference: defini8ons! Lecture 1, Slide 24 Richard Socher 5/27/15 A minimal NN for lexical rela8ons •  Words are learned embedding vectors. •  One plain RNN or RNTN layer •  SoRmax emits rela8on labels •  Learn everything with SGD. Lecture 1, Slide 25 Richard Socher 5/27/15 Recursion in proposi8onal logic Experimental approach: Train on rela8onal statements generated from some formal system, test on other such rela8onal statements. The model needs to: •  Learn the rela8ons between individual words. (lexical rela8ons) •  Learn how lexical rela8ons impact phrasal rela8ons. •  This needs to be recursively applicable! •  a ≡ a, a ^ (not a), a ≡ (not (not a)), ... Lecture 1, Slide 26 Richard Socher 5/27/15 Natural language inference with RNNs •  Two trees + learned comparison layer, then a classifier: Lecture 1, Slide 27 Richard Socher 5/27/15 Natural language inference with RNNs Lecture 1, Slide 28 Richard Socher 5/27/15 Ques8on Answering: Quiz Bowl Compe88on •  QUESTION: He leR unfinished a novel whose 8tle character forges his father's signature to get out of school and avoids the draR by feigning desire to join. A more famous work by this author tells of the rise and fall of the composer Adrian Leverkühn. Another of his novels features the jesuit Naptha and his opponent Se^embrini, while his most famous work depicts the aging writer Gustav von Aschenbach. Name this German author of The Magic Mountain and Death in Venice. •  Iyyer et al. 2014: A Neural Network for Factoid Ques8on Answering over Paragraphs Ques8on Answering: Quiz Bowl Compe88on •  QUESTION: He leR unfinished a novel whose 8tle character forges his father's signature to get out of school and avoids the draR by feigning desire to join. A more famous work by this author tells of the rise and fall of the composer Adrian Leverkühn. Another of his novels features the jesuit Naptha and his opponent Se^embrini, while his most famous work depicts the aging writer Gustav von Aschenbach. Name this German author of The Magic Mountain and Death in Venice. •  ANSWER: Thomas Mann Recursive Neural Networks •  Follow dependency structure Pushing Facts into En8ty Vectors that are restricted to just the question data, a
access
Wikipedia
despite
being
trained on m
Qanta to
Model Can Defeat Human Players Figure 4: Comparisons of qanta+ir-wiki to h
individual human, and the bar height correspon
and it substantially improves an IR model wit
Literature Ques8ons are Hard! much
less data.
human quiz bowl players. Each bar represents a
Visual Grounding •  Idea: Map sentences and images into a joint space •  Socher et al. 2013: Grounded Composi8onal Seman8cs for Finding and Describing Images with Sentences Discussion: Composi8onal Structure •  Recursive Neural Networks so far used cons8tuency trees which results in more syntac8cally influenced representa8ons •  Instead: Use dependency trees which capture more seman8c structure Convolu8onal Neural Network for Images •  CNN trained on ImageNet (Le et al. 2013) •  RNN trained to give large inner products between sentence and image vectors: Results ü
û ü
û ü
û û ü
ü
û û û Results û ü
û û û û û ü
Describing Images Mean Rank Image Search Mean Rank Random 92.1 Random 52.1 Bag of Words 21.1 Bag of Words 14.6 CT-­‐RNN 23.9 CT-­‐RNN 16.1 Recurrent Neural Network 27.1 Recurrent Neural Network 19.2 Kernelized Canonical Correla8on Analysis 18.0 Kernelized Canonical Correla8on Analysis 15.9 DT-­‐RNN 16.9 DT-­‐RNN 12.5 Image – Sentence Genera8on (!) •  Several models came out simultaneously in 2015 that follow up •  Replace recursive neural network with LSTM and instead of only finding vectors they generate the descrip8on •  Mostly memorized training sequences (becomes similar again) •  Donahue et al. 2015: Long-­‐term à Recurrent Convolu8onal Networks for Visual Recogni8on and Descrip8on •  Karpathy and Fei-­‐Fei 2015: Deep Visual-­‐Seman8c Alignments for Genera8ng Image Descrip8ons Lecture 1, Slide 40 Richard Socher 5/27/15 Image – Sentence Genera8on (!) Next Lecture •  The future (?) of deep learning for NLP •  No video Lecture 1, Slide 42 Richard Socher 5/27/15