CS224d Deep NLP Lecture 6: Neural Tips and Tricks + Recurrent Neural Networks Richard Socher richard@metamind.io Overview Today: •  Useful NNet techniques / 0ps and tricks: •  Mul0-­‐task learning •  Nonlineari0es •  Finite difference gradient check •  Momentum, AdaGrad •  Language Models •  Recurrent Neural Networks Lecture 1, Slide 2 Richard Socher 4/15/15 Deep Learning General Strategy and Tricks 3 MulH-­‐task learning / Weight sharing •  Similar to neural network from last class but replaces the single scalar score with a So#max classifier •  Training is again done via backpropaga0on which gives an error similar to the score in the scoring learning model •  NLP (almost) from scratch, Collobert et al. 2011 4 c1 c2 c3 a1 a2 x1 x2 x3 +1 The Model -­‐ Training •  We already know the soWmax classifier and how to op0mize it •  The interes0ng twist in deep learning is that the input features x are also learned, similar to learning with a score: s a1 a2 U2 W23 x1 x2 x3 +1 5 c1 c2 c3 a1 a2 x1 x2 x3 +1 S The Model -­‐ Training •  Main addi0onal idea: We can share both the word vectors AND the hidden layer weights. Only the soWmax weights are different. •  Cost func0on is just the sum of two cross entropy errors c1 c2 c3 a1 a2 x1 x2 x3 +1 6 S1 c4 c5 c6 a1 a2 x1 x2 x3 +1 S2 The secret sauce is the unsupervised word vector pre-­‐
training on a large text collecHon State-­‐of-­‐the-­‐art* Supervised NN Word vector pre-­‐training followed by supervised NN** + hand-­‐craWed features*** POS WSJ (acc.) NER CoNLL (F1) 97.24 96.37 97.20 89.31 81.47 88.87 97.29 89.59 * Representa0ve systems: POS: (Toutanova et al. 2003), NER: (Ando & Zhang 2005) ** 130,000-­‐word embedding trained on Wikipedia and Reuters with 11 word window, 100 unit hidden layer – then supervised task training ***Features are character suffixes for POS and a gazegeer for NER 7 Supervised refinement of the unsupervised word representaHon helps Supervised NN NN with Brown clusters Fixed embeddings* C&W 2011** POS WSJ (acc.) NER CoNLL (F1) 96.37 96.92 97.10 97.29 81.47 87.15 88.87 89.59 * Same architecture as C&W 2011, but word embeddings are kept constant during the supervised training phase ** C&W is unsupervised pre-­‐train + supervised NN + features model of last slide 8 General Strategy for Successful NNets 1.  Select network structure appropriate for problem 1.  Structure: Single words, fixed windows, bag of words, recursive vs. recurrent, CNN, sentence based vs. document 2.  Nonlinearity 2.  Check for implementa0on bugs with gradient checks 3.  Parameter ini0aliza0on 4.  Op0miza0on tricks 5.  Check if the model is powerful enough to overfit 1.  If not, change model structure or make model “larger” 2.  If you can overfit: Regularize 9 Non-­‐lineariHes: What’s used logis0c (“sigmoid”) tanh tanh is just a rescaled and shiWed sigmoid tanh oWen performs well for deep nets tanh(z) = 2logistic(2z) −1
10 For many models, tanh is the best! •  In comparison to sigmoid: •  At ini0aliza0on: values close to 0 •  Faster convergence in prac0ce •  Like sigmoid: Nice deriva0ve: 11 Richard Socher 4/15/15 Non-­‐lineariHes: There are various other choices hard tanh soW sign recHfied linear (ReLu) a
15 If you gradient fails and you don’t know why? What now? Simplify your model un0l you have no bug! Example: Start from simplest model then go to what you want: •  Only soWmax on fixed input •  Backprop into word vectors and soWmax •  Add single unit single hidden layer •  Add mul0 unit single layer •  Add bias •  Add second layer single unit •  Add two soWmax units General Strategy 1.  Select appropriate Network Structure 1.  Structure: Single words, fixed windows vs Recursive Sentence Based vs Bag of words 2.  Nonlinearity 2.  Check for implementa0on bugs with gradient check 3.  Parameter ini0aliza0on 4.  Op0miza0on tricks 5.  Check if the model is powerful enough to overfit 1.  If not, change model structure or make model “larger” 2.  If you can overfit: Regularize 16 Parameter IniHalizaHon •  Ini0alize hidden layer biases to 0 and output (or reconstruc0on) biases to op0mal value if weights were 0 (e.g., mean target or inverse sigmoid of mean target). •  Ini0alize weights ∼ Uniform(−r, r), r inversely propor0onal to fan-­‐in (previous layer size) and fan-­‐out (next layer size): for tanh units, and 4x bigger for sigmoid units [Glorot AISTATS 2010] 17 StochasHc Gradient Descent (SGD)
to a maximum value. Makes a big difference in RNNs. 21 General Strategy 1. 
Select appropriate Network Structure 1. 
Structure: Single words, fixed windows vs Recursive Sentence Based vs Bag of words 2. 
Nonlinearity Check for implementa0on bugs with gradient check Parameter ini0aliza0on Op0miza0on tricks Check if the model is powerful enough to overfit 1. 
If not, change model structure or make model “larger” 2. 
If you can overfit: Regularize Assuming you found the right network structure, implemented it correctly, op0mize it properly and you can make your model overfit on your training data. Now, it’s 0me to regularize 22 Prevent Overfibng: Model Size and RegularizaHon •  Simple first step: Reduce model size by lowering number of units and layers and other parameters •  Standard L1 or L2 regulariza0on on weights •  Early Stopping: Use parameters that gave best valida0on error •  Sparsity constraints on hidden ac0va0ons, e.g., add to cost: 23 Prevent Feature Co-­‐adaptaHon Dropout (Hinton et al. 2012) •  Training 0me: at each instance of evalua0on (in online SGD-­‐
training), randomly set 50% of the inputs to each neuron to 0 •  Test 0me: halve the model weights (now twice as many) •  This prevents feature co-­‐adapta0on: A feature cannot only be useful in the presence of par0cular other features •  A kind of middle-­‐ground between Naïve Bayes (where all feature weights are set independently) and logis0c regression models (where weights are set in the context of all others) •  Can be thought of as a form of model bagging •  It also acts as a strong regularizer 24 Deep Learning Tricks of the Trade •  Y. Bengio (2012), “Prac0cal Recommenda0ons for Gradient-­‐
Based Training of Deep Architectures” •  Unsupervised pre-­‐training •  Stochas0c gradient descent and seyng learning rates •  Main hyper-­‐parameters • 
