Selection Bias, Label Bias, and Bias in Ground Truth ! Part II: Sample Selection Bias Anders Søgaard, Barbara Plank and Dirk Hovy Sample Selection Bias The CROSS-DOMAIN GULF Sample Selection Bias The CROSS-DOMAIN GULF Sample Selection Bias The CROSS-DOMAIN GULF Sample Selection Bias “domain adaptation” or “transfer learning” domain, genre, time,… → differences in P(x) The CROSS-DOMAIN GULF Off-the-shelf POS tagger http://cogcomp.cs.illinois.edu/demo/pos/ 3 Off-the-shelf POS tagger http://cogcomp.cs.illinois.edu/demo/pos/ 3 Off-the-shelf POS tagger http://cogcomp.cs.illinois.edu/demo/pos/ 3 Off-the-shelf POS tagger 4 Off-the-shelf POS tagger The/DT share/NN rose/VBD to/TO 10/CD $/$ a/DT unit/NN ./. 4 Off-the-shelf POS tagger The/DT share/NN rose/VBD to/TO 10/CD $/$ a/DT unit/NN ./. May/NNP I/PRP brrow/VBP 10bucks/UH 4 First, a few words on terminology… what do we call it? 5 General ML trichotomy 6 General ML trichotomy 1. supervised ML labeled DATA 6 General ML trichotomy 1. supervised ML 2. semi-supervised ML labeled DATA unlabeled labeled + DATA DATA 6 General ML trichotomy 1. supervised ML 2. semi-supervised ML 3. unsupervised ML labeled DATA unlabeled labeled + DATA DATA unlabeled DATA 6 Domain Adaptation: 4 7 Domain Adaptation: 4 1. supervised DA (e.g. Daumè, 2007) labeled SOURCE labeled TARGET 7 Domain Adaptation: 4 1. supervised DA (e.g. Daumè, 2007) labeled SOURCE 2. semi-supervised DA (e.g. Daumè, 2010; Chang, Conner & Roth, 2010) labeled SOURCE labeled TARGET labeled TARGET unlabeled TARGET 7 Domain Adaptation: 4 1. supervised DA (e.g. Daumè, 2007) labeled SOURCE 2. semi-supervised DA (e.g. Daumè, 2010; Chang, Conner & Roth, 2010) labeled SOURCE 3. unsupervised DA (e.g. Blitzer et al., 2007; labeled SOURCE McClosky et al., 2008) labeled TARGET labeled TARGET unlabeled TARGET unlabeled TARGET 7 Domain Adaptation: 4 1. supervised DA (e.g. Daumè, 2007) labeled SOURCE 2. semi-supervised DA (e.g. Daumè, 2010; Chang, Conner & Roth, 2010) labeled SOURCE 3. unsupervised DA (e.g. Blitzer et al., 2007; labeled SOURCE 4. blind/unknown DA labeled SOURCE McClosky et al., 2008) (e.g. Søgaard & Johannsen, 2012; Plank & Moschitti, 2013; Elming et al., 2014) labeled TARGET labeled TARGET unlabeled TARGET unlabeled TARGET ? ? N W O N K N ?? U at test time 7 Domain Adaptation: 4 1. supervised DA (e.g. Daumè, 2007) labeled SOURCE 2. semi-supervised DA (e.g. Daumè, 2010; before Chang, Conner & Roth, 2010) labeled SOURCE 3. unsupervised DA (e.g. Blitzer et al., 2007; labeled SOURCE 4. blind/unknown DA labeled SOURCE labeled TARGET labeled TARGET unlabeled TARGET 2010 McClosky et al., 2008) 2012 onwards (e.g. Søgaard & Johannsen, 2012; Plank & Moschitti, 2013; Elming et al., 2014) unlabeled TARGET ? ? N W O N K N ?? U at test time 7 Domain Adaptation: 4 1. supervised DA (e.g. Daumè, 2007) labeled SOURCE 2. semi-supervised DA (e.g. Daumè, 2010; before Chang, Conner & Roth, 2010) labeled SOURCE 3. unsupervised DA (e.g. Blitzer et al., 2007; labeled SOURCE 4. blind/unknown DA labeled SOURCE labeled TARGET labeled TARGET unlabeled TARGET 2010 McClosky et al., 2008) 2012 onwards (e.g. Søgaard & Johannsen, 2012; Plank & Moschitti, 2013; Elming et al., 2014) unlabeled TARGET ? ? N W O N K N ?? U at test time 7 Domain Adaptation: 4 1. supervised DA (e.g. Daumè, 2007) labeled SOURCE 2. semi-supervised DA (e.g. Daumè, 2010; before Chang, Conner & Roth, 2010) labeled SOURCE 3. unsupervised DA (e.g. Blitzer et al., 2007; labeled SOURCE 4. blind/unknown DA labeled SOURCE labeled TARGET labeled TARGET unlabeled TARGET 2010 McClosky et al., 2008) 2012 onwards (e.g. Søgaard & Johannsen, 2012; Plank & Moschitti, 2013; Elming et al., 2014) unlabeled TARGET ? ? N W O N K N ?? U at test time 7 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting adversarial learning distant supervision 8 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting adversarial learning distant supervision 8 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting adversarial learning distant supervision 8 semi-supervised machine learning to address the biased selection of sentences (x) Semi-supervised learning (SSL) How can it help us to bridge the cross-domain gulf? labeled SOURCE unlabeled TARGET Semi-supervised learning (SSL) How can it help us to bridge the cross-domain gulf? labeled SOURCE implicitly adapting by adding newly labeled data from TARGET unlabeled TARGET Semi-supervised learning (SSL) How can it help us to bridge the cross-domain gulf? labeled SOURCE implicitly adapting by adding newly labeled data from TARGET unlabeled TARGET Semi-supervised learning (SSL) How can it help us to bridge the cross-domain gulf? labeled SOURCE implicitly adapting by adding newly labeled data from TARGET unlabeled TARGET Semi-supervised learning (SSL) How can it help us to bridge the cross-domain gulf? labeled SOURCE implicitly adapting by adding newly labeled data from TARGET ✓ if gulf is not too wide unlabeled TARGET Self-training labeled SOURCE ML unlabeled TARGET labeled TARGET Self-training labeled SOURCE train ML unlabeled TARGET labeled TARGET Self-training labeled SOURCE test train ML unlabeled TARGET labeled TARGET Self-training labeled SOURCE test train ML label unlabeled TARGET labeled TARGET Self-training labeled SOURCE test train ML add data unlabeled TARGET label labeled TARGET Self-training labeled SOURCE re-train train test ML add data unlabeled TARGET label labeled TARGET Self-training labeled SOURCE re-train train test ML iterate add data unlabeled TARGET label labeled TARGET Self-training Parameters & Variants: - pool size, number of iterations - select (only most confident) - add with weight - (in)delible 12 Delible self-training L0 instead of L (Abney, 2007) 13 Self-training Pros ✓Simple wrapper method ✓Can correct bias to some extent (if expected error on target is low/gulf not too wide) Cons ‣ many parameters ‣ might introduce more bias (both selection and label bias) 14 Co-training • similar to self-training but with two views • two classifiers labeling data for each other ML1 labeled SOURCE ML2 15 Co-training ML1 ML2 train labeled SOURCE ML1 test ML2 label for each other unlabeled TARGET label labeled TARGET Co-training Pros ✓simple wrapper method ✓often less sensitive to mistakes than self-training Cons ‣ computationally more expensive (ensemble) ‣ many parameters ‣ two views not always available 17 Tri-training ML2 ML1 ML3 18 Tri-training agree ML2 ML1 ML3 18 Tri-training agree ML2 ML1 add ML3 18 Tri-training agree ML2 ML1 add Pros ✓same advantages as co-training ✓fewer parameters ML3 Cons ‣ again, ensemble method ‣ many parameters 18 Implicit use of unlabeled data labeled SOURCE train ML Implicit use of unlabeled data labeled SOURCE train ML unlabeled TARGET unsupervised learning Implicit use of unlabeled data labeled SOURCE train ML Brown clusters (e.g., Koo et al., 2008; Turian, 2010) unlabeled TARGET unsupervised learning Implicit use of unlabeled data labeled SOURCE train ML Brown clusters (e.g., Koo et al., 2008; Turian, 2010) unlabeled TARGET unsupervised learning count/predict (distr. sim./embeddings) (e.g., Mikolov et al., 2013; Baroni et al., 2014; Johannsen et al., 2014) Implicit use of unlabeled data labeled SOURCE train ML Brown clusters (e.g., Koo et al., 2008; Turian, 2010) add features unlabeled TARGET unsupervised learning count/predict (distr. sim./embeddings) (e.g., Mikolov et al., 2013; Baroni et al., 2014; Johannsen et al., 2014) Implicit use of unlabeled data labeled SOURCE train ML add features unlabeled TARGET Implicit use of unlabeled data labeled SOURCE train ML add features unlabeled TARGET Implicit use of unlabeled data labeled SOURCE train ML O D E W N A C E ? S S L A E I B T features E L P WHAadd M A S T C E R R O TO C unlabeled TARGET Implicit use of unlabeled data labeled SOURCE train ML O D E W N A C E ? S S L A E I B T features E L P WHAadd M A S T C E R R O TO C unlabeled TARGET Drop! - data points - features Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning semi-supervised learning importance weighting weighting importance adversarial learning distant supervision 21 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning semi-supervised learning importance weighting weighting importance adversarial learning distant supervision 21 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning semi-supervised learning importance weighting weighting importance adversarial learning distant supervision 21 Importance weighting Importance weighting (IW) SOURCE train ? unlabeled TARGET TARGET test Importance weighting (IW) SOURCE train ? unlabeled TARGET TARGET test Importance weighting (IW) SOURCE train ? assign instance-dependent weights (Shimodaira, 2001): unlabeled TARGET TARGET test Importance weighting (IW) SOURCE train ? assign instance-dependent weights (Shimodaira, 2001): unlabeled TARGET TARGET test approximation, e.g.: ! ! domain classifier to discriminate between SOURCE & TARGET (Zadrozny et al., 2004; Bickel and Scheffer, 2007; Søgaard and Haulrich, 2011) Importance weighting (IW) Pros ✓simple idea ✓works well if we know how our sample differs ✓also useful to combat label bias (more on this later) Cons ‣ challenge is to find a good weight function ‣ finite sample: can overcome bias only to certain extent 24 Importance weighting in NLP ! Only 4 NLP studies1, of which 2 on unsupervised DA with mixed results Does importance weighting work for unsupervised DA of POS taggers? 1(Jiang & Zhai, 2007; Foster et al., 2010; Søgaard & Haulrich, 2011; Plank & Moschitti, 2013) 25 (Plank, Johannsen, Søgaard, 2014) EMNLP representation Domain classifier n-gram size 26 (Plank, Johannsen, Søgaard, 2014) EMNLP Domain classifier representation (Søgaard & Haulrich, 2011) n-gram size 26 (Plank, Johannsen, Søgaard, 2014) EMNLP Domain classifier representation (Søgaard & Haulrich, 2011) n-gram size 26 (Plank, Johannsen, Søgaard, 2014) EMNLP Random weighting • Setup: Google Web Treebank, Universal POS, weighted structured perceptron 27 (Plank, Johannsen, Søgaard, 2014) EMNLP Results Token-based domain classifier baseline 1-gram 2-gram reviews emails 3-gram 4-gram 96 94 92 answers weblogs newsgroups on test sets; results were similar for other representations (Brown, Wiktionary) 28 (Plank, Johannsen, Søgaard, 2014) EMNLP Results Token-based domain classifier baseline 96 1-gram 2-gram reviews emails 3-gram Y L T N A C I F I N G I E S N S I I L E E S A B N ON N A H T R BETTE 4-gram 94 92 answers weblogs newsgroups on test sets; results were similar for other representations (Brown, Wiktionary) 28 (Plank, Johannsen, Søgaard, 2014) EMNLP Results Token-based domain classifier baseline 1-gram 2-gram answers reviews emails weblogs avg tag ambiguity 1.09 KL-div: 0.05 OOV: 27.7 1.07 0.04 29.5 1.07 0.03 29.9 1.05 0.01 22.1 96 3-gram Y L T N A C I F I N G I E S N S I I L E E S A B N ON N A H T R BETTE 4-gram 94 92 newsgroups 1.05 0.01 23.1 on test sets; results were similar for other representations (Brown, Wiktionary) 28 (Plank, Johannsen, Søgaard, 2014) EMNLP Results Token-based domain classifier baseline 1-gram 2-gram answers reviews emails weblogs avg tag ambiguity 1.09 KL-div: 0.05 OOV: 27.7 1.07 0.04 29.5 1.07 0.03 29.9 1.05 0.01 22.1 96 3-gram Y L T N A C I F I N G I E S N S I I L E E S A B N ON N A H T R BETTE 4-gram 94 92 newsgroups 1.05 0.01 23.1 low low on test sets; results were similar for other representations (Brown, Wiktionary) 28 (Plank, Johannsen, Søgaard, 2014) EMNLP Results Token-based domain classifier baseline 1-gram 2-gram answers reviews emails weblogs avg tag ambiguity 1.09 KL-div: 0.05 OOV: 27.7 1.07 0.04 29.5 1.07 0.03 29.9 1.05 0.01 22.1 96 3-gram Y L T N A C I F I N G I E S N S I I L E E S A B N ON N A H T R BETTE 4-gram 94 92 newsgroups 1.05 0.01 23.1 low low high OOV! on test sets; results were similar for other representations (Brown, Wiktionary) 28 (Plank, Johannsen, Søgaard, 2014) EMNLP Random weighting uniform stdexp Zipfian (500 runs in each plot) 29 (Plank, Johannsen, Søgaard, 2014) EMNLP Random weighting uniform baseline stdexp Zipfian (500 runs in each plot) 29 (Plank, Johannsen, Søgaard, 2014) EMNLP Random weighting uniform significance cutoff baseline stdexp Zipfian (500 runs in each plot) 29 (Plank, Johannsen, Søgaard, 2014) EMNLP Random weighting significance cutoff uniform baseline stdexp Y L T N A C I F I N G I E S N S I I L E E S A B N ON N A H T R E T T E B Zipfian (500 runs in each plot) 29 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting importance weighting adversarial learning adversarial learning distant supervision supervision distant 30 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting importance weighting adversarial learning adversarial learning distant supervision supervision distant 30 Roadmap labeled SOURCE unlabeled TARGET What if we don’t know the target? ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting importance weighting adversarial learning adversarial learning distant supervision supervision distant 30 Roadmap labeled SOURCE unlabeled TARGET What if we don’t know the target? ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting importance weighting adversarial learning adversarial learning distant supervision supervision distant 30 Swamping / Feature dropout Motivation: (ALVINN) 31 Swamping / Feature dropout Motivation: (ALVINN) 31 Swamping / Feature dropout Motivation: (ALVINN) Problem: feature swamping (Sutton et al. 2006) Idea: corrupt features 31 Data Corruption 1 11 1 1 32 Data Corruption Original 1 11 1 1 32 Data Corruption Original 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 32 Data Corruption Original Corrupted data 1 1 1 11 1 1 11 1 1 1 1 1 1 32 Dropout 33 Dropout vector indicating how “active” feature is 33 Dropout vector indicating how “active” feature is • binomial dropout (Søgaard & Johannsen, 2012): sample P from random binomial (“hard dropout”, 0/1) 33 Dropout vector indicating how “active” feature is • binomial dropout (Søgaard & Johannsen, 2012): sample P from random binomial (“hard dropout”, 0/1) • Zifpian corruptions (Søgaard, 2013a): P is inverse Zipfian distribution (“soft dropout/feature importance weighting”) 33 (Søgaard 2013b) Antagonistic adversaries • It’s the predictive features that swamp. Let adversaries focus where it hurts the most. 34 (Søgaard 2013b) Antagonistic adversaries • It’s the predictive features that swamp. Let adversaries focus where it hurts the most. randomly drop predictive features, i.e. weight more than stdev away from mean 34 (Søgaard 2013b) Antagonistic adversaries • It’s the predictive features that swamp. Let adversaries focus where it hurts the most. randomly drop predictive features, i.e. weight more than stdev away from mean 34 (Søgaard 2013b) Antagonistic adversaries • It’s the predictive features that swamp. Let adversaries focus where it hurts the most. randomly drop predictive features, i.e. weight more than stdev away from mean 34 Results Dropout baseline binomial Zipfian Adversarial 95.2 93.6 92 answers reviews emails weblogs newsgroups GWEB data, universal POS tags, drop-out: average over 5 runs 35 Results Dropout baseline binomial Zipfian Adversarial 95.2 93.6 92 answers reviews emails weblogs newsgroups GWEB data, universal POS tags, drop-out: average over 5 runs 35 Results Dropout baseline binomial Zipfian Adversarial 95.2 93.6 92 answers correlation POS: 77% reviews 82% emails 92% weblogs 96% newsgroups 96% does not help on domains very similar to SRC GWEB data, universal POS tags, drop-out: average over 5 runs 35 Another view on dropout (Hinton et al., 2012; Wager, Wang & Liang, 2013) 36 Another view on dropout Ensemble methods (e.g., NetFlix challenge) (Hinton et al., 2012; Wager, Wang & Liang, 2013) 36 Another view on dropout Ensemble methods (e.g., NetFlix challenge) dropout ~ model averaging ~ regularization (Hinton et al., 2012; Wager, Wang & Liang, 2013) 36 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting importance weighting adversarial learning adversarial learning distant supervision supervision distant 37 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting importance weighting adversarial learning adversarial learning distant supervision supervision distant 37 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting importance weighting adversarial learning adversarial learning distant supervision supervision distant 37 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting importance weighting adversarial learning adversarial learning distant supervision supervision distant 37 distant supervision (Snow, Juraskfy, Ng, 2005; Mintz, Bills, Snow, Jurafsky, 2009) Distant supervision 39 (Snow, Juraskfy, Ng, 2005; Mintz, Bills, Snow, Jurafsky, 2009) Distant supervision • Distantly supervised: use a large knowledge base (KB) to create noisily labeled instances 39 (Snow, Juraskfy, Ng, 2005; Mintz, Bills, Snow, Jurafsky, 2009) Distant supervision • Distantly supervised: use a large knowledge base (KB) to create noisily labeled instances • Idea: if entity1 and entity2 are found in the same sentence and rel(entity1,entity2) ∈ KB ➙ positive training instance 39 (Snow, Juraskfy, Ng, 2005; Mintz, Bills, Snow, Jurafsky, 2009) Distant supervision • Distantly supervised: use a large knowledge base (KB) to create noisily labeled instances • Idea: if entity1 and entity2 are found in the same sentence and rel(entity1,entity2) ∈ KB ➙ positive training instance • Exploiting some kind of “world knowledge” 39 (Snow, Juraskfy, Ng, 2005; Mintz, Bills, Snow, Jurafsky, 2009) Distant supervision • Distantly supervised: use a large knowledge base (KB) to create noisily labeled instances • Idea: if entity1 and entity2 are found in the same sentence and rel(entity1,entity2) ∈ KB ➙ positive training instance • Exploiting some kind of “world knowledge” • Like type-constraints in sequence tagging (Täckström et al., 2013) The food is good at COLING 39 Type constraints 40 Type constraints Can it help us bridge the cross-domain gulf? 40 Type constraints Can it help us bridge the cross-domain gulf? - POS tagging - Supersense tagging (Plank, Johannsen, Søgaard, 2014) EMNLP (Johannsen et al., 2014) *SEM (talk yesterday by Dirk) 40 Type constraints Can it help us bridge the cross-domain gulf? - POS tagging - Supersense tagging (Plank, Johannsen, Søgaard, 2014) EMNLP (Johannsen et al., 2014) *SEM (talk yesterday by Dirk) 40 Type constraints Can it help us bridge the cross-domain gulf? helped? - POS tagging (Plank, Johannsen, Søgaard, 2014) EMNLP YES - Supersense tagging YES (Johannsen et al., 2014) *SEM (talk yesterday by Dirk) 40 Semi-supervised learning (SSL) How can it help us to bridge the cross-domain gulf? labeled SOURCE unlabeled TARGET Semi-supervised learning (SSL) How can it help us to bridge the cross-domain gulf? labeled SOURCE ✓ if gulf is not too wide ✓ OR combined with distant supervision (“extra ingredient”) + unlabeled TARGET Semi-supervised learning (SSL) How can it help us to bridge the cross-domain gulf? labeled SOURCE ✓ if gulf is not too wide ✓ OR combined with distant supervision (“extra ingredient”) + unlabeled TARGET Semi-supervised learning (SSL) How can it help us to bridge the cross-domain gulf? labeled SOURCE ✓ if gulf is not too wide ✓ OR combined with distant supervision (“extra ingredient”) + unlabeled TARGET Semi-supervised learning (SSL) How can it help us to bridge the cross-domain gulf? labeled SOURCE ✓ if gulf is not too wide ✓ OR combined with distant supervision (“extra ingredient”) + unlabeled TARGET Talk: Thursday 28 Aug, 14:50 room Theatre Adapting taggers to Twitter using not-so-distant supervision joint work with Dirk Hovy, Ryan McDonald, Anders Søgaard Idea Tweet NN NN JJR #Localization #job Supplier : . VB NN . NNP / Project Manager - Localisation NN . NNP NNP NNP NNP NN NNP Vendor - NY, NY, United States http://bit.ly/16KigBg #nlppeople 43 Idea Tweet NN NN JJR #Localization #job Supplier : . VB NN . NNP / Project Manager - Localisation NN . NNP NNP NNP NNP NN NNP Vendor - NY, NY, United States http://bit.ly/16KigBg #nlppeople 20% 43 Idea Tweet NN NN JJR #Localization #job Supplier : . VB NN . NNP / Project Manager - Localisation NN . NNP NNP NNP NNP NN NNP Vendor - NY, NY, United States http://bit.ly/16KigBg #nlppeople URL NN NN . NN NN VBZ DET … The Supplier / Project Manager performs the … 43 Idea Tweet NN NN JJR #Localization #job Supplier : . VB NN . NNP / Project Manager - Localisation NN . NNP NNP NNP NNP NN NNP Vendor - NY, NY, United States http://bit.ly/16KigBg #nlppeople URL NN NN . NN NN VBZ DET … The Supplier / Project Manager performs the … 43 Idea Tweet NN NN NN JJR #Localization #job Supplier : . NN VB NN . NNP / Project Manager - Localisation NN . NNP NNP NNP NNP NN NNP Vendor - NY, NY, United States http://bit.ly/16KigBg #nlppeople URL NN NN . NN NN VBZ DET … The Supplier / Project Manager performs the … 43 Same for NER Tweet O O O Prey Developer worked O B-PER with Nintendo O O O on project http://bit.ly/17Kbsf 44 Same for NER Tweet O O O Prey Developer worked O B-PER with Nintendo O O O on project http://bit.ly/17Kbsf URL O O O O B-ORG O O … In a statement , Nintendo announced that … 44 Same for NER Tweet O O O Prey Developer worked B-ORG B-PER O O O with Nintendo on project http://bit.ly/17Kbsf O URL O O O O B-ORG O O … In a statement , Nintendo announced that … 44 Setup 45 Setup 1. tag 45 Setup 2. tag 1. tag 45 Setup 2. tag 1. tag 3. project 45 Setup 2. tag 1. tag 3. project 4. add data 45 Setup 2. tag 1. tag 4. add data 3. project augmented self-training 45 NB: URLs not required at testing time! Setup 2. tag 1. tag 4. add data 3. project augmented self-training 45 POS Train Test + 46 POS Results 93 91.6 92.4 WSJ+Gimpel baseline not-so-distant supervision 89 87.5 86 88.4 87.4 88.5 88.8 89.8 82 79 75 Foster Lowlands Ritter Test-average 47 POS Results 93 91.6 92.4 WSJ+Gimpel baseline not-so-distant supervision 89 87.5 86 88.4 87.4 88.5 88.8 89.8 82 79 75 Foster Lowlands Ritter Test-average 47 POS Results 93 91.6 92.4 WSJ+Gimpel baseline not-so-distant supervision 89 87.5 86 88.4 87.4 88.5 88.8 89.8 82 plain self-training 79 75 Foster Lowlands Ritter Test-average 47 POS Results 93 91.6 92.4 WSJ+Gimpel baseline not-so-distant supervision not-so-distant supervision 89 87.5 86 88.4 87.4 88.5 88.8 89.8 82 plain self-training 79 75 Foster Lowlands Ritter Test-average 47 Projection Examples Snohomish initial tag ADJ projected NOUN Bakery NOUN NOUN Salmon-Safe NOUN ADJ parks NOUN NOUN 48 Limitations ! NOUN If I gave you one wish that will become true. NOUN VERB What’s your wish ?... ? i wish i’ll get 3 wishes from you :p URL 49 Limitations ! NOUN If I gave you one wish that will become true. NOUN NOUN VERB What’s your wish ?... ? i wish i’ll get 3 wishes from you :p URL 49 Error Analysis • improvements due to richer linguistic context Man ARK Our • Utd PRT NOUN NOUN NOUN ARK Our Radio NOUN NOUN Edit VERB NOUN somewhat arbitrary differences Nokia D5000 ARK NOUN NUM Our NOUN NOUN love his version ARK VERB DET NOUN Our VERB PRON NOUN 50 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting adversarial learning distant supervision 51 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting adversarial learning distant supervision 51 Roadmap labeled SOURCE unlabeled TARGET ? ? labeled N W O N K N SOURCE ?? U semi-supervised learning importance weighting adversarial learning distant supervision 51 References Books Steven Abney. Semisupervised Learning for Computational Linguistics. 2007. Anders Søgaard. Semi-supervised learning and domain adaptation for NLP. Morgan & Claypool, 2013. Papers Baroni et al. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL 2014. Blitzer et al. Biographies, Bollywood, Boom-boxes, and Blenders: Domain Adaptation for Sentiment Classification. ACL 2007. Blum & Mitchell. Combining Labeled and Unlabeled Data with Co-training. 1998. Chang, Connor, Roth. The Necessity of Combining Adaptation Methods. In EMNLP, 2010. Jiang & Thai. Instance Weighting for Domain Adaptation in NLP. In ACL, 2007. Daumè. Frustratingly Easy Domain Adaptation. In ACL, 2007. Elming, Plank, Hovy. Robust Cross-Domain Sentiment Analysis for Low-Resource Languages. WASSA 2014. Foster et al. Discriminative instance weighting for domain adaptation in statistical machine translation. EMNLP 2010. Hinton et al. Improving neural networks by preventing co-adaptation of feature detectors. 2012. Hovy, Plank, Søgaard. When POS data sets don’t add up. Combating sample bias. LREC 2014. Johannsen et al. More or less supervised super-sense tagging of Twitter. *SEM 2014. Koo et al. Simple semi-supervised dependency parsing. ACL 2008. McClosky et al. When is Self-training Effective for Parsing? In COLING, 2008. Mikolov et al. 2013. Efficient estimation of word representations in vector space. Mintz et al. Distant supervision for relation extraction without labeled data. In ACL, 2009. Pang & Yang. A Survey on Transfer Learning. In IEEE, 2012. Plank, Hovy, Søgaard. Learning POS taggers with inter-annotator agreement loss. In EACL 2014. Plank, Hovy, McDonald, Søgaard. Adapting POS taggers to Twitter with not-so-distant supervision. COLING 2014. Plank, Johannsen & Søgaard. Importance Weighting for Unsupervised domain adaptation of POS taggers: A negative result. EMNLP 2014. Plank & Moschitti. Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction. ACL 2013 Shimodaira. Improving predictive inference under covariate shift by weighting the log- likelihood function. Journal of Statistical Planning and Inference, 2000. Snow, Jurafsky, Ng. Learning syntactic patterns for automatic hypernym discovery. NIPS 2005 Sutton et al. 2006. Reducing weight undertraining in structured discriminative learning. NAACL 2006. Søgaard. Zipfian corruptions for robust POS tagging. In NAACL, 2013. Søgaard. Part-of-speech tagging with antagonistic adversaries. In ACL, 2013. Søgaard & Haulrich. 2011. Sentence-level instance-weighting for graph-based and transition-based dependency parsing. IWPT Søgaard & Johannsen. Robust learning in random subspaces: equipping NLP for OOV effects. COLING 2012. Søgaard, Østerskov & Rishøj. Semisupervised dependency parsing using generalized tri-training. In ACL 2010. Turian et al. Word representations: A simple and general method for semi-supervised learning. ACL 2010. Wager, Wang, Liang. Dropout Training as Adaptive Regularization. NIPS 2013. Zadrozny. Learning and evaluating classifiers under sample selection bias. ICML 2004. Zhou and Li. Tri-Training: Exploiting Unlabeled Data Using Three Classifiers. In IEEE 2005. 52
© Copyright 2025