Text Categorization (actually, methods apply for categorizing anything into fixed categories – tagging, WSD, PP attachment ...) 600.465 - Intro to NLP - J. Eisner 1 Why Text Categorization? Is it spam? Is it Spanish? Is it interesting to this user? News filtering Helpdesk routing Is it interesting to this NLP program? e.g., should my calendar system try to interpret this email as an appointment (using info. extraction)? Where should it go in the directory? Yahoo! / Open Directory / digital libraries Which mail folder? (work, friends, junk, urgent ...) 600.465 - Intro to NLP - J. Eisner 2 Measuring Performance Classification accuracy: What % of messages were classified correctly? Is this what we care about? System 1 Overall accuracy 95% Accuracy on spam 99.99% Accuracy on gen 90% System 2 95% 90% 99.99% Which system do you prefer? 600.465 - Intro to NLP - J. Eisner 3 Measuring Performance Precision vs. Recall of Good (non-spam) Email Precision = good messages kept all messages kept Precision 100% 75% 50% 25% 0% 0% 25% 50% 75% Recall 100% Recall = good messages kept all good messages Trade off precision vs. recall by setting threshold Measure the curve on annotated dev data (or test data) Choose a threshold where user is comfortable 600.465 - Intro to NLP - J. Eisner 4 F-measure = 1 / (average(1/precision, 1/recall)) Measuring Performance Precision vs. Recall of Good (non-spam) Email Precision 100% 75% 50% 25% OK for search engines (maybe) high threshold: all we keep is good, but we don’t keep much point where precision=recall (sometimes reported) would prefer to be here! low threshold: keep all the good stuff, but a lot of the bad too 0% 0% 25% 600.465 - Intro to NLP - J. Eisner 50% Recall 75% 100% OK for spam filtering and legal search 5 More Complicated Cases of Measuring Performance For multi-way classifiers: Average accuracy (or precision or recall) of 2-way distinctions: Sports or not, News or not, etc. Better, estimate the cost of different kinds of errors e.g., how bad is each of the following? putting Sports articles in the News section putting Fashion articles in the News section putting News articles in the Fashion section Now tune system to minimize total cost Which articles are most Sports-like? For ranking systems: Which articles / webpages most relevant? Correlate with human rankings? Get active feedback from user? Measure user’s wasted time by tracking clicks? 600.465 - Intro to NLP - J. Eisner 6 How to Categorize? Subject: would you like to . . . . . . drive a new vehicle for free ? ? ? this is not hype or a hoax , there are hundreds of people driving brand new cars , suvs , minivans , trucks , or rvs . it does not matter to us what type of vehicle you choose . if you qualify for our program , it is your choice of vehicle , color , and options . we don ' t care . just by driving the vehicle , you are promoting our program . if you would like to find out more about this exciting opportunity to drive a brand new vehicle for free , please go to this site : http : / / 209 . 134 . 14 . 131 / ntr to watch a short 4 minute audio / video presentation which gives you more information about our exciting new car program . if you do n't want to see the short video , but want us to send you our information package that explains our exciting opportunity for you to drive a new vehicle for free , please go here : http : / / 209 . 134 . 14 . 131 / ntr / form . htm we would like to add you the group of happy people driving a new vehicle for free . happy motoring . 600.465 - Intro to NLP - J. Eisner 7 How to Categorize? (supervised) We’ve seen lots of options in this course! 1. Build n-gram model of each category Question: How to classify test message? Answer: Bayes’ Theorem 600.465 - Intro to NLP - J. Eisner 8 How to Categorize? (supervised) We’ve seen lots of options in this course! 2. Represent each document as a vector (must choose representation and distance measure; use SVD?) Question: How to classify test message? Answer 1: Category whose centroid is most similar (may not work well if category is diverse) Answer 2: Cluster each category into subcategories (then use answer 1 to pick a subcategory) (return the category that the subcategory is in) (this can also be useful for n-gram models) Answer 3: Just look at labels of nearby training docs (e.g., let the k nearest neighbors vote – flexible!) (maybe the closer ones get a bigger vote) 600.465 - Intro to NLP - J. Eisner 9 How to Categorize? (supervised) We’ve seen lots of options in this course! 3. Treat it like word-sense disambiguation a) Vector model – use all the features (we just saw this) b) Decision list – use single most indicative feature c) Naive Bayes – use all the features, weighted by how well they discriminate among the categories d) Decision tree – use some of the features in sequence e) Other options from machine learning, like perceptron, Support Vector Machine (SVM), logistic regression, … Features matter more than which machine learning method 600.465 - Intro to NLP - J. Eisner 10 Review: Vector Model These two documents are similar: After normalizing vector length to 1, Close in Euclidean space (similar endpoint) High dot product (similar direction) (0, 0, 3, 1, 0, 7, ... 1, 0) (0, 0, 1, 0, 0, 3, ... 0, 1) Can play lots of encoding games when creating vector: Remove function words or reduce their weight Use features other than unigrams 600.465 - Intro to NLP - J. Eisner 11 slide courtesy of D. Yarowsky (modified) Review: Decision Lists To disambiguate a token of lead : Scan down the sorted list The first cue that is found gets to make the decision all by itself Not as subtle as combining cues, but works well for WSD Cue’s score is its log-likelihood ratio: log [ p(cue | sense A) [smoothed] / p(cue | sense B) ] 600.465 - Intro to NLP - J. Eisner 12 slide courtesy of D. Yarowsky (modified) Review: Combining Cues via Naive Bayes these stats come from term papers of known authorship (i.e., supervised training) 600.465 - Intro to NLP - J. Eisner 13 slide courtesy of D. Yarowsky (modified) Review: Combining Cues via Naive Bayes 1 2 1 2 “Naive Bayes” model for classifying text (Note the naive independence assumptions!) 600.465 - Intro to NLP - J. Eisner Would this kind of sentence be more typical of a student A paper or a student B paper? 14 example from Manning & Schütze Decision Trees Is this Reuters article an Earnings Announcement? 2301/7681 = 0.3 of all docs split on feature that reduces our uncertainty most contains “cents” 2 times contains “cents” < 2 times 1607/1704 = 0.943 694/5977 = 0.116 contains “versus” 2 times 1398/1403 = 0.996 “yes” contains “versus” < 2 times 209/301 = 0.694 600.465 - Intro to NLP - J. Eisner contains “net” 1 time 422/541 = 0.780 contains “net” < 1 time 272/5436 = 0.050 “no” 15 Features Besides Unigrams All these approaches (except n-gram model) can use “interesting” features, not just unigrams. There’s generally a heuristic feature selection problem Use some very large set of features defined by a template Maybe restrict to features that look useful in isolation? Add features greedily, one at a time Measure or guess expected improvement of each feature Make sure to smooth when doing this – why? At the end, remove features that hurt performance on held-out data What does SpamAssassin use? 600.465 - Intro to NLP - J. Eisner 16 SpamAssassin Features 100 4.0 3.994 3.970 3.910 3.801 3.472 3.437 3.371 3.350 3.284 3.283 3.261 3.251 3.250 3.200 From: address is in the user's black-list Sender is on www.habeas.com Habeas Infringer List Invalid Date: header (timezone does not exist) Written in an undesired language Listed in Razor2, see http://razor.sf.net/ Subject is full of 8-bit characters Claims compliance with Senate Bill 1618 exists:X-Precedence-Ref Reverses Aging Claims you can be removed from the list 'Hidden' assets Claims to honor removal requests Contains "Stop Snoring" Received: contains a name with a faked IP-address Received via a relay in list.dsbl.org Character set indicates a foreign language 600.465 - Intro to NLP - J. Eisner 17 SpamAssassin Features 3.198 3.193 3.180 3.140 3.123 3.090 3.072 3.044 3.009 3.005 2.991 2.975 2.968 2.932 2.900 2.879 Forged eudoramail.com 'Received:' header found Free Investment Received via SBLed relay, seehttp://www.spamhaus.org/sbl/ Character set doesn't exist Dig up Dirt on Friends No MX records for the From: domain X-Mailer contains malformed Outlook Expressversion Stock Disclaimer Statement Apparently, NOT Multi Level Marketing Bulk email software fingerprint (jpfree) found inheaders exists:Complain-To Bulk email software fingerprint (VC_IPA) found inheaders Invalid Date: year begins with zero Mentions Spam law "H.R. 3113" Received forged, contains fake AOL relays Asks for credit card details 600.465 - Intro to NLP - J. Eisner 18 SpamAssassin Features 2.858 2.851 2.842 2.826 2.800 2.800 2.796 2.795 2.786 2.784 2.783 2.782 2.782 2.748 2.744 2.737 To: username at front of subject Claims you actually asked for this spam To header contains 'recipient' marker Compare Rates Received: says mail bounced all around the world Mentions Spam Law "UCE-Mail Act" Received via buggy SMTP server (MDaemon2.7.4SP4R) Bulk email software fingerprint (StormPost) foundin headers Broken CGI script message Message-Id generated by a spam tool Urges you to call now Tells you it's an ad RAND found, spammer forgot to run the random-IDgenerator Cable Converter No Age Restrictions Possible porn - Celebrity Porn 600.465 - Intro to NLP - J. Eisner 19 SpamAssassin Features 2.782 2.782 2.748 2.744 2.737 2.735 2.730 2.726 2.720 2.720 2.702 2.695 2.693 2.668 2.660 2.658 Tells you it's an ad RAND found, spammer forgot to run the random-IDgenerator Cable Converter No Age Restrictions Possible porn - Celebrity Porn Bulk email software fingerprint (JiXing) found inheaders DNSBL: sender is Confirmed Spam Source Bulk email software fingerprint (MMailer) found inheaders exists:X-Encoding DNSBL: sender is Confirmed Open Relay SEC-mandated penny-stock warning -- thanks SEC Claims you can be removed from the list Removes Wrinkles Offers a stock alert Listed in DCC, seehttp://rhyolite.com/anti-spam/dcc/ Common pyramid scheme phrase (1) 600.465 - Intro to NLP - J. Eisner 20 SpamAssassin Features 2.654 2.645 2.642 2.640 2.639 2.622 2.620 2.611 2.566 2.565 2.541 2.516 2.513 2.510 2.502 2.500 Offers a free consultation Bulk email software fingerprint (EVAMAIL) foundin headers Possible porn - Amateur Porn Listed in Razor1, see http://razor.sf.net/ Subject contains lots of white space exists:X-x Received via a relay in relays.visi.com Bulk email software fingerprint (IMktg) found inheaders Compete for your business Possible porn - Pay Site Contains "CBYI" Spam phrases score is 34 to 55 (high) Possible porn - Lesbian Site Contains 'free installation' with capitals Free Grant Money Listed in Pyzor, see http://pyzor.sf.net/ 600.465 - Intro to NLP - J. Eisner 21 SpamAssassin Features 2.500 2.500 2.500 2.500 2.500 2.500 2.496 2.492 2.488 2.456 2.450 2.445 2.443 2.425 2.421 2.398 Tre¶æ zawiera 'odes³anie z dopiskiem NIE' Tre¶æ zawiera 'Artykul 25 ust 2 punkt 2' Tresc zawiera 'przepraszamy za zajêty czas' Tresc zawiera 'Zamów teraz!!!' Tresc zawiera 'Je¿eli (Pañstwo) nie ¿yczycie(sz)sobie' Tresc zawiera 'Aby usun±æ adres e-mail...' Spam tool pattern in MIME boundary 'Message-Id' was added by a relay Bulk email software fingerprint (screwup 1) found inheaders University Diplomas Character set indicates foreign language body Claims you can be removed from the list Headers include 3 consecutive 8-bit characters Date: is 24 to 48 hours after Received: date 'From' juno.com does not match 'Received' headers Meet Singles 600.465 - Intro to NLP - J. Eisner 22 SpamAssassin Features 2.362 2.361 2.357 2.357 2.351 2.334 2.331 2.314 2.292 2.290 2.280 2.276 2.261 2.250 2.242 2.240 Serious Enquiries Only. Claims auto-email removal MiME-Version header (oddly capitalized) A "microsoft" header was found X-Mailer contains "OutLook Express 3.14159" Possible porn - Rape "Collect Child Support" Scam Claims spam helps the environment Free Leads Fake name used in SMTP HELO command Received via a relay in ipwhois.rfc-ignorant.org Possible porn - Cum Shot Amazing Stuff Received via a relay in orbs.dorkslayers.com Possible porn - Mega Porn Offers pure profit 600.465 - Intro to NLP - J. Eisner 23 SpamAssassin Features 2.216 2.210 2.209 2.206 2.203 2.203 2.202 2.180 2.176 2.170 2.145 2.114 2.109 2.100 2.088 2.083 Received contains a faked HELO hostname Tells you it's an ad Uses control sequences inside a URL's hostname Claims spam helps the environment Tells you to 'take action now!' Cash Bonus From an address @btamail.net.cn exists:X-Library Contains "My wife, Jody" testimonial Possible porn - Nasty Girls Promise you ...! Claims to be in accordance with some Spam law Uses a numeric IP address in URL Possible porn - Live Porn Discusses search engine listings HTML comments which obfuscate text 600.465 - Intro to NLP - J. Eisner 24 SpamAssassin Features 2.066 2.066 2.060 2.052 2.044 2.030 2.022 2.011 2 2 2 2 2 2 2 2 Information on getting a larger penis or breasts (2) Contains 'free preview' with capitals A foreign language charset used in headers Says "We strongly oppose the use of spam email" trail of Received: headers seems to be forged Credit Bureaus Claims compliance with House Bill 4176 No Investment Tre¶æ zawiera 'adres e-mail zostalznaleziony/pozyskany' Tre¶æ zawiera 'adres (e-mail) pochodzi zogólnodostêpnych....' Tre¶æ zawiera 'Ustawy o ochronie danychosobowych' Tresc zawiera 'temat USUN' Tresc zawiera 'na podstawie adresow e-mailpublicznie...' Tresc zawiera 'kliknij w poni¿szy link' Tresc zawiera 'do nabycia u nas' Tresc zawiera 'Wys³aæ pusty mail' 600.465 - Intro to NLP - J. Eisner 25 SpamAssassin Features 2 2 2 2 2 2 2 2 2 2 1.995 1.984 1.977 1.952 1.910 1.904 Tresc zawiera 'Wiadomo¶æ nadano na podstawie...' Tresc zawiera 'Wiadomo¶æ nadano jednorazowo...' Tresc zawiera 'USUN Z BAZY' Tresc zawiera 'Prosimy o przes³anie pustego maila' Tresc zawiera 'Je¿eli nie interesuj±...' Tresc zawiera 'Je¿eli nie chcesz (otrzymywac)...' Tresc zawiera '...prosimy o zwrotny e-mail...' Tresc zawiera '...adres z bazy...' Dice cumplir con la ley Clama cumplir con la normativa SPAM Serious cash Viagra and other drugs If only it were that easy Nigerian scam key phrase (million dollars) Drastically Reduced Contains "Temple Kiff" 600.465 - Intro to NLP - J. Eisner 26 SpamAssassin Features 1.889 1.889 1.880 1.858 1.856 1.844 1.842 1.839 1.836 1.831 1.824 1.813 1.778 1.772 1.754 1.744 Forged 'by gw05' 'Received:' header found Credit Card Offers Find out Anything Contains "Gentle Ferocity" Spam phrases score is 21 to 34 (high) Possible Porn - Porn membership Potential Earnings Bulk email software fingerprint (Group Mail) foundin headers Once in a lifetime, apparently Offers Free (often stolen) Passwords Contains 'Dear (something)' Possible porn - Porn Password Message is 90-100% HTML tags Sent using a trial version of CommuniGate Date: is 48 to 96 hours after Received: date To: has no local-part before @ sign 600.465 - Intro to NLP - J. Eisner 27 SpamAssassin Features 1.739 1.721 1.697 1.690 1.687 1.686 1.682 1.681 1.663 1.640 1.640 1.639 1.631 1.625 1.598 1.591 Talk about a check or money order Contains 'for only pennies a day' Spam tool pattern in MIME boundary Form for checking email address Subject: contains advertising tag Talks about bulk email Claims you registered with some kind of partner Long Distance Phone Offer Additional Income Spam phrases score is 05 to 08 (medium) Contains 'subject to credit approval' Talks about tracing by SSN Possible Porn - XXX Photos Contains 'earn (dollar) something per week' Message-Id has characters often found in spam 'X-Mailer' line contains gibberish 600.465 - Intro to NLP - J. Eisner 28 SpamAssassin Features 1.591 1.578 1.552 1.548 1.546 1.544 1.539 1.526 1.523 1.518 1.506 1.505 1.503 1.500 1.500 1.500 Cures Baldness Subject starts with "Hello" "Refinance your home" Doing something with my income Date: is 96 hours or more before Received: date To: address contains spaces Cents on the Dollar Uses a username in a URL Secretly Recorded Invalid Date: header (not RFC 2822) From and To are same (3) Valid-looking To "undisclosed-recipients" exists:Date-warning Temat zawiera 'oferta' Tre¶æ zawiera 'Zaprosiæ pañstwo' Tre¶æ zawiera 'Szanowni Pañstwo' 600.465 - Intro to NLP - J. Eisner 29 SpamAssassin Features 1.500 1.500 1.500 1.495 1.490 1.486 1.479 1.470 1.466 1.459 1.435 1.410 1.404 1.404 1.400 1.394 Tresc zawiera 'publicznie dostêpny (email)' Tresc zawiera 'Upowaznienie do wystawiania fakturVAT...' Tresc zawiera '...mail z tematem...' Possible registry spammer Possible porn - Adult Web Sites 'one time mailing' doesn't mean it isn't spam Forged hotmail.com 'Received:' header found Talks about opting in Possible porn - Barely Legal Claims compliance with Senate Bill 1618 Direct Marketing Money back guarantee. Date: is 48 to 96 hours before Received: date Instructions on how to increase something NOS CHILLAN PARA DECIR QUE ES GRATIS Plugs Viagra 600.465 - Intro to NLP - J. Eisner 30 SpamAssassin Features 1.385 1.382 1.373 1.370 1.368 1.363 1.361 1.352 1.337 1.332 1.319 1.314 1.306 1.302 1.301 1.293 Spam phrases score is 08 to 13 (medium) URL uses words and phrases which indicate porn (4) As seen on national TV! Message text disguised using base-64 encoding Date: is 3 to 6 hours after Received: date Score with babes! From and To are same (6) 'From' yahoo.com does not match 'Received' headers Spam phrases score is 13 to 21 (high) Not intended for residents of XYZ. Faked To "Undisclosed-Recipients" From and To are same (5) Only thing addresses on CD are useful for is spam Contains "Vjestika Aphrodisia" Lower Monthly Payment HTML comment has 3 consecutive 8-bit characters 600.465 - Intro to NLP - J. Eisner 31 SpamAssassin Features 1.285 1.283 1.275 1.274 1.273 1.270 1.269 1.253 1.253 1.247 1.246 1.231 1.226 1.224 1.218 1.201 From: does not include a real name Uses a dotted-decimal IP address in URL Contains link without http:// prefix 'Subject' contains G.a.p.p.y-T.e.x.t Marketing Solutions Spam tool pattern in MIME boundary 'Prestigious Non-Accredited Universities' Spam tool pattern in MIME boundary Incorporates a tracking ID number From and To are same (2) Contains 'free sample' with capitals Claims compliance with spam regulations Online Pharmacy Received via SMTPD32 server (SMTPD32-n.n) Includes a form which will send an email While you Sleep 600.465 - Intro to NLP - J. Eisner 32 SpamAssassin Features 1.187 1.175 1.148 1.146 1.138 1.131 1.119 1.118 1.112 1.110 1.099 1.098 1.092 1.084 1.084 1.078 Uses non-standard port number for HTTP Possible porn - in ALL CAPS Subject contains a unique ID Bulk email software fingerprint (hash 2) found inheaders Get Paid Contains 'URGENT BUSINESS' Why Pay More? Requires Initial Investment Javascript to open a new window exists:X-List-Unsubscribe Date: is 6 to 12 hours after Received: date Subject starts with dollar amount Increase your ejaculation! Subject: contains Korean unsolicited email tag Spam phrases score is 03 to 05 (medium) Plugs "Herbal Viagra" 600.465 - Intro to NLP - J. Eisner 33 SpamAssassin Features 1.187 1.175 1.148 1.146 1.138 1.131 1.119 1.118 1.112 1.110 1.099 1.098 1.092 1.084 1.084 1.078 Uses non-standard port number for HTTP Possible porn - in ALL CAPS Subject contains a unique ID Bulk email software fingerprint (hash 2) found inheaders Get Paid Contains 'URGENT BUSINESS' Why Pay More? Requires Initial Investment Javascript to open a new window exists:X-List-Unsubscribe Date: is 6 to 12 hours after Received: date Subject starts with dollar amount Increase your ejaculation! Subject: contains Korean unsolicited email tag Spam phrases score is 03 to 05 (medium) Plugs "Herbal Viagra" 600.465 - Intro to NLP - J. Eisner 34 SpamAssassin Features 1.077 1.057 1.045 1.042 1.039 1.038 1.023 1.021 1.009 1 1 1 1 1 1 1 Apparently, you'll be amazed People just leave money laying around Bulk email software fingerprint (eGroups) found inheaders Date: is 24 to 48 hours before Received: date Talks about direct email Unneeded encoding of HTML tags Javascript to move windows around No such thing as a free lunch (3) Save big money Frequent SPAM content Frequent SPAM content Frequent SPAM content Frequent SPAM content Frequent SPAM content Frequent SPAM content Frequent SPAM content 600.465 - Intro to NLP - J. Eisner 35 SpamAssassin Features 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Filename is just a '\#'; probably a JS trick Old Murkowski disclaimer Obfuscated action attribute in HTML form Mentions monsterhut.com Form for verifying email address Contains signature of unregistered spam tool Publicidad por e-mail Contiene la palabra gratis en las cabeceras exists:X-Fix To: non-existent 'Investors' address Subject contains 'Your Membership Exchange' Spam tool pattern in MIME boundary Reply-To: is empty Received via a relay in bl.spamcop.net Received via RSSed relay, seehttp://www.mail-abuse.org/rss/ Received via RBLed relay, seehttp://www.mail-abuse.org/rbl/ 600.465 - Intro to NLP - J. Eisner 36 SpamAssassin Features 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Received from first hop dialup, seehttp://www.mail-abuse.org/dul/ Received from dialup, seehttp://www.mail-abuse.org/dul/ Received contains fake 'Post.cz' hostname From an address @email-publisher.com Bulk email software fingerprint (xmailer tag) foundin headers Bulk email software fingerprint (pascual) found inheaders Bulk email software fingerprint (eBizmailer) foundin headers Bulk email software fingerprint (charset) found inheaders Bulk email software fingerprint (Yam) found inheaders Bulk email software fingerprint (V3161) found inheaders Bulk email software fingerprint (Uproar) found inheaders Bulk email software fingerprint (Seednet) found inheaders Bulk email software fingerprint (PowerCampaign)found in headers Bulk email software fingerprint (Opt-In Lightning)found in headers Bulk email software fingerprint (Matchmaker) foundin headers Bulk email software fingerprint (Mail Bomber)found in headers 600.465 - Intro to NLP - J. Eisner 37 SpamAssassin Features 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Bulk email software fingerprint (Henry Su) found inheaders Bulk email software fingerprint (GRMessageQueue)found in heade Bulk email software fingerprint (EPaper) found inheaders Bulk email software fingerprint (DiffondiCool)found in headers Bulk email software fingerprint (CurrentMailer)found in headers Bulk email software fingerprint (Caretop) found inheaders Bulk email software fingerprint (Campaign Blaster)found in header Bulk email software fingerprint ("outlook") found inheaders 'Received:' contains huge hostname 'From' contains more than one address Tre¶æ jest od wydawnictwa Verlag Dashofer(spamerzy) Tresc zawiera 'Za zaliczeniem pocztowym...' /zam.wieni/i /zainteresowan.{0,50}wsp..prac/ /www\.adresy\.org/i /specjaln.{0,50}ofert/i 600.465 - Intro to NLP - J. Eisner 38 SpamAssassin Features 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Presentación de un nuevo producto. Porno gratis. Para dejar de fumar Pago contra reembolso. Nos animan a contestar si estamos interesados No se puede considerar spam Mensaje enviado por error Mas informacion. Los regalos no existen, salvo de nuestros amigos. Inmigración legal (?) a los Estados Unidos Informacion y reserva If you want to subscribe... If you send an email you will be OptOut IMPERATIVOS EN MAYUSCULAS. Haga click aqui. Ha sido ganador. 600.465 - Intro to NLP - J. Eisner 39 SpamAssassin Features 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Ha sido ganador. El correo como alternativa comercial Conviertete en Spammer. Claims you can opt-out Claims you can be removed in Spanish Claims not to be spam in Spanish Alta en buscadores hispanos. spam software: PopLaunch mentions Cyber FirePower!, a spam-tool Will not Belive your Eyes! Well known spam senders Wants you to do business online Things incredible They keep your money -- No Refund! Terms and conditions Suspect you might have received the message bymistake 600.465 - Intro to NLP - J. Eisner 40 SpamAssassin Features 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Slashed Price SSPL found, spammer forgot to run the random-IDgenerator Psychics Scam Prices won't Last Possible porn - Galleries of Pictures Plugs "Natural Viagra" Outstanding Values Orders shipped by priority mail No Middleman No Medical Exams No Gimmick Nigerian scam, cfhttp://www.snopes2.com/inboxer/scams/nigeria.htm New Customers Only More Internet Traffic Luxury Car List removal information 600.465 - Intro to NLP - J. Eisner 41 SpamAssassin Features 1 Get Started Now 1 Cyber FirePower! rant about losing dropboxes 1 Confidentially on all orders 1 Claims you were on a list 1 Claims to listen to some removal request list 1 Claims not to be spam 1 Claims not to be selling anything 1 Claims compliance with spam regulations 1 Claims compliance with spam regulations 1 Claims "This is not junk email" 1 Cell Phone Cancer Scam 1 Buying judgements 1 Achieve Wealth 0.982 Talks about future mailings 0.977 Excessive quoted-printable encoding in body 0.975 Multi Level Marketing mentioned 600.465 - Intro to NLP - J. Eisner 42 SpamAssassin Features 0.968 0.959 0.954 0.952 0.948 0.947 0.935 0.931 0.910 0.908 0.906 0.904 0.900 0.893 0.885 0.882 Possible porn - Hardcore Porn Missing To: header From: has no local-part before @ sign Targeted Traffic / Email Addresses Information on getting a larger penis or breasts Message is 70-90% HTML tags Free Membership To: and Cc: contain similar domains at least 8 times Received contains a (dollar) variable reference Claims compliance with spam regulations 'From' ebay.com does not match 'Received' headers Unlimited in caps Accept Credit Cards From: ends in numbers 'Message-Id' was added by a relay (3) Gives information about an opportunity 600.465 - Intro to NLP - J. Eisner 43 SpamAssassin Features 0.874 0.863 0.853 0.849 0.849 0.849 0.838 0.820 0.817 0.810 0.796 0.795 0.781 0.781 0.781 0.779 Don't delete me! Nooooo!!!! Fast Viagra Delivery Frequent SPAM content exists:X-Stormpost-To Missing Date: header List removal information Consolidate Debt and Credit Financial Freedom Lots and lots of Cc: headers Received via a relay in multihop.dsbl.org Contains word 'guarantee' in all-caps Claims you can be removed from the list Spam phrases score is 00 to 01 (low) HTML message is a saved web page Claims compliance with Senate Bill 1618 exists:X-PMFLAGS 600.465 - Intro to NLP - J. Eisner 44 SpamAssassin Features 0.676 0.673 0.670 0.666 0.665 0.658 0.653 0.646 0.643 0.630 0.628 0.622 0.620 0.614 0.612 0.612 See for yourself You'd better read all of this spam! Easy Terms Contains "Toner Cartridge" Human Growth Hormone Trying to sell insurance online No experience needed! Claims to be legitimate email Subject: starts with advertising tag Frequent SPAM content illegal Nigerian transactions (2) Subject GUARANTEED DNSBL: sender ip address in in a dialup block Possible porn - Must be 18 Tells you to click on a URL (in caps) Free Quote 600.465 - Intro to NLP - J. Eisner 45 SpamAssassin Features 0.611 0.610 0.608 0.606 0.605 0.601 0.601 0.600 0.594 0.573 0.563 0.560 0.556 0.553 0.552 0.549 Refinance Home Received via a relay in relays.ordb.org Contains 'free access' with capitals Uses a long numeric IP address in URL Have you been turned down? Includes a URL link to send an email with the subject'remove' No Credit Check No Inventory To: has a malformed address Be your own boss Information on how to work at home (2) Contains mail-in order form One hundred percent guaranteed Guaranteed Stuff Information on mortgage rates Frequent SPAM content 600.465 - Intro to NLP - J. Eisner 46 SpamAssassin Features 0.544 0.542 0.542 0.541 0.539 0.536 0.531 0.525 0.521 0.518 0.514 0.513 0.511 0.506 0.506 0.505 From and To the same (1) Bulk email software fingerprint (screwup 2) found inheaders Gives an excuse for why message was sent Avoid Bankruptcy Includes a link for AOL users to click Form for changing email address Apply online (with capital O) List removal information Date: is 12 to 24 hours after Received: date Asks you for your signature on a form Subject talks about losing pounds Lower Interest Rates Do it Today Unsecured Credit/Debt The best Rates From: starts with nums 600.465 - Intro to NLP - J. Eisner 47 SpamAssassin Features 0.505 0.505 0.503 0.503 0.501 0.501 0.500 0.496 0.489 0.488 0.483 0.466 0.466 0.459 0.448 0.448 Spam phrases score 55 or higher (high) Impotence cure Vacation Offers Spam is 100% natural?! Possible porn - Free Porn Possible porn - Best, Largest Porn Collections Spam phrases score is 01 to 02 (low) Can not be combined with any other offer Message contains disclaimer Claims to be Legal Subject is all capitals MS-Outlook-style To "<Undisclosed-Recipient:;>" Date: is 96 hours or more after Received: date Spam tool pattern in MIME boundary Date: is 6 to 12 hours before Received: date Says: "to be removed, reply via email" or similar 600.465 - Intro to NLP - J. Eisner 48 SpamAssassin Features 0.448 0.446 0.443 0.443 0.435 0.434 0.431 0.431 0.429 0.428 0.426 0.424 0.424 0.422 0.422 0.421 Possible porn - Porn Fest Sent with 'X-Priority' set to high Local part containing a "4u" variant HTML font color is magenta Join Millions of Americans Asks for a billing address Nigerian scam key phrase ((dollar) NNN.Nm/USDNNN.N m/US(doll Claims "This is not spam" Sent with 'X-Msmail-Priority' set to high Subject contains "FREE" in CAPS exists:X-MailingID MIME section missing boundary Asks you to fill out a form HTML font color is unknown to us Domain name containing a "4u" variant HTML font color is yellow 600.465 - Intro to NLP - J. Eisner 49 SpamAssassin Features 0.419 0.419 0.418 0.417 0.416 0.415 0.414 0.414 0.414 0.414 0.413 0.412 0.411 0.410 0.408 0.407 Includes a link to send a mail with a subject Standard investment opportunity spam Javascript to hide URLs in browser Offers Extra Cash Eliminate Bad Credit Lose Weight Spam Subject talks about savings Subject ends with lots of white space Offers a full refund Gives instructions for removal from list Free Cell Phone Frontpage used to create the message Offers a limited time offer Claims you can be removed from the list Attempt at obfuscating the word "mortgage" Opportunity - What a deal! 600.465 - Intro to NLP - J. Eisner 50 SpamAssassin Features 0.407 0.406 0.406 0.406 0.405 0.405 0.405 0.405 0.405 0.404 0.404 0.404 0.403 0.402 0.402 0.402 Nobody's perfect Tells you about a strong buy HTML table has thick border Buy Direct Instant Access button HTML font color is green HTML font color is cyan Discusses money making Asks you to click below (in caps) Uses open redirection service exists:X-ServerHost Claims you can be removed from the list List removal information Message with extraneous Content-type:...type=header There is no obligation. Talks about lots of money 600.465 - Intro to NLP - J. Eisner 51 SpamAssassin Features 0.402 0.401 0.401 0.400 0.400 0.400 0.400 0.386 0.382 0.380 0.369 0.365 0.364 0.364 0.362 0.362 Contains 'Get it now' with capitals Supplies are Limited No such thing as a free lunch (2) You won't be dissapointed. Possible porn - Offers Instant Access Nigerian scam key phrase ((dollar)NN,NNN,NNN.NN) How dear can you be if you don't know my name? No Strings Attached HTML with embedded plugin object Received via a relay in relays.osirusoft.com Off Shore Scams Information on how to work at home (1) Possible porn - Hot, Nasty, Wild, Young Contains word 'amazing' in all-caps exists:X-SMTPExp-Version There is no catch. 600.465 - Intro to NLP - J. Eisner 52 SpamAssassin Features 0.361 0.360 0.344 0.336 0.335 0.334 0.333 0.333 0.330 0.329 0.329 0.327 0.327 0.326 0.325 0.324 sent to you@you.com or similar Received from first hop dialup listed inrelays.osirusoft.com HTML font color is same as background Subject: is empty or missing FONT Size +2 and up or 3 and up Lowest Price HTML font color has unusual name Contains word 'profits' in all-caps HTML font color is gray What are you waiting for One Time Rip Off Talks about prizes Free Website To: and Cc: contain similar usernames at least 5 times HTML font face is not a commonly used face Quoted-printable line longer than 76 characters 600.465 - Intro to NLP - J. Eisner 53 SpamAssassin Features 0.324 0.323 0.323 0.323 0.321 0.321 0.321 0.320 0.320 0.320 0.319 0.318 0.317 0.315 0.315 0.315 From: has a malformed address exists:X-SMTPExp-Registration Message-Id has no @ sign No such thing as a free lunch (1) URL of CGI script called "unsubscribe" or "remove" Satisfaction Guaranteed "if you do not wish to receive any more" Message contains a lot of ^M characters exists:x-esmtp Claims you are a winner From: contains numbers mixed in with letters Can't live without? HTML mail with non-white background Talks about email marketing Save big money HTML font color is red 600.465 - Intro to NLP - J. Eisner 54 SpamAssassin Features 0.315 0.313 0.313 0.312 0.312 0.308 0.308 0.307 0.307 0.306 0.305 0.304 0.302 0.302 0.302 0.301 3 WHOLE LINES OF YELLING DETECTED Save Up To Domain registration spam body Tells you to click on a URL Subject: domain registration spam subject URL contains spamhaus signature: numbered servers Name Brand Asks you to click below Act Now! Don't Hesitate! Talks about Hidden Charges Message is 50-70% HTML tags While Supplies Last Easily-executed JavaScript code Subject starts with "Free" HTML font color not within safe 6x6x6 palette No Purchase Necessary 600.465 - Intro to NLP - J. Eisner 55 SpamAssassin Features 0.301 0.300 0.300 0.300 0.300 0.299 0.296 0.294 0.281 0.279 0.245 0.242 0.239 0.229 0.224 0.222 Auto-executing JavaScript code DNSBL: sender is a Spamware site or vendor Significant Savings No Fees Click-to-remove with PHP/ASP action found X-Mailer header indicates a non-spam MUA (TheBat!) 'remove' URL contains an email address Being a Member Investment Decision Date: is 3 to 6 hours before Received: date Contains a Privacy Statement Tells you how to stop further spam Month Trial Offer Save (dollar) (dollar) (dollar) Sign up Free Today To: repeats address as real name 600.465 - Intro to NLP - J. Eisner 56 SpamAssassin Features 0.218 0.217 0.216 0.214 0.212 0.212 0.212 0.212 0.212 0.211 0.211 0.211 0.211 0.211 0.210 0.210 Congratulations - you've been scammed? 2 WHOLE LINES OF YELLING DETECTED Weekend Getaway Trying to offer you something Member Stuff HTML font color is missing hash ( Doesn't ask any questions Contains 'Special Promotion' A WHOLE LINE OF YELLING DETECTED To: is empty Winning in Caps Stuff on Sale Only (dollar) (dollar) (dollar) Encourages you to waste no time in ordering Who really wins? HTML font face has excess capital characters 600.465 - Intro to NLP - J. Eisner 57 SpamAssassin Features 0.209 0.207 0.207 0.206 0.205 0.204 0.204 0.204 0.203 0.203 0.203 0.203 0.202 0.201 0.201 0.201 Free DVD Date: is 12 to 24 hours before Received: date JavaScript code Header with all capitals found HTML font color is blue Winner in Caps HTML font face is not a word Fantastic Deal Includes a 'remove' email address Includes a URL link to send an email Possible porn - Large Number of movies, pics Free Offer Contains a tollfree number illegal Nigerian transactions (1) Image tag with an ID code to identify you Frame wanted to load outside URL 600.465 - Intro to NLP - J. Eisner 58 SpamAssassin Features 0.201 0.181 0.150 0.146 0.144 0.137 0.134 0.127 0.123 0.117 0.114 0.114 0.111 0.108 0.107 0.106 Contains 'for only' some amount of cash X-Mailer header indicates a non-spam MUA(Outlook Express) Spam tool pattern in MIME boundary Cancel at any time! Talks about social security numbers Click to perform an action on an account Gives an excuse about why you were sent this spam Nigerian scam key phrase ((dollar) NNN.Nm/USDNNN.N m/US(doll Contains a comment with nothing but unique ID No Claim Forms 'Message-Id' was added by a relay (2) Free Trial They're just giving it away! Message-Id has characters indicating spam Dear you@you.com? Free Hosting 600.465 - Intro to NLP - J. Eisner 59 SpamAssassin Features 0.105 0.104 0.103 0.102 0.101 0.101 0.100 0.100 0.100 0.100 0.038 0.032 0.031 0.028 0.014 0.009 Contains an ASCII-formatted form I wonder how many emails they sent in error... URL of page called "unsubscribe" Subject has exclamation mark and question mark Offer Expires Contains 'Dear Somebody' Javascript protocol in a URI Message includes Microsoft executable program MIME filename does not match content Spam tool pattern in MIME boundary 'Received:' has 'may be forged' warning Message-Id is not valid, according to RFC 2822 Offers Coupon Please read this! Please oh please oh please! Shopping Spree Contains a line >=199 characters long 600.465 - Intro to NLP - J. Eisner 60 SpamAssassin Features 0.009 0.009 0.008 0.008 0.005 0.004 0.003 -0.006 -0.019 -0.026 -0.069 -0.075 -0.102 -0.118 -0.123 -0.133 Spam tool pattern in MIME boundary Risk free. Suuurreeee.... Reserves the right Expect to earn Contains 'G.a.p.p.y-T.e.x.t' Gift Certificate Big Bucks X-Mailer header indicates a non-spam MUA(Outlook) From Majordomo Missing From: header Free money! Forwarded email (Outlook style) Email came from some known mailing list software Mailer daemon failure notice (1) Message text is over 40K in size Came via Internet Mail Service plugin 600.465 - Intro to NLP - J. Eisner 61 SpamAssassin Features -0.137 -0.143 -0.196 -0.200 -0.207 -0.211 -0.215 -0.217 -0.231 -0.233 -0.240 -0.298 -0.300 -0.301 -0.302 -0.304 Correct for MIME 'null block' X-Mailer header indicates a non-spam MUA(Netscape) Mailing list headers are suspicious exists:Resent-To exists:X-Authentication-Warning Where are you working at? exists:X-Accept-Language Subject contains newsletter header (list) 'Message-Id' was added by yahoo.com, that's OK exists:X-Loop X-Mailer header indicates a non-spam MUA (AOL) To: repeats local-part as real name User-Agent header indicates a non-spam MUA(Entourage) Short signature present (no empty lines) exists:X-Mailing-List Long signature present (empty lines) 600.465 - Intro to NLP - J. Eisner 62 SpamAssassin Features -0.484 -0.484 -0.489 -0.506 -0.506 -0.506 -0.518 -0.522 -0.558 -0.601 -0.605 -0.616 -0.641 -0.695 -0.708 -0.708 Subject contains a month name - probable newsletter(2) Subject contains a month name - probable newsletter Common footer for Hotmail Contains a PGP-signed message Appears to be from yahoo groups Yahoo! Groups message exists:User-Agent Has a valid-looking References header Forwarded email User-Agent header indicates a non-spam MUA(Mozilla) User-Agent header indicates a non-spam MUA(Outlook Express) Subject contains newsletter header (news) Message-Id indicates a non-spam MUA (Pine) Contains what looks like an 'E-Mail Disclaimer' Contains a PGP-signed message (signature attached) Message text is over 20K in size 600.465 - Intro to NLP - J. Eisner 63 SpamAssassin Features -0.725 -0.754 -0.832 -0.847 -0.864 -0.897 -0.949 -0.986 -1 -1 -1 -1 -1 -1 -1 -1 Subject contains a frequency - probable newsletter X-Mailer header indicates a non-spam MUA(T-Offline) Contains what looks like a quoted email text exists:In-Reply-To Has an Approved-By moderated list header User-Agent header indicates a non-spam MUA(IMP) Contains what looks like a patch from diff -u Mailer daemon failure notice (2) X-Mailer header indicates a non-spam MUA (Gnus) User-Agent header indicates a non-spam MUA(Gnus) Subject contains newsletter header (in review) From: looks like US Telephone Number recommended page from MailBits.com Talks about tracking numbers Common footer for MSN A MailMan confirm-your-address message 600.465 - Intro to NLP - J. Eisner 64 SpamAssassin Features -1.118 -1.128 -1.152 -1.176 -1.301 -1.334 -1.433 -1.451 -1.596 -1.628 -1.696 -1.780 -1.801 -1.898 -2.092 -2.170 Common footer for MSN Contains a password retrieval system Something about registration User-Agent header indicates a non-spam MUA(Mutt) Came from MSN Communities exists:X-Cron-Env Subject looks like order info From the Mailer-Daemon Subject contains a date Contains what looks like an email attribution Common footer for Hotmail X-Mailer header indicates a non-spam MUA (AppleMail) Common footer for Hotmail Sent through Microsoft's ListBuilder service Short signature present (empty lines) Common footer for Hotmail 600.465 - Intro to NLP - J. Eisner 65 SpamAssassin Features -2.174 -2.442 -2.473 -2.475 -2.550 -2.699 -2.863 -3.052 -3.127 -4.0 -6 -10 -10 -20 -100 -100 Message from eBay Contains what looks like a patch from diff -c Looks like a Debian BTS bug Common footer for Hotmail Subject is an eBay question Looks like a Bugzilla bug User-Agent header indicates a non-spam MUA(KMail) non-spam Yahoo! Groups banner found Long signature present (no empty lines) Uses the Habeas warrant mark(http://www.habeas.com/) User is listed in 'whitelist_to' Not Matt's Scripts formmail.pl Bonded sender, seehttp://www.bondedsender.org/referred.html User is listed in 'more_spam_to' User is listed in 'all_spam_to' From: address is in the user's white-list 600.465 - Intro to NLP - J. Eisner 66 How to Categorize? (unsupervised) What if we don’t have supervised training data? Might try an iterative approach as usual: 1. Cluster the messages 2. Train n-gram, Naive Bayes, or decision list model to discriminate among the clusters 3. Use the model to reassign messages to clusters (most will stay put but some will move) 4. Return to step 2 until convergence 600.465 - Intro to NLP - J. Eisner 67 How to Categorize? (semisupervised) What if we have only a little supervised data? Could try bootstrapping like Yarowsky’s WSD: 1. Start with very small, rather accurate classes 2. Train n-gram, Naive Bayes, or decision list model to discriminate among the classes 3. Augment each class with new messages that the model confidently classifies there (maybe also move or remove some existing messages) 4. Return to step 2 until convergence 600.465 - Intro to NLP - J. Eisner 68 How to Categorize? (adaptive) What if we gradually get more new data over time? User feedback (active or passive) on our classifications News / email systems that categorize, or judge relevance Add new articles / messages to training data If they’re unlabeled (no supervision), label them automatically Add them only if we’re confident? Add them fractionally, like EM? So model adjusts over time: E.g., change the cluster centroids or n-gram parameters May want to weight the more recent data more heavily, since the future is more like the present than the past E.g., message from k days ago has weight 0.9k (k=0,1,2, ...) So today’s model = today’s data + 0.9 * yesterday’s model 600.465 - Intro to NLP - J. Eisner 69 How to Categorize? (hierarchical) What if we are putting document in a Yahoo! category? There are thousands of categories (at least) – too hard! Choose one of the 14 top-level categories, e.g., Science Then use a Science-specific classifier to choose one of the 54 second-level categories within Science (14 are symlinks) Continue working your way down the tree ... When you can’t classify with high confidence, ask a human (then use the human’s answer as more training data) 600.465 - Intro to NLP - J. Eisner 70
© Copyright 2024