EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS Karine Megerdoomian University of Maryland, College Park karinem@umiacs.umd.edu دومین کارگاه پژوهشی زبان فارسی و رایانه دانشگاه تهران Talk Outline Persian Weblogs – Description of a finite-state morphological analyzer for Persian – – Persian is the 4th largest blog language in the world (~75,000 sites) System description Language issues and implementation Computational issues in weblogs Language of Blogs Contain both formal and informal morphology Morphology – – – Informal text is very different from formal مرا گرفته استگرفته تم Features that don’t exist in formal فروشندهه؛ رفتش Shortened verbal stems and inflection می گویندمیگن Language of Blogs Morphology – – – Colloquial pronunciation غلطای امالیی ؛ این سایتو ؛ دوستامونم ؛ دردناکه ؛ مثل منن ازشون ؛ خودتون ؛ نگاه های شان ؛ همسایه اشون Spelling errors and non-standard punctuation & spacing Emoticons and hyperlinks Language of Blogs Lexicon Wordforms follow pronunciation اوضاش ؛ برام ؛ نگامی کنم ؛ خونه ؛ تمبل ؛ همدیگه ؛ بش گفتم Colloquial forms تو دانشگاه ؛ واسه استادام New words لینکدونی ؛ دوستان کامنت گذار – – – Language of Blogs Lexicon – Loan words چت روم ؛ آن الین ؛ دان لود کنین – Interjections !آاااخ! ؛ واال ؛ وای ؛ اوووه – More idiomatic expressions دمش گرم آقا Language of Blogs Huge amount of variation!! Need for flexible rules Phonological rules to represent colloquial speech Need to disambiguate (statistical component?) Formal blog text is also different from traditional formal text Language of Blogs خوابگرد موافق اند بیننده گان کتاب اش کم تر کافی ست حتا BBC موافقند بینندگان کتابش کمتر کافیست حتی Finite-State Transducers (FST) Two-level network or transducer – – b b Input = lower-side of arc Output = upper-side of arc i i r r d d +Noun +Pl s MA: System Description Developed on Xerox Finite State Technology (XFST) [Karttunen & Beesley 1992] Components: – – Lexicon and morphology rules (lexc) Phonological rules (regular expressions) Compiled into a FST (finite-state transducer) FST for each part of speech created separately then composed final FST for morphological analysis MA: System Description Noun FST Verb FST Adverb FST COMPOSITION Phonology rules Input string Final FST For Morphology Output string MA: System Description Coverage: formal Persian language – Full verbal conjugation – Nonverbal inflection مسافرین ؛ فقرا – Productive derivational morphology سرسام آور – ~20 phonological rules – Proper nouns of people, places, organizations Inflectional Morphology LEXICON Root ktab Noun ; LEXICON Noun +Pl:ha #; +Pl:_ha #; +Sg:0 #; کتابها کتاب ها کتاب +Pl:a کتابا #; Complex Tokens Two different POS categories دردفتر ؛ وگفت- بعقیده شما ؛ اینکار؛ بهترست bh+Prep<eqydh+Noun+Sg dr+Prep<dftr+Noun+Sg ktab+Noun+Pl>av+Pron+Pers+Poss+1P+Pl بعقیده دردفتر کتابهایمان bradr+Noun+Sg>av+Pron+Pers+Poss+1P+Pl برادرشه >bvdn+Verb+Ind+Pres+3P+Sg Verbal Morphology Two different stems Infinitive Present Stem Past Stem توانستن توان توانست رفتن رو رفت Verbal Morphology LEXICON PastStem tvanst Infl1 ; rft Infl1 ; xndyd Infl1 ; LEXICON PstStemBlog tvnst InflBlog1; LEXICON PresentStem tvanst:tvan Infl2 ; rft:rv Infl2; xndyd:xnd Infl2; LEXICON PrStemBlog tvanst:tvn Infl2 ; rft:r Infl2; Long Distance Dependencies Some tenses of the verb can only be determined if we take into account the co-occurrence of the prefix and the person inflection / auxiliary problem for linear approaches است Pres. Aux.3sg د گذار Pres.3sg Present ‘’ گذاشت Past.3sg Past می Imperf. می Imperf. ه Past.3sg می Imperf. گذاشت Past می گذارد میذاره می گذاشت میذاشت می گذاشته است میذاشته Long Distance Dependencies Leads to very complex paths and continuation classes in lexc Using filters largely increases the size of the FST Use flag diacritics for unification (@U.Feature.Value@) - Keeps FST small - Can apply constraints between non-adjacent morphemes Phonology Rules Form of affixes may change based on the ending character of the stem Formal: صدایش ؛ همسایه اش/کتابش ؛ چشم هایش Informal: صداش ؛ همسایش/کتابش ؛ چشماش define clitic1 [^NB 0 || Cons __ ] ; define clitic2 [^NB y || Vowel __ ] ; define clitic3 [^NB “\u200c” a || e __ ] ; Optional in informal blog text ktab^NBš Sda^NBš hmsaye^NBš Evaluation FST: 178,452 states; 928,982 arcs before optimization Speed: 20.84 CPU time in seconds for 10 MB file, on SunSparcStation Coverage=97.5%; Accuracy=95% Unanalyzed tokens: proper nouns + missing lexicon words No weblog language rules included yet! Conclusion Challenges in morphological analysis of Persian formal text Solutions in XFST system New issues and variance due to blog language Need robust system: Lexicon updated with colloquial forms Flexible morphological rules + derivational morphology rules Transliteration component for loan words Statistical approach to disambiguate and to deal with unknowns
© Copyright 2024