A Multilayer System of Lexical Resources for Language Technology Infrastructure Paweł Kędzia, Michał Marcińczuk, Marek Maziarz, Maciej Piasecki, Adam Radziszewski, Ewa Rudnicka G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl System of Lexico-semantic Resources Lexicon of lexico-syntactic structures of multi-word expressions plWordNet 3.0 (Słowosieć 3.0) plWordNet 3.0 to WordNet 3.1 mapping Semantic lexicon of proper names Mapping to an ontology And a valency lexicon linked to plWordNet System of Lexico-semantic Resources Valence lexicon MWE lexicon plWordNet 3.0 describes WordNet 3.1 + extension Proper Names Ontology: SUMO + intermediate level Wordnet { samochodzik 2 `small car’ } deminutiveness {samochód 1, pojazd samochodowy 1, auto 1, wóz 1 `car, automobile’ } meronymy hypernymy/hyponymy {bagażnik 1 `boot’ } {pogotowie 3, karetka 1, sanitarka 1, karetka pogotowia 1 `ambulance’ } plWordNet 2.2 Synset and constitutive relations Synset as a notational convention for a group of lexical units sharing certain relations represents synonyms {afekt 1 `passion’, uczucie 2 `feeling’} hypernym {miłość 1 `love’, umiłowanie 1 `affection’ , kochanie 1 `loving’} This is based on constitutive relations Additional distinctions: stylistic register and aspect Minimal committment principle: make as few assumptions plWordNet model: non-relational aspects Constitutive features stylistic registers, verb aspect and semantic verb classes Referred to in the relation definitions e.g. relations limited to verbs of the same aspect and semantic class Glosses helps wordnet editors Usage examples: direct links to the corpus plWordNet: Constitutive Relations • Traditional wordnet relations • e.g. tiger 'Panthera tigris' -hyponymy → tiger meronymy, cause, instance of (only Proper Names) • Additional constitutive relations • e.g., swallow -verb meronymy→ eat preceding, presupposition, gradation (only Adjectives) • Number: 10 and about 40 subtypes plWordNet: Relations of Lexical Units • Traditional wordnet relations • e.g antonymy, fuzzynymy • Additional relations • converse plWordNet: derivationally based lexico-semantic relations • Number: 8 and about 40 subtypes • For instance, góral ‘highlander’ –inhabitant– góry ‘highlands’ zapalić sięperfect `light, start burning’ –inchoativity– palić sięimperfect `burn, produce light’ chamiećimperfect `to become a boor‘ –process– cham `boor’ State: plWordNet on 22th Nov. 2014 Number of lemmas plWN PWN 3.1 156,360 155,593 enWN plWN 3.0 157,691 ~185,000 lexical units synsets 220,848 164,233 209,329 >260,000 119,441 ~195,000 206,978 117,659 • plWN – plWordNet (the version: 20th Nov. 2014) • PWN 3.1 – Princeton WordNet 3.1 • enWN 0.1 – PWN 3.1 expanded in CLARIN-PL (20th Nov. 2014) • plWN 3.0 – the target size of plWN Lexicon of multi-word expressions Non-trivial morphology of Polish MWEs more than 100 nominal structural patterns Description of the lexico-syntactic structures of MWEs Multi-word LUs as semantic atoms no internal semantic relations Dynamic lexicon a tool for automatic MWE extraction 60 000 described in the lexicon and plWordNet Multi-word lexical units Dictionary of MWLUs goal: 60k entries semantic description by mapping to plWordNet syntactic description by WCCL constraints Criteria of distinguishing MWLUs * collocations that are: • terms, • non-compositional expressions, • syntactically fixed expressions. Example: gęśliki podhalańskie '~fiddle' <mwegroup type='fix' name='SubstAdjPlFix' class='subst'> <condition> and( inter(base[0],$s:S), inter(nmb[0], {pl}), inter(base[1],$s:A), inter(class[1],{adj,ppas,pact}), inter(class[0],{subst,ger,depr}), agrpp(0,1,{nmb,gnd,cas}), setvar($Pos1, 0), setvar($Pos2, 1) ) </condition> <instances> <MWE base='gęśliki podhalańskie'> <head>in(class[0],{subst,ger,depr})</head> <var name='S'>gęśliki</var> <var name='A'>podhalański</var> </MWE> plWordNet to WordNet 3.1 mapping plWordNet: built independently to obtain faithful description Manual mapping bottom-up order comparison of the relations structures a cascading list of Interlingual-relations plWordNet verification as an important side effect Present state: 113,265 N and Adj synsets mapped Target: complete plWordNet 3.0 mapped Hierarchy of inter-lingual relations • • • • • • Inter-lingual Synonymy (only one per synset) Inter-lingual inter-register synonymy I-partial synonymy I-hyponymy I-hypernymy I-meronymy for parts, elements or materials of bigger wholes • I-holonymy for a whole made of smaller parts, elements or materials WordnetLoom: editing the mapping NELexicon 2.0 • NELexion 2.0 is a dictionary of proper names containing 2.3 milion entries. • Hierarchy of proper name categories is based on Sekine's Extended Named Entity Hierarchy [http://nlp.cs.nyu.edu/ene/]. – – – 7 top-level categories: event, facility, living, location, organization, product, other, 3-level hierarchy, 107 fine-grained categories. NELexicon 2.0: hierarchy (fragment) nam_loc (location) nam_loc_astronomical nam_loc_country_region nam_loc_gpe nam_loc_gpe_admin nam_loc_gpe_city nam_loc_gpe_conurbation nam_loc_gpe_country nam_loc_gpe_district nam_loc_gpe_subdivision nam_loc_historical_region nam_loc_hydronym nam_loc_hydronym_bay nam_loc_hydronym_lagoon nam_loc_hydronym_lake nam_loc_hydronym_ocean nam_loc_hydronym_river nam_loc_hydronym_sea nam_loc_land_cape nam_loc_land_continent nam_loc_land_desert nam_loc_land_island nam_loc_land_mountain nam_loc_land_peak nam_loc_land_peninsula … nam_fac (facility) nam_fac_bridge nam_fac_cossroad nam_fac_goe nam_fac_goe_market nam_fac_goe_stop nam_fac_park nam_fac_road nam_fac_square nam_fac_system NELexicon 2.0 – statistics Coarse-grained categories breakdown 4% 1% 6% Count 0% 30% 43% 16% Event Organization Facility Other Living Product Top 10 fine-grained categories Location Category 450 351 nam_org_company 418 786 nam_org_organization 371 390 nam_liv_person_last 281 013 nam_liv_person 197 197 nam_loc_gpe_city 72 537 nam_loc_gpe_admin3 44 184 nam_fac_road 34 156 nam_loc_astronomical 28 629 nam_fac_other 23 153 nam_org_institution Mapping to ontology Ontology: unambiguous concepts defined formally Lexical meanings imprecisely delimited constrained by usage, stylistic register and sentiment Mapping to ontology precise, formal description for meanings association: concepts – their lexical embodiment SUMO selected Princeton WordNet mapping Semi-automated mapping of plWordNet SUMO Ontology • SUMO – Suggested Upper Merged Ontology, – Available on General Public Licence, – Contains ~25 000 concepts and ~80 000 axioms, – Concepts are connected with one of the relations: subclass, subrelations, instance, subAttribute. – Eeach concepts has formal definition written in SUO-KIF Language: (<=> (exists (?BUILD) (and (instance ?BUILD Constructing) (result ?BUILD ?ARTIFACT))) (instance ?ARTIFACT StationaryArtifact))) PlWordNet mapping to SUMO Applications Free WordNet-type licence facilitate applications. Examples: • Semantic annotation in a corpus of referential gestures (Lis, 2012) • Lexicon of semantic valency frames (Hajnicz, 2011; Hajnicz, 2012) • Features for text mining from Web pages (Maciolek and Dobrowolski, 2013) • Mapping between a lexicon and an ontology (Wróblewska et al., 2013) • Word-to-word similarity in ontologies (Lula and Paliwoda-Pękosz, 2009) • Text similarity for Information Retrieval (Siemiński, 2012) • Text classification (Maciołek, 2010) • Terminology extraction and clustering (Mykowiecka and Marciniak, 2012) • Automated extraction of Opinion Attribute Lexicons (Wawer and Gołuchowski, 2012) • Named Entity Recognition • Word Sense Disambiguation (Gołuchowski and Przepiórkowski, 2012) • Anaphora resolution About 600 registered users, ~70 declared commercial applications Conclusions • plWordNet 2.2 – a national wordnet not translated from Princeton WordNet • plWordNet 2.2.1 is larger than WordNet 3.1 in size, as well as in lexical coverage, hypernymy depth and relation density • Synset membership depends only on constitutive relations between lexical units. • A unique mapping strategy and a unique opportunity to compare the two lexical systems • plWordNet 3.0 (2015): – a comprehensive wordnet of Polish – 185k of lemmas and 260k of LUs, mapped to enWN Thank-you www.plwordnet.pwr.wroc.pl Thank you! NELexicon 2.0 – sources • Sources: – – – – – NELexicon 1.0 (1.4 milion) Wikipedia infoboxes (manually created mapping for 970 infobox attributes), Wikipedia internal links (base forms for inflected forms), Names recognized by Liner2 in Wikipedia (statistical model for NER for Polish) Inflected forms from Wiktionary. Features for the mapping rules • Interlingual relation between plWordNet and WordNet: i-synonymy, i-hyponymy, i-part-ofmeronymy, . . . • Mapping relation between WordNet and SUMO: equivalent, instance of and subsumed. • Domains of plWordNet and WordNet synsets: body, grp, food, loc, . . . • Capital letter in the first lemma of a plWordNet synset. • SUMO concept: Currency, GroupOfPeople, FieldOfStudy, Human, . . . Constitutive relations • Synset = a group of lexical units which share all constitutive relations • Constitutive relation = a lexico-semantic relation which – is frequent enough – and frequently shared by groups Also – is established in linguistics – and accepted in the wordnet tradition • Examples: hypernymy, meronymy, cause Applications Strong universal basis a comprehensive wordnet >200 000 lemmas resulting in ~285 000 LUs and ~210 000 synsets one of the largest ever Polish dictionaries Modularly constructed toolkit a layered architecture of large software systems separate but linked layers each layer based on limited set of notions and principles and exchangeable The core of the CLARIN-PL language technology infrastructure
© Copyright 2025