Literature Mining and Ontology BMI/IBGP 730 Autumn, 2010 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu Department of Biomedical Informatics The Ohio State University Outline • What is Literature Mining? – – – – – Popular Tools for Literature Mining Basic Techniques Indexing: Expediting searching Linguistic Processing Other Processing • What is Ontology? – – – – Simple ontology examples Gene ontology United Medical Language System Use and index ontology • Applications of Literature Mining and Ontology Outline • What is Literature Mining? – – – – – Popular Tools for Literature Mining Basic Techniques Indexing: Expediting searching Linguistic Processing Other Processing • What is Ontology? – – – – Simple ontology examples Gene ontology United Medical Language System Use and index ontology • Applications of Literature Mining and Ontology What is Literature (Text) Mining? • The purposes of Literature Mining – Find relevant documents – Discover knowledge (what is knowledge?) • The advantage of computer-based Literature Mining – Simply, computers can search much more documents! – Computers can ‘think’ and discover knowledge. • We will focus on biomedical literature mining in the following Why Literature Mining is Very Popular in Biomedical Science? • Biomedical science studies nature subjects. – Species – Genes – Phenotypes – Diseases …. Outline • What is Literature Mining? – – – – – Popular Tools for Literature Mining Basic Techniques Indexing: Expediting searching Linguistic Processing Other Processing • What is Ontology? – – – – Simple ontology examples Gene ontology United Medical Language System Use and index ontology • Applications of Literature Mining and Ontology Popular Tools for Biomedical Literature Mining – Document search • Google – Google Scholar: http://scholar.google.com • ISI web of knoledge – www.isiknowledge.com • Pubmed – www.ncbi.nlm.nih.gov/pubmed Tools for Biomedical Literature Mining – Knowledge discovery • The Gene Ontology – http://www.geneontology.org/ • Gene answer – www.geneanswers.com Outline • What is Literature Mining? – – – – – Popular Tools for Literature Mining Basic Techniques Indexing: Expediting searching Linguistic Processing Other Processing • What is Ontology? – – – – Simple ontology examples Gene ontology United Medical Language System Use and index ontology • Applications of Literature Mining and Ontology Techniques Behind Literature Mining • Interdisciplinary – Computer Science • • • • Information retrieval Data mining Natural Language Processing Machine learning – Library Science – Biomedical Science – Linguistics • Computational linguistics – Statistics – And more! • Two main research areas (some overlaps) – Information Retrieval – Natural Language Processing Basic Text Search Algorithm … H e l l o , w o r l d w o r l d … text String to match • Assume text size is n. • Assume search string size is m. • How to design an efficient algorithm to find all matches in the text? – Brutal force algorithm, O(mn). – Boyer-Moore Heuristics, O(mn), but fast in most cases for English text. – KMP (Knuth-Morris-Pratt) algorithm, O(m+n). Outline • What is Literature Mining? – – – – – Popular Tools for Literature Mining Basic Techniques Indexing: Expediting searching Linguistic Processing Other Processing • What is Ontology? – – – – Simple ontology examples Gene ontology United Medical Language System Use and index ontology • Applications of Literature Mining and Ontology Information Retrieval (Indexing) • Archiving (preprocessing) documents for fast search – – – – Preprocessing time Query time Index size Accuracy vs relevancy • Precision= |{relevant docs}∩{retrieved docs}|/| {retrieved docs}| • Recall= |{relevant docs}∩{retrieved docs}|/|{relevant docs}| • Fall-out |{nonrelevant docs}∩{retrieved docs}|/|{nonrelevant docs}| Outline • What is Literature Mining? – – – – – Popular Tools for Literature Mining Basic Techniques Indexing: Expediting searching Linguistic Processing Other Processing • What is Ontology? – – – – Simple ontology examples Gene ontology United Medical Language System Use and index ontology • Applications of Literature Mining and Ontology Programming language processing (C++, Java, etc) • Lexical analysis y=x+10; • Syntax analysis identifier y lexeme Token type y identifier = assignment operator x identifier + addition operator 10 number ; end of statement assignment operator = expression identifier x expression + expression number 10 Natural Language Processing • Lexical level – Stemming (including lemmatizing): find the root of a word swimming, swam, swim, swimmer swim – Stemming rule may vary (balance between overstemming and understemming) – Typical algorithm (Porter Stemming algorithm) – Alias, Synonym • Grammatical level – Parsing “…We find Gene1 interacts with Gene2…” Sentence Verb phrase Noun phrase Gene1 Verb interact Noun phrase Gene2 Outline • What is Literature Mining? – – – – – Popular Tools for Literature Mining Basic Techniques Indexing: Expediting searching Linguistic Processing Other Processing • What is Ontology? – – – – Simple ontology examples Gene ontology United Medical Language System Use and index ontology • Applications of Literature Mining and Ontology Statistical and Data Mining Processing • Statistical – Count the word frequency – Count the expression frequency • Data Mining – Mining the set of frequent words – Association rule Document Classification (Machine Learning) • E.g., classify all documents related to coffee and health Documents show benefits Coffee and health related documents Documents show risk Cardioprotective … Laxative Cholesterol … Anxiety • Various machine learning algorithms can be applied here. Outline • What is Literature Mining? – – – – – Popular Tools for Literature Mining Basic Techniques Indexing: Expediting searching Linguistic Processing Other Processing • What is Ontology? – – – – Simple ontology examples Gene ontology United Medical Language System Use and index ontology • Applications of Literature Mining and Ontology Ontology • According to philosophy, ontology is a systematic account of Existence • In information science, ontology is a representation of concepts and their relationships, often by directed graphs Ontology Example (Informal) fish salt water fresh water North American Asian Europe …… native Crappie Common Carp mirror Carp invasive Ontology Example: Scientifc classification Kingdom Animalia Chordata Actinopterygii … Neopterygii Teleostei Cypriniformes Cyprinidae … … Sarcopterygii Chondrostei Hemichordata Phylum Class Subclass Infraclass … … … Order Family Outline • What is Literature Mining? – – – – – Popular Tools for Literature Mining Basic Techniques Indexing: Expediting searching Linguistic Processing Other Processing • What is Ontology? – – – – Simple ontology examples Gene ontology United Medical Language System Use and index ontology • Applications of Literature Mining and Ontology Gene Ontology (GO) Consortium DNA metabolis … Molecular function … Nucleic acid binding DNA binding cell … enzyme helicase DNA helicase ATP-dependent DNA helicase … Reference: Gene Ontology: tool for the unification of biology, nature genetics, 2000 http://dx.doi.org/ 10.1038/75556 Outline • What is Literature Mining? – – – – – Popular Tools for Literature Mining Basic Techniques Indexing: Expediting searching Linguistic Processing Other Processing • What is Ontology? – – – – Simple ontology examples Gene ontology United Medical Language System Use and index ontology • Applications of Literature Mining and Ontology Unified Medical Language System (UMLS) • A compendium of controlled vocabularies in the biomedical sciences (since 1986). It contains: – Metathesaurus – Semantic Network – SPECIALIST Lexicon • Maintained by US National Library of Medicine • Website: http://www.nlm.nih.gov/research/umls/ UMLS - Metathesaurus • Number of biomedical concepts >= 1 million • Number of concept names >=5 million • Stem from over 100 incorporated controlled source vocabularies: – ICD (International Statistical Classification of Diseases and Related Health Problems) – MeSH (Medical Subject Headings) – SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) – LOINC (Logical Observation Identifiers Names and Codes) – Gene Ontology – OMIM (Mendelian Inheritance in Man) … http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html UMLS - Semantic Network • 135 semantic types (categories) – Entity • Physical Object – Organism … … – Event • Actitivity – Behavior … … • 54 semantic relationships (between members of the various Semantic types) – isa – assoicated_with • physically_related_to – • part_of … spatially_related_to – location_of … … http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html http://www.clres.com/semrels/umls_relation_list.html Outline • What is Literature Mining? – – – – – Popular Tools for Literature Mining Basic Techniques Indexing: Expediting searching Linguistic Processing Other Processing • What is Ontology? – – – – Simple ontology examples Gene ontology United Medical Language System Use and index ontology • Applications of Literature Mining and Ontology Use of ontology systems • Statistical – Gene ontology enrichment test • Indexing – Reachibility – Distance – Path Represent Ontology by Graphs • Directed Graph • Directed Acyclic Graph (DAG): A good number of ontologies fall into this type, but not all! • Directed Tree Reachability The problem: Given two vertices u and v in a directed graph G, is there a path from u to v ? 15 14 11 13 10 6 7 3 4 1 12 8 9 5 2 ?Query(1,11) Yes ?Query(3,9) No Distance The problem: Given two vertices u and v in a (directed) graph G, what is the distance from u to v? 15 14 11 13 10 6 7 3 4 1 12 8 9 5 2 ?Query dG(1, 11) =3 Path The problem:Given two vertices u and v in a (directed) graph G, what is a path (are paths) connecting u to v ? 15 14 Find a path from 1 to 11 11 13 10 6 7 3 4 1 12 8 9 5 2 The estimated difficulty of building a very efficient indexing schemes (based on current research) Reachability Distance Path easy easy easy Directed Acyclic Graph medium hard hard Directed Graph hard hard Directed Tree medium Reference: R. Jin, Y. Xiang, N. Ruan, H. Wang, "Efficiently Answering Reachability Queries on Very Large Directed Graphs", Proc. of ACM SIGMOD Conference, Vancouver, June 9-12, 2008, pp. 595-608. R. Jin, Y. Xiang, N. Ruan, D. Fuhry, "3-HOP: A High-Compression Indexing Scheme for Reachability Query", Proc. of ACM SIGMOD Conference, Providence, Rhode Island, June 29-July 2, 2009, pp. 813-826. Outline • What is Literature Mining? – – – – – Popular Tools for Literature Mining Basic Techniques Indexing: Expediting searching Linguistic Processing Other Processing • What is Ontology? – – – – Simple ontology examples Gene ontology United Medical Language System Ontology use and indexing • Applications of Literature Mining and Ontology Applications of Literature Mining and Ontology - I • Build confirmed gene-phenotype relations – Human Phenotype Ontology (HPO) – Built from Online Mendelian Inheritance in Man (OMIM) database. – http://human-phenotype-ontology.org/ Reference: Robinson PN, Mundlos S. The Human Phenotype Ontology. Clinical Genetics 77(6) 2010: 525–534. http://dx.doi.org/10.1111/j.1399-0004.2010.01436.x Applications of Literature Mining and Ontology - II • Predicting unknown gene-phenotype relations – Use text mining to build similarities among phenotypes – Gene relationships are built by protein-protein Interaction databases – Known gene-phenotype relationships can be built by text mining. The previous slide gives an example. • Various methods [Statistical, graph theory, etc.] have been proposed to do the prediction. – Reference: X. Wu, R. Jiang, M.Q. Zhang, and S. Li, Network-based global inference of human disease genes. Molecular Systems Biology, 4(1), 2008 PPI network ≈ G2G network Phenotype similarity graph Thanks! Questions?
© Copyright 2024