Literature Mining and Ontology BMI/IBGP 730 Autumn, 2010

Literature Mining and Ontology
BMI/IBGP 730
Autumn, 2010
Yang Xiang, Ph.D. in Computer Science
yxiang@bmi.osu.edu
Department of Biomedical Informatics
The Ohio State University
Outline
• What is Literature Mining?
–
–
–
–
–
Popular Tools for Literature Mining
Basic Techniques
Indexing: Expediting searching
Linguistic Processing
Other Processing
• What is Ontology?
–
–
–
–
Simple ontology examples
Gene ontology
United Medical Language System
Use and index ontology
• Applications of Literature Mining and Ontology
Outline
• What is Literature Mining?
–
–
–
–
–
Popular Tools for Literature Mining
Basic Techniques
Indexing: Expediting searching
Linguistic Processing
Other Processing
• What is Ontology?
–
–
–
–
Simple ontology examples
Gene ontology
United Medical Language System
Use and index ontology
• Applications of Literature Mining and Ontology
What is Literature (Text) Mining?
• The purposes of Literature Mining
– Find relevant documents
– Discover knowledge (what is knowledge?)
• The advantage of computer-based Literature
Mining
– Simply, computers can search much more documents!
– Computers can ‘think’ and discover knowledge.
• We will focus on biomedical literature mining in
the following
Why Literature Mining is Very Popular
in Biomedical Science?
• Biomedical science studies nature subjects.
– Species
– Genes
– Phenotypes
– Diseases
….
Outline
• What is Literature Mining?
–
–
–
–
–
Popular Tools for Literature Mining
Basic Techniques
Indexing: Expediting searching
Linguistic Processing
Other Processing
• What is Ontology?
–
–
–
–
Simple ontology examples
Gene ontology
United Medical Language System
Use and index ontology
• Applications of Literature Mining and Ontology
Popular Tools for Biomedical Literature
Mining – Document search
• Google
– Google Scholar: http://scholar.google.com
• ISI web of knoledge
– www.isiknowledge.com
• Pubmed
– www.ncbi.nlm.nih.gov/pubmed
Tools for Biomedical Literature Mining
– Knowledge discovery
• The Gene Ontology
– http://www.geneontology.org/
• Gene answer
– www.geneanswers.com
Outline
• What is Literature Mining?
–
–
–
–
–
Popular Tools for Literature Mining
Basic Techniques
Indexing: Expediting searching
Linguistic Processing
Other Processing
• What is Ontology?
–
–
–
–
Simple ontology examples
Gene ontology
United Medical Language System
Use and index ontology
• Applications of Literature Mining and Ontology
Techniques Behind Literature Mining
• Interdisciplinary
– Computer Science
•
•
•
•
Information retrieval
Data mining
Natural Language Processing
Machine learning
– Library Science
– Biomedical Science
– Linguistics
• Computational linguistics
– Statistics
– And more!
• Two main research areas (some overlaps)
– Information Retrieval
– Natural Language Processing
Basic Text Search Algorithm
…
H
e
l
l
o
,
w
o
r
l
d
w
o
r
l
d
…
text
String to match
• Assume text size is n.
• Assume search string size is m.
• How to design an efficient algorithm to find all
matches in the text?
– Brutal force algorithm, O(mn).
– Boyer-Moore Heuristics, O(mn), but fast in most cases
for English text.
– KMP (Knuth-Morris-Pratt) algorithm, O(m+n).
Outline
• What is Literature Mining?
–
–
–
–
–
Popular Tools for Literature Mining
Basic Techniques
Indexing: Expediting searching
Linguistic Processing
Other Processing
• What is Ontology?
–
–
–
–
Simple ontology examples
Gene ontology
United Medical Language System
Use and index ontology
• Applications of Literature Mining and Ontology
Information Retrieval (Indexing)
• Archiving (preprocessing) documents for fast
search
–
–
–
–
Preprocessing time
Query time
Index size
Accuracy vs relevancy
• Precision=
|{relevant docs}∩{retrieved docs}|/| {retrieved docs}|
• Recall=
|{relevant docs}∩{retrieved docs}|/|{relevant docs}|
• Fall-out
|{nonrelevant docs}∩{retrieved docs}|/|{nonrelevant docs}|
Outline
• What is Literature Mining?
–
–
–
–
–
Popular Tools for Literature Mining
Basic Techniques
Indexing: Expediting searching
Linguistic Processing
Other Processing
• What is Ontology?
–
–
–
–
Simple ontology examples
Gene ontology
United Medical Language System
Use and index ontology
• Applications of Literature Mining and Ontology
Programming language processing
(C++, Java, etc)
• Lexical analysis
y=x+10;
• Syntax analysis
identifier
y
lexeme
Token type
y
identifier
=
assignment operator
x
identifier
+
addition operator
10
number
;
end of statement
assignment
operator
=
expression
identifier
x
expression
+
expression
number
10
Natural Language Processing
• Lexical level
– Stemming (including lemmatizing): find the root of a word
swimming, swam, swim, swimmer  swim
– Stemming rule may vary (balance between overstemming and
understemming)
– Typical algorithm (Porter Stemming algorithm)
– Alias, Synonym
• Grammatical level
– Parsing
“…We find Gene1 interacts with Gene2…”
Sentence
Verb phrase
Noun phrase
Gene1
Verb
interact
Noun phrase
Gene2
Outline
• What is Literature Mining?
–
–
–
–
–
Popular Tools for Literature Mining
Basic Techniques
Indexing: Expediting searching
Linguistic Processing
Other Processing
• What is Ontology?
–
–
–
–
Simple ontology examples
Gene ontology
United Medical Language System
Use and index ontology
• Applications of Literature Mining and Ontology
Statistical and Data Mining Processing
• Statistical
– Count the word frequency
– Count the expression frequency
• Data Mining
– Mining the set of frequent words
– Association rule
Document Classification (Machine
Learning)
• E.g., classify all documents related to coffee
and health
Documents
show
benefits
Coffee and
health related
documents
Documents
show
risk
Cardioprotective
…
Laxative
Cholesterol
…
Anxiety
• Various machine learning algorithms can be
applied here.
Outline
• What is Literature Mining?
–
–
–
–
–
Popular Tools for Literature Mining
Basic Techniques
Indexing: Expediting searching
Linguistic Processing
Other Processing
• What is Ontology?
–
–
–
–
Simple ontology examples
Gene ontology
United Medical Language System
Use and index ontology
• Applications of Literature Mining and Ontology
Ontology
• According to philosophy, ontology is a
systematic account of Existence
• In information science, ontology is a
representation of concepts and their
relationships, often by directed graphs
Ontology Example (Informal)
fish
salt water
fresh water
North
American
Asian
Europe
……
native
Crappie
Common
Carp
mirror
Carp
invasive
Ontology Example:
Scientifc classification
Kingdom
Animalia
Chordata
Actinopterygii
…
Neopterygii
Teleostei
Cypriniformes
Cyprinidae
…
…
Sarcopterygii
Chondrostei
Hemichordata
Phylum
Class
Subclass
Infraclass
…
…
…
Order
Family
Outline
• What is Literature Mining?
–
–
–
–
–
Popular Tools for Literature Mining
Basic Techniques
Indexing: Expediting searching
Linguistic Processing
Other Processing
• What is Ontology?
–
–
–
–
Simple ontology examples
Gene ontology
United Medical Language System
Use and index ontology
• Applications of Literature Mining and Ontology
Gene Ontology (GO) Consortium
DNA
metabolis
…
Molecular
function
…
Nucleic acid
binding
DNA binding
cell
…
enzyme
helicase
DNA helicase
ATP-dependent
DNA helicase
…
Reference: Gene Ontology: tool for the unification of biology, nature genetics, 2000
http://dx.doi.org/ 10.1038/75556
Outline
• What is Literature Mining?
–
–
–
–
–
Popular Tools for Literature Mining
Basic Techniques
Indexing: Expediting searching
Linguistic Processing
Other Processing
• What is Ontology?
–
–
–
–
Simple ontology examples
Gene ontology
United Medical Language System
Use and index ontology
• Applications of Literature Mining and Ontology
Unified Medical Language System
(UMLS)
• A compendium of controlled vocabularies in the
biomedical sciences (since 1986). It contains:
– Metathesaurus
– Semantic Network
– SPECIALIST Lexicon
• Maintained by US National Library of Medicine
• Website:
http://www.nlm.nih.gov/research/umls/
UMLS - Metathesaurus
• Number of biomedical concepts >= 1 million
• Number of concept names >=5 million
• Stem from over 100 incorporated controlled source
vocabularies:
– ICD (International Statistical Classification of Diseases and
Related Health Problems)
– MeSH (Medical Subject Headings)
– SNOMED CT (Systematized Nomenclature of Medicine – Clinical
Terms)
– LOINC (Logical Observation Identifiers Names and Codes)
– Gene Ontology
– OMIM (Mendelian Inheritance in Man)
…
http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html
UMLS - Semantic Network
•
135 semantic types (categories)
– Entity
•
Physical Object
– Organism
…
…
– Event
•
Actitivity
– Behavior
…
…
•
54 semantic relationships (between members of the various Semantic types)
– isa
– assoicated_with
•
physically_related_to
–
•
part_of
…
spatially_related_to
–
location_of
…
…
http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html
http://www.clres.com/semrels/umls_relation_list.html
Outline
• What is Literature Mining?
–
–
–
–
–
Popular Tools for Literature Mining
Basic Techniques
Indexing: Expediting searching
Linguistic Processing
Other Processing
• What is Ontology?
–
–
–
–
Simple ontology examples
Gene ontology
United Medical Language System
Use and index ontology
• Applications of Literature Mining and Ontology
Use of ontology systems
• Statistical
– Gene ontology enrichment test
• Indexing
– Reachibility
– Distance
– Path
Represent Ontology by Graphs
• Directed Graph
• Directed Acyclic Graph (DAG): A good number
of ontologies fall into this type, but not all!
• Directed Tree
Reachability
The problem: Given two vertices u and v in
a directed graph G, is there a path from u to v ?
15
14
11
13
10
6
7
3
4
1
12
8
9
5
2
?Query(1,11)
Yes
?Query(3,9)
No
Distance
The problem: Given two vertices u and v in
a (directed) graph G, what is the distance from u to v?
15
14
11
13
10
6
7
3
4
1
12
8
9
5
2
?Query dG(1, 11)
=3
Path
The problem:Given two vertices u and v in
a (directed) graph G, what is a path (are paths) connecting
u to v ?
15
14
Find a path from 1 to 11
11
13
10
6
7
3
4
1
12
8
9
5
2
The estimated difficulty of building a very efficient
indexing schemes (based on current research)
Reachability
Distance
Path
easy
easy
easy
Directed Acyclic Graph medium
hard
hard
Directed Graph
hard
hard
Directed Tree
medium
Reference:
R. Jin, Y. Xiang, N. Ruan, H. Wang, "Efficiently Answering Reachability Queries on Very Large Directed Graphs",
Proc. of ACM SIGMOD Conference, Vancouver, June 9-12, 2008, pp. 595-608.
R. Jin, Y. Xiang, N. Ruan, D. Fuhry, "3-HOP: A High-Compression Indexing Scheme for Reachability Query",
Proc. of ACM SIGMOD Conference, Providence, Rhode Island, June 29-July 2, 2009, pp. 813-826.
Outline
• What is Literature Mining?
–
–
–
–
–
Popular Tools for Literature Mining
Basic Techniques
Indexing: Expediting searching
Linguistic Processing
Other Processing
• What is Ontology?
–
–
–
–
Simple ontology examples
Gene ontology
United Medical Language System
Ontology use and indexing
• Applications of Literature Mining and Ontology
Applications of Literature Mining and
Ontology - I
• Build confirmed gene-phenotype relations
– Human Phenotype Ontology (HPO)
– Built from Online Mendelian Inheritance in Man
(OMIM) database.
– http://human-phenotype-ontology.org/
Reference: Robinson PN, Mundlos S. The Human Phenotype Ontology.
Clinical Genetics 77(6) 2010: 525–534.
http://dx.doi.org/10.1111/j.1399-0004.2010.01436.x
Applications of Literature Mining and
Ontology - II
• Predicting unknown gene-phenotype relations
– Use text mining to build similarities among phenotypes
– Gene relationships are built by protein-protein Interaction databases
– Known gene-phenotype relationships can be built by text mining. The
previous slide gives an example.
• Various methods [Statistical, graph theory, etc.] have been
proposed to do the prediction.
– Reference: X. Wu, R. Jiang, M.Q. Zhang, and S. Li, Network-based
global inference of human disease genes. Molecular Systems Biology,
4(1), 2008
PPI network ≈ G2G network
Phenotype similarity graph
Thanks!
Questions?