TextpressoCentral: A universal portal to search and curate biological

Yuling Li1, Hans-Michael Mü ller1, Paul Sternberg1,2
1
Biology, California Institute of Technology, Pasadena, CA, USA
2 Howard Hughes Medical Institute, Pasadena, CA, USA
Biocuration 2015 April 26th, 2015
Textpresso is an information extracting and
processing package for biological literature.
http://www.textpresso.org


Full text literature searches of model
organism research and subject-specific
articles at individual sites
Corpus:
C. elegans, nematode, mouse, D.melanogaster,
arabidopsis, neuroscience, cancer


Full paper text search

Keyword search

Category words search(genes, cell, biological
processes, disease, etc… )



Successor of current Textpresso system
built from scratch with an emphasis on a
“one-stop” search, view and curation
experience for curators.
bigger, faster, more functionalities, easier to
use

The site currently contains approximately
880,000 full text articles from the PMC Open
Archive. (NXML or PDF format)
26 sub-corpora (experimental division):
Agriculture, Clinical, Health, Nutrition, Protein,
Animal, Crystallography, Immunology, Oncology,
Psychology, Biology, Disease, Medicine, Pediatrics,
Review, Cardiology, Genetics, Methodology,
Pharmacology, Unclassified, Chemistry, Genomics,
Neuroscience, Physiology, Virology, and C. elegans


Full text paper search and text-mining:
◦ Document level and sentence level
◦ Search with pre-loaded category terms
◦ Text mining as a natural part of curation, curation results become
part of training sets for text mining.

Full text paper viewer:
view papers in full text with highlighted keywords/terms

Curate paper directly in paper viewer :
Curate directly on papers (from search results)

Save to curation DB(post to external DB)
Save curations to Textpresso database or post to external DBs
DOCUMENT LEVEL vs SENTENCE LEVEL search
• You want to find out that smf-1 is expressed in dopaminergic neurons
document level search
OR a sentence level search
DOCUMENT LEVEL
no direct association, need to browse/read many articles.
The words smf-1 and dopaminergic neuron not tightly associated
SENTENCE LEVEL
the first hit already gives you the result you are looking for
Sentence level search may return more correct results
Search scope(Doc or
Sentence level)
Filters(Author, Journal, Year, etc.)
keyword to search
Categories to search
Refined lists of terms from new sources:
Such as:
◦
◦
◦
◦
Sequence Ontology,
Chemical Entities of Biological Interest (Chebi),
Phenotypic Quality Ontology (PATO).
Etc…
search using category only
Use category search to find papers, examples:
organic substance biosynthetic process
(GO:1901576):
=>find papers about Collagen biosynthesis

catalytic activity (GO:0003824) AND gene
(SO:0000704)
=>find papers about Catalytic activities and related
genes

ciliary part (GO:0044441) AND deletion (SO:0000159)
=> Find papers about deletion mutations and cilium

Check and view papers in curation
Paper loaded from search results
Pick a category of terms to highlight
Type in a word to highlight
Sentence selected to curate
Select sentence by clicking on first word and last word
Save curation to external databases
Save curation to local database
Edit curation entries in Database



Natural Language processing: incorporate machine
learning pipelines(such as SVM: a popular paper
classifier for triaging) into TextpressoCentral
Users can upload and manage their own corpus
and their own categories of words, making them
searchable.
Workflow: automated curation pipelines can be
managed under a control panel with
TextpressoCentral.

Old Textpresso: www.textpresso.org

New TextpressoCentral under construction:
send email to textpresso@caltech.edu for
announcements
estimated release: Mid-fall 2015