Introduction to Pattern Recognition Prediction in Bioinformatics • How can we predict?

Introduction to Pattern Recognition
Prediction in Bioinformatics
• What do we want to predict?
– Features from sequence
– Data mining
• How can we predict?
– Homology / Alignment
– Pattern Recognition / Statistical Methods / Machine Learning
• What is prediction?
– Generalization / Overfitting
– Preventing overfitting: Homology reduction
• How do we measure prediction?
– Performance measures
– Threshold selection
Henrik Nielsen
Center for Biological Sequence Analysis
Technical University of Denmark
Sequence → structure → function
Prediction from DNA sequence
• Protein-coding genes
–
–
–
–
transcription factor binding sites
transcription start/stop
translation start/stop
splicing: donor/acceptor sites
• Non-coding RNA
– tRNAs
– rRNAs
– miRNAs
• General features
– Structure (curvature/bending)
– Binding (histones etc.)
Prediction from amino acid sequence
• Folding / structure
• Post-Translational Modifications
– Attachment: phosphorylation glycosylation lipid attachment
– Cleavage: signal peptides, propeptides, transit peptides
– Sorting: secretion, import into various organelles, insertion into
membranes
• Interactions
• Function
–
–
–
–
–
Enzyme activity
Transport
Receptors
Structural components
etc...
Protein sorting in eukaryotes
• Proteins belong in different organelles of the cell – and some even
have their function outside the cell
• Günter Blobel was in 1999 awarded The Nobel Prize in Physiology
or Medicine for the discovery that "proteins have intrinsic signals that
govern their transport and localization in the cell"
Data: UniProt annotation of protein sorting
Annotations relevant for protein sorting are found in:
– the CC (comments) lines
– cross-references (DR lines) to GO (Gene Ontology)
– the FT (feature table) lines
ID
AC
...
DE
GN
...
CC
...
DR
...
FT
INS_HUMAN
P01308;
Reviewed;
110 AA.
Insulin precursor [Contains: Insulin B chain; Insulin A chain].
Name=INS;
-!- SUBCELLULAR LOCATION: Secreted.
GO; GO:0005576; C:extracellular region; IC:UniProtKB.
SIGNAL
1
24
3 types of non-experimental qualifiers in the CC and FT lines:
– Potential: Predicted by sequence analysis methods
– Probable: Inconclusive experimental evidence
– By similarity: Predicted by alignment to proteins with known
location
Problems in database parsing
Extreme example:
A4_HUMAN, Alzheimer disease amyloid protein
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
...
DR
DR
DR
-!- SUBCELLULAR LOCATION: Membrane; Single-pass type I membrane
protein. Note=Cell surface protein that rapidly becomes
internalized via clathrin-coated pits. During maturation, the
immature APP (N-glycosylated in the endoplasmic reticulum) moves
to the Golgi complex where complete maturation occurs (Oglycosylated and sulfated). After alpha-secretase cleavage,
soluble APP is released into the extracellular space and the Cterminal is internalized to endosomes and lysosomes. Some APP
accumulates in secretory transport vesicles leaving the late Golgi
compartment and returns to the cell surface. Gamma-CTF(59) peptide
is located to both the cytoplasm and nuclei of neurons. It can be
translocated to the nucleus through association with Fe65. BetaAPP42 associates with FRPL1 at the cell surface and the complex is
then rapidly internalized. APP sorts to the basolateral surface in
epithelial cells. During neuronal differentiation, the Thr-743
phosphorylated form is located mainly in growth cones, moderately
in neurites and sparingly in the cell body. Casein kinase
phosphorylation can occur either at the cell surface or within a
post-Golgi compartment.
GO; GO:0009986; C:cell surface; IDA:UniProtKB.
GO; GO:0005576; C:extracellular region; TAS:ProtInc.
GO; GO:0005887; C:integral to plasma membrane; TAS:ProtInc.
Prediction methods
• Homology / Alignment
• Simple pattern recognition
– Example:
PROSITE entry PS00014, ER_TARGET:
Endoplasmic reticulum targeting sequence.
Pattern: [KRHQSA]-[DENQ]-E-L>
• Statistical methods
– Weight matrices: calculate amino acid probabilities
– Other examples: Regression, variance analysis, clustering
• Machine learning
– Like statistical methods, but parameters are estimated by
iterative training rather than direct calculation
– Examples: Neural Networks (NN), Hidden Markov Models
(HMM), Support Vector Machines (SVM)
Prediction of subcellular localisation from sequence
• Homology: threshold  30%-70% identity
• Sorting signals (‘‘zip codes’’)
– N-terminal: secretory (ER) signal peptides, mitochondrial
& chloroplast transit peptides.
– C-terminal: peroxisomal targeting signal 1, ER-retention
signal.
– internal: Nuclear localisation signals, nuclear export
signals.
• Global properties
–
–
–
–
amino acid composition, aa pair composition
composition in limited regions
predicted structure
physico-chemical parameters
• Combined approaches
Signal-based prediction
• Signal peptides
– von Heijne 1983, 1986 [WM]
– SignalP (Nielsen et al. 1997, 1998; Bendtsen et al. 2004) [NN,
HMM]
• Mitochondrial & chloroplast transit peptides
– Mitoprot (Claros & Vincens 1996) [linear discriminant using
physico-chemical parameters]
– ChloroP, TargetP* (Emanuelsson et al. 1999, 2000) [NN]
– iPSORT* (Bannai et al. 2002) [decision tree using physicochemical parameters]
– Protein Prowler* (Hawkins & Bodén 2006) [NN]
*= includes also signal peptides
• Nuclear localisation signals
– PredictNLS (Cokol et al. 2000) [regex]
– NucPred (Heddad et al. 2004) [regex, GA]
Composition-based prediction
•
•
•
•
•
•
•
Nakashima and Nishikawa 1994 [2 categories; odds-ratio statistics]
ProtLock (Cedano et al. 1997) [5 categories; Mahalanobis distance]
Chou and Elrod 1998 [12 categories; covariant discriminant]
NNPSL (Reinhardt and Hubbard 1998) [4 categories; NN]
SubLoc (Hua and Sun 2001) [4 categories; SVM]
PLOC (Park and Kanehisa 2003) [12 categories; SVM]
LOCtree (Nair & Rost 2005) [6 categories; SVM incl. regions,
structure and profiles]
• BaCelLo (Pierleoni et al. 2006) [5 categories; SVM incl. regions and
profiles]
Pro:
• does not require knowledge of signals
• works even if N-terminus is wrong
Con:
• cannot identify isoform differences
A simple statistical method: Linear regression
Observations (training data): a
set of x values (input) and y values
(output).
Model: y = ax + b (2 parameters,
which are estimated from the
training data)
Prediction: Use the model to
calculate a y value for a new x
value
Note: the model does not fit the observations exactly. Can we do
better than this?
Overfitting
y = ax + b
y = ax6+bx5+cx4+dx3+ex2+fx+g
2 parameter model
Good description, poor fit
7 parameter model
Poor description, good fit
Note: It is not interesting that a model can fit its observations (training
data) exactly.
To function as a prediction method, a model must be able to generalize,
i.e. produce sensible output on new data.
A classification problem
How complex a
model should we
choose? This
depends on:
• The real
complexity of the
problem
• The size of the
training data set
• The amount of
noise in the data
set
How to estimate parameters for prediction?
Model selection
Linear Regression
Quadratic Regression
Join-the-dots
The test set method
The test set method
The test set method
The test set method
The test set method
Cross Validation
Cross Validation
Cross Validation
Cross Validation
Cross Validation
Cross Validation
Cross Validation
Which kind of Cross Validation?
Note: Leave-one-out is also known as jack-knife
Problem: sequences are related
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
• If the sequences in
the test set are
closely related to
those in the training
set, we can not
measure true
generalization
performance
Solution: Homology reduction
• Calculate all pairwise similarities in
the data set
• Define a threshold for being
”neighbours” (too closely related)
• Calculate numbers of neighbours for
each example, and remove the
example with most neighbours
• Repeat until there are no examples
with neighbours left
Alternative: Homology partitioning
• keep all examples, but cluster them
so that no neighbours end up in the
same fold
• Should be combined with weighting
The Hobohm algorithm
Defining a threshold for homology reduction
First approach:
two sequences
are too closely
related, if the
prediction
problem can be
solved by
alignment
The Sander/Schneider curve:
For protein structure prediction, 70% identical classification of
secondary structure means prediction by alignment is possible
This corresponds to 25% identical amino acids in a local alignment
> 80 positions
Defining a threshold for homology reduction
Second
approach: two
sequences are
too closely
related, if their
homology is
statistically
significant
The Pedersen / Nielsen / Wernersson curve:
Use the extreme value distribution to define
the BLAST score at which the similarity is
stronger than random