Introduction to Pattern Recognition Prediction in Bioinformatics • What do we want to predict? – Features from sequence – Data mining • How can we predict? – Homology / Alignment – Pattern Recognition / Statistical Methods / Machine Learning • What is prediction? – Generalization / Overfitting – Preventing overfitting: Homology reduction • How do we measure prediction? – Performance measures – Threshold selection Henrik Nielsen Center for Biological Sequence Analysis Technical University of Denmark Sequence → structure → function Prediction from DNA sequence • Protein-coding genes – – – – transcription factor binding sites transcription start/stop translation start/stop splicing: donor/acceptor sites • Non-coding RNA – tRNAs – rRNAs – miRNAs • General features – Structure (curvature/bending) – Binding (histones etc.) Prediction from amino acid sequence • Folding / structure • Post-Translational Modifications – Attachment: phosphorylation glycosylation lipid attachment – Cleavage: signal peptides, propeptides, transit peptides – Sorting: secretion, import into various organelles, insertion into membranes • Interactions • Function – – – – – Enzyme activity Transport Receptors Structural components etc... Protein sorting in eukaryotes • Proteins belong in different organelles of the cell – and some even have their function outside the cell • Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell" Data: UniProt annotation of protein sorting Annotations relevant for protein sorting are found in: – the CC (comments) lines – cross-references (DR lines) to GO (Gene Ontology) – the FT (feature table) lines ID AC ... DE GN ... CC ... DR ... FT INS_HUMAN P01308; Reviewed; 110 AA. Insulin precursor [Contains: Insulin B chain; Insulin A chain]. Name=INS; -!- SUBCELLULAR LOCATION: Secreted. GO; GO:0005576; C:extracellular region; IC:UniProtKB. SIGNAL 1 24 3 types of non-experimental qualifiers in the CC and FT lines: – Potential: Predicted by sequence analysis methods – Probable: Inconclusive experimental evidence – By similarity: Predicted by alignment to proteins with known location Problems in database parsing Extreme example: A4_HUMAN, Alzheimer disease amyloid protein CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC ... DR DR DR -!- SUBCELLULAR LOCATION: Membrane; Single-pass type I membrane protein. Note=Cell surface protein that rapidly becomes internalized via clathrin-coated pits. During maturation, the immature APP (N-glycosylated in the endoplasmic reticulum) moves to the Golgi complex where complete maturation occurs (Oglycosylated and sulfated). After alpha-secretase cleavage, soluble APP is released into the extracellular space and the Cterminal is internalized to endosomes and lysosomes. Some APP accumulates in secretory transport vesicles leaving the late Golgi compartment and returns to the cell surface. Gamma-CTF(59) peptide is located to both the cytoplasm and nuclei of neurons. It can be translocated to the nucleus through association with Fe65. BetaAPP42 associates with FRPL1 at the cell surface and the complex is then rapidly internalized. APP sorts to the basolateral surface in epithelial cells. During neuronal differentiation, the Thr-743 phosphorylated form is located mainly in growth cones, moderately in neurites and sparingly in the cell body. Casein kinase phosphorylation can occur either at the cell surface or within a post-Golgi compartment. GO; GO:0009986; C:cell surface; IDA:UniProtKB. GO; GO:0005576; C:extracellular region; TAS:ProtInc. GO; GO:0005887; C:integral to plasma membrane; TAS:ProtInc. Prediction methods • Homology / Alignment • Simple pattern recognition – Example: PROSITE entry PS00014, ER_TARGET: Endoplasmic reticulum targeting sequence. Pattern: [KRHQSA]-[DENQ]-E-L> • Statistical methods – Weight matrices: calculate amino acid probabilities – Other examples: Regression, variance analysis, clustering • Machine learning – Like statistical methods, but parameters are estimated by iterative training rather than direct calculation – Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM) Prediction of subcellular localisation from sequence • Homology: threshold  30%-70% identity • Sorting signals (‘‘zip codes’’) – N-terminal: secretory (ER) signal peptides, mitochondrial & chloroplast transit peptides. – C-terminal: peroxisomal targeting signal 1, ER-retention signal. – internal: Nuclear localisation signals, nuclear export signals. • Global properties – – – – amino acid composition, aa pair composition composition in limited regions predicted structure physico-chemical parameters • Combined approaches Signal-based prediction • Signal peptides – von Heijne 1983, 1986 [WM] – SignalP (Nielsen et al. 1997, 1998; Bendtsen et al. 2004) [NN, HMM] • Mitochondrial & chloroplast transit peptides – Mitoprot (Claros & Vincens 1996) [linear discriminant using physico-chemical parameters] – ChloroP, TargetP* (Emanuelsson et al. 1999, 2000) [NN] – iPSORT* (Bannai et al. 2002) [decision tree using physicochemical parameters] – Protein Prowler* (Hawkins & Bodén 2006) [NN] *= includes also signal peptides • Nuclear localisation signals – PredictNLS (Cokol et al. 2000) [regex] – NucPred (Heddad et al. 2004) [regex, GA] Composition-based prediction • • • • • • • Nakashima and Nishikawa 1994 [2 categories; odds-ratio statistics] ProtLock (Cedano et al. 1997) [5 categories; Mahalanobis distance] Chou and Elrod 1998 [12 categories; covariant discriminant] NNPSL (Reinhardt and Hubbard 1998) [4 categories; NN] SubLoc (Hua and Sun 2001) [4 categories; SVM] PLOC (Park and Kanehisa 2003) [12 categories; SVM] LOCtree (Nair & Rost 2005) [6 categories; SVM incl. regions, structure and profiles] • BaCelLo (Pierleoni et al. 2006) [5 categories; SVM incl. regions and profiles] Pro: • does not require knowledge of signals • works even if N-terminus is wrong Con: • cannot identify isoform differences A simple statistical method: Linear regression Observations (training data): a set of x values (input) and y values (output). Model: y = ax + b (2 parameters, which are estimated from the training data) Prediction: Use the model to calculate a y value for a new x value Note: the model does not fit the observations exactly. Can we do better than this? Overfitting y = ax + b y = ax6+bx5+cx4+dx3+ex2+fx+g 2 parameter model Good description, poor fit 7 parameter model Poor description, good fit Note: It is not interesting that a model can fit its observations (training data) exactly. To function as a prediction method, a model must be able to generalize, i.e. produce sensible output on new data. A classification problem How complex a model should we choose? This depends on: • The real complexity of the problem • The size of the training data set • The amount of noise in the data set How to estimate parameters for prediction? Model selection Linear Regression Quadratic Regression Join-the-dots The test set method The test set method The test set method The test set method The test set method Cross Validation Cross Validation Cross Validation Cross Validation Cross Validation Cross Validation Cross Validation Which kind of Cross Validation? Note: Leave-one-out is also known as jack-knife Problem: sequences are related ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV • If the sequences in the test set are closely related to those in the training set, we can not measure true generalization performance Solution: Homology reduction • Calculate all pairwise similarities in the data set • Define a threshold for being ”neighbours” (too closely related) • Calculate numbers of neighbours for each example, and remove the example with most neighbours • Repeat until there are no examples with neighbours left Alternative: Homology partitioning • keep all examples, but cluster them so that no neighbours end up in the same fold • Should be combined with weighting The Hobohm algorithm Defining a threshold for homology reduction First approach: two sequences are too closely related, if the prediction problem can be solved by alignment The Sander/Schneider curve: For protein structure prediction, 70% identical classification of secondary structure means prediction by alignment is possible This corresponds to 25% identical amino acids in a local alignment > 80 positions Defining a threshold for homology reduction Second approach: two sequences are too closely related, if their homology is statistically significant The Pedersen / Nielsen / Wernersson curve: Use the extreme value distribution to define the BLAST score at which the similarity is stronger than random