Origami with strings: protein folding by computer

Origami with strings:
protein folding by computer
Kevin Karplus
karplus@soe.ucsc.edu
Biomolecular Engineering Department
Undergraduate and Graduate Director, Bioinformatics
University of California, Santa Cruz
origami with strings – p.1/49
Outline of Talk
What is Biomolecular Engineering? Bioinformatics?
What is a protein?
The folding problem and variants on it:
Local structure prediction
Fold recognition
Comparative modeling
“Ab initio” methods
Contact prediction
origami with strings – p.2/49
What is Biomolecular Engineering?
Engineering with, of, or for biomolecules. For example,
with: using proteins as sensors or for self-assembly.
of: protein engineering—designing or artificially evolving
proteins to have particular functions
for: designing high-throughput experimental methods to
find out what molecules are present, how they are
structured, and how they interact.
origami with strings – p.3/49
What is Bioinformatics?
Bioinformatics: using computers and statistics to make
sense out of the mountains of data produced by
high-throughput experiments.
Genomics: finding important sequences in the genome
and annotating them.
Phylogenetics: “tree of life”.
Systems biology: piecing together various control
networks.
DNA microarrays: what genes are turned on under
what conditions.
Proteomics: what proteins are present in a mixture.
Protein structure prediction.
origami with strings – p.4/49
What is a protein?
There are many abstractions of a protein: a band on a
gel, a string of letters, a mass spectrum, a set of 3D
coordinates of atoms, a point in an interaction
graph, . . . .
For us, a protein is a long skinny molecule (like a string
of letter beads) that folds up consistently into a
particular intricate shape.
The individual “beads” are amino acids, which have 6
atoms the same in each “bead” (the backbone atoms: N,
H, CA, HA, C, O).
The final shape is different for different proteins and is
essential to the function.
The protein shapes are important, but are expensive to
determine experimentally.
origami with strings – p.5/49
Folding Problem
The Folding Problem:
If we are given a sequence of amino acids (the letters on a
string of beads), can we predict how it folds up in 3-space?
MTMSRRNTDA ITIHSILDWI EDNLESPLSL EKVSERSGYS KWHLQRMFKK
ETGHSLGQYI RSRKMTEIAQ KLKESNEPIL YLAERYGFES QQTLTRTFKN
YFDVPPHKYR MTNMQGESRF LHPLNHYNS
↓
Too hard!
origami with strings – p.6/49
Fold-recognition problem
The Fold-recognition Problem:
Given a sequence of amino acids A (the target sequence)
and a library of proteins with known 3-D structures (the
template library),
figure out which templates A match best, and align the
target to the templates.
The backbone for the target sequence is predicted to be
very similar to the backbone of the chosen template.
Progress has been made on this problem, but we can
usefully simplify further.
origami with strings – p.7/49
Remote-homology Problem
The Homology Problem:
Given a target sequence of amino acids
and a library of protein sequences,
figure out which sequences A is similar to and align them to
A.
No structure information is used, just sequence
information. This makes the problem easier, but the
results aren’t as good.
This problem is fairly easy for recently diverged, very
similar sequences, but difficult for more remote
relationships.
origami with strings – p.8/49
New-fold prediction
What if there is no template we can use?
We can try to generate many conformations of the
protein backbone and try to recognize the most
protein-like of them.
Search space is huge, so we need a good conformation
generator and a cheap cost function to evaluate
conformations.
origami with strings – p.9/49
Hidden Markov Models
Hidden Markov Models (HMMs) are a very successful way to
capture the variability possible in a family of proteins.
An HMM is a stochastic model—that is, it assigns a
probability to every possible sequence.
An HMM is a finite-state machine with a probability for
emitting each letter in each state, and with probabilities
for making each transition between states.
Probabilities of letters sum to one for each state.
Probabilities of transitions out of each state sum to one
for that state.
We also include null states that emit no letters, but have
transition probabilities on their out-edges.
origami with strings – p.10/49
Profile Hidden Markov Model
Start
End
a-
a1
b4
a2
A3
B1
a1 a2 A3
. . B1
B2
B2
A4
A5
B3
B5
A4 . A5
B3 b4 B5
Circles are null states.
Squares are match states, each of which is paired with a
null delete state. We call the match-delete pair a fat state.
Each fat state is visited exactly once on every path from
Start to End.
Diamonds are insert states, and are used to represent
possible extra amino acids that are not found in most of
the sequences in the family being modeled.
origami with strings – p.11/49
What is single-track HMM looking for?
nostruct-align/3chy.t2k w0.5
3
2
1
VI VVIDD
L
E
X
XX
L
I
A
LL
V
L
I
E
ND
S
GE
N
S
L
V
F
M
XX
X
G
V
D
N
A
D
S
NG
EAL
D
QL
XXX X
X XX XXXXXX
X
ADK EL KFL V VDDFSTMRRIVR NLLKELGF NNVEE AE DG V DA L N KL Q A G G Y
XX
T
X
R
X
X
V
XX
L
P
I
V
XXX
L
A
X
1
T
L
I
S
E
E
V
H
L
P
E
X
L
K
L
V
E
Y
X
X
X
E
X
L
E
X
X
X
A
XX
E
E
E
L
X
K
E
50
A
10
L
40
XA
X
R
X
30
S
L
20
X
G
X
X
H
3
2
L
D
P G
TS
M
I
VI
D
DL
VL
L
N
VI
L
E
LV
L
IV
E
I
XXXXXXX XXXXXXXXXX
XXX
XXXXXXXX XX
XX
XXXX
GFV IS DWN M PNMDGLELLKTI RADGAMSA LPVLM VT AE A KK E N II A A A Q A
A
IAFLSL
I
GV
M
S
G
D
I
VEG
A
SS
NN
DLL
L
F
VVAA
L
PLA
G
MII
E
VR
V
L
M
R
L
T
X
A
P
X
X
V
V
D
H
T
X
E
E
L
X
80
X
70
60
51
R
E
E
AL
D
A
D
XX
X
X
E
T
A
R
X
100
S
G
N
EA
90
1
H
3
2
L
G
D
S
E
N
A
XX
K
I
D
YV
GF
KP
L
RD
S
G
Q
EE
A
XXXXXXXX X
M
FS
E
L
I
V
F
M
L
X
R
I
LR
V
L
XX
R
XXXXX
E
D
X
A
K
X
LR
XXX
L
A
X
G
X
E
T
120
110
GAS GY VVK P FTAATLEEKLNK IFEKLGM
101
1
H
origami with strings – p.12/49
Secondary structure Prediction
Instead of predicting the entire structure, we can predict
local properties of the structure.
What local properties do we choose?
We want properties that are well-conserved through
evolution, easily predicted, and useful for finding and
aligning templates.
One popular choice is a 3-valued helix/strand/other
alphabet—we have investigated many others. Typically,
predictors get about 80% accuracy on 3-state
prediction.
Many machine-learning methods have been applied to
this problem, but the most successful is neural
networks.
origami with strings – p.13/49
CASP Competition Experiment
Everything published in literature “works”
CASP set up as true blind test of prediction methods.
Sequences of proteins about to be solved released to
prediction community.
Predictions registered with organizers.
Experimental structures compared with solution by
assessors.
“Winners” get papers in Proteins: Structure, Function, and
Bioinformatics.
origami with strings – p.14/49
Predicting Local Structure
Want to predict some local property at each residue.
Local property can be emergent property of chain (such
as being buried or being in a beta sheet).
Property should be conserved through evolution (at
least as well as amino acid identity).
Property should be somewhat predictable (we gain
information by predicting it).
Predicted property should aid in fold-recognition and
alignment.
For ease of prediction and comparison, we look only at
discrete properties (alphabets of properties).
origami with strings – p.15/49
Using Neural Net
We use neural nets to predict local properties.
Input is profile with probabilities of amino acids at each
position of target chain, plus insertion and deletion
probabilities.
Output is probability vector for local structure alphabet
at each position.
Each layer takes as input windows of the chain in the
previous layer and provides a probability vector in each
position for its output.
We
P train neural net to maximize
log(P (correct output)).
origami with strings – p.16/49
Neural Net
Typical net has 4 layers and 6471 weight parameters:
o
o
o
input/pos
window
output/pos
weights
22
5
15
1665
15
7
15
1590
15
9
15
2040
15
13
6
1176
Inputs
Input layer
o
o
o
22 values/position
5
Hidden Layer 1
Hidden Layer 1 15 units/position
o
o
o
P(E)
P(B)
o
o
o
o
o
Hidden Layer 2
Hidden Layer 3
Output Layer
o
P(L)
7
Hidden Layer 2
15 units/position
Hidden Layer 3
25 units/position
Output Layer
3−12 units/position
9
13
origami with strings – p.17/49
DSSP
DSSP is a popular program to define secondary
structure.
7-letter alphabet: EBGHSTL
E = β strand
B = β bridge
G = 310 helix
H = α helix
I = π helix (very rare, so we lump in with H)
S = bend
T = turn
L = everything else (DSSP uses space for L)
origami with strings – p.18/49
STR: Extension to DSSP
Yael Mandel-Gutfreund noticed that parallel and
anti-parallel strands had different hydrophobicity
patterns, implying that parallel/antiparallel can be
predicted from sequence.
We created a new alphabet, splitting DSSP’s E into 6
letters:
P
Q
A
Z
M
E
origami with strings – p.19/49
HMMSTR φ-ψ alphabet
For HMMSTER, Bystroff did k-means classification of
φ-ψ angle pairs into 10 classes (plus one class for cis
peptides).
We used just the 10 classes, ignoring the ω angle.
origami with strings – p.20/49
ALPHA11: α angle
Backbone geometry can be mostly summarized with
one angle per residue:
CA(i)
CA(i+2)
CA(i+1)
CA(i−1)
We discretize into 11 classes:
0.014
G H
I
S
T
A
B
C
D
E
F
0.012
0.01
0.008
0.006
0.004
0.002
0
8 31 58 85
140 165 190 224 257 292
343
origami with strings – p.21/49
de Brevern’s Protein Blocks
Clustered on 5-residue window of φ-ψ angles:
origami with strings – p.22/49
Burial alphabets
Our second set of investigations was for a sampling of the
many burial alphabets, which are discretizations of various
accessibility or burial measures:
solvent accessible surface area
relative solvent accessible surface area
neighborhood-count burial measures
origami with strings – p.23/49
Solvent Accessibility
Absolute SA: area in square Ångstroms accessible to a
water molecule, computed by DSSP.
Relative SA: Absolute SA/ max SA for residue type
(using Rost’s table for max SA).
Frequency of occurrence
0.1 AB C D E
F
G
0.01
0.001
0.0001
1e-05
17 24 46 71
106
solvent accessibility
origami with strings – p.24/49
Burial
Define a sphere for each residue.
Count the number of atoms or of residues within that
sphere.
Example: center= Cβ , radius=14Å, count= Cβ , quantize
in 7 equi-probable bins.
Frequency of occurrence
0.1
A
B C D E
F
G
0.01
0.001
0.0001
1e-05
27 34 40 47 55
burial
66
origami with strings – p.25/49
Mutual Information
Mutual information between two random variables
(letters of alphabet):
M I(X, Y ) =
X
i,j
P (i, j)
,
P (i, j) log
P (i)P (j)
We look at mutual information between different
alphabets at same position in protein. (redundancy)
We look at mutual information with one alphabet
between corresponding positions on alignments of
sequences.
origami with strings – p.26/49
Information Gain
Information gain is how much more we know about a
variable after making a prediction.
P̂i (Xi )
I(X) = average log
P0 (Xi )
P̂i is predicted probability vector for position i
Xi is actual observation at position i
P0 is background probability vector
origami with strings – p.27/49
Conservation and Predictability
conservation
alphabet
MI
predictability
info gain
Name
size
entropy
with AA
mutual info
per residue
Q|A|
str
13
2.842
0.103
1.107
1.009
0.561
protein blocks
16
3.233
0.162
0.980
1.259
0.579
stride
6
2.182
0.088
0.904
0.863
0.663
DSSP
7
2.397
0.092
0.893
0.913
0.633
stride-EHL
3
1.546
0.075
0.861
0.736
0.769
DSSP-EHL
3
1.545
0.079
0.831
0.717
0.763
CB-16
7
2.783
0.089
0.682
0.502
CB-14
7
2.786
0.106
0.667
0.525
CB-12
7
2.769
0.124
0.640
0.519
rel SA
7
2.806
0.183
0.402
0.461
abs SA
7
2.804
0.250
0.382
0.447
origami with strings – p.28/49
Multi-track H MMs
We can also use alignments to build a two- or three-track
target HMM:
Amino-acid track (created from the multiple alignment).
Local-structure track(s) with probabilities from neural
net.
Can align template (AA+local) to target model.
start
stop
AA
AA
AA
2ry
2ry
2ry
AA
AA
2ry
2ry
origami with strings – p.29/49
What is second track looking for?
nostruct-align/3chy.t2k EBGHTL
3
2
1
C
T
EEEE HHHHHHHHHHHHH C EEEE HHHHHHHHH
E
TTC
E
CTC
CC
T
XXXE
C
TX
X
X
X
E
T
CTC
TT
XC
X
X
X
X
X
X
X
X
X
X
X
CE
H
CT
C
TC
G
HT
CTEX
X
X
X
X
X
T
C
ET
CT
T
X
X
X
H
C
C
TT
C
T
TC
T
ETCXCC
XCEXC
X
X
X
X
X
X
T
H
XG
X
X
X
X
X
X
X
X
X
X
X
X
T
E
H
E
50
40
30
20
1
10
ADK EL KFLV V D DF STMRR I VRN LLK EL G FNN V EE A ED GV D AL NK LQ A G G Y
H
3
2
1
EEEEE
E
C
T
HHHHHHHHH
TTT H
T
C
CCCC C T
CET
T
C
H
X
X
X
X
X
TTT
T
EEEEE HHHHHHHHHH
E
TC
H
GC
E
TCTHC
GXCTC
CXX
CTXX
XX
EG
EX
XTTX
X
XXXCX
X
X
X
X
X
X
X
X
X
C
X
T C
C
CTCT
CTX
H
XTEXXX
X
X
X
X
X
X
XX
C
X
X
X
X
X
X
X
E
T
H
T
E
T
100
90
80
70
60
51
GFV IS DWNM P N MD GLELL K TIR ADG AM S ALP V LM V TA EA K KE NI IA A A Q A
H
3
2
CC
HHHHHHHHHHHHH C
CH
C
H
C
EEEC
TTCT
CCH
GAS GY VVKP F T AA TLEEK L NKI FEK LG M
CCC
TXX
GG
T
XTEEE
X
X
T
X
X
X
X
X
110
E
X
E
T
XTTTX
X
X
X
X
X
X
X
X
X
X
X
X
120
T
TC
TXX
X
101
1
H
origami with strings – p.30/49
Target-model Fold Recognition
Find probable homologs of target sequence and make
multiple alignment.
Make secondary structure probability predictions based
on multiple alignment.
Build an HMM based on the multiple alignment and
predicted 2ry structure (or just on multiple alignment).
Score sequences and secondary structure sequences
for proteins that have known structure (all sequences for
AA-only, 8,000-11,000 representatives for multi-track).
Select the best-scoring sequence(s) to use as
templates.
origami with strings – p.31/49
Template-library Fold Recognition
Build an HMM for each protein in the template library,
based on the template sequence (and any homologs
you can find).
The HMM library has over 12,000 templates from PDB.
For the fold-recognition problem, structure information
can be used in building these models (though we
currently don’t).
Score target sequence with all models in the library.
Select the best-scoring model(s) to use as templates.
origami with strings – p.32/49
Combined SAM-Txx method
target sequence
template sequences
target alignment
template alignments
target HMM
template HMMs
local structure
prediction
target model scores
template model scores
combined scores
Combine the costs from the template library search and
the target library searches using different local structure
alphabets.
Choose one of the many alignments of the target and
template (whatever method gets best results in testing).
http://www.soe.ucsc.edu/research/compbio/SAM_T06/T06-query.html
origami with strings – p.33/49
Fold recognition results
Fold Recognition for 1415 SAM-T05 HMMs with w(amino-acid)=1
0.4
True positives / possible fold matches
0.35
0.3
0.25
0.2
0.15
average w(near-backbone-11)=0.6 w(n_sep)=0.1
average w(near-backbone-11)=0.6 w(str2)=0.25
w(near-backbone-11)=0.4
w(n_sep)=0.1
w(str2)=0.1
aa-only
T2K w(str2)=0.2
T2K aa-only
0.1
0.05
0
0.001
0.01
0.1
1
10
False positives per query
origami with strings – p.34/49
Comparative modeling: T0232
RMSD= 5.158Å all-atom, 4.463Å Cα
origami with strings – p.35/49
T0298 domain 2 (130–315)
RMSD= 2.468Å all-atom, 1.7567Å Cα , GDT=82.5%
best model 1 submitted to CASP7 (red=real)
origami with strings – p.36/49
Comparative modeling: T0348
RMSD= 11.8 Å Cα , GDT=58.2% (cartoon=real)
best model 1 by CASP7 GDT, Robetta1 slightly better.
origami with strings – p.37/49
Undertaker
Undertaker is UCSC’s attempt at a fragment-packing
program.
Named because it optimizes burial.
Representation is 3D coordinates of all heavy atoms
(not hydrogens).
Can replace backbone fragments (a la Rosetta) or full
alignments—chain need not remain contiguous.
Conformations can borrow heavily from fold-recognition
alignments, without having to lock in a particular
alignment.
Use genetic algorithm with many conformation-change
operators to do stochastic search.
origami with strings – p.38/49
Fragfinder
Fragments are provided to undertaker from 3 sources:
Generic fragments (2-4 residues, exact sequence
match) are obtained by reading in 500–1000 PDB files,
and indexing all fragments.
Long specific fragments (and full alignments) are
obtained from the various target and template
alignments generated during fold recognition.
Medium-length fragments (9–12 residues long) for
every position are generated from the HMMs with
fragfinder, a new tool in the SAM suite.
origami with strings – p.39/49
Cost function
Cost function is modularly designed—easy to add or
remove terms.
Cost function can include predictions of local properties
by neural nets.
Clashes and hydrogen bonds are important
components.
There are over 40 cost function components available:
burial functions, disulfides, contact order, rotamer
preference, radius of gyration, constraints, ...
origami with strings – p.40/49
Target T0201 (NF)
We tried forcing various sheet topologies and selected
4 by hand.
Model 1 has right topology (5.912Å all-atom, 5.219Å
Cα ).
Unconstrained cost function not good at choosing
topology (two strands curled into helices).
Helices were too short.
origami with strings – p.41/49
Target T0201 (NF)
origami with strings – p.42/49
Contact prediction
Use mutual information between columns.
Thin alignments aggressively (30%, 35%, 40%, 50%,
62%).
Compute e-value for mutual info (correcting for
small-sample effects).
Compute rank of log(e-value) within protein.
Feed log(e-values), log rank, contact potential, joint
entropy, and separation along chain for pair, and
amino-acid profile, predicted burial, and predicted
secondary structure for each residue of pair into a
neural net.
origami with strings – p.43/49
Evaluating contact prediction
Two measures of contact prediction:
Accuracy:
P
χ(i, j)
P
1
(favors short-range predictions, where contact
probability is higher)
Weighted accuracy:
P
χ(i,j)
“
”
Prob contact|separation=|i−j|
P
1
(1 if predictions no better than chance based on
separation).
origami with strings – p.44/49
Contact prediction results
true positives/predicted
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0.01
Weighted-accuracy of contact prediction, by protein
avg is_contact/prob(contact|sep)
Accuracy of contact prediction, by protein
Neural net
thin62, e-value
thin62, raw mi
0.1
predictions/residue
1
30
25
Neural net
thin62, e-value
thin62, raw mi
20
15
10
5
0
0.01
0.1
predictions/residue
1
origami with strings – p.45/49
Target T0230 (FR/A)
Good except for C-terminal loop and helix flopped
wrong way.
We have secondary structure right, including phase of
beta strands.
Contact prediction helped, but we put too much weight
on it—decoys fit predictions better than real structure
does.
origami with strings – p.46/49
Target T0230 (FR/A)
origami with strings – p.47/49
Target T0230 (FR/A)
Real structure with contact predictions:
origami with strings – p.48/49
Web sites
These slides:
http://www.soe.ucsc.edu/˜karplus/papers/origami-with-strings-jan-2007.pdf
SAM-T06 prediction server:
http://www.soe.ucsc.edu/research/compbio/SAM_T06/T06-query.html
CASP6 all our results and working notes:
http://www.soe.ucsc.edu/˜karplus/casp6/
Predictions for all yeast proteins:
http://www.soe.ucsc.edu/˜karplus/yeast/
UCSC bioinformatics (research and degree programs) info:
http://www.soe.ucsc.edu/research/compbio/
SAM tool suite info:
http://www.soe.ucsc.edu/research/compbio/sam.html
origami with strings – p.49/49