Document 275696

in vivo
in vitro
in silico
The Frontier of
Computational Biology
and Functional
Genomics
11th NTT Science Forum
6 April 2000
“NIH Urged to Train
Biologists on Computers”
Headline in The Washington Post, Monday June 7 1999
Recommendation of Federal Advisory
Panel to NIH Director Varmus:
Establish 20 new U.S. centers to teach
computer-based biomedical research at
a cost of US$8M per center per year.
Dr. Harold Varmus
Why?
“It’s sink or swim as a
tidal wave of data
approaches”
Nature 399:517 10 June 1999
Scientific literature continues to accumulate
at a rapid rate
12,000,000
Now over 10
million articles
in MEDLINE®
10,000,000
8,000,000
6,000,000
400,000 new
articles added
each year
4,000,000
2,000,000
0
1965
1970
1975
1980
1985
Year
National Library of Medicine
1990
1995
Molecular Genetics articles are accumulating
even faster than general scientific literature
1,200,000
1,000,000
Over 1 million
Molecular Biology
800,000
and Genetics articles
600,000
in MEDLINE®
400,000
200,000
0
1965
1970
1975
1980
Year
National Library of Medicine
1985
1990
1995
The rate at which DNA sequences are
accumulating is exponential
6,000,000
Over 6 million
sequence entries
in GenBank
5,000,000
4,000,000
3,000,000
Over 5 billion
bases from
50,000 species
2,000,000
Human Genome
Project begun
Rapid DNA
sequencing invented
1,000,000
0
1965
1970
1975
1980
1985
Year
National Library of Medicine
1990
1995
2000
How do we bridge the gap between
sequence and function?
6,000,000
5,000,000
4,000,000
3,000,000
DNA Sequencing
Invented
Human Genome
Project Begun
The
Gap
2,000,000
1,000,000
0
1975
1980
1985
Publications
1990
1995
2000
DNA sequences
Science (Genome Issue) 15 Oct. 1999
National Library of Medicine
Gene Mapping Milestones
1996
15,000 genes
1998
30,000 genes
1999
35,000 genes
Most mapped genes are
anonymous -- their locations
are known but their functions
await discovery.
The Gene Map has helped to identify genes
responsible for inherited diseases
Science 276, 2045-2047 (1997)
Mutation in the alpha-synuclein
gene identified in families with
Parkinson's disease.
Polymeropoulos MH, et.al.
Parkinson’s
disease gene
The Accelerating Human Genome Project
Nature
Science
(September, 1998)
(October, 1998)
Waterston
Collins
Nature
(March, 1999)
Gibbs
Science
(March, 1999)
Lander
The Accelerating Human Genome Project
Sequenced Regions of Human
Chromosomes
To date, more than 800 million bases of DNA
sequence have been produced
Sequenced Regions of Human Chromosomes
State of the Genome
April
May 1999
2000
Sequenced Regions of Human Chromosomes
State of the Genome
Chr.7:
49% 1999
April
May
2000
finished
Chr.X: 42%
finished
Done! 33 Mb
Chr.21: 74%
finished2 Dec 1999
The Human Genome Project is an
International Effort
Japan
6%
France
12%
UK
26%
JFCR
JST Corp.
Keio University
RIKEN GSC
Tokai University
Germany
4%
other
2%
USA 50%
Human Genome Project data available now
on the Internet …
… for use by
researchers prior to genome completion
Details of
Chromosome 22
Sequence,
Biology and
Medicine
Computational Biology: Performing biological
experiments in silico
Gene
> DNA sequence
AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC
TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA
TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA
ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG
TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA
TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG
GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA
CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC
TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA
ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG
TAAGAAGATCGCGAACATCTAGTAGA
Biological structure &
function
> Protein sequence
MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNI
DELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGK
KVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE
PDEAEQDCIEFGKKIANI
The power of computing
on the data!
Ataxia-telangiectasia gene: 18 years and 5 minutes
New England Journal of Medicine 333:645-7; 1995
Comparative Analysis of Genes
Cell, Vol. 75, 1027-1038, December 3, 1993, Copyright © 1993 by Cell Press
The Human Mutator Gene Homolog MSH2
and its Association with Hereditary
Nonpolyposis Colon Cancer
Richard Fishel, * Mary Kay Lescoe, * M. R. S. Rao, § Neal G.
Copeland, † Nancy Jenkins, †
Judy Garber, ‡ Michael Kane, §
and Richard Kolodner §
*Department of Microbiology and Molecular Genetics
Markey Center for Molecular Genetics
University of Vermont Medical School
can give rise to mismatched bases
example, the deamination of 5thymine and and, therefore, a G
1980). Second, misincorporation
DNA replication
Similarity to
bacterial and
yeast genes sheds new light
on human disease process
Human 638 RHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPC 697
Yeast 657 RHPVLEMQDDISFISNDVTLESGKGDFLIITGPNMGGKSTYIRQVGVISLMAQIGCFVPC 716
E.coli 584 RHPVVEQVLNEPFIANPLNLSPQRR-MLIITGPNMGGKSTYMRQTALIALMAYIGSYVPA 642
portion of DNA mismatch repair protein sequence
Comparative Analysis of Genomes
“What is true for E. coli
is also true for elephant.”
“What is true for yeast is
also true for human.”
Jacques Monod, c. 1961
David Botstein, 1988
The importance of “model organisms”
Mouse Genes are closely related to Human Genes
Human
86%
Rat
85%
93%
Mouse
DNA sequence identity was computed for more than 1000
pairs of orthologous human, mouse, and rat genes
“Homology...
... is the central concept for all of biology. Whenever we say that a
mammalian hormone is the ‘same’ hormone as a fish hormone, that
a human gene sequence is the ‘same’ as a sequence in a chimp or
a mouse, that a HOX gene is the ‘same’ in a mouse, a fruit fly, a
frog, and a human -- even when we argue that discoveries about a
worm, a fruit fly, a frog, a mouse, or a chimp have relevance to the
human condition -- we have made a bold and direct statement
about homology. The aggressive confidence of modern biomedical
science implies that we know what we are talking about.”
David B. Wake
NCBI
“Dinosaur
DNA”
BLAST Search with “Dinosaur DNA”
NCBI
BLASTN 1.1.7MP [23-Nov-90]
"Dinosaur DNA" from Crichton's JURASSIC PARK, p. 103
Query:
Database:
GenBank Release 65.0 (complete), October 1990
39,533 sequences; 49,179,285 total residues.
Sequences producing high-scoring segment pairs:
>Plasmid pBR322, complete genome.
length = 4361
Score = 328, Matches = 95% (68/71), Query strand =
Expect = 9.7e-18, Poisson P = 9.7e-18
Query:
Sbjct:
Query:
Sbjct:
nt 1-1200
A common piece of
DNA used in every
molecular biology
Plus
laboratory!
721 CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA 780
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2581 CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA 2640
781 GCGCTCTCCTG 791
|| | ||| ||
2641 GCTCCCTCGTG 2651
Score = 320, Matches = 93% (68/73), Query strand = Plus
Expect = 4.5e-17, Poisson P = 4.5e-38
Query:
Sbjct:
530 GCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGG 589
|| || || ||||||||||||||||||||||||||||||||||||||||||||||||||
1026 GCATCGGGATGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGG 1085
Dot matrix analysis of Dinosaur DNA
NCBI
Window Size = 35
Min. % Score =100
Dinosaur DNA
200
Scoring Matrix: DNA Matrix
A
400
C1
600
800
C2
1000
B
1200
500
1000 1500 2000 2500 3000 3500 4000
pBR322,
complete genome
NCBI
Rejected by Science
Rejected by Nature
Rejected by Cell
Published in
BioTechniques
12(5):668-9; 1992
Dr. Crichton’s reply:
NCBI
Another Dinosaur
gene. Or is it?
Crichton, The Lost World
Database Search with “Lost World” DNA
Similar to chicken and
frog DNA!
Query: Sequence from THE LOST WORLD, page 135 (1435 bases)
Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences
316,522 sequences; 481,803,458 total letters
Searching..........……………………………......................................done
High
E
Sequences producing significant alignments:
Score Value
gb|M26209|CHKRERYF1 Chicken erythroid-specific transcription fa…………….....783 0.0
gb|M76564|XELGATAC X.laevis GATA-binding protein (XGATA-2) gene………….670 0.0
gb|M76563|XELGATAB X.laevis GATA-binding protein (XGATA-1B ) ge…………..248 1e-63
dbj|D13518|RATGATA1 Rat mRNA for transcription factor GATA-1, c……………..71.9 2e-10
emb|X95701|HSGATA6PR H.sapiens mRNA for GATA-6 DNA-binding protein…..65.9 1e-08
gb|U66075|HSU66075 Human transcription factor hGATA-6 mRNA, com………...65.9 1e-08
gb|U91328|HSU91328 Human hereditary haemochromatosis region, hi…………..60.0 6e-07
emb|X00257|SCCDC28 Yeast CDC28 (cell division control) gene………………….60.0 6e-07
emb|X99254|PFPRIMSSU P.falciparum gene encoding primase, small……………60.0 6e-07
emb|Z36028|SCYBR159W S.cerevisiae chromosome II reading frame O………….60.0 6e-07
A Secret Message in the Dinosaur DNA
NCBI
Score = 607 bits (1637), Expect = e-174
Identities = 304/318 (95%), Positives = 304/318 (95%)
Gaps = 14/318 (4%)
QUERY
1
P17678
1
QUERY
61
P17678
61
QUERY
121
P17678
117
QUERY
181
P17678
170
MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60
MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG
MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60
TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 120
TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV
NCGAT
TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116
MARK
ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 180
ATPLWRRDGTGHYLCN
ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS
NCQT
ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169
WAS
HERE
STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 240
STTTLWRRSPMGDPVCN
ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG
STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226
NIH
Mark Boguski
Cloning of the mouse “Obesity Gene”
The gene encodes a
protein called
“leptin.”
Sequence homology
searching revealed
nothing about the
possible function of
this protein.
Zhang et al.(1994) Nature 372, 425-432.
Computed prediction of leptin’s 3D structure
IL-2 structure / IL-2 sequence
IL-2 structure / leptin sequence
The protein sequence of leptin is compatible with the
protein structure of interleukin-2 (IL-2), suggesting that
the two may have a similar mechanisms of action
Madej et al. (1995). FEBS Lett. 373,13-18
The structure of leptin is now known
As predicted, it is a
member of the longchain helical cytokine
family, like IL-2
Zheng et al. (1997). Nature 387, 206-209
How do we bridge the gap between
sequence and function?
6,000,000
5,000,000
4,000,000
3,000,000
DNA Sequencing
Invented
Human Genome
Project Begun
The
Gap
2,000,000
1,000,000
0
1975
1980
1985
Publications
1990
1995
2000
DNA sequences
Science (Genome Issue) 15 Oct. 1999
National Library of Medicine
“Functional Genomics”
...refers to the development and application of global
(genome-wide or system-wide) experimental approaches
to assess gene function by making use of the information
and reagents provided by genome projects. It is
characterized by high throughput or large scale
experimental methodologies combined with statistical
and computational analysis of the results.
The fundamental strategy in a functional genomics
approach is to expand the scope of biological investigation from studying single genes or proteins to
studying all genes or proteins at once in a systematic
fashion.
Hieter & Boguski (1997) Science 278: 601-602
National Library of Medicine
Tens of thousands of genes can be studied
in a single microarray experiment
V.R.Iyer et al, (1999). Science 283, 83-87.
Gene Expression Profiling
using DNA Microarrays
Plasminogen activator
inhibitor-2
HMG CoA
reductase
Each spot corresponds to a single gene
Signal color and intensity reveal changes in gene activity
“Anticipated advances in
computer speed will be unable to
keep up with the growing [DNA]
sequence databases and the
demand for homology searches of
the data.”
Charles DeLisi, 1988
U.S. Department of Energy
Luckily, DeLisi’s dire prediction has
not (yet) come true
100,000,000.00
10,000,000.00
Moore’s Law vs.
Growth of GenBank
1,000,000.00
100,000.00
10,000.00
1,000.00
100.00
10.00
Transistors/chip
DNA Sequences
19
70
19
72
19
74
19
76
19
78
19
80
19
82
19
84
19
86
19
88
19
90
19
92
19
94
19
96
19
98
20
00
1.00
Computational model of heart failure
Model based on aberrant
behaviour of cardiac ion
transporter genes
Computation
requires days of time
on a large, multiprocessor computer
⎤
1 ⎡
1⎛ κ ⎞
∂v( x, t )
(
)
I
(
v
(
x
,
t
))
I
(
x
,
t
)
∇
•
M
x
∇
v
x
t
(
)
(
,
)
,L∀ x ∈ H
=
−
−
+
⎟
⎜
ion
app
i
⎢
⎥
β ⎝ κ +1⎠
Cm ⎣
∂t
⎦
Total Membrane
Current
Coupling
Current
ELSI
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health