Outline Wprowadzenie do genetyki i zastosowań statystyki w genetyce ♦ Elementary genetics ♦ “Omics” ♦ A cautionary case study Tomasz Burzykowski ♦ Statistics for “omics” technologies Hasselt University & International Drug Development Institute (IDDI), Belgium tomasz.burzykowski@uhasselt.be 1 2 DNA (deoxyribonucleic acid) ♦ The hereditary material in a cell is coded in the sequence of the nucleotides of DNA. • There are normally 46 strands of DNA in 23 chromosomes in human cells. • The complete set is called genome. ♦ Prior to cell division, the DNA material must be duplicated so that after cell division, each new cell contains the full amount of DNA material. The process is usually called replication. • 3 The replication is semiconservative, as each new cell contains one strand of original DNA and one newly synthesized strand of DNA. 4 DNA and RNA DNA Replication ♦ The double helix of DNA is caused to unwind. Each DNA strand serves as a template to guide the synthesis of its complementary strand of DNA ♦ Template #2 guides the formation of a new complementary #1 strand: A → T, C → G, T → A, etc. Exactly the opposite reaction occurs using template #1. ♦ The new sequences are checked by two different polymerase enzymes. Mismatched nucleotides are hydrolyzed and cut out and new correct ones are inserted. 5 Genes (1) 6 Genes CATCGGCTTATCTAGCTAATCGAGCTCTCTGAAGAGAAATATCATCTACGACTACTACGACACACATCGACGAGGCATC ♦ You can think of a cell as a protein factory. ♦ A gene is a contiguous section of a chromosome that encodes information to build a protein or an RNA (ribonucleic acid) molecule. ♦ Proteins are the basic building blocks of life. • Some proteins are the fundamental, structural components of tissue; others (enzymes) are catalysts for chemical reactions. ♦ In humans, a gene is composed of about 10,000 bp. ♦ Each gene is a blueprint for a protein, which gets manufactured in the cell, and then goes and does some job elsewhere in the body, or maybe in the same cell. ♦ A chromosome contains genes and contiguous sections that are not part of any gene. ♦ A gene specifies how to make a specific protein, using the materials typically found inside the cell (amino acids, AAs). 7 8 Proteins, Peptides, Amino Acids Protein Structure ♦ Proteins are large molecules composed of one or more AA chains (polypeptides), arranged in a biologically functional way. ♦ Peptides (Greek: "digested") are short chains of AAs. Distinguished (arbitrarily) from proteins based on size (typically, peptide < 50 AAs) • dipeptides (two AAs), tripeptides, tetrapeptides, etc. ♦ A polypeptide is a long, continuous, and unbranched peptide. 9 Amino Acids 10 The Genetic Code ♦ AAs are coded by triplets of nucleotides ♦ Redundancy: there are 20 basic AAs and 43 = 64 triplets 11 12 Transcription: DNA → mRNA Translation: mRNA → protein ♦ After mRNA has been produced, it leaves the nucleus to allow ♦ The genetic code is “read” from a type of RNA called protein synthesis. messenger RNA (mRNA). ♦ In the cytoplasm, ribsomal RNA • DNA needs to be transcribed into mRNA. (rRNA) and protein combine to form a ribosome. It serves as the site and carries the enzymes necessary for protein synthesis. ♦ Transfer RNA (tRNA) contains 13 Translation: gene → protein RNA transcript A R C S E Y 14 Human Genome Project ♦ A 13-year project coordinated by the U.S. Department of CUAGCUCGAUGCUCUGAGUACGUCUAG L about 75 nucleotides, 3 of which are called anticodons, and one AA. The tRNA reads the mRNA codon by using anticodon and carries the AA to be incorporated into the protein. There are at least 20 different tRNA's - one for each AA. Energy and the National Institutes of Health V [stop] ♦ Project goals were to: The ribosome “translates” each 3-letter codon into a specific AA. • identify all the approximately 20,000-25,000 genes in human DNA, • determine the sequences of the 3 billion chemical base pairs that make up human DNA, • store this information in databases, • improve tools for data analysis, • transfer related technologies to the private sector, and • address the ethical, legal, and social issues (ELSI) that may arise from the project. ♦ It was completed in 2003: http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml 15 16 We Know the Genome, but... Genomics ♦ The knowledge of the genome is only a start. We want to be able to answer questions like: ♦ Genome: the set of genes. • What proteins do the genes code? • What do the proteins do? ♦ Genomics: “Any attempt to analyze or compare the entire • In what processes are the genes/proteins involved? genetic complement of a species or species (plural).” • Can we modify the code contained in a gene? If so, how? • It is of course possible to compare genomes by comparing moreor-less representative subsets of genes within genomes. • ... ♦ This information cannot be obtained simply from the DNA ♦ Genome (humans): 2.3 × 104 sequence. Intensive biological experimenting is needed, using sophisticated technologies. This results in the need for suitable data-processing and analysis methods. 17 Transcriptomics 18 Proteomics ♦ Transcriptome: the set of expressed mRNA molecules. ♦ Proteome: the set of proteins encoded by the genome. ♦ Transcriptomics: the study of the transcriptome. ♦ Proteomics: The study of the proteome; evokes not only all the proteins in any given cell, but also the set of all protein isoforms and modifications, the interactions between them, the structural description of proteins and their higher-order complexes. ♦ Transcriptome (humans): ~ 106 ♦ Proteome (humans): ~ 108 19 20 Metabolomics Many other “omics” ♦ The study of the metabolome, the collection of all ♦ Genomics metabolites in a biological cell, tissue, organ or organism, which are the end products of cellular processes. • Cognitive Genomics, Comparative Genomics, Functional genomics, Metagenomics, ... • Glycomics: study of glycomes (the entire complement of sugars) • Nutrigenomics • Lipidomics: study of lipids • Pharmacogenomics • Toxicogenomics ♦ Proteomics: • Immunoproteomics, Nutriproteomics, Proteogenomics ♦ ... Tragicomics (Tragicomix) 21 Aims of “Omics” Experiments 22 High-throughput Technologies ♦ Class discovery ♦ Genomics: genome-sequencing • E.g., gene- or protein-signatures to find new disease sub-types „unsupervised learning” ♦ Gene-expression: microarrays, SAGE, RNA-seq ♦ Class comparison ♦ Proteomics: mass-spectrometry, protein chips • E.g., comparison of protein abundance between biological conditions „differential expression analysis” ♦ Metabolomics: mass-spectrometry, NMR ♦ Class prediction • E.g., gene or protein-signatures to be used for diagnostic purposes ♦ … „supervised learning” 23 24 “Omics” Technologies are Sophisticated and Impressive... ... Which Makes Them Vulnerable ♦ Highly sensitive; systematic effects due to time, place, ♦ Based on advanced scientific principles reagents, personnel, … can be visible ♦ Use complex instrumentation ♦ Reproducibility can easily be compromised ♦ Produce massive amounts of data (“high-throughput”) ♦ Variability can be considerable ♦ Naïve data analysis can lead to erroneous conclusions 25 26 Classification Using Mass Spectra (1) Mass Spectrometry: A Case Study ♦ Use of mass spectra to discriminate between ovarian cancer and normal samples: • Petricoin et al., Lancet 2002; 359: 572-577 • Conrads et al., Endocr Relat Cancer 2004; 11: 163-178 • Baggerly et al., Bioinformatics 2004; 20: 777-785 • Sorace and Zhan, BMC Bioinformatics 2003; 4:24 27 28 Classification Using Mass Spectra (2) Classification Using Mass Spectra (3) Lancet 2002; 359: 572-577 ♦ 100 ovarian cancer pts.; 100 normal controls; 16 pts. with “benign disease” (216 in total) ♦ July 2004: samples processed with the original SELDI technology and with a higher resolution instrument (QqTOF) ♦ Method: 50 cancer and 50 normal spectra used to train a classifier; the algorithm tested on the remaining samples. ♦ Results: ♦ Attention paid to QA/QC • Correctly classified 50/50 of the “test” ovarian cancer cases (100% sensitivity). • Correctly classified 63/66 of the “test” non-malignant cases (95% specificity). ♦ The results indicate 100% sensitivity and 100% specificity for identifying cancer from normal 29 Classification Using Mass Spectra (4) 30 Classification Using Mass Spectra (5) One can find a separation in dataset 3… ♦ Re-analysis of three datasets: (1) described in Petricoin et al., 2002 (216 spectra) Something’s gone wrong. What? (2) the same 216 samples run on the Ciphergen WCX2 ProteinChip array .. but not using 5 features (peaks) from dataset 2. (3) a new set of 253 spectra (91 normal and 162 cancer samples), run on the WCX2 array. 31 32 Classification Using Mass Spectra (6) Day 1 Day 2 “… 32 spectra that were lesser quality (…) were all generated at the end of experimental run, suggesting that a deviation in the process had occurred.” ♦ Focus on their Figures 6 and 7 Day 3 33 34 What Happened? Bias due to Confounding Cancer samples Control samples MS intensity measurements Cancers were processed mainly on day 1... ... controls on days 2-3… observed association (induced) Sample status (cancer/control) association created by the study design association created by chance … but there were quality problems occuring on day 3... Day of measurement … so what do we discriminate between? 35 Day 1 Day 2 Day 3 Control 100% 0% Cancer 0% ~50% ~50% 0% 36 How the Problem Could Have Been Prevented? Discrimination Using Mass Spectra (7) ♦ By randomizing the order of processing the samples ♦ Re-analysis of the third dataset (253 spectra, WCX2 array) • Measurement days (“interventions ”) assigned to each of the samples with equal probability ♦ Found perfect classification rules, using only two m/z features ♦ It would balance the distribution of days within the groups • The association between the day and group would be eliminated 37 Discrimination Using Mass Spectra (8) 38 Common Features of Technology and Data ♦ Sophisticated instrumentation ♦ Experimental techniques highly sensitive; systematic effects due to time, place, reagents, personnel, … can be visible ♦ Reproducibility can easily be compromised ♦ Large amount of data, many measurements/sample (103 - 106) ♦ Highly structured/complex data (correlation, variability, etc.) 39 40 Common Features of Analyses ♦ Sophisticated instrumentation • needs understanding ♦ Experimental techniques highly sensitive • pre-processing (removal of artifacts, normalization, ...) ♦ Many measurements per sample (103 - 106) • Multiple testing adjustment • Automated analyses preferred • Computational time requires a consideration ♦ Data structure/complexity • Modelling preferred... • ... but taking into account assumptions and computational time 41