MAPPING PUTATIVE REGULATORY REGIONS USING HISTONE H3 LYSINE 4 MONOMETHYLATION MARKS IN BREAST CANCER CELL LINES by Denil Wickrama B.Sc., McMaster University, 2005 a Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the Department of Molecular Biology and Biochemistry © Denil Wickrama 2011 SIMON FRASER UNIVERSITY Summer 2011 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the “Institutional Repository” link of the SFU Library website <www.lib.sfu.ca> at: <http://ir.lib.sfu.ca/handle/1892/112>) and, without changing the content, to translate the thesis/project or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work. The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without the author’s written permission. Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence. While licensing SFU to permit the above uses, the author retains copyright in the thesis, project or extended essays, including the right to change the work for subsequent purposes, including editing and publishing the work in whole or in part, and licensing other parties, as the author may desire. The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive. Simon Fraser University Library Burnaby, BC, Canada Last revision: Spring 09 Abstract Breast cancer is the most frequently diagnosed cancer in women. In cancer, tumour cells accumulate changes over time that allow them to replicate indenitely. These changes can be mutations to DNA and also epigenetic modications. This study looks at a histone modication, H3K4me1, in multiple breast cancer cell lines. It has been found that the regions between anking H3K4me1 peaks, referred to as valleys , are enriched for bound transcription factors. Multiple cell lines were used to form functional groups (luminal vs. basal cell lines and tumourigenic vs. a non-tumourigenic match control) in which to look for concordance of valleys. In addition, overexpressed genes in a functional group, as determined by RNA-seq, were correlated with associated uniquely marked valleys. A motif analysis was done on the valley sequences using MEME and STAMP to yield putative transcription factor binding sites. This analysis yielded some known and putative tumour suppressors and oncogenic factors. iii This thesis is dedicated to my parents for their love, endless support, and encouragement. iv Acknowledgments I am very grateful to my supervisor Dr. Steven Jones for the opportunity to do this research and for the support, suggestions, and encouragement given throughout my thesis work. Thanks also to current and former members of Dr. Steven Jones' lab for help with research, thesis corrections, or presentation feedback. Notably Anthony Fejes, Mikhail Bilenky, Gordon Robertson, Timothée Cezard, Elizabeth Chun, and Shing Zhan. Thanks as well to the other members of my committee, Dr. Frederic Pio, and Dr. Fiona Brinkman, and also my SFU examiner, Dr. Jack Chen, who provided valuable suggestions to improve this thesis. Thanks to the CIHR/MSFHR Bioinformatics Training Program and the supervisors and members of labs that hosted me for a rotation as part of this program. amazing learning experience. v It has been an Contents Approval ii Abstract iii Dedication iv Acknowledgments v Contents vi List of Tables xiii List of Figures xv Nomenclature xvi 1 Introduction 1.1 Breast cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Cancer development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1.1 Oncogenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1.2 Tumour suppressors . . . . . . . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.2.1 Luminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.2.2 Basal-like . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.2.3 HER2+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.2.4 Normal breast-like . . . . . . . . . . . . . . . . . . . . . . . . 6 Cell lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.2 1.2 1 Breast Cancer Subtypes vi 1.3 1.4 1.2.1 Advantages of cell lines over primary culture . . . . . . . . . . . . . . . 6 1.2.2 Fidelity of cell lines to primary breast tumours . . . . . . . . . . . . . 7 1.2.2.1 Large scale genomic delity . . . . . . . . . . . . . . . . . . . 7 1.2.2.2 Immunohistochemical Fidelity . . . . . . . . . . . . . . . . . 7 1.2.2.3 Therapeutic Fidelity . . . . . . . . . . . . . . . . . . . . . . . 7 Cancer genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Watson genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.2 Venter Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.3 Exomes and transcriptomes . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.4 Whole cancer genome 9 1.3.5 Genomic Landscape of Cancer 1.3.6 Breast Cancer Genomics Sequencing 1.6 9 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.1 First generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.2 Second generation 12 1.4.3 Illumina Genome Analyzer Next-generation sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.3.1 Roche 454 Genome Sequencer . . . . . . . . . . . . . . . . . . 13 1.4.3.2 Life Technologies SOLiD System . . . . . . . . . . . . . . . . 13 1.4.3.3 Single molecule sequencing . . . . . . . . . . . . . . . . . . . 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 RNA-seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5.1 Comparison to other methods . . . . . . . . . . . . . . . . . . . . . . . 15 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.1 Hash based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.1.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.1.2 MAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.6.2 1.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . Third generation BurrowsWheeler Transformation Methods . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.7.1 What is epigenetics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.7.2 How important is epigenetics in normal development? . . . . . . . . . 20 1.7.3 What role does epigenetics play in cancer? . . . . . . . . . . . . . . . . 21 1.6.2.1 Software 1.6.2.2 Bowtie Epigenetics vii 1.7.4 How do epigenetic factors exert phenotypic change? . . . . . . . . . . 21 1.7.5 How permanent are the changes? . . . . . . . . . . . . . . . . . . . . . 21 1.7.6 What role does the nucleosome play? . . . . . . . . . . . . . . . . . . . 22 1.7.7 What are the types of histone modications? . . . . . . . . . . . . . . 22 1.7.7.1 Histone acetylation . . . . . . . . . . . . . . . . . . . . . . . 22 1.7.7.2 Histone phosphorylation . . . . . . . . . . . . . . . . . . . . . 22 1.7.7.3 Histone ubiquitination . . . . . . . . . . . . . . . . . . . . . . 23 1.7.7.4 Histone methylation . . . . . . . . . . . . . . . . . . . . . . . 23 H3K4me1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.7.8.1 Mono-, di- and tri-methylation . . . . . . . . . . . . . . . . . 24 1.7.8.2 Bimodal locii . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Histone methyltransferases and histone demethylases . . . . . . . . . . 24 1.7.9.1 LSD1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.7.9.2 MLL1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.7.12 JHDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Transcription Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.8.1 Popular TF binding sites programs . . . . . . . . . . . . . . . . . . . . 27 1.8.2 Mismatch representation . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.8.3 Probabilistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.8.4 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.8.4.1 29 1.7.8 1.7.9 1.7.10 Smyd 1.7.11 Whistle 1.8 1.8.5 1.8.6 MEME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TF binding databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.8.5.1 OregAnno . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.8.5.2 JASPER 30 1.8.5.3 TRANSFAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interpretation of motif-nder output 30 . . . . . . . . . . . . . . . . . . . 31 STAMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Functional analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.9.1 DAVID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.9.2 g:Proler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.10 Summary of research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.8.6.1 1.9 viii 2 Materials and Methods 2.1 36 Cell lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.1.1 Framentation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.1.2 Immunohistochemical properties . . . . . . . . . . . . . . . . . . . . . 37 2.1.3 Cell lines used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.1.3.1 MCF7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.1.3.2 T47D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.1.3.3 BT549 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.1.3.4 MDA-MB-231 . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.1.3.5 HS578T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.1.3.6 HS578Bst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.2 Aligning sequence reads to reference genome . . . . . . . . . . . . . . . . . . . 39 2.3 Filtering reads 39 2.4 Identifying enriched regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.4.1 Vancouver Short Read (Find Peaks 4) . . . . . . . . . . . . . . . . . . 40 2.4.2 Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5 Valley regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6 Concordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.7 Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.8 Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.8.1 Association of valley marked genes with breast cancer tumourigenesis . 42 2.8.2 Functional analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 Results 43 3.1 Note regarding contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Chip sequencing Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Tally of Reads and Peaks . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.2 Saturation curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Enrichment of TF binding sites in H3K4me1 marked motifs . . . . . . . . . . 46 3.4 Correlation of Valleys with Downstream Genes . . . . . . . . . . . . . . . . . 46 Association of valley marked genes with breast cancer tumourigenesis . 47 Concordance of valleys between cell lines . . . . . . . . . . . . . . . . . . . . . 47 3.5.1 47 3.4.1 3.5 Concordance between breast cancer cell line and a matched control . . ix 3.5.2 3.5.3 Concordance among various luminal and basal breast cancer cell lines 48 3.5.2.1 Breast cancer subtypes . . . . . . . . . . . . . . . . . . . . . 48 3.5.2.2 Concordance with the same subtype . . . . . . . . . . . . . . 49 3.5.2.3 Valleys shared by all cell lines . . . . . . . . . . . . . . . . . . 49 Concordance between a set of luminal and a set of basal breast cancer cell lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Unique valleys in promoter regions of overexpressed genes 52 . . . . . . . . . . . . . . . . 52 Dening marked overexpressed categories 3.6.2 Tally of unique valleys in promoter region of overexpressed genes . . . 56 3.6.2.1 Breast cancer subtype specic valleys . . . . . . . . . . . . . 56 3.6.2.2 Tumourigenics valleys . . . . . . . . . . . . . . . . . . . . . . 56 Tally of uniquely marked overexpressed genes . . . . . . . . . . . . . . 56 Functional analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.7.1 Functional analysis of basal and luminal cell lines . . . . . . . . . . . . 58 3.7.1.1 Functional analysis of basal marked basal overexpressed genes 58 3.7.1.2 Functional analysis of basal marked luminal overexpressed genes 3.7.1.3 3.7.1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Functional analysis of cancer and control cell lines 3.7.2.1 59 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Functional analysis of cancer marked cancer overexpressed genes 3.7.2.4 . . . . . . . . . . . 59 Functional analysis of cancer marked control overexpressed genes 3.7.2.3 59 Functional analysis of control marked cancer overexpressed genes 3.7.2.2 59 Functional analysis of luminal marked luminal overexpressed genes 3.7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Functional analysis of luminal marked basal overexpressed genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Functional analysis of control marked control overexpressed genes 3.8 . . . . . . . . . . . 3.6.1 3.6.3 3.7 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Marked overexpressed genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.8.1 Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.8.1.1 82 ESR1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 3.8.1.2 3.9 ESR2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Genes downstream of ESR1 motifs in Valleys . . . . . . . . . . . . . . . . . . 4 Discussion & Conclusions 4.1 82 86 92 Valley concordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.1.1 Match control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.1.2 Breast cancer subtype . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Core shared marks . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2 Association of valley marked genes with breast cancer tumourigenesis . . . . . 94 4.3 Marked genes with corresponding expression modulation . . . . . . . . . . . . 94 4.1.2.1 4.3.1 Functions of H3K4me1 Marked genes with corresponding expression modulation 4.4 4.3.1.1 Cell cycle checkpoints 4.3.1.2 Metastasis 4.3.1.3 95 . . . . . . . . . . . . . . . . . . . . . . 95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Cellular adhesion . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3.2 Angiogenesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3.3 MicroRNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Putative regulatory regions 4.4.1 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Relevance of marked overexpressed categories . . . . . . . . . . . . . . 98 4.4.1.1 Putative activatory region . . . . . . . . . . . . . . . . . . . . 98 4.4.1.2 Putative repressive region . . . . . . . . . . . . . . . . . . . . 98 Experimentally determined functions of TFs potentially regulated by valley regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5.1 ESR1 and ESR2 99 4.5.2 Egr1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.5.3 Che-1 4.5.4 EWSR1/Fli-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.5.5 Ixr1 4.5.6 Tlx1_NFIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.5.7 Tin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.5.8 Bcd, oc, and gsc 4.5.9 IRF1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.5.10 MEF2A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 xi 4.5.11 Sna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.5.12 Stat3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.5.13 REST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.6 Experimental validation 4.7 Uncorroborated experimental results 4.8 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 . . . . . . . . . . . . . . . . . . . . . . . 109 4.7.1 Post-transcriptional regulation 4.7.2 Co-regulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Progressive methylation . . . . . . . . . . . . . . . . . . . . . . 109 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.8.1 Binding strengths of eectors . . . . . . . . . . . . . . . . . . . . . . . 110 4.8.2 H3K4me3 unobserved in these studies . . . . . . . . . . . . . . . . . . 110 4.8.2.1 Expected case 4.8.2.2 Methylation states . . . . . . . . . . . . . . . . . . . . . . . . 110 4.8.2.3 Reasons for unexpected case Epigenetic crosstalk . . . . . . . . . . . . . . . . . . . . . . . . . . 110 . . . . . . . . . . . . . . . . . . 111 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Bibliography 113 xii List of Tables 3.1 Tally of Reads and Peaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Enrichment of TF binding sites in valleys 3.3 Proportion of breast cancer genes of the set of genes marked with H3K4me1 . . . . . . . . . . . . . . . . . . . . 44 46 valleys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Concordance of valleys in match controlled cell lines . . . . . . . . . . . . . . 48 3.5 Cell lines by breast cancer subtype . . . . . . . . . . . . . . . . . . . . . . . . 49 3.6 Overlap of valleys in promoter regions of luminal and basal cell lines . . . . . 50 3.7 Overlap of valleys in promoter regions of luminal and basal cell lines . . . . . 54 3.8 Valleys shared between breast cancer subtypes . . . . . . . . . . . . . . . . . . 55 3.9 Categories correlating expression with H3K4me1 mark in tumourigenic and non-tumourigenic cell lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.10 Categories correlating expression with H3K4me1 mark in luminal and basal cell lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.11 Number of valleys in the promoter region marking overexpressed genes in breast cancer by subtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.12 Valleys in promoters of genes correlated with overexpression in match-controlled cell lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.13 Uniquely marked genes correlated with overexpression by breast cancer subtype 57 3.14 Uniquely marked genes correlated with overexpression in match-controlled cell lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.15 Control marked cancer overexpressed genes . . . . . . . . . . . . . . . . . . . 61 3.16 Cancer marked control overexpressed genes . . . . . . . . . . . . . . . . . . . 63 3.17 Cancer marked cancer overexpressed genes . . . . . . . . . . . . . . . . . . . . 66 3.18 Control marked control overexpressed genes 68 xiii . . . . . . . . . . . . . . . . . . . 3.19 Basal marked basal overexpressed genes . . . . . . . . . . . . . . . . . . . . . 72 3.20 Basal marked luminal overexpressed genes . . . . . . . . . . . . . . . . . . . . 75 3.21 Luminal marked basal overexpressed genes . . . . . . . . . . . . . . . . . . . . 77 3.22 Luminal marked luminal overexpressed genes . . . . . . . . . . . . . . . . . . 80 3.23 Uniquely Marked in Control and Overexpressed in Control . . . . . . . . . . . 84 3.24 Uniquely Marked in Cancer and Overexpressed in Control . . . . . . . . . . . 84 3.25 Uniquely Marked in Cancer and Overexpressed in Cancer . . . . . . . . . . . 85 3.26 Uniquely Marked in Control and Overexpressed in Cancer . . . . . . . . . . . 85 xiv List of Figures 3.1 Combined Saturation plots. This gure was generated using Find Peaks 2 and a modied MatLab script, saturation.m, both created by Mikhail Bilenky. 3.2 Overlap of valley regions in tumourigenic cell line vs. control 3.3 Overlap of valley regions by breast cancer subtype 3.4 ESR1 motifs found in valleys upstream of genes that were uniquely marked 45 . . . . . . . . . 48 . . . . . . . . . . . . . . . 52 by H3K4me1 mono-methylation in the control cell line and overexpressed in the control cell line, cont. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.1 Snail1 complex [44] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.2 Various REST isoforms [76] 4.3 Low H3K4me1 could indicate higher H3K4me3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 xv . . . . . . . . . . . . . . . . . 111 Nomenclature Acronym BM Basement membrane bp Base pair DAVID Database for Annotation, Visualization and Integrated Discovery DNA Deoxyribonucleic acid ECM Extracelluar Matrix EGFR epidermal growth factor receptor ER Estrogen Receptor GO Gene Ontology HAT Histone AcetylTransferases HDAC Histone DeACetylases HER2 Human Epidermal growth factor Receptor 2 HKMT Histone Lysine MethylTransferases KDM Lysine DeMethylase KEGG Kyoto Encyclopedia of Genes and Genomes LSD1 Lysine-Specic Demethylase 1 MEME Multiple EM for Motif Elicitation xvi PCR Polymerase Chain Reaction PFM position frequency matrix PR Progesterone Receptor PRMT Protein aRrginine MethylTransferases PSSM Position Specic Score Matrix SHRiMP The SHort Read Mapping Package SMS Single molecule sequencing SOLiD Sequencing by Oligonucleotide Ligation and Detection TF Transcription factor TRANSFAC The Transcription Factor Database TSS Transcriptional start site Glossary Carcinogenesis Carcinogenesis or oncogenesis is literally the creation of cancer ChIP-Seq Chromatin immunoprecipitation combined with massively parallel DNA sequencing to identify the DNA-associated proteins Epigenetics Heritable changes in gene expression and chromatin organisation that are not encoded in the genomic DNA itself H3K4me1 Histone H3 mono methyl K4 Histones Histones are the proteins closely associated with DNA molecules Nucleosomes Nucleosomes are the basic unit of DNA packaging in eukaryotes (cells with a nucleus), consisting of a segment of DNA wound around a histone protein core. Oncogenes An oncogene is a gene that has the potential to cause cancer RNA-seq Deep high-throughput transcriptome sequencing. Also known as Whole Transcriptome Shotgun Sequencing. xvii Somatic mutation Alterations in DNA that occur after conception. Tumour suppressor gene A tumour suppressor gene, or anti-oncogene, is a gene that protects a cell from one step on the path to cancer. Valley Flanking H3K4me1 monomethylation peaks possibly marking a transcription factor binding site xviii Chapter 1 Introduction 1.1 Breast cancer Breast cancer is heterogeneous, arising from varied genetic and epigenetic abnormalities [286]. In general, tumours progress by accumulating modications that allow them behave dierently than normal cells. This includes self-suciency in growth signals, insensitivity to anti-growth signals, tissue invasion, metastasis, and sustained angiogenesis [106]. Usually, these steps occur by the activation of an oncogene, such as Ras, or inactivating tumour suppressor genes, such as p53 [23]. Dierent tumour types have dierent molecular characteristics and determining the tumour type allows prediction of the prognosis along with the best treatment [138]. 1.1.1 Cancer development The path by which cancer progresses is also important to developing treatments. Carcinogenesis literally means the production of cancer [50]. It occurs in multiple steps in the form of genetic or epigenetic alterations that inuence key cellular pathways [71, 71, 129]. Some of these steps include: deregulation of multiple cellular processes including genome stability, proliferation, apoptosis, motility, and angiogenesis [6, 106]. With the breakdown of these barriers, a normal, nite-life-span somatic epithelial cell can transform into an immortalized, 1 CHAPTER 1. INTRODUCTION 2 metastatic cell. 1.1.1.1 Oncogenes A group of genes that are major players in carcinogenesis are called oncogenes. An oncogene is any gene that encodes a protein able to transform cells to induce cancer [198]. Types of oncogenes may include growth factors, growth factor receptors, signal-transduction proteins, transcription factors, pro- or anti-apoptotic proteins, cell cycle control proteins, and DNA repair proteins [198]. An example of a growth factor oncogene that plays a role in breast cancer is EGF-R/ErbB2. Epidermal growth factor receptor (EGFR) and ErbB2 are members of the ErbB family of receptor tyrosine kinases. ErbB2 interacts with EGFR in order to achieve its full oncogenic potential. ErbB2 amplication and overexpression are associated with a poor prognosis in breast cancer patients [45]. In addition, BRCA1 is an oncogeneic transcriptional regulator whose mutation has been linked to the development of breast and ovarian cancer [347]. Other oncogenes that researchers have found to be related to breast cancer include the tyrosine kinase family of growth factor receptors, the c-myc oncogene, cyclin D-1, and the cyclin regulator, CDK-1 [46]. Oncogenes arise by activation of a proto-oncogene. These proto-oncogenes may undergo mutations altering their regulation or function, which make them capable of turning normal cells into cancer cells. Proto-oncogenes are cellular genes with important functions in normal cell growth or dierentiation [198]. Dierent oncogenes are mutated in dierent tumours, contributing to dierences in histopathology, hormone receptor expression, and clinical course [67]. As mentioned earlier, activation is necessary to convert a proto-oncogene into an oncogene. Generally, this involves a gain-of-function mutation [198]. These genes are altered due to mutations such as amplication, deletion and insertion mutations, increased transcription, and point mutations [198]. A mutation within a proto-oncogene can change a protein's structure, causing an increase in enzyme activity or a loss of regulation [234]. One method of proto-oncogene activation, gene amplication, increases the protein levels encoded by a gene. This could occur in various ways. For example, an amplication in CHAPTER 1. INTRODUCTION 3 protein concentration, due to misregulation would provide a gain of function. Also, increasing mRNA stability prolongs its existence, causing more translation, and thus increased activity in the cell. This results in enhanced function of the gene. An example of such a mode of oncogene activation is that of HER2, which is seen in about 20% of primary breast cancer cases [245]. A point mutation that enhances the function of the oncoprotein is another mode of activation. An example is point mutations in the ras oncogene, seen commonly in lung, colorectal, and pancreatic (but not breast) cancer [245]. Chromosomal translocation, is a method of oncogenic transformation where a fusion gene is transcribed into a protein, with enhanced function. Chromosomal translocations can cause increased gene expression to occur in the incorrect cell type or cellular conditions. This could also result in the expression of a constitutively active hybrid protein. 1.1.1.2 Tumour suppressors Proto-oncogenes are typically genes that assist cell growth and dierentiation that mutated they induce cancer when mutated. [54]. Tumour suppressors on the other hand slow down cell division, repair DNA mistakes, and promote apoptosis. The loss of function of these genes promotes malignancy [250]. Tumour suppressor gene mutations can be haploinsucient, or dominant negative in addition to recessive [250]. Usually, mutated tumour suppressors are recessive alleles, as they contain loss-of-function mutations [107]. These mutations can follow a two hit hypothesis where both alleles that code for a particular gene must be aected before an eect is manifested [155]. Typically, a mutation limited to one oncogene would be suppressed by normal mitotic control and tumour suppressor genes [156]. An inherited loss of a tumour suppressor allele leads to accelerated tumourigenesis, due to the need to inactivate only one remaining allele [250] In some cases, inactivation of one allele of a tumour suppressor gene is sucient to cause tumours. Haploinsuciency occurs when one allele is insucient to confer the full functionality produced from two wild-type alleles [250]. CHAPTER 1. INTRODUCTION 4 In the case of a dominant negative mutation, the wild-type allele does not need to be inactivated, because the dominant negative mutation serves that function [250]. This phenomenon is called the dominant negative eect. These mutations are also thought to be more fre- quent than null mutations such as complete gene deletions, premature nonsense mutations or regulatory alterations abolishing allelic expression [94]. Also, they appear frequently in transcription factors (TFs) [317]. An example of a key tumour suppressor gene is the p53 gene [322]. A mutation in the p53 gene is the most common genetic change found in breast cancer, found in 50% of human cancers [294]. One function of this gene is to keep cells with damaged DNA from entering the cell cycle. The p53 gene can tell a normal cell with DNA damage to stop proliferating and repair the damage [46]. In cancer cells, p53 recognizes damaged DNA and tells the cell to undergo apoptosis. If the p53 gene is damaged and loses its function, cells with damaged DNA continue to reproduce when normally they would have been removed through apoptosis. A small proportion of breast cancer cases (5%) are related to the inheritance of susceptibility genes [46]. Examples of breast cancer susceptibility genes involved in some inherited cases of breast cancer are BRCA1 and BRCA2. If inactivated, these tumour suppressor genes can act indirectly in the cell by disrupting DNA repair [46]. This allows the cell to accumulate DNA damage, including mutations that can encourage cancer development. Other tumour suppressor genes that researchers have found may be related to breast cancer include the Retino blastoma, Brush-1, Maspin, nm23, and the TSG101 genes. 1.1.2 Breast Cancer Subtypes For many years, the conventional way to diagnose the pathology of breast tumours was microscopic subtyping and grading [268]. However, patients with the same pathologic subtype and grade can have dierent outcomes. Long-term follow-up of patients with breast can- cer show that a particular subtype of carcinoma or a specic grade as determined by the Nottingham Prognostic Index (NPI) has little impact on prognosis and doesn't provide any insights into the best therapeutic strategy [268]. Patients with breast cancer can be stratied based on their gene expression prole and expression of immunohistochemical expression of cytokeratins, estrogen receptors, EGFR, CHAPTER 1. INTRODUCTION 5 and HER2 [24]. This classication has impact on therapeutic strategies, and the dierent molecular subtypes respond dierently to chemotherapy. Five distinct molecular subclasses have been identied: Luminal A and B, HER2, basal-like, and normal-like [255, 237, 137]. 1.1.2.1 Luminal Luminal-like breast carcinoma is characterized by the expression of Estrogen Receptor (ER), Progestorone Receptor (PR), Bcl-2 and CK8/18 [268]. Luminal tumours originate at the inner cells that line the mammary ducts [232]. They are characterized by high levels of ER expression and are associated with good prognosis, high survival rates and low recurrence. Luminal A is the most prevalent cancer subtype occuring in 42-59% of cases [37]. The characteristic molecular markers are ER+ and/or PR+ and tend to be Human Epidermal growth factor Receptor 2 negative (HER2-). Only about 15% of luminal A tumours have p53 mutations, a factor linked with a poorer prognosis [37]. Luminal B tumours occur 9-16% and is a more aggressive phenotype than luminal A but still has fairly high survival rates [255]. They are more likely to have p53 mutations, poorer tumour grade, and larger tumour size. Luminal B tumours tend to be HER2+, ER+ and/or PR+, and most express EGFR-1 and cyclin E1 [308]. 1.1.2.2 Basal-like Many basal-like tumours are triple-negative (ER-, PR-, HER2-) and this category comprises about 8-20% of breast cancers [37, 221]. This subtype expresses CK5/6 and/or EGFR [269]. These tumours are often associated with aggressive histological features, BRCA mutations, and have a poorer prognosis compared to luminal subtypes [338]. Basal-like tumours are usually treated with some combination of surgery, radiation therapy and chemotherapy. These tumours cannot be treated with trastuzumab or hormone therapies because they are HER2- and hormone receptor-negative [20]. CHAPTER 1. INTRODUCTION 6 1.1.2.3 HER2+ This category of tumour typically has the molecular signature (ER-, PR-, HER2+). HER2+ breast cancers tend to be more aggressive than other types of breast cancer [309]. In the majority of these tumours p53 is not expressed. HER2+ tumours and also are prone to early and frequent relapse and distant metastases. This tumour type has an occurrence of 7-12% [151]. HER+ tumours can be treated with the drug trastuzumab. 1.1.2.4 Normal breast-like About 6-10% fall into an unclassied/normal breast-like category [37]. These tumours do not t the proles of the other four subtypes. These are negative for all ve markers ER-, PR-, HER2-, CK5- and EGFR- [268]. These tumours are most often small and tend to have a good prognosis [73]. 1.2 Cell lines When studying tumours, cell lines are often used. A cell line is a homogeneous population of cells on which experiments can be performed. These cell lines can be derived from breast cancer patients and be immortalized for study. They are useful in vitro models of cancer research. 1.2.1 Advantages of cell lines over primary culture Cell lines do not fully represent the tumours from which they derive. They do however represent tangible and tractable experimental resources, and there are advantages to their use in a genomewide sequencing study. For example, they are readily available in large quantities. When tumour tissue is used the quantity of tumour material would be limited, and therefore dicult to share. Directly sequencing patient derived tumour tissue would provide no experimental resource to test whether the modications are causative or merely correlative with disease. Some other advantages of using cell lines over primary culture are CHAPTER 1. INTRODUCTION 7 faster population doubling times, and the lack of a nite set lifespan before senescence [32]. Cell lines are also heavily relied upon for compound and RNAi screening [103]. 1.2.2 Fidelity of cell lines to primary breast tumours Interpreting the results of a cell line experiment in the context of breast cancer pathophysiology requires an understanding of the extent to which they mirror aberrations that are present in primary tumours. Studies have concluded that the cell line collection mirror most of the important genomic and resulting transcriptional abnormalities found in primary breast tumours. They show analysis of the functions of these genes in the ensemble of cell lines will accurately reect how they contribute to breast cancer pathophysiologies [236]. 1.2.2.1 Large scale genomic delity Cell lines display the same heterogeneity in copy number and expression abnormalities as the primary tumours, and they carry almost all of the recurrent genomic abnormalities associated with clinical outcome in primary tumours [236]. 1.2.2.2 Immunohistochemical Fidelity Breast cancer cell lines can also be used to study subtype specic changes in breast cancer. This is because the breast cancer cell lines cluster into basal-like and luminal expression subsets in a similar way to their tissue counterparts. A study on the cell lines T47D, HS578T, MCF7, and MDA-MB-231 shows that luminal cells appear more dierentiated and form tight cell-cell junctions, while the Basal B cells appear less dierentiated and have a more mesenchymal-like appearance [236]. 1.2.2.3 Therapeutic Fidelity Given the immunohistochemical and large scale genomic delity, we would expect cell lines to behave in a similar manner to their representative breast tumours to therapeutic agents. CHAPTER 1. INTRODUCTION 8 Indeed, studies have found the cell lines exhibit heterogeneous responses to targeted therapeutics paralleling clinical observations [236]. 1.3 1.3.1 Cancer genomics Watson genome Second-generation DNA sequencing technologies have transformed investigation of cancer genomes. James Watson's genome was the rst personal genome to be sequenced using NGS technologies [319]. This achievement was rst proof of principle that these rapid- sequencing machines can decipher large, complex genomes [243]. Watson's genome was sequenced to 7.4× coverage on the 454 GS (Roche) platform [331], and included 3.3 million single nucleotide polymorphisms. It took just four months, a handful of scientists and less than US$1.5 million to sequence the 6 billion base pairs of DNA pioneer James Watson [319]. 1.3.2 Venter Genome The genome of J. Craig Venter was sequenced at a cost of $100 million [319]. Their approach was based on whole-genome shotgun sequencing, and generated an assembled genome over half of which is represented in large diploid segments (>200 kilobases). Essentially, in this method, the sequence was broken into large parts. Then the large parts were broken into smaller parts, sequenced and put back together [261]. The dierence between Venter's genome and Watson's, besides the cost, is that in Venter's genome it was possible to gure out how the smaller parts t into the larger parts, and to reconstruct contiguous pieces. Also, unlike Watson's data, Venter's data allows us to look much more closely at the dierence between the two sets of chromosomes and reports that the maternal and paternal sets are quite dierent and 44% of the genes are heterozygous [261]. Comparison with previous reference human genome sequences, which were composites comprising multiple humans, revealed that the majority of genomic alterations are the well-studied class of variants based on single nucleotides (SNPs). However, the results also reveal that lesser-studied genomic variants, insertions and deletions, while comprising a minority (22%) of genomic variation events, actually account for almost 74% of variant nucleotides [182]. CHAPTER 1. INTRODUCTION 1.3.3 9 Exomes and transcriptomes Most of the currently known driver mutations change the coding sequences of protein-coding genes and because protein-coding exons account for only about 1% of the human genome, sequencing is often being thriftily targeted at these [315]. Use of technologies that extract subsets of DNA sequences from the whole genome [208], in combination with second-generation sequencing, has allowed sequencing of the protein-coding exons of roughly 2000 individual cancers worldwide [304]. This strategy will nd base substitutions and indels in coding exons but will miss these types of mutation in noncoding regions and require other analyses of the same genomes to report most rearrangements. Similarly, after extraction of RNA, the transcriptomes of many hundreds of cancers have been sequenced [304]. 1.3.4 Whole cancer genome While sequencing exomes and transcriptomes yields useful information, it does not tell us the whole story. Technology shifts allowed further insight by sequencing the whole cancer genome [291, 185, 178]. This strategy, in which genomic DNA from a cancer and, DNA isolated from normal tissue of the same person, can reveal all classes of somatic change (base substitutions, indels, rearrangements, copy number changes, and even potentially epigenetic alterations) in all sectors of the genome (exons, introns, and intergenic regions) [304]. Thizs allows exploration of the genome without any preconceptions of where the important mutations are. 1.3.5 Genomic Landscape of Cancer Somatic mutations found in cancer are either drivers or passengers [286, 98]. Passenger mutations confer no selective advantage or disadvantage, whereas driver mutations are causal in the neoplastic process and positively selected for in tumourigenesis [335]. There are usually between 1000 and 10,000 somatic substitutions in the genomes of most adult cancers, including breast, ovary, colorectal, pancreas, and glioma [98]. Within a particular cancer type, individual tumours often display wide variation in the prevalence of base substitutions [304]. Cancer genome exploration has identied approximately 400 somatically mutated cancer CHAPTER 1. INTRODUCTION 10 genes or 2% of the protein-coding genes in the human genome that contribute to neoplastic change in one or more types of cancer [89, 304]. Most inherited cancer shows a dominant pattern of inheritance, an inactivation of tumour suppressor genes rather than activating mutations in oncogenes [77]. Most of the known cancer genes were found through primary cytogenetic analyses, with the wave of ever higher resolution copy number studies bringing a further substantial yield [304]. The advent of studies systematically sequencing cancer genomes has identied cancer genes directly through an elevated prevalence of base substitutions and small indels. These include several dominant cancer genes, such as BRAF, EGFR, ERBB2, PIK3CA, IDH1, IDH2, EZH2, FOXL2, PPP2R1A, and JAK2 [304]. While less recessive genes are known, there are some genes which may activate oncogenes. Examples that have emerged through systematic sequencing, include SETD2, KDM6A, KDM5C, PBRM1, BAP1, ARID1A, DNMT3A, GATA3, DAXX, ATRX, and MLL2 [304]. Epigenetics plays a part in carcinogenesis and these sequencing studies nd evidence of this as well. Some of the genes found in these studies are involved in chromatin modication and remodelling. For example, SETD2, EZH2, and MLL2 are histone H3 methylases, whereas KDM6A and KDM5C are histone H3 demethylases [304]. 1.3.6 Breast Cancer Genomics Sequencing Not only have there been many studies to investigate the genomic landscape of cancer but one study in particular investigates the breast cancer genome. This study used these advances in sequencing technologies to characterize all somatic coding mutations that occur during the development and progression of individual cancers. Here they achieved over 43-fold coverage using sequencing Illumina technology to study the genome of metastatic tissue from a breast cancer patient [291]. This coverage ensured every part of the genome was sequenced and allowed them to identify somatic mutations where the tumour genome diered from the patient's normal genome. When comparing noncancerous and metastatic tissue, they found 32 mutations present in the metastatic tumour. Overall the number of mutations they found in the cancerous tissue was greater than expected, making it challenging to determine which mutations were drivers CHAPTER 1. INTRODUCTION 11 that enhance a cancer's ability to spread, and which were passenger mutations that have no eect [295]. Of the 32 mutations found in the metastatic tumour, ve were prevalent in the primary tumour, and six were found at lower frequencies in the primary tumour [291]. This kind of analysis to sheds light on questions such as whether tumours start out with the ability to spread or they evolve that capacity with time. 1.4 Next-generation sequencing The previous section discussed some of the types of contributions that high-throughput genome sequencing can have on cancer genomics. When we examine a genome in a unbiased way and use tumourigenic samples with non-tumourigenic controls we can draw many useful conclusions about the role of a particular gene in tumourigenesis. Using this technology for these kinds of studies is now feasible but only as a result of various improvements. A discussion of the progression and future of high throughput sequencing follows. 1.4.1 First generation This technology started with Sanger sequencing machines. Modern Sanger sequencing machines started a shift in the way we think about sequencing. High-throughput sequencing, in which a single lab could sequence millions of base pairs, rather than the thousands that could be done prior to their introduction [131]. These machines are called the rst generation of sequencing technology as they are the rst of many improvements and variations on highthroughput sequencing technology. They used automated capillary sequencing machines. This method was rst developed by Fredrick Sanger, using Sanger chemistry [285]. consumed much time and reagents and used isotopic radioactive labelling. They This required four separate chain termination reactions, and slab-gel based separation on four individual lanes. Eventually, this was improved to capillary-electrophoresis, using parallel multiple sequencing runs. This generation of sequencers was used in production of the Human Genome Project. This method can be applied to achieve sequencing length up to 1000 bp, with raw accuracy as high as 99.999%, at a cost as little as $0.50/kilobase and throughput close to 600000 bp/day [346]. Though this method is still use, it is not fast enough or suciently economical to be used in present-day large scale genomic analysis. CHAPTER 1. INTRODUCTION 1.4.2 12 Second generation A new generation of sequencing technologies was needed for massively parallel genomic studies. There are three widely used commercial second generation sequencing platforms, the Illumina Genome Analyzer, Roche 454 Genome Sequencer and Life Technologies SOLiD System. 1.4.3 Illumina Genome Analyzer Illumina's workow uses reversible uorescently-labelled terminators as each dNTP is added. This system uses a ow-cell with eight lanes that allows bridge amplication [78] of fragments on it's surface. Each cycle, four distinctly labelled nucleotides are added simultaneously to ´ the ow cell channel, DNA polymerase adds a base pair, and it is 3 -OH blocked. The Illumina Genome Analyzer produces sequence reads of 32-50 bps [210, 346]. It's main drawback is the short read length and the signal decay of the uorescent signal if any of the DNA strands extend out of sync [346]. All of the second generation technologies follow a similar workow. The workow will be described below for the Illumina Genome Analyzer. First, DNA fragments are prepared from the genomic DNA sample. This is done by randomly sheared genomic DNA of 10s to 100s bp in size or pair-end fragments with controlled distance distribution. This can be done by either sonication or using micrococcal nuclease to fragment the DNA [253]. The advantages and disadvantages of each of these strategies are discussed further in the materials and methods section. Adapters are ligated to both ends of the fragments [346]. They are then attached to a planar surface as denatured single strands. The resulting single-stranded template library is created and immobilized on a solid surface. These fragments are then clonally amplied, by bridge amplication [78], resulting in double stranded fragments. These fragments are denatured and cycles of bridge amplication are repeated [346]. DNA clusters form an array of DNA clusters on a slide. The sequencing then begins with the addition of all four ourescently labelled reversible terminators, primers, and DNA polymerase [18]. Then the 30 end is unblocked and the cycle is repeated for the subsequent bases. Optical events generated from the cyclic chain extension process are monitored by microscopic detection system, and images recorded through CCD camera. 100 of these regions of clustered DNA, or tiles, are CHAPTER 1. INTRODUCTION 13 imaged per lane [333]. Some bioinformatics challenges involved in this step are background subtraction, image correlation to account for owcell repositioning, and intensity extraction of the cluster [333]. Next, a post-image analysis signal correction must be done to get accurate base calling. Bioinformatics challenges here involve crosstalk correction caused by overlapping dye emission frequencies, phasing correction caused by failed incorporation of a nucleotide, and chastity ltering on mixed clusters [333]. The sequence reads are aligned to the reference genome in processes which are described later. 1.4.3.1 Roche 454 Genome Sequencer The Illumina technology described above has fairly low error rates but short reads. technology results in long reads, but with considerable homopolymer problems. 454 In this workow, amplicons are made by emulsion PCR using paramagnetic beads coated with DNA primers [212]. The beads, which carry no more than one ssDNA molecule, are amplied through rounds of thermocyling and transferred to picotiter plates and further enriched. Sequencing-by-synthesis is done with pyrophosphate chemistry to produce optical signals [281]. Advantages of this technique are its speed and read length of up to 500 bp [346]. This is due to the lack of extra chemical steps such as removing a label moiety or deblocking a terminator. Costs of reagents and errors in homopolymer regions are drawbacks of this method. 1.4.3.2 Life Technologies SOLiD System Illumina results in with longer reads than SOLiD and is more expensive to run with fewer reads. In addition, SOLiD is more suited to SNP calling. SOLiD also uses emulsion PCR with paramagnetic beads, and then xes those beads in a disorder array on a at glass substrate [346]. The sequencing-by-synthesis method used in this technology is driven by ligation [203]. Seven rounds of a ligation are used with ourescently labelled octamer probes at the 8 th position. Since the rst two bases correlate with a unique ourescent colour, each base is measured twice to allow identication of miscalls. Studies have shown that SOLiD sequencing can characterize an entire genome with only 18 Ö haploid coverage [217]. CHAPTER 1. INTRODUCTION 14 1.4.3.3 Single molecule sequencing SMS platforms address some of the major drawbacks of other second generation sequencing platforms. SMS increases read length, the number of DNA fragments that can be independently analyzed on a given surface area, and involves no costly cluster amplication step [346]. The major challenge in this technology is the optical signal detection of a single- molecule event. Some companies that have addressed or are trying to address this issue are Helicos HeliScope, VisiGen, Pacic Biosciences, and Mobious Nexus I. 1.4.4 Third generation Third generation sequencing involves sequencing single DNA molecules without the need to halt between read steps (whether enzymatic or otherwise) [287]. This is in contrast to second generation sequencing which works by indirectly determining the base incorporated with either DNA polymerase or DNA ligase through uorescent of chemiluminescent optical events. Working with large numbers of optical images is complex and costly. Consumables for biochemical reactions in sequence interrogation are also a major expense. There are attempts being made to create the next generation of sequencing technology. Non-optical microsopic imaging is one strategy attempting to take a high-resolution picture of a DNA strand at the atomic level [310]. Nanopore is another technology that threads a DNA strand through a pore electrophoretically and then reads the bases as they pass through the pore opening [26]. Grapheen [262] and carbon nanotubes [5] are other techniques that are in development to use electrophysical properties to sequence DNA. 1.5 RNA-seq RNA-seq, or whole transcriptome shotgun sequencing, can be used to prole the transcriptome, the complete set of transcripts in a cell using deep-sequencing technologies. RNA-Seq uses deep-sequencing technologies to analyze a transcriptome [325]. First a library of cDNA fragments is generated from a population of RNA. Adaptors are attached to one or both ends of the cDNA fragments. Each molecule, with or without amplication, is then sequenced in a high-throughput manner to obtain short sequences from one end or both ends. The reads are CHAPTER 1. INTRODUCTION 15 typically 30400 bp, depending on the DNA-sequencing technology used [325]. The cDNAs are then sequenced in a high-throughput manner to obtain short sequences. The sequence reads are aligned to the reference genome as either junction reads exonic reads or poly(A) end reads. Once high-quality reads have been obtained, the rst task of data analysis is to map the short reads from RNA-Seq to the reference genome, or to assemble them into contigs before aligning them to the genomic sequence to reveal transcription structure. There are several programs for mapping reads to the genome, including ELAND, SOAP, MAQ, and RMAP33 [325]. Exonexon junctions can be identied by the presence of a specic sequence con- text and conrmed by the low expression of intronic sequences, which are removed during splicing [325]. For complex transcriptomes it is more dicult to map reads that span splice junctions, due to extensive alternative splicing and trans-splicing. One partial solution is to compile a junction library that contains all the known and predicted junction sequences and map reads to this library [334, 226]. For large transcriptomes, alignment is also complicated because reads match multiple locations in the genome. One solution is to assign these multimatched reads by proportionally assigning them based on the number of reads mapped to their neighbouring unique sequences [226, 48]. 1.5.1 Comparison to other methods Various technologies have been developed to deduce and quantify the transcriptome, including hybridization- or sequence-based approaches. In contrast to microarray methods, sequence-based approaches directly determine the cDNA sequence [325]. RNA-Seq has very low, if any, background signal because DNA sequences can been unambiguously mapped to unique regions of the genome [325]. In addition, RNA-Seq does not have an upper limit for quantication, which correlates with the number of sequences obtained. Consequently, it has a large dynamic range of expression levels over which transcripts can be detected: a greater than 9,000-fold range was estimated in a study [229], and a range spanning ve orders of magnitude was estimated in another [226]. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods [325]. RNA-Seq has also been shown to be highly accurate for quantifying expression levels, as determined using quantitative PCR (qPCR)and spike-in RNA controls of known concentration [229, 226] CHAPTER 1. INTRODUCTION 16 Finally, RNA-Seq also show high levels of reproducibility, for both technical and biological replicates [325]. Sanger sequencing of cDNA or EST libraries is relatively low throughput, expensive and generally not quantitative [174]. Tag-based methods such as Serial Analysis of Gene Expression (SAGE), Cap Analysis of Gene Expression (CAGE) and Massively Parallel Signature Sequencing (MPSS) are high throughput and can provide precise, `digital' gene expression levels [325]. However, they are also based on expensive Sanger sequencing technology, and a signicant portion of the short tags cannot be uniquely mapped to the reference genome. Also, only a portion of the transcript is analyzed and isoforms are generally indistinguishable from each other [325]. DNA microarrays lack sensitivity for genes expressed either at low or very high levels and therefore have a much smaller dynamic range (one-hundredfold to a few-hundredfold) [325]. In general, RNA-Seq avoids limitations of other methods such as reliance upon existing knowledge about genome sequence, high background levels owing to cross-hybridization, and a limited dynamic range of detection owing to both background and saturation of signals [325]. 1.6 Alignment Next generation sequencing involves an alignment step. Alignment is the process of determining the most likely source within the genome sequence for an observed DNA sequencing read [82]. It is one of the rst steps taken in a sequencing-based project in which a reference genome assembly already exists. As sequence capacity grows, algorithmic speed may be- come a more important bottleneck. Running accurate alignment algorithms as a full search of all possible places where the sequence may map is computationally infeasible. In general, alignment programs using heuristic techniques in the rst step to quickly identify a small set of places in the reference sequence where the location of the best mapping is most likely to be found. Then, slower and more accurate alignment algorithms such as Smith-Waterman are run on the limited subset. There are two fundamental technologies used in alignment hash tablebased implementations, and Burrows Wheeler Transform based (BWT-based) methods. CHAPTER 1. INTRODUCTION 1.6.1 17 Hash based methods DNA sequencing reads are extremely unlikely to contain every possible combination of nucleotides and very likely to contain duplicates. This type of dataset lends itself well to hash tables. Hash tables are a common data structure that are able to index complex and nonsequential data in a way that facilitates rapid searching. The rst wave of alignment programs specically designed for short-read alignment from next-generation sequencing machines was based on a hash-table data structure to index and scan the sequence data. Hash-based algorithms build their hash table either on the set of input reads or on the reference genome. There are advantages and disadvantages to each method. For example, hash tables of the reference genome have a constant memory requirement for a given parameter set regardless of the size of the input set of reads, which may be large, depending on the size and complexity of the reference genome. Hash tables based on the set of input reads typically have smaller and variable memory requirements based on the number and diversity of the input read set but may use more processing time to scan the entire reference genome when there are relatively few reads in the input set. 1.6.1.1 Software Examples of tools using this approach, building a hash table of the input read sequences, include MAQ [189], ELAND, SHRiMP [282], and ZOOM [193]. SOAP [190] is another example which hashes the reference genome assembly [82]. The idea of a hash table can be traced back to BLAST [8]. This method follows a seed and extgend paradigm, with each k -mer subsequence in the in a hash table. An improvement to this method was the discovery that seeding non-consecutive matches improves sensitivity [202]. A seed allowing internal mismatches is called a spaced seed. Eland was the rst to use these spaced seed as does SOAP. They allow a two-mismatch hit. MAQ extends this to allow k -mismatches. Zoom uses manually constructed space seeds to enable detection of up to 4 mismatch in 50-bp reads [202]. A potential problem with consecutive seed and spaced seed is they disallow gaps within the seed [188]. A q -gram approach [270], requires that multiple spaced seeds per read match if a CHAPTER 1. INTRODUCTION 18 region is to be considered a possible alignment. This provides a possible solution to building an index natively allowing gaps. The occurrence of a query and the of length the lter is based on the observation that at the query string with at most k dierences (mismatches and gaps), the w-long database substring share at least (w + 1) − (k + 1)q q [35]. q -gram w-long q -gram common substrings The former category initiates seed extension from one long seed match, while approach initiates extension usually with multiple relatively short seed matches. An example usage of this method is SHRiMP [282]. BLAT [146] and SSAHA2 [240], which are used as capillary read aligners, also use this method [157]. 1.6.1.2 MAQ The Mapping and Alignment with Qualities algorithm (MAQ), was one of the rst methods to work with short-read lengths [189]. Maq is a popular aligner that is among the fastest competing open source tools for aligning millions of Illumina reads to the human genome [168]. MAQ considers base quality scores during sequence alignment, which helps to address the variable quality of sequence across a read [157]. Second, it assigns a mapping quality score to quantify the algorithm's condence that a read was correctly placed. MAQ also makes use of read pairing information in paired-end libraries to improve mapping accuracy and identify aberrantly-mapped pairs. 1.6.2 BurrowsWheeler Transformation Methods The inexact matching problem can be reduced to identifying exact matches and building inexact alignment supported by exact matches [188]. These methods typically use the Fulltext Minute-space (FM) index data structure, which introduced the concept that a sux array is much more ecient if it is created from the Burrows-Wheeler Transform (BWT) sequence, rather than from the original sequence [81]. The FM index retains the sux array's ability for rapid subsequence search and, for mammalian genomes, is often the same size or smaller than the input genome size [101]. Creating the underlying data structure requires two steps. In the rst step, the sequence order of the reference genome is modied using the BWT, a reversible process that reorders the genome such that sequences that exist multiple CHAPTER 1. INTRODUCTION 19 times appear together in the data structure. Next, the nal index is created; it is then used for rapid read placement on the genome. The creation of the nal index may be a memoryintensive step, although methods exist to create the index in relatively little memory at the cost of more processing time [139]. The BWT has been commonly used in which rst create an ecient index of the reference genome assembly in a way that facilitates rapid searching in a low-memory footprint. [82] 1.6.2.1 Software There are at least three aligners, Bowtie [168], BWA [187] and SOAP2 [191] that have leveraged the BWT algorithm. This algorithm provide to dramatically decreased alignment time. They are capable of mapping a single lane of Illumina data (20 million reads) in a matter of hours, compared to the several days required by MAQ [331]. 1.6.2.2 Bowtie Bowtie uses a dierent and novel indexing strategy to create an ultrafast, memory-ecient short read aligner, geared toward mammalian re-sequencing [168]. It employs a BWT index based on the FM index, which has a memory footprint of only about 1.3 gigabytes (GB) for the human genome [168]. Bowtie can align reads as short as four bases and as long as 1,024 bases [168]. The input to a single run of Bowtie may comprise a mixture of reads with Ö dierent lengths. Bowtie has been used to align 14.3 coverage worth of human Illumina reads from the 1,000 Genomes project in about 14 hours on a single desktop computer with four processor cores [168]. Bowtie aligns Illumina reads to the human genome at a rate of over 25 million reads per hour [168]. Bowtie makes a number of compromises to achieve this speed. If one or more exact matches exist for a read, then Bowtie is guaranteed to report one, but if the best match is an inexact one then Bowtie is not guaranteed in all cases to nd the highest quality alignment. With its highest performance settings, Bowtie may fail to align a small number of reads with valid alignments, if those reads have multiple mismatches. If the stronger guarantees are desired, Bowtie supports options that increase accuracy at the cost of some performance [168]. CHAPTER 1. INTRODUCTION 20 With its default options, Bowtie's sensitivity measured in terms of reads aligned is equal to SOAP's and somewhat less than MAQ's. There are options to allow increased sensitivity at the cost of greater running time, and to enable Bowtie to report multiple hits for a read. Bowtie has been found to align 35 bp reads at a rate of more than 25 million reads per CPUhour, which is more than 35 times faster than Maq and 300 times faster than SOAP under the same conditions [82]. Also, unlike SOAP, Bowtie's 1.3 GB memory footprint allows it to run on a typical PC with 2 GB of RAM [168]. 1.7 1.7.1 Epigenetics What is epigenetics? Epigenetics is the study of heritable changes in genome function that occur without changing the underlying DNA sequence. Like the key signatures, phrasing and dynamics on a score of sheet music [266] that show how the keys in a melody should be played, so to do epigenetic changes add multidimensional layers to the readout of DNA. 1.7.2 How important is epigenetics in normal development? Epigenetics plays a role in normal development [58]. It is involved when cells specialize in complex multi-cellular organisms developed from a fertilized egg. Interesting studies on epigenetics include those of twins. Identical twins share the same DNA sequence and have similar phenotypes, but they do not have complete phenotypic identity. These phenotypic dierences are likely imparted by epigenetic modications that occur over a lifetime. In a study of 80 pairs of identical twins ranging in age, epigenetic dierences were hardly detectable in the youngest twins, but increased with age. The number of genes that dier in activity between 50-year-old twins was more than three times that in pairs three year old twins [86]. Also, epigenetic changes explain how simply altering the diet of a pregnant mouse can change the coat colour of her pups [327], or even alter their response to stress [328]. CHAPTER 1. INTRODUCTION 1.7.3 21 What role does epigenetics play in cancer? Epigenetic modication can play an important role in the steps of tumourogenesis [123]. Some epigenetic processes silence key regulatory genes. When this silencing become disregulated it can result in diseased states. Epigenetic abnormalities in cancer aberrations in cancer comprise virtually every component of chromatin involved in packaging the human genome [129]. These epigenetic modications are mitotically heritable and can thus play the same roles and undergo the same selective processes as genetic alterations. In fact, epigenetic events can occur at a much more increased rate compared to mutations in somatic cells. 1.7.4 How do epigenetic factors exert phenotypic change? One example is the methylation of CpG islands in the promoter regions of gene [279]. This condenses the DNA to heterochromatin and can hide transcription factor binding sites or inuence polymerase progression, thus silencing those genes. DNA is not naked in eukaryotes, a complex of proteins interact with chromatin. DNA is spooled around nuclosomal units consisting of eight histones (two H2A, H2B, H3 and H4 histones) around which 147 base pairs of DNA are wrapped in 1.75 superhelical turns [200]. This close proximity of the histones to the DNA allows for changes in the histones to aect how the DNA is accessed and/or processed. These include posttranslational histone modifcations, energy-dependent chromatin-remodeling, exchanging of histones with variants, and targeting of small noncoding RNAs [260]. 1.7.5 How permanent are the changes? There are many modications and chromatin changes that are reversible. These transitory changes are unlikely to be passed along to the germline. These marks change the chromatin template in response to various stimuli [127]. Other epigenetic modications can be stable through several cell divisions. These include methylated DNA regions, altered nucleosome structures, and some histone modications. CHAPTER 1. INTRODUCTION 1.7.6 22 What role does the nucleosome play? The core histone proteins that make up the nucleosome are highly basic. globular domain which has pretruding exible histone tails. They have a Histone proteins, including their tails, are highly conserved from yeast to humans, which indicates they have critical functions [144]. 1.7.7 What are the types of histone modications? Many types of histone modications have been identied. They include histone acetyla- tion, phosphorylation, ubiquitination, sumoylation, ADP-ribosylation, biotinylation, proline isomerization, and histone methylation [314]. In addition variant proteins of H2A and H3 could be substituted. The arrangment of these nucleosomes on the DNA is altered either by cis -eects or trans -eects. Cis -eects occur due to changes in the physical properties of covalently modied histone tails. Trans -eects occur via recruitement of modifcation-binding partners to the chromatin. This allows for context-dependent reading of a particular covalent histone mark. 1.7.7.1 Histone acetylation Histone acetylation neutralizes the positive charge on the histones and decreases the interaction of the N termini of histones with the negatively charged phosphate groups of DNA. This generates an expansion of the chromatin ber allowing better access of the transcriptional machinery. Histone Acetyl Transferase (HAT) and Histone Deacetylase (HDAC) serve to regulate these histone marks. There is evidence that histone H3 acetylation and H3 lysine 4 methylation, are functionally linked [239]. 1.7.7.2 Histone phosphorylation The four core histones, histone variants, and H1 histones, are phosphorylated on both the amino-terminal and carboxy-terminal portions of the histones [116]. In general, histone phosphorylation may disrupt chromatin structure and allows for the recruitment or occlusion CHAPTER 1. INTRODUCTION of non-histone chromosomal proteins to chromatin [265]. 23 Linker histone H1 proteins are believed to promote the higher-order packaging of DNA by shielding the negative charge of linker DNA between adjacent nucleosomes. Histone H1 phosphorylation aects chromatin condensation and function. Phosphorylation of H1 increases the protein's mobility in the nucleus and weakens its interaction with chromatin [181]. It is thought that site-specic interphase H1 phosphorylation facilitates transcription by RNA polymerases I and II [344]. There is evidence that phosphorylation of histone H3 at threonine 6 by protein kinase C beta I prevents LSD1 from demethylating H3K4 [220]. 1.7.7.3 Histone ubiquitination H2A, H2B, H3 and their variant forms are ubiquitinated [56]. a reversible modication. Histone ubiquitination is Attachment of a chain of ubiquitin monomers is a prerequisite for the selective degradation of intracellular proteins by the ubiquitin-dependent proteolytic pathway. H2B ubiquitination may disrupt chromatin structure exposing H3K4 to Set1 [306] 1.7.7.4 Histone methylation Histone methylation does not alter the charge of the histone tail but instead inuences the basicity, hydrophobicity, and the anity of certain molecules such as transcription factors toward DNA [343]. There are two general classes of methylating enzymes, Protein Arganine MethylTransferase (PRMT) and Histone Lysine Methyl Transferase (HKMT). Methylation of histones was previously though to be a permanent mark on chromatin [161]. This was based partly on the 30-year old reports that methylated lysines seemed to have the same half-life as histones [15]. It was previously though a histone swap for a variant would be the only way methylated lysines could be removed. A variant Histone H3.3 could replace H3, essentially replacing the canonical histone H3 with one that had dierent epigenetics modications [111]. While these marks are stable, it is now known they are reversible enzymatically. Arginine methylation is can be removed by deiminases which convert methyl-arginine to citrulline. Methylated lysine residues appears to be more stable but still removable. Lysine methylation can be present in mono-, di-, or tri-methylated states. CHAPTER 1. INTRODUCTION 1.7.8 24 H3K4me1 1.7.8.1 Mono-, di- and tri-methylation All three histone methylation states are found in an elevated state surrounding the TSSs of know genes and are correlated with gene activation [16]. The monomethylation peaks are more disperse though, on average. H3K4me1 peaks are found 900 kb upstream of the TSS, as opposed to 500 kb for H3K4me2, and 300 kb for H3K4me3 [16]. All three states of H3K4 methylation are also highly enriched at insulators [16]. High levels of H3K4me1 with low levels of H3K4me3 were found to be a signature predicting enhancers in HeLa [110]. Though there are many epigenetic modications that act together to aect transcription, a study claims that H3K4me1 may be at top of causal relationship chain [339]. Active genes were previously found to associated with the mono- and tri-methylation of H3K4 [324]. 1.7.8.2 Bimodal locii Studies done by Robertson et al. have studied the spatial distribution of H3K4me1 around TFBS. Bimodal H3K4me1 proles were found, with peaks of H3K4me1 enrichment on either side of the indicated sites, such as transcription factor binding sites [278]. Genes with associated bimodal loci had been found to have signicantly higher expression than genes with associated monomodal or low H3K4me1 loci [115]. 1.7.9 Histone methyltransferases and histone demethylases Early studies of histones and methylated lysine residues demonstrated similar half-lives, which was interpreted as evidence of histone lysine methylation as an irreversible event. Evidence for the turnover of methyl groups arose. The putative mechanisms included demethylaes, histone replacement and clipping [321]. There are multiple histone methyl- transferases (HMTs) and histone demethylases (HDMs) involved in H3K4 methylation [122]. Enzymes that methylate H3K4 include Mll1-4, Set1a/b, Ash2L (H3K4me2/3 only), Set7/9, Meisetz, Smyd1/Bop1, Smyd3, and Whistle. Enzymes that demethyate H3K4 include Lsd1, Jhdm1a/b, Jarid1a/Rbp2, and Jarid1b/c/d. CHAPTER 1. INTRODUCTION 25 1.7.9.1 LSD1 LSD1 is a gene which codes a avin-dependent monoamine oxidase. It catalyses demethylation at distinct lysine residues in histone H3K4me1/2, but cannot aect H3K4me3 due to it's lack of protonated nitrogen [293]. As a component of co-repressor complexes, LSD1 contributes to target gene repression by removing mono- and dimethyl marks from lysine 4 of histone H3 (H3K4) [220]. LSD1 is a avin-containing amine oxidase [312]. LSD1 catalyses both HDAC and a histone lysine demethylase [7] and HDAC inhibitors diminish H3K4 demethylation by LSD1 in vitro [177]. The transcriptional activation complex that LSD1 is part of includes MLL1. This suggests the balance between methylated and unmethylated H3K4 is important to transcriptional regulation [231]. In addition, CoREST enhances the ability of LSD1 to reverse methylation and protects LSD1 from proteasomal degregation in vivo [175]. A possible mechanism is that CoREST binds to LSD1 and tethers it to the nucleosome, bringing the amine oxidase domain close to the H3 tail [175]. A study has proposed a mechanism by which DNA binding of CoREST facilitates the histone demethylation of nucleosomes by LSD1 [337]. CoREST is necessary to make LSD1 able to act on intact nucleosomal particles and CoREST-bound LSD1 exhibits a 2-fold increase in the rate of catalysis [85]. 1.7.9.2 MLL1 The mixed lineage leukemia protein-1 (MLL1) is a member of the SET1 family of H3K4 methyltransferases. MLL1, methyltransferase was in a transcriptional activation complex that includes LSD1. This may be an indication that a functional interplay between histone methyltransferases and histone demethylases may be what ultimately denes the transcriptional states of the targeted genes. MLL1 has been shown to interact with RNAPII [222]. 1.7.10 Smyd The SMYD protein family consists of ve proteins SMYD15 (SET- and MYND-containing protein). Smyd1, Smyd2, and Smyd3 have activity on H3K4 methylation [122, 43]. The SET domain in Smyd2 is required for the methylation at H3K4 [2]. Also, it was found an CHAPTER 1. INTRODUCTION 26 interaction of SMYD2 with HSP90α enhances SMYD2 histone methyltransferase activity and specicity for H3K4 1.7.11 in vitro [2]. Whistle WHISTLE (WHSC1-like 1 isoform 9 with methyltransferase activity to lysine) methylates histone H3K4 and H3K27 residues [152]. There have been studies that show that WHISTLE can induce apoptotic cell death through caspase-3 activation and that HMTase activity is important for the apoptosis induction [151]. and in vivo WHISTLE interacts with HDAC1 in vitro that the recruitment of the HDAC1 is involved in the WHISTLE-mediated transcriptional repression [151]. 1.7.12 JHDM The JHDM (JmjC domain-containing histone demethylase) [153] is conserved in various oraganisms and predicted to be a metalloenyme catalytic motif [47]. There are multiple members of this family. JHDM1 demethylates H3K36, JHDM2 demethylates H3K9, JHDM3 demethylates H3K9 and H3K36, and JARID1 demethylates H3K4 [122]. This class of enzymes catalyzes the removal of methylation by using a hydroxylation reaction and required iron and α-ketoglutarate as cofactors. JARID1B is one of the four members of the JARID1 protein family. All four members of this family have recently been shown to possess H3K4 demethylase activity [176, 126, 43, 154]. Overexpression of JARID1B resulted in loss of tri-, di-, and monomethyl H3K4 but did not aect other histone lysine methylations [122]. JARID1B can catalyze the removal of all three methyl groups from the H3K4 lysine residue. JARID1B, also known as PLU-1, was shown to be up-regulated in breast cancer and probably involved in breast cancer development [122, 336]. 1.8 Transcription Regulation Unravelling the mechanisms that regulate gene expression is a major challenge in biology. Eukaryotic protein coding genes are transcribed by RNA polymerase II, however the basal CHAPTER 1. INTRODUCTION 27 transcription is tightly regulated by complex processes involving chromatin modifying proteins, transcription factors (TF), co-factors and RNA polymerase [326]. This rate varies for each TF binding model and is inuenced by model parameters, but the application of most models with standard settings will report TFBSs in the range of 1/5001/5000 bp [326]. An important task in this challenge is to identify regulatory elements and the conserved regions of DNA called motifs. Recent advances in genome sequence availability and in high-throughput gene expression analysis technologies have allowed for the development of computational methods for motif nding [55]. TFs have distinct preferences towards specic target sequences. Given a set of known binding sites, it is possible to construct a model to describe the target sequence properties that can be used to predict potential binding sites in genomic sequences [326]. These DNA motifs are of important biological signicance. Normally, the pattern is fairly short (5 to 20 bp long) and is known to recur in dierent genes or several times within a gene [55]. Sequences could have zero, one, or multiple copies of a motif. They can form patters such as palindromic motifs or spaced dyad motifs. Spaced dyads are motifs consisting of two short conserved boxes separated by a region of xed size and variable content. 1.8.1 Popular TF binding sites programs Dening the transcription factor binding site can help elucidate the transcriptional machinery of the cell. The goal of motif nding is to detect novel, over-represented unknown signals in a set of sequences [272]. Existing motif nding approaches can be classied into two main categories for representing the consensus DNA pattern, probabilistic or mismatch representation [70]. 1.8.2 Mismatch representation Patterns can be used to dene a signal to be a consensus pattern and allow up to a certain number of mismatches to occur in each instance of the pattern [55]. This is called mismatch representation. The goal of these algorithms is to recover the consensus pattern with the most signicant number of instances, given a certain background model. These CHAPTER 1. INTRODUCTION 28 methods view the representation of the signals as discrete and rely on exhaustive enumeration [55]. These algorithms is that guarantee that the highest scoring pattern will be the global optimum for any scoring function, however, consensus patterns are not as expressive of the DNA signal as prole representations. Recent approaches within this framework include Projection methods [31], string based methods [257], Pattern-Branching [263], and MULTIPROFILER [145]. 1.8.3 Probabilistic A generative probabilistic representation of the nucleotide positions can be used to discover a consensus DNA pattern that maximizes the information content score [272]. In this method, nding the best consensus pattern is done by nding the global maximum of a continuous non-convex function. Algorithms in this category perform stochastic optimization or greedy searches [70]. The main advantage of this approach is that the generated proles are highly representative of the signals being determined [272]. The disadvantage, however, is that nding global maximum of any continuous non-convex function is a challenging problem and thus the best motif may not be the one found but the nearest local optimum instead [64]. Gibbs sampling [172], MEME [13], Weeder [249], greedy CONSENSUS algorithm [113] and HMM based methods [65] belong use this method. 1.8.4 Expectation Maximization Expectation-Maximization is an iterative procedure to maximize the likelihood of a probabilistic model with regard to given data. The algorithm starts with an initial guess as to the location and size of the site of interest in each of the sequences [228]. These parts of the sequence are aligned and this provides an estimate of the base or amino acid composition of each column in the site. The binding sites are modelled as a Position Frequency Matrix (PFM). There is a background genomic sequence and the embedded binding site which have dierent statistical properties [228]. Through multiple iterations involving calculating the probability of each sequence for all possible choices of the binding site, the binding site is rened [228]. Convergence is achieved when the values of the predicted binding site probabilities no longer CHAPTER 1. INTRODUCTION 29 change [164]. 1.8.4.1 MEME MEME is an example of the expectation maximization algorithm and can be used to search for novel new transcription factor binding sites in sets of biological sequences. MEME searches for repeated, sequence patterns that occur in the DNA [173], including sites that may include gaps [36]. MEME is widely used, however, there are newer programs that surpass MEME in certain aspects. For example, it was suggested that MEME is too conservative and could miss discovering motifs [230]. Also, in a study of 13 motif nding tools Weeder outperformed the other tools [313]. This may be due to the 'cautious mode' Weeder was run in, allowing only the strongest motifs to be reported. This mode would be most useful if a search was done with the knowledge that there was at most one motif of interest in the sequence. 1.8.5 TF binding databases There are various databases that catalogue these transcription factors to be used in further studies. These databases can be used to correlate regions in the genome with transcription factor binding sites. 1.8.5.1 OregAnno ORegAnno is an open-source, open-access database and literature curation system for communitybased annotation of experimentally identied DNA regulatory regions, transcription factor binding sites and regulatory variants [99]. A regular user can add individual annotations of promoters, transcription factor binding sites and regulatory mutations to the database. These data are validated by cross-referencing against PubMed [332], Entrez Gene [206], dbSNP [292], the NCBI Taxonomy database [332] and EnsEMBL [121]. Once submitted, an XML representation is scored by a validators who conrm the reliability of annotation from literature. CHAPTER 1. INTRODUCTION 30 Each annotation species an evidence type, subtype and class describing the biological technique cited to discover the regulatory sequence. Evidence classes are broken into two categories: the `regulator' classes describe evidence for the specic protein that bind a site. The `regulatory site' classes describe evidence for the function of a regulatory sequence itself. These two categories are further divided into three levels of regulation (transcription, transcript stability, and translation). The experimental evidence is optionally associated to a specic cell type using the eVOC cell type ontology [185]. Each transcription factor binding site or regulatory mutation must specify a target transcription factor which is either user-dened, in Entrez Gene or in EnsEMBL, or classied as `unknown'. 1.8.5.2 JASPER Position-specic scoring matrices are the preferred models for representation of transcription factor binding specicity. In addition, JASPAR is an open-access database of annotated, high-quality, matrix-based transcription factor binding site proles for multicellular eukaryotes. These proles were derived exclusively from sets of nucleotide sequences experimentally demonstrated to bind transcription factors [283]. 1.8.5.3 TRANSFAC The TRANScription FACtor database (TRANSFAC) models the interaction of eukaryotic transcription factors with their DNA-binding sites and how this aects gene expression. At its core are the three tables: Factor, Site, and Gene. A link between the factor table and the site table indicates the binding interaction. Experimental evidence for this interaction and the cell from which the factor was derived is given in the site entry. On the basis of the method and cell, a quality value is iven to describe the condence with which a binding activity could be assigned to a specic factor [215]. When a number of binding sites have been collected for a factor, the site sequences are aligned to create nucleotide distribution matrices. These matrices are used by the tool Match to nd potential binding sites in uncharacterized sequences, while Patch, another tool, uses the single sites stored in the site table. CHAPTER 1. INTRODUCTION 31 The Gene table connects information of TRANSFAC, TRANSCompel, HumanPSDTM, S/MARtDBTM, or TRANSPATH. Gene entries serve as major linking source to a growing number of external databases. Public versions of TRANSFAC and the above men- tioned programs are freely accessible for research groups from non-prot organizations at http://www.gene-regulation.com. The professional version of TRANSFAC, is available at http://www.biobase-international.com [215]. 1.8.6 Interpretation of motif-nder output Motif-discovery is often one of the rst steps performed during computational analysis of gene-regulation. For instance, researchers often wish to discover over-represented motifs that are common to sets of genes with similar expression patterns. Interpretation of the output from motif-nders is a challenge. Many distinct motifs may be reported with little or no indication as to whether each may potentially possesses regulatory function. A tool that can assess similarity between novel, computationally identied motifs and the known motifs stored in the databases would be necessary for interpretation [207]. 1.8.6.1 STAMP STAMP is a web server that is designed to support the study of DNA-binding motifs. It is used to query motifs against databases of known motifs. The software aligns input motifs against the chosen database, and lists of the highest-scoring matches are returned. Such similarity-search functionality is expected to facilitate the identication of transcription factors that potentially interact with newly discovered motifs [207]. This resource is exible in format of data it inputs. Motifs may be input as frequency matrices, consensus sequences, or alignments of known binding sites. STAMP also directly accepts the output les from 12 supported motif-nders, enabling quick interpretation of motif-discovery analyses [207]. STAMP automatically builds multiple alignments, familial binding proles and similarity trees when more than one motif is input. These functions are expected to enable evolutionary studies on sets of related motifs and xed-order regulatory modules, as well as illustrating CHAPTER 1. INTRODUCTION 32 similarities and redundancies within the input motif collection [207]. STAMP's functionality is essentially pairwise comparison of motifs. In general, two motifs can be aligned using NeedlemanWunsch [235] (global) or SmithWaterman [298] (local) alignment methods [207]. Alignment algorithms require a distance metric. There are ve supported distance metrics: (i) Pearson's correlation coecient [258], (ii) KullbackLeibler information content [280], b sum of squared distances [284], (iv) average log-likelihood ratio (ALLR) [323] and (v) ALLR with a lower limit of 2 imposed on the score [207]. This algorithm avoids length biases when comparing motifs of dierent lengths, using the method of Sandelin and Wasserman for the calculation of empirical p-values based on simulated PSSM models [284]. 1.9 Functional analysis When using a high throughput technique that allows you to monitor the expression of tens of thousands of genes, you need an automated method to extract meaningful information from the large amount of data that results [150]. This section describes the common challenge of translating such lists of dierentially regulated genes into a better understanding of the underlying biological phenomenon. The output of RNA-seq experiments are often a list of dierentially expressed genes. An automatic ontological analysis approach can help with the biological interpretation of such results. Currently, this approach is the de facto standard for the secondary analysis of high throughput experiments and a large number of tools have been developed for this purpose [150]. This type of analysis may have drawbacks. For instance, experimentally derived gene lists have substantially more annotation associated with them, as they have been researched upon for a longer period of time. This annotation bias, a result of patterns of research activity within the biomedical community, is a major problem for classical hypergeometric test-based ORA approaches, which cannot account for such bias [180]. The need to formalize this interpretation process has led to the development of a range of CHAPTER 1. INTRODUCTION 33 tools, of which a family of statistical methods collectively known as over-representation analysis is becoming increasingly popular among researchers undertaking microarray analysis. The fundamental question asked by ORA is: what biological terms or functional categories are represented in the gene list more often than expected by chance [180]. Multiple database are useful for the functional analysis. GO is the primary resource for annotating gene groups to three types of knowledge: cell components, molecular functions, and biological processes [9]. The KEGG database provides functional annotations for metabolic and information processing pathways, cellular processes, human diseases and drug development data [134]. Reactome is a mammalian-specic pathway database with thorough annotations of numerous well-studied biological processes, ranging from intermediary metabolism to signal transduction to cell cycle and apoptosis 1.9.1 [62]. DAVID There are also web-based tools that amalgamate the output of such tools. DAVID is the Database for Annotation, Visualization and Integrated Discovery, is one such tool. This provides mainly batch annotation and Gene Ontology (GO) term enrichment analysis. Other resources provided include protein-protein interactions, protein functional domains, disease associations, bio-pathways, sequence general features, homologies, gene functional summaries, and gene tissue expressions [120]. Functional enrichment tests are used to interpret biological meanings of a gene list. Such statistical tests are performed on the functional categories of the gene lists. A hypergeometric test is used to test the enrichment of genes belonging to a given category in the identied gene list versus the genome [72]. DAVID uses various methods of multiple testing correction techniques including Bonferroni, Benjamini, and FDR. In addition, DAVID gives the option of using an EASE score (Expression Analysis Systematic Explorer) to quantify overall enrichment of gene groups. The EASE score is a modied Fisher's exact test. control family-wide false discovery rate [150]. It globally corrects enrichment p-values to CHAPTER 1. INTRODUCTION 1.9.2 34 g:Proler g:Proler is a web-based toolset for functional proling of gene lists from large-scale experiments [276]. Primary input can be a list of genes, proteins, or probe identiers. It supports many ID types and even mixing of arbitrary ID types [276]. The purpose of g:Proler is to nd common high-level knowledge such as pathways, biological processes, molecular functions, subcellular localizations, or shared TFBSs to the list of input genes. The data used in g:Proler is derived from the Gene Ontology [9], KEGG [134], Reactome [62] and TRANSFAC [215] databases [276]. GO is a structured vocabulary in a form of a directed acyclic graph. The results from GO and other relevant biological databases are presented in either tree-like top-down order, grouped by domains, or ranked by statistical signicance. The GO-structure-preserving visualization captures the hierarchical relationships between signicantly enriched categories. Hierarchical relations hold within GO. Vocabulary terms are related to one or several more general `parent' terms. Any term automatically involves all terms below via all relational paths. Therefore, genes annotated to a specic term in g:Proler are also added to all associated `parents', and the proling is performed at all hierarchical levels simultaneously. g:Proler strips out GO annotations that apply the `NOT' qualier. A visualization technique called gene-to-term mapping shows a coloured box if there is an association with a term in question. Furthermore, the colour coding used correlates to dierent types of evidence in heatmap style. g:Proler uses cumulative hypergeometric p-values to identify the most signicant terms corresponding to the input set of genes. Unlike most of the common proling tools, g:Proler supports annotations of descendants according to the ``True Path Rule'' [53]. A crucial factor in functional proling is the estimation of statistical signicance due to multiple testing against many categories if the specic functional category was not selected a priori [150]. Multiple testing corrections can broadly be split into two groups. Family-Wise Error Rates (FWER) such as Bonferroni, or Sidak, measure the chance of at least one falsepositive match. Functional proling provides testing against hundreds to thousands of terms, and such approaches become rather conservative, especially as tests are not independent due to the hierarchical structure of GO. These tests do not apply for heavily overlapping functional classications from GO. CHAPTER 1. INTRODUCTION 35 A more liberal group of corrections, false discovery rates (FDR), measure the proportion of false discoveries in a multi-test experiment and gain a test-wide threshold by ranking observed p-values and comparing their relative rank to individual test thresholds [17]. FDR approaches are more promising, since some versions also allow partial dependencies in input data [17]. g:Proler also has an option fo g:SCS (Set Counts and Sizes) by default. This is a novel method to estimate thresholds in complex and structured functional proling data such as GO, pathways and TFBS, where statistical signicance is determined from set intersections in 2 × 2 contingency tables. g:SCS has been claimed to be superior to standard multiple testing methods, since it takes into account the actual structure behind functional annotations [276]. 1.10 Summary of research The research in done in this thesis aimed to elucidate eects of an epigenetic modication, H3K4me1. This histone modication was studied in multiple breast cancer cell lines. Functional groups were used based on a comparison of breast cancer subtypes, or tumourigenic vs. non-tumourigenic matched controls. This was to look for the involvement of this mark in cancer gene regulation. We formed this hypothesis based on previous evidence [115] where regions formed by anking H3K4me1 sites where found to be enriched for TF binding sites. RNA-seq was used to determine the expression levels of genes downstream of these valley regions. The functional groups were used to correlated uniquely marked valleys with overexpression. A motif analysis was done on the valley sequences using MEME and STAMP to yield putative transcription factor binding sites. The purpose of this experiment was to look for known and putative tumour suppressors and oncogenic factors. Chapter 2 Materials and Methods 2.1 2.1.1 Cell lines Framentation methods These cell lines were prepared either with sonication or using micrococcal nuclease to fragment the DNA [253]. Sonication is generally believed to create randomly sized DNA fragments, with no section of the genome being preferentially cleaved. The fragments created by sonicating, are on average 500700 base pairs, are typically larger than those created via enzymatic cleavage [61]. Sonication tends to break DNA segments across the fault lines which dene nucleosome boundaries [197]. Enzymatic cleavage, in contrast, will not produce random sections of chromatin. Miccrococcal nuclease favors certain areas of genome sequence over others and will not digest DNA evenly or equally [74]. When using micrococcal nuclease MNase is the enzyme that catalyzes the endonucleolytic cleavage of DNA. In contrast to sonication, MNase treated chromatin preparations show highly homogenous lengths [179]. Also, enzymatic digestion of chromatin is milder than sonication and better preserves the integrity of the chromatin and antibody epitopes, which means increased IP eciency [87]. 36 CHAPTER 2. MATERIALS AND METHODS 2.1.2 37 Immunohistochemical properties These cell lines represent dierent breast cancer subtypes which result in diering immunohistochemical properties. Steroid receptors are useful to predicting outcome and response to therapy of breast cancer. Also, they help predict the relevance of cell line experiments in breast tissues of dierent types. Immunohistochemical markers with clinical importance include amplication of HER2 [296]. Also, changes in EGFR a tyrosine kinase receptor that is expressed in normal breast [68]. 2.1.3 Cell lines used Cell lines were used as experimental resources to in this study. Cell lines used were MCF-7, BT-549, T-47D, MDA-MB-231, and Hs578T. These cell lines are widely studied and retain DNA mismatch repair activity. Defects in this process would result in an approximate 20-fold increase in obfuscating background mutations. 2.1.3.1 MCF7 MCF-7 is a luminal cell line that was derived from a pleural eusion from a 69-year-old woman who underwent two mastectomies in a ve year span [302]. These cells show low motility and are not metastatic [49]. It cells express E-cadherin, epidermal growth factor receptor, estrogen receptor, and progesterone receptor. MCF-7 cells express full-length functional BRCA1 [49]. The media used for this cell line was RPMI1640 + 10%FBS +1% L-Gln +1% Pen/Strep. Sonication was used to break up the DNA. 2.1.3.2 T47D T47D was a luminal cell line obtained from the pleural eusion of a 54-year-old woman with intrating ductal carinoma [148]. T47D cells carry receptors for a variety of steroids and calcitonin. They express mutant tumour suppressor protein p53 protein. The progesterone receptor (PR) is expressed constitutively and these cells are responsive to estrogen. They are able to lose the ER during long-term estrogen deprivation in vitro [132]. As a result, CHAPTER 2. MATERIALS AND METHODS 38 sometimes these cells are use as a model for studies of drug resistance to tamoxifen in patients with mutant p53 breast tumours. The cells are also HER2 positive. There is no evidence of BRCA1 mutations in this cell line [162]. The media used for this cell line was RPMI1640 + 10%FBS +1% L-Gln +1% Pen/Strep. Sonication was used to break up the DNA. 2.1.3.3 BT549 BT-549 is a basal breast cancer cell line that was derived from a papillary, invasive ductal tumour of a 72 year-old woman that had metastasized to 3 of 7 regional lymph nodes [170]. BT-474 is ER, PR, and HER2 negative [142]. There is no evidence of BRCA1 mutations in this cell line [162]. The media used for this cell line was RPMI1640 + 10% FBS +1% L-Gln +1% Pen/Strep. Sonication uses was used to break up the DNA. 2.1.3.4 MDA-MB-231 MDA-MB-231 is a basal cell line that was obtained from a pleural eusion of a 51-year-old female [34]. MDA-MB-231 expresses very low levels of both ER and PR and is categorized as HR-negative, with HER-2/neu did not produce a statistically signicant change in HR levels [88]. There is no evidence of BRCA1 mutations in this cell line [162]. The media used for this cell line was RPMI1640 + 10%FBS +1% L-Gln +1% Pen/Strep. Sonication uses was used to break up the DNA. 2.1.3.5 HS578T Hs578T was derived from a carcinosarcoma and was epithelial, aneuploid, and lacks estrogenreceptor protein. It was a basal cell line that was taken from a 74-year-old woman with invasive ductal carcinoma [105]. The breast tissue it was derived from was excised at surgery and showed an inltrating ductal carcinoma. Hs578T cells are ER and PR negative, lack estrogen receptor, E-cadherin, and have low HER2/neu expression. There is no evidence of BRCA1 mutations in this cell line [162]. The media used for this cell line was RPMI 1640 + 10% FBS + 1% L-Glutamine + 1% Penicillin/Streptomycin. Sonication uses was used to break up the DNA. CHAPTER 2. MATERIALS AND METHODS 39 2.1.3.6 HS578Bst Hs578Bst was diploid and possibly of myoepithelial origin. It was a basal cell line that was taken from normal tissue distal to the region Hs578T and was removed from (in the same patient, in the same breast) and no tumour cells were identied in it. This made it a good control from Hs578T [105]. These cells are ER, PR and HER2, negative. The media used for this cell line was ATCC Hybri-Care Medium, Catalog No. 46-X. Hybri-Care Medium. This was supplied as a powder and was reconstituted in 1 L cell-culture-grade water and supplemented with 1.5 g/L sodium bicarbonate. To make the complete growth medium the following components were added 30 ng/ml mouse Epidermal Growth Factor (EGF) and fetal bovine serum to a nal concentration of 10%. Enzymatic Digestion with Miccrococcal Nuclease (MNase) was used to cleave DNA into smaller fragments. 2.2 Aligning sequence reads to reference genome Sequence reads of 27 bp or 32 bp derived from Illumina 1G sequencers were aligned to the NCBI reference human (hg18) genomes using MAQ. MAQ was used successfully in previous large scale experiments [18, 185] and was a good choice for alignment at the time it was used. Today, we would perhaps use an aligner such as Bowtie which exhibits a large performance advantage over MAQ at a slight cost in accuracy [82]. Only sequence reads that aligned to unique genomic locations were retained. The alignment was done by Richard Varhol. 2.3 Filtering reads Any reads whose sequences were similar to sequences for gel size selection ladders or sequencing adapters were removed from the alignment output. All sets of multiple reads that corresponded to a single DNA fragment start were collapsed into a single read. CHAPTER 2. MATERIALS AND METHODS 2.4 2.4.1 40 Identifying enriched regions Vancouver Short Read (Find Peaks 4) Enrichment proles were generated with Find Peaks v.4.0.15 which is available at //sourceforge.net/projects/vancouvershortr/. http: A maq read size of 128 is used. Triangle distribution is used to weight the contribution of bases in the reads. A mappable genome fraction of 0.7 was used based on previous estimates done at the Genome Science Centre. 5 iterations are used for the FindPeaks runs. A subpeak value of 0.2 and a trim value of 0.2 is used to separate subpeaks and trim the sides of peaks. To reduce the amount of noise after running Find Peaks 4 peaks a height threshold was used based on a FDR value of 0.01. 2.4.2 Saturation To assist in generating the saturation plot in Figure 3.1 on page 45 Find Peaks v.2 by Mikhail Bilenky was used. 2.5 Valley regions Flanking H3K4me1 peaks were searched for genome-wide in the promoter regions of genes 2.5 kb upstream of the transcription start site. The locations of the two anking peaks were separated by no more than 1000 bp and the centre 80% of peak to peak region is dened as the valley. 2.6 Concordance To determine the enrichment of transcription factor binding site in Figure 3.2, on page 46, and extract the sequences for Tables 3.23-3.26, on page 84, two packages were used. The SequenceExtractor package by Mikhail Bilenky and the BedTools package [267] available at http://code.google.com/p/bedtools/. CHAPTER 2. MATERIALS AND METHODS 2.7 41 Expression RNA-Sequencing (RNA-seq) data was obtained to further characterize the eects of the epigenetic changes. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods. [325]. Also, RNA-Seq also shows high levels of reproducibility, for both technical and biological replicates [325]. The Genomic Alignment Analysis package of Find Peaks 4 was used to get the number of reads per gene isoform. When comparisons between cell lines or groups of cell lines were done there may be several splice isoforms per gene. To simplify the comparison the splice isoform with the highest expression in either group was chose and then used for the comparison in both genes. Expression data was expressed in terms of reads per million base pairs (rpkm). The expression changes are at least two-fold with genes with pairs of low expression values eliminated. This threshold for expression change should lter all but the most signicant results [10]. 2.8 Motifs Motifs were searched for in the valleys where a unique valley coincided with an overexpression in one of the cell lines. MEME [12] was used to search for conserved regions between 6 and 15 bp. This was chosen based on previous research that said motifs are typically fairly short (5 to 20 bp long) [55] or typically about 10 bp long [313]. A site of conservation needed to occur in 5 promoter regions or more to be considered in this analysis. Twenty such sites were retrieved per category. A search was then performed to check whether any of the conserved regions matched known motifs. STAMP [207] was used to identify know transcription factors with the JASPAR v2010 motif set. Bonferroni correction is a method used to address the problem of multiple comparisons in these data [254]. This is a conservative test [22]. Matches with low complexity or with p>1×10−3 were discarded. Figures of valley regions in the promoter regions were obtained using the UCSC genome browser [147] http://genome.ucsc.edu/. CHAPTER 2. MATERIALS AND METHODS 2.8.1 42 Association of valley marked genes with breast cancer tumourigenesis Figure 3.3 on page 47 used the genes to systems database available at cnr.it/breastcancer/ 2.8.2 to associate of valleys marked genes with breast cancer. Functional analysis Gproler, available at http://biit.cs.ut.ee/gprofiler/, was used to obtain these func- tional data. Data in this resource is derived from several sources. database [9] http://www.geneontology.org/, egories. share. http://www.itb. KEGG [134] MiRBase [159] The Gene Ontology is used to obtain the gene ontology cat- http://www.genome.jp/kegg/ http://www.mirbase.org/ describes the pathways these genes is a searchable database of published miRNA sequences and annotation. Bonnferoni was used for multiple testing correction in these cell lines. Chapter 3 Results 3.1 Note regarding contributions In this study, Yongjun Zhao did all of the preparation of the libraries for ChIP-Sequencing. Richard Varhol did the alignment of the sequencing reads to the reference genome. Mikhail Bilenky wrote Find Peaks 2 and Anthony Fejes wrote Find Peaks 4. I did the bioinformatics analyses dening valley regions. I found a control for the cell line HS-578T and added it to enable match-controlled analysis. I found the overlap of these regions with various databases containing breast cancer genes or transcription factors. I found concordance of valleys amongst cell lines. I calculated expression levels in RPKM from the RNA-sequencing data and chose an appropriate transcript to use for each gene. I performed all of the motif analysis. Dr. Steven Jones conceived the study. 3.2 3.2.1 Chip sequencing Quality Tally of Reads and Peaks Several cell lines were analyzed with ChIP-sequencing. Reads were generated and aligned to the Human Mar. 2006 (NCBI36/hg18) assembly genome. The reads were overlapped to create islands. The Vancouver Short Read Analysis Package [79] created peaks from these 43 CHAPTER 3. RESULTS 44 islands. Peaks below an FDR threshold of 0.01 were discarded to reduce noise. Table 3.1 shows the cell lines used, their reads, and enriched islands of reads, or peaks, generated by the Vancouver Short Read Analysis Package. Table 3.1: Tally of Reads and Peaks 3.2.2 Cell lines Reads Peaks MDA-MB-231 6774327 20791 BT-549 4384352 522727 HS-578T 4747582 501543 T-47D 7065557 770301 MCF-7 5972111 641704 Sum-149 2868543 670431 PC9 5308285 534586 HS-578Bst 10182518 751500 Saturation curves A saturated library refers to a library with enough reads such that almost all of the peaks have been discovered. Depending on the library, the initial peaks allow new areas of enrichment to be discovered. With the addition of more reads, the library nears saturation. Then rather than new peaks being discovered, deeper sequencing of known peaks occurs [149]. Simulation is used to estimate binding saturation. By running the peak-calling algorithm on smaller random subsets of the set of sequence reads, the number of detected regions (on the y axis) can be plotted against the number of reads (on the x axis). This will often result in a curve that rises rapidly in the beginning but then starts to saturate. The curve can be extrapolated to estimate at what number of sequenced reads it will start to appear at [195]. In Figure 3.1 we see MCF-7 is an example of a fully saturated library. It starts saturating at approximately 2.5 million reads. There, the number of regions per reads levels o to a plateau. This indicates that we would nd no new regions with deeper sequencing. A library such HS-578Bst starts to saturate but has not yet quite reached saturation, even with a large number of reads. Thus the library is deeply sequenced but noisy. CHAPTER 3. RESULTS 45 Figure 3.1: Combined Saturation plots. This gure was generated using Find Peaks 2 and a modied MatLab script, saturation.m, both created by Mikhail Bilenky. CHAPTER 3. RESULTS 3.3 46 Enrichment of TF binding sites in H3K4me1 marked motifs The ORegAnno database (Open REGulatory ANNOtation) [223] contains known regulatory elements curated from scientic literature. Table 3.2 correlates valley regions, in promoters of genes with regulatory regions found in ORegAnno. To control for chance overlap with valley regions the prevelance of ORegAnno regions in the entire genome is calculated. To do this, the ORegAnno regions were shued randomly in the genome and overlap with valleys was again calculated. In a thousand repetitions, the percentage overlap of true ORegAnno sites was always greater than regions of the same length randomly placed in the genome. Table 3.2 shows these results and valleys are signicantly enriched for ORegAnno regulatory regions (p < 1 × 10−3 ). Table 3.2: Enrichment of TF binding sites in valleys Randomized ORegAnno Valleys Overlap Overlap % % MDA-MB-231 Basal 6 138 4.35 0.18 BT549 Basal 289 3825 7.56 0.35 HS578T Basal 293 2543 11.52 0.37 MCF7 Luminal 293 3543 8.27 0.36 T47D Luminal 245 2417 10.14 0.38 HS578T Cancer 297 2553 11.63 0.38 HS578BST Control 197 2256 8.73 0.33 3.4 Correlation of Valleys with Downstream Genes A study by Homan et. al found anking H3K4 monomethylation peaks mark sites of putative transcription factor binding [115]. To look for evidence of this binding, a search was done to nd correlation of valleys in promoter regions 2.5 kb upstream of the TSS with downstream genes. This search looked for functional relevance of genes downstream of marked promoters. CHAPTER 3. RESULTS 3.4.1 47 Association of valley marked genes with breast cancer tumourigenesis Next, we examine the set of genes whose promoters contain a valley region, or more simply, valley marked genes. To look for evidence of association of valley marked genes with breast cancer tumourgenesis and progression, we compare the genes downstream of the valleys with genes found in the G2SBC (Genes to Systems Breast Cancer) [227] database. G2SBC is an integration of many sources, such as NCBI, Breast Cancer Database, Uniprot, InterPro, KEGG, BioGRID, and Gene Ontology. In Table 3.3, there is an enrichment (p = 3.0×10−15 ) of breast cancer-related genes in the set marked by H3K4 monomethylation. Table 3.3: Proportion of breast cancer genes of the set of genes marked with H3K4me1 valleys Total Marked 63281 12466 2180 1322 3.4 10.6 Ensemble Transcripts Breast Cancer Genes Percent (%) 3.5 Concordance of valleys between cell lines Previous studies on the delity of cell lines to primary breast tumours showed that cell lines tend to mirror the modications of the tumours from which they are derived [236]. These analyses used multiple dierent cell lines, which gave us the opportunity of grouping the cell lines and looking for dierences in those groups. The two groupings done were a comparison of a cancer cell line vs. a matched control cell line, and luminal cell lines vs. basal cell lines. 3.5.1 Concordance between breast cancer cell line and a matched control We used two cell lines, HS578T and HS578Bst, that were derived from the same breast in the same patient [105]. HS578T was the tumourigenic cell line, and HS578Bst was a non-tumourigenic cell line taken from a distal location with no tumour cells identied in CHAPTER 3. RESULTS 48 Figure 3.2: Overlap of valley regions in tumourigenic cell line vs. control Cancer 1710 Control 373 2180 Table 3.4: Concordance of valleys in match controlled cell lines Shared 373 it. Unique to Unique to Cancer cell Control cell line line 1710 2180 In Table 3.4, we can see the concordance of valley regions in the promoter regions 2.5 kb upstream of the TSS in a matched pair of control and cancer cell lines. The lack of shared valleys is consistent with a hypothesis that these H3K4me1 anking peaks mark transcription factor binding sites, whose binding aects the genes downstream. Since we expect many genes to be regulated in opposite ways in the cancer vs. the control cell line, it is not surprising that more H3K4me1 marked genes are unique to a cell line than the number of genes that are shared between the two cell lines. 3.5.2 Concordance among various luminal and basal breast cancer cell lines 3.5.2.1 Breast cancer subtypes There are multiple cell lines used that represent dierent breast cancer subtypes. The cell lines used are representative of the two subtypes of breast cancer and are shown in Table 3.5. The cell lines BT549 and HS578T were chosen to represent the basal subtype due to an abnormally low number of valleys from the cell line MDA-MB-231 in Table 3.2. The overlap CHAPTER 3. RESULTS 49 in the 4 cell lines representing basal and luminal breast cancer subtypes is shown in Table 3.6a. Table 3.5: Cell lines by breast cancer subtype Basal Luminal MDA-MB-231 MCF7 BT549 T47D HS578T 3.5.2.2 Concordance with the same subtype One would expect the two basal cell lines and the two luminal cell lines to share more monomethylation marks than marks in luminal cell lines compared with marks in basal cell lines. When we examine the pairwise overlap of valleys in Table 3.6a, we see that this does not hold true for all of these cell lines. Instead, all the cell lines have the most overlap with BT549, the cell line with the largest total number of peaks. Other than overlaps with BT549, the next highest overlap is between MCF7 and T47D, two luminal cell lines. In Table 3.7b, we see these pairwise values expressed as a fraction of either of the two cell lines that are being compared. Here, the largest fractions are still found when cell lines overlap with BT549, however the next highest value is an overlap between MCF7 and T47D. 3.5.2.3 Valleys shared by all cell lines Table 3.7 shows the genes in which all cell lines are marked in the promoter with H3K4me1. Of these genes, four, CTDSPL, BLCAP, CITED1, PCDH8, are found in the Genes-toSystems Breast Cancer Database, which is a database of genes having a role in breast cancer that has a molecular alteration such as DNA amplication, deletion, insertion, altered protein isoform, altered RNA expression or an RNA splice variant [227]. CTDSPL is the CTD (carboxy-terminal domain, RNA polymerase II, polypeptide A) small phosphatase-like protein. This gene is a tumour suppressor and in previous studies missense and nonsense mutations were found in tumours in this gene [140]. BLCAP, bladder cancer associated protein, is a tumour suppressor gene originally identied CHAPTER 3. RESULTS 50 Table 3.6: Overlap of valleys in promoter regions of luminal and basal cell lines (a) Pairwise overlap of valleys in promoter regions of luminal and basal cell lines Basal BT549 Basal BT549 HS578T MCF7 T47D 3983 621 634 717 2631 465 543 3645 599 HS578T Luminal Luminal MCF7 T47D 2495 Basal Basal Luminal Luminal BT549 HS578T MCF7 T47D BT549 1.000 0.156 0.159 0.180 HS578T 0.236 1.000 0.177 0.206 MCF7 0.174 0.128 1.000 0.164 T47D 0.287 0.218 0.240 1.000 (b) Pairwise overlap of fraction of valleys in promoter regions of luminal and basal cell lines Valleys Overlaping cell lines BT549 HS578T MCF7 166 BT549 HS578T T47D 164 BT549 MCF7 T47D 161 HS578T MCF7 T47D 120 (c) Overlapping valleys in 3 luminal and basal cell lines Overlaping cell lines BT549 HS578T MCF7 Valleys T47D 48 (d) Overlaping valleys in all 4 luminal and basal cell lines CHAPTER 3. RESULTS 51 from human bladder carcinoma. Previous studies have found editing events that alter the highly conserved amino terminus of the protein [91]. CITED1 is the Cbp/p300-interacting transactivator, with Glu/Asp-rich carboxy-terminal domain, 1. A study showed CITED1 knockout mice identied a subset of estrogen-responsive genes displaying altered expression in the absence of CITED1 [216]. Maintenance of the ERalpha-CITED1 co-regulated signalling pathway in breast tumours can indicate good prognosis. PCDH8, protocadherin 8, is a candidate tumour suppressor of breast cancer. Loss of PCDH8 expression is associated with loss of heterozygosity, partial promoter methylation, and increased proliferation. It is thought that loss of PCDH8 promotes oncogenesis in epithelial human cancers by disrupting cell-cell communication dedicated to tissue organization and repression of mitogenic signaling [341]. 3.5.3 Concordance between a set of luminal and a set of basal breast cancer cell lines Table 3.8 indicates the overlap in genes marked in the promoter region of two basal or two luminal cell lines. Overlapping valleys in dierent cell lines with the same breast cancer subtype were merged and counted as one region. Valleys in Table 3.8 seem to be split between those that have eects in all breast cancer, and those that have subtype specic eects. In contrast, Table 3.4 on seems to indicate that as many valleys are shared between subtypes as are unique to them. This may indicate that H3K4me1 marks have a stronger eect in tumourigenesis and tumour progression in general than breast cancer subtype specic functions. CHAPTER 3. RESULTS 52 Figure 3.3: Overlap of valley regions by breast cancer subtype Basal 1990 3.6 Luminal 2006 2210 Unique valleys in promoter regions of overexpressed genes 3.6.1 Dening marked overexpressed categories RNA-seq experiments were done on all of the breast cancer cell lines used. To further elucidate the functions of the H3K4me1 valleys we correlate them with expression data (Tables 3.9 and 3.10). The expression changes are at least two-fold with genes with pairs of the lowest 20% expression values eliminated. This threshold for expression change should lter all but the most signicant results [10]. First, we use the tumourigenic cell line HS578T and it's match control HS578Bst shown in Table 3.9. By correlating expression we generate the following four gene categories: 1. Cancer marked, cancer overexpressed: Marked in the cancer cell line with a H3K4me1 valley and overexpressed in the cancer cell line. 2. Cancer marked, control overexpressed: Marked in the cancer cell line with a H3K4me1 valley and overexpressed in the control cell line. 3. Control marked, cancer overexpressed: Marked in the control cell line with a H3K4me1 valley and overexpressed in the cancer cell line. 4. Control marked, control overexpressed: Marked in the control cell line with a H3K4me1 valley and overexpressed in the control cell line. Corresponding gene categories were also generated with the luminal and basal cell lines shown in Table 3.10: CHAPTER 3. RESULTS 53 1. Luminal marked, luminal overexpressed: Marked in the luminal cell line with a H3K4me1 valley and overexpressed in the luminal cell line. 2. Luminal marked, basal overexpressed: Marked in the luminal cell line with a H3K4me1 valley and overexpressed in the basal cell line. 3. Basal marked, luminal overexpressed: Marked in the basal cell line with a H3K4me1 valley and overexpressed in the luminal cell line. 4. Basal marked, basal overexpressed: Marked in the basal cell line with a H3K4me1 valley and overexpressed in the basal cell line. CHAPTER 3. RESULTS Table 3.7: Overlap of valleys in promoter regions of luminal and basal cell lines Hugo Genes Description TPR translocated promoter region CCDC30 coiled-coil domain containing 30 CCDC18 coiled-coil domain containing 18 CGREF1 cell growth regulator with EF-hand domain 1 POLQ polymerase , theta CTDSPL CTD small phosphatase-like CGGBP1 CGG triplet repeat binding protein 1 POLK polymerase kappa REEP2 receptor accessory protein 2 BOD1 biorientation of chromosomes in cell division 1 KIF6 kinesin family member 6 C6orf138 Patched domain-containing protein C6orf138 PRIM2 primase, DNA, polypeptide 2 OLFML2A olfactomedin-like 2A ZFAND5 zinc nger, AN1-type domain 5 LHX6 LIM homeobox 6 CITED1 Cbp/p300-interacting transactivator 1 CCNY cyclin Y CCDC6 coiled-coil domain containing 6 WDR74 WD repeat domain 74 AGAP2 ArfGAP with GTPase domain, ankyrin repeat and PH domain 2 PCDH8 protocadherin 8 NLRC5 NLR family, CARD domain containing 5 CRHR1 Corticotropin-releasing factor receptor 1 Precursor ADAMTSL5 ADAMTS-like 5 HAUS5 HAUS augmin-like complex, subunit 5 GGN gametogenetin SPRED3 sprouty-related, EVH1 domain containing 3 ZNF283 zinc nger protein 230 BLCAP bladder cancer associated protein SEPT5 septin 5 SMC1B structural maintenance of chromosomes 1B EMID N/A 54 CHAPTER 3. RESULTS Table 3.8: Valleys shared between breast cancer subtypes Basal Luminal Total 4216 3996 Unique 2210 1990 Shared 2006 2006 55 CHAPTER 3. RESULTS 3.6.2 56 Tally of unique valleys in promoter region of overexpressed genes 3.6.2.1 Breast cancer subtype specic valleys Table 3.11 is a tally of the four categories of subtype-specic valleys that were in the promoter regions of overexpressed genes. Since there were ve cell lines to use for this analysis, promoters with valleys in at least two of the same subtype of genes were used. 3.6.2.2 Tumourigenics valleys A similar analysis was done on HS578T and it's match control HS578Bst (results shown in Table 3.14). We see an average of 60 valley regions found in each category. 3.6.3 Tally of uniquely marked overexpressed genes The number of overexpressed genes that are marked uniquely in a promoter in at least two cell lines is shown in Table 3.14. Similarily, the number of genes that are uniquely marked with at least two valley regions is shown in Table 3.13 in matched control cell lines. Table 3.9: Categories correlating expression with H3K4me1 mark in tumourigenic and nontumourigenic cell lines H3K4me1 Valley Control × × Expression Cancer in Cancer × × ↑ ↓ ↑ ↓ CHAPTER 3. RESULTS 57 Table 3.10: Categories correlating expression with H3K4me1 mark in luminal and basal cell lines H3K4me1 Valley Luminal Basal × × Expression Luminal Basal ↑ ↑ × × ↑ ↑ Table 3.11: Number of valleys in the promoter region marking overexpressed genes in breast cancer by subtype Over-expression Basal Luminal Marked Basal 131 116 Cell line Luminal 100 104 Table 3.12: Valleys in promoters of genes correlated with overexpression in match-controlled cell lines Over-expression Cancer Control Marked Cancer 55 81 Cell line Control 47 62 Table 3.13: Uniquely marked genes correlated with overexpression by breast cancer subtype Over-expression Basal Luminal Marked Basal 53 44 Cell line Luminal 42 46 Table 3.14: Uniquely marked genes correlated with overexpression in match-controlled cell lines Over-expression Cancer Control Marked Cancer 45 61 Cell line Control 42 52 CHAPTER 3. RESULTS 3.7 3.7.1 58 Functional analysis Functional analysis of basal and luminal cell lines A functional analysis was done using gProler [276]. This database includes data from Gene Ontology, KEGG, and miRBase. This analysis was done for all four categories of marked overexpresion. The p-values listed are multiple-testing corrected. The individual genes and select signicantly enriched functions are listed in Tables 3.15-3.22. 3.7.1.1 Functional analysis of basal marked basal overexpressed genes In Table 3.19 we see that there are many gene ontology categories enriched in basal marked basal overexpressed genes. One of the functions that is observed in these analysis is metastasis. Metastasis involves the spread of cancer from its primary site to other places in the body. Some metastatic functions seem to be revealed by this analysis. genes are associated with the focal adhesion (p = the GO terms integrin-mediated signaling pathway (p (p = 9.5 × 10−4 ) For example, ve of the 1.1 × 10−2 ) KEGG [135] pathway. Also, = 7.8 × 10−4 ) and integrin binding provide evidence that these genes may be involved in a breakdown of adhesion. Some of the functional categories may indicate involvement in angiogenesis. As a tumour gets bigger, it is less able to suciently access the blood vessels. The generation of vascular stroma is thus essential for solid tumour growth [29]. Vascular stroma formation is evident in two GO categories, vasculature development (p (p = 8.1 × 10−5 ). = 2.8×10−5 ) and blood vessel morphogenesis Heparin binding was also enriched in these genes (p = 6.5 × 10−4 ). MicroRNAs are regulatory, non-coding RNAs about 22 nucleotides in length. They control gene expression by targeting mRNAs and triggering either translation repression or RNA degradation. In previous studies, miRNAs were identied whose expression was correlated with specic breast cancer biopathologic features, such as estrogen and progesterone receptor expression, tumour stage, vascular invasion, or proliferation index [124]. miR-586 (p = 6.4 × 10−3 ) is enriched in this analysis [248]. The microRNA CHAPTER 3. RESULTS 59 3.7.1.2 Functional analysis of basal marked luminal overexpressed genes Table 3.20 shows enriched miRNAs. mice and rats [159]. The miRNA miR-351 is reported to be specic to It belongs to the miR-125 family, shown to perform varied roles in development, cancer and inammation. The miRNA miR-351 regulates genes involved in the TNF-α signaling pathway [225]. 3.7.1.3 Functional analysis of luminal marked basal overexpressed genes In Table 3.21, we see only actin binding (p = 2.2 × 10−4 ) as an enchiched GO category. 3.7.1.4 Functional analysis of luminal marked luminal overexpressed genes In Table 3.22, we see two microRNAs were found at multiple genes, miR-486 (p 10−4 ) and miR-542-5p (p = 4.2 × 10−4 ). upgreagulated in grade 3 vs. tumours [59]. = 4.2 × MiR-486 was found in a breast study to be grade 1/2 tumours and upgregulated in IBC vs. non-IBC The miRNA miR-542-5p was thought was a putative tumour suppressor discovered in neuroblastoma [290]. 3.7.2 Functional analysis of cancer and control cell lines 3.7.2.1 Functional analysis of control marked cancer overexpressed genes Table 3.15 shows an enrichment in the KEGG term cell cycle (p there is a enrichment of the reactome term cell cycle, mitotic (p enriched GO terms include cell cycle process (p regulation of cell cycle checkpoint (p 1.4 × 10−3 ). = = 5.9×10−3 ). = 1.6×10−3 ). 2.6 × 10−9 ), mitosis (p = 2.0 × 10−4 ) In addition, Similarily an = 2.1 × 10−4 ), and cell cycle checkpoint (p One could describe cancer as a disease of mitosis. = A breakdown in normal checkpoints results in unregulated growth. DNA packaging (p = 8.3 × 10−4 ) was another GO term that was enriched. DNA is associated with many proteins that organize and package it. The proteins and complexes CHAPTER 3. RESULTS 60 can aect accessibilty of DNA or modulate transcription factor binding. Genes involved in DNA packaging may thus also be involved in cancer progression. 3.7.2.2 Functional analysis of cancer marked control overexpressed genes The miRNA miR-650 (p = 3.2 × 10−3 ) is enriched in this analysis. In other studies miR- 650 was found to be downregulated in colon cancer [297]. miR-491-5p was also enriched (p = 4.9 × 10−3 ). A study found miR-491-5p expression was induced by TGF-β1 through the MEK/p38 MAPK pathway [345]. This microRNA down-regulated the expression of Par-3 ´ through a binding site in the 3 UTR integrity, and thus disrupts cell junction. 3.7.2.3 Functional analysis of cancer marked cancer overexpressed genes The miRNA miR-7-1 (p = 7.0 × 10−3 ) is enriched in this analysis. Previous studies have shown miR-7 to be correlated with genes that had predicted chromosomal instability [84]. Also, miR-7 was linked to cell cycle deregulation in breast cancer [84]. Alternatively, miR-7 inhibited expression of p21-activated kinase 1, an invasion-promoting kinase up-regulated in multiple cancer types [273]. Transfection of miR-7 was found in previous studies to inhibit the motility, invasiveness, anchorage-independent growth, and tumourigenic potential of highly invasive breast cancer cells [273]. 3.7.2.4 Functional analysis of control marked control overexpressed genes The ability of tumour cells to invade tissue requires that the tumour cell be able to traverse the basement membrane and extracelluar matrix [301]. region Three GO terms Extracellular p = 1.5×10−6 , Extracellular matrix part p = 4.4×10−5 , and Basement membrane p = 1.0 × 10−4 possibly indicate that invasion is occuring. The microRNA miR-130b (p = 1.0 × 10−2 ) is enriched in this analysis. MiR-130b is a tumour-suppressing micro RNA and there is a down-regulation of miR-130b in metastatic breast cancers. TAp63, suppresses tumourigenesis and metastasis, and coordinately regulates Dicer and miR-130b to suppress metastasis [305]. A conicting study found miR-130b upregulated in breast cancer with metastasis and in grade 3 vs. grade 1/2 tumours [59]. BUB1 C13ORF3 IL8 RSPO1 × × × × × × × p = 2.0 × 10−4 p = 1.4 × 10−3 p = 8.3 × 10−4 Regulation of cell cycle Cell cycle checkpoint DNA packaging ZNF238 × PCDH7 × × GPATCH4 C8ORF38 PDLIM3 INO80C Continued on next page Cell cycle, mitotic Reactome p = 1.6 × 10−3 p = 2.1 × 10−4 Mitosis miRNA p = 9.7 × 10−5 p = 2.6 × 10−9 Cell cycle process GO miR-144 p = 5.9 × 10−3 Cell cycle CHAPTER 3. RESULTS 61 Table 3.15: Control marked cancer overexpressed genes PAGE2 C4orf46 HSPA1A FAM36A MT1X INSIG1 CADM1 × CHAPTER 3. RESULTS 62 Cell cycle, mitotic miR-144 × DNA packaging Mitosis × Cell cycle checkpoint Cell cycle process × Regulation of cell cycle Cell cycle Control marked cancer overexpressed genes (cont.) PAGE2 NCAPG2 × × PSPH FANCD2 × × × ANP32E SERBP1 TCF19 CDCA8 × × × × × HAT1 × PKMYT1 × DLGAP5 × × × × HNRNPR × RPP40 × ZNF706 CENPA × × SMC4 × × TTK × × KRT18 × × × × × × × × RPA3 × H2AFV RBBP8 × × × × × × × CENPM TGFB2 Continued on next page × CHAPTER 3. RESULTS 63 Cell cycle, mitotic miR-144 DNA packaging Cell cycle checkpoint × Regulation of cell cycle Cell cycle process × Mitosis Cell cycle Control marked cancer overexpressed genes (cont.) PAGE2 GTSE1 × HMMR SNAI2 Table 3.16: Cancer marked control overexpressed genes p = 4.9 × 10−3 miR-491-5p miR-650 p = 3.2 × 10−3 miRNA × FER1L4 PSG4 HOXC6 PDCD1LG2 C15ORF52 × ADAMTSL1 IRX3 SEZ6L2 Continued on next page × CHAPTER 3. RESULTS 64 miR-491-5p miR-650 Cancer marked control overexpressed genes (cont.) × FER1L4 MRGPRF × GAA TMEM129 IL7R GDNF ANGPTL4 × MAP1A CITED2 ESM1 PBXIP1 NCSTN × NMRAL1 COL8A1 × ECM1 LHX9 CRABP2 × IFITM3 SLC16A3 NAAA ITM2B ITGA7 × NT5E MYH11 × STARD13 NES Continued on next page × CHAPTER 3. RESULTS 65 miR-491-5p miR-650 Cancer marked control overexpressed genes (cont.) × FER1L4 REEP2 DLG4 SLC44A2 CD68 ISLR TUBA1 PSG2 XG CPXM2 PLAGL1 PCTK3 × H2AFJ WSB1 HPS1 ENG × ARSA TIMP3 HSD3B7 TBC1D2 × × SLC22A17 × × GPR137B P4HA2 × DGKA HAGH DCN Continued on next page miR-650 miR-491-5p CHAPTER 3. RESULTS 66 miR-650 miR-491-5p Cancer marked control overexpressed genes (cont.) × FER1L4 RHBDF1 TMEM98 KIAA1539 × × Table 3.17: Cancer marked cancer overexpressed genes p = 7.0 × 10−3 p = 2.1 × 10−3 miR-543 miR-7-1 p = 2.5 × 10−3 miR-606 miRNA SLC12A8 CSNK2B TIMM23B CBWD1 × × CK17 TUT1 C1ORF110 × AP1S2 MLF1 RPS7 POLR2J PLK1 Continued on next page × × CHAPTER 3. RESULTS 67 miR-543 miR-7-1 miR-606 Cancer marked cancer overexpressed genes (cont.) × × SLC12A8 KIAA0101 × PSIP1 GEM ACTG2 NRG1 EEF1A1 × × UBC CEP55 TROAP × POSTN NASP EMILIN2 × CDKN2C ZWINT TJP2 × × × × × × LPHN2 SIPA1L2 × LOXL3 ECT2 LMNB1 × RAD51AP1 FOXM1 CNTNAP3 × MEST PBEF1 TNNT1 Continued on next page × CHAPTER 3. RESULTS 68 miR-7-1 miR-543 miR-606 Cancer marked cancer overexpressed genes (cont.) SLC12A8 CDC45L MTMR2 WDR62 × TMEM48 × CENPQ IL32 POLR2J4 Table 3.18: Control marked control overexpressed genes p = 1.0 × 10−4 Basement membrane ITGBL1 × PSAP × LAMB3 × p = 1.0 × 10−2 p = 4.4 × 10−5 Extracellular matrix part × miR-130b p = 1.5 × 10−6 FAM19A5 Continued on next page miRNA Extracellular region ECM organization p = 7.3 × 10−4 GO × × × CHAPTER 3. RESULTS 69 Basement membrane × × miR-130b Extracellular matrix part Extracellular region ECM organization Control marked control overexpressed genes (cont.) × FAM19A5 S100A4 GAS6 × RNASE4 × FUCA1 P4HTM PNPLA2 × A2M COL8A2 × × NUDT6 ITFG3 × TPP1 × AP000926.2 × IGSF8 MEGF6 × C5ORF45 GPSM1 JAM2 LYPD6B × SIDT2 × HERC4 IGFBP3 Continued on next page × CHAPTER 3. RESULTS 70 FAM19A5 × SCUBE3 × PAM × miR-130b Basement membrane Extracellular matrix part Extracellular region ECM organization Control marked control overexpressed genes (cont.) × TBCK × FGF2 CYP1B1 FAM129B COL4A2 × × × × × LOXL2 PODXL HSPA2 RECK × × F3 × RGS4 NID1 PTGDS × × × × × × × PIK3IP1 SEL1L3 NID2 × KCNK2 MOXD1 TNS1 Continued on next page × CHAPTER 3. RESULTS 71 × FAM19A5 EVC PPAP2A LIMCH1 COL11A1 DKK3 ALDH3B1 × × × miR-130b Basement membrane Extracellular matrix part Extracellular region ECM organization Control marked control overexpressed genes (cont.) ALCAM DLC1 × DCBLD1 × Continued on next page CTHRC1 C8ORF84 p = 5.1 × 10−2 Melanoma KEGG NUDT6 p = 6.4 × 10−3 p = 1.1 × 10−2 Focal adhesion GO miR-586 p = 9.5 × 10−4 Integrin binding × p = 6.5 × 10−4 × Heparin binding × p = 1.2 × 10−6 × Extracellular matrix p = 1.2 × 10−4 p = 2.8 × 10−5 p = 9.9 × 10−8 Adherens junction BGN p = 7.8 × 10−4 MXRA7 Integrin-mediated signaling pathway SPATS2L p = 8.1 × 10−5 TGM2 Blood vessel morphogenesis C4orf46 Vasculature development Cell adhesion CHAPTER 3. RESULTS 72 Table 3.19: Basal marked basal overexpressed genes miRNA × HYI × × C6ORF145 IKBIP × × × × CHAPTER 3. RESULTS 73 miR-586 Melanoma Focal adhesion Integrin binding Heparin binding Extracellular matrix Adherens junction Integrin-mediated signaling pathway Blood vessel morphogenesis Vasculature development Cell adhesion Basal marked basal overexpressed genes (cont.) C4orf46 DTX3L × ANTXR2 SNX7 SYNC FBLIM1 × × ADAMTS1 × × × × × × × × PDE1C DST × × ITGB1 × × × × × ZEB1 COL8A1 × × × × × × × × × CDCA7 FGF2 THBS1 × LMO7 × COL5A1 × × × × × × × × × × × × × MAP7D3 × FLNC SGCE Continued on next page × × CHAPTER 3. RESULTS 74 Melanoma × × C4orf46 CALD1 LPHN2 AKT3 GPR177 DOCK7 LOXL3 NCOA7 LPXN × AC005562 ARHGEF10 SLC39A14 PLS3 MYH9 × × × × × TIMP3 × FBLN1 × × FGFR1 × × FOSL2 PLEKHC1 CAPG Continued on next page × × × miR-586 Focal adhesion Integrin binding Heparin binding Extracellular matrix Adherens junction Integrin-mediated signaling pathway Blood vessel morphogenesis Vasculature development Cell adhesion Basal marked basal overexpressed genes (cont.) miR-351 p = 6.7 × 10−3 MRPL38 S100A14 ZNF552 Continued on next page miRNA × SLC25A29 × BCAM SIGIRR × HDDC3 × SNAI2 Table 3.20: Basal marked luminal overexpressed genes miR-586 Melanoma Focal adhesion Integrin binding Heparin binding Extracellular matrix Adherens junction Integrin-mediated signaling pathway Blood vessel morphogenesis Vasculature development Cell adhesion CHAPTER 3. RESULTS 75 Basal marked basal overexpressed genes (cont.) C4orf46 × CHAPTER 3. RESULTS 76 miR-351 Basal marked luminal overexpressed genes (cont.) MRPL38 × EFCAB4A × SEZ6L2 PPM1D ZNF444 TMC4 CRYL1 TOR2A C21ORF33 DUSP23 HIST1H2BD NMRAL1 FAM128B DAK PGAP2 CRABP2 × SYTL1 × BCKDHA KREMEN2 KIAA0182 CBFA2T3 BCAS4 C14ORF179 TP53I3 EPB41L5 H2AFJ COQ5 Continued on next page CHAPTER 3. RESULTS 77 miR-351 Basal marked luminal overexpressed genes (cont.) MRPL38 × CPT1A TRIM37 TJP3 × TRPS1 DECR2 EEF1A2 SULT2B1 CELSR1 CA12 HAGH SH3YL1 Table 3.21: Luminal marked basal overexpressed genes Actin binding p = 2.2 × 10−4 GO NCRNA00152 TUBB3 S100A2 Continued on next page CHAPTER 3. RESULTS 78 Actin binding Luminal marked basal overexpressed genes (cont.) NCRNA00152 AFAP1 × IL1RAP FAM92A1 SEPT10 CDCA2 MCFD2 PTRF ANTXR1 × TPM4 × EDIL3 FBLN2 TGFBR2 DDR2 MEGF6 MX1 DAB2 SACS LHFPL2 NFKBIZ CEP170 COL6A2 MFGE8 DUSP6 NT5E FST Continued on next page CHAPTER 3. RESULTS 79 Actin binding Luminal marked basal overexpressed genes (cont.) NCRNA00152 PLAU F3 CCDC88A × NR3C1 FAM46A QKI GPR162 DDX58 MET FXYD5 × RAGE MYLK × LIMCH1 × ITGA3 CHAPTER 3. RESULTS 80 Table 3.22: Luminal marked luminal overexpressed genes p = 4.2 × 10−4 miR-542 p = 4.2 × 10−4 miR-486 miR-448 p = 7.4 × 10−3 miRNA NPEPL1 SNURF DDR1 MB ZP3 × C19ORF46 FAM128A × PFKFB3 × BOLA2B ABCA3 × C10ORF32 RBM47 DMKN × FGFR4 × ZDHHC12 × × × × ACSS1 PDCD4 DOC2A × SOX13 FAM63A × SUOX DDB2 Continued on next page × CHAPTER 3. RESULTS 81 miR-542 miR-486 miR-448 Luminal marked luminal overexpressed genes (cont.) NPEPL1 INADL EPS8L1 × × × RAB17 PEX16 × EPCAM HDHD3 BSPRY PCTK3 EFHD1 MANSC1 × FOLR1 TUBD1 × NPDC1 AGR2 ISYNA1 × GSTZ1 × × ESR1 FXYD3 × LPHN1 PRKCZ × ERBB3 PTGER3 × KIAA1370 × DBNDD1 × × CHAPTER 3. RESULTS 3.8 82 Marked overexpressed genes 3.8.1 Motifs Motifs were searched for in the valleys where a unique valley coincided with an overexpression in one of the cell lines. MEME [13] was used to search for conserved regions between 6 and 15 bp. A site of conservation needed to occur in 5 promoter regions or more. Twenty such sites were retrieved. A search was then performed to check whether any of the conserved regions matched known motifs. STAMP [207] was used with the JASPAR v2010 motif set. Matches with low complexity or with p > ×10−3 were discarded. If more than one motif may match a site well then they are all listed. Motifs that match at dierent sites are separated by lines in Tables 3.23 - 3.26 on pages 84 85. The motifs that were found were ESR1, ESR2, REST, Egr1, sna, che-1, stat3, cup2, EWSR1-FLI1, Ixr1, Tlx1_NFIC, tinman, bcd, oc, gsc, IRF1, MEF2A, and NFκB. 3.8.1.1 ESR1 ESR1 is found in Table 3.23 and Table 3.24. The presence of ESR1 in the control marked control overexpressed category can be explained by the H3K4me1 mark aiding the activatory role of ER. ESR1, as a tumour suppressor [19], is likely activating genes ghting tumourigenesis such as apoptotic genes. The presence of ESR1 in the cancer marked control category is not intially expected and possible explanations are discussed in Section 4.7. Estrogen Receptor 1 (ESR1) is the gene that encodes estrogen receptor alpha (ER-α). ESR1 is activated by the ligand estrogen and aects physiological processes such as growth, dierentiation, and homeostasis in eukaryotic cells [93]. 3.8.1.2 ESR2 ESR2 is found in Table 3.23 and Table 3.24. The presence of ESR2 in the control marked control overexpressed category can be explained by the H3K4me1 mark aiding the activatory CHAPTER 3. RESULTS 83 role of ER. ESR2, as a tumour suppressor [136], is likely activating genes ghting tumourigenesis such as apoptotic genes. The presence of ESR2 in the cancer marked control category is not intially expected and possible explainations are discussed in Section 4.7. EStrogen Receptor 2 (ESR2) is the gene that encodes Estrogen receptor beta (ER-β). Like ESR1, ESR2 is activated by the ligand estrogen and aects physiological processes such as growth, dierentiation, and homeostasis in eukaryotic cells [93]. CHAPTER 3. RESULTS 84 Table 3.23: Uniquely Marked in Control and Overexpressed in Control Motif TF p-value MA0450 hkb 2.8×10 13 8 −4 MA0055 Myf −5 1.4×10 MA0402 SWI5 2.7×10 MA0193 Lag1 4.9×10 MA0247 tin 1.7×10 MA0086 sna 1.2×10 MA0112 ESR1 7.1×10 MA0258 ESR2 1.2×10 MA0149 EWSR1-FLI1 1.7×10 MA0393 STE12 8.0×10 achi −4 4.7×10 MA0207 Sites −4 −5 33 −7 8 −6 −5 10 −4 −7 18 −5 13 Table 3.24: Uniquely Marked in Cancer and Overexpressed in Control Motif TF p-value MA0149 EWSR1-FLI1 6.5×10 12 MA0162 Egr1 −4 1.5×10 12 MA0260 che-1 7.2×10 −11 −6 MA0023 dl_2 −5 4.8×10 MA0304 GCR1 8.1×10 MA0212 bcd 3.1×10 MA0234 oc −7 3.1×10 MA0190 Gsc 6.0×10 MA0112 ESR1 6.1×10 MA0258 ESR2 −4 6.8×10 MA0105 NFKB1 2.3×10 MA0061 NF-kappaB −4 7.3×10 MA0023 dl_2 7.9×10 MA0287 CUP2 1.6×10 MA0144 Stat3 MA0430 MA0087 Sites 6 −5 −7 23 −7 −4 −5 5 7 −4 −6 5 5.1×10 −14 5 YLR278C −4 3.7×10 5 Sox5 2.6×10 −4 9 CHAPTER 3. RESULTS 85 Table 3.25: Uniquely Marked in Cancer and Overexpressed in Cancer Motif TF p-value MA0162 Egr1 7.9×10 −4 12 MA0162 Egr1 −6 6.0×10 12 MA0323 IXR1 8.2×10 MA0138 REST −5 1.8×10 MA0373 RPN4 3.9×10 −5 7 MA0260 che-1 −6 2.0×10 5 MA0393 STE12 3.2×10 MA0050 IRF1 5.8×10 MA0212 bcd 2.4×10 MA0234 oc 2.4×10 MA0190 Gsc 4.6×10 MA0052 MEF2A 1.2×10 −6 Sites 9 −6 −6 −7 5 −7 −7 −6 5 Table 3.26: Uniquely Marked in Control and Overexpressed in Cancer Motif TF p-value Sites MA0234 oc −7 9.3×10 21 MA0212 bcd 9.8×10 MA0190 Gsc 1.4×10 MA0190 Gsc 2.5×10 MA0212 bcd 5.8×10 MA0234 oc 6.3×10 MA0218 ct 4.9×10 6 MA0373 RPN4 −5 6.2×10 10 MA0344 NHP10 2.0×10 5 MA0344 NHP10 −4 1.5×10 18 MA0016 usp 2.0×10 MA0323 IXR1 1.0×10 −7 −6 −7 7 −7 −7 −5 −5 −5 5 −8 8 MA0119 TLX1_NFIC −6 8.5×10 MA0373 RPN4 3.3×10 MA0162 Egr1 4.3×10 MA0393 STE12 3.7×10 MA0260 che-1 −5 7.9×10 MA0430 YLR278C 4.8×10 −5 −5 9 −5 13 −5 10 CHAPTER 3. RESULTS 3.9 86 Genes downstream of ESR1 motifs in Valleys The ESR1 gene encodes an estrogen receptor which is important for hormone binding, DNA binding, and activation of transcription. The ESR1 gene is amplied in 21% of breast carcinomas [119]. UCSC [147] was used to plot the data showing the H3K4me1 mark in the tumourigenic and control cell line in Figure 3.4. Below the H3K4me1 data is shown where valleys were identied and the location of the ESR1 motif. Many of these genes are known to have functions in tumourigenesis. 0_ 0_ ENST00000374476 ENST00000374479 FUCA1 ESR1 motif HS578-Bst Valley 0_ HS578T Valley 16.05 _ HS578-Bst HS578T Scale chr1: 16.35 _ MIR548F3 ESR1 motif HS578-Bst Valley 0.04 _ HS578T Valley 15.85 _ HS578-Bst HS578T Scale chr1: 213243000 44.55 _ 1 kb RefSeq Genes ESR1 motif HS578-Bst Valley HS1328 HS578T Valley HS0356 213244500 Ensembl Gene Predictions 213244000 213245000 1 kb 24068000 Ensembl Gene Predictions RefSeq Genes ESR1 motif HS578-Bst Valley HS1328 HS578T Valley 24068500 HS0356 24069000 (b) ENST00000374476, AC092162.1, HS578T:3.97307 rpkm, HS578-Bst:21.8525 rpkm 213243500 (a) ENST00000391895, KCNK2, HS578T:15.2132 rpkm, HS578-Bst:31.4393 rpkm the control cell line and overexpressed in the control cell line, cont. 24069500 ENST00000391895 KCNK2 213245500 Figure 3.4: ESR1 motifs found in valleys upstream of genes that were uniquely marked by H3K4me1 mono-methylation in CHAPTER 3. RESULTS 87 0.23 _ 0_ ENST00000260630 ENST00000407341 CYP1B1 ESR1 motif HS578-Bst Valley 0_ HS578T Valley 9.54 _ HS578-Bst HS578T Scale chr2: 20.16 _ ENST00000354332 ENST00000368714 ENST00000368716 S100A4 S100A4 ESR1 motif HS578-Bst Valley 1.8 _ HS578T Valley 19.16 _ HS578-Bst HS578T Scale chr1: 12.35 _ 1 kb 151786000 151786500 ENST00000368712 Ensembl Gene Predictions ENST00000368713 RefSeq Genes S100A3 ESR1 motif HS578-Bst Valley HS1328 HS578T Valley HS0356 151787000 1 kb 38157500 Ensembl Gene Predictions RefSeq Genes ESR1 motif HS578-Bst Valley HS1328 HS578T Valley 38158000 HS0356 38158500 (d) ENST00000407341, CYP1B1, HS578T:9.10138 rpkm, HS578-Bst:39.3691 rpkm 38157000 (c) ENST00000368716, S100A4, HS578T:221.861 rpkm, HS578-Bst:506.054 rpkm 151785500 38159000 151787500 CHAPTER 3. RESULTS 88 0_ 0_ ENST00000292586 ENST00000376931 C5orf45 C5orf45 ESR1 motif HS578-Bst Valley 1.14 _ HS578T Valley 13.38 _ HS578-Bst HS578T Scale chr5: 10.4 _ ESR1 motif HS578-Bst Valley 0_ HS578T Valley 17.04 _ HS578-Bst HS578T Scale chr3: 25.72 _ 500 bases RefSeq Genes Ensembl Gene Predictions ESR1 motif HS578-Bst Valley HS1328 HS578T Valley 49001500 HS0356 Ensembl Gene Predictions RefSeq Genes ESR1 motif HS578-Bst Valley HS1328 HS578T Valley HS0356 179219500 49002000 (f) ENST00000376931, C5ORF45, HS578T:21.0684 rpkm, HS578-Bst:42.5321 rpkm 500 bases 179219000 (e) ENST00000383729, P4HTM, HS578T:9.88339 rpkm, HS578-Bst:23.777 rpkm 49001000 179220000 ENST00000383729 CHAPTER 3. RESULTS 89 1.16 _ 0_ ESR1 motif HS578-Bst Valley 0_ HS578T Valley 11.41 _ HS578-Bst HS578T Scale chr16: 6.24 _ ENST00000275521 ENST00000381083 ENST00000381086 IGFBP3 IGFBP3 ESR1 motif HS578-Bst Valley 0.47 _ HS578T Valley 10.65 _ HS578-Bst HS578T Scale chr7: 26.53 _ 223800 1 kb RefSeq Genes ESR1 motif HS578-Bst Valley HS1328 HS578T Valley HS0356 Ensembl Gene Predictions 45928500 45929000 500 bases 224000 224100 224300 HS0356 Ensembl Gene Predictions RefSeq Genes ESR1 motif HS578-Bst Valley HS1328 HS578T Valley 224200 224400 ENST00000301677 224500 (h) ENST00000301679, P4HTM, HS578T:50.9351 rpkm, HS578-Bst:102.124 rpkm 223900 (g) ENST00000381086, IGFBP3, HS578T:398.96 rpkm, HS578-Bst:10702.8 rpkm 45928000 224600 45929500 ITFG3 224800 ENST00000301679 224700 CHAPTER 3. RESULTS 90 0.1 _ ENST00000215912 ENST00000402249 PIK3IP1 PIK3IP1 ESR1 motif HS578-Bst Valley 0.64 _ HS578T Valley 16.19 _ HS578-Bst HS578T Scale chr22: 6.16 _ 500 bases Ensembl Gene Predictions RefSeq Genes ESR1 motif HS578-Bst Valley HS1328 HS578T Valley 30019500 HS0356 UCSC[147] was used to plot the data in these gures. (i) ENST00000402249, PIK3IP1, HS578T:0.092294 rpkm, HS578-Bst:12.3945 rpkm 30019000 30020000 CHAPTER 3. RESULTS 91 Chapter 4 Discussion & Conclusions Advances in sequencing technologies have allowed for the unbiased examination of global histone modications within a cell at tenable timeframes and cost. This study took advantage of the advances by examining genome-wide Histone H3K4me1 modications in several breast cancer cell lines. Transcription Factors (TFs) promote or block the recruitment of RNA polymerase. This inuence on gene transcription can be modulated by either enhancing or inhibiting the accessibility of site-specic transcription factors to target loci. A central problem in TF biology is how binding sites are selected given the near ubiquity of short and degenerate recognition motifs and the small fraction of high-anity sites that are actually bound [130]. In these studies, we discovered novel putative activatory and repressive regions. We saw that valleys were signicantly enriched for ORegAnno regulatory regions. Thus, we saw that the bimodal H3K4me1 peaks seem to mark areas of putative transcription factor binding. These results were consistent with studies by Homan et al. that nd bimodal loci are more highly occupied than loci with low H3K4me1 [115]. Studies by Robertson et al. have found that the spatial distribution for H3K4me1 around TF binding sites have found symmetric anking pairs of enrichment [278]. in such anking pairs or valleys. We found transcription factor binding site enrichment This enrichment may be due to modulated accessibil- ity of chromatin [158], or interactions with molecular eectors involved in recognition of H3K4me1 [186]. 92 CHAPTER 4. DISCUSSION & CONCLUSIONS 93 We found that genes marked with H3K4me1 were more likely to be involved in breast cancer (p = 3.0 × 10−15 ). This is consistent with studies that have found monomethylation of histone H3K4 has been associated with active transcription of a promoter [324]. This is evidence that these novel putative activatory and repressive regions have an eect in the progression of the tumour. 4.1 Valley concordance To further analyze these novel putative activatory and repressive regions we look for their concordance in multiple dierent cell lines. This gave us the opportunity of grouping the cell lines and looking for dierences in those groups. The section below discusses the ndings when comparing tumourigenic to non-tumourigenic cell lines and also dierent breast cancer subtypes. 4.1.1 Match control When comparing a cancer cell line vs. a matched control cell line, these results indicate the majority of valleys are unshared between the two. This would be expected if H3K4me1 was an epigenetic modication that directs the transcriptional program of a cancer cell. 4.1.2 Breast cancer subtype When comparing a pair of basal cell lines with a pair of luminal cell lines, their valleys seem to be largely unshared. There doesn't seem to be a distinct between valley concordance and breast cancer subtype. This may indicate that H3K4me1 plays less of a role in breast cancer subtype specic functions as it does in tumourigenesis in general. 4.1.2.1 Core shared marks When DNA is conserved across many organisms that indicates the level of importance of a gene's functionality. Similarly, it could be hypothesized H3K4me1 marks shared between CHAPTER 4. DISCUSSION & CONCLUSIONS 94 cell lines of dierent breast cancer subtypes are putative activatory or repressive regions important in tumourigenesis. There were 48 genes marked in two basal cell lines and two luminal cell lines. Four of those genes CTDSPL, BLCAP, CITED1, and PCDH8, were listed in the Genes-to-Systems Breast Cancer Database. CCDC18 and KIF6 are listed as having mutations in breast cancer in the COSMIC database. Some of the others seem to have roles in cancer as well. Tpr was found to be a fusion partner with the MET oncogene and was involved in gastric tumourigenesis [299]. CGREF1 was found in a study predicting epigenetically regulated genes in breast cancer cell lines [199]. Overexpression of POLQ is known to be correlated with poor prognosis in early breast cancer patients [114]. OLFML2A is listed in a patent developing a signature to predict and reduce the risk of metastasis of breast cancer to lung [213]. AGAP2 is overexpressed in human cancers, including breast cancer, and prevents apoptosis by up-regulating Akt [33, 3]. There is also evidence that corticotropin-releasing hormone exerts antiproliferative activity on growth of human breast cancer cells via the activation of CRH-R1 [97]. Bladder cancer-associated protein is a novel candidate tumour suppressor gene originally identied from human bladder carcinoma [91]. 4.2 Association of valley marked genes with breast cancer tumourigenesis Using multiple dierent breast cancer cell lines allowed us to examine functional groups. We nd that genes marked with H3K4me1 valleys in their promoters are enriched for breast cancer related genes found in the G2SBC (Genes to Systems Breast Cancer database) [227] (Table 3.3). This is further evidence that the valleys represent novel putative activatory or repressive regions that could be binding sites for cancer related TFs. 4.3 Marked genes with corresponding expression modulation There are multiple valley regions found in promoters of genes where a two-fold expression modulation correlates with the H3K4me1 mark. The H3K4me1 mark would serve to aid the binding and function of the transcription factor. Activators would be expected to bind to valleys in the promoter regions of two categories of genes, cancer marked cancer overexpressed CHAPTER 4. DISCUSSION & CONCLUSIONS 95 and control marked control overexpressed. There were 117 such valleys. Repressors would be expected to bind to valleys in the promoter regions of two categories of genes, cancer marked control overexpressed and control marked cancer overexpressed. There were 99 such valleys. 4.3.1 Functions of H3K4me1 Marked genes with corresponding expression modulation The analysis of the marked overexpressed genes yielded many functional annotations which may be related to cancer progression. Gene Ontology, KEGG and miRBase annotations were included. 4.3.1.1 Cell cycle checkpoints Cell cycle machinery controls cell proliferation, and cancer is a disease of inappropriate cell proliferation. Reduction in sensitivity to signals leads to a cycle of increasing cell number due to disregulation of signals telling a cell to adhere, dierentiate, or die [52]. Cell cycle checkpoints sense aws in DNA replication and chromosome segregation [66]. When checkpoints are activated, signals are relayed to the cell cycle-progression machinery causing a delay in cycle progression, until the danger of mutation has been averted [52]. In addition to directly repairing DNA breaks or adducts, cells can respond to DNA damage by undergoing programmed cell death. Cells with an intact DNA-damage response frequently arrest or die in response to DNA damage, thus reducing the likelihood of progression to malignancy. Mutations in mitotic-checkpoint pathways, can thus permit the survival or the continued growth of cells with genomic abnormalities [141]. An enrichment of the KEGG term cell cycle, the reactome term cell cycle, mitotic, and GO terms such as cell cycle process, mitosis, cell cycle checkpoint, and regulation of cell cycle checkpoint point to a breakdown in the cell cycle checkpoints, possibly contributing to the unregulated growth. CHAPTER 4. DISCUSSION & CONCLUSIONS 96 4.3.1.2 Metastasis The metastatic process involves multiple steps, including cell detachment from the primary tumour, degradation of the basement membrane and ECM, migration into surrounding connective tissue, entry into the vascular or lymphatic circulation, attachment to the endothelial cells in suitable organs, extravasation from the circulation, and colony formation in the secondary sites [205]. Cellular adhesion molecules are involved in these steps. 4.3.1.3 Cellular adhesion Proteins involved in focal adhesion are macromolecules through which the cytoskeleton of a cell connect to the ECM and mediate it's regulatory eects through ECM-receptor interaction pathways [256, 134]. Focal adhesion kinase (FAK) is a protein tyrosine kinase expressed in invasive breast cancer and eects antiapoptotic signaling [169]. FAK might have roles both in the later stages of tumour progression, such as invasion and metastasis, promoting the adhesion of invading cells' metastatis to distant sites. They are also involved in early stage functions in cancer progression that precede invasion and metastasis [192]. Integrin proteins are major cell surface receptors for extracellular matrix molecules. FAK is a key component of the signal transduction pathways triggered by integrins [102]. Alterations to integrin function within human breast cancer may be linked to metastasis [80]. The GO terms integrin-mediated signaling pathway and integrin binding provide evidence that these genes may be involved in a breakdown of adhesion. The mammary gland consists of a ductal epithelial network. These ducts contain two major layers, a luminal layer of secretory epithelial cells and an outer, basal layer of myoepithelial cells. The basal surface of the epithelium is a basement membrane (BM) that interacts with an ECM. The BM is a layer separating basal cells from the extracellular matrix. During tumour progression, changes arise that perturb interactions of epithelium and ECM [108]. The degradation of both the myoepithelial cell layer and the basement membrane is a prerequisite for breast cancer invasion and metastasis [209]. The GO terms Extracellular region, Extracellular matrix part, and Basement membrane possibly indicate that this invasion is occurring. CHAPTER 4. DISCUSSION & CONCLUSIONS 97 Tumour cell migration and adhesion and are important features during the switch to the metastatic state. Actin cytoskeleton is important in these processes and involved in many aspects of cancer and cancer progression [166]. In normal tissue, broblasts and epithelial cells locally migrate during wound repair, and white blood cells cross vessel walls. Myoepithelial cells, are contractile and arranged in a similar manner to smooth muscle cells [271]. Their cytoplasm contains the contractile protein actin. These kinds of processes can be disregulated to allow malignant cancer cells to move out of the primary tumour and beyond the boundaries of the tissue or organ where the tumour initially developed [211]. The GO term Actin binding may indicate such processes are occurring in this cell line. 4.3.2 Angiogenesis As a tumour gets bigger, it is less able to suciently access the blood vessels. The generation of vascular stroma is thus essential for solid tumour growth [29]. Vascular stroma formation is evident in two GO categories, vasculature development and blood vessel morphogenesis. Studies using the MDA-MB-231 breast cancer cell line had concluded that heparin-binding growth-associated molecule was found to function as a tumour growth factor [329]. In another study, the expression of integrins, strongly expressing epidermal growth factor (EGF) receptors, was increased by addition of the heparin-binding EGF-like growth factor [233]. Heparin-binding proteins can promote angiogenesis in endothelial cells [318]. 4.3.3 MicroRNAs MicroRNAs control gene expression by targeting mRNAs. In previous studies, miRNAs were identied whose expression was correlated with specic breast cancer biopathologic features, such as estrogen and progesterone receptor expression, tumour stage, vascular invasion, or proliferation index [124]. The eect of microRNAs is post-transcriptional and is such not aected by H3K4me1, but some miRNAs mark genes that are cancer related. There were several genes that had enrichment of cancer-related miRNAs. CHAPTER 4. DISCUSSION & CONCLUSIONS 4.4 4.4.1 98 Putative regulatory regions Relevance of marked overexpressed categories The categories generated by combining comparisons of monomethylation data and expression data in dierent functional groups in Tables 3.9 and 3.10 allows us to dene novel putative regulatory regions. The correlation of a unique valley region with the signicant change in expression leads us to the hypothesis that these are activatory or repressive regions. 4.4.1.1 Putative activatory region In a case where an H3K4me1 mark is correlated with overexpression of the downstream gene in the same cell line, we could expect an activatory transcription factor to be binding within the valley region of the gene's promoter. The mark would aid the binding and eect of the activatory transcription factor, contributing to the overexpression of the downstream gene. 4.4.1.2 Putative repressive region On the other hand, in a case where an H3K4me1 mark is correlated with overexpression of the downstream gene in a dierent cell line, we could expect a repressive transcription factor to be binding within the valley region of the gene's promoter. The mark would aid the binding and eect of the repressive transcription factor, contributing to decreased expression of the downstream gene. 4.5 Experimentally determined functions of TFs potentially regulated by valley regions To test whether these valleys have regulatory functions, they were correlated with motifs. This was done comparing tumourigenic and non-tumourigenic cell lines in cases where there were uniquely marked genes that were overexpressed in one of the cell lines. This analysis nds putative activatory and repressive regions where the H3K4me1 appears to modulate CHAPTER 4. DISCUSSION & CONCLUSIONS 99 the eect of the transcription factor. 4.5.1 ESR1 and ESR2 There are many examples of the regulatory role of the valleys being supported by what is known about the TFs binding the motifs in the literature. For example, ESR1 and ESR2 are found in Table 3.23, the control marked control overexpressed category. This would indicate the H3K4me1 mark is aiding the activatory role of ER, which is corroborated in the ER literature. and ER is known to have three activation domains AF-1, AF-2, and AF-2a [241], in vitro studies show that TATA-binding protein-associated factor interacts with the AF-2a domain to enhance ER-mediated transcription [28]. The presence of ESR1 and ESR2 in Table 3.24, the cancer marked control overexpressed category, indicates a repressive role that has not been described in the literature. Possible interpretations of this inconsistency are discussed in Section 4.7. EStrogen Receptor 2 (ESR2) is the gene that encodes Estrogen receptor beta (ER-β) and EStrogen Receptor 1 (ESR1) encodes Estrogen receptor beta (ER-α). The ESR's are activated by the ligand estrogen and aect physiological processes such as growth, dierentiation, and homeostasis in eukaryotic cells [93]. These TFs are tumour suppressors [136], and are likely activating genes ghting tumourigenesis such as apoptotic genes. Breast cancers whose cell growth rate is not aected by the presence of estrogen are estrogen receptor-negative (ER-). The cell lines used in these studies, HS-578T and HS578Bst, are known to be ER- [105]. The ER- status may appear to conict with the result indicating ER bind to promoters of genes causing changes in downstream expression. However, these results are consistent with the presence of ERRs. ER-related receptors (ERRs) are nuclear orphan receptors with signicant homology to ERs, which do not bind estrogen. These have unknown physiological ligands can take over for estrogen or are constitutively active. ERRs are known to be able to bind to classic EREs, in which they exert a constitutive transcriptional activity [100, 118]. These studies were done in ER- cell lines which may indicate that ERRs are involved in tumourigenesis in this case. The presence of ESR1 in the results of this breast cancer study are consistent with this TF's major role in the disease well documented in previous literature. Studies on breast CHAPTER 4. DISCUSSION & CONCLUSIONS cancer samples showed ESR1 amplication in 20.6% of breast cancers [117]. 100 The loss of ER expression causes tumour growth that is no longer under estrogen control and cannot be stopped by endocrine therapy. prognosis. This results in higher tumour aggressiveness and poor Therefore, ER is a critical growth regulatory gene in breast cancer, and its expression in breast cancer cells is critical for tumour progression [93]. 4.5.2 Egr1 Furthermore, the presence of Egr1 (Early Growth Response Protein 1) in Table 3.25, the cancer marked cancer overexpressed category, is also corroborated in other studies. Our results would indicate the H3K4me1 mark aids the eect of this TF which we would expect to be an activator. Indeed, Egr1 does have an activation domain; a serine/threonine/prolinerich region between amino acids 174 and 270 [38]. The presence of Egr1 in Table 3.24, the cancer marked control overexpressed category, and Table 3.26, the control marked cancer overexpressed category, indicate a repressive role for Egr1. Again the literature conrms that Egr1 has both activatory and repressive domains. The repressive domain is between amino acids 281-314 to the 5' of it's zinc ngers [92]. In addition, Swirno et. al found evidence Nab1, a corepressor of Egr-1, was an active, direct (non-quenching) repressor that appears to work via a direct mechanism. Thus, it interferes with the function of the general transcription apparatus (GTA) but not that of specic activating TFs [307]. Egr1 was also shown in other studies to have a role in cancer. It has been found to have a key role and is a convergence point for many signaling cascades and involved in gene proliferation, stress responses and apoptosis [57, 196]. This complex TF is known to act as both a tumour suppressor and a tumour promoter [160]. It's dual roles as tumour suppressor and tumour promoter, activator and repressor, appear consistent with our nding this TF in three dierent categories. There has been previous evidence of Egr1 specically involved in breast cancer as well. It has been linked to apoptosis and shown to be activated by extracellular signal-regulated kinase [11]. However, EGR1 was previously shown to be needed for TBX2 to repress NDRG1 CHAPTER 4. DISCUSSION & CONCLUSIONS 101 and drive cell proliferation in breast cancer [274]. In mammary normal tissue, Egr-1 expression is low, suggesting a possible relation between the low levels of Egr-1 and the development of mammary neoplasias [247]. Analyses of the expression of Egr-1 in breast carcinoma cells, such as MCF-7, demonstrated a relatively high expression of the endogenous Egr-1 in these cells [247]. Other results in the literature suggest that siRNA-Egr-1 potent antineoplas- tic agent in suppressing the growth of breast tumour despite the known role of Egr-1 as a tumour-suppressor in several other types of human cancers [247]. 4.5.3 Che-1 In addition, Che-1 is found in Table 3.25, the cancer marked cancer overexpressed category. This indicates that Che-1 is an activator. The literature matches this observation, and previous studies have found Che-1 contains an activation domain [60]. The presence of Che-1 in Table 3.26, the control marked cancer overexpressed category, and Table 3.24, the cancer marked control overexpressed category, indicates a repressive role that has not been described in the literature. Possible interpretations of this inconsistency are discussed in Section 4.7. Che-1 was previously shown to have a a proproliferative role, interacting with the retinoblastoma protein (Rb) and inhibiting its ability to suppress expression of E2F [75]. Furthermore, Che-1 appears to counteract Par-4 or β-amyloid induced apoptosis [83]. In contrast, Che-1 was also shown to have antiproliferative activity by inducing expression of p21Waf1 [246]. 4.5.4 EWSR1/Fli-1 Furthermore, EWSR1/Fli-1 is found in Table 3.23, the control marked control overexpressed category. This is evidence that EWSR1/Fli-1 is an activator. EWSR1/Fli-1 is a chimeric protein fusing Ewing sarcoma breakpoint region 1 and Friend Leukemia Integration 1 protein. ´ This chimera joins fusing a 5 domain. part of EWS to the to the 3 EWS-FLI1 can recognize in vitro ´ half encoding the DNA binding the same sequences as FLI-1, but is a more potent transactivator than the wild type FLI-1 [14]. The activatory role for EWSR1/Fli-1 is consistent with our ndings. CHAPTER 4. DISCUSSION & CONCLUSIONS 102 Our studies also nd EWSR1/Fli-1 in Table 3.24, the cancer marked control overexpressed category. This is inconsistent with it's activatory role discussed above but there has been evidence in the literature that could corroborate this eect. EWS/FLI-1 has been shown to bind the IGFBP-3 promoter in vitro and in vivo and can repress its activity [264]. This bivalent role is discussed in a study that has had characterized eight transcripts that are dependent on EWS/FLI for expression and two transcripts that are repressed in response to EWS/FLI [27]. 4.5.5 Ixr1 Ixr1 is a homeobox gene that encodes the iroquois homeobox 1 protein. Homeobox genes encode transcription factors that play key roles in the determination and maintenance of cell fate and cell identity [40]. Ixr1 is found in Table 3.26, the control marked cancer overexpressed category, indicating a role as a repressor. The specic role of Ixr1 does not appear to be known decisively, but one study has found it has a possible role as a repressor. In this study, mutations in IXR1 cause de-repression of COX5B [165]. It also appears in Table 3.25, indicating a activatory role. This could be a novel unknown function of Ixr1 or there may be other factors involved. For example, homeobox genes do not generally act alone to determine cell identity. There is a combinatorial, spatial, and temporally regulated pattern of homeobox genes functioning in a given cell that determines the cell's identity [183]. Acting together the genes can be considered a Homeobox code programming cellular outcome. The binding of this one TF may not determine the outcome of downstream genes alone. There is some evidence homeobox genes could be involved in breast cancer. IRX-2, for example, is expressed in discrete epithelial cell lineages being found in ductal and lobular epithelium [184]. IRX-2 expression is maintained in human mammary neoplasias [184]. 4.5.6 Tlx1_NFIC TLX1 is the gene encoding the T-cell leukemia homeobox protein and NFIC is the Nuclear factor I/C-type protein. Tlx1_NFIC is found in Table 3.26, the control marked cancer CHAPTER 4. DISCUSSION & CONCLUSIONS overexpressed category. 103 This appears to be a case where H3K4me1 marks a valley for Tlx1_NFIC to bind and repress the downstream genes. This is consistent with literature that reports TLX1 functions as a bifunctional transcriptional regulator, being capable of activation or repression depending on cell type [277]. Tlx1_NFICis a homeoprotein, that is known to interact with the CCAAT binding transcription factor NFIC [342]. There is evidence for this complex's involvement in previous cancer studies. TLX1, is essential to spleen organogenesis and oncogenic when aberrantly expressed in immature T cells [277]. NFIC is upregulated in breast cancer [201]. 4.5.7 Tin Tin is found in Table 3.23, the control marked control overexpressed category. This would indicate a role for Tin an activator where the H3K4me1 mark in the control is aiding the binding of Tin resulting of activation of the downstream genes. This is consistent with previous studies. The human homologs are NKX2-5 and NKX2-6 (NK2 transcription factor related) which are members of the NK homeobox family [311]. NKX2-5 has been found to act either as a specic transcriptional activator or repressor [4]. In addition, apoptosis and reduced proliferation was observed in Nkx2.5 and Nkx2.6 double-mutant mice [311]. 4.5.8 Bcd, oc, and gsc Bcd encodes bicoid a homeodomain-containing transcriptional factor. goosecoid homeobox Goosecoid. Gsc encodes the Oc encodes the homeobox gene ocelliless [39]. Bcd, oc, and gsc are found in Tables 3.24, the cancer marked control overexpressed category, and 3.26, the control marked cancer overexpressed category. This would indicate H3K4me1 is promoting binding of repressors. In the literature, it is found that goosecoid and bicoid act translational repressors [63, 171]. The presence of Bcd, oc, and gsc in Table 3.26, the control marked cancer overexpressed category, indicates a repressive role that has not been described in the literature. Possible interpretations of this inconsistency are discussed in Section 4.7. These genes are also known to have a role in breast cancer, consistent with our discovery of their role in breast cancer. Goosecoid, promotes tumour metastasis and is overexpressed in CHAPTER 4. DISCUSSION & CONCLUSIONS 104 a majority of human breast tumours. Moreover, Goosecoid signicantly enhanced the ability of breast cancer cells to form pulmonary metastases in mice [109]. 4.5.9 IRF1 IRF1 encodes the interferon regulatory factor 1. IRF1 is found in Table 3.25, the cancer marked cancer overexpressed category. This could be interpreted as H3K4me1 marking the binding site and facilitating the role of the activatory protein IRF-1. Indeed, previous studies conrm, IRF-1 is an activator of transcription [214]. Other studies have also conrmed its involvement in breast cancer. IRF1 behaves as a tumour suppressor gene in breast cancer through caspase activation and induction of apoptosis [25]. Other studies have shown that ectopic expression of IRF1 using an adenovirus delivery system led to a decrease in survivin expression and an increase in cell death in breast cancer cell lines [259]. 4.5.10 MEF2A MEF2A encodes the Myocyte-specic enhancer factor 2A protein. MEF2A is found in Table 3.25. It is found in the cancer marked cancer overexpressed category which indicates that the H3K4me1 is aiding the activatory function of MEF2A in the cancer cell line. This is consistent with ndings in the literature that MEF2 can act either as an activator or repressor of transcription under dierent circumstances [218]. 4.5.11 Sna Sna encodes the Snail protein [143]. Sna is found in Table 3.23, the control marked control overexpressed category. This would indicate that Snail has activatory capabilities. This is inconsistent with previous reports of Snail's repressive SNAG domain [238]. It is consistent with one study that found a Snail-type TF, CES-1, which also binds to E-boxes, was found to activate transcription in vivo [275]. CHAPTER 4. DISCUSSION & CONCLUSIONS 105 Figure 4.1: Snail1 complex [44] In other cancer studies, Snail was also observed. It's expression has been detected in a number of dierent human carcinoma and melanoma cell lines [252]. Snail is sucient to promote mammary tumour recurrence in vivo [224]. High levels of Snail predict decreased relapse- free survival in women with breast cancer [224]. Snail has been associated to the lymph node status and/or invasiveness of ductal breast carcinomas [21]. Snail expression has been shown to confer resistance to cell death mediated by several factors and chemotherapeutic agents [133, 316]. Snail was found to be upregulated in recurrent tumours. This recurrence is accompanied by epithelial-to-mesenchymal transition (EMT). Snail has been shown to be sucient in inducing EMT in primary tumour cells [224]. However, silencing of Snail by stable RNA interference has been shown to induce a complete mesenchymal to epithelial transition (MET), associated to the upregulation of E-cadherin, downregulation of mesenchymal markers, and inhibition of invasion [242]. Other studies have also veried cross-talk between Snail and epigenetic factors. As shown in Figure 4.1, Snail physically interacts with, and recruits, the histone demethylase LSD1 to epithelial gene promoters. The Snag domain of Snai1 is sucient for interaction with the LSD1 complex [194]. LSD1 removes dimethylation of lysine 4 on histone H3 (H3K4me2 H3K4me1/H3K4me0) and in the absence of LSD1, Snai1 fails to repress E-cadherin [194]. LSD1 associates with co-repressors including HDAC1/2 and CoREST to form a core ternary complex. This is recruited to chromatin and can eciently bind and modify nucleosomal CHAPTER 4. DISCUSSION & CONCLUSIONS substrates to repress transcription 106 [167]. Previous studies have shown that Snail induces repressive histone modications at the E-cadherin promoter through recruitment of histone deacetylases (HDACs) and a H3K27 methyltransferase [251, 112]. 4.5.12 Stat3 Stats are a family of latent transcription factors that mediate signalling from cytokines and growth factors. Signal Transducers and Activators of Transcription 3 protein (Stat3) regulates the transcriptional activation of VEGF (vascular endothelial growth factor) [96]. Stat3 is found in Table 3.24, the cancer marked control overexpressed category. This is inconsistent with literature that nds STAT family members are transcription activators [300]. Stat3 was also found in other cancer studies. oncogenic signalling pathways. It is a point of convergence for numerous It is constitutively activated both in tumour cells and in immune cells in the tumour microenvironment through consitutive phosphorylation on Tyrosine [340]. Stat3 plays a key role in many cellular processes such as cell proliferation, survival, invasion, and tumour angiogenesis [1]. In lung cancer, Stat3 transduces survival signals downstream of tyrosine kinases such as Src, EGF-R, and c-Met, as well as cytokines such as IL-6 [300]. Stat3 has been found to be essential for the early phase of mammary gland involution [1]. Involution is characterized by extensive apoptosis of the epithelial cells and a dramatic switch from survival to death signalling. 4.5.13 REST RE1-Silencing Transcription factor encodes a transcriptional repressor. REST was initially proposed to silence the transcription of neuronal genes in non-neuronal cells. known to have essential roles in both neuronal and non-neuronal cells. It is now REST is found in Table 3.25, the cancer marked cancer overexpressed category. This is inconsistent with previous literature that says REST is thought to repress genes by binding to a 1733 base pair neuron-restrictive silencer element [288, 41, 289]. This inconsistency may be due to post-transcriptional regulation of REST. Previous studies CHAPTER 4. DISCUSSION & CONCLUSIONS 107 Figure 4.2: Various REST isoforms [76] have found this occurs during oncogenic transformation [320]. Protein levels can be significantly reduced in the absence of altered mRNA levels [330]. REST can therefore not be directly measured by its mRNA levels in breast tumours, such as our studies. The level of RNA-Seq coverage we have allows us to observe changes in expression, but not observe SNPs conclusively at a base pair level. Thus there may be a isoform due to a mutation to a stop codon that we do not observe. The various splice isoforms for REST are shown in Figure 4.2. Isoform 1 consists of two repression domains (RD1 and RD2) and a DNA-binding Domain (DBD). Expression of a dominant negative form of REST derepresses a promoter [42]. REST4 or Isoform 3 in Figure 4.2 lacks RD2 and has a truncated DNA binding domain, but retains zinc ngers 1-5 and nuclear localization [76]. Re-expression of functional REST in REST4-expressing cells has been shown to induce apoptosis, suggesting that suppression of REST function is key to survival of these cells [104]. CHAPTER 4. DISCUSSION & CONCLUSIONS 108 Other studies conrm REST's involvement in breast cancer. Loss of REST has been found to result in a highly aggressive breast cancer disease course [320]. Also, RESTless tumours have signicantly increased tumour size and lymph node involvement [320]. Furthermore, patients with RESTless breast cancer undergo signicantly more early disease recurrence than those with fully functional REST, regardless of estrogen receptor or HER2 status [320]. Also, other studies have conrmed cross-talk of REST with epigenetic factors. REST has been found to act as a hub for the recruitment of multiple chromatin-modifying enzymes such as histone deacetylases (HDACs) and histone methyltransferases (HMTs) [244]. CoREST has been found to enhance the ability of LSD1, a known histone H3K4 histone demethylase, to reverse methylation and protects LSD1 from proteasomal degradation in vivo [175]. Figure 4.1 shows such an example complex containing REST. 4.6 Experimental validation These experiments give us a better understanding of key molecular targets that underlie the pathways that are associated with disease development. By inhibiting a gene with proproliferative roles we might slow the progression of breast cancer in an individual. We could do further research to validate such genes as potential targets. RNA interference (RNAi) approaches are an eective means of target validation [125]. This method would allow us to model the pharmacological inhibition of a target protein. RNAi is a valuable laboratory research tool, both in cells, and in whole animal models [125]. RNAi is a naturally occurring mechanism that controls gene expression at the post-transcriptional level [125]. In eukaryotes, double-stranded interfering RNAs target complementary mRNAs for degradation. RNAi can be eected in mammalian cells by the use of small-interfering RNA (siRNA) duplexes that silence gene expression without inducing an inhibitory interferon response [219, 30, 219, 125]. siRNAs can either be directly introduced into cells by transfection or can be generated within the cell by introducing plasmids that express short-hairpin RNA (shRNA) precursors of siRNAs [125]. shRNAs are processed by the DICER enzyme into siRNAs, which, in turn, enable transcript degradation by binding to a complementary mRNA in the context of the RNA-induced silencing complex (RISC) [125]. CHAPTER 4. DISCUSSION & CONCLUSIONS 109 Once the inhibition of the gene has been modelled, assays can be run to test the eect. Some of the assays described in the literature have included the luminescent measurement of cell viability [204], a wound-healing assay modelling cell motility [51], the use of a uorescencebased plasmid reporter system measuring proteasome function and microscopic image analysis as a measure of mitotic progression [125]. 4.7 Uncorroborated experimental results There are other cases where, while our ndings that a TF is involved in cancer appears to be supported by previous literature, the activatory or repressive role we predict is not corroborated in the literature. This section discusses some of the possible reasons for those cases. 4.7.1 Post-transcriptional regulation Post-transcriptional regulation is when protein levels are signicantly reduced in the absence of altered mRNA levels [320]. An example of this is REST post-transcriptional regulation which occurs during oncogenic transformation [320]. REST is regulated by ubiquitinmediated proteolysis. degradation. β-TRCP β-TRCP is the specic E3 ubiquitin ligase responsible for REST overexpression causes oncogenic transformation of human mammary epithelial cells and this pathogenic function requires REST degradation [330]. This kind of post-transcriptional regulation would not be evident in a ChIP-seq experiment. 4.7.2 Co-regulators Transcription coregulators interact with transcription factors to either activate or repress the transcription of specic genes [95]. Nuclear co-regulators act cooperatively with transcription factors to establish patterns of gene expression and thus provide functional exibility in specifying transcriptional regulation [130]. An example of this is NKX2-5, the human Tin homolog, whose transcriptional activity is modulated positively and negatively by its respective binding partner, Tbx-5 or Tbx-2, in a region-specic manner [4]. CHAPTER 4. DISCUSSION & CONCLUSIONS 4.8 4.8.1 110 Progressive methylation Binding strengths of eectors It is known that both H3K4me1 and H3K4me3 can be present in sites proximal to the TSS. Furthermore, there has been work done to characterize eectors of the H3K4me1 mark, proteins that recognize the methylation at lysine H3K4 and eect change. NURF is one such protein with a BPTF PHD nger that recognizes H3K4 methylation [186]. It was found that binding was most tight to H3K4me1 but binding to other methylation states possible, though weaker [186]. There was a gradient of binding anity H3K4me3 > H3K4me2 > H3K4me1 > H3K4me0 [186]. Both H3K4me1 and H3K4me3 are activatory but we can expect H3K4me3 to have a stronger eect if all of the eectors are like NURF. 4.8.2 H3K4me3 unobserved in these studies 4.8.2.1 Expected case There are dierent ways to interpret an H3K4me1 mark. It could be an increased activatory mark from an H3K4me0. In this case we would expect an activator to bind a valley where a unique H3K4me1 mark correlates with increased expression. 4.8.2.2 Methylation states In these studies, we are only examining H3K4me1 and not other methylation states. There is evidence that distinct methylation states could reect the stability of histone lysine methylation. Stability would gradually increase from mono-, di-, and nally to trimethylation of the various histone lysine positions [163]. Furthermore, there is evidence that eciency of readout of the dierent methylation states increases from mono-, di-, and nally to trimethylation [186]. CHAPTER 4. DISCUSSION & CONCLUSIONS 111 H3K4me3 TF H3K4me1 Promoter TSS Figure 4.3: Low H3K4me1 could indicate higher H3K4me3 4.8.2.3 Reasons for unexpected case Thus, the presence of H3K4me1 could indicate a decreased activatory mark from a H3K4me3. Furthermore, H3K4me1 and H3K4me3 are competitive and a low H3K4me1 could indicate a higher unseen H3K4me3. In this case, H3K4me1 would decreased as these sites were progressively methylated to H3K4me2 and/or H3K4me3 [186]. Thus H3K4me1 can be used to assess genes that have modulated expression in oncogenesis but it is dicult to use this mark alone to determine if a particular gene is activated or repressed. 4.9 Epigenetic crosstalk In addition to various dierent methylation states on histone H3 lysine 4, there are other residues that can be methylated and many types of histone modications. They include histone acetylation, phosphorylation, ubiquitination, sumoylation, ADP-ribosylation, biotinylation, proline isomerization, and histone methylation [314]. These modications act together creating Histone Code [303]. According to the histone code, distinct combinations of histone modications are related to specic chromatin-related functions and processes [128]. Multiple modications can help tip the balance of one chromatin state to another, making the underlying DNA more or less accessible to the protein machinery. These histone modications generate a language that is interpreted through the ability to recruit the proteins that modulate chromatin. While our experiments are unique CHAPTER 4. DISCUSSION & CONCLUSIONS 112 in that they nd novel regulatory regions in a genomewide comparision of multiple breast cancer cell lines, there have been other studies of histone modications. Previous studies have also found them to have roles in gene transcription, DNA repair, mitosis, meiosis, development and in apoptosis [90]. Many dierent epigenetic modications have been described in the human genome and have been previously shown to play diverse roles in gene regulation, cellular dierentiation and the onset of disease [69]. Studying individual modications can help us nd links to the activity levels of various genetic functional elements. However, to better understand the complete eect, the combinatorial patterns of many epigenetic factors must be considered. I think this study shows how H3K4me1 contributes an eect to gene expression, but this mark alone does not explain all gene expression eects. 4.10 Conclusions In conclusion, bimodal H3K4me1 peak forms valleys, which are putative regulatory regions. In these regions, transcription factor binding is enriched, and some H3K4me1 marked genes are involved in cancer progression or anti-cancer functions. Genes marked multiple breast cancer cell lines may be important for tumour progression. Finally, correlating motifs found in valleys with overexpression data has yielded genes with important functions in breast cancer. Bibliography [1] Kathrine Abell, Antonio Bilancio, Richard W E Clarkson, Paul G Tien, Anton I Altaparmakov, Thomas G Burdon, Tomoichiro Asano, Bart Vanhaesebroeck, and Christine J Watson. Stat3-induced apoptosis requires a molecular switch in pi(3)k subunit composition. Nat Cell Biol, 7(4):392398, Apr 2005. [2] Mohamed Abu-Farha, Jean-Philippe Lambert, Ashraf S Al-Madhoun, Fred Elisma, Ilona S Skerjanc, and Daniel Figeys. The tale of two domains: proteomics and genomics analysis of smyd2, a new histone methyltransferase. Mol Cell Proteomics, 7(3):560572, Mar 2008. [3] Jee-Yin Ahn, Yuanxin Hu, Todd G Kroll, Paulette Allard, and Keqiang Ye. Pike-a is amplied in human cancers and prevents apoptosis by up-regulating akt. Proc Natl Acad Sci U S A, 101(18):69936998, May 2004. [4] Hiroshi Akazawa and Issei Komuro. Cardiac transcription factor csx/nkx2-5: Its role in cardiac development and diseases. Pharmacol Ther, 107(2):252268, Aug 2005. [5] F. Albertorio, M. E. Hughes, J. A. Golovchenko, and D. Branton. dna-carbon nanotube interactions: control. Base dependent Activation enthalpies and assembly-disassembly Nanotechnology, 20(39):395101, 2009. [6] Donna G Albertson, Colin Collins, Frank McCormick, and Joe W Gray. Chromosome aberrations in solid tumors. Nat Genet, 34(4):369376, Aug 2003. [7] C. David Allis, Shelley L Berger, Jacques Cote, Sharon Dent, Thomas Jenuwien, Tony Kouzarides, Lorraine Pillus, Danny Reinberg, Yang Shi, Ramin Shiekhattar, Ali Shilatifard, Jerry Workman, and Yi Zhang. New nomenclature for chromatin-modifying enzymes. Cell, 131(4):633636, Nov 2007. 113 BIBLIOGRAPHY 114 [8] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. alignment search tool. Basic local J Mol Biol, 215(3):403410, Oct 1990. [9] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unication of biology. the gene ontology consortium. Nat Genet, 25(1):2529, May 2000. [10] S. Audic and J. M. Claverie. The signicance of digital gene expression proles. Genome Res, 7(10):986995, Oct 1997. [11] S.J. Baek, L.C. Wilson, L.C. Hsi, and T.E. Eling. Troglitazone, a peroxisome proliferator-activated receptor gamma (ppar gamma ) ligand, selectively induces the early growth response-1 gene independently of ppar gamma. a novel mechanism for its anti-tumorigenic activity. J Biol Chem, 278:584553, 2003. [12] T. L. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol, 2:2836, 1994. [13] Timothy L Bailey, Nadya Williams, Chris Misleh, and Wilfred W Li. Meme: discovering and analyzing dna and protein sequence motifs. Nucleic Acids Res, 34(Web Server issue):W369W373, Jul 2006. [14] R. A. Bailly, R. Bosselut, J. Zucman, F. Cormier, O. Delattre, M. Roussel, G. Thomas, and J. Ghysdael. Dna-binding and transcriptional activation properties of the ews-i-1 fusion protein resulting from the t(11;22) translocation in ewing sarcoma. Mol Cell Biol, 14(5):32303241, May 1994. [15] Andrew J Bannister and Tony Kouzarides. Reversing histone methylation. Nature, 436(7054):11031106, Aug 2005. [16] Artem Barski, Suresh Cuddapah, Kairong Cui, Tae-Young Roh, Dustin E Schones, Zhibin Wang, Gang Wei, Iouri Chepelev, and Keji Zhao. High-resolution proling of histone methylations in the human genome. Cell, 129(4):82337, May 2007. [17] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29:11651188, 2001. BIBLIOGRAPHY 115 [18] David R Bentley, Shankar Balasubramanian, Harold P Swerdlow, Georey P Smith, John Milton, Clive G Brown, Kevin P Hall, Dirk J Evers, Colin L Barnes, Helen R Bignell, Jonathan M Boutell, Jason Bryant, Richard J Carter, R. Keira Cheetham, Anthony J Cox, Darren J Ellis, Michael R Flatbush, Niall A Gormley, Sean J Humphray, Leslie J Irving, Mirian S Karbelashvili, Scott M Kirk, Heng Li, Xiaohai Liu, Klaus S Maisinger, Lisa J Murray, Bojan Obradovic, Tobias Ost, Michael L Parkinson, Mark R Pratt, Isabelle M J Rasolonjatovo, Mark T Reed, Roberto Rigatti, Chiara Rodighiero, Mark T Ross, Andrea Sabot, Subramanian V Sankar, Aylwyn Scally, Gary P Schroth, Mark E Smith, Vincent P Smith, Anastassia Spiridou, Peta E Torrance, Svilen S Tzonev, Eric H Vermaas, Klaudia Walter, Xiaolin Wu, Lu Zhang, Mohammed D Alam, Carole Anastasi, Ify C Aniebo, David M D Bailey, Iain R Bancarz, Saibal Banerjee, Selena G Barbour, Primo A Baybayan, Vincent A Benoit, Kevin F Benson, Claire Bevis, Phillip J Black, Asha Boodhun, Joe S Brennan, John A Bridgham, Rob C Brown, Andrew A Brown, Dale H Buermann, Abass A Bundu, James C Burrows, Nigel P Carter, Nestor Castillo, Maria Chiara E Catenazzi, Simon Chang, R. Neil Cooley, Natasha R Crake, Olubunmi O Dada, Konstantinos D Diakoumakos, Belen Dominguez-Fernandez, David J Earnshaw, Ugonna C Egbujor, David W Elmore, Sergey S Etchin, Mark R Ewan, Milan Fedurco, Louise J Fraser, Karin V Fuentes Fajardo, W. Scott Furey, David George, Kimberley J Gietzen, Colin P Goddard, George S Golda, Philip A Granieri, David E Green, David L Gustafson, Nancy F Hansen, Kevin Harnish, Christian D Haudenschild, Narinder I Heyer, Matthew M Hims, Johnny T Ho, Adrian M Horgan, Katya Hoschler, Steve Hurwitz, Denis V Ivanov, Maria Q Johnson, Terena James, T. A. Huw Jones, Gyoung-Dong Kang, Tzvetana H Kerelska, Alan D Kersey, Irina Khrebtukova, Alex P Kindwall, Zoya Kingsbury, Paula I Kokko-Gonzales, Anil Kumar, Marc A Laurent, Cynthia T Lawley, Sarah E Lee, Xavier Lee, Arnold K Liao, Jennifer A Loch, Mitch Lok, Shujun Luo, Radhika M Mammen, John W Martin, Patrick G McCauley, Paul McNitt, Parul Mehta, Keith W Moon, Joe W Mullens, Taksina Newington, Zemin Ning, Bee Ling Ng, Sonia M Novo, Michael J O'Neill, Mark A Osborne, Andrew Osnowski, Omead Ostadan, Lambros L Paraschos, Lea Pickering, Andrew C Pike, Alger C Pike, D. Chris Pinkard, Daniel P Pliskin, Joe Podhasky, Victor J Quijano, Come Raczy, Vicki H Rae, Stephen R Rawlings, Ana Chiva BIBLIOGRAPHY 116 Rodriguez, Phyllida M Roe, John Rogers, Maria C Rogert Bacigalupo, Nikolai Romanov, Anthony Romieu, Rithy K Roth, Natalie J Rourke, Silke T Ruediger, Eli Rusman, Raquel M Sanches-Kuiper, Martin R Schenker, Josena M Seoane, Richard J Shaw, Mitch K Shiver, Steven W Short, Ning L Sizto, Johannes P Sluis, Melanie A Smith, Jean Ernest Sohna Sohna, Eric J Spence, Kim Stevens, Neil Sutton, Lukasz Szajkowski, Carolyn L Tregidgo, Gerardo Turcatti, Stephanie Vandevondele, Yuli Verhovsky, Selene M Virk, Suzanne Wakelin, Gregory C Walcott, Jingwen Wang, Graham J Worsley, Juying Yan, Ling Yau, Mike Zuerlein, Jane Rogers, James C Mullikin, Matthew E Hurles, Nick J McCooke, John S West, Frank L Oaks, Peter L Lundberg, David Klenerman, Richard Durbin, and Anthony J Smith. genome sequencing using reversible terminator chemistry. Accurate whole human Nature, 456(7218):5359, Nov 2008. [19] Carolin Berner, Eva AumÃ×ller, Anne Gnauck, Manuela Nestelberger, A. Just, and Alexander G Haslberger. Epigenetic control of estrogen receptor expression and tumor suppressor genes is modulated by bioactive food compounds. Ann Nutr Metab, 57(3- 4):183189, 2010. [20] Donald A Berry, Constance Cirrincione, I. Craig Henderson, Marc L Citron, Daniel R Budman, Lori J Goldstein, Silvana Martino, Edith A Perez, Hyman B Muss, Larry Norton, Cliord Hudis, and Eric P Winer. Estrogen-receptor status and outcomes of modern chemotherapy for patients with node-positive breast cancer. JAMA, 295(14):1658 1667, Apr 2006. [21] Maria J Blanco, Gema Moreno-Bueno, David Sarrio, Annamaria Locascio, Amparo © Cano, Josà Palacios, and M. Angela Nieto. Correlation of snail expression with histological grade and lymph node status in breast carcinomas. Oncogene, 21(20):3241 3246, May 2002. [22] J. M. Bland and D. G. Altman. Multiple signicance tests: the bonferroni method. BMJ, 310(6973):170, Jan 1995. [23] K. I. Bland, M. M. Konstadoulakis, M. P. Vezeridis, and H. J. Wanebo. Oncogene protein co-expression. value of ha-ras, c-myc, c-fos, and p53 as prognostic discriminants for breast carcinoma. Ann Surg, 221(6):70618; discussion 71820, Jun 1995. BIBLIOGRAPHY 117 [24] Fiona M Blows, Kristy E Driver, Marjanka K Schmidt, Annegien Broeks, Flora E van Leeuwen, Jelle Wesseling, Maggie C Cheang, Karen Gelmon, Torsten O Nielsen, ¿ © Carl Blomqvist, Pà slen, Louis R Bà ivi Heikkilà ¿ , Tuomas Heikkinen, Heli Nevanlinna, Lars A Ak- gin, William D Foulkes, Fergus J Couch, Xianshu Wang, Vicky Cafourek, Janet E Olson, Laura Baglietto, Graham G Giles, Gianluca Severi, Catriona A McLean, Melissa C Southey, Emad Rakha, Andrew R Green, Ian O Ellis, Mark E Sherman, Jolanta Lissowska, William F Anderson, Angela Cox, Simon S Cross, Malcolm W R Reed, Elena Provenzano, Sarah-Jane Dawson, Alison M Dunning, Manjeet Humphreys, Douglas F Easton, Montserrat GarcÃa-Closas, Carlos Caldas, Paul D Pharoah, and David Huntsman. Subtyping of breast cancer by immunohistochemistry to investigate a relationship between subtype and short and long term survival: a collaborative analysis of data for 10,159 cases from 12 studies. PLoS Med, 7(5):e1000279, 2010. [25] Kerrie B Bouker, Todd C Skaar, Rebecca B Riggins, David S Harburger, David R Fernandez, Alan Zwart, Antai Wang, and Robert Clarke. Interferon regulatory factor1 (irf-1) exhibits tumor suppressor activities in breast cancer associated with caspase activation and induction of apoptosis. Carcinogenesis, 26(9):15271535, Sep 2005. [26] D. Branton, D. W. Deamer, A. Marziali, H. Bayley, S. A. Benner, T. Butler, M. Di Ventra, S. Garaj, A. Hibbs, X. Huang, S. B. Jovanovich, P. S. Krstic, S. Lindsay, X. S. Ling, C. H. Mastrangelo, A. Meller, J. S. Oliver, Y. V. Pershin, J. M. Ramsey, R. Riehn, G. V. Soni, V. Tabard-Cossa, M. Wanunu, M. Wiggin, and J. A. Schloss. The potential and challenges of nanopore sequencing. Nat. Biotechnol., 26(10):1146 1153, 2008. [27] B. S. Braun, R. Frieden, S. L. Lessnick, W. A. May, and C. T. Denny. Identica- tion of target genes for the ewing's sarcoma ews/i fusion protein by representational dierence analysis. Mol Cell Biol, 15(8):46234630, Aug 1995. [28] Miguel H. Bronchud, editor. Principles of Molecular Oncology. Humana Press, 2 edition, 1 2000. [29] L. F. Brown, A. J. Guidi, S. J. Schnitt, L. Van De Water, M. L. Iruela-Arispe, T. K. Yeo, K. Tognazzi, and H. F. Dvorak. Vascular stroma formation in carcinoma in BIBLIOGRAPHY 118 situ, invasive carcinoma, and metastatic carcinoma of the breast. Clin Cancer Res, 5(5):10411056, May 1999. [30] Thijn R Brummelkamp, Renà © Bernards, and Reuven Agami. A system for stable expression of short interfering rnas in mammalian cells. Science, 296(5567):550553, Apr 2002. [31] Jeremy Buhler and Martin Tompa. Finding motifs using random projections. J Comput Biol, 9(2):225242, 2002. [32] Sarah E Burdall, Andrew M Hanby, Mark R J Lansdown, and Valerie Speirs. Breast cancer cell lines: friend or foe? Breast Cancer Res, 5(2):8995, 2003. [33] Yi Cai, Jianghua Wang, Rile Li, Gustavo Ayala, Michael Ittmann, and Mingyao Liu. Ggap2/pike-a directly activates both the akt and nuclear factor-kappab pathways and promotes prostate cancer progression. Cancer Res, 69(3):819827, Feb 2009. [34] R. Cailleau, M. Olive, and Q. V. Cruciger. Long-term human breast carcinoma cell lines of metastatic origin: preliminary characterization. In Vitro, 14(11):911915, Nov 1978. [35] Xia Cao, Shuai Cheng Li, and Anthony K. H. Tung. Indexing dna sequences using q-grams. pages 416, 2005. [36] L. R. Cardon and G. D. Stormo. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned dna fragments. J Mol Biol, 223(1):159170, Jan 1992. [37] Lisa A Carey, Charles M Perou, Chad A Livasy, Lynn G Dressler, David Cowan, Kathleen Conway, Gamze Karaca, Melissa A Troester, Chiu Kit Tse, Sharon Edmiston, Sandra L Deming, Joseph Geradts, Maggie C U Cheang, Torsten O Nielsen, Patricia G Moorman, H. Shelton Earp, and Robert C Millikan. Race, breast cancer subtypes, and survival in the carolina breast cancer study. JAMA, 295(21):24922502, Jun 2006. [38] J. A. Carman and J. G. Monroe. The egr1 protein contains a discrete transcriptional regulatory domain whose deletion results in a truncated protein that blocks egr1induced transcription. DNA Cell Biol, 14(7):581589, Jul 1995. BIBLIOGRAPHY 119 [39] M. Carr, I. Hurley, K. Fowler, A. Pomiankowski, and H.K. Smith. Expression of defective proventriculus during head capsule development is conserved in drosophila and stalk-eyed ies (diopsidae). Dev Genes Evol, 215:4029, 2005. [40] Hexin Chen and Saraswati Sukumar. Role of homeobox genes in normal mammary gland development and breast tumorigenesis. J Mammary Gland Biol Neoplasia, 8(2):159175, Apr 2003. [41] A. Cheong, A.J. Bingham, J. Li, B. Kumar, P. Sukumar, C. Munsch, N.J. Buckley, C.B. Neylon, K.E. Porter, D.J. Beech, and I.C. Wood. Downregulated rest transcription factor is a switch enabling critical potassium channel expression and cell proliferation. Mol Cell, 20:4552, 2005. [42] J. A. Chong, J. Tapia-RamÃrez, S. Kim, J. J. Toledo-Aral, Y. Zheng, M. C. Boutros, Y. M. Altshuller, M. A. Frohman, S. D. Kraner, and G. Mandel. Rest: a mam- malian silencer protein that restricts sodium channel gene expression to neurons. Cell, 80(6):949957, Mar 1995. [43] Jesper Christensen, Karl Agger, Paul A C Cloos, Diego Pasini, Simon Rose, Lau Sennels, Juri Rappsilber, Klaus H Hansen, Anna Elisabetta Salcini, and Kristian Helin. Rbp2 belongs to a family of demethylases, specic for tri-and dimethylated lysine 4 on histone 3. Cell, 128(6):10631076, Mar 2007. [44] Gerhard Christofori. Snail1 links transcriptional control with epigenetic regulation. EMBO J, 29(11):17871789, Jun 2010. [45] Jonas Cicenas. The potential role of the egfr/erbb2 heterodimer in breast cancer. Expert Opinion on Therapeutic Patents, 17(6):607616, 2007. [46] Rachel Ann Clark, Roy Levine, and Suzanne Snedeker. The biology of breast cancer, fact sheet 5. Technical report, Cornell University, College of Veterinary Medicine, Vet Box 31, Ithaca, NY 14853-6401, October 1997. [Online; accessed 19-July-2010]. [47] P. M. Clissold and C. P. Ponting. Jmjc: cupin metalloenzyme-like domains in jumonji, hairless and phospholipase a2beta. Trends Biochem Sci, 26(1):79, Jan 2001. [48] Nicole Cloonan, Alistair R R Forrest, Gabriel Kolle, Brooke B A Gardiner, Georey J Faulkner, Mellissa K Brown, Darrin F Taylor, Anita L Steptoe, Shivangi Wani, Graeme BIBLIOGRAPHY 120 Bethel, Alan J Robertson, Andrew C Perkins, Stephen J Bruce, Clarence C Lee, Swati S Ranade, Heather E Peckham, Jonathan M Manning, Kevin J McKernan, and Sean M Grimmond. sequencing. Stem cell transcriptome proling via massive-scale mrna Nat Methods, 5(7):613619, Jul 2008. [49] Elisabeth D Coene, Catarina Gadelha, Nicholas White, Ashraf Malhas, Benjamin Thomas, Michael Shaw, and David J Vaux. A novel role for brca1 in regulating breast cancer cell spreading and motility. J Cell Biol, 192(3):497512, Feb 2011. [50] Collins. Collins english dictionary: 30th anniversary edition (dictonary). 6 2010. [51] Cynthia S Collins, Jiyong Hong, Lisa Sapinoso, Yingyao Zhou, Zheng Liu, Kenneth Micklash, Peter G Schultz, and Garret M Hampton. A small interfering rna screen for modulators of tumor cell motility identies map4k4 as a promigratory kinase. Proc Natl Acad Sci U S A, 103(10):37753780, Mar 2006. [52] Kathleen Collins, Tyler Jacks, and Nikola P. Pavletich. The cell cycle and cancer. Proceedings of the National Academy of Sciences of the United States of America, 94(7):27762778, 1997. [53] Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res, 11(8):14251433, Aug 2001. [54] Carlo M. Croce. Oncogenes and cancer. New England Journal of Medicine, 358(5):502 511, 2008. [55] Modan K Das and Ho-Kwok Dai. A survey of dna motif nding algorithms. BMC Bioinformatics, 8 Suppl 7:S21, 2007. [56] J. R. Daviea. Histone modications. New Compr. Biochem., 39(03):205 240, 2004. [57] I. de Belle, R. P. Huang, Y. Fan, C. Liu, D. Mercola, and E. D. Adamson. p53 and Egr-1 additively suppress transformed growth in HT1080 cells but Egr-1 counteracts p53-dependent apoptosis. Oncogene, 18:36333642, Jun 1999. [58] Geneviève P Delcuve, Mojgan Rastegar, and James R Davie. Epigenetic control. Cell. Physiol., 219(2):24350, May 2009. J. BIBLIOGRAPHY 121 [59] I. Van der Auwera, R. Limame, P. van Dam, P. B. Vermeulen, L. Y. Dirix, and S. J. Van Laere. Integrated mirna and mrna expression proling of the inammatory breast cancer subtype. Br J Cancer, 103(4):532541, Aug 2010. [60] Agata Desantis, Annalisa Onori, Maria Grazia Di Certo, Elisabetta Mattei, Maurizio Fanciulli, Claudio Passananti, and Nicoletta Corbi. Novel activation domain derived from che-1 cofactor coupled with the articial protein jazz drives utrophin upregulation. Neuromuscul Disord, 19(2):158162, Feb 2009. [61] V. G. Deshpande and P. K. Ranjekar. Repetitive dna in three gramineae species with low dna content. Hoppe Seylers Z Physiol Chem, 361(8):12231233, Aug 1980. [62] Peter D'Eustachio. Reactome knowledgebase of human biological pathways and processes. Methods Mol Biol, 694:4961, 2011. [63] J. Dubnau and G. Struhl. Rna recognition and translational regulation by a homeodomain protein. Nature, 379(6567):694699, Feb 1996. [64] Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Biological Cambridge University Press, 1 edition, 5 1998. [65] S. R. Eddy. Prole hidden markov models. Bioinformatics, 14(9):755763, 1998. [66] S. J. Elledge. preventing an identity crisis. Cell cycle checkpoints: Science, 274(5293):16641672, Dec 1996. [67] I. T. Ernberg. Oncogenes and tumor growth factors in breast cancer. a minireview. Acta Oncol, 29(3):331334, 1990. [68] E. Ernst. Mistletoe for cancer? Eur J Cancer, 37(1):911, Jan 2001. [69] Jason Ernst and Manolis Kellis. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol, 28(8):817825, Aug 2010. [70] Eleazar Eskin. From proles to patterns and back again: a branch and bound algorithm for nding near optimal motif proles. pages 115124, 2004. BIBLIOGRAPHY 122 [71] Manel Esteller. Cancer epigenomics: Dna methylomes and histone-modication maps. Nat. Rev. Genet., 8(4):28698, Apr 2007. [72] S. Falcon and R. Gentleman. Using gostats to test gene lists for go term association. Bioinformatics, 23(2):257258, Jan 2007. [73] Cheng Fan, Daniel S Oh, Lodewyk Wessels, Britta Weigelt, Dimitry S A Nuyten, Andrew B Nobel, Laura J van't Veer, and Charles M Perou. gene-expression-based predictors for breast cancer. Concordance among N Engl J Med, 355(6):560569, Aug 2006. [74] Xiaochun Fan, Zarmik Moqtaderi, Yi Jin, Yong Zhang, X. Shirley Liu, and Kevin Struhl. Nucleosome depletion at yeast terminators is not intrinsic and can occur by a transcriptional mechanism linked to 3'-end formation. Proc Natl Acad Sci U S A, 107(42):1794517950, Oct 2010. [75] M. Fanciulli, T. Bruno, M. Di Padova, R. De Angelis, S. Iezzi, C. Iacobini, A. Floridi, and C. Passananti. Identication of a novel partner of rna polymerase ii subunit 11, che-1, which interacts with and aects the growth suppression function of rb. FASEB J, 14(7):904912, May 2000. [76] M Faronato and JM Coulson. Rest (re1-silencing transcription factor). Atlas Genet Cytogenet Oncol Haematol, 2010. [77] E. R. Fearon. Human cancer syndromes: clues to the origin and nature of cancer. Science, 278(5340):10431050, Nov 1997. [78] M. Fedurco, A. Romieu, S. Williams, I. Lawrence, and G. Turcatti. Bta, a novel reagent for dna attachment on glass and ecient generation of solid-phase amplied dna colonies. Nucleic Acids Res., 34(3):e22, 2006. [79] Anthony P. Fejes, Gordon Robertson, Mikhail Bilenky, Richard Varhol, Matthew Bainbridge, and Steven J. M. Jones. Findpeaks 3.1: a tool for identifying areas of en- richment from massively parallel short-read sequencing technology. Bioinformatics, 24(15):17291730, Aug 2008. [80] B. Felding-Habermann, T. E. O'Toole, J. W. Smith, E. Fransvea, Z. M. Ruggeri, M. H. Ginsberg, P. E. Hughes, N. Pampori, S. J. Shattil, A. Saven, and B. M. Mueller. BIBLIOGRAPHY 123 Integrin activation controls metastasis in human breast cancer. Proc Natl Acad Sci U S A, 98(4):18531858, Feb 2001. [81] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. pages 390398, 2000. [82] Paul Flicek and Ewan Birney. Sense from sequence reads: methods for alignment and assembly. Nat Methods, 6(11 Suppl):S6S12, Nov 2009. [83] Aristide Floridi and Maurizio Fanciulli. Che-1: a new eector of checkpoints signaling. Cell Cycle, 6(7):804806, Apr 2007. [84] John A Foekens, Anieta M Sieuwerts, Marcel Smid, Maxime P Look, Vanja de Weerd, Antonius W M Boersma, Jan G M Klijn, Erik A C Wiemer, and John W M Martens. Four mirnas associated with aggressiveness of lymph node-negative, estrogen receptor- Proc Natl Acad Sci U S A, positive human breast cancer. 105(35):1302113026, Sep 2008. [85] Federico Forneris, Claudia Binda, Antonio Adamo, Elena Battaglioli, and Andrea Mattevi. Structural basis of lsd1-corest selectivity in histone h3 recognition. J Biol Chem, 282(28):2007020074, Jul 2007. [86] Mario F. Fraga and Manel Esteller. draft of histone modications. Towards the human cancer epigenome: a rst Cell Cycle, 4(10):13771381, Oct 2005. [87] Yutao Fu, Manisha Sinha, Craig L Peterson, and Zhiping Weng. The insulator binding protein ctcf positions 20 nucleosomes around its binding sites across the human genome. PLoS Genet, 4(7):e1000138, 2008. [88] G. Fuh and J. A. Wells. breast cancer cell lines. Prolactin receptor antagonists that inhibit the growth of J Biol Chem, 270(22):1313313137, Jun 1995. [89] P. Andrew Futreal, Lachlan Coin, Mhairi Marshall, Thomas Down, Timothy Hubbard, Richard Wooster, Nazneen Rahman, and Michael R Stratton. cancer genes. A census of human Nat Rev Cancer, 4(3):177183, Mar 2004. [90] J. FÃ×llgrabe, N. Hajji, and B. Joseph. Cracking the death code: apoptosis-related histone modications. Cell Death Dier, 17(8):12381243, Aug 2010. BIBLIOGRAPHY 124 [91] Federica Galeano, Anne Leroy, Claudia Rossetti, Irina Gromova, Philippe Gautier, Liam P Keegan, Luca Massimi, Concezio Di Rocco, Mary A O'Connell, and Angela Gallo. Human blcap transcript: new editing events in normal and cancerous tissues. Int J Cancer, 127(1):127137, Jul 2010. [92] A. L. Gashler, S. Swaminathan, and V. P. Sukhatme. A novel repression module, an extensive activation domain, and a bipartite nuclear localization signal dened in the immediate-early transcription factor egr-1. Mol Cell Biol, 13(8):45564571, Aug 1993. [93] L. Giacinti, P.P. Claudio, M. Lopez, and A. Giordano. estrogen receptor alpha expression in breast cancer. Epigenetic information and Oncologist, 11:18, 2006. [94] T. J. Gibson and J. Spring. Genetic redundancy in vertebrates: polyploidy and persistence of genes encoding multidomain proteins. Trends Genet, 14(2):469; discussion 4950, Feb 1998. [95] C. K. Glass and M. G. Rosenfeld. The coregulator exchange in transcriptional functions of nuclear receptors. Genes Dev, 14(2):121141, Jan 2000. [96] M.J. Gray, J. Zhang, L.M. Ellis, G.L. Semenza, D.B. Evans, S.S. Watowich, and G.E. Gallick. Hif-1alpha, stat3, cbp/p300 and ref-1/ape are components of a transcriptional complex that regulates src-dependent hypoxia-induced expression of vegf in pancreatic and prostate carcinomas. Oncogene, 24:311020, 2005. [97] Grazia Graziani, Lucio Tentori, Alessia Muzi, Matteo Vergati, Giuseppe Tringali, Giacomo Pozzoli, and Pierluigi Navarra. Evidence that corticotropin-releasing hormone inhibits cell growth of human breast cancer cells via the activation of crh-r1 receptor subtype. Mol Cell Endocrinol, 264(1-2):4449, Jan 2007. [98] Christopher Greenman, Philip Stephens, Raaella Smith, Gillian L Dalgliesh, Christopher Hunter, Graham Bignell, Helen Davies, Jon Teague, Adam Butler, Claire Stevens, Sarah Edkins, Sarah O'Meara, Imre Vastrik, Esther E Schmidt, Tim Avis, Syd Barthorpe, Gurpreet Bhamra, Gemma Buck, Bhudipa Choudhury, Jody Clements, Jennifer Cole, Ed Dicks, Simon Forbes, Kris Gray, Kelly Halliday, Rachel Harrison, Katy Hills, Jon Hinton, Andy Jenkinson, David Jones, Andy Menzies, Tatiana Mironenko, Janet Perry, Keiran Raine, Dave Richardson, Rebecca Shepherd, Alexandra Small, Calli Tofts, Jennifer Varian, Tony Webb, Soe West, Sara Widaa, Andy BIBLIOGRAPHY 125 Yates, Daniel P Cahill, David N Louis, Peter Goldstraw, Andrew G Nicholson, Francis Brasseur, Leendert Looijenga, Barbara L Weber, Yoke-Eng Chiew, Anna DeFazio, Mel F Greaves, Anthony R Green, Peter Campbell, Ewan Birney, Douglas F Easton, Georgia Chenevix-Trench, Min-Han Tan, Sok Kean Khoo, Bin Tean Teh, Siu Tsan Yuen, Suet Yi Leung, Richard Wooster, P. Andrew Futreal, and Michael R Stratton. Patterns of somatic mutation in human cancer genomes. Nature, 446(7132):153158, Mar 2007. [99] Obi L Grith, Stephen B Montgomery, Bridget Bernier, Bryan Chu, Katayoon Kasaian, Stein Aerts, Shaun Mahony, Monica C Sleumer, Mikhail Bilenky, Maximilian Haeussler, Malachi Grith, Steven M Gallo, Belinda Giardine, Bart Hooghe, Peter Van Loo, Enrique Blanco, Amy Ticoll, Stuart Lithwick, Elodie Portales-Casamar, Ian J Donaldson, Gordon Robertson, Claes Wadelius, Pieter De Bleser, Dominique Vlieghe, Marc S Halfon, Wyeth Wasserman, Ross Hardison, Casey M Bergman, Steven J M Jones, and Open Regulatory Annotation Consortium. Oreganno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res, 36(Database issue):D107D113, Jan 2008. [100] Christian J Gruber, Doris M Gruber, Isabel M L Gruber, Fritz Wieser, and Johannes C Huber. Anatomy of the estrogen response element. Trends Endocrinol Metab, 15(2):73 78, Mar 2004. [101] Stefan Grà ¿ f, Fiona G G Nielsen, Stefan Kurtz, Martijn A Huynen, Ewan Birney, Henk Stunnenberg, and Paul Flicek. Optimized design and assessment of whole genome tiling arrays. Bioinformatics, 23(13):i195i204, Jul 2007. [102] J. L. Guan. Role of focal adhesion kinase in integrin signaling. Int J Biochem Cell Biol, 29(8-9):10851096, 1997. [103] Kristin C Gunsalus and Fabio Piano. Rnai as a tool to study cell biology: building the genome-phenome bridge. Curr Opin Cell Biol, 17(1):38, Feb 2005. [104] Carmen Gurrola-Diaz, Jeannine Lacroix, Susanne Dihlmann, Cord-Michael Becker, and Magnus von Knebel Doeberitz. Reduced expression of the neuron restrictive silencer factor permits transcription of glycine receptor alpha1 subunit in small-cell lung cancer cells. Oncogene, 22(36):56365645, Aug 2003. BIBLIOGRAPHY 126 [105] A J Hackett, H S Smith, E L Springer, R B Owens, W A Nelson-Rees, J L Riggs, and M B Gardner. Two syngeneic cell lines from human breast tissue: the aneuploid mammary epithelial (hs578t) and the diploid myoepithelial (hs578bst) cell lines. J. Natl. Cancer Inst., 58(6):1795806, 1977. [106] D. Hanahan and R. A. Weinberg. The hallmarks of cancer. Cell, 100(1):5770, Jan 2000. [107] M. F. Hansen and W. K. Cavenee. Tumor suppressors: recessive mutations that lead to cancer. Cell, 53(2):173174, Apr 1988. [108] R. K. Hansen and M. J. Bissell. Tissue architecture and breast cancer: the role of extracellular matrix and steroid hormones. Endocr Relat Cancer, 7(2):95113, Jun 2000. [109] Kimberly A Hartwell, Beth Muir, Ferenc Reinhardt, Anne E Carpenter, Dennis C Sgroi, and Robert A Weinberg. tumor metastasis. The spemann organizer gene, goosecoid, promotes Proc Natl Acad Sci U S A, 103(50):1896918974, Dec 2006. [110] Nathaniel D Heintzman, Rhona K Stuart, Gary Hon, Yutao Fu, Christina W Ching, R. David Hawkins, Leah O Barrera, Sara Van Calcar, Chunxu Qu, Keith A Ching, Wei Wang, Zhiping Weng, Roland D Green, Gregory E Crawford, and Bing Ren. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet, 39(3):311318, Mar 2007. [111] S. Heniko, E. McKittrick, and K. Ahmad. Epigenetics, histone h3 variants, and the inheritance of chromatin states. Cold Spring Harb Symp Quant Biol, 69:235243, 2004. [112] Nicolas Herranz, Diego Pasini, Victor M Diaz, Clara Francis, Arantxa Gutierrez, Natalia Dave, Maria Escriva, Inma Hernandez-Munoz, Luciano Di Croce, Kristian Helin, Antonio GarcÃa de Herreros, and Sandra Peiro. Polycomb complex 2 is required for ecadherin repression by the snail1 transcription factor. Mol Cell Biol, 28(15):47724781, Aug 2008. [113] G. Z. Hertz and G. D. Stormo. Identifying dna and protein patterns with statistically signicant alignments of multiple sequences. Bioinformatics, 15(7-8):563577, 1999. BIBLIOGRAPHY 127 [114] Geo S Higgins, Adrian L Harris, McKenna, and Francesca M Bua. sis in early breast cancer patients. [115] B. G. Homan, G. Robertson, Remko Prevo, Thomas Helleday, W. Gillies Overexpression of polq confers a poor progno- Oncotarget, 1(3):175184, Jul 2010. B. Zavaglia, M. Beach, R. Cullum, S. Lee, G. Soukhatcheva, L. Li, E. D. Wederell, N. Thiessen, M. Bilenky, T. Cezard, A. Tam, B. Kamoh, I. Birol, D. Dai, Y. Zhao, M. Hirst, C. B. Verchere, C. D. Helgason, M. A. Marra, S. J. Jones, and P. A. Hoodless. Locus co-occupancy, nucleosome positioning, and H3K4me1 regulate the functionality of FOXA2-, HNF4A-, and PDX1-bound loci in islets and liver. [116] K.E.V. Holde. Genome Res., 20:10371051, Aug 2010. Chromatin (Springer series in molecular biology). Springer-Verlag Berlin and Heidelberg GmbH & Co. K, 12 1989. [117] Frederik Holst, Phillip R Stahl, Christian Ruiz, Olaf Hellwinkel, Zeenath Jehan, Marc Wendland, Annette Lebeau, Luigi Terracciano, Khawla Al-Kuraya, Fritz Jà ¿ nicke, Guido Sauter, and Ronald Simon. Estrogen receptor alpha (esr1) gene amplication is frequent in breast cancer. Nat Genet, 39(5):655660, May 2007. [118] B. Horard and J-M. Vanacker. Estrogen receptor-related receptors: orphan receptors desperately seeking a ligand. J Mol Endocrinol, 31(3):349357, Dec 2003. [119] Hugo M Horlings, Anna Bergamaschi, Silje H Nordgard, Young H Kim, Wonshik Han, Dong-Young Noh, Keyan Salari, Simon A Joosse, Fabien Reyal, Ole Christian Lingjaerde, Vessela N Kristensen, Anne-Lise Búrresen-Dale, Jonathan Pollack, and Marc J van de Vijver. Esr1 gene amplication in breast cancer: a common phenomenon? Nat Genet, 40(7):8078; author reply 8102, Jul 2008. [120] Da Wei Huang, Brad T Sherman, and Richard A Lempicki. Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat Protoc, 4(1):4457, 2009. [121] T. Hubbard, D. Andrews, M. Caccamo, G. Cameron, Y. Chen, M. Clamp, L. Clarke, G. Coates, T. Cox, F. Cunningham, V. Curwen, T. Cutts, T. Down, R. Durbin, X. M. Fernandez-Suarez, J. Gilbert, M. Hammond, J. Herrero, H. Hotz, K. Howe, V. Iyer, K. Jekosch, A. Kahari, A. Kasprzyk, D. Keefe, S. Keenan, F. Kokocinsci, D. London, I. Longden, G. McVicker, C. Melsopp, P. Meidl, S. Potter, G. Proctor, BIBLIOGRAPHY 128 M. Rae, D. Rios, M. Schuster, S. Searle, J. Severin, G. Slater, D. Smedley, J. Smith, W. Spooner, A. Stabenau, J. Stalker, R. Storey, S. Trevanion, A. Ureta-Vidal, J. Vogel, S. White, C. Woodwark, and E. Birney. Ensembl 2005. Nucleic Acids Res, 33(Database issue):D447D453, Jan 2005. [122] Philip Hublitz, Mareike Albert, and Antoine H F M Peters. Mechanisms of transcriptional repression by histone lysine methylation. Int. J. Dev. Biol., 53(2-3):33554, 2009. [123] C A Iacobuzio-Donahue. Epigenetic changes in cancer. Annu Rev Pathol, 4:229249, 2009. [124] Marilena V Iorio, Manuela Ferracin, Chang-Gong Liu, Angelo Veronese, Riccardo Spizzo, Silvia Sabbioni, © Campiglio, Sylvie Mà Eros Magri, Massimo Pedriali, Muller Fabbri, Manuela nard, Juan P Palazzo, Anne Rosenberg, Piero Musiani, Ste- fano Volinia, Italo Nenci, George A Calin, Patrizia Querzoli, Massimo Negrini, and Carlo M Croce. Microrna gene expression deregulation in human breast cancer. Can- cer Res, 65(16):70657070, Aug 2005. [125] Elizabeth Iorns, Christopher J Lord, Nicholas Turner, and Alan Ashworth. Utilizing rna interference to enhance cancer drug discovery. Nat Rev Drug Discov, 6(7):556568, Jul 2007. [126] Shigeki Iwase, Fei Lan, Peter Bayliss, Luis de la Torre-Ubieta, Maite Huarte, Hank H Qi, Johnathan R Whetstine, Azad Bonni, Thomas M Roberts, and Yang Shi. The x-linked mental retardation gene smcx/jarid1c denes a family of histone h3 lysine 4 demethylases. Cell, 128(6):10771088, Mar 2007. [127] Rudolf Jaenisch and Adrian Bird. Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat Genet, 33 Suppl:245254, Mar 2003. [128] T. Jenuwein and C. D. Allis. Translating the histone code. Science, 293(5532):1074 1080, Aug 2001. [129] Peter A Jones and Stephen B Baylin. The epigenomics of cancer. Feb 2007. Cell, 128(4):683692, BIBLIOGRAPHY 129 [130] Roy Joseph, Yuriy L Orlov, Mikael Huss, Wenjie Sun, Say Li Kong, Leena Ukil, You Fu Pan, Guoliang Li, Michael Lim, Jane S Thomsen, Yijun Ruan, Neil D Clarke, Shyam Prabhakar, Edwin Cheung, and Edison T Liu. Integrative model of genomic factors for determining binding site selection by estrogen receptor-alpha. Mol Syst Biol, 6:456, Dec 2010. [131] Luke Jostins. Basics: Sequencing dna, part 1, april 2009. [132] S. M. Judge and R. T. Chatterton. Progesterone-specic stimulation of triglyceride biosynthesis in a breast cancer cell line (t-47d). Cancer Res, 43(9):44074412, Sep 1983. [133] Masahiro Kajita, Karissa N McClinic, and Paul A Wade. Aberrant expression of the transcription factors snail and slug alters the response to genotoxic stress. Mol Cell Biol, 24(17):75597566, Sep 2004. [134] Minoru Kanehisa. The kegg database. Novartis Found Symp, 247:91101; discussion 1013, 11928, 24452, 2002. [135] Minoru Kanehisa, Susumu Goto, Miho Furumichi, Mao Tanabe, and Mika Hirakawa. Kegg for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res, 38(Database issue):D355D360, Jan 2010. [136] Jin Seok Kang, Na Jin Jung, Seyl Kim, Dae Joong Kim, Dong Deuk Jang, and Ki-Hwa Yang. Downregulation of estrogen receptor alpha and beta expression in carcinogeninduced mammary gland tumors of rats. Eksp Onkol, 26(1):3135, Mar 2004. [137] J. Kao, K. Salari, M. Bocanegra, Y. L. Choi, L. Girard, J. Gandhi, K. A. Kwei, T. Hernandez-Boussard, P. Wang, A. F. Gazdar, J. D. Minna, and J. R. Pollack. Molecular proling of breast cancer cell lines denes relevant tumor models and provides a resource for cancer gene discovery. PLoS ONE, 4:e6146, 2009. [138] Amy V Kapp, Stefanie S Jerey, Anita Langerød, Anne-Lise Børresen-Dale, Wonshik Han, Dong-Young Noh, Ida R K Bukholm, Monica Nicolau, Patrick O Brown, and Robert Tibshirani. Genomics, 7:231, 2006. Discovery and validation of breast cancer subtypes. BMC BIBLIOGRAPHY [139] Juha Karkkainen. 130 Fast bwt in small space by blockwise sux sorting. Computer Science, 387(3):249 257, 2007. Theoretical The Burrows-Wheeler Transform. [140] Vladimir I Kashuba, Jingfeng Li, Fuli Wang, Vera N Senchenko, Alexey Protopopov, Alena Malyukova, Alexey S Kutsenko, Elena Kadyrova, Veronika I Zabarovska, Olga V Muravenko, Alexander V Zelenin, Lev L Kisselev, Igor Kuzmin, John D Minna, ¶ Gà sta Winberg, Ingemar Ernberg, Eleonora Braga, Michael I Lerman, George Klein, and Eugene R Zabarovsky. Rbsp3 (hya22) is a tumor suppressor gene implicated in major epithelial malignancies. Proc Natl Acad Sci U S A, 101(14):49064911, Apr 2004. [141] Michael B Kastan and Jiri Bartek. Cell-cycle checkpoints and cancer. Nature, 432(7015):316323, Nov 2004. [142] Y. Katayose, M. Kim, A. N. Rakkar, Z. Li, K. H. Cowan, and P. Seth. Promoting apoptosis: a novel activity associated with the cyclin-dependent kinase inhibitor p27. Cancer Res, 57(24):54415445, Dec 1997. [143] M. Katoh and M. Katoh. Comparative genomics on snai1, snai2, and snai3 orthologs. Oncol Rep, 14:10836, 2005. [144] L. H. Kedes. Histone genes and histone messengers. Annu Rev Biochem, 48:837870, 1979. [145] U. Keich and P. A. Pevzner. Finding motifs in the twilight zone. Bioinformatics, 18(10):13741381, Oct 2002. [146] W. James Kent. Blatthe blast-like alignment tool. Genome Res, 12(4):656664, Apr 2002. [147] W. James Kent, Charles W Sugnet, Terrence S Furey, Krishna M Roskin, Tom H Pringle, Alan M Zahler, and David Haussler. The human genome browser at ucsc. Genome Res, 12(6):9961006, Jun 2002. [148] I. Keydar, L. Chen, S. Karby, F. R. Weiss, J. Delarea, M. Radu, S. Chaitcik, and H. J. Brenner. Establishment and characterization of a cell line of human breast carcinoma origin. Eur J Cancer, 15(5):659670, May 1979. BIBLIOGRAPHY 131 [149] Peter V Kharchenko, Michael Y Tolstorukov, and Peter J Park. Design and analysis of chip-seq experiments for dna-binding proteins. Nat Biotechnol, 26(12):13511359, Dec 2008. [150] Purvesh Khatri and Sorin Draghici. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics, 21(18):35873595, Sep 2005. [151] Mi-Jung Kim, Jae Y Ro, Sei-Hyun Ahn, Hak Hee Kim, Sung-Bae Kim, and Gyungyub Gong. Clinicopathologic signicance of the basal-like subtype of breast cancer: comparison with hormone receptor and her2/neu-overexpressing phenotypes. a Hum Pathol, 37(9):12171226, Sep 2006. [152] Sung-Mi Kim, Hae-Jin Kee, Nakwon Choe, Ji-Young Kim, Hoon Kook, Hyun Kook, and Sang-Beom Seo. The histone methyltransferase activity of whistle is important for the induction of apoptosis and hdac1-mediated transcriptional repression. Exp Cell Res, 313(5):975983, Mar 2007. [153] Robert J Klose, Eric M Kallin, and Yi Zhang. Jmjc-domain-containing proteins and histone demethylation. Nat Rev Genet, 7(9):715727, Sep 2006. [154] Robert J Klose, Qin Yan, Zuzana Tothova, Kenichi Yamane, Hediye ErdjumentBromage, Paul Tempst, D. Gary Gilliland, Yi Zhang, and William G Kaelin. retinoblastoma binding protein rbp2 is an h3k4 demethylase. Cell, The 128(5):889900, Mar 2007. [155] A. G. Knudson. Mutation and cancer: statistical study of retinoblastoma. Proc Natl Acad Sci U S A, 68(4):820823, Apr 1971. [156] A. G. Knudson. Two genetic hits (more or less) to cancer. Nat Rev Cancer, 1(2):157 162, Nov 2001. [157] Daniel C Koboldt, Li Ding, Elaine R Mardis, and Richard K Wilson. Challenges of sequencing human genomes. Brief Bioinform, 11(5):484498, Sep 2010. [158] Tony Kouzarides. Chromatin modications and their function. Feb 2007. Cell, 128(4):693705, BIBLIOGRAPHY 132 [159] Ana Kozomara and Sam Griths-Jones. and deep-sequencing data. mirbase: integrating microrna annotation Nucleic Acids Res, 39(Database issue):D152D157, Jan 2011. [160] Anja Krones-Herzig, Shalu Mittal, Kelly Yule, Hongyan Liang, Chris English, Rafael Urcis, Tarun Soni, Eileen D Adamson, and Dan Mercola. Early growth response 1 acts as a tumor suppressor in vivo and in vitro via regulation of p53. Cancer Res, 65(12):51335143, Jun 2005. [161] Stefan Kubicek and Thomas Jenuwein. A crack in histone lysine methylation. Cell, 119(7):903906, Dec 2004. [162] J. Kuntzer, D. Eggle, H. P. Lenhof, H. Burtscher, and S. Klostermann. The roche cancer genome database (rcgdb). data available as: Hum Mutat, 31(4):407413, 2010. Specic link to BRAC1 http://rcgdb.bioinf.uni-sb.de/MutomeWeb/MutatedCellLines? query=672. [163] M. Lachner, R. Sengupta, G. Schotta, and T. Jenuwein. Trilogies of histone lysine methylation as epigenetic landmarks of the eukaryotic genome. Cold Spring Harb Symp Quant Biol, 69:209218, 2004. Computational Biology of Transcription Factor Binding (Methods in Molecular Biology). Humana Press, 1st edition. edition, 9 2010. [164] Istvan Ladunga, editor. [165] J. R. Lambert, V. W. Bilanchone, and M. G. Cumsky. The ord1 gene encodes a transcription factor involved in oxygen regulation and is identical to ixr1, a gene that confers cisplatin sensitivity to saccharomyces cerevisiae. Proc Natl Acad Sci U S A, 91(15):73457349, Jul 1994. [166] Anja Lambrechts, Marleen Van Troys, and Christophe Ampe. The actin cytoskeleton in normal and pathological cell motility. Int J Biochem Cell Biol, 36(10):18901909, Oct 2004. [167] Fei Lan, Amanda Clair Nottke, and Yang Shi. Mechanisms involved in the regulation of histone lysine demethylases. Curr Opin Cell Biol, 20(3):316325, Jun 2008. BIBLIOGRAPHY 133 [168] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-ecient alignment of short dna sequences to the human genome. Genome Biol, 10(3):R25, 2009. [169] Amy L Lark, Chad A Livasy, Lynn Dressler, Dominic T Moore, Robert C Millikan, Joseph Geradts, Mary Iacocca, David Cowan, Debbie Little, Rolf J Craven, and William Cance. High focal adhesion kinase expression in invasive breast carcinomas is associated with an aggressive phenotype. Mod Pathol, 18(10):12891294, Oct 2005. [170] E. Y. Lasfargues, W. G. Coutinho, and E. S. Redeld. Isolation of two human tumor epithelial cell lines from solid breast carcinomas. J. Natl. Cancer Inst., 61(4):967 978, 1978. [171] B.V. Latinkic and J.C. Smith. Goosecoid and mix.1 repress brachyury expression and are required for head formation in xenopus. Development, 126:176979, 1999. [172] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment. Science, 262(5131):208214, Oct 1993. [173] C. E. Lawrence and A. A. Reilly. An expectation maximization (em) algorithm for the identication and characterization of common sites in unaligned biopolymer sequences. Proteins, 7(1):4151, 1990. [174] Ju Youn Lee, Ji Yeon Park, and Bin Tian. Identication of mrna polyadenylation sites in genomes using cdna sequences, expressed sequence tags, and trace. Methods Mol Biol, 419:2337, 2008. [175] M. G. Lee, C. Wynder, N. Cooch, and R. Shiekhattar. An essential role for CoREST in nucleosomal histone 3 lysine 4 demethylation. Nature, 437:432435, Sep 2005. [176] Min Gyu Lee, Jessica Norman, Ali Shilatifard, and Ramin Shiekhattar. Physical and functional association of a trimethyl h3k4 demethylase and ring6a/mblr, a polycomblike protein. Cell, 128(5):877887, Mar 2007. [177] Min Gyu Lee, Christopher Wynder, Daniel A Bochar, Mohamed-Ali Hakimi, Neil Cooch, and Ramin Shiekhattar. and deacetylase enzymes. Functional interplay between histone demethylase Mol Cell Biol, 26(17):63956402, Sep 2006. BIBLIOGRAPHY 134 [178] William Lee, Zhaoshi Jiang, Jinfeng Liu, Peter M Haverty, Yinghui Guan, Jeremy Stinson, Peng Yue, Yan Zhang, Krishna P Pant, Deepali Bhatt, Connie Ha, Stephanie Johnson, Michael I Kennemer, Sankar Mohan, Igor Nazarenko, Colin Watanabe, Andrew B Sparks, David S Shames, Robert Gentleman, Frederic J de Sauvage, Howard Stern, Ajay Pandita, Dennis G Ballinger, Radoje Drmanac, Zora Modrusan, Somasekar Seshagiri, and Zemin Zhang. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. [179] Pascal Lefevre and Constanze Nature, 465(7297):473477, May 2010. Bonifer. Analyzing crosslinked chromatin treated with micrococcal nuclease. histone modication using Methods Mol Biol, 325:315 325, 2006. [180] Hui Sun Leong and David Kipling. Text-based over-representation analysis of microarray gene lists with annotation bias. Nucleic Acids Res, 37(11):e79, Jun 2009. [181] M. A. Lever, J. P. Th'ng, X. Sun, and M. J. Hendzel. Rapid exchange of histone h1.1 on chromatin in living human cells. Nature, 408(6814):873876, Dec 2000. [182] Samuel Levy, Granger Sutton, Pauline C Ng, Lars Feuk, Aaron L Halpern, Brian P Walenz, Nelson Axelrod, Jiaqi Huang, Ewen F Kirkness, Gennady Denisov, Yuan Lin, Jerey R MacDonald, Andy Wing Chun Pang, Mary Shago, Timothy B Stockwell, Alexia Tsiamouri, Vineet Bafna, Vikas Bansal, Saul A Kravitz, Dana A Busam, Karen Y Beeson, Tina C McIntosh, Karin A Remington, Josep F Abril, John Gill, Jon Borman, Yu-Hui Rogers, Marvin E Frazier, Stephen W Scherer, Robert L Strausberg, and J. Craig Venter. The diploid genome sequence of an individual human. PLoS Biol, 5(10):e254, Sep 2007. [183] M. T. Lewis. Homeobox genes in mammary gland development and neoplasia. Breast Cancer Res, 2(3):158169, 2000. [184] M. T. Lewis, S. Ross, P. A. Strickland, C. J. Snyder, and C. W. Daniel. Regulated expression patterns of irx-2, an iroquois-class homeobox gene, in the human breast. Cell Tissue Res, 296(3):549554, Jun 1999. [185] Timothy J Ley, Elaine R Mardis, Li Ding, Bob Fulton, Michael D McLellan, Ken Chen, David Dooling, Brian H Dunford-Shore, Sean McGrath, Matthew Hickenbotham, Lisa Cook, Rachel Abbott, David E Larson, Dan C Koboldt, Craig Pohl, Scott Smith, Amy BIBLIOGRAPHY 135 Hawkins, Scott Abbott, Devin Locke, Ladeana W Hillier, Tracie Miner, Lucinda Fulton, Vincent Magrini, Todd Wylie, Jarret Glasscock, Joshua Conyers, Nathan Sander, Xiaoqi Shi, John R Osborne, Patrick Minx, David Gordon, Asif Chinwalla, Yu Zhao, Rhonda E Ries, Jacqueline E Payton, Peter Westervelt, Michael H Tomasson, Mark Watson, Jack Baty, Jennifer Ivanovich, Sharon Heath, William D Shannon, Rakesh Nagarajan, Matthew J Walter, Daniel C Link, Timothy A Graubert, John F DiPersio, and Richard K Wilson. leukaemia genome. Dna sequencing of a cytogenetically normal acute myeloid Nature, 456(7218):6672, Nov 2008. [186] Haitao Li, Serge Ilin, Wooikoon Wang, Elizabeth M Duncan, Joanna Wysocka, C. David Allis, and Dinshaw J Patel. Molecular basis for site-specic read-out of histone h3k4me3 by the bptf phd nger of nurf. Nature, 442(7098):9195, Jul 2006. [187] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrowswheeler transform. Bioinformatics, 25(14):17541760, Jul 2009. [188] Heng Li and Nils Homer. generation sequencing. A survey of sequence alignment algorithms for next- Brief Bioinform, 11(5):473483, Sep 2010. [189] Heng Li, Jue Ruan, and Richard Durbin. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Res, 18(11):18511858, Nov 2008. [190] Ruiqiang Li, Yingrui Li, Karsten Kristiansen, and Jun Wang. cleotide alignment program. Soap: short oligonu- Bioinformatics, 24(5):713714, Mar 2008. [191] Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. Soap2: an improved ultrafast tool for short read alignment. Bioinfor- matics, 25(15):19661967, Aug 2009. [192] Harry M Lightfoot, Amy Lark, Chad A Livasy, Dominic T Moore, David Cowan, Lynn Dressler, Rolf J Craven, and William G Cance. Upregulation of focal adhesion kinase (fak) expression in ductal carcinoma in situ (dcis) is an early event in breast tumorigenesis. Breast Cancer Res Treat, 88(2):109116, Nov 2004. [193] Hao Lin, Zefeng Zhang, Michael Q Zhang, Bin Ma, and Ming Li. Zoom! zillions of oligos mapped. Bioinformatics, 24(21):24312437, Nov 2008. BIBLIOGRAPHY 136 [194] T. Lin, A. Ponn, X. Hu, B. K. Law, and J. Lu. Requirement of the histone demethylase lsd1 in snai1-mediated transcriptional repression during epithelial-mesenchymal transition. Oncogene, 29(35):48964904, Sep 2010. [195] Edison T Liu, Sebastian Pott, and Mikael Huss. Q&a: Chip-seq technologies and the study of gene regulation. BMC Biol, 8:56, 2010. [196] Jingbo Liu, Ya-Guang Liu, Ruochun Huang, Chen Yao, Shiyong Li, Weimin Yang, Dongzi Yang, and Ruo-Pan Huang. Concurrent down-regulation of egr-1 and gelsolin in the majority of human breast cancer cells. Cancer Genomics Proteomics, 4(6):377 385, 2007. [197] George Locke, Denis Tolkunov, Zarmik Moqtaderi, Kevin Struhl, and Alexandre V Morozov. High-throughput sequencing reveals a simple model of nucleosome energetics. Proc Natl Acad Sci U S A, 107(49):2099821003, Dec 2010. [198] Harvey Lodish, Arnold Berk, Chris A. Kaiser, Monty Krieger, Matthew P. Scott, An- Molecular Cell Biology (Lodish, thony Bretscher, Hidde Ploegh, and Paul Matsudaira. Molecular Cell Biology). W. H. Freeman, 6th edition, 6 2007. [199] Leandro A Loss, Anguraj Sadanandam, Steen Durinck, Shivani Nautiyal, Diane Flaucher, Victoria E H Carlton, Martin Moorhead, Yontao Lu, Joe W Gray, Malek Faham, Paul Spellman, and Bahram Parvin. genes in breast cancer cell lines. Prediction of epigenetically regulated BMC Bioinformatics, 11:305, 2010. [200] K. Luger, A. W. Mader, R. K. Richmond, D. F. Sargent, and T. J. Richmond. Crystal structure of the nucleosome core particle at 2.8 a resolution. Nature, 389(6648):251 260, Sep 1997. [201] Margus Lukk, Misha Kapushesky, Janne ¿ Nikkilà , Helen Parkinson, Goncalves, Wolfgang Huber, Esko Ukkonen, and Alvis Brazma. human gene expression. Angela A global map of Nat Biotechnol, 28(4):322324, Apr 2010. [202] Bin Ma, John Tromp, and Ming Li. Patternhunter: faster and more sensitive homology search. Bioinformatics, 18(3):440445, Mar 2002. [203] SC Macevicz. Dna sequencing by parallel oligonucleotide extensions. 1997(163):45 45, 1997. Biofutur, BIBLIOGRAPHY 137 [204] Jerey P MacKeigan, Leon O Murphy, and John Blenis. Sensitized rnai screen of human kinases and phosphatases identies new regulators of apoptosis and chemoresistance. Nat Cell Biol, 7(6):591600, Jun 2005. [205] M. Maemura and R. B. Dickson. metastasis of breast cancer? Are cellular adhesion molecules involved in the Breast Cancer Res Treat, 32(3):239260, 1994. [206] Donna Maglott, Jim Ostell, Kim D Pruitt, and Tatiana Tatusova. Entrez gene: genecentered information at ncbi. Nucleic Acids Res, 33(Database issue):D54D58, Jan 2005. [207] Shaun Mahony and Panayiotis V Benos. Stamp: a web tool for exploring dna-binding motif similarities. Nucleic Acids Res, 35(Web Server issue):W253W258, Jul 2007. [208] Lira Mamanova, Alison J Coey, Carol E Scott, Iwanka Kozarewa, Emily H Turner, Akash Kumar, Eleanor Howard, Jay Shendure, and Daniel J Turner. Target-enrichment strategies for next-generation sequencing. Nat Methods, 7(2):111118, Feb 2010. [209] Yan-Gao Man and Qing-Xiang Amy Sang. The signicance of focal myoepithelial cell layer disruptions in human breast tumor invasion: a paradigm shift from the "proteasecentered" hypothesis. Exp Cell Res, 301(2):103118, Dec 2004. [210] Elaine R Mardis. The impact of next-generation sequencing technology on genetics. Trends Genet., 24(3):13341, Mar 2008. [211] Marc Mareel and Ancy Leroy. invasion. Clinical, cellular, and molecular aspects of cancer Physiol Rev, 83(2):337376, Apr 2003. [212] M. Margulies, M. Egholm, W. E. Altman, S. Attiya, J. S. Bader, L. A. Bemben, J. Berka, M. S. Braverman, Y.-J. Chen, Z. Chen, S. B. Dewell, L. Du, J. M. Fierro, X. V. Gomes, B. C. Godwin, W. He, S. Helgesen, C. H. Ho, C. H. Ho, G. P. Irzyk, S. C. Jando, M. L. I. Alenquer, T. P. Jarvie, K. B. Jirage, J.-B. Kim, J. R. Knight, J. R. Lanza, J. H. Leamon, S. M. Lefkowitz, M. Lei, J. Li, K. L. Lohman, H. Lu, V. B. Makhijani, K. E. McDade, M. P. McKenna, E. W. Myers, E. Nickerson, J. R. Nobile, R. Plant, B. P. Puc, M. T. Ronan, G. T. Roth, G. J. Sarkis, J. F. Simons, J. W. Simpson, M. Srinivasan, K. R. Tartaro, A. Tomasz, K. A. Vogt, G. A. Volkmer, S. H. Wang, Y. Wang, M. P. Weiner, P. Yu, R. F. Begley, and J. M. Rothberg. Genome BIBLIOGRAPHY 138 sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376 380, 2005. [213] Joan Massague, Gaorav P Gupta, and Andy Minn. Method of predicting and reducing risk of metastasis of breast cancer to lung, 2008. [214] S. Matikainen, T. Ronni, M. Hurme, R. Pine, and I. Julkunen. Retinoic acid activates interferon regulatory factor-1 gene expression in myeloid cells. Blood, 88(1):114123, Jul 1996. ¶ [215] V. Matys, E. Fricke, R. Geers, E. Gà ssling, M. Haubrock, R. Hehl, K. Hornischer, D. Karas, A. E. Kel, O. V. Kel-Margoulis, D-U. Kloos, S. Land, B. Lewicki-Potapov, H. Michael, R. MÃ×nch, I. Reuter, S. Rotert, H. Saxel, M. Scheer, S. Thiele, and E. Wingender. Transfac: transcriptional regulation, from patterns to proles. Nucleic Acids Res, 31(1):374378, Jan 2003. [216] J. McBryan, J. Howlin, P. A. Kenny, T. Shioda, and F. Martin. Eralpha-cited1 coregulated genes expressed during pubertal mammary gland development: implications for breast cancer prognosis. Oncogene, 26(44):64066419, Sep 2007. [217] Kevin Judd McKernan, Heather E Peckham, Gina L Costa, Stephen F McLaughlin, Yutao Fu, Eric F Tsung, Christopher R Clouser, Cisyla Duncan, Jerey K Ichikawa, Clarence C Lee, Zheng Zhang, Swati S Ranade, Eileen T Dimalanta, Fiona C Hyland, Tanya D Sokolsky, Lei Zhang, Andrew Sheridan, Haoning Fu, Cynthia L Hendrickson, Bin Li, Lev Kotler, Jeremy R Stuart, Joel A Malek, Jonathan M Manning, Alena A Antipova, Damon S Perez, Michael P Moore, Kathleen C Hayashibara, Michael R Lyons, Robert E Beaudoin, Brittany E Coleman, Michael W Laptewicz, Adam E Sannicandro, Michael D Rhodes, Rajesh K Gottimukkala, Shan Yang, Vineet Bafna, Ali Bashir, Andrew MacBride, Can Alkan, Jerey M Kidd, Evan E Eichler, Martin G Reese, Francisco M De La Vega, and Alan P Blanchard. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res, 19(9):15271541, Sep 2009. BIBLIOGRAPHY 139 [218] T. A. McKinsey, C. L. Zhang, and E. N. Olson. Activation of the myocyte en- hancer factor-2 transcription factor by calcium/calmodulin-dependent protein kinasestimulated binding of 14-3-3 to histone deacetylase 5. Proc Natl Acad Sci U S A, 97(26):1440014405, Dec 2000. [219] Gunter Meister, Markus Landthaler, Agnieszka Patkaniowska, Yair Dorsett, Grace Teng, and Thomas Tuschl. mirnas and sirnas. Human argonaute2 mediates rna cleavage targeted by Mol Cell, 15(2):185197, Jul 2004. [220] Eric Metzger, Axel Imhof, Dharmeshkumar Patel, Philip Kahl, Katrin Homeyer, Nicolaus Friedrichs, Judith M MÃ×ller, Holger Greschik, Jutta Kirfel, Sujuan Ji, Natalia Kunowska, Christian Beisenherz-Huss, Thomas GÃ×nther, Reinhard Buettner, and Roland SchÃ×le. Phosphorylation of histone h3t6 by pkcbeta(i) controls demethylation at histone h3k4. Nature, 464(7289):792796, Apr 2010. [221] Robert C Millikan, Beth Newman, Chiu-Kit Tse, Patricia G Moorman, Kathleen Conway, Lynn G Dressler, Lisa V Smith, Miriam H Labbok, Joseph Geradts, Jeannette T Bensen, Susan Jackson, Sarah Nyante, Chad Livasy, Lisa Carey, H. Shelton Earp, and Charles M Perou. Epidemiology of basal-like breast cancer. Breast Cancer Res Treat, 109(1):123139, May 2008. [222] Thomas A Milne, Yali Dou, Mary Ellen Martin, Hugh W Brock, Robert G Roeder, and Jay L Hess. target genes. Mll associates specically with a subset of transcriptionally active Proc Natl Acad Sci U S A, 102(41):1476514770, Oct 2005. [223] S. B. Montgomery, O. L. Grith, M. C. Sleumer, C. M. Bergman, M. Bilenky, E. D. Pleasance, Y. Prychyna, X. Zhang, and S. J M Jones. Oreganno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics, 22(5):637640, Mar 2006. [224] Susan E Moody, Denise Perez, Tien chi Pan, Christopher J Sarkisian, Carla P Portocarrero, Christopher J Sterner, Kathleen L Notorfrancesco, Robert D Cardi, and Lewis A Chodosh. The transcriptional repressor snail promotes mammary tumor recurrence. Cancer Cell, 8(3):197209, Sep 2005. [225] Eyal Mor, Yuval Cabilly, Yona Goldshmit, Harel Zalts, Shira Modai, Liat Edry, Orna BIBLIOGRAPHY 140 Elroy-Stein, and Noam Shomron. Species-specic microrna roles elucidated following astrocyte activation. Nucleic Acids Res, Jan 2011. [226] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeer, and Barbara Wold. Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods, 5(7):621628, Jul 2008. [227] Ettore Mosca, Roberta Aleri, Ivan Merelli, Federica Viti, Andrea Calabria, and Luciano Milanesi. A multilevel data integration resource for breast cancer study. BMC Syst Biol, 4:76, 2010. [228] David W. Mount. Bioinformatics: Sequence and genome analysis, second edition. 7 2004. [229] Ugrappa Nagalakshmi, Zhong Wang, Karl Waern, Chong Shou, Debasish Raha, Mark Gerstein, and Michael Snyder. dened by rna sequencing. The transcriptional landscape of the yeast genome Science, 320(5881):13441349, Jun 2008. [230] Niranjan Nagarajan, Neil Jones, and Uri Keich. Computing the p-value of the information content from an alignment of multiple sequences. Bioinformatics, 21 Suppl 1:i311i318, Jun 2005. [231] Tatsuya Nakamura, Toshiki Mori, Shinichiro Tada, Wladyslaw Krajewski, Tanya Rozovskaia, Richard Wassell, Garrett Dubois, Alexander Mazo, Carlo M Croce, and Eli Canaani. All-1 is a histone methyltransferase that assembles a supercomplex of proteins involved in transcriptional regulation. Mol Cell, 10(5):11191128, Nov 2002. [232] S. Nandi, R. C. Guzman, and J. Yang. Hormones and mammary carcinogenesis in mice, rats, and humans: a unifying hypothesis. Proc Natl Acad Sci U S A, 92(9):36503657, Apr 1995. [233] T. Narita, N. Kawakami-Kimura, M. Sato, N. Matsuura, S. Higashiyama, N. Taniguchi, and R. Kannagi. Alteration of integrins by heparin-binding egf-like growth factor in human breast cancer cells. Oncology, 53(5):374381, 1996. [234] Martijn C. Nawijn, Andrej Alendar, and Anton Berns. For better or for worse: the role of pim oncogenes in tumorigenesis. Nat Rev Cancer, 11(1):2334, January 2011. BIBLIOGRAPHY 141 [235] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for J Mol Biol, similarities in the amino acid sequence of two proteins. 48(3):443453, Mar 1970. [236] Richard M Neve, Koei Chin, Jane Fridlyand, Jennifer Yeh, Frederick L Baehner, Tea Fevr, Laura Clark, Nora Bayani, Jean-Philippe Coppe, Frances Tong, Terry Speed, Paul T Spellman, Sandy DeVries, Anna Lapuk, Nick J Wang, Wen-Lin Kuo, Jackie L Stilwell, Daniel Pinkel, Donna G Albertson, Frederic M Waldman, Frank McCormick, Robert B Dickson, Michael D Johnson, Marc Lippman, Stephen Ethier, Adi Gazdar, and Joe W Gray. A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell, 10(6):51527, 2006. [237] Torsten O Nielsen, Forrest D Hsu, Kristin Jensen, Maggie Cheang, Gamze Karaca, Zhiyuan Hu, Tina Hernandez-Boussard, Chad Livasy, Dave Cowan, Lynn Dressler, Lars A Akslen, Joseph Ragaz, Allen M Gown, C. Blake Gilks, Matt van de Rijn, and Charles M Perou. Immunohistochemical and clinical characterization of the basal-like subtype of invasive breast carcinoma. Clin Cancer Res, 10(16):53675374, Aug 2004. [238] M. Angela Nieto. The snail superfamily of zinc-nger transcription factors. Nat Rev Mol Cell Biol, 3(3):155166, Mar 2002. [239] Karl P Nightingale, Susanne Gendreizig, Darren A White, Charlotte Bradbury, Florian Hollfelder, and Bryan M Turner. Cross-talk between histone modications in response to histone deacetylase inhibitors: Mll4 links histone h3 acetylation and histone h3k4 methylation. J Biol Chem, 282(7):44084416, Feb 2007. [240] Z. Ning, A. J. Cox, and J. C. Mullikin. databases. Ssaha: a fast search method for large dna Genome Res, 11(10):17251729, Oct 2001. [241] J. D. Norris, D. Fan, S. A. Kerner, and D. P. McDonnell. Identication of a third autonomous activation domain within the human estrogen receptor. Mol Endocrinol, 11(6):747754, Jun 1997. [242] D. Olmeda, M. Jordý, H. Peinado, A. Fabra, and A. Cano. Snail silencing eectively suppresses tumour growth and invasiveness. Oncogene, 26(13):18621874, Mar 2007. [243] M. V. Olson. Human genetics: Dr watson's base pairs. 2008. Nature, 452(7189):819820, BIBLIOGRAPHY 142 [244] Lezanne Ooi and Ian C Wood. Chromatin crosstalk in development and disease: lessons from rest. Nat Rev Genet, 8(7):544554, Jul 2007. [245] Cynthia Osborne, Paschal Wilson, and Debu Tripathy. Oncogenes and tumor suppressor genes in breast cancer: potential diagnostic and therapeutic applications. Oncolo- gist, 9(4):361377, 2004. [246] Monica Di Padova, Tiziana Bruno, Francesca De Nicola, Simona Iezzi, Carmen D'Angelo, Rita Gallo, Daniela Nicosia, Nicoletta Corbi, Annamaria Biroccio, Aristide Floridi, Claudio Passananti, and Maurizio Fanciulli. Che-1 arrests human colon carcinoma cell proliferation by displacing hdac1 from the p21waf1/cip1 promoter. J Biol Chem, 278(38):3649636504, Sep 2003. [247] Eduardo Parra and Jorge Ferreira. The eect of sirna-egr-1 and camptothecin on growth and chemosensitivity of breast cancer cell lines. Oncol Rep, 23(4):11591165, Apr 2010. [248] Chiara Pastrello, Jerry Polesel, Lara Della Puppa, Alessandra Viel, and Roberta Maestro. Association between hsa-mir-146a genotype and tumor age-of-onset in brca1/brca2-negative familial breast and ovarian cancer patients. Carcinogenesis, 31(12):21242126, Dec 2010. [249] Giulio Pavesi, Paolo Mereghetti, Giancarlo Mauri, and Graziano Pesole. Weeder web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res, 32(Web Server issue):W199W203, Jul 2004. [250] Shannon R Payne and Christopher J Kemp. Tumor suppressor genetics. Carcinogen- esis, 26(12):20312045, Dec 2005. [251] Hector Peinado, Faustino Marin, Eva Cubillo, Hans-Juergen Stark, Norbert Fusenig, M. Angela Nieto, and Amparo Cano. Snail and e47 repressors of e-cadherin induce distinct invasive and angiogenic properties in vivo. J Cell Sci, 117(Pt 13):28272839, Jun 2004. [252] Hà © ctor Peinado, Francisco Portillo, and Amparo Cano. Transcriptional regulation of cadherins during development and carcinogenesis. 2004. Int J Dev Biol, 48(5-6):365375, BIBLIOGRAPHY 143 [253] Steve Pells, editor. ular Biology). Nuclear Reprogramming: Methods and Protocols (Methods in Molec- Humana Press, 1st edition. edition, 12 2010. [254] T. V. Perneger. What's wrong with bonferroni adjustments. BMJ, 316(7139):1236 1238, Apr 1998. [255] C. M. Perou, T. Sørlie, M. B. Eisen, M. van de Rijn, S. S. Jerey, C. A. Rees, J. R. Pollack, D. T. Ross, H. Johnsen, L. A. Akslen, O. Fluge, A. Pergamenschikov, C. Williams, S. X. Zhu, P. E. Lønning, A. L. Børresen-Dale, P. O. Brown, and D. Botstein. Molecular portraits of human breast tumours. [256] V. Petit and J. P. Thiery. Nature, 406(6797):747752, Aug 2000. Focal adhesions: structure and dynamics. Biol Cell, 92(7):477494, Oct 2000. Computational Molecular Biology: An Algorithmic Approach (Computational Molecular Biology). The MIT Press, 1 edition, 8 2000. [257] Pavel A. Pevzner. [258] S. Pietrokovski. Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res, 24(19):38363845, Oct 1996. [259] Eva Pizzoferrato, Ye Liu, Andrea Gambotto, Michaele J Armstrong, Michael T Stang, William E Gooding, Sean M Alber, Stuart H Shand, Simon C Watkins, Walter J Storkus, and John H Yim. Ectopic expression of interferon regulatory factor-1 promotes human breast cancer cell death and results in reduced expression of survivin. Cancer Res, 64(22):83818388, Nov 2004. [260] Anna Portela and Manel Esteller. Epigenetic modications and human disease. Nat Biotechnol, 28(10):10571068, Oct 2010. [261] Sandra ence? Porter. Watson's genome, Scitizen, September 2007. [online] venter's genome, what's the dier- http://scitizen.com/biotechnology/ watson-s-genome-venter-s-genome-what-s-the-difference-_a-28-1038.html. [262] H. W. C. Postma. Rapid sequencing of individual dna molecules in graphene nanogaps. Nano Lett., 10(2):420 425, 2010. [263] Alkes Price, Sriram Ramabhadran, and Pavel A Pevzner. branching from sample strings. Finding subtle motifs by Bioinformatics, 19 Suppl 2:ii149ii155, Oct 2003. BIBLIOGRAPHY 144 [264] Alexandre Prieur, Franck Tirode, Pinchas Cohen, and Olivier Delattre. Ews/i-1 silencing and gene proling of ewing cells reveal downstream oncogenic pathways and a crucial role for repression of insulin-like growth factor binding protein 3. Mol Cell Biol, 24(16):72757283, Aug 2004. [265] Beatriz Pà © rez-CadahÃa, Bojan Drobic, Protiti Khan, Chaitra C Shivashankar, and James R Davie. Current understanding and importance of histone phosphorylation in regulating chromatin biology. Curr Opin Drug Discov Devel, 13(5):613622, Sep 2010. [266] Jane Qiu. Epigenetics: unnished symphony. Nature, 441(7090):143145, May 2006. [267] Aaron R Quinlan and Ira M Hall. Bedtools: a exible suite of utilities for comparing genomic features. Bioinformatics, 26(6):841842, Mar 2010. [268] M. Raica, I. Jung, Anca Maria Cimpean, C. Suciu, and Anca Maria Muresan. From conventional pathologic diagnosis to the molecular classication of breast carcinoma: are we ready for the change? Rom J Morphol Embryol, 50(1):513, 2009. [269] E. A. Rakha, M. E. El-Sayed, A. R. Green, E. C. Paish, A. H. S. Lee, and I. O. Ellis. Breast carcinoma with basal dierentiation: A proposal for pathology denition based on basal cytokeratin expression. Histopathology, 50(4):434 438, 2007. [270] Kim R Rasmussen, Jens Stoye, and Eugene W Myers. nding all epsilon-matches over a given length. Ecient q-gram lters for J Comput Biol, 13(2):296308, Mar 2006. Pharmacotherapy plus endoscopic intervention is more eective than pharmacotherapy or endoscopy alone in the secondary prevention of esophageal variceal bleeding: a metaanalysis of randomized, controlled trials., volume 70. 2009. [271] M. Ravipati, S. Katragadda, P. D. Swaminathan, J. Molnar, and E. Zarling. [272] Chandan K Reddy, Yao-Chung Weng, and Hsiao-Dong Chiang. Rening motifs by improving information content scores using neighborhood prole search. Algorithms Mol Biol, 1:23, 2006. [273] Sirigiri Divijendra Natha Reddy, Kazufumi Ohshiro, Suresh K Rayala, and Rakesh Kumar. Microrna-7, a homeobox d10 target, inhibits p21-activated kinase 1 and regulates its functions. Cancer Res, 68(20):81958200, Oct 2008. BIBLIOGRAPHY 145 [274] K. L. Redmond, N. T. Crawford, H. Farmer, Z. C. D'Costa, G. J. O'Brien, N. E. Buckley, R. D. Kennedy, P. G. Johnston, D. P. Harkin, and P. B. Mullan. T-box 2 represses NDRG1 through an EGR1-dependent mechanism to drive the proliferation of breast cancer cells. Oncogene, 29:32523262, Jun 2010. [275] John S Reece-Hoyes, Bart Deplancke, M. Inmaculada Barrasa, Julia Hatzold, Ryan B Smit, H. Efsun Arda, Patricia A Pope, Jeb Gaudet, Barbara Conradt, and Albertha J M Walhout. The c. elegans snail homolog ces-1 can activate gene expression in vivo and share targets with bhlh transcription factors. Nucleic Acids Res, 37(11):36893698, Jun 2009. [276] JÃ×ri Reimand, Meelis Kull, Hedi Peterson, Jaanus Hansen, and Jaak Vilo. g:proler a web-based toolset for functional proling of gene lists from large-scale experiments. Nucleic Acids Res, 35(Web Server issue):W193W200, Jul 2007. [277] K.L. Rice, D.J. Izon, J. Ford, A. Boodhoo, U.R. Kees, and W.K. Greene. Overexpression of stem cell associated aldh1a1, a target of the leukemogenic transcription factor tlx1/hox11, inhibits lymphopoiesis and promotes myelopoiesis in murine hematopoietic progenitors. Leuk Res, 32:87383, 2008. [278] A. Gordon Robertson, Mikhail Bilenky, Angela Tam, Yongjun Zhao, Thomas Zeng, Nina Thiessen, Timothee Cezard, Anthony P Fejes, Elizabeth D Wederell, Rebecca Cullum, Ghia Euskirchen, Martin Krzywinski, Inanc Birol, Michael Snyder, Pamela A Hoodless, Martin Hirst, Marco A Marra, and Steven J M Jones. Genome-wide re- lationship between histone h3 lysine 4 mono- and tri-methylation and transcription factor binding. Genome Res, 18(12):19061917, Dec 2008. [279] K. D. Robertson. Dna methylation, methyltransferases, and cancer. Oncogene, 20(24):31393155, May 2001. [280] Stefan Roepcke, Steen Grossmann, Sven Rahmann, and Martin Vingron. T-reg comparator: an analysis tool for the comparison of position weight matrices. Nucleic Acids Res, 33(Web Server issue):W438W441, Jul 2005. [281] M. Ronaghi, S. Karamohamed, B. Pettersson, M. Uhlén, and P. Nyrén. Real-time dna sequencing using detection of pyrophosphate release. 1996. Anal. Biochem., 242(1):84 89, BIBLIOGRAPHY 146 [282] Stephen M Rumble, Phil Lacroute, Adrian V Dalca, Marc Fiume, Arend Sidow, and Michael Brudno. Shrimp: accurate mapping of short color-space reads. PLoS Comput Biol, 5(5):e1000386, May 2009. ¿ [283] Albin Sandelin, Wynand Alkema, Pà ¶ r Engstrà m, Wyeth W Wasserman, and Boris Lenhard. Jaspar: an open-access database for eukaryotic transcription factor binding proles. Nucleic Acids Res, 32(Database issue):D91D94, Jan 2004. [284] Albin Sandelin and Wyeth W Wasserman. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J Mol Biol, 338(2):207215, Apr 2004. [285] F. Sanger and A. R. Coulson. A rapid method for determining sequences in dna by primed synthesis with dna polymerase. J. Mol. Biol., 94(3):441 448, 1975. [286] Carla Sawan, Thomas Vaissière, Rabih Murr, and Zdenko Herceg. Epigenetic drivers and genetic passengers on the road to cancer. Mutat. Res., 642(1-2):113, Jul 2008. [287] Eric E Schadt, Steve Turner, and Andrew Kasarskis. A window into third-generation sequencing. Hum Mol Genet, 19(R2):R227R240, Oct 2010. [288] C. J. Schoenherr and D. J. Anderson. The neuron-restrictive silencer factor (nrsf ): a coordinate repressor of multiple neuron-specic genes. Science, 267(5202):13601363, Mar 1995. [289] C. J. Schoenherr, A. J. Paquette, and D. J. Anderson. target genes for the neuron-restrictive silencer factor. Identication of potential Proc. Natl. Acad. Sci. U.S.A., 93:98819886, Sep 1996. [290] Johannes H Schulte, Tobias Marschall, Marcel Martin, Philipp Rosenstiel, Pieter Mestdagh, Stefanie Schlierf, Theresa Thor, Jo Vandesompele, Angelika Eggert, Stefan Schreiber, Sven Rahmann, and Alexander Schramm. Deep sequencing reveals dierential expression of micrornas in favorable versus unfavorable neuroblastoma. Nucleic Acids Res, 38(17):59195928, Sep 2010. [291] S. P. Shah, R. D. Morin, J. Khattra, L. Prentice, T. Pugh, A. Burleigh, A. Delaney, K. Gelmon, R. Guliany, J. Senz, C. Steidl, R. A. Holt, S. Jones, M. Sun, G. Leung, R. Moore, T. Severson, G. A. Taylor, A. E. Teschendor, K. Tse, G. Turashvili, BIBLIOGRAPHY 147 R. Varhol, R. L. Warren, P. Watson, Y. Zhao, C. Caldas, D. Huntsman, M. Hirst, M. A. Marra, and S. Aparicio. Mutational evolution in a lobular breast tumour proled at single nucleotide resolution. Nature, 461:809813, Oct 2009. [292] S. T. Sherry, M. H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, and K. Sirotkin. dbsnp: the ncbi database of genetic variation. Nucleic Acids Res, 29(1):308311, Jan 2001. [293] Yujiang Shi, Fei Lan, Caitlin Matson, Peter Mulligan, Johnathan R Whetstine, Philip A Cole, Robert A Casero, and Yang Shi. by the nuclear amine oxidase homolog lsd1. Histone demethylation mediated Cell, 119(7):941953, Dec 2004. [294] A. Sigal and V. Rotter. Oncogenic mutations of the p53 tumor suppressor: the demons of the guardian of the genome. Cancer Res, 60(24):67886793, Dec 2000. [295] Emily Singer. Sequencing tumors to target treatment. Technology review india, 2009. [296] D. J. Slamon, G. M. Clark, S. G. Wong, W. J. Levin, A. Ullrich, and W. L. McGuire. Human breast cancer: her-2/neu oncogene. correlation of relapse and survival with amplication of the Science, 235(4785):177182, Jan 1987. [297] Martha L Slattery, Erica Wol, Michael D Homan, Daniel F Pellatt, Brett Milash, and Roger K Wol. Micrornas and colon and rectal cancer: Dierential expression by tumor location and subtype. Genes Chromosomes Cancer, Dec 2010. [298] T. F. Smith and M. S. Waterman. Identication of common molecular subsequences. J Mol Biol, 147(1):195197, Mar 1981. [299] N. R. Soman, P. Correa, B. A. Ruiz, and G. N. Wogan. The tpr-met oncogenic rearrangement is present and expressed in human gastric carcinoma and precursor lesions. Proc Natl Acad Sci U S A, 88(11):48924896, Jun 1991. [300] H. Song, X. Jin, and J. Lin. Stat3 upregulates mek5 expression in human breast cancer cells. Oncogene, 23:83019, 2004. [301] Wiley W. Souba and Douglas W. Wilmore, editors. 1st edition, 2 2001. Surgical Research. Academic Press, BIBLIOGRAPHY 148 [302] H. D. Soule, J. Vazguez, A. Long, S. Albert, and M. Brennan. A human cell line from a pleural eusion derived from a breast carcinoma. J Natl Cancer Inst, 51(5):14091416, Nov 1973. [303] B. D. Strahl and C. D. Allis. The language of covalent histone modications. Nature, 403(6765):4145, Jan 2000. [304] Michael R. Stratton. Exploring the genomes of cancer cells: Progress and promise. Science, 331(6024):15531558, 2011. [305] Xiaohua Su, Deepavali Chakravarti, Min Soon Cho, Lingzhi Liu, Young Jin Gi, Yu-Li Lin, Marco L Leung, Adel El-Naggar, Chad J Creighton, Milind B Suraokar, Ignacio Wistuba, and Elsa R Flores. Tap63 suppresses metastasis through coordinate regulation of dicer and mirnas. Nature, 467(7318):986990, Oct 2010. [306] Zu-Wen Sun and C. David Allis. Ubiquitination of histone h2b regulates h3 methylation and gene silencing in yeast. Nature, 418(6893):104108, Jul 2002. [307] A. H. Swirno, E. D. Apel, J. Svaren, B. R. Sevetson, D. B. Zimonjic, N. C. Popescu, and J. Milbrandt. Nab1, a corepressor of ng-a (egr-1), contains an active transcriptional repression domain. Mol Cell Biol, 18(1):512524, Jan 1998. [308] T. Súrlie, C. M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie, M. B. Eisen, M. van de Rijn, S. S. Jerey, T. Thorsen, H. Quist, J. C. Matese, P. O. Brown, D. Botstein, P. Eystein Lúnning, and A. L. Búrresen-Dale. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A, 98(19):1086910874, Sep 2001. [309] Rulla M Tamimi, Heather J Baer, Jonathan Marotti, Mark Galan, Laurie Galaburda, Yineng Fu, Anne C Deitz, James L Connolly, Stuart J Schnitt, Graham A Colditz, and Laura C Collins. Comparison of molecular phenotypes of ductal carcinoma in situ and invasive breast cancer. Breast Cancer Res, 10(4):R67, 2008. [310] H. Tanaka and T. Kawai. Partial sequencing of a single dna molecule with a scanning tunnelling microscope. Nat Nanotechnol, 4(8):518 522, 2009. BIBLIOGRAPHY 149 [311] M. Tanaka, M. Schinke, H. S. Liao, N. Yamasaki, and S. Izumo. Nkx2.5 and nkx2.6, homologs of drosophila tinman, are required for development of the pharynx. Mol Cell Biol, 21(13):43914398, Jul 2001. [312] Xiaoqing Tian and Jingyuan Fang. Current perspectives on histone demethylases. Acta Biochim Biophys Sin (Shanghai), 39(2):8188, Feb 2007. [313] Martin Tompa, Nan Li, Timothy L Bailey, George M Church, Bart De Moor, Eleazar Eskin, Alexander V Favorov, Martin C Frith, Yutao Fu, W. James Kent, Vsevolod J Makeev, Andrei A Mironov, William Staord Noble, Giulio Pavesi, Graziano Pesole, Mireille Rà © gnier, Nicolas Simonis, Saurabh Sinha, Gert Thijs, Jacques van Helden, Mathias Vandenbogaert, Zhiping Weng, Christopher Workman, Chun Ye, and Zhou Zhu. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol, 23(1):137144, Jan 2005. [314] Alejandro Vaquero, Alejandra Loyola, and Danny Reinberg. The constantly changing face of chromatin. Sci Aging Knowledge Environ, 2003(14):RE4, Apr 2003. [315] Ignacio Varela, Patrick Tarpey, Keiran Raine, Dachuan Huang, Choon Kiat Ong, Philip Stephens, Helen Davies, David Jones, Meng-Lay Lin, Jon Teague, Graham Bignell, Adam Butler, Juok Cho, Gillian L Dalgliesh, Danushka Galappaththige, Chris Greenman, Claire Hardy, Mingming Jia, Calli Latimer, King Wai Lau, John Marshall, Stuart McLaren, Andrew Menzies, Laura Mudie, Lucy Stebbings, David A Largaespada, L. F A Wessels, Stephane Richard, Richard J Kahnoski, John Anema, David A Tuveson, Pedro A Perez-Mancera, Ville Mustonen, Andrej Fischer, David J Adams, Alistair Rust, Waraporn Chan-on, Chutima Subimerb, Karl Dykema, Kyle Furge, Peter J Campbell, Bin Tean Teh, Michael R Stratton, and P. Andrew Futreal. Exome sequencing identies frequent mutation of the swi/snf complex gene pbrm1 in renal carcinoma. Nature, 469(7331):539542, Jan 2011. [316] Sonia Vega, Aixa V Morales, Oscar H Ocana, Francisco Valdes, Isabel Fabregat, and M. Angela Nieto. Snail blocks the cell cycle and confers resistance to cell death. Genes Dev, 18(10):11311143, May 2004. [317] Reiner A Veitia. Dominant negative factors in health and disease. 418, Aug 2009. J Pathol, 218(4):409 BIBLIOGRAPHY 150 [318] R. I. Viji, V. B Sameer Kumar, M. S. Kiran, and P. R. Sudhakaran. response of endothelial cells to heparin-binding domain of bronectin. Angiogenic Int J Biochem Cell Biol, 40(2):215226, 2008. [319] M. Wadman. James watson's genome sequenced at high speed. Nature, 452(7189):788, 2008. [320] M. P. Wagoner, K. T. Gunsalus, B. Schoenike, A. L. Richardson, A. Friedl, and A. Roopra. The transcription factor REST is lost in aggressive breast cancer. PLoS Genet., 6:e1000979, 2010. [321] Gang G. Wang, C. David Allis, and Ping Chi. Chromatin remodeling and cancer, part ii: Atp-dependent chromatin remodeling. Trends Mol Med, 13(9):373380, Sep 2007. [322] L. Wang, Q. Wu, P. Qiu, A. Mirza, M. McGuirk, P. Kirschmeier, J. R. Greene, Y. Wang, C. B. Pickett, and S. Liu. Analyses of p53 target genes in the human genome by bioinformatic and microarray approaches. J Biol Chem, 276(47):4360443610, Nov 2001. [323] Ting Wang and Gary D Stormo. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics, 19(18):23692380, Dec 2003. [324] Zhibin Wang, Chongzhi Zang, Jerey A Rosenfeld, Dustin E Schones, Artem Barski, Suresh Cuddapah, Kairong Cui, Tae-Young Roh, Weiqun Peng, Michael Q Zhang, and Keji Zhao. Combinatorial patterns of histone acetylations and methylations in the human genome. Nat. Genet., 40(7):897903, 2008. [325] Zhong Wang, Mark Gerstein, and Michael Snyder. Rna-seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 10(1):5763, Jan 2009. [326] Wyeth W Wasserman and Albin Sandelin. Applied bioinformatics for the identication of regulatory elements. Nat Rev Genet, 5(4):276287, Apr 2004. [327] Robert A Waterland and Randy L Jirtle. Transposable elements: targets for early nutritional eects on epigenetic gene regulation. 2003. Mol Cell Biol, 23(15):52935300, Aug BIBLIOGRAPHY 151 [328] Ian C G Weaver, Frances A Champagne, Shelley E Brown, Sergiy Dymov, Shakti Sharma, Michael J Meaney, and Moshe Szyf. Reversal of maternal programming of stress responses in adult ospring through methyl supplementation: altering epigenetic marking later in life. J Neurosci, 25(47):1104511054, Nov 2005. [329] A. Wellstein, W. J. Fang, A. Khatri, Y. Lu, S. S. Swain, R. B. Dickson, J. Sasse, A. T. Riegel, and M. E. Lippman. A heparin-binding growth factor secreted from breast cancer cells homologous to a developmentally regulated cytokine. J Biol Chem, 267(4):25822587, Feb 1992. [330] Thomas F Westbrook, Guang Hu, Xiaolu L Ang, Peter Mulligan, Natalya N Pavlova, Anthony Liang, Yumei Leng, Rene Maehr, Yang Shi, J. Wade Harper, and Stephen J Elledge. Scfbeta-trcp controls oncogenic transformation and neural dierentiation through rest degradation. Nature, 452(7185):370374, Mar 2008. [331] David A. Wheeler, Maithreyan Srinivasan, Michael Egholm, Yufeng Shen, Lei Chen, Amy McGuire, Wen He, Yi-Ju Chen, Vinod Makhijani, G. Thomas Roth, Xavier Gomes, Karrie Tartaro, Faheem Niazi, Cynthia L. Turcotte, Gerard P. Irzyk, James R. Lupski, Craig Chinault, Xing-zhi Song, Yue Liu, Ye Yuan, Lynne Nazareth, Xiang Qin, Donna M. Muzny, Marcel Margulies, George M. Weinstock, Richard A. Gibbs, and Jonathan M. Rothberg. The complete genome of an individual by massively parallel dna sequencing. Nature, 452(7189):872876, Apr 2008. [332] David L Wheeler, Tanya Barrett, Dennis A Benson, Stephen H Bryant, Kathi Canese, Deanna M Church, Michael DiCuccio, Ron Edgar, Scott Federhen, Wolfgang Helmberg, David L Kenton, Oleg Khovayko, David J Lipman, Thomas L Madden, Donna R Maglott, James Ostell, Joan U Pontius, Kim D Pruitt, Gregory D Schuler, Lynn M Schriml, Edwin Sequeira, Steven T Sherry, Karl Sirotkin, Grigory Starchenko, Tugba O Suzek, Roman Tatusov, Tatiana A Tatusova, Lukas Wagner, and Eugene Yaschenko. Database resources of the national center for biotechnology information. Nucleic Acids Res, 33(Database issue):D39D45, Jan 2005. ¶ [333] Nava Whiteford, Tom Skelly, Christina Curtis, Matt E Ritchie, Andrea Là hr, Alexander Wait Zaranek, Irina Abnizova, and Clive Brown. Swift: primary data analysis for the illumina solexa sequencing platform. Bioinformatics, 25(17):21942199, Sep 2009. BIBLIOGRAPHY 152 [334] Brian T Wilhelm, Samuel Marguerat, Stephen Watt, Falk Schubert, Valerie Wood, Ian Goodhead, Christopher J Penkett, Jane Rogers, and JÃ×rg Bà ¿ hler. Dynamic reper- toire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature, 453(7199):12391243, Jun 2008. ¢ ¶ [335] Laura D Wood, D. Williams Parsons, Sià n Jones, Jimmy Lin, Tobias Sjà blom, Rebecca J Leary, Dong Shen, Simina M Boca, Thomas Barber, Janine Ptak, Natalie Silliman, Steve Szabo, Zoltan Dezso, Vadim Ustyanksky, Tatiana Nikolskaya, Yuri Nikolsky, Rachel Karchin, Paul A Wilson, Joshua S Kaminker, Zemin Zhang, Randal Croshaw, Joseph Willis, Dawn Dawson, Michail Shipitsin, James K V Willson, Saraswati Sukumar, Kornelia Polyak, Ben Ho Park, Charit L Pethiyagoda, P. V Krishna Pant, Dennis G Ballinger, Andrew B Sparks, James Hartigan, Douglas R Smith, Erick Suh, Nickolas Papadopoulos, Phillip Buckhaults, Sanford D Markowitz, Giovanni Parmigiani, Kenneth W Kinzler, Victor E Velculescu, and Bert Vogelstein. The genomic landscapes of human breast and colorectal cancers. Science, 318(5853):1108 1113, Nov 2007. [336] Kenichi Yamane, Keisuke Tateishi, Robert J Klose, Jia Fang, Laura A Fabrizio, Hediye Erdjument-Bromage, Joyce Taylor-Papadimitriou, Paul Tempst, and Yi Zhang. Plu-1 is an h3k4 demethylase involved in transcriptional repression and breast cancer cell proliferation. Mol Cell, 25(6):801812, Mar 2007. [337] Maojun Yang, Christian B Gocke, Xuelian Luo, Dominika Borek, Diana R Tomchick, Mischa Machius, Zbyszek Otwinowski, and Hongtao Yu. Structural basis for corestdependent demethylation of nucleosomes by the human lsd1 histone demethylase. Mol Cell, 23(3):377387, Aug 2006. [338] Fruma Yehiely, Jose V Moyano, Joseph R Evans, Torsten O Nielsen, and Vincent L Cryns. Deconstructing the molecular portrait of basal-like breast cancer. Trends Mol Med, 12(11):537544, Nov 2006. [339] Hong Yu, Shanshan Zhu, Bing Zhou, Huiling Xue, and Jing-Dong J Han. Infer- ring causal relationships among dierent histone modications and gene expression. Genome Res., 18(8):131424, Aug 2008. [340] Hua Yu, Marcin Kortylewski, and Drew Pardoll. Crosstalk between cancer and immune BIBLIOGRAPHY 153 cells: role of stat3 in the tumour microenvironment. Nat Rev Immunol, 7(1):4151, Jan 2007. [341] J. S. Yu, S. Koujak, S. Nagase, C-M. Li, T. Su, X. Wang, M. Keniry, L. Memeo, A. Rojtman, M. Mansukhani, H. Hibshoosh, B. Tycko, and R. Parsons. Pcdh8, the human homolog of papc, is a candidate tumor suppressor of breast cancer. Oncogene, 27(34):46574665, Aug 2008. [342] N. Zhang, W. Shen, R. G. Hawley, and M. Lu. Hox11 interacts with ctf1 and mediates hematopoietic precursor cell immortalization. Oncogene, 18(13):22732279, Apr 1999. [343] Y. Zhang and D. Reinberg. Transcription regulation by histone methylation: interplay between dierent covalent modications of the core histone tails. Genes Dev, 15(18):23432360, Sep 2001. [344] Yupeng Zheng, Sam John, James J Pesavento, Jennifer R Schultz-Norton, R. Louis Schiltz, Sonjoon Baek, Ann M Nardulli, Gordon L Hager, Neil L Kelleher, and Craig A Mizzen. Histone h1 phosphorylation is associated with transcription by rna poly- merases i and ii. J Cell Biol, 189(3):407415, May 2010. [345] Qin Zhou, Jinjin Fan, Xuebing Ding, Wenxing Peng, Xueqing Yu, Yueqin Chen, and Jing Nie. Tgf-beta-induced mir-491-5p expression promotes par-3 degradation in rat proximal tubular epithelial cells. J Biol Chem, 285(51):4001940027, Dec 2010. [346] XiaoGuang Zhou, LuFeng Ren, YunTao Li, Meng Zhang, YuDe Yu, and Jun Yu. The next-generation sequencing technology: a technology review and future perspective. Sci China Life Sci, 53(1):4457, Jan 2010. [347] Surekha M. Zingde. 2001. Cancer genes. Current Science, 81(5):5085141, September 10
© Copyright 2025