How To QC A KiloPipeline Per Month? Michael S. Noble The Broad Institute of MIT & Harvard TCGA Steering Committee Meeting Houston, Texas April 26, 2012 Born of the desire to systematize analyses from The Cancer Genome Atlas pilot and scale their execution to the dozens of remaining diseases to be studied, now sits atop 14 terabytes of TCGA data and reliably executes more than 1000 pipelines per month. Data Snapshot Data Snapshot +821 CopyNumber Data Snapshot +821 CopyNumber +917 Methylation Data Snapshot +821 CopyNumber New datatype column +2087 protein samples +917 Methylation GDAC Session Yesterday Recommendation: formalize AWG co-chair role for each tumor type Data Coordinator: ensure best possible data/analysis outcome. YES! GDAC Session Yesterday Recommendation: formalize AWG co-chair role for each tumor type Data Coordinator: ensure best possible data/analysis outcome. YES! Firehose approximates this for 23 tumorsets simultaneously Volume, Change & Complexity analyses: stddata: 26 x 23 tumor sets / month = 598 273 platforms over 23 tumorsets x 2/month = 546 Volume, Change & Complexity analyses: stddata: 26 x 23 tumor sets / month = 598 273 platforms over 23 tumorsets x 2/month = 546 • Not even counting RPPA: ingested & almost ready • CPTAC Proteomics Consortium interested, too • Nothing on this scale ever attempted before? • But worthless if we cannot establish scientific credibility Enormous QC Challenge Computing Infrastructure Scientific Veracity Enormous QC Challenge Computing Infrastructure Firehose Genepattern HPC system (LSF) Unix filesystems DCC Mirroring Normalization Control Scripts Website Dashboards Submission etc Scientific Veracity Data How much? Formatted ok? New platforms? Combine platforms? Algorithms What knob settings? New versions? Credible results? Wired together properly? Reports ok? Scale preclude exhaustive inspection Scale preclude exhaustive inspection How We Cope 1. Automate 2. Aggregate 3. Clarify 4. Simplify Scale preclude exhaustive inspection How We Cope 1. Automate 2. Aggregate 3. Clarify 4. Simplify Now some non-trivial examples ... Automate: Continuous Unit Testing • • • • Implemented in Bamboo framework Regression tests run automatically Immediately when changes checked in for covered tools No need for CompBios / BInfs to explicitly run Daniel DiCara Automate: Continuous Unit Testing • • • • Implemented in Bamboo framework Regression tests run automatically Immediately when changes checked in for covered tools No need for CompBios / BInfs to explicitly run “Seems to find bugs before Iʼve checked in the code!” Daniel DiCara Automate: Continuous Unit Testing * Gradual but steady cultural change Where's the Real Bottleneck in Scientific Computing? www.americanscientist.org Jan/Feb 2006 • • • • Implemented in Bamboo framework Regression tests run automatically Immediately when changes checked in for covered tools No need for CompBios / BInfs to explicitly run * “Seems to find bugs before Iʼve checked in the code!” Daniel DiCara But Automation Not Enough When N things break : manually diagnose / fix Or automation N/A : eyeball clusters/plots But Automation Not Enough When N things break : manually diagnose / fix Or automation N/A : eyeball clusters/plots Aggregate: failure dashboard for all tumors & analyses Clarify: instant synoptic view of entire run Saves 100s of clicks through Firehose GUI Aggregate & Automate: Integration Testing • Unit tests must be predictable • Which means stable inputs and outputs • Essentially mandates old data • What about realtime? More current/live data? Aggregate & Automate: Integration Testing Establish that code changes play nice with rest of system Aggregate & Automate: Integration Testing brca coad gbm kirc kirp laml luad lusc ov read ... ucec Establish that code changes play nice with rest of system Across all datasets Aggregate & Automate: Integration Testing brca coad gbm kirc kirp laml luad lusc ov read ... ucec Establish that code changes play nice with rest of system Across all datasets With O’s correctly wired to I’s Aggregate & Automate: Integration Testing brca coad gbm kirc kirp laml luad lusc ov read ... ucec Establish that code changes play nice with rest of system Across all datasets With O’s correctly wired to I’s Downstream dependents correctly read outputs Aggregate & Automate: Integration Testing brca coad gbm kirc kirp laml luad lusc ov read ... ucec Establish that code changes play nice with rest of system Across all datasets With O’s correctly wired to I’s Downstream dependents correctly read outputs And remainder of workflow runs to completion Aggregate & Automate: Integration Testing brca coad gbm kirc kirp laml luad lusc ov read ... ucec Establish that code changes play nice with rest of system Across all datasets With O’s correctly wired to I’s Downstream dependents correctly read outputs And remainder of workflow runs to completion Using same automation infrastructure as production runs. Clarify : Internal Process Flow Clarify : Internal Process Flow Previous Slide Clarify & Simplify: TCGA MAF WorkFlow Nikki Schultz Broad Institute Prachi Kothiyal Heidi Sofia & Many Others Simplify : firehose_get Short 10k script stddata dashboard analyses dashboard • Download all or parts • Of data or analyses runs • Open access : no password • Select by run type & date • Subselect by tumor type • Or analyses type / name • See what runs we did • Or what tasks in each run Simplify : firehose_get Short 10k script stddata dashboard analyses dashboard • Download all or parts • Of data or analyses runs • Open access : no password • Select by run type & date • Subselect by tumor type • Or analyses type / name • See what runs we did • Or what tasks in each run Easier more eyeballs higher quality better science Published studies have that cells with mutated or methylated of ourshown studied HGS-OvCa samples had germline or somatic mutations in BRCA1 or mutated BRCA2 have homologous BRCA1/2, that defective 11% lost BRCA1 expressionrecombination through DNA hypermethy35–38 lationtoand that epigenetic of BRCA1 wasthat mutually Gene expression . Fig. 3c shows 20%exclusive of and are highly responsive PARP inhibitorssilencing 24 Low High , Fisher’smutations exact test). Univariate surBRCA1/2 mutations 5 4.4 3 10 of our studied HGS-OvCa samples had (P germline or somatic in vival analysis of BRCA1/2 status (Fig. 3c) showed better overall survival BRCA1/2, that 11% lost BRCA1 expression through DNA hypermethyTCGA training set TCGA test set for BRCA1/2 mutated cases than BRCA1/2 wild-type cases. Notably, b (Batches 17–24; 255 tumors) (Batches 9–15; 215 tumors) lation and that epigenetic silencingsilenced of BRCA1 was mutually exclusive sion epigenetically BRCA1 cases had survival similarofto BRCA1/2 Death event (red) 24 High exact test). Univariate sur-survivals of BRCA1/2 mutations (P 5 4.4 3HGS-OvCa 10 , Fisher’s wild-type tumours (respective median overall Poor prognosis 41.5 and 41.9 (Fig. months, 5 0.69, log-rank Supplementary vival analysis of BRCA1/2 status 3c)Pshowed better test; overall survival Methods, genes (107) section and Supplementary Fig. 8.13b).cases. This suggests that BRCA1 is TCGA training set TCGA test set for BRCA1/2 mutated cases8, than BRCA1/2 wild-type Notably, OV Analysis Summary Good prognosis inactivated by mutually exclusive genomic and epigenomic mechanisms genes (85) (Batches 17–24; 255 tumors) (Batches 9–15; 215 tumors) epigenetically silenced BRCA1 cases had survival similar to BRCA1/2 and that patient survival depends on the mechanism of inactivation. red) 10 Prognostic wild-type HGS-OvCaGenomic tumours (respective median overallrecombination survivals of genes that This is the analysis overview for Firehose run "21 March 2012". 0 alterations in other homologous gene score sis ....................................................................................................... –10 41.5 41.9riskmonths,might P 5render 0.69, log-rank test; Methods,in this study Lower risk Higher risk Lower risk andHigher cells sensitive to Supplementary PARP inhibitors39 discovered The most robust consensus NMF (Supplementary Methods,This section 8, andthat Supplementary section 8, and Supplementary Fig. 8.13b). suggests BRCA1 is Fig. 8.12) clustering of 565 samples using sis include amplification or and mutation of EMSY mechanisms (also known as C11orf30) c genes was d inactivated bycluster mutually exclusive genomic epigenomic TCGA testthe set1500 most variable Ref. 25 Gene 1 (8%), focal deletion or mutation of PTEN (7%), hypermethylation of identified 1for k = 3 clusters. We Expr D I M P N = 237 N = 255 and that patient survival depends on theof mechanism of inactivation. Pulse entire tumor analyses 0.8 0.8 10 computed the clustering for k = 2 C1 55 48 15 89 RAD51C (3%), mutation of ATM or ATR (2%), and mutation of There were 547 tumor samples miRNA CN to k = 8 and 0.6 0.6 used the cophenetic 0 Genomic alterations in other homologous recombination genes that 29 51 21 40 C2 cluster used in this analysis: 29 s ign Fanconi anaemia genes (5%). Overall, homologous recombination Clusters In key representational figures correlation 0.4 0.4 coefficient to 39 C3 39 37 43 20 –10 ................................................. Lower risk Higher risk Cox P =Lower risk Higher risk discovered might render cells sensitive PARP inhibitors defects to may be present in approximately halfinofthis all study HGS-OvCa cases, 0.02 determine Coxbest P = solution. 8 × 10 0.2 0.2the Peaks Log-rank P = 0.0002 Log-rank P = 0.02 With short accompanying providing a rationale for clinical trials of PARP targeting 0 0 (Supplementary Methods, section 8, and Supplementary Fig.inhibitors 8.12) text 20 40 60 20 40 60 0 0 ARTICLE RESEARCH tumours with these homologous-recombination-related aberrations. Overall survival (months) Overall survival (months) e include TCGA amplification or mutation of EMSY (also known as C11orf30) GA test set Ref. 25 Gene cluster 1 d Comparison between the complete set of BRCA inactivation events Ref. 29 Ref. 30 N = 480 1 (8%), focal deletion or mutation of PTEN (7%), hypermethylation of 0.8 I M P 1 D and all recurrently altered copy number peaks revealed an unexpectedly N1= 237 N = 255 Executive summary from N = 169 C1 55 48 15 N = 118 0.8 0.6 C1 89 RAD51C (3%), mutation of ATM or ATR (2%), and mutation ofinactivation 0.8 0.8 low frequency of CCNE1 amplification in cases with BRCA C2 miRNA 0.4 0.6 0.6 C2 40 21 51 29 C3 50+ page Nozzle reports cluster 0.6 Fanconi anaemia genes (5%). Overall, homologous recombination Survival (8% of BRCA altered cases had CCNE1 amplification whereas 26% of 0.2 0.4 0.4 0.4 Log-rank P = 0.007 C3 39 37 43 20 0 Cox P = 0.02 –5 Cox P = 0.0002 BRCA wild-type cases did; 0.0048, adjusted for false-discovery SMGs defects may40 be60present in approximately halfQof5 all HGS-OvCa cases, = 0.02 Auto-excerpted from them 20 0 0.2 Cox P = 8 × 100.2 Log-rank P = 0.001 0.2 Log-rank P = 0.005 40 Log-rank P = 0.0002 ank P = 0.02 0 0 Overall survival , overall survival tended to be lower for previously providing a (months) rationalerate). for As clinical trialsreported of PARP inhibitors targeting 0 20 40 60 20 40 60 0 0 20 40 Overall 60 survival (months) Overall survival (months) 0 40 60 0 patients with CCNE1 amplification than foraberrations. patients in all other cases tumours with these homologous-recombination-related survival (months) Overall survival (months) e TCGA (P 5 0.072, log-rank test; Supplementary Methods, section 8, and Figure 2 | Gene and miRNA expression patterns of molecular subtype and 1 Comparison between the complete of BRCA inactivation Supplementary Fig. set 8.14a). However, no survivalevents disadvantage for Ref. 29 Ref. 30 prediction in HGS-OvCa.Path NTumours = 480 outcome a, from TCGA and ref. 25 0.8 1 and all recurrently copy number peaks an unexpectedly CCNE1-amplified cases (Prevealed 5 0.24, log-rank test; Supplementary separated into four clusters on the basis of gene expression. b, Using a training altered N = 169 N = 118 0.6 C1 0.8 Methods, section 8, and Supplementary Fig. 8.14b) was apparent when low frequency ofset. CCNE1 amplification in cases with BRCA inactivation data set, a prognostic gene signature was defined and applied to a test data C2 Ways 0.4 C3 0.6 c, Kaplan–Meier analysis of four independent expression data sets, only CCNE1 at BRCA amplification wild-type cases, whereas suggesting26% that the (8%profile of BRCA alteredlooking cases had of previously 0.2 0.4 Log-rank P = 0.007 comparing survival for predicted higher-risk patients versus lower-risk reported CCNE1 survival difference can be explained by the higher 0 = 0.0002 = 0.02 BRCA wild-type cases did; Q 5 0.0048, adjusted for false-discovery 0.2 Cox Ppatients. Univariate Cox P0value20for risk 40 index 60 included. d, Tumours separated survival of40BRCA-mutated cases. ank P = 0.001 Log-rank P = 0.005 0 (months) , overall survival tended to model be lower for As previously into 40 three clusters on theOverall basis ofsurvival miRNA expression,rate). overlapping with gene- reported 41 ) to Finally, we used a probabilistic graphical (PARADIGM 20 60 0 40 60 0 based clusters as indicated. D, differentiated; I, immunoreactive; M, survival (months) Overall survival (months) patients with CCNE1search amplification for patients in all other cases for alteredthan pathways in the US National Cancer Institute mesenchymal; P, proliferative (red bold indicates high degree of overlap). 42 , and found that8,theand FOXM1 tranInteraction Database (P 5 0.072, log-rankPathway test; Supplementary Methods, section e, Differences in patient survival among the and three miRNA-based clusters. ne and miRNA expression patterns of molecular subtype factor network 3d) is disadvantage significantly altered Supplementary Fig. scription 8.14a). However, no (Fig. survival for in 87% of iction in HGS-OvCa.section a, Tumours from TCGA and ref. 25 6). Kaplan–Meier survival analysis of this signature showed cases cases (Supplementary Methods, section 10, and Supplementary Figs CCNE1-amplified (P 5 0.24, log-rank test; Supplementary four clusters on the basis of gene expression. b, Using a training statistically significant association with survival in all validation data 10.1–10.3). FOXM1 and its proliferation-related target genes, AurB Methods, section 8, and Supplementary Fig. 8.14b) wasPLK1, apparent when gnostic gene signature sets was (Fig. defined andSupplementary applied to a test data set.section 2d and Methods, 6). (AURKB), CCNB1, BIRC5, CDC25 and were consistently overier analysis of four independent expression profile data sets, looking only at BRCA wild-type cases, suggesting that the previously Negative matrix factorization consensus clustering of miRNA expressed but not altered by DNA copy number changes, indicative of 1,000 g Survival probability Clarify & Simplify : Quicklook Summaries • • • c RB and PI3K/RAS signalling RB signalling PI3K/RAS signalling 67% of cases altered 45% of cases altered NF1 7% 12% KRAS 18% 11% Amplified Amplified, mut. <1% Amplified, mut. <1% CCND2 4% 15% Upregulated RB1 AKT1 BRAF 3% 0.5% Amplified Mutated Survival probability 3% Log-rank test P = 0.0008602 60 40 DNA damage Sensors ATM 1% Mutated FA core complex 5% Mutated 20 0 Proliferation/survival 50 Inactivated d 0 50 100 Survival (months) PLK1 HR pathway 51% of cases altered ATR <1% Mutated BRCA1 23% Mutated, hypermethyl. EMSY 8% Amplified, mutated FANCD2 <1% Mutated BRCA2 11% Mutated RAD51C 3% Hypermethyl. HR-mediated repair 150 PTEN 7% Deleted • • • 84% of cases altered FOXM1 signalling 50 Activated ATM TP53 ATR RAD51 MAML1 2% JAG1 JAG2 BRCA mutated [66] BRCA1 epigenetically silenced [33] BRCA wild type [212] 80 Percentage of cases (%) NOTCH signalling 22% of cases altered 2% Epigenetic silencing via hypermethylation 0 6% Amplified Cell cycle progression Somatic mutation 100 AKT2 10% Deleted 8%, mutated 2% Amplified Germline mutation Patient survival PIK3CA Amplified BRCA2 Deleted 7%, mut. <1% Deleted 8%, mut. 4% 20% CCND1 b BRCA1 PTEN 32% CCNE1 HR alterations BRCA altered cases, N = 103 (33%) CDKN2A Downregulated 30%, deleted 2% Survival probability –5 a Survival probability 1,000 genes Proliferative Inhibition amplified / mutated NOTCH3 11% Amplified/mutated amplified / mutated Proliferation 2% mutated Amplified CHEK2 Activation MAML3 Cell cycle progression BRCA1 FOXM1 AURKB MAML2 4% CCNB1 BRCC BRCA2 BIRC5 DNA repair CDC25B Figure 3 | Altered pathways in HGS-OvCa. a, b, RB and PI3K/RAS pathways, identified by curated analysis (a), and the NOTCH pathway, identified by HOTNET analysis (b), are commonly altered. Alterations are defined by somatic mutations, DNA copy number changes or, in some cases, by significant up- or downregulation relative to expression in diploid tumours. Alteration frequencies are expressed as a percentage of all cases; activated genes are red and inactivated genes are blue. c, Genes in the homologous recombination (HR) pathway are altered in up to 49% of cases. Survival analysis of BRCA1/2 status shows a divergent outcome for BRCA1/2 mutated cases (with higher overall survival) than BRCA1/2 wild type, and that BRCA1 epigenetically silenced cases have poorer outcomes. FA, Fanconi anaemia. d, FOXM1 transcription factor network is activated in 87% of cases. Each gene is depicted as a multi-ring circle in which its copy number (outer ring) and gene expression (inner ring) are plotted such that each ‘spoke’ in the ring represents a single patient sample, with samples sorted in increasing order of FOXM1 expression. Excitatory interactions (red arrows) and inhibitory interactions (blue lines) were taken from the US National Cancer Institute Pathway Interaction Database. Dashed lines indicate transcriptional regulation. stands in striking contrast to previous TCGA findings in glioblastoma47, where there were more recurrently mutated genes with far fewer chromosome arm-level or focal SCNAs (Fig. 1a). A high prevalence of mutations and promoter methylation in putative DNA repair genes, including homologous recombination components, may explain the high prevalence of SCNAs. The mutation spectrum marks HGS-OvCa as completely distinct from other ovarian cancer histological subtypes. For example, clear-cell ovarian cancer tumours have few TP53 mutations but have recurrent ARID1A and PIK3CA mutations48–50; endometrioid ovarian cancer tumours have frequent CTNNB1, ARID1A and PIK3CA mutations and a lower rate of TP53 (refs 49, 50); and mucinous ovarian cancer tumours have prevalent KRAS mutations51. These differences between ovarian cancer subtypes METHODS SUMMARY Human interpretation needed, but not automate-able This distills into most concise aggregate form All specimens were obtained from patients with appropriate consent from the relevant institutional review board. DNA and RNA were collected from samples using the Allprep kit (Qiagen). We used commercial technology for capture and sequencing of exomes from whole-genome-amplified tumour DNA and normal DNA. DNA sequences were aligned to NCBI Build 36 of the human genome; duplicate reads were excluded from mutation calling. Validation of mutations occurred on a separate whole-genome amplification of DNA from the same tumour. Significantly mutated genes were identified by comparing them with expectation models based on the exact measured rates of specific sequence lesions. CHASM20 and MutationAssessor (Supplementary Methods, section 4) were used to identify functional mutations. GISTIC analysis of the circular-binary-segmented Agilent 1M feature copy number data was used to identify recurrent peaks by comparison with the results from the other platforms, to determine likely plat- Related Posters Poster : Poster : Poster : Poster : Engineering Firehose RNA-Seq in Firehose GDAC Interoperability Broad SNP6 Pipeline (DiCara et al) (Zhang et al) (Cerami et al) (Saksena et al) Acknowledgements Broad Michael Noble Douglas Voet Gordon Saksena Dan DiCara Kristian Cibulskis Juok Cho Rui Jing Michael Lawrence Lee Lichtenstein Pei Lin Spring Liu Aaron McKenna Sachet Shukla Raktim Sinha Andrey Sivachenko Carrie Sougnez Petar Stojanov Lihua Zhou Hailei Zhang Robert Zupko PI: Lynda Chin, Gaddy Getz Belfler-DFCI/MDACC Yonghong Xiao Juinhua Zhang Terrence Wu Harvard Peter Park Nils Gehlenborg Semin Lee Richard Park IGV & GenePattern teams @ Broad Jill Mesirov Michael Reich Peter Carr Marc-Danie Nazaire Jim Robinson Helga Thorvaldsdottir Matthew Meyerson Todd Golub Eric Lander
© Copyright 2024