D AT A F L O W A N D B I O I N F O R M AT I C S IN CLINICAL CANCER RESEARCH … A N D S O M E R S T U F F. R FOR CLINICAL BREAST CANCER RESEARCH R in Genomics, 20150323 Daniel Klevebring, Ph D daniel.klevebring@ki.se 1 CLINICAL SEQUENCING OF CANCER 2 Most overused slide in genomics 3 ClinSeq overall objectives and short term goals Overall objectives Perform disease subtyping at primary diagnosis with the goal to replace and improve standard diagnostics Ensure rapid adoptions of new research findings Ensure rapid inclusion of patients in clinical trials Short term goals Include first prospective cases in breast cancer and AML by Q1 2015 Develop a clinical pipeline for cancer genomics with capacity to handle 5000+ cancers yearly 4 ClinSeq pipeline ! Patient value Research opportunities 5 ClinSeq clinical collaborations and funding Clinical collaborations AML Breast cancer Ovarian cancer Colorectal cancer Pancreatic cancer Funding – Sören Lehmann, hematologist – Christer Nilsson, hematologist – – – – – – Jonas Bergh, oncologist Johan Hartman, pathologist Kamila Czene, epidemiologist Lorand Kis, pathologist Jan Frisell, surgeon Irma Fredriksson, Surgeon – Henrik Falconer, surgeon – Hanna Dahlstrand, oncologist – Joseph Carlson, pathologist – Anna Martling, surgeon – Maria Gustavsson Liljefors, oncologist – Sam Ghazi, pathologist – Matthias Löhr, professor – Caroline Verbeke, pathologist – Marco Del Chiaro, surgeon 6 This can happen within one pa0ent 1 2 3 4 5 Blood Normal Tissue Tumor Frozen piece (/RNALater/FFPE) Examples • • • • • Mul0focal tumor Contralateral BC Primary tumor and local met (node) Primary tumor and recurrence Primary tumor and metastasis Referrals ClinSeq Breast L KI Biobank Remiss för provtagning KI Biobank Remiss för provtagning + ClinSeq Bröst - Tumörpreparat ClinSeq Bröst - Blodprov Instruktion för provtagning: Kontrollera identiteten, fyll i uppgifterna på remissen. Glöm ej att fylla i datum och tid för operation och för patolog-arbete. Märk 1-4 rör med streckkodsetikett. OBS! Tag rätt rör till rätt tumör. Fyll i fältet provbit med antal provbitar som tas (1-4). Instruktion för blodprovtagning Karolinska Universitetssjukhuset Solna: 1. Kontrollera identiten, fyll i uppgifterna på remissen. Glöm ej att fylla i datum och tid. 2. Märk ett rör med streckkodsetikett. 1 st. 4 mL EDTA rör, lila propp Tumörpreparat: 1. Tag ett skrap från tumören och skrapa av på kantens insida av ett RNAlater-rör. 2. Skruva på locket och vänd röret försiktigt ca 10 ggr. 3. Placera i kyl (+4 grader) 4. Tag provet som vid vanlig venpunktion. 5. Vänd röret försiktigt ca 10 ggr. L 6. Skicka provet till KI Biobank enligt vanliga rutiner. + Vid frågor kontakta Carin Cavalli- Björkman tel: 08-524 867 97 Studieinformation: + Godkänt av Etikprövningsnämnden dnr: 2011/1020-32 och 2013/1833-31/2. Kombikakod: Å Å M D M N D N N N Kvinna Å Å Å M Mottagande biobank: KI Biobank (reg.nr.222), Karolinska Institutet. M D M N D N N N Kvinna Man Kön: Antal provbitar: Deltagarens personnummer Man Å + Kön: Å Å Å M Vänster Höger M D D Prov 1 - - Deltagarens personnummer Å Å - Fylls i av provtagaren. Avläses maskinellt. Skriv tydligt. Å Godkänt av Etikprövningsnämnden, dnr: 2011/1020-32 och 2013/1833-31/2. 00021001710 Fylls i av provtagare och patolog. Avläses maskinellt. Skriv tydligt. Mottagande biobank: KI Biobank (reg.nr.222), Karolinska Institutet. 00021001710 Kombikakod: Kontaktperson: Carin Cavalli- Björkman tel: 08-524 867 97 Prov 2 Provtagningsdatum M D - - D T T M M T : M Prov 3 M : Provtagningstid (24-timmarsklocka) Provtagningsdatum T Prov 4 Γ Klockslag för provhantering på patologen Provtagares signatur Kommentar Provtagares telefonnummer Provtagares signatur Kommentar Ref Ref Placera etiketten rakt på röret. Tissue 1 blood1 ClinSeq Bröst Blod RemissID blood1 + rid tissue1 Tissue 2 tissue2 Tissue 3 tissue3 Tissue 4 tissue4 RemissID Tag rätt etikett till rätt prov ClinSeq Bröst ClinSeq Bröst ClinSeq Bröst ClinSeq Bröst tissue1 tissue2 tissue3 tissue4 1 2 3 4 4 mL EDTA rid + KI Biobank, Institutionen för Medicinsk Epidemiologi och Biostatistik, Karolinska Institutet, Box 281, 171 77 Stockholm Tel: 08-524 823 77, ki.se/kibiobank, biobank@ki.se Ver. 2015-03-03, ClinSeq-Bröst_blod_Karolinska Solna KI Biobank, Institutionen för Medicinsk Epidemiologi och Biostatistik, Karolinska Institutet, Box 281, 171 77 Stockholm Tel: 08-524 823 77, ki.se/kibiobank, biobank@ki.se L Ver: 2015-03-03, ClinSeq-Bröst_tumör KI Biobank Path Referral Blood Referral Surgeon Referral cFTP Push from KI Biobank BloodRefTbl SciLifeLab TissueRefTbl ResultsTbl Seq + bioinfo Nightly cFTP Local MSSQL db MEB Firewall Bioinformatics preprocessing • ≈1000 patients analyzed to date • • • • • • 15 Gb raw data / TN pair Analysis ≈ 100 CPUh / TN pair Uses accepted best practices for each data type Open source tools Based on GATK Queue Generation of reference files and tool-spec indices 10 GRCh37 aka 1000kg aka GATK bundle ref • Included chromosomes – GRCh37 1-‐22 – X – Y track name='PARs' description='PARs' chrX 60001 2699520 chrX 154931044 155260560 chrY 10001 2649520 chrY 59034050 59363566 • Masked pseudoautosomal regions (PARs) – PAR1 chrY:10001-‐2649520 ßà chrX:60001-‐2699520 – PAR2 chrY:59034050-‐59363566 ßà chrX:154931044-‐155260560 • X-‐PARs will look diploid for men – MT -‐ NC_012920.1 – GL000191-‐249 (unplaced con0gs) – Decoy sequences (d5, incl NC_007605 (EBV) ) from • _p://_p.1000genomes.ebi.ac.uk/vol1/_p/technical/reference/ phase2_reference_assembly_sequence/ – Source file that we use: _p://gsapub_p-‐anonymous@_p.broadins0tute.org/bundle/2.8/b37/ human_g1k_v37_decoy.fasta.gz Parsing varia0on resource files • dbSNP, Cosmic, ClinVar, ExAC are all great resources • None of them follow the VCF4.1 spec, which they all claim to do • The problem: Use these files dbSNP VCF to annotate this file Sample VCF ClinVar VCF Cosmic VCF ExAC VCF "How hard could this possibly be?" 12 Issues with annota0ng variants • Mul0alleleic variats – Mul0ple ALT alleles on a single VCF line ("T,A") • (1) Some tools check the whole ALT string for iden0ty • (2) Some tools ignore mul0alleleic variants – Does "A" equal "A,T" à no à move on – WRONG – Some GATK tools • (3) Some tools only check chr/pos for iden0ty, without regarding REF and ALT alleles – Some GATK tools here as well • Solu0on: Split mul0allelic variants into mul0ple lines and avoid tools in category (2) – Must learn how each tool operates • Måns Magnusson's vcf_parser correctly handles FORMAT, INFO and genotypes when splilng mutliallelic variants – hmps://github.com/moonso/vcf_parser 13 Le_ alignment of indels • • • Indels can have mul0ple correct representa0ons Consensus strategy is to "le_ align" GATK Le_AlignAndTrimVariants – When splitMultiallelics is set to true, the INFO field is not correctly split, and genotypes are dropped (set to ./.) – Breaks if any indel > 200 bps is found – Doesn't check that REF allele matches the reference sequence (by default) – (tested with GATK 3.3-‐0 – bugs reported, so can change) • bc_ools norm • (Related hmps://github.com/arq5x/gemini/issues/346 ) – Requires that REF matches the reference sequence – Cosmic has some variants that map to the Y-‐PARs, which has reference sequence NNNN in the build we use 14 External VCF prep pipeline sort in dict order filter Y-‐PAR variants split mul0allelics le_ align vcfsorter.pl bedtools intersect vcf_parser bc_ools norm Adrian Tan et al, Bioinforma0cs (2015) doi: 10.1093/bioinforma0cs/btv112 vcfsorter.pl by German Leparc from hmps://code.google.com/p/vcfsorter/ Panel Low-‐pass WGS RNAseq Sharing is caring – [Ongoing work] – We want our data to be accessible to others – Two tiers required • Open access for non-personal data – Tumor-specific alterations, gene expression levels, some phenotypes • Controlled access – Genetic data (considered personal data by Swedish law) – Certain sensitive phenotypes – Let’s not reinvent the wheel - Others to this well • ICGC – Can we submit our data to EGA? – Legal &| consent issues? 17 R in ClinSeq • BioConductor – CNANorm, QDNAseq for CNV analyses – Rmarkdown and Sweave for report genera0on • Rstudio – Rmarkdown/knitr + git to version control project spec analyses • Hadleyverse – ggplot2/devtools/dplyr/reshape/0dyr/data.table 18 Hadleyverse • Use cases: – data.table reduces reading 0me of RNAseq data from 30 min to <2 min • 400 files, each a matrix of 2 x 50000 – tidyr and reshape enable rapid – ggplot2 makes beau0ful plots with powerful syntax – magrittr enables piping in R 19 Bioconductor • Repo for HT-‐biology-‐related R packages – Sta0s0cal and graphical methods – Genome Annota0on • • • • bioconductor.org Updated twice per year Focus on packages with vignemes In a pipeline stucture, runnable scripts are needed – Wouldn't it be nice if packages ship with a #!Rscript included? 20 Time to befriend getopt library(getopt) # set variables # format is c(long, short, argmask, datatype, desk) # argmask 0=no arg, 1=req, 2=optional args <-‐ rbind( c("bam", "b", 1, "character", "Input bam file"), c("output", "o", 1, "character", "Output tsv"), c("background", "x", 1, "character", "Background set to use, as a RData file")) opts <-‐ getopt(args) # opts$bam # opts$output # opts$background 21 getopt, check parameters #check cli parameters if(is.null(opts$bam)){ stop("Must specify input bam file -‐-‐bam/-‐b.") } if(is.null(opts$output)){ stop("Must specify output tsv file name -‐-‐segments/-‐s.") } if(is.null(opts$background)){ stop("Must specify background file -‐-‐background/-‐x.") } 22 Write to file, sense .gz suffix ## Write to outfile, gzip if outfile end with gz. cat("Writing outfile...\n") ofile <-‐ opts$output if( grepl("gz$", opts$output) ){ ofile <-‐ gzfile( opts$output, 'w' ) } write.table(dat, ofile, col.names=TRUE, dec=".", quote=FALSE, sep="\t", row.names=FALSE) if( grepl("gz$", opts$output) ){ close(ofile) } 23 oncoprints in R Premy print matricies of genomics data As seen on cbioportal.org hmps://github.com/dakl/oncoprint 24 oncoprints in R library(devtools) install_github("dakl/oncoprint") library(oncoprint) data(tcga_brca) # load example data # vertical x-‐labels vert_x <-‐ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=.5)) oncoprint(tcga_brca) + coord_fixed() + vert_x 25 Questions? daniel.klevebring@ki.se 26
© Copyright 2024