R FOR CLINICAL BREAST CANCER RESEARCH

D AT A F L O W A N D B I O I N F O R M AT I C S
IN CLINICAL CANCER RESEARCH
… A N D S O M E R S T U F F. R FOR CLINICAL BREAST
CANCER RESEARCH
R in Genomics, 20150323
Daniel Klevebring, Ph D
daniel.klevebring@ki.se
1
CLINICAL SEQUENCING OF CANCER
2 Most overused slide in genomics
3 ClinSeq overall objectives and short term goals
Overall
objectives
Perform disease subtyping at primary diagnosis with the
goal to replace and improve standard diagnostics
Ensure rapid adoptions of new research findings
Ensure rapid inclusion of patients in clinical trials
Short term
goals
Include first prospective cases in breast cancer and AML by
Q1 2015
Develop a clinical pipeline for cancer genomics with
capacity to handle 5000+ cancers yearly
4
ClinSeq pipeline
!
Patient value
Research opportunities
5
ClinSeq clinical collaborations and funding
Clinical collaborations
AML
Breast cancer
Ovarian cancer
Colorectal cancer
Pancreatic cancer
Funding
–  Sören Lehmann, hematologist
–  Christer Nilsson, hematologist
– 
– 
– 
– 
– 
– 
Jonas Bergh, oncologist
Johan Hartman, pathologist
Kamila Czene, epidemiologist
Lorand Kis, pathologist
Jan Frisell, surgeon
Irma Fredriksson, Surgeon
–  Henrik Falconer, surgeon
–  Hanna Dahlstrand, oncologist
–  Joseph Carlson, pathologist
–  Anna Martling, surgeon
–  Maria Gustavsson Liljefors, oncologist
–  Sam Ghazi, pathologist
–  Matthias Löhr, professor
–  Caroline Verbeke, pathologist
–  Marco Del Chiaro, surgeon
6
This can happen within one pa0ent 1 2 3 4 5
Blood Normal Tissue Tumor Frozen piece (/RNALater/FFPE) Examples • 
• 
• 
• 
• 
Mul0focal tumor Contralateral BC Primary tumor and local met (node) Primary tumor and recurrence Primary tumor and metastasis Referrals ClinSeq Breast L
KI Biobank
Remiss för provtagning
KI Biobank
Remiss för provtagning
+
ClinSeq Bröst - Tumörpreparat
ClinSeq Bröst - Blodprov
Instruktion för provtagning:
Kontrollera identiteten, fyll i uppgifterna på remissen.
Glöm ej att fylla i datum och tid för operation och för patolog-arbete.
Märk 1-4 rör med streckkodsetikett. OBS! Tag rätt rör till rätt tumör.
Fyll i fältet provbit med antal provbitar som tas (1-4).
Instruktion för blodprovtagning Karolinska Universitetssjukhuset Solna:
1. Kontrollera identiten, fyll i uppgifterna på remissen. Glöm ej att fylla i datum och tid.
2. Märk ett rör med streckkodsetikett.
1 st. 4 mL EDTA rör, lila propp
Tumörpreparat:
1. Tag ett skrap från tumören och skrapa av på kantens insida av ett RNAlater-rör.
2. Skruva på locket och vänd röret försiktigt ca 10 ggr.
3. Placera i kyl (+4 grader)
4. Tag provet som vid vanlig venpunktion.
5. Vänd röret försiktigt ca 10 ggr.
L
6. Skicka provet till KI Biobank enligt vanliga rutiner.
+
Vid frågor kontakta Carin Cavalli- Björkman tel: 08-524 867 97
Studieinformation:
+
Godkänt av Etikprövningsnämnden dnr:
2011/1020-32 och 2013/1833-31/2.
Kombikakod:
Å
Å
M
D
M
N
D
N
N
N
Kvinna
Å
Å
Å
M
Mottagande biobank: KI Biobank (reg.nr.222),
Karolinska Institutet.
M
D
M
N
D
N
N
N
Kvinna
Man
Kön:
Antal provbitar:
Deltagarens personnummer
Man
Å
+
Kön:
Å
Å
Å
M
Vänster Höger
M
D
D
Prov 1
-
-
Deltagarens personnummer
Å
Å
-
Fylls i av provtagaren. Avläses maskinellt. Skriv tydligt.
Å
Godkänt av Etikprövningsnämnden, dnr:
2011/1020-32 och 2013/1833-31/2.
00021001710
Fylls i av provtagare och patolog. Avläses maskinellt. Skriv tydligt.
Mottagande biobank: KI Biobank (reg.nr.222),
Karolinska Institutet.
00021001710
Kombikakod:
Kontaktperson:
Carin Cavalli- Björkman
tel: 08-524 867 97
Prov 2
Provtagningsdatum
M
D
-
-
D
T
T
M
M
T
:
M
Prov 3
M
:
Provtagningstid (24-timmarsklocka)
Provtagningsdatum
T
Prov 4
Γ
Klockslag för provhantering på
patologen
Provtagares signatur
Kommentar
Provtagares telefonnummer
Provtagares signatur
Kommentar
Ref
Ref
Placera etiketten rakt på röret.
Tissue 1
blood1
ClinSeq Bröst
Blod
RemissID
blood1
+
rid
tissue1
Tissue 2
tissue2
Tissue 3
tissue3
Tissue 4
tissue4
RemissID
Tag rätt etikett till rätt prov
ClinSeq Bröst
ClinSeq Bröst
ClinSeq Bröst
ClinSeq Bröst
tissue1
tissue2
tissue3
tissue4
1
2
3
4
4 mL
EDTA
rid
+
KI Biobank, Institutionen för Medicinsk Epidemiologi och Biostatistik, Karolinska Institutet, Box 281, 171 77 Stockholm
Tel: 08-524 823 77, ki.se/kibiobank, biobank@ki.se
Ver. 2015-03-03, ClinSeq-Bröst_blod_Karolinska Solna
KI Biobank, Institutionen för Medicinsk Epidemiologi och Biostatistik, Karolinska Institutet, Box 281, 171 77 Stockholm
Tel: 08-524 823 77, ki.se/kibiobank, biobank@ki.se
L
Ver: 2015-03-03, ClinSeq-Bröst_tumör
KI Biobank
Path Referral
Blood Referral
Surgeon Referral
cFTP Push from KI Biobank
BloodRefTbl
SciLifeLab
TissueRefTbl
ResultsTbl
Seq + bioinfo
Nightly cFTP Local MSSQL db
MEB Firewall
Bioinformatics preprocessing
•  ≈1000 patients analyzed to date
• 
• 
• 
• 
• 
• 
15 Gb raw data / TN pair
Analysis ≈ 100 CPUh / TN pair
Uses accepted best practices for each data type
Open source tools
Based on GATK Queue
Generation of reference files and tool-spec indices
10 GRCh37 aka 1000kg aka GATK bundle ref •  Included chromosomes –  GRCh37 1-­‐22 –  X –  Y track name='PARs' description='PARs' chrX 60001 2699520 chrX 154931044 155260560 chrY 10001 2649520 chrY 59034050
59363566 •  Masked pseudoautosomal regions (PARs) –  PAR1 chrY:10001-­‐2649520 ßà chrX:60001-­‐2699520 –  PAR2 chrY:59034050-­‐59363566 ßà chrX:154931044-­‐155260560 •  X-­‐PARs will look diploid for men –  MT -­‐ NC_012920.1 –  GL000191-­‐249 (unplaced con0gs) –  Decoy sequences (d5, incl NC_007605 (EBV) ) from •  _p://_p.1000genomes.ebi.ac.uk/vol1/_p/technical/reference/
phase2_reference_assembly_sequence/ –  Source file that we use: _p://gsapub_p-­‐anonymous@_p.broadins0tute.org/bundle/2.8/b37/
human_g1k_v37_decoy.fasta.gz Parsing varia0on resource files •  dbSNP, Cosmic, ClinVar, ExAC are all great resources •  None of them follow the VCF4.1 spec, which they all claim to do •  The problem: Use these files dbSNP VCF to annotate this file Sample VCF ClinVar VCF Cosmic VCF ExAC VCF "How hard could this possibly be?" 12 Issues with annota0ng variants • 
Mul0alleleic variats –  Mul0ple ALT alleles on a single VCF line ("T,A") • 
(1) Some tools check the whole ALT string for iden0ty • 
(2) Some tools ignore mul0alleleic variants –  Does "A" equal "A,T" à no à move on –  WRONG –  Some GATK tools • 
(3) Some tools only check chr/pos for iden0ty, without regarding REF and ALT alleles –  Some GATK tools here as well • 
Solu0on: Split mul0allelic variants into mul0ple lines and avoid tools in category (2) –  Must learn how each tool operates • 
Måns Magnusson's vcf_parser correctly handles FORMAT, INFO and genotypes when splilng mutliallelic variants –  hmps://github.com/moonso/vcf_parser 13 Le_ alignment of indels • 
• 
• 
Indels can have mul0ple correct representa0ons Consensus strategy is to "le_ align" GATK Le_AlignAndTrimVariants –  When splitMultiallelics is set to true, the INFO field is not correctly split, and genotypes are dropped (set to ./.) –  Breaks if any indel > 200 bps is found –  Doesn't check that REF allele matches the reference sequence (by default) –  (tested with GATK 3.3-­‐0 – bugs reported, so can change) • 
bc_ools norm • 
(Related hmps://github.com/arq5x/gemini/issues/346 ) –  Requires that REF matches the reference sequence –  Cosmic has some variants that map to the Y-­‐PARs, which has reference sequence NNNN in the build we use 14 External VCF prep pipeline sort in dict order filter Y-­‐PAR variants split mul0allelics le_ align vcfsorter.pl bedtools intersect vcf_parser bc_ools norm Adrian Tan et al, Bioinforma0cs (2015) doi: 10.1093/bioinforma0cs/btv112 vcfsorter.pl by German Leparc from hmps://code.google.com/p/vcfsorter/ Panel Low-­‐pass WGS RNAseq Sharing is caring
–  [Ongoing work]
–  We want our data to be accessible to others
–  Two tiers required
•  Open access for non-personal data
–  Tumor-specific alterations, gene expression levels, some
phenotypes
•  Controlled access
–  Genetic data (considered personal data by Swedish law)
–  Certain sensitive phenotypes
–  Let’s not reinvent the wheel - Others to this well
•  ICGC
–  Can we submit our data to EGA?
–  Legal &| consent issues?
17
R in ClinSeq •  BioConductor –  CNANorm, QDNAseq for CNV analyses –  Rmarkdown and Sweave for report genera0on •  Rstudio –  Rmarkdown/knitr + git to version control project spec analyses •  Hadleyverse –  ggplot2/devtools/dplyr/reshape/0dyr/data.table 18 Hadleyverse •  Use cases: –  data.table reduces reading 0me of RNAseq data from 30 min to <2 min •  400 files, each a matrix of 2 x 50000 –  tidyr and reshape enable rapid –  ggplot2 makes beau0ful plots with powerful syntax –  magrittr enables piping in R 19 Bioconductor •  Repo for HT-­‐biology-­‐related R packages –  Sta0s0cal and graphical methods –  Genome Annota0on • 
• 
• 
• 
bioconductor.org Updated twice per year Focus on packages with vignemes In a pipeline stucture, runnable scripts are needed –  Wouldn't it be nice if packages ship with a #!Rscript included? 20 Time to befriend getopt library(getopt) # set variables # format is c(long, short, argmask, datatype, desk) # argmask 0=no arg, 1=req, 2=optional args <-­‐ rbind( c("bam", "b", 1, "character", "Input bam file"), c("output", "o", 1, "character", "Output tsv"), c("background", "x", 1, "character", "Background set to use, as a RData file")) opts <-­‐ getopt(args) # opts$bam # opts$output # opts$background 21 getopt, check parameters #check cli parameters if(is.null(opts$bam)){ stop("Must specify input bam file -­‐-­‐bam/-­‐b.") } if(is.null(opts$output)){ stop("Must specify output tsv file name -­‐-­‐segments/-­‐s.") } if(is.null(opts$background)){ stop("Must specify background file -­‐-­‐background/-­‐x.") } 22 Write to file, sense .gz suffix ## Write to outfile, gzip if outfile end with gz. cat("Writing outfile...\n") ofile <-­‐ opts$output if( grepl("gz$", opts$output) ){ ofile <-­‐ gzfile( opts$output, 'w' ) } write.table(dat, ofile, col.names=TRUE, dec=".", quote=FALSE, sep="\t", row.names=FALSE) if( grepl("gz$", opts$output) ){ close(ofile) } 23 oncoprints in R Premy print matricies of genomics data As seen on cbioportal.org hmps://github.com/dakl/oncoprint 24 oncoprints in R library(devtools) install_github("dakl/oncoprint") library(oncoprint) data(tcga_brca) # load example data # vertical x-­‐labels vert_x <-­‐ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=.5)) oncoprint(tcga_brca) + coord_fixed() + vert_x 25 Questions?
daniel.klevebring@ki.se
26