Open source analytics for Big Data in Big Pharma Applications in next generation sequencing data Big Data SIG 23 Apr 2015 Miika Ahdesmaki Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca Crash course to molecular biology Central dogma • DNA is the ~static part • RNA is the dynamic middle man - Only 1% of DNA is protein-coding (or “exonic”) • Proteins are involved in virtually all cell functions • We can sequence DNA and RNA using ultra high throughput sequencing (3rd gen Next Generation Sequencing) "Centraldogma nodetails" by Narayanese at English Wikipedia - Own work. Licensed under Public Domain via Wikimedia Commons – http://commons.wikimedia.org/wiki/File:Centraldogma_nodetails.png#/media/File:Centraldogma_nodetails.png 2 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca Why NGS? • Personalised medicine: - One drug for all patients no longer realistic (especially in oncology) - Different demographics have different variations of risks - Understanding patient specific needs will help guide their individual medication • Cancer is a genetic disease, most often the result of spurious mutations in DNA - Understanding changes in cancer DNA can help defeat the disease • Next generation high throughput sequencing offers genome DNA analyses in days and under $10k 3 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca What is next generation sequencing? Sequencing • NGS: massively parallel DNA sequencing • Oncology biggest consumer of NGS at AZ • We sequence RNA and DNA e.g. from - Clinical samples - Cell lines - Xenografts / explants 4 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca What is next generation sequencing? Sequencing • The DNA/RNA is pre-processed, fragmented and the short fragments are sequenced (in random order) 5 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca What is next generation sequencing? Alignment • The short fragments are aligned to a reference sequence, such as the human reference HG19 6 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca What is next generation sequencing? Downstream Processing (variants, expression) • The alignments are further processed to answer the following questions - How are the alignments different from the reference (SNPs, indels)? HG19 - Which genes are expressed? 7 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca Uses of NGS Patient stratification Biomarkers for prognosis, drug response, safety Expression Variants NGS Data RNA-Seq Fusions Explants Tumors-FFPE Tumors –fresh frozen Targeted Cell lines Clinical samples DNA-Seq Whole exome Whole genome Coding and noncoding variants Coding variants New Target ID Mechanism of drug action Mechanism of disease Mechanisms of resistance 8 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca Data generation and volumes • AZ: Mix of outsourced sequencing and internal data generation Whole genome: 60-180GB • Typical size of files per sample: Exome Dna-seq: 10-20GB • In oncology, individuals are often studied in pairs (tumour/normal, parental/daughter), doubling the data volumes • Typical study sizes: 100GB - 1TB raw compressed data • One of our most frequent Big Data problems 9 Miika Ahdesmaki | 23 April 2015 RNA-seq 10-15GB Single gene targeted: 100-200MB Cambridge Wireless Big Data SIG | AstraZeneca Data generation and volumes • Over the past 3-4 years we accumulated ~400TB of sequencing data via - Acquiring public data sets (TCGA, ICGC) - Vendor sequencing (major) - Internal sequencing (minor) • Over 2015-2016 we expect - Internal sequencing to become the major data generation source (5 new sequencers in 2015 to accompany 2 sequencers in 2013-2014) - 1PB of sequencing data by mid 2016 • Long term prediction of volumes difficult • 3 tiered storage for processing, short term storage and long term storage - Amazon Glacier strongly considered for long term storage 10 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca Partnering with the leaders • “Illumina Announces Strategic Partnerships with AstraZeneca, Janssen and Sanofi to Redefine Companion Diagnostics for Oncology” - http://investor.illumina.com/phoenix.zhtml?c=121127&p=irolnewsArticle&ID=1960007 - Illumina, Inc. … announced it has formed collaborative partnerships with leading pharmaceutical companies to develop a universal … NGS-based oncology test system - The system will be used for clinical trials of targeted cancer therapies with a goal of developing and commercializing a multi-gene panel for therapeutic selection, resulting in a more comprehensive tool for precision medicine 11 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca Pipelines and analytics 12 Production – Dealing with the complexity Number of NGS tools increases daily.. annotateBed append_sff bam12auxmerge bam12split bam12strip bam2fastx bamadapterclip bamadapterfind bamauxsort bamcat bamchecksort bamclipreinsert bamcollate bamcollate2 bamdownsamplerandom bamfilteraux bamfilterflags bamfilterheader bamfilterrg bamfixmateinformation bamindex bamleftalign bammapdist bammarkduplicates bammarkduplicates2 bammaskflags bammdnm bammerge bam_merge bamrank bamrecompress bamreset bamseqchksum bamsort bamsplit bamsplitdiv bamToBed bamtofastq bamToFastq bamtools bamtools-2.3.0 bamzztoname bcbio_nextgen.py bcftools bed12ToBed6 bedGraphToBigWig bedpeToBam bedpeToBed12 bedpeToVcf bedToBam bedToBigBed bedToIgv bed_to_juncs bedtools bgzip bigBedInfo bigBedSummary bigBedToBed bigWigInfo bigWigSummary bigWigToBedGraph bigWigToWig blast2sam.pl bowtie2 bowtie2-align bowtie2-build bowtie2-inspect bowtie2sam.pl brew bwa ccmake closestBed clusterBed cmake complementBed contig_to_chr_coords convert_trace coverageBed cpack cpanm cram_dump cram_index cramtools crc32 ctest cuffcompare cuffdiff cufflinks cuffmerge dbilogstrip dbiprof dbiproxy expandCols export2sam.pl extract_fastq extract_qual extract_seq faCount faSize fastaFromBed fastqc fastqtobam faToTwoBit featureCounts fetchChromSizes filter_vep.pl fix_map_ordering flankBed freebayes gatk-framework GenomeAnalysisTK.jar genomeCoverageBed get_comment getOverlap gffread glia grabix groupBy gtf_juncs gtf_to_fasta gtfToGenePred gtf_to_sam hash_exp hash_extract hash_list hash_sff hash_tar index_tar interpolate_sam.pl intersectBed io_lib-config isnovoindex juncs_db kmerprob liftOver linksBed long_spanning_reads lumpy makeSCF map2gtf mapBed maq2sam-long maq2sam-short maskFastaFromBed md5fa md5sum-lite mergeBed multiBamCov multiIntersectBed muTect-1.1.6.jar normalisefasta novo2paf novo2sam.pl novoalign novoalignCS novoalignCSMPI novoalignMPI novobarcode novoindex novomethyl novope2bed.pl novorun.pl novosort novoutil nucBed pairToBed pairToPair platypus plot_roc.r plot-vcfstats prep_reads psl2sam.pl qualimap randomBed rtg s3cmd sam2vcf.pl sambamba samblaster sam_juncs samtools samtools.pl scalpel scf_dump scf_info scf_update scramble scram_flagstat scram_merge scram_pileup segment_juncs seqtk shuffleBed slopBed snpEff soap2sam.pl SomaticAnalysisTK.jar sortBed speedseq speedseq.config splitReadSamToBedpe splitterToBreakpoint sra_to_solid srf2fasta srf2fastq srf_dump_all srf_extract_hash srf_extract_linear srf_filter srf_index_hash srf_info srf_list STAR subtractBed tabix tabtk tagBam tophat tophat2 tophat-fusion-post tophat_reports trace_dump twoBitInfo twoBitToFa unionBedGraphs variant_effect_predictor.pl vcf2fasta vcf2sqlite.py vcf2tsv vcfaddinfo vcfafpath vcfallelicprimitives vcfaltcount vcfannotate vcfannotategenotypes vcfbiallelic vcfbreakmulti vcfcat vcfcheck vcfclassify vcfcleancomplex vcfclearid vcfclearinfo vcfcombine vcfcommonsamples vcfcomplex vcfcountalleles vcfcreatemulti vcfdistance vcfecho vcfentropy vcfevenregions vcffilter vcffixup vcfflatten vcfgeno2alleles vcfgeno2haplo vcfgenosamplenames vcfgenosummarize vcfgenotypecompare vcfgenotypes vcfglbound vcfglxgt vcfgtcompare.sh vcfhetcount vcfhethomratio vcfindelproximity vcfindels vcfindex vcfintersect vcfkeepgeno vcfkeepinfo vcfkeepsamples vcfleftalign vcflength vcfmultiallelic vcfmultiway vcfmultiwayscripts vcfnobiallelicsnps vcfnoindels vcfnosnps vcfnulldotslashdot vcfnumalt vcfoverlay vcfparsealts vcfplotaltdiscrepancy.r vcfplotaltdiscrepancy.sh vcfplotsitediscrepancy.r vcfplottstv.sh vcfprimers vcfprintaltdiscrepancy.r vcfprintaltdiscrepancy.sh vcfqual2info vcfqualfilter vcfrandom vcfrandomsample vcfregionreduce vcfregionreduce_and_cut vcfregionreduce_pipe vcfregionreduce_uncompressed vcfremap vcfremoveaberrantgenotypes vcfremovenonATGC vcfremovesamples vcfroc vcfsample2info vcfsamplediff vcfsamplenames vcfsitesummarize vcfsnps vcfsom vcfsort vcfstats vcfstreamsort vcf_strip_extra_headers vcfToBedpe vcfuniq vcfuniqalleles vcfutils.pl vcfvarstats vep_convert_cache.pl vep_install.pl vt wgsim wgsim_eval.pl wigToBigWig windowBed windowMaker xmlwf zoom2sam.pl ztr_dump 300+ (OSS) tools within our production framework Infinite number of combinations to “get it wrong” 13 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca Production – Overcoming the Complexity Scalability, Reproducibility, Flexibility, Accessibility • “Forced” to use open source tools and OS (Linux), no closed source alternatives exist - Integration challenging - Variant calling and expression analysis very much an open research questions, rapidly changing code - No licensing costs, but costs in internal and external consulting • Bcbio-nextgen - An open source Python toolkit providing best practice pipelines for fully automated NGS analysis - Main developer Brad Chapman (HSPH) - Unit tested, version controlled, development in Github https://github.com/chapmanb/bcbio-nextgen - Scalable across different clusters, schedulers, Amazon cloud • AZ is active recognised contributor and collaborator to HSPH and bcbio-nextgen 14 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca Production – Overcoming the Complexity Bcbio-nextgen overview • The user writes/modifies a high level configuration file specifying inputs and analysis parameters - Very few “tuning parameters” -> Given the same data, two analysts will produce the same results 15 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca Getting it right • Given the rapid changes in the individual analysis tools, how do we know the pipeline “gets it right”? • Solution: reference standards • For germline sequencing, the Genome in A Bottle Consortium established a gold standard for an individual (NA12878) - Samples from NA12878 can be bought off the shelf - Compare sequencing and analytics results to the gold standard, establish sensitivity, PPV of variant calls, compare to other people’s results • For tumour sequencing, several standards exist - Horizon Diagnostics’ tumour standard - ICGC-TCGA DREAM Mutation Calling challenge 16 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca Processing and managing the data • NGS HPC clusters on 4 main R&D sites - UK (SGE, ~200 cores, gpfs) - Sweden (SLURM, >500 cores, Lustre) - China (SGE, >100 cores, gpfs) - US (UGE, >200 cores, gpfs) • Data generated or received in one place processed locally by the NGS Production Team (each member has access to all HPC clusters) - Processed data handed over to disease area bioinformaticians in a controlled manner • Quick pipes between the sites allows data sharing when required • Cloud computing … 17 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca NGS + Cloud NGS Suited to using “Cloud” • Large scale storage needs • High computational power that can continue to scale • Inherently (embarrassingly) parallel, easily ported • Peaks and valleys in compute needs, so burst into cloud as needed instead of large investment upfront • Launch-able computing centre utilising Amazon EC2 18 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca StarCluster from MIT with our pipeline 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 320 SSD 320 SSD 320 SSD 320 SSD 320 SSD 320 SSD 40 TB GlusterFS /ngs 19 Miika Ahdesmaki | 23 April 2015 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core Cambridge Wireless Big Data SIG | AstraZeneca Why not Hadoop? • The use of a large number of mostly academic open source tools that are 99.9% not written for Hadoop • No pipeline implements wrapping up of the above tools in a Hadoop framework • Disk I/O admittedly the bottle neck in current parallel file system architectures for NGS analytics - Gpfs locally at AZ - Lustre in AWS, local scratch SSD 20 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca Visualising the data JBrowse genome browser • Most popular genome analysis viewer is the Integrated Genome Viewer (IGV, Broad Institute), a Java based standalone program - Requires a Java app - Requires configuration • JBrowse, a web browser based genome viewer is inherently easier for non-tech savvy people: point your browser to it and it just works - Physical location of data less important, only the part that is shown transferred • Data of interest, such as genomic variants, can be annotated by a URL to JBrowse 21 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca JBrowse BRCA2 gene screenshot Reference DNA sequence and amino acids BRCA2 alternative exons Detected gene variant (G to A mutation) Evidence in the data for the variant 22 Miika Ahdesmaki | 23 April 2015 Noise in the data Cambridge Wireless Big Data SIG | AstraZeneca Summary 23 Summary • NGS data is accumulating faster and faster • Analysing and interpreting the data is I/O intensive (+CPU and RAM) • Easily parallelised using SMP and simple schedulers (SGE, Slurm) • Current challenges in integrating all the processed data (in e.g. no-SQL databases) • Long term storage (due to e.g. regulatory requirements) in e.g. Amazon Glacier Confidentiality Notice This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com 25 Miika Ahdesmaki | 23 April 2015 Cambridge Wireless Big Data SIG | AstraZeneca
© Copyright 2025