RNF: a method and tools to evaluate NGS read

Rnf: a method and tools to evaluate
Ngs read mappers
Karel Břinda, Valentina Boeva, Gregory Kucherov
karel.brinda@univ-mlv.fr, valentina.boeva@curie.fr, gregory.kucherov@univ-mlv.fr
Introduction
Aligning reads to a reference sequence is a
fundamental step in numerous bioinformatics
pipelines. The sensitivity and precision of the
mapping tool can critically affect the accuracy of produced results.
Read simulators combined with alignment
evaluation tools provide the most straightforward way to evaluate and compare mappers.
In default of standards for encoding read origins,
every evaluation tool had to be made explicitly
compatible with the simulator used to generate
reads.
To solve this obstacle, we have created a format
Rnf (Read Naming Format) and an associated
software package RnfTools.
Rnf
Description:
Read Naming
Format, a generic format for assigning read names with encoded
information about original positions.
Specification:
http://karel-brinda.github.io/rnf-spec/
Read Naming Format
Prefix
Segments of reads
Read tuple ID
Suffix
(with comments and extensions)
sim__0043fd1__(3,13,F,027871,027970),(3,13,R,029171,029270)__[paired_end],C:[100=,42=1X47=]
Genome ID
Chromosome ID
Direction
Example of simulated read tuples
Coor
Rightmost coordinate
Leftmost coordinate
Their corresponding Rnf names
read
tuple
r001
r002
12345678901234-5678901234567890123456789
Source 1 - reference genome
chr 1
ATGTTAGATAA-GATAGCTGTGCTAGTAGGCAGTCAGCCC
chr 2
ttcttctggaa-gaccttctcctcctgcaaataaa
r003
Source 2 - generator of random sequences
r004
READS:
r001
r002/1
r002/2
r003/1
r003/2
r004
r005
ATG-TAGATA ->
TTAGATAACGA ->
r005
<- TCAG-CGGG
tgcaaataa ->
r006
gaa-gacc-t ->
ATAGCT............TCAG ->
GTAGG ->
<- agacctt
<- TCGACACG
ATATCACATCATTAGACACTA
r006
LRN
SRN
sim__1__(1,1,F,01,10)__[single_end]
sim__2__(1,1,F,04,14),(1,1,R,31,39)
__[paired_end]
sim__3__(1,2,F,09,17),(1,2,F,25,33)
__[mate_pair]
sim__4__(1,1,F,15,36)__[spliced],
C:[6=12N4=]
sim__5__(1,1,R,15,22),(1,1,F,25,29),
(1,2,R,05,11)__[chimeric]
rnd__6__(2,0,N,00,00)__[random]
#1
#2
#3
#4
#5
#6
LRN Long read name.
SRN Short read name. They are used only if an LRN exceeds 255 characters (maximum allowed read length
in Sam). Then a SRN-LRN correspondence file must
be created.
Evaluation of read mappers using Rnf-compatible programs
Mapper evaluation
Read simulation
FASTA
RnfTools
Description:
An associated
software package of
Rnfcompatible programs, based on
Snakemake [2].
All employed
external programs are installed
automatically when they are
needed.
Genome 1
Genome 2
Read simulator
RNF
encoding
BAM
FASTQ
Reads
Alignment
Mapper
Prerequisites:
– Unix-like system (Linux, OSX, etc.)
– Python 3.2+
Steps:
2. Mapping All reads were mapped to HG38 by
i) Yara, ii) Bwa-Mem, iii) Bwa-Sw, and
iv) Bowtie2.
3. Evaluation. The obtained Bam files were
evaluated using LAVEnder.
Figure → Comparison of the mappers with
respect to correctly mapped reads.
Figure & Detailed graph for Yara.
Figure ↓
100 %
Installation using Easy Install:
> easy_install rnftools
[2] J. Köster and S. Rahmann. Snakemake – a scalable bioinformatics workflow engine. Bioinformatics 28(19): 2520–2522, 2012.
80 %
Part of all reads (%)
[1] K. Břinda, V. Boeva, G. Kucherov. RNF: a general framework to evaluate NGS read mappers.
arXiv:1504.00556 [q-bio.GN], 2015.
FDR in mapping (#wrongly mapped reads / #mapped reads)
1. Simulation of reads. 200.000 reads were
simulated by DwgSim using MIShmash:
– 100.000 reads from a human genome (HG38),
– 100.000 reads from a mouse genome (MM10).
Installation using Pip:
> pip install rnftools
References
Report
Correctly mapped reads in all reads which should be mapped
60 %
40 %
#correctly mapped reads / #reads which should be mapped
Source codes and documentation:
http://github.com/karel-brinda/rnftools
http://rnftools.rtfd.org
TXT/HTML
RnfTools – example of usage
10-4
10-3
100 %
BWA-MEM
BWA-SW
Bowtie2
YARA
10-2
10-1
90 %
80 %
70 %
60 %
50 %
Detailed graph for Bwa-Mem.
BWA-MEM
YARA
FDR in mapping (#wrongly mapped reads / #mapped reads)
FDR in mapping (#wrongly mapped reads / #mapped reads)
10-2
10-1
100 %
Unmapped correctly
Unmapped incorrectly
Thresholded correctly
Thresholded incorrectly
Multimapped
Mapped, should be unmapped
Mapped to wrong position
Mapped correctly
80 %
Part of all reads (%)
ii) LAVEnder
Tool for read mappers evaluation using
Rnf reads.
Mapper
evaluation
tool
Genome n
Components:
i) MIShmash
Pipeline applying one of popular read simulating tools (among DwgSim, Art, Mason, CuReSim etc.) and transforming the
generated reads into Rnf format.
RNF
decoding
60 %
40 %
20 %
20 %
0%
0%
10-2
10-1
Unmapped correctly
Unmapped incorrectly
Thresholded correctly
Thresholded incorrectly
Multimapped
Mapped, should be unmapped
Mapped to wrong position
Mapped correctly
100