Differential Methylation Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews 1 A basic question… 2 Factors to consider • • • • • Number of observations Magnitude of effect Technical considerations Biological variability Biological common sense 3 The problem of power… • Ideally want to cover every Cytosine (CpG) • Have to correct for the number of tests • There’s no way you’ll collect enough data to analyse each C and have p-values which survive multiple testing correction • Stats have to find a way to work round this. 4 Maximising power • Options – Analyse in windows – Pre-filter – Hierarchical or Adaptive filtering 5 Window sizes Small windows • Good resolution • Specific biological effects • High MTC burden • Small observations • High p-values Large windows • Lots of data • High statistical power • Low MTC burden • Low p-values • Effect averaging 6 Simple Statistical Approach • Is the proportion of methylated calls different between two samples, given the number of observations? Meth count A Unmeth count A Meth count B Unmeth count B % change Significant? 2 0 0 2 100 No 200 2 198 5 1.5 No 100 50 75 60 11 Probably 7 Contingency tests • Chi-square / G-test / Fisher’s exact test – Differ only at low observations – Significant changes require enough observations that any of these should give the same answer • Operates on single replicates • Technical measure of difference Meth A Unmeth A Meth B Unmeth B 8 Chi-Square results 9 Biological considerations • Minimum relevant effect size? – Balance power vs change – What makes biological sense – (what would you follow up?) • Minimum coverage worth testing – No point testing poorly covered regions 10 Effect of pre-filtering 11 Distribution of methylation Chi square assumes a normal distribution, and methylation data isn’t normally distributed 12 Beta binomial distribution More relevant statistics than chi-square. Need to fit custom model to actual data. 13 Implications of a beta distribution • Many summaries assume normality – Mean – Standard Deviation – Boxplots • None of these is strictly appropriate when looking at methylation data 14 Dealing with replicates • Simple approach – Merge data from replicates together – Single test, High power – Post-hoc test for consistency • Explicitly account for batch effects – Logistic regression – Measures batch effects and excludes them from final significance calculation • Work with methylation values – Normalise percentage methylation values – Use conventional statistics (t-tests etc) for comparing groups 15 Hierarchical testing • Test larger regions – Windows / Features etc. • Take significant hits and subdivide – Smaller windows – Individual CpGs – Correct only for these tests • Assemble hits together to make up DMRs 16 Hierarchical testing CGI CGI CGI Genome CGI X CGI CGI X CGI Genome CGI X X X CGI X Genome CGI X X CGI CGI CGI CGI CGI CGI CGI CGI Statistically ‘creative’ solution to not having enough data 17 Methylation statistics packages • swDMR (Perl/R-package) • methylKit* (R-package by A. Akalin et al.) Sliding window DMR finding (choose between t_test, Kolmogorov, Fisher, ChiSquare, Wilcoxon for n = 2; ANOVA, Kruskal for n > 3) Sliding window, Fisher’s exact test or logistic regression. Adjusts p-values to q-values using SLIM method. • bsseq* (R/Bioconductor by K.D. Hansen) Implements the BSmooth smoothing algorithm. Numerous CpG-wise t-tests and p-value cutoff to define DMRs. Outperforms Fisher’s exact test. Requires biological replicates for DMR detection • BiSeq* (R/Bioconductor by K. Hebestreit et al.) • RnBeads* (R package by F. Mueller et al.) • DMAP* (C command line tool by P. Stockwell et al.) Beta regression model, impractical for very large data other than RRBS or targeted BS-Seq works for 450K arrays, BS-Seq, MeDIP or MBD-Seq data RRBS fragment or fixed window approach, Fisher’s exact test, Chi-squared or ANOVA • RADMeth (C++ command line tool by E. Dolzhenko and A.D. Smith) • MOABS* (C++ command line tool by D. Sun et al.) Beta-binomial regression analysis to find DMCs or DMRs, local likelihood, adjust for neighbouring CpGs Beta binomial hierarchical model to capture sampling and biological variation, Credible Methylation Difference (CDIF) single metric that combines biological and statistical significance • ComMet (Y. Saito et al., 2014) Bisulfighter suite; DMR detection based on hidden Markov models (HMMs) that enable automated adjustment of DMC chaining criteria. Does not require biological replicates • DSS (R/Bioconductor by Feng et al., 2014) Constructs genome-wide prior distribution for beta-binomial dispersion. Bayesian hierarchical model to detect differentially methylated loci • more appearing every other week… * interface well with 18 Tool Statistical test Suitable for Implementation Notes bsseq Sample-wise smoothing, then group differences via CpG-wise t-tests (pvalue cutoff to define adjacent CpG sites as DMRs) WGBS; not designed for targeted BS-Seq or RRBS R package/ Bioconductor Outperforms Fisher’s exact test; intended to compare 2 groups; replicates required BiSeq Define CpG clusters, smooth methylation data, model and test group effect (fitting beta regression model to smoothed methylation levels and testing for group effect using the Wald test), hierarchical testing procedure on CpG clusters, then define DMR boundaries RRBS; targeted BS-Seq; for WGBS R package/ Bioconductor Very computationally intensive; Not limited to 2 groups MethylKit Models CpG methylation within a logistic regression. Sliding linear model (SLIM) to correct for multiple testing (e)RRBS R package * WGBS = whole genome BS-Seq; (e)RRBS = (enhanced) reduced representation BS-Seq 19 bsseq – for whole genome BS-Seq • Smoothing of low coverage BS-Seq first to get reliable semilocal methylation estimation estimates • Not suitable for captured or restricted data • After smoothing it uses biological replicates to estimate biological variation and identify methylated regions (DMRs) • Smoothing suitable for even a single sample • Works for CpG context in humans, will probably not scale to 2x585M Cs in non-CG context 20 BSmooth algorithm black: 25x (Lister) pink: 4x (Lister) 21 BiSeq - for RRBS or targeted BS-Seq 1) Define CpG cluster boundaries (requires 20 CpG sites that are frequently covered in the majority of samples, e.g. CGIs or targeted regions) 2) Smooth methylation data within CpG clusters (spatial smoothing using a weighted local likelihood). Aims to overcome variance of lowly covered sites 3) Model and test group effect within CpG clusters (fitting beta regression model to smoothed methylation levels and testing for group effect using the Wald test) 4) Apply hierarchical testing procedure by Benjamini and Heller, 2007: – Test CpG clusters for differential methylation and control weighted FDR on clusters – Trim rejected CpG clusters and control FDR on single CpG sites 5) Define DMR boundaries 22