Differential methylation lecture

Differential Methylation
Analysis
Simon Andrews
simon.andrews@babraham.ac.uk
@simon_andrews
1
A basic question…
2
Factors to consider
•
•
•
•
•
Number of observations
Magnitude of effect
Technical considerations
Biological variability
Biological common sense
3
The problem of power…
• Ideally want to cover every Cytosine (CpG)
• Have to correct for the number of tests
• There’s no way you’ll collect enough data to
analyse each C and have p-values which
survive multiple testing correction
• Stats have to find a way to work round this.
4
Maximising power
• Options
– Analyse in windows
– Pre-filter
– Hierarchical or Adaptive filtering
5
Window sizes
Small windows
• Good resolution
• Specific biological effects
• High MTC burden
• Small observations
• High p-values
Large windows
• Lots of data
• High statistical power
• Low MTC burden
• Low p-values
• Effect averaging
6
Simple Statistical Approach
• Is the proportion of methylated calls different
between two samples, given the number of
observations?
Meth count A Unmeth count A Meth count B Unmeth count B % change Significant?
2
0
0
2
100
No
200
2
198
5
1.5
No
100
50
75
60
11
Probably
7
Contingency tests
• Chi-square / G-test / Fisher’s exact test
– Differ only at low observations
– Significant changes require enough observations
that any of these should give the same answer
• Operates on single replicates
• Technical measure of difference
Meth A
Unmeth A
Meth B
Unmeth B
8
Chi-Square results
9
Biological considerations
• Minimum relevant effect size?
– Balance power vs change
– What makes biological sense
– (what would you follow up?)
• Minimum coverage worth testing
– No point testing poorly covered regions
10
Effect of pre-filtering
11
Distribution of methylation
Chi square assumes a normal distribution, and methylation data isn’t normally distributed
12
Beta binomial distribution
More relevant statistics than chi-square. Need to fit custom model to actual data.
13
Implications of a beta distribution
• Many summaries assume normality
– Mean
– Standard Deviation
– Boxplots
• None of these is strictly appropriate when
looking at methylation data
14
Dealing with replicates
• Simple approach
– Merge data from replicates together
– Single test, High power
– Post-hoc test for consistency
• Explicitly account for batch effects
– Logistic regression
– Measures batch effects and excludes them from final
significance calculation
• Work with methylation values
– Normalise percentage methylation values
– Use conventional statistics (t-tests etc) for comparing groups
15
Hierarchical testing
• Test larger regions
– Windows / Features etc.
• Take significant hits and subdivide
– Smaller windows
– Individual CpGs
– Correct only for these tests
• Assemble hits together to make up DMRs
16
Hierarchical testing
CGI
CGI
CGI
Genome
CGI
X
CGI
CGI
X
CGI
Genome
CGI
X X
X
CGI
X
Genome
CGI
X X
CGI
CGI
CGI
CGI
CGI
CGI
CGI
CGI
Statistically ‘creative’ solution to not having enough data
17
Methylation statistics packages
•
swDMR (Perl/R-package)
•
methylKit* (R-package by A. Akalin et al.)
Sliding window DMR finding (choose between t_test, Kolmogorov, Fisher, ChiSquare, Wilcoxon for n = 2; ANOVA, Kruskal for n > 3)
Sliding window, Fisher’s exact test or logistic regression. Adjusts p-values to q-values using SLIM method.
•
bsseq* (R/Bioconductor by K.D. Hansen)
Implements the BSmooth smoothing algorithm. Numerous CpG-wise t-tests and p-value cutoff to define DMRs. Outperforms Fisher’s
exact test. Requires biological replicates for DMR detection
•
BiSeq* (R/Bioconductor by K. Hebestreit et al.)
•
RnBeads* (R package by F. Mueller et al.)
•
DMAP* (C command line tool by P. Stockwell et al.)
Beta regression model, impractical for very large data other than RRBS or targeted BS-Seq
works for 450K arrays, BS-Seq, MeDIP or MBD-Seq data
RRBS fragment or fixed window approach, Fisher’s exact test, Chi-squared or ANOVA
•
RADMeth (C++ command line tool by E. Dolzhenko and A.D. Smith)
•
MOABS* (C++ command line tool by D. Sun et al.)
Beta-binomial regression analysis to find DMCs or DMRs, local likelihood, adjust for neighbouring CpGs
Beta binomial hierarchical model to capture sampling and biological variation, Credible Methylation Difference (CDIF) single metric that
combines biological and statistical significance
•
ComMet (Y. Saito et al., 2014)
Bisulfighter suite; DMR detection based on hidden Markov models (HMMs) that enable automated adjustment of DMC chaining criteria. Does not
require biological replicates
•
DSS (R/Bioconductor by Feng et al., 2014)
Constructs genome-wide prior distribution for beta-binomial dispersion. Bayesian hierarchical model to detect differentially methylated
loci
•
more appearing every other week…
* interface well with
18
Tool
Statistical test
Suitable for
Implementation
Notes
bsseq
Sample-wise smoothing, then group
differences via CpG-wise t-tests (pvalue cutoff to define adjacent CpG
sites as DMRs)
WGBS; not designed for
targeted BS-Seq or RRBS
R package/
Bioconductor
Outperforms
Fisher’s exact
test; intended to
compare 2
groups;
replicates
required
BiSeq
Define CpG clusters, smooth
methylation data, model and test
group effect (fitting beta regression
model to smoothed methylation
levels and testing for group effect
using the Wald test), hierarchical
testing procedure on CpG clusters,
then define DMR boundaries
RRBS; targeted BS-Seq; for
WGBS
R package/
Bioconductor
Very
computationally
intensive; Not
limited to 2
groups
MethylKit
Models CpG methylation within a
logistic regression. Sliding linear
model (SLIM) to correct for multiple
testing
(e)RRBS
R package
* WGBS = whole genome BS-Seq; (e)RRBS = (enhanced) reduced representation BS-Seq
19
bsseq – for whole genome BS-Seq
• Smoothing of low coverage BS-Seq first to get reliable semilocal methylation estimation estimates
• Not suitable for captured or restricted data
• After smoothing it uses biological replicates to estimate
biological variation and identify methylated regions (DMRs)
• Smoothing suitable for even a single sample
• Works for CpG context in humans, will probably not scale to
2x585M Cs in non-CG context
20
BSmooth algorithm
black: 25x (Lister)
pink: 4x (Lister)
21
BiSeq - for RRBS or targeted BS-Seq
1) Define CpG cluster boundaries (requires 20 CpG sites
that are frequently covered in the majority of samples, e.g.
CGIs or targeted regions)
2) Smooth methylation data within CpG clusters (spatial
smoothing using a weighted local likelihood). Aims to
overcome variance of lowly covered sites
3) Model and test group effect within CpG clusters (fitting
beta regression model to smoothed methylation levels and
testing for group effect using the Wald test)
4) Apply hierarchical testing procedure by Benjamini and
Heller, 2007:
– Test CpG clusters for differential methylation and control
weighted FDR on clusters
– Trim rejected CpG clusters and control FDR on single CpG sites
5) Define DMR boundaries
22