Scalable MCMC Hong Ge Jes Frellsen Department of Engineering University of Cambridge Machine Learning Reading Group 07 May 2015 1 Bayesian inference and MCMC In Bayesian inference we rely on the posterior p(θ|D) ∝ p(D|θ)p(θ) In many applications the posterior is intractable and we have to rely on approximations. A standard approach is to use MCMC where we • Construct a Markov chain with stationary distribution p(θ|D) θ1 , θ2 , θ3 , . . . where p(θi |θi−1 . . . θ1 ) = τ (θt |θt−1 ) • Calculate posterior expectations using a Monte Carlo estimate (unbiased) T 1X Ep(θ|D) [f ] ≈ f (θt ) T t=1 2 Computations bottleneck in MCMC If the data points D = {x1 , . . . , xN } are conditional independent given θ we can write the posterior as p(θ|D) ∝ p(θ)p(D|θ) = p(θ) N Y i=1 p(xi |θ) If the data set is larger (N 1) evaluating p(D|θ) or ∇p(D|θ) is the computational bottleneck for most standard MCMC methods, including • Metropolis-Hastings (MH) Algorithm • Langevin Monte Carlo (LMC) • Hamiltonian Monte Carlo (HMC) 3 Outline We are going to cover some recent developments in scalable approaches to • Metropolis-Hastings (MH) algorithm • Langevin Monte Carlo (LMC) • Hamiltonian Monte Carlo (HMC) We are not going to cover distributed approaches (e.g., Ahn, Shahbaba, et al. (2014) and Xu et al. (2014)) 4 Scaling up Metropolis-Hastings 5 Metropolis-Hastings (MH) algorithm • Probably the most widely used MCMC algorithm • Algorithm for constructing a Markov chain θ1 , θ2 , θ3 , . . . with stationary distribution p(θ|D) At time t: 1. Propose a state θ0 ∼ q(·|θt−1 ) p(θ0 |D)q(θ|θ0 ) 2. Accept θ as θt with probability α(θ |θ) = min 1, p(θ|D)q(θ0 |θ) 0 0 • In the MH algorithm the transition operator is given by Z τ (θt |θt−1 ) = q(θt |θt−1 )α(θt |θt−1 ) + δθt−1 ,θt 1 − q(θ|θt−1 )α(θ|θt−1 ) dθ {z } | | {z } acceptance rejection 6 Scaling up Metropolis-Hastings by subsampling “For each proposed sample, the MH rule needs to examine the likelihood of all data-items. When the number of data-cases is large this is an awful lot of computation for one bit of information, namely whether to accept or reject a proposal.” (Korattikara et al. 2014) We are going to consider three approaches to addressing this problem by only using a subset of the data: • Firefly Monte Carlo (FlyMC) (Maclaurin and Adams 2014) • The Austerity framework (Korattikara et al. 2014) • Metropolis-Hastings with subsampled likelihoods, MHS UB L HD (Bardenet et al. 2014) 7 Firefly Monte Carlo Maclaurin and Adams (2014) • An exact MCMC procedure that only uses a subset of the data • Leaves the full-data posterior invariant • Introduces a set of binary auxiliary variables that can effectively turn on/off the data points (hence the name firefly) • Required a lower bound for each likelihood term 8 Firefly Monte Carlo Maclaurin and Adams (2014) • Recall that we assumed the posterior p(θ|D) ∝ p(θ)p(D|θ) = p(θ) N Y i=1 p(xi |θ) Li (θ) = p(xi |θ) • We introduce • A set of binary auxiliary variables zi ∈ {0, 1} • Bi (θ) which is a lower bound on Li (θ), i.e., 0 < Bi (θ) ≤ Li (θ) • Each zi is Bernoulli distributed Li (θ) − Bi (θ) p(zi |xi , θ) = Li (θ) zi Bi (θ) Li (θ) 1−zi • Augment the posterior distribution with z1:N without chaining the marginal p(θ|D) p(θ, z1:N |x1:N ) ∝ p(θ)p(x1:N , z1:N |θ) = p(θ) N Y i=1 p(xi |θ)p(zi |xi , θ) 9 Firefly Monte Carlo Maclaurin and Adams (2014) • Consider the i’th factor in the augmented the posterior Li (θ) − Bi (θ) zi Bi (θ) 1−zi p(xi |θ)P(zi |xi , θ) = Li (θ) Li (θ) Li (θ) ( Li (θ) − Bi (θ) if zi = 1 = Bi (θ) if zi = 0 • {xi |zi = 1} forms a minibatch for which we only need to evaluate the likelihood • Bi (θ) needs to have a convenient form, such that QN i=1 Bi (θ) can be computed in O(1) (e.g., exponential family distributions) N h i Y L (θ) − B (θ) Y i i p(θ|D) ∝ p(θ) Bi (θ) Bi (θ) i=1 i|zi =1 • We can use Gibbs sampling (zi ∼ P(zi |xi , θ)) to resample a random (fixed-size) subset of {zi } 10 Illustration of FlyMC on synthetic toy data Maclaurin and Adams (2014) Illustration of the FlyMC algorithm on a logistic regression model A two-class classification problem in two dimensions (and one bias dimension) 11 Results from empirical evaluations of FlyMC Maclaurin and Adams (2014) For the MAP-tuned results stochastic gradient decent optimisation was used to make the bound tight at the MAP parameters 12 The Austerity framework and MHS UB L HD Bardenet et al. (2014) and Korattikara et al. (2014) • Speed up MH by performing an approximate accept/reject step • We will be sampling from ˜ p(θ|D) and not the correct p(θ|D) PT • The Monte Carlo estimator ˆI = T −1 t=1 f (θt ) computed using samples from ˜ p(θ|D) will be biased • The MSE of ˆI can be decomposed into bias B and variance V E[(I − ˆI)2 ] = B2 + V where V = O(1/T). • Despite the bias introduced by ˜ p(θ|D) we can (hope to) obtain a lower MSE on ˆI with the approximate accept/reject step since we can collect more samples (T) on the same computational budget 13 Rewriting the Metropolis-Hastings algorithm Bardenet et al. (2014) and Korattikara et al. (2014) 1: 2: 3: 4: 5: 6: for t ∈ {0, . . . , M − 1} do θ0 ∼ q(·|θt ) u ∼ U(0, 1) 0 )P(D|θ 0 ) q(θ |θ 0 ) t α(θ0 |θt ) = p(θ p(θt )P(D|θt ) q(θ0 |θt ) if u < α(θ0 |θt ) then θt+1 ← θ0 else θt+1 ← θt By rearranging the terms in u < α(θ0 |θt ) we get the algorithm for t ∈ {0, . . . , M − 1} do θ0 ∼ q(·|θt ) 3: u ∼ U(0, 1)h i p(θt ) q(θ0 |θt ) 4: ψ ← N1 log u p(θ 0 ) q(θ |θ 0 ) h t 0 i 1 PN i |θ ) 5: µN ← N i=1 log p(x p(xi |θt ) 1: 2: 6: 7: if ψ < µN then θt+1 ← θ0 else θt+1 ← θt 14 Approximating the MH test Bardenet et al. (2014) and Korattikara et al. (2014) • We have reformulated the MH test as N 1 p(θ) q(θ0 |θ) 1X p(xi |θ0 ) log u < log N p(θ0 ) q(θ|θ0 ) N p(xi |θ) i=1 where the RHS is the average log likelihood ratio N N 1X p(xi |θ0 ) 1X µN = log = li N p(xi |θ) N i=1 i=1 • Can we approximate the MH test by using only a subsample (Monte Carlo estimate) of D = {x1 , . . . , xN }? ∗ 0 n n p(xi |θ ) 1X 1X ∗ log li µ∗n = = n p(xi∗ |θ) n i=1 where {x1∗ , . . . , xn∗ } i=1 are sampled w/o replacement from D • The Austerity paper and the MHS UB L HD paper provided two different approaches to this 15 Austerity framework Korattikara et al. (2014) • Reformulate the MH test as a statistical decision problem ψ(θ, θ0 , u) < 1 N PN i=1 li • Let {l1∗ , . . . , ln∗ } be the log likelihood ratios of the subsample of D P • Test if the difference between the subsample mean ¯l = 1n i li∗ and ψ(θ, θ0 , u) is significantly greater than the SD of ¯l at confidence level • If so, we can use ¯l to perform the MH test (with chance of error) • Else we need to draw more data • Keep drawing until we reach confidence level or t = N • If n is large enough then ¯l ∼ N (central limit theorem) and we can use a sequential one-sample t-test 16 Austerity framework Korattikara et al. (2014) 0.5 0 ε =0.00, T = 75484 ε =0.01, T = 133069 ε =0.05, T = 200672 ε =0.10, T = 257897 ε =0.20, T = 422978 −4 −0.5 −6 −1 −1.5 −8 −10 −12 0 ε = 0, T = 5992 ε = 0.01, T = 11320 ε = 0.05, T = 40973 ε = 0.1, T = 171917 ε = 0.2, T = 1894000 0 Log (Risk) Log (Risk) −2 −2 −2.5 50 100 150 200 250 Wall Clock Time (secs) 300 350 400 Logistic regression on MNIST for classifying 7 vs. 9. Risk in estimating the predictive mean. −3 −3.5 0 1000 2000 3000 4000 5000 Wall Clock Time (secs) 6000 7000 Independent Component Analysis on synthetic data with 4 sources. The risk is in mean Amari distrance. 17 MHS UB L HD Bardenet et al. (2014) • Approximate the MH test ψ(θ, θ 0 , u) < µN by use of concentration inequalities • Obtains exact confidence intervals for the Monte Carlo approximation µ∗n of µN , h ∗ 0 i P p(x |θ ) µ∗n = 1t ti=1 log p(xi∗ |θ) i µN = 1 N PN i=1 log h p(xi |θ0 ) p(xi |θ) i where {x1∗ . . . xn∗ } are sampled w/o replacement from {x1 . . . xN } • The precision of the estimate µ∗n can be quantify through concentration inequalities, i.e., for δn > 0 P(|µ∗n (θ, θ0 ) − µN (θ, θ0 )| ≤ cn ) ≥ 1 − δn 18 Hoeffding’s inequality Bardenet et al. (2014) • Hoeffding’s inequality without replacement tells us that P(|µ∗n (θ, θ0 ) − µN (θ, θ0 )| ≤ cn ) ≥ 1 − δn where s Cθ,θ0 2(1 − − log(2/δn ) n 0 = max | log p(xi |θ ) − log p(xi |θ)| cn = Cθ,θ0 n−1 N ) 1≤i≤n • Main challenge is to find Cθ,θ0 without looking at data • Gaussian likelihood: Cθ,θ0 ∝ maxi |xi | • Logistic regression likelihood: Cβ,β 0 = ||β − β 0 || maxi |xi | • In practice tighter bounds are used 19 Using the concentration inequality Bardenet et al. (2014) • The concentration inequality P(|µ∗n (θ, θ0 ) − µN (θ, θ0 )| ≤ cn ) ≥ 1 − δn tells us that • If |µ∗n − ψ| > cn we can take the correct decision for the MH test (ψ(θ, θ0 , u) < µN ) with probability at least 1 − δn • Else we need to increase t until |µ∗n − ψ| > cn • The authors have a rule for setting δn such that the total probability of making an error in each decision is less than a user specified δ > 0 20 Austerity vs. MHS UB L HD Bardenet et al. (2014) 1800 1600 1400 1200 1000 800 600 400 200 0 0.080 Approx. CI, 3% of n MHSubLhd, 54% of n 2500 2000 Approx. CI, 16% of n MHSubLhd, 54% of n 1500 1000 500 0.085 0.090 0.095 0.100 0.105 2 (a) D ∼ N (0, 0.1 ) and initial m = 2 0 0.095 0.096 0.097 0.098 0.099 0.100 0.101 0.102 0.103 (b) D ∼ N (0, 0.12 ) and initial m = 500 3.0 2.5 2.0 Approx. CI, 21% of n MHSubLhd, 99% of n 1.5 1.0 0.5 0.0 25 30 35 40 45 50 (b) D ∼ ln N (0, 2) and initial m = 500 21 Austerity vs. MHS UB L HD vs. FlyMC Bardenet et al. (2014) and Korattikara et al. (2014) • The main difference between the three methods is • Austerity uses an approximate confidence interval • MHS UB L HD uses an exact confidence interval (but we need to know Cθ,θ0 ) PT • Only FlyMC will produce unbiased estimates ˆI = T −1 t=1 f (θt ) • However • In MHS UB L HD the probability of making a wrong accept/reject is directly bounded by δ • In the Austerity there is a computational overhead to set and the batch size m based on a required tolerance δ on the average or worst case error • The main challenge in applying Firefly Monte Carlo is to find a good lower bound on the likelihood 22 Scalable Langevin Monte Carlo (LMC) 23 Langevin Monte Carlo (LMC) Kennedy (1990) and R. Neal (2011) Stochastic diffusion as adaptive proposal process • For θ ∈ RD with density p(θ), U(θ) ≡ − log p(θ), define Langevin diffusion 1 dθ(t) = ∇θ U θ(t) dt + db(t) 2 (1) where b denotes a D-dimensional Brownian motion. • First order Euler discrete integration of diffusion θ(τ + ) = θ(τ ) + 2 ∇θ U θ(τ ) + z(τ ) 2 (2) • Equivalent with proposal 2 q(θ 0 | θ) = N · | µ(θ, ), 2 I with µ(θ, ) = θ + ∇θ θ 2 • Use MH acceptance step to correct for bias 24 Stochastic Gradient Langevin Dynamics (SGLD) Welling and Teh (2011) • Idea: Langevin dynamics with stochastic gradients. θ(τ + ) = θ(τ ) + n τ NX ∇θ p(θ) + ∇θ log p(xti | θ) + z(τ ) 2 n i=1 z(τ ) ∼ N (0, τ I) • Update is just SGD plus Gaussian noise. • Noise variance is balanced with gradient step sizes. • τ decreases to 0 slowly (step-size requirement). 25 Stochastic Gradient Langevin Dynamics (SGLD) Welling and Teh (2011) n τ NX θ(τ + ) = θ(τ ) + ∇θ p(θ) + ∇θ log p(xti | θ) + z(τ ) 2 n i=1 z(τ ) ∼ N (0, τ I) • Only computationally expensive part of Langevin dynamics is the gradient computation. If gradient can be well-approximated on small minibatches the algorithm will work well. • As τ → 0 • MH acceptance probability approaches 1, so we can ignore the expensive MH accept/reject step. • τ approaches 0 slowly enough, so discretised Langevin dynamics still able to explore whole parameter space. 26 Scalable Hamiltonian Monte Carlo (HMC) 27 Hamiltonian Monte Carlo (HMC) Duane et al. (1987), R. M. Neal (1993), and R. Neal (2011) • For θ ∈ RD with density p(θ), U(θ) ≡ − log p(θ) − log Z, where U(θ) is the “potential energy” at “position” θ. • Introduce extra momentum variables, p, of the same dimension as θ, and define a “kinetic energy”, K(p). Typically, K(p) = pT p 2. • Define the “Hamitonian” to be H(θ, p) = U(θ) + K(p), and let (θ, p) ∼ 1 exp − H(θ, p) Z Note: θ and p are independent, and the marginal density for θ is p(θ). 28 Hamiltonian Monte Carlo (HMC) Duane et al. (1987), R. M. Neal (1993), and R. Neal (2011) • Repeatedly do the following: • Sample p from its density, which is N 0, I . • Find a proposal (θ 0 , p0 ) from (θ, p) by simulating Hamiltonian dynamics for some amount of fictitious time. • Accept or reject (θ 0 , p0 ) as the next state according to the Metropolis acceptance probability,i h min 1, exp − H(θ 0 , p0 ) + H(θ, p) . 29 Hamiltonian Monte Carlo (HMC) Hamiltonian Dynamics • Hamiltonian dynamics is defined by the following differential equations for how the state (θ, p) evolves with “time”, t: dθ ∂H = , dt ∂p dp ∂H =− dt ∂θ (3) • When H(θ, p) = U(θ) + K(p) and K(p) = pT p 2, we see that dθi = pi , dt dpi ∂U =− dt ∂θi (4) which is equivalent with dθi = pi dt, dpi = − ∂U dt ∂θi (5) so the “position” variables get pushed by the “momentum” variables, which are themselves controled by the gradient of the potential energy. 30 Stochastic gradient HMC SGHMC; Chen et al. (2014) Key ideas of SGHMC: • Simulating Hamiltonian dynamics requires gradients of target distribution. • Replace exact gradient with stochastic gradient • MH correction step requires calculation of likelihoods. • Ignore the expensive MH accept/reject step 31 Stochastic gradient HMC SGHMC; Chen et al. (2014) • In SGHMC, exact gradient is replaced with stochastic gradient calculated using a subset of data S ⊂ {x1 , . . . , xN }: N X ˜ U(θ) =− log p(xi | θ) − log p(θ) |S| (6) xi ∈S • Assume that: ˜ ∇θ U(θ) ≈ ∇θ U(θ) + N 0, 2B(θ) (7) • Assuming unit mass, a naive construction of SGHMC would be: dθ = pdt, ˜ dp = ∇θ U(θ)dt + N 0, 2B(θ)dt (8) 32 Stochastic gradient HMC SGHMC; Chen et al. (2014) dθ = pdt, ˜ dp = ∇θ U(θ)dt + N 0, 2B(θ)dt (9) • However, the equilibrium distribution of dynamical system (9) (without an MH step) is not the model posterior. • SGHMC overcome this by introducing a friction term B(θ)p into the Hamiltonian dynamics dθ = pdt, ˜ dp = ∇θ U(θ)dt − B(θ)pdt + N 0, 2B(θ)dt (10) • With this fraction term, the model posterior is preserved. 33 Stochastic gradient HMC SGHMC; Chen et al. (2014) • So far we have assumed the gradient noise B(θ) is known. • In practise, this needs to be estimated, leading to a troublesome issue to practitioners. • An alternative solution is treating B(θ) as part of the dynamical system. • Leads to stochastic gradient Nosé-Hoover thermostat dynamics (SGNHT). 34 Stochastic gradient Nosé-Hoover thermostats SGNHT; Ding et al. (2014) • Using Nosé-Hoover thermostats dynamics as a proposal, we have: dθ = pdt, ˜ dp = ∇θ U(θ)dt − ξpdt + N 0, 2Cdt , 1 dξ = ( pT p − 1)dt n (11) (12) (13) • The formulas without the stochastic term N(0, 2Cdt) is known as Nosé-Hoover thermostats. • This Nosé-Hoover thermostats has the model posterior as its equilibrium distribution (proved via Fokker-Planck equations). 35 Stochastic gradient Nosé-Hoover thermostats SGNHT; Ding et al. (2014) Algorithm 1: Stochastic Gradient Nos´e-Hoover Thermostat Input: Parameters h, A. Initialize ✓(0) 2 Rn , p(0) ⇠ N (0, I), and ⇠(0) = A ; for t = 1, 2, . . . do ˜ (✓(t 1) ) from (2) ; Evaluate rU p ˜ (✓(t 1) )h + 2A N (0, h); p(t) = p(t 1) ⇠(t 1) p(t 1) h rU ✓(t) = ✓(t 1) + p(t) h; ⇠(t) = ⇠(t 1) + ( n1 p> (t) p(t) 1)h; end Illustrations of a Double-well Potential To illustrate that the adaptive upda the mean kinetic energy, and more importantly, produce correct sampling w the gradient, we consider the following double-well potential, U (✓) = (✓ + 4)(✓ + 1)(✓ 1)(✓ 3)/14 + 0.5. 36 Stochastic gradient Nosé-Hoover thermostats SGNHT; Ding et al. (2014) Results on the double-well problem: U(θ) = (θ + 4)(θ + 1)(θ − 1)(θ − 3)/14 + 0.5 37 Stochastic gradient Nosé-Hoover thermostats SGNHT; Ding et al. (2014) Stochastic Gradient Thermostats More results on Bayesian NN (MNIST), PMF (Netflix, Results Movielens1m), and LDA (ICML): 38 Further reading Articles not covered in this presentation: • M. Xu et al. (2014). “Distributed Bayesian posterior sampling via moment sharing”. In: Advances in Neural Information Processing Systems, pp. 3356–3364 • S. Ahn, B. Shahbaba, et al. (2014). “Distributed stochastic gradient MCMC”. . In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1044–1052 • H. Strathmann et al. (2015). “Unbiased Bayes for Big Data: Paths of Partial Posteriors”. In: arXiv preprint arXiv:1501.03326 • S. Ahn, A. K. Balan, et al. (2012). “Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring”. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012 • W. Neiswanger et al. (2014). “Asymptotically Exact, Embarrassingly Parallel MCMC”. . In: Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI) • D. N. VanDerwerken and S. C. Schmidler (2013). “Parallel Markov Chain Monte Carlo”. In: arXiv preprint arXiv:1312.7479 • D. Wilkinson (2005). “Parallel Bayesian Computation”. In: Handbook of Parallel Computing and Statistics. Chapman and Hall/CRC, pp. 477–508 39 References I Ahn, S., A. K. Balan, and M. Welling (2012). “Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring”. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. Ahn, S., B. Shahbaba, and M. Welling (2014). “Distributed stochastic gradient MCMC”. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1044–1052. Bardenet, R., A. Doucet, C. Holmest, A. Doucet, and C. Holmes (2014). “Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach”. In: Proceedings of The 31st International Conference on Machine Learning. Beijing, China, pp. 405–413. Chen, T., E. Fox, and C. Guestrin (2014). “Stochastic Gradient Hamiltonian Monte Carlo”. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1683–1691. Ding, N., Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven (2014). “Bayesian Sampling Using Stochastic Gradient Thermostats”. In: Advances in Neural Information Processing Systems 27. Ed. by Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger. Curran Associates, Inc., pp. 3203–3211. 40 References II Duane, S., A. Kennedy, B. J. Pendleton, and D. Roweth (1987). “Hybrid Monte Carlo”. In: Physics Letters B 195.2, pp. 216–222. Kennedy, A. (1990). “The Theory of Hybrid Stochastic Algorithms”. English. In: Probabilistic Methods in Quantum Field Theory and Quantum Gravity. Ed. by P. Damgaard, H. Hüffel, and A. Rosenblum. Vol. 224. NATO ASI Series. Springer, pp. 209–223. Korattikara, A., Y. Chen, and M. Welling (2014). “Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget”. In: Proceedings of The 31st International Conference on Machine Learning. Beijing, China, pp. 181–189. Maclaurin, D. and R. P. Adams (2014). “Firefly Monte Carlo: Exact MCMC with Subsets of Data”. In: Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI). Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Tech. rep. CRG-TR-92-1. Department of Computer Science, University of Toronto Toronto, CA. Neal, R. (2011). “MCMC Using Hamiltonian Dynamics”. In: Chapman and Hall/CRC. 41 References III Neiswanger, W., E. Xing, and C. Wang (2014). “Asymptotically Exact, Embarrassingly Parallel MCMC”. In: Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI). Strathmann, H., D. Sejdinovic, and M. Girolami (2015). “Unbiased Bayes for Big Data: Paths of Partial Posteriors”. In: arXiv preprint arXiv:1501.03326. VanDerwerken, D. N. and S. C. Schmidler (2013). “Parallel Markov Chain Monte Carlo”. In: arXiv preprint arXiv:1312.7479. Welling, M. and Y. W. Teh (2011). “Bayesian learning via stochastic gradient Langevin dynamics”. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688. Wilkinson, D. (2005). “Parallel Bayesian Computation”. In: Handbook of Parallel Computing and Statistics. Chapman and Hall/CRC, pp. 477–508. Xu, M., B. Lakshminarayanan, Y. W. Teh, J. Zhu, and B. Zhang (2014). “Distributed Bayesian posterior sampling via moment sharing”. In: Advances in Neural Information Processing Systems, pp. 3356–3364. 42
© Copyright 2025