CoCoA vs. CoCoA+ : Adding vs. Averaging in Distributed Optimization Martin Tak´aˇc SAS, 10th of March, 2015 1 / 32 Outline Problem Formulation Serial and Parallel Coordinate Descent Method (CDM) Distributed CDM Original CoCoA Framework CoCoA+ Framework Computation vs. Communication Trade-off Spark Some Numerical Experiments 2 / 32 The Problem - Regularized Empirical Loss Minimization Let {(xi , yi )}ni=1 be our training data data, xi ∈ Rd and yi ∈ R. " # n 1X λ min P(w ) := `i (w T xi ) + kw k2 n 2 w ∈Rd (P) i=1 where λ > 0 is a regularization parameter `i (·) is convex loss function which can depend on the label yi Examples: Logistic loss: `i (ζ) = log(1 + exp(−yi ζ)) Hinge loss: `i (ζ) = max{0, 1 − yi ζ} The dual problem # n λ 1X ∗ 2 `i (−αi ) max D(α) := − kAαk − α∈Rn 2 n " (D) i=1 1 where A = λn X T and X T = [x1 , x2 , . . . , xn ] ∈ Rd×n ∗ `i is convex conjugate of `i wlog kxi k ≤ 1 3 / 32 Duality Primal-Dual mapping For any α ∈ dom(D) we can define w (α) := Aα (1) From strong duality we have that w ∗ = w (α∗ ) is optimal to (P) if α∗ is optimal solution to (D). Gap function G (α) = P(w (α)) − D(α) 4 / 32 The Setting & Challenges The size of matrix A is huge (e.g. TBs of data) We want to use many nodes of computer cluster (or cloud) to speed-up the computation Challenges distributed data: no single machine can load the whole instance expensive communication: latency RAM 100 nanoseconds standard network connection 250,000 nanoseconds unreliable nodes: we assume that the node can die at any point during the computation (we want to have fault tolerant solution) 5 / 32 The Serial/Parallel/Distributed SDCA Algorithm assume we have K nodes (computers) each with parallel processing power we partition the coordinates {1, 2, . . . , n} into K balanced sets P1 , . . . , PK ∀k ∈ {1, . . . , K } we have |Pk | = Kn Serial Parallel Distributed Stochastic Dual Coordinate Ascent choose α(0) ∈ Rn repeat α(t+1) = α(t) pick a random coordinate i ∈ {1, . . . , n} pick a random coordinate i ∈ {1, . . . , n} pick a random subset S ⊂ {1, . . . , n} with |S| = H for each computer k ∈ {1, . . . , K } in parallel do pick a random subset S ⊂ {1, . . . , n} with |S| = H pick a random subset Sk ⊂ Pk with |S| = H ≤ Kn for each i ∈ S in parallel do for each i ∈ Sk in parallel do compute the update: hti (α(t) ) := arg maxh D(α(t) + hei ) (t+1) (t+1) apply the update: αi = αi + hti (α(t) )ei (t+1) (t+1) apply the update: αi = αi + hti (α(t) )ei 6 / 32 Distributed CDM Illustration: K = 4 and H = 2 7 / 32 Disadvantages of Distributed CDM we cannot choose H > |Pk |! the computation of step is very easy (usually close form or a bit complicated 1D problem) after taking H steps, usually the objective function doesn’t change much it is almost impossible to balance computation and communication 8 / 32 Data Distribution Vector α and columns of matrix A are partitioned according {Pk }K k=1 . Notation: For k ∈ {1, 2, . . . , K } we use αk ∈ R|Pk | is a subvector of α. Vector α[k] ∈ Rn is a vector obtained from vector α by setting all coordinates ∈ / Pk to zero. Example: α1 = (∗, ∗, ∗, ∗)T , α[1] = (∗, ∗, ∗, ∗, 0, 0, . . . , 0)T . 9 / 32 Local Problem CoCoA subproblem At iteration t at node k (t) (∆α∗ )[k] = arg max n D(α(t) + ∆α[k] ) ∆α[k] ∈R = arg max ∆α[k] ∈Rn n λ 1X ∗ `i (−(α(t) + ∆α[k] )i ) − kA(α(t) + ∆α[k] )k2 − 2 n ! i=1 we cannot solve the subproblem as it depends on α(t) and A if we know w (t) = Aα(t) then (t) (∆α∗ )[k] = arg max ∆α[k] ∈Rn λ 1 X ∗ − kw (t) + A∆α[k] k2 − `i (−(α(t) + ∆α[k] )i ) 2 n ! i∈Pk (t) if we know w (t) we can compute (∆α∗ )[k] 10 / 32 The CoCoA Framework Communication-Efficient Distributed Dual Coordinate Ascent Input: T ≥ 1 Data: {(xi , yi )}ni=1 distributed over K machines (0) Initialize: α[k] ← 0 for all machines k, and w (0) ← 0 for t = 1, 2, . . . , T for all machines k = 1, 2, . . . , K in parallel Solve local problem approximately to obtain ∆α[k] (t−1) ← α[k] + K1 ∆α[k] ∆wk ← K1 A∆α[k] PK reduce w (t) ← w (t−1) + k=1 computation (t) α[k] ∆wk communication The performance of this methods (in worst case) can be the same as if we randomly pick k and solve corresponding subproblem and replace K1 by 1 How accurately do we need to solve the local sub-problem? How to change the local problem to avoid averaging (e.g. just to add local solutions)? Can we prove it will be better? 11 / 32 Smarter Subproblem Local Subproblem for CoCoA+ 0 max n Gkσ (∆α[k] ; w (t) ) ∆α[k] ∈R (2) where 0 Gkσ (∆α[k] ; w (t) ) = − 1 X ∗ (t) `i (−(α[k] + ∆α[k] )i ) n i∈Pk 1 λ (t) 2 kw k − λ(w (t) )T A∆α[k] K 2 2 λ − σ 0 A∆α[k] . 2 − Compare with: max ∆α[k] ∈Rn ! X λ (t) 1 − kw + A∆α[k] k2 − `∗i (−(α(t) + ∆α[k] )i ) 2 n (3) i∈Pk If σ 0 = 1 then the optimal solutions of (2) and (3) coincides. 12 / 32 The CoCoA+ Framework Communication-Efficient Distributed Dual Coordinate Ascent Input: T ≥ 1, γ ∈ [ K1 , 1], σ 0 ∈ [1, ∞) Data: {(xi , yi )}ni=1 distributed over K machines (0) Initialize: α[k] ← 0 for all machines k, and w (0) ← 0 for t = 1, 2, . . . , T for all machines k = 1, 2, . . . , K in parallel 0 approximately max G σ (∆α[k] ; w (t) ) to obtain ∆α[k] (t) α[k] ← + γ∆α[k] ∆wk ← γA∆α[k] PK reduce w (t) ← w (t−1) + k=1 ∆wk If γ = If γ = 1 K 1 K computation (t−1) α[k] communication we obtain CoCoA then σ 0 = 1 is ”safe” value What about another values of γ? (we want γ = 1) 13 / 32 CoCoA+ Parameters - σ 0 and γ σ 0 measures the difficulty of the given data partition it must be chosen not smaller than def 0 σ 0 ≥ σmin = γ maxn PK α∈R kAαk2 k=1 kAα[k] k2 (4) Lemma For any α ∈ Rn (α 6= 0) we have kAαk2 PK k=1 kAα[k] k2 ≤K We can take the safe value σ 0 = K · γ Again: if γ = K1 then σ 0 = K · K1 = 1 is a safe value New: if γ = 1 then σ 0 = K · 1 = K is a safe value 14 / 32 How Accurately? Assumption: Θ-approximate solution We assume that there exists Θ ∈ [0, 1) such that ∀k ∈ [K ], the local solver at any iteration t produces a (possibly) randomized approximate solution ∆α[k] , which satisfies 0 0 0 0 ∗ ∗ , w ) − Gkσ (0, w ) , (5) , w ) − Gkσ (∆α[k] , w ) ≤ Θ Gkσ (∆α[k] E Gkσ (∆α[k] where ∆α∗ ∈ arg min n ∆α∈R K X 0 Gkσ (∆α[k] , w ). (6) k=1 because the subproblem is not really what one wants to solve, therefore in practise Θ ≈ 0.9 (depending on the cluster and problem) what about convergence guarantees? how to get Θ approximate solution? 15 / 32 Iteration Complexity - Smooth Loss Theorem Assume the loss functions functions `i are (1/µ)-smooth, for i ∈ {1, 2, . . . , n}. We define kAα[k] k2 def ≤ |Pk | (7) σk = max n α[k] ∈R kα[k] k2 and σmax = maxk∈[K ] σk . Then after T iterations of CoCoA+ , with T ≥ λµn+σmax σ 0 1 γ(1−Θ) λµn log 1 , it holds that E[D(α∗ ) − D(αT )] ≤ . Furthermore, after T iterations with T ≥ λµn+σmax σ 0 1 γ(1−Θ) λµn log λµn+σmax σ 0 1 1 γ(1−Θ) λµn , we have the expected duality gap E[P(w (α(T ) )) − D(α(T ) )] ≤ . 16 / 32 Averaging vs. Adding The leading term is λµn+σmax σ 0 1 . γ(1−Θ) λµn Averaging Let us assume that ∀k : |Pk | = n K Adding 1 K γ=1 σ0 = K γ= σ0 = 1 n K λµn+ K 1−Θ λµn n 1 λµn+ K K 1−Θ λµn 1 λµK +1 1−Θ λµ 1 λµ+1 1−Θ λµ Note: this is in the worst case (for the worst case example) 17 / 32 Iteration Complexity - General Convex Loss Theorem Consider CoCoA+ starting with α0 = 0 ∈ Rn and ∀i ∈ {1, 2, . . . , n} : `i (·) be L-Lipschitz continuous and > 0 be the desired duality gap. Then after T iterations, where m l 4L2 σσ 0 1 , 2 }, T ≥ T0 + max{ γ(1 − Θ) λn γ(1 − Θ) 2 0 2 8L σσ T0 ≥ t0 + −1 , γ(1 − Θ) λn2 + l m 2 ∗ )−D(α0 )) 1 t0 ≥ max(0, γ(1−Θ) ), log( 2λn (D(α ) 4L2 σσ 0 we have that the expected duality gap satisfies E[P(w (α)) − D(α)] ≤ , at the averaged iterate PT −1 1 (t) α := T −T t=T0 +1 α , 0 PK where σ = k=1 |Pk |σk . 18 / 32 SDCA as a Local Solver SDCA Input: α[k] , w = w (α) Data: Local {(xi , yi )}i∈Pk 0 Initialize: ∆α[k] = 0 ∈ Rn for h = 0, 1, . . . , H − 1 do choose i ∈ Pk uniformly at random 0 h 6: δi∗ = arg max Gkσ (∆α[k] + δi ei , w ) 1: 2: 3: 4: 5: δi ∈R (h+1) (h) ∆α[k] = ∆α[k] + δi∗ ei 8: end for (H) 9: Output: ∆α[k] 7: Theorem Assume the functions `i are (1/µ)−smooth for i ∈ {1, 2, . . . , n}. If H ≥ nk σ 0 + λnµ 1 log λnµ Θ then SDCA will produce a Θ-approximate solution. (8) 19 / 32 Total Runtime To get accuracy we need O Recall Θ = 1 − λnγ K 1+λnγ n 1 1 log 1−Θ H Let τo be the duration of communication per iteration τc be the duration of ONE coordinate update during the inner iteration Total runtime O 1 1 τc (τO + Hτc ) = O 1 + H 1−Θ τo 1 − Θ |{z} rc/o 20 / 32 H(τc /τo ), Θ(τc /τo ) optimal H vs. r, 1e−3 optimal H vs. r, 1e−6 5 5 x 10 4000 4 3000 3 H H 5000 2000 2 1000 1 0 0 −4 10 −2 10 0 10 2 computation to communication ratio −4 10 10 −2 10 0 2 10 computation to communication ratio optimal θ vs. r, 1e−3 10 optimal θ vs. r, 1e−6 0.8 0.8 0.6 0.6 θ 1 θ 1 0.4 0.4 0.2 0.2 0 0 −4 10 −2 10 0 10 computation to communication ratio 2 10 −4 10 −2 10 0 10 computation to communication ratio 2 10 21 / 32 Apache Spark Apache Spark is a fast and general engine for large-scale data processing runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3 is slower than our C++ code for CoCoA+ we run it on Amazon Elastic Compute Cloud (Amazon EC2) 22 / 32 Numerical Experiments Datasets Dataset Training (n) Features (d) nnz/(n · d) cov rcv1 imagenet 522,911 677,399 32,751 54 47,236 160,000 22.22% 0.16% 100% λ 1e-6 1e-6 1e-5 Workers (K ) 4 8 32 23 / 32 Dependence of Primal Suboptimality on H COV 2 Log Primal Suboptimality 10 0 10 −2 10 −4 10 −6 10 0 1e5 1e4 1e3 100 1 20 40 60 Time (s) 80 100 24 / 32 Comparison with Different Algorithms COCOA minibatch-CD Tianbao Yang.Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent. In NIPS 2013. Martin Tak´ aˇc, Avleen Bijral, Peter Richt´ arik, and Nathan Srebro. Mini-Batch Primal and Dual Methods for SVMs. In ICML, March 2013. local-SGD Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming, 127(1):3–30, October 2010. batch-SGD Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming, 127(1):3–30, October 2010. 25 / 32 Cov 2 Log Primal Suboptimality 0 −2 10 −4 10 −6 0 COCOA (H=1e5) minibatch−CD (H=100) local−SGD (H=1e5) batch−SGD (H=1) 20 0 10 −2 10 −4 10 COCOA (H=1e5) minibatch−CD (H=100) local−SGD (H=1e4) batch−SGD (H=100) −6 40 60 80 10 100 0 100 Time (s) 200 300 400 Time (s) Imagenet 2 10 Log Primal Suboptimality Log Primal Suboptimality 10 10 10 RCV1 2 10 0 10 −2 10 −4 10 −6 10 0 COCOA (H=1e3) mini−batch−CD (H=1) local−SGD (H=1e3) mini−batch−SGD (H=10) 200 400 600 800 Time (s) 26 / 32 Cov 2 Log Primal Suboptimality 0 −2 10 −4 10 −6 0 COCOA (H=1e5) minibatch−CD (H=100) local−SGD (H=1e5) batch−SGD (H=1) 50 100 0 10 −2 10 −4 10 COCOA (H=1e5) minibatch−CD (H=100) local−SGD (H=1e4) batch−SGD (H=100) −6 150 200 250 10 300 0 100 # of Communicated Vectors 200 300 400 500 600 700 # of Communicated Vectors Imagenet 2 10 Log Primal Suboptimality Log Primal Suboptimality 10 10 10 RCV1 2 10 0 10 −2 10 −4 10 −6 10 0 COCOA (H=1e3) mini−batch−CD (H=1) local−SGD (H=1e3) mini−batch−SGD (H=10) 500 1000 1500 2000 2500 3000 # of Communicated Vectors 27 / 32 CoCoA vs. CoCoA+ Covertype, 1e-4 100 6 H=10 H=10 5 4 H=10 H=106 H=105 4 H=10 -2 10-3 10 -4 10 1 10 2 10 3 10 -2 -4 4 10 1 10 Number of Communications H=10 6 H=10 5 H=10 4 H=106 H=105 4 H=10 3 10 10-2 10-3 10-4 CoCoA CoCoA CoCoA CoCoA+ CoCoA+ CoCoA+ 10-1 Duality Gap Duality Gap 10 4 Covertype, 1e-5 100 CoCoA CoCoA CoCoA CoCoA+ CoCoA+ CoCoA+ 10-1 2 Number of Communications Covertype, 1e-4 100 6 H=10 H=10 5 4 H=10 H=106 H=105 4 H=10 10-3 10 10 CoCoA CoCoA CoCoA CoCoA+ CoCoA+ CoCoA+ 10-1 Duality Gap Duality Gap 10-1 10 Covertype, 1e-5 100 CoCoA CoCoA CoCoA CoCoA+ CoCoA+ CoCoA+ H=10 6 H=10 5 H=10 4 H=106 H=105 4 H=10 10-2 10-3 10-4 10 0 10 1 Elapsed Time (s) 10 2 10 0 10 1 Elapsed Time (s) 10 2 28 / 32 CoCoA vs. CoCoA+ RCV1, 1e-4 100 6 H=10 H=10 5 4 H=10 H=106 H=105 4 H=10 10-2 10 -3 10 -4 10 1 10 2 10 3 10 4 CoCoA CoCoA CoCoA CoCoA+ CoCoA+ CoCoA+ 10-1 Duality Gap Duality Gap 10-1 RCV1, 1e-5 100 CoCoA CoCoA CoCoA CoCoA+ CoCoA+ CoCoA+ 10 5 10-2 10 -3 10 -4 10 1 10 Number of Communications H=10 6 H=10 5 H=10 4 H=106 H=105 4 H=10 10-2 10 -3 10 1 10 2 Elapsed Time (s) 10 3 10 4 10 5 CoCoA CoCoA CoCoA CoCoA+ CoCoA+ CoCoA+ 10 4 10 4 H=10 6 H=10 5 H=10 4 H=106 H=105 4 H=10 10-2 10 10-4 3 10-1 Duality Gap Duality Gap 10-1 10 RCV1, 1e-5 100 CoCoA CoCoA CoCoA CoCoA+ CoCoA+ CoCoA+ 2 Number of Communications RCV1, 1e-4 100 6 H=10 H=10 5 4 H=10 H=106 H=105 4 H=10 -3 10-4 10 1 10 2 Elapsed Time (s) 10 3 29 / 32 Scaling up Scaling up K, RCV1 Time (s) to e-3 Accurate Primal 250 200 150 100 50 2 4 6 8 10 12 14 CoCoA+ CoCoA Mini-batch SGD 102 10 16 1 2 Number of machines (K) Time (s) to e-2 Duality Gap Time (s) to e-4 Duality Gap 300 0 Scaling up K, RCV1 103 CoCoA+ CoCoA 350 6 8 10 12 14 16 Number of machines (K) Scaling up K, Epsilon 700 600 4 CoCoA+ CoCoA 500 400 300 200 100 0 20 40 60 80 100 Number of machines (K) 30 / 32 Effect of σ 0 Effect of <` for . = 1 (adding) 101 10 0 10 Duality Gap Duality Gap 10 101 -1 10-2 <` = 8 (K) <` = 6 <` = 4 <` = 2 <` = 1 10-3 10 1 0 -1 10-2 10-3 -4 10 10 10 2 Number of Communications 10 3 Effect of <` for . = 1 (adding) 10 <` = 8 (K) <` = 6 <` = 4 <` = 2 <` = 1 -4 10 1 Elapsed Time (s) 31 / 32 References 1 Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan, Peter Richt´ arik and Martin Tak´ aˇ c: Adding vs. Averaging in Distributed Primal-Dual Optimization, arXiv: 1502.03508, 2015. 2 Martin Jaggi, Virginia Smith, Martin Tak´ aˇ c, Jonathan Terhorst, Thomas Hofmann and Michael I. Jordan: Communication-Efficient Distributed Dual Coordinate Ascent, NIPS 2014. 3 Richt´ arik, P. and Tak´ aˇ c, M.: On optimal probabilities in stochastic coordinate descent methods, arXiv:1310.3438, 2013. 4 Richt´ arik, P. and Tak´ aˇ c, M.: Parallel coordinate descent methods for big data optimization, arXiv:1212.0873, 2012. 5 Richt´ arik, P. and Tak´ aˇ c, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, Mathematical Programming, 2012. 6 Tak´ aˇ c, M., Bijral, A., Richt´ arik, P. and Srebro, N.: Mini-batch primal and dual methods for SVMs, In ICML, 2013. 7 Qu, Z., Richt´ arik, P. and Zhang, T.: Randomized dual coordinate ascent with arbitrary sampling, arXiv:1411.5873, 2014. 8 Qu, Z., Richt´ arik, P., Tak´ aˇ c, M. and Fercoq, O.: SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization, arXiv:1502.02268, 2015. 9 Tappenden, R., Tak´ aˇ c, M. and Richt´ arik, P., On the Complexity of Parallel Coordinate Descent, arXiv: 1503.03033, 2015. 32 / 32
© Copyright 2025