ISIT Tutorial Information theory and machine learning Part II Martin Wainwright UC Berkeley Emmanuel Abbe Princeton University 1 Inverse problems on graphs A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents by observing local noisy interactions of these agents Examples: - community detection in social networks 2 Inverse problems on graphs A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents by observing local noisy interactions of these agents Examples: - community detection in social networks - image segmentation 3 Inverse problems on graphs A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents by observing local noisy interactions of these agents Examples: - community detection in social networks - image segmentation - data classification and information retrieval 4 Inverse problems on graphs A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents by observing local noisy interactions of these agents Examples: - community detection in social networks - image segmentation - data classification and information retrieval - object matching, synchronization - page sorting - protein-to-protein interactions - haplotype assembly - ... 5 Inverse problems on graphs A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents by observing local noisy interactions of these agents In each case: observe information on the edges of a network that has been generated from hidden attributes on the nodes, and try to infer back these attributes Dual to the graphical model learning problem (previous part) 6 What about graph-based codes? X1 W Y1 W YN C Xn Different: the code is a design parameter and takes typically specific non-local interactions of the bits (e.g., random, LDPC, polar codes). (1) What are the relevant types of “codes” and “channels” behind machine learning problems? (2) What are the fundamental limits for these? 7 Outline of the talk 1. Community detection and clustering 2. Stochastic block models : fundamental limits and capacity-achieving algorithms 3. Open problems 4. Graphical channels and low-rank matrix recovery 8 Community detection and clustering 9 Networks provide local interactions among agents social networks: “friendship” call graphs: “calls” biological networks: “protein interactions” genome HiC networks: “DNA contacts” 10 Networks provide local interactions among agents one often wants to infer global similarity classes social networks: “friendship” call graphs: “calls” biological networks: “protein interactions” genome HiC networks: “DNA contacts” 11 The challenges of community detection A long-studied and notoriously hard problem what is a good clustering? assort. and disassort. relations how to get a good clustering? computationally hard work with models many heuristics... Tutorial motto: Can on establish a clear line-of-sight as in communications with the Shannon capacity? Peak Data Efficiency WAN wireless tech. Infeasible Region Shannon-capacity HSDPA EV-DO SNR 12 802.16 The Stochastic Block Model 13 The stochastic block model P = diag(p) p = (p1 , . . . , pk ) 0 W11 . . . B .. .. W =@ . . Wk1 ··· SBM(n, p, W ) <- probability vector = relative size of the communities 1 W1k <- symmetric matrix with entries in [0,1] C .. = prob .of connecting among communities . A Wkk np1 W11 W12 W13 W14 The DMC of clustering..? W24 np4 W44 14 W34 W22 np2 W23 W33 np3 k=4 The (exact) recovery problem Let X n = [X1 , . . . , Xn ] represent the community variables of the nodes (drawn under p) ˆ n (·) solves (exact) recovery in the SBM if Definition. An algorithm X ˆ n (G)) = 1. for a random graph G under the model, limn!1 P(X n = X We will see weaker recovery requirements later Starting point: progress in science often comes from understanding special cases... 15 SBM with 2 symmetric communities: 2-SBM 16 2-SBM p q p n 2 n 2 p1 = p2 = 1/2 W11 = W22 = p 17 W12 = q Some history for 2-SBM Recovery problem ˆ n = X n) ! 1 P(X 1983 Holland Laskey Leinhardt Boppana Dyer Frieze Bui, Chaudhuri, Leighton, Sipser Bui, Chaudhuri, Leighton, Sipser ’84 Boppana ’87 Dyer, Frieze ’89 Snijders, Nowicki ’97 Jerrum, Sorkin ’98 Condon, Karp ’99 Carson, Impagliazzo ’01 Mcsherry ’01 Bickel, Chen ’09 Rohe, Chatterjee, Yu ’11 maxflow-mincut spectral meth. min-cut via degrees EM algo. Metropolis aglo. augmentation algo. hill-climbing algo. spectral meth. N-G modularity spectral meth. Snijders Nowicki Jerrum Sorkin Condon Karp Carson Impagliazzo McSherry 1 4/((p+q)n) p = ⌦(1/n), q = o(n p ) p (p q)/ p + q = ⌦( log(n)/n) p q = ⌦(1) p q = ⌦(1) p q = ⌦(n 1/6+✏ ) p q = ⌦(n 1/2+✏ ) 4 p q = ⌦(n 1/2 log (n)) p p (p q)/ p ⌦( log(n)/n) p p (p q)/ p + q = ⌦(log(n)/ n) p q = ⌦(1) algorithms driven... 18 2010 2014 Bickel Chen Rohe Chatterjee Yu Instead of ‘how’, when can we recover the clusters (IT)? Information-theoretic view of clustering Y1 W X1 unorthodox code! Y1 W 2-SBM ... C 2 R= n ... X1 n R= N Xn W = ✓ 1 ✏ ✏ Xn YN W ✏ 1 ✏ ✓ ◆ n N= 2 W ◆ W = reliable comm. iff R < 1-H(ɛ) ✓ 1 1 p p q q , reliable comm. iff exact recovery 19 ◆ YN ??? Some history for 2-SBM Recovery problem ˆ n = X n) ! 1 P(X 1983 Holland Laskey Leinhardt Boppana Dyer Frieze Bui, Chaudhuri, Leighton, Sipser Bui, Chaudhuri, Leighton, Sipser ’84 Boppana ’87 Dyer, Frieze ’89 Snijders, Nowicki ’97 Jerrum, Sorkin ’98 Condon, Karp ’99 Carson, Impagliazzo ’01 Mcsherry ’01 Bickel, Chen ’09 Rohe, Chatterjee, Yu ’11 maxflow-mincut spectral meth. min-cut via degrees EM algo. Metropolis aglo. augmentation algo. hill-climbing algo. spectral meth. N-G modularity spectral meth. Snijders Nowicki Jerrum Sorkin Condon Karp Carson Impagliazzo McSherry 1 4/((p+q)n) p = ⌦(1/n), q = o(n p ) p (p q)/ p + q = ⌦( log(n)/n) p q = ⌦(1) p q = ⌦(1) p q = ⌦(n 1/6+✏ ) p q = ⌦(n 1/2+✏ ) p q = ⌦(n 1/2 log4 (n)) p p (p q)/ p ⌦( log(n)/n) p p (p q)/ p + q = ⌦(log(n)/ n) p q = ⌦(1) 20 2010 2014 Bickel Chen Rohe Chatterjee Yu Abbe-Bandeira-Hall a log(n) b log(n) p= ,q = n n p Recovery i↵ a+b 2 1+ ab efficiently achievable b Infeasible a Some history for 2-SBM ˆ n = X n) ! 1 P(X 1983 Holland Laskey Leinhardt Boppana Dyer Frieze Bui, Chaudhuri, Leighton, Sipser Bui, Chaudhuri, Leighton, Sipser ’84 Boppana ’87 Dyer, Frieze ’89 Snijders, Nowicki ’97 Jerrum, Sorkin ’98 Condon, Karp ’99 Carson, Impagliazzo ’01 Mcsherry ’01 Bickel, Chen ’09 Rohe, Chatterjee, Yu ’11 Detection problem Recovery problem maxflow-mincut spectral meth. min-cut via degrees EM algo. Metropolis aglo. augmentation algo. hill-climbing algo. spectral meth. N-G modularity spectral meth. Snijders Nowicki Jerrum Sorkin ˆ n, X n) d(X 1 9✏ > 0 : P( < n 2 2010 2014 ✏) ! 1 Decelle Massoulié Krzakala Bickel Mossel Moore Chen Carson Zdeborova Neeman Impagliazzo Rohe Sly Chatterjee McSherry Yu Coja-Oghlan Mossel-Neeman-Sly Condon Karp 1 4/((p+q)n) p = ⌦(1/n), q = o(n p ) p (p q)/ p + q = ⌦( log(n)/n) p q = ⌦(1) p q = ⌦(1) p q = ⌦(n 1/6+✏ ) p q = ⌦(n 1/2+✏ ) p q = ⌦(n 1/2 log4 (n)) p p (p q)/ p ⌦( log(n)/n) p p (p q)/ p + q = ⌦(log(n)/ n) p q = ⌦(1) What about multiple/asymm. communities? Conjecture: detection changes with 5 or more 21 Abbe-Bandeira-Hall a log(n) b log(n) p= ,q = n n p Recovery i↵ a+b 2 1+ ab efficiently achievable a b p = ,q = n n Detection i↵ (a b)2 > 2(a + b) Recovery in the 2-SBM: IT limit Converse: If p a+b 2 i p Ri ab < 1 then ML fails w.h.p. q Rj j p a log n b log n p= ,q = n n Bj Bi n/2 n/2 what is ML? ! min-bisection ML fails if two nodes can be swapped to reduce the cut P (9i : Bi Ri ) = ? ⇣ n P (B1 R1 ) (weak correlations) P (B1 R1 ) = n Abbe-Bandeira-Hall ’14 22 ((a+b)/2 p ab)+o(1) Recovery in the 2-SBM: efficient algorithms p p q 1 n/2 spectral: max xT Ax s.t. kxk = 1 1t x = 0 P +1 n/2 ML-decoding: max xT Ax s.t. xi = ±1 1t x = 0 NP-hard lifting: t X = xx SDP: max tr(AX) s.t. Xii = 1 1t X = 0 X⌫0 rank(X) = 1 Abbe-Bandeira-Hall ’14 23 Recovery in the 2-SBM: efficient algorithms Theorem. The SDP solves recovery if 2LSBM + 11t + In ⌫ 0 where LSBM = DG+ DG A. -> Analyze the spectral norm of a random matrix [Abbe-Bandeira-Hall ’14] Bernstein: slightly loose [Xu-Hanjek-Wu ’15] Seginer bound [Bandeira, Bandeira-Van Handel ’15] tight bound Note that SDP can be expensive... Abbe-Bandeira-Hall ’14 24 The general SBM 25 SBM(n, p, W ) Quiz: If a node is in community i, how many neighbors does it have in expectation in community j ? 1. 2. 3. 4. np1 npj npj Wij npi Wij 7 0 @ W13 W14 W24 np4 i nP W W11 W12 1 A 26 W44 np2 W22 W23 W33 W34 “degree profile matrix” np3 Back to the Information-theoretic view of clustering Y1 W 2 R= n X1 W Y1 W YN ... X1 n R= N SBM ... C Xn W ✓ W ◆ Xn YN ✓ reliable comm. iff R < 1-H(ɛ) W ◆ __ reliable comm. iff 1 < (a+b)/2-√ab reliable comm. iff R < max I(p,W) reliable comm. iff 1 < J(p,W) ??? p | {z } KL-divergence 27 Main results Theorem 1. Recovery is solvable in SBM(n, p, Q log(n)/n) if and only if J(p, Q) := min D+ ((P Q)i , (P Q)j ) i<j where D+ (µ, ⌫) := max t2[0,1] 1 p ( a 2 p b) 2 1 Abbe-Bandeira-Hall ’14 X `2[k] | t)⌫` {z Dt (µ, ⌫) p 1 p 2k µ µt` ⌫`1 t } ⌫k22 is the Hellinger divergence (distance) P t • Dt is an f -divergence: ⌫ f (µ /⌫ ) f (x) = 1 t + tx x i i i i P t t • log maxt i µi ⌫i is the Cherno↵ divergence • D1/2 (µ, ⌫) = We call D+ the CH-divergence. D [Abbe-Sandon ’15] tµ` + (1 1 28 Main results Theorem 1. Recovery is solvable in SBM(n, p, Q log(n)/n) if and only if J(p, Q) := min D+ ((P Q)i , (P Q)j ) where i<j D+ (µ, ⌫) := max t2[0,1] X tµ` + (1 t)⌫` 1 µt` ⌫`1 t `2[k] Is recovery in the general SBM solvable efficiently down the information theoretic threshold? YES! Theorem 2. The degree-profiling algorithm achieves the threshold and runs in quasi-linear time. [Abbe-Sandon ’15] 29 When can we extract a specific community? Theorem. If community i has a profile (P Q)i at D+ -distance at least 1 from all other profiles (P Q)j , j 6= i, then it can be extracted w.h.p. (P Q)j 1 (P Q)i What if we do not know the parameters? We can learn them on the fly: [Abbe-Sandon ’15] (second paper) [Abbe-Sandon ’15] 30 Proof techniques and algorithms 31 A key step 1 dv 2 Hypothesis 1 1 2 . dv ⇠ P(log(n)(P Q)1 ) 3 Hypothesis 2 . dv ⇠ P(log(n)(P Q)2 ) 3 Theorem. For any ✓1 , ✓2 2 (R+ \ {0})k with ✓1 6= ✓2 and p1 , p2 2 R+ \ {0}, ⇣ ⌘ X min(Pln(n)✓1 (x)p1 , Pln(n)✓2 (x)p2 ) = ⇥ n D+ (✓1 ,✓2 ) o(1) , x2Zk + where D+ is the CH-divergence. [Abbe-Sandon ’15] 32 How to use this? Plan: put effort in recovering most of the nodes and then finish greedily with local improvements [Abbe-Sandon ’15] 33 The degree-profiling algorithm G’ loglog-degree G” log-degree (1) Split G into two graphs (2) Run Sphere-comparison on G’ -> gets a fraction 1-o(1) (see next) (3) Take now G” with the clustering of G’ dv Hypothesis 1 ... dv ⇠ P(log(n)(P Q)1 ) Hypothesis 2 ... dv ⇠ P(log(n)(P Q)2 ) Pe = n [Abbe-Sandon ’15] D+ ((P Q)1 ,(P Q)2 )+o(1) 34 (capacity-achieving) How do we get most nodes correctly? 35 Other recovery requirements Weak recovery or detection : c = 1/k + ✏ for some ✏ > 0 (for the symmetric k-SBM). Partial recovery: An algorithm solves partial recovery in the SBM with accuracy c if it produces a clustering which is correct on a fraction c of the nodes with high probability. Almost exact recovery: c = 1 o(1) Exact recovery: c = 1 For all the above: what are the “efficient” VS. “information-theoretic” fundamental limits? 36 Partial recovery in SBM(n, p, Q/n) What is a good notion of SNR? Proposed notion of SNR: | = Bin(n/2, b/n) = Bin(n/2, a/n) min | 2 e.v. of PQ max (a b)2 2(a+b) for 2-symm. comm. (a b)2 k(a+(k 1)b) for k-symm. comm. Theorem (informal). In the sparse SBM(n, p, Q/n), the Sphere-comparison algorithm recovers a fraction of nodes which approaches 1 when the SNR diverges. Note that the SNR scales if Q scales! [Abbe-Sandon ’15] 37 Sphere-comparison A node neighborhood in =i } SBM(n, p, Q/n) v ... (P Q)i pk Qik depth r ... p1 Qi1 npk {z np1 r [Abbe-Sandon ’15] Nr (v) | ((P Q) )i 38 v Sphere-comparison =i ... Take now two nodes: Compare v and v’ from: 0 Nr (v) 0 Nr0 (v ) |Nr (v) \ Nr0 (v )| hard to analyze... ... v0 [Abbe-Sandon ’15] =j 39 Decorrelate: v Sphere-comparison =i Subsample G with prob. c to get E Compare v and v’ from: 0 ... Nr,r0 [E] (v · v ) Nr[G\E] (v) N r 0 [G\E] 0 (v ) E ... [Abbe-Sandon ’15] v0 =j = number of crossing edges cQ 0 ⇡ Nr[G\E] (v) · Nr0 [G\E] (v ) n cQ r r0 ⇡ ((1 c)P Q) e v · ((1 c)P Q) e n r+r 0 r+r 0 e v0 /n = c(1 c) e v · Q(P Q) v0 Additional steps: 1. look at several depths -> Vandermonde syst. 2. use “anchor nodes” 40 A real data example 41 The political blogs network edge = hyperlink between blogs 1222 blogs (left- and right-leaning) [Adamic and Glance ’05] The CH-divergence is close to 1 We can recover 95% of the nodes correctly 42 Some open problems in community detection 43 Some open problems in community detection I. The SBM: a. Recovery Growing nb. of communities? Sub-linear communities? [Abbe-Sandon ’15] should extend to k = o(log(n)) [Chen-Xu ’14] k,p,q scale with n polynomially 44 Some open problems in community detection I. The SBM: a. Recovery b. Detection and broadcasting on trees [Mossel-Neeman-Sly ’13] Converse for detection in 2-SBM: p=a/n, q=b/n Galton-Watson tree Poisson((a+b)/2) If (a+b)/2<=1 the tree dies w.p. 1 If (a+b)/2>1 the tree survives w.p. >0 Xr Unorthodox broadcasting problem: when can we detect the root-bit? BSC(b/(a + b)) a+b c= 2 1 1 1 If and only if c > (1 2")2 0 45 [Evans-Kenyon-Peres-Schulman ’00] (a b)2 () >1 2(a + b) Some open problems in community detection I. The SBM: a. Recovery b. Detection and broadcasting on trees For k clusters? SNR = (a k(a b)2 (k 1)b) impossible 0 possible ck efficient 1 SNR Conjecture. For the symmetric k-SBM(n, a, b), there exists ck s.t. (1) If SNR < ck , then detection cannot be solved, (2) If ck < SNR < 1, then detection can be solved information-theoretically but not efficiently, (3) If SNR > 1, then detection can be solved efficiently. Moreover ck = 1 for k 2 {2, 3, 4} and ck < 1 for k 5. [Decelle-Krzakala-Zdeborova-Moore ’11] 46 Some open problems in community detection I. The SBM: a. Recovery b. Detection and broadcasting on trees c. Partial recovery and the SNR-distortion curve Conjecture. For the symmetric k-SBM(n, a, b) and ↵ 2 (1/k, 1), there exists k , k s.t. partial-recovery of accuracy ↵ is solvable if and only if SNR > k , and efficiently solvable i↵ SNR > k . =1 ↵ k=2 1/2 k 5 SNR 1 47 Some open problems in community detection II. Other block models p - Censored block models - Labelled block models 0 p p 1 2-CBM(n, p, ✏) 2-CBM n/2 n/2 [Abbe-Bandeira-Bracher-Singer, [Abbe-Bandeira-Bracher-Singer ’14] Xu-Lelarge-Massoulie, Saad-Krzakala-Lelarge-Zdeborova, Chin-Rao-Vu, “correlation clustering” [Bansal, Blum, Chawla ’04] Hajek-Wu-Xu] “LDGM codes” [Kumar, Pakzad, Salavati, Shokrollahi ’12] “labelled block model” [Heimlicher, Lelarge, Massoulié ’12] “soft CSPs” [Abbe, Montanari ’13] “pairwise measurements” [Chen, Suh, Goldsmith ’14 and ’15] “bounded-size correlation clustering” [Puleo, Milenkovic ’14] 48 Some open problems in community detection II. Other block models - Censored block models - Labelled block models - Degree-corrected block models [Karrer-Newman ’11] - Mixed-membership block models [Airoldi-Blei-Fienber-Xing ’08] - Overlapping block models [Abbe-Sandon ’15] u Xu ⇠ p v Xv ⇠ p OSBM (n, p, f ) p on {0, 1}s connect (u, v) with prob. f (Xu , Xv ) 49 f : {0, 1}s ⇥ {0, 1}s ! [0, 1] Some open problems in community detection II. Other block models - Censored block models - Labelled block models - Degree-corrected block models - Mixed-membership block models - Overlapping block models - Planted community [Deshpande-Montanari ’14 , Montanari ’15] , 50 p p Some open problems in community detection II. Other block models - Censored block models - Labelled block models - Degree-corrected block models - Mixed-membership block models - Overlapping block models - Planted community - Hypergraph models 51 Some open problems in community detection II. Other block models - Censored block models - Labelled block models - Degree-corrected block models - Mixed-membership block models - Overlapping block models - Planted community - Hypergraph models For all the above: Is there a CH-divergence behind? A generalized notion of SNR? Detection gaps? Efficient algorithms? 52 Some open problems in community detection III. Beyond block models: a. Exchangeable random arraysw and graphons w : [0, 1]2 ! [0, 1] G2T w(ui , uj ) P (Eij = 1|Xi = xi , Xj = xj ) = w(xi , xj ) × uj [Lovasz] G1 (ui , uj ) [Choi-Wolfe, Airoldi-Costa-Chan] SBMs can approximateuigraphons b. Graphical channels Figure 1: [Left] Given a graphon w : [0, 1]2 → [0, 1], we draw i.i.d. samples ui , uj from Uniform[0,1] and assign Gt [i, j] = 1 with probability w(ui , uj ), for t = 1, . . . , 2T . [Middle] Heat map of a graphon w. [Right] A random graph generated by the graphon shown in the middle. Rows and columns of the graph are ordered by increasing ui , instead of i for better visualization. A generalWolfeinformation-theoretic model? [9] studied the consistency properties, but did not provide algorithms to estimate the graphon. To the best of our knowledge, the only method that estimates graphons consistently, besides ours, is USVT [8]. However, our algorithm has better complexity and outperforms USVT in our simulations. More recently, other groups have begun exploring approaches related to ours [28, 24]. The proposed approximation procedure requires w to be piecewise Lipschitz. The basic idea is to approximate w by a two-dimensional step function w ! with diminishing intervals as n increases.The proposed method is called the stochastic blockmodel approximation (SBA) algorithm, as the idea of using a two-dimensional step function for approximation is equivalent to using the stochastic block models [10, 22, 13, 7, 25]. The SBA algorithm is defined up to permutations of the nodes, so the estimated graphon is53not canonical. However, this does not affect the consistency properties of the SBA algorithm, as the consistency is measured w.r.t. the graphon that generates the graphs. Graphical channels [Abbe-Montanari ’13] A family of channels motivated by inference on graph problems V Q YN G E ... For x 2 X , y 2 Y , Q P(y|x) = e2E(G) Q(ye |x[e]) Y1 ... • G = (V, E) a k-hypergraph • Q : X k ! Y a channel (kernel) X1 Q Xn How much information do graphical channels carry? 1 Theorem. lim I(X n ; G) exists for ER graphs and some kernels n!1 n -> why not always? what is the limit? 54 Connection: sparse PCA and clustering 55 SBM and low-rank r Gaussian model Spiked Wigner model: Y = t XX + Z Gaussian symmetric Yij = cXi Xj + Zij n • X = (X1 , . . . , Xn ) i.i.d. Bernoulli(✏) ! sparse-PCA [Amini-Wainwright ’09, Desphande-Montanari ’14] • X = (X1 , . . . , Xn ) i.i.d. Radamacher(1/2) ! SBM??? [Deshpande-Abbe-Montanari ’15] 1 1 ? Theorem. lim I(X; G(n, pn , qn )) = lim I(X; Y ) n!1 n n!1 n = 2 If n n(pn qn ) = ! 2(pn + qn ) where ⇤ solves = (1 + 2 ⇤ ⇤ + I( ⇤ ) 4 4 2 (finite SNR), with npn , nqn ! 1 (large degrees) MMSE( )) I-MMSE [Guo-Shamai-Verdú] 56 Y1 ( ) = p X1 + Z1 (single-letter) Conclusion Community detection couples naturally with the channel view of information theory and more specifically with: } - graph-based codes - f-divergences - broadcasting problems - I-MMSE - ... | {z unorthodox versions... More generally, the problem of inferring global similarity classes in data sets from noisy local interactions is at the center of many problems in ML, and an information-theoretic view of these problems seems needed and powerful. 57 Questions? Documents related to the tutorial: 58
© Copyright 2025