How to exploit network properties to improve learning in relational domains !

How to exploit network properties to improve
learning in relational domains
Jennifer Neville
Departments of Computer Science and Statistics
Purdue University
!
!
!
!
(joint work with Brian Gallagher, Timothy La Fond, Sebastian Moreno,
Joseph Pfeiffer and Rongjing Xiang)
Relational network classification examples
Predict
organizational roles
from communication
patterns
Email networks!
Predict protein
function from
interaction
patterns
Gene/protein networks!
Predict paper
topics from
properties of
cited papers
Scientific networks!
Predict content
changes from
properties of
hyperlinked pages
World wide web!
Predict personal
preferences from
characteristics of
friends
Social networks!
Predict group
effectiveness from
communication
patterns
Organizational networks!
Network data is:
heterogeneous and interdependent,
partially observed/labeled,
dynamic and/or non-stationary,
and often drawn from a single network
...thus many traditional ML methods
developed for i.i.d. data do not apply
Machine learning 101
choose
1
Data
representation
2
Linear equation
Generic form is:
Knowledge
representation
y = "1 x1 + " 2 x 2 ...+ " 0
An example for the Alzheimer’s data would
choose
defines
!CDR = 0.12MM1+ 0.34SBScore..." 0
Machine learning 101
Model
space
defines
combine
a would be:
3
re..." 0.34
Objective
|D|
X
1
Lsq (D) =
(f (xi )
N i=1
function
choose
yi ) 2
Machine learning 101
(eg
. op
Search
timiz
ati
algorithm
o
4
combine
n)
Learning identifies model with max
objective function on training data
Model is applied for prediction on
new data from same distribution
Relational
learning
Machine
learning
101
1
2
Data
representation
Knowledge
representation
Email networks!
3
Objective
function
relational
data
Social networks!
Scientific networks!
relational
models
Bn
Bn
Firm
Broker (Bk)
Disclosure
Branch (Bn)
Bn
Size
4
Search
algorithm
Gene/protein networks!
Problem
In Past
Has
Business
Is
Problem
On
Watchlist
Year
Bk
Region
Type
Bk
Area
Bk
Layoffs
Bn
On
Watchlist
World wide web!
Organizational networks!
Bn
Bk
There has been a great deal of work on templated
graphical model representations for relational data
RBNs
RMNs
MLNs
GMNs
d
e
e
n
e
w
l
a
icRDNs
h
p
a
r
g
IHRMs
o
s
l
a
s
i
n
o
i
ks
t
r
a
t
o
n
w
e
t
s
e
e
n
r
l
p
e
d
re
o
l
DAPER
e
m
d
o
m
o
m
r
PRMs
f
e
c
s
rk
Sin
o
w
t
e
n
a
t
a
d
h
s
i
u
g
n
i
t
s
i
to d
Data network
Gender?
Married?
Politics?
Religion?
Data network
1
Data
representation
F
Y
D
!C
M
Y
C
C
Relational learning task:
E.g., predict political views based on user’s
intrinsic attributes and political views of friends
F
N
D
!C
M
Y
D
C
F
N
D
!C
M
N
C
C
F
N
C
!C
F
Y
D
C
(Y|{X}n , G)
Estimate joint distribution:
PAttributed
network
or conditional distribution: P (Yi |Xi , XR , YR )
a
y
l
n
o
e
v
a
nh
e
t
f
o
e
g
w
n
i
e
n
t
r
a
No
e
l
r
fo
k
r
o
w
t
e
n
e
singl
Define structure of graphical model
Politicsi
Politics j
Politicsi
Gender i
Politicsi
Married i
Politics
Religion i
i
Relational template
Yi
Yj
Relational template
1
Yi
Xi
Yi
Xi
Yi
Xi
2
3
Model template
Yi
Yj
Yi
Xi
Yi
Xi
Yi
Xi
1
+
2
3
Model template
Data network
2
Knowledge
representation
2
2
X2
X8
X12
X21
1
1
X32
X24
Y2
3
1
X
1
X
X14
Y1
1
3
3
3
X
X
Y3
X34
X15
2
6
X
1
6
X35
Y5
3
6
X
X
Y6
X8
Y8
X25
Y4
X23
3
X8
X27
X17
X37
Y7
Model network
(graphical model)
3
Objective
function
Search: eg. convex
optimization
4
2
2
X2
X8
X12
X21
1
1
X32
X24
Y2
3
1
X
1
X
X14
Y1
1
3
3
3
X
X34
X15
X
Y3
2
6
X
1
6
X35
Y5
3
6
X
X
Y6
X8
Y8
X25
Y4
X23
3
X8
X27
X17
X37
Y7
Learn model parameters from fully labeled network
P (yG |xG ) =
1
Z( , xG )
T (xC , yC ; T )
T T C C(T (G))
Apply model to make predictions in another network
drawn from the same distribution
X21
Yi
Y j X11
1
Yi
Xi
Yi
Xi
Yi
X12
3
1
X
Y1
+
Y2
X13
X33
Y3
X25
X15
X34
Y4
X23
3
X8
Y8
X24
X14
Xi
Model template
1
X8
X32
2
3
X28
X22
Y5
X26
X16
X36
Y6
X35
X27
X17
X37
Y7
Test network
Collective classification uses full joint, rolled out model for
inference… but labeled nodes impact the final model structure
2
1
X
X11
2
1
X
X
X
Y2
Y1
2
81
2
2
X
X
1
3 2
1
3
1
X1 X1
Y1
2
2
1
2
X32
1
X8
X32
Y2
2
X
X3
X23
X13 X3
X33Y4
3
Y3
Y8
X24
X14
Y4
1
3
X
X8
X
X28
X38
3
X8
Y8
X25
X15
X34
Y4
Y4
X26
1
6
Y4 X
Y3
Labeled
node
X16
Y6
X15
Y5
2
36
X
X6
Y6
X3 25
X5
X35
2
Y5 X7 X2
1
37
X
X
7
7
X17
X37
X36
Y7
Y7
Collective classification uses full joint, rolled out model for
inference… but labeled nodes impact the final model structure
The structure
of “rolled-out” relational X
X
X
X
X
X
graphical models are determined by the
Y
Y
Y
Y
X
X
structure
of the
underlying
data
network,
X
X
X
X
including location + availability of labels
Y
Y
X
X
X
X
X
X
X
X
X
…this can
impact
performance
of LabeledlearningY and inference
Y
Y
methods
,
n
o
i
t
a
node
present
X21
1
1
11
22
3
1
11
2
X828
X2222
11
88
33
22
22
44
22
11
44
22
33
11
33
22
55
88
11
55
33
44
33
55
22
66
44
33
33
33
33
88
11
66
33
66
66
22
77
55
11
77
33
77
77
,
n
via re
o
i
t
c
n
u
f
e
v
i
t
c
e
hm
t
i
obj
r
o
g
l
a
h
c
r
a
e
s
and
Networks are much, much larger in practice…
Finding 1: Representation Implicit assumption is that nodes of the same type should be
identically distributed—but many relational representations
cannot ensure this holds for varying graph structures
I.I.D. assumption revisited
• Current relational models do not impose the same marginal invariance
condition that is assumed for IID models, which can impair generalization
p(yA |xA )
A
B
E
C
D
F
p(yE |xE )
p(yA |xA ) 6= p(yE |xE ) due to varying graph structure
• Markov relational network representation does not allow us to explicitly specify
the form of the marginal probability distributions, thus it is difficult to impose any
equality constraints on the marginals
Is there an alternative approach?
• Goal: Combine the marginal invariance advantages of IID models with the
ability to model relational dependence
• Incorporate node attributes in a general way (similar to IID classifiers)
• Idea: Apply copulas to combine marginal models with dependence structure
t1
t2
t3
...
tn
t jointly ~
Copula theory: can construct n-dimensional vector( of
1) arbitrary
zi = F i
( i (ti ))
marginals while preserving the desired dependence structure
F
z1
z2
z3
...
zn
zi marginally ~ Fi
Let’s start with a reformulation of IID classifiers...
• General form of probabilistic binary classification:
p(yi = 1) = F (⌘(xi ))
• e.g., Logistic regression
• Now view F as the CDF of a distribution symmetric around 0 to obtain a latent variable formulation:
zi ⇠ p(zi = z|xi = x) = f (z ⌘(xi ))
!
yi = sign(zi )
• z is a continuous variable, capturing random effects that are not present in x
• p is the corresponding PDF of F
• In IID models, the random effect for each instance is
independent, thus can be integrated out
• When links among instances are observed, the
correlations among their class labels can be
modeled through dependence among the z’s
• Key question: How to model the dependence
among z’s while preserving the marginals?
?
Zj
Copula Latent Markov Network (CLMN)
IID classifiers
CLMN
The CLMN model
• Sample t from the desired joint
dependency:
(t1 , t2 , . . . , tn ) ⇠
to obtain
• Apply marginal transformation
( 1)
the latent variable z:
zi = Fi
• Classification: yi = sign(zi )
(
i (ti )) Marginal
Φi transforms ti to uniform [0,1] r.v. ui
Quasi-inverse of CDF Fi is used to obtain zi from ui,
Attributes moderate corresponding pdf fi
Copula Latent Markov Network (Xiang and N. WSDM‘13)
CLMN implementation
Estimation:
• First, learn marginal model as
if instances were IID
Gaussian Markov network
Logistic regression
• Next, learn the dependence
model conditioned on the
marginal model... but GMN
has no parameters to learn
Inference:
• Conditional inference in
copulas have not previously
been considered for largescale networks
• For efficient inference, we
developed a message passing
algorithm based on EP
Experimental Results
CLMN SocDim
RMN
Key
idea: Ensuring that nodes with
varying
graph Facebook
GMN
LR
structure have identical marginals improves learning
IMDB
Gene
IMDB
Finding 2: Search Graph+attribute space is too large to sample thoroughly,
but efficient generative graph models can be exploited to
search more effectively
How to efficiently generate attributed graph samples
from the underlying joint distribution P (X, Y, G) ?
V 2 +V ·p
Space is O(2
) so effective sampling from joint is difficult
Naive sampling approach: Assume independence between graph/attributes
PE (X, E|⇥E , ⇥X ) = PE (E|⇥E )P (X|⇥X )
Attributes
Graph Model
Attribute Model
Problem with naive approach
by
d
e
r
u
t
p
ca
e
b
n
a
te
c
u
b
e
i
r
r
t
u
t
t
a
c
u
h
it
tr
s
w
h
g
p
n
i
a
r
r
i
g
a
p
Sampled
n
e
v
o
i
i
t
a
a
n
l
,
e
Although hOriginal
s
r
l
or
de
c
o
l
m
a
n
o
p
i
a
t
gr
la
e
e
r
v
i
t
e
r
a
r
u
e
t
p
a
gen
c
t
o
n
es
o
d
s
e
l
p
sam
0-0
1-1
0-1
Attribute value combinations
Solution: Use graph model to propose edges, but
sample conditional on node attribute values
PE (X, E|⇥E , ⇥X ) = PE (E|X, ⇥E , ⇥X )P (X|⇥E , ⇥X )
Attributes
Graph
Model
t
c
e
j
e
t-R
p
e
c
c
A
e
l
use
p
m
a
s
o
t
s
s
e
c
s
r
t
t
pro
a
n
o
d
e
n
o
i
t
i
cond
Attribute
Model
Exploit efficient generative graph model as proposal
distribution to search effectively
• What to use as acceptance probabilities? Ratio of observed probabilities in
original data to sampled probabilities resulting from naive approach
!
!
!
!
!
Original
Sampled
Original
0-0
1-1
0-1
Attribute value combinations
0-0
1-1
0-1
Attribute value combinations
• This corresponds to rejection sampling
• Proposing distribution:
PE (Eij
= 1|⇥E )
• True distribution:
Po (Eij = 1|f (xi , xj ), ⇥E , ⇥X )
33
Attributed graph models (Pfeiffer, La Fond, Moreno, N & Gallagher WWW’14)
#
#
#
#
!
Learn attribute and graph model
Generate graph with naive approach
Compute acceptance ratios
Sample attributes
while not enough edges:
draw (vi,vj) from Q’ (the model)
U ~ Uniform(0,1)
if U < A(xi, xj)
put (vi, vj) into the edges
return edges
0-0
1-1
Attribute value combinations
a
e
Possible
Edges
b
b
d
f
g
h
0-1
c
f
g
i
h
Theorem 1: AGM samples from the joint distribution of edges and attributes
P (Eij = 1|f (xi , xj ), ⇥E , ⇥F )P (xi , xj |⇥X )
Corollary 2: Expected AGM degree equals expected degree of structural graph model
Empirical results on Facebook data
Correlation
Political views
0.4
AGM preserves 0.3
characteristics
Key idea: Statistical models
of graphs
of graph
model can be exploited to improve sampling from full jointPE (E, X|⇥
E , ⇥ X )
0.2
AGM captures attribute correlation
0.1
AGM
No AGM
0.0
Facebook
AGM-KPGM (2x2)
TCL
AGM-FCL
AGM-KPGM (3x3)
KPGM (2x2)
AGM-TCL
FCL
KPGM (3x3)
Relational learning
1
Data
representation
2
Knowledge
representation
Representations affect our ability to enforce invariance assumptions
Objective
function
Conventional obj. functions do not behave as expected in partially labeled networks Search
algorithm
Simpler (graph) models can be used to statistically
“prune” search space
3
4
(not in this talk)
Conclusion
• Relational models have been shown to significantly improve predictions
through the use of joint modeling and collective inference
• But since the (rolled-out) model structure depends on the structure of the underlying data network… • …we need to understand how the data graph affects
model/algorithm characteristics in order to better
exploit relational information for learning/prediction
• A careful consideration of interactions between:
data representation, knowledge representation, objective function, and search algorithm will improve our understanding of mechanisms that impact performance
and this will form the foundation for improved algorithms & methodology
Thanks to:
Alum
ni
Hoda Eldardiry
Rongjing Xiang
Chris Mayfield
Karthik Nagaraj
Umang Sharan
Sebastian Moreno
Nesreen Ahmed
Hyokun Yun
Suvidha Kancharla
Tao Wang
Timothy La Fond
Joel Pfeiffer
Ellen Lai
Pablo Granda
Hogun Park
Questions?
!
neville@cs.purdue.edu
www.cs.purdue.edu/~neville