¡ Goal: Map nodes so that similarity in the embedding space e.g., dot product approximates similarity e.g., proximity in the network 12/6/18 Jure Leskovec, Stanford CS224W: Analysis o
Trang 1CS224W: Analysis of Networks
http://cs224w.stanford.edu
Trang 2¡ Intuition: Map nodes to d-dimensional
embeddings such that similar nodes in the
graph are embedded close together
f( )=
Input graph 2D node embeddings
How to learn mapping function !?
Trang 3¡ Goal: Map nodes so that similarity in the
embedding space (e.g., dot product)
approximates similarity (e.g., proximity) in the network
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 3
Input network embedding space d-dimensional
Trang 5¡ Encoder: Map a node to a low-dimensional
vector:
¡ Similarity function defines how relationships
in the input network map to relationships in
the embedding space:
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 5
node in the input graph
d-dimensional embedding
Similarity of u and v
in the network dot product between node embeddings
similarity(u, v) ⇡ z > v z u
Trang 6¡ So far we have focused on “shallow”
encoders, i.e embedding lookups:
Trang 712/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 7
Trang 8¡ Limitations of shallow embedding methods:
§ O(|V|) parameters are needed:
§ No sharing of parameters between nodes
§ Every node has its own unique embedding
§ Inherently “transductive”:
§ Cannot generate embeddings for nodes that are not seen during training
§ Do not incorporate node features:
§ Many graphs have features that we can and should
leverage
Trang 9¡ Today: We will now discuss deep methods
based on graph neural networks:
¡ Note: All these deep encoders can be
combined with node similarity functions
Trang 10Output: Node embeddings Also,
we can embed larger network structures, subgraphs, graphs
Trang 11CNN on an image:
Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 11
Goal is to generalize convolutions beyond simple lattices
Leverage node features/attributes (e.g., text, images)
12/6/18
Trang 12Single CNN layer with 3x3 filter:
Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 12
Convolutional neural networks (on grids)
(Animation by Vincent Dumoulin)
Single CNN layer with 3x3 filter:
12/6/18
Transform information at the neighbors and combine it:
§ Transform “messages” ℎ " from neighbors: # " ℎ "
§ Add them up: ∑ " # " ℎ "
Trang 13But what if your graphs look like this?
Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 13
End-to-end learning on graphs with GCNs Thomas Kipf
What if our data looks like this?
End-to-end learning on graphs with GCNs Thomas Kipf
Hidden layer Hidden layer
ReLU
Output ReLU
What if our data looks like this?
Trang 14¡ Join adjacency matrix and features
¡ Feed them into a deep neural net:
§ !(#) parameters
§ Not applicable to graphs of different sizes
§ Not invariant to node ordering
A B C D E A
B C D E
• Huge number of parameters
• No inductive learning possible
Trang 151 Basics of deep learning for graphs
2 Graph Convolutional Networks
3 Graph Attention Networks (GAT)
4 Practical tips and demos
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 15
Trang 17¡ Local network neighborhoods:
§ Describe aggregation strategies
§ Define computation graphs
¡ Stacking multiple layers:
§ Describe the model, parameters, training
§ How to fit the model?
§ Simple example for unsupervised and supervised
training
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 17
Trang 18¡ Assume we have a graph !:
§ " is the vertex set
§ # is the adjacency matrix (assume binary)
§ $ ∈ ℝ '×|*| is a matrix of node features
§ Biologically meaningful node features:
§ E.g., immunological signatures, gene expression profiles, gene functional information
§ No features:
§ Indicator vectors (one-hot encoding of a node)
Trang 19Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 19
Problem: For a given subgraph how to come
with canonical node ordering
Learning Convolutional Neural Networks for Graphs
a sequence of words However, for numerous graph
col-lections a problem-specific ordering (spatial, temporal, or
otherwise) is missing and the nodes of the graphs are not
in correspondence In these instances, one has to solve two
problems: (i) Determining the node sequences for which
neighborhood graphs are created and (ii) computing a
nor-malization of neighborhood graphs, that is, a unique
map-ping from a graph representation into a vector space
rep-resentation The proposed approach, termed P ATCHY - SAN ,
addresses these two problems for arbitrary graphs For each
input graph, it first determines nodes (and their order) for
which neighborhood graphs are created For each of these
nodes, a neighborhood consisting of exactly k nodes is
ex-tracted and normalized, that is, it is uniquely mapped to a
space with a fixed linear order The normalized
neighbor-hood serves as the receptive field for a node under
consider-ation Finally, feature learning components such as
convo-lutional and dense layers are combined with the normalized
neighborhood graphs as the CNN’s receptive fields.
Figure 2 illustrates the P ATCHY - SAN architecture which
has several advantages over existing approaches: First, it
is highly efficient, naively parallelizable, and applicable to
large graphs Second, for a number of applications,
rang-ing from computational biology to social network analysis,
it is important to visualize learned network motifs ( Milo
et al , 2002 ) P ATCHY - SAN supports feature
visualiza-tions providing insights into the structural properties of
graphs Third, instead of crafting yet another graph kernel,
P ATCHY - SAN learns application dependent features
with-out the need to feature engineering Our theoretical
contri-butions are the definition of the normalization problem on
graphs and its complexity; a method for comparing graph
labeling approaches for a collection of graphs; and a result
that shows that P ATCHY - SAN generalizes CNNs on images.
Using standard benchmark data sets, we demonstrate that
the learned CNNs for graphs are both efficient and
effec-tive compared to state of the art graph kernels.
2 Related Work
Graph kernels allow kernel-based learning approaches such
as SVMs to work directly on graphs ( Vishwanathan et al ,
2010 ) Kernels on graphs were originally defined as
sim-ilarity functions on the nodes of a single graph ( Kondor
& Lafferty , 2002 ) Two representative classes of kernels
are the skew spectrum kernel ( Kondor & Borgwardt , 2008 )
and kernels based on graphlets ( Kondor et al , 2009 ;
Sher-vashidze et al , 2009 ) The latter is related to our work,
as it builds kernels based on fixed-sized subgraphs These
subgraphs, which are often called motifs or graphlets,
re-flect functional network properties ( Milo et al , 2002 ; Alon ,
2007 ) However, due to the combinatorial complexity of
subgraph enumeration, graphlet kernels are restricted to
as-as receptive fields and combined with existing CNN components.
subgraphs with few nodes An effective class of graph kernels are the Weisfeiler-Lehman (WL) kernels ( Sher- vashidze et al , 2011 ) WL kernels, however, only sup- port discrete features and use memory linear in the num- ber of training examples at test time P ATCHY - SAN uses
WL as one possible labeling procedure to compute ceptive fields Deep graph kernels ( Yanardag & Vish- wanathan , 2015 ) and graph invariant kernels ( Orsini et al ,
re-2015 ) compare graphs based on the existence or count of small substructures such as shortest paths ( Borgwardt &
Kriegel , 2005 ), graphlets, subtrees, and other graph variants ( Haussler , 1999 ; Orsini et al , 2015 ) In con- trast, P ATCHY - SAN learns substructures from graph data and is not limited to a predefined set of motifs More- over, while all graph kernels have a training complexity
in-at least quadrin-atic in the number of graphs ( Shervashidze
et al , 2011 ), which is prohibitive for large-scale problems,
P ATCHY - SAN scales linearly with the number of graphs.
Graph neural networks (GNNs) ( Scarselli et al , 2009 ) are
a recurrent neural network architecture defined on graphs.
GNNs apply recurrent neural networks for walks on the graph structure, propagating node representations until a fixed point is reached The resulting node representations are then used as features in classification and regression problems GNNs support only discrete labels and perform
as many backpropagation operations as there are edges and nodes in the graph per learning iteration Gated Graph Se- quence Neural Networks modify GNNs to use gated recur- rent units and to output sequences ( Li et al , 2015 ).
Recent work extended CNNs to topologies that differ from the low-dimensional grid structure ( Bruna et al , 2014 ;
Henaff et al , 2015 ) All of these methods, however, assume one global graph structure, that is, a correspondence of the vertices across input examples ( Duvenaud et al , 2015 ) perform convolutional type operations on graphs, develop- ing a differentiable variant of one specific graph feature.
End-to-end learning on graphs with GCNs Thomas Kipf
What if our data looks like this?
12/6/18
[Scarselli et al., IEEE TNN 2005, Niepert et al., ICML 2016]
Trang 20Idea: Node’s neighborhood defines a
computation graph
Determine node computation graph
!
Propagate and transform information
!
Learn how to propagate information across the
graph to compute node features
[Kipf and Welling, ICLR 2017]
Trang 21¡ Key idea: Generate node embeddings based
on local network neighborhoods
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 21
A
A C
F
B E A
Trang 22¡ Intuition: Nodes aggregate information from
their neighbors using neural networks
A
A C
F
B E A
Neural networks
Trang 23¡ Intuition: Network neighborhood defines a
computation graph
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 23
Every node defines a computation
graph based on its neighborhood!
Trang 24¡ Model can be of arbitrary depth:
§ Nodes have embeddings at each layer
§ Layer-0 embedding of node ! is its input feature, i.e
" ! TARGET NODE B
A
A C
F
B E A
Trang 25¡ Neighborhood aggregation: Key distinctions
are in how different approaches aggregate
information across the layers
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 25
A
A C
F
B E
Trang 26¡ Basic approach: Average information from
neighbors and apply a neural network
A
A C
F
B E A
1) average messages
from neighbors
2) apply neural network
Trang 27¡ Basic approach: Average neighbor messages
and apply a neural network
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 27
Average of neighbor’s previous layer embeddings
Initial 0-th layer embeddings are
equal to node features
Embedding after K
layers of neighborhood
aggregation
Non-linearity (e.g., ReLU)
1
A , 8k 2 {1, , K}
z v = h K v
Trang 29We can feed these embeddings into any loss
function and run stochastic gradient descent to
train the weight parameters
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 29
Trainable weight matrices (i.e., what we learn)
1
A , 8k 2 {1, , K}
z v = h K v
Trang 30¡ Train in an unsupervised manner:
§ Use only the graph structure
§ “Similar” nodes have similar embeddings
¡ Unsupervised loss function can be anything
from the last section, e.g., a loss based on
§ Random walks (node2vec, DeepWalk, struc2vec)
§ Graph factorization
§ Node proximity in the graph
Trang 31Directly train the model for a supervised task
(e.g., node classification)
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 31
Safe or toxic
drug?
Safe or toxic drug?
E.g., a drug-drug interaction network
Trang 32Directly train the model for a supervised task
(e.g., node classification )
Encoder output:
node embedding Classification weights
Node class label
Safe or toxic drug?
v 2V
y v log( (z > v ✓)) + (1 y v ) log(1 (z > v ✓))
Trang 3312/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 33
1) Define a neighborhood aggregation function
2) Define a loss function on the
embeddings
! "
Trang 343) Train on a set of nodes, i.e., a
batch of compute graphs
Trang 3512/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 35
4) Generate embeddings for nodes as
needed
Even for nodes we never
trained on!
Trang 36¡ The same aggregation parameters are shared
for all nodes:
§ The number of model parameters is sublinear in
|"| and we can generalize to unseen nodes !
Compute graph for node A Compute graph for node B
shared parameters
shared parameters
Trang 3712/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 37
Inductive node embedding Generalize to entirely unseen graphs E.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B
Train on one graph Generalize to new graph
z u
Trang 38Train with snapshot New node arrives Generate embedding for new node
z u
¡ Many application settings constantly encounter
previously unseen nodes:
§ E.g., Reddit, YouTube, Google Scholar
¡ Need to generate new embeddings “on the fly”
Trang 39¡ Recap: Generate node embeddings by
aggregating neighborhood information
§ We saw a basic variant of this idea
§ Key distinctions are in how different approaches
aggregate information across the layers
¡ Next: Describe GraphSAGE graph neural
network
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 39
Trang 401 Basics of deep learning for graphs
2 Graph Convolutional Networks
3 Graph Attention Networks (GAT)
4 Practical tips and demos
Trang 42So far we have aggregated the neighbor
messages by taking their (weighted) average
A
A C
F
B E
Trang 4312/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 43
Trang 44¡ Simple neighborhood aggregation:
1 A
Generalized aggregation
Trang 45¡ Mean: Take a weighted average of neighbors
¡ Pool: Transform neighbor vectors and apply
symmetric vector function
¡ LSTM: Apply LSTM to reshuffled of neighbors
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 45
Trang 461 A
1 A
1 A
Key idea: Generate node embeddings based on
local neighborhoods
§ Nodes aggregate “messages” from their neighbors using
neural networks
§ Basic variant: Average neighborhood information and stack
neural networks.
§ Generalized neighborhood aggregation
Trang 47Relational inductive biases and graph networks (Battaglia, 2018)
Attention-based neighborhood aggregation:
§ Graph attention networks (Hoshen, 2017; Velickovic et al., 2018; Liu et al., 2018)
Embedding entire graphs:
§ Graph neural nets with edge embeddings (Battaglia et al., 2016; Gilmer et al., 2017)
§ Embedding entire graphs (Duvenaud et al., 2015; Dai et al., 2016; Li et al., 2018) and graph pooling (Ying et al., 2018)
§ Graph generation and relational inference (You et al., 2018; Kipf et al., 2018
Spectral approaches to graph neural networks:
§ Spectral graph CNN & ChebNet (Bruna et al., 2015; Defferrard et al., 2016)
§ Geometric deep learning (Bronstein et al., 2017; Monti et al., 2017)
Hyperbolic geometry and hierarchical embeddings:
§ Poincare embeddings and hierarchical relations (Nickel et al., 2017; Nickel et al., 2018)
§ Graph representation learning tradeoffs (De Sa et al., 2018)
12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 47