19 graph convolutional networks(gcn)

¡ Goal: Map nodes so that similarity in the embedding space e.g., dot product approximates similarity e.g., proximity in the network 12/6/18 Jure Leskovec, Stanford CS224W: Analysis o

Trang 1

CS224W: Analysis of Networks

http://cs224w.stanford.edu

Trang 2

¡ Intuition: Map nodes to d-dimensional

embeddings such that similar nodes in the

graph are embedded close together

f( )=

Input graph 2D node embeddings

How to learn mapping function !?

Trang 3

¡ Goal: Map nodes so that similarity in the

embedding space (e.g., dot product)

approximates similarity (e.g., proximity) in the network

12/6/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 3

Input network embedding space d-dimensional

Trang 5

¡ Encoder: Map a node to a low-dimensional

vector:

¡ Similarity function defines how relationships

in the input network map to relationships in

the embedding space:

node in the input graph

d-dimensional embedding

Similarity of u and v

in the network dot product between node embeddings

similarity(u, v) ⇡ z > v z u

Trang 6

¡ So far we have focused on “shallow”

encoders, i.e embedding lookups:

Trang 7

Trang 8

¡ Limitations of shallow embedding methods:

§ O(|V|) parameters are needed:

§ No sharing of parameters between nodes

§ Every node has its own unique embedding

§ Inherently “transductive”:

§ Cannot generate embeddings for nodes that are not seen during training

§ Do not incorporate node features:

§ Many graphs have features that we can and should

leverage

Trang 9

¡ Today: We will now discuss deep methods

based on graph neural networks:

¡ Note: All these deep encoders can be

combined with node similarity functions

Trang 10

Output: Node embeddings Also,

we can embed larger network structures, subgraphs, graphs

Trang 11

CNN on an image:

Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 11

Goal is to generalize convolutions beyond simple lattices

Leverage node features/attributes (e.g., text, images)

12/6/18

Trang 12

Single CNN layer with 3x3 filter:

Convolutional neural networks (on grids)

(Animation by Vincent Dumoulin)

Single CNN layer with 3x3 filter:

12/6/18

Transform information at the neighbors and combine it:

§ Transform “messages” ℎ " from neighbors: # " ℎ "

§ Add them up: ∑ " # " ℎ "

Trang 13

But what if your graphs look like this?

End-to-end learning on graphs with GCNs Thomas Kipf

What if our data looks like this?

Hidden layer Hidden layer

ReLU

Output ReLU

Trang 14

¡ Join adjacency matrix and features

¡ Feed them into a deep neural net:

§ !(#) parameters

§ Not applicable to graphs of different sizes

§ Not invariant to node ordering

A B C D E A

B C D E

• Huge number of parameters

• No inductive learning possible

Trang 15

1 Basics of deep learning for graphs

2 Graph Convolutional Networks

3 Graph Attention Networks (GAT)

4 Practical tips and demos

Trang 17

¡ Local network neighborhoods:

§ Describe aggregation strategies

§ Define computation graphs

¡ Stacking multiple layers:

§ Describe the model, parameters, training

§ How to fit the model?

§ Simple example for unsupervised and supervised

training

Trang 18

¡ Assume we have a graph !:

§ " is the vertex set

§ # is the adjacency matrix (assume binary)

§ $ ∈ ℝ '×|*| is a matrix of node features

§ Biologically meaningful node features:

§ E.g., immunological signatures, gene expression profiles, gene functional information

§ No features:

§ Indicator vectors (one-hot encoding of a node)

Trang 19

Problem: For a given subgraph how to come

with canonical node ordering

Learning Convolutional Neural Networks for Graphs

a sequence of words However, for numerous graph

col-lections a problem-specific ordering (spatial, temporal, or

otherwise) is missing and the nodes of the graphs are not

in correspondence In these instances, one has to solve two

problems: (i) Determining the node sequences for which

neighborhood graphs are created and (ii) computing a

nor-malization of neighborhood graphs, that is, a unique

map-ping from a graph representation into a vector space

rep-resentation The proposed approach, termed P ATCHY - SAN ,

addresses these two problems for arbitrary graphs For each

input graph, it first determines nodes (and their order) for

which neighborhood graphs are created For each of these

nodes, a neighborhood consisting of exactly k nodes is

ex-tracted and normalized, that is, it is uniquely mapped to a

space with a fixed linear order The normalized

neighbor-hood serves as the receptive field for a node under

consider-ation Finally, feature learning components such as

convo-lutional and dense layers are combined with the normalized

neighborhood graphs as the CNN’s receptive fields.

Figure 2 illustrates the P ATCHY - SAN architecture which

has several advantages over existing approaches: First, it

is highly efficient, naively parallelizable, and applicable to

large graphs Second, for a number of applications,

rang-ing from computational biology to social network analysis,

it is important to visualize learned network motifs ( Milo

et al , 2002 ) P ATCHY - SAN supports feature

visualiza-tions providing insights into the structural properties of

graphs Third, instead of crafting yet another graph kernel,

P ATCHY - SAN learns application dependent features

with-out the need to feature engineering Our theoretical

contri-butions are the definition of the normalization problem on

graphs and its complexity; a method for comparing graph

labeling approaches for a collection of graphs; and a result

that shows that P ATCHY - SAN generalizes CNNs on images.

Using standard benchmark data sets, we demonstrate that

the learned CNNs for graphs are both efficient and

effec-tive compared to state of the art graph kernels.

2 Related Work

Graph kernels allow kernel-based learning approaches such

as SVMs to work directly on graphs ( Vishwanathan et al ,

2010 ) Kernels on graphs were originally defined as

sim-ilarity functions on the nodes of a single graph ( Kondor

& Lafferty , 2002 ) Two representative classes of kernels

are the skew spectrum kernel ( Kondor & Borgwardt , 2008 )

and kernels based on graphlets ( Kondor et al , 2009 ;

Sher-vashidze et al , 2009 ) The latter is related to our work,

as it builds kernels based on fixed-sized subgraphs These

subgraphs, which are often called motifs or graphlets,

re-flect functional network properties ( Milo et al , 2002 ; Alon ,

2007 ) However, due to the combinatorial complexity of

subgraph enumeration, graphlet kernels are restricted to

as-as receptive fields and combined with existing CNN components.

subgraphs with few nodes An effective class of graph kernels are the Weisfeiler-Lehman (WL) kernels ( Sher- vashidze et al , 2011 ) WL kernels, however, only support discrete features and use memory linear in the number of training examples at test time P ATCHY - SAN uses

WL as one possible labeling procedure to compute ceptive fields Deep graph kernels ( Yanardag & Vish- wanathan , 2015 ) and graph invariant kernels ( Orsini et al ,

re-2015 ) compare graphs based on the existence or count of small substructures such as shortest paths ( Borgwardt &

Kriegel , 2005 ), graphlets, subtrees, and other graph variants ( Haussler , 1999 ; Orsini et al , 2015 ) In con- trast, P ATCHY - SAN learns substructures from graph data and is not limited to a predefined set of motifs More- over, while all graph kernels have a training complexity

in-at least quadrin-atic in the number of graphs ( Shervashidze

et al , 2011 ), which is prohibitive for large-scale problems,

P ATCHY - SAN scales linearly with the number of graphs.

Graph neural networks (GNNs) ( Scarselli et al , 2009 ) are

a recurrent neural network architecture defined on graphs.

GNNs apply recurrent neural networks for walks on the graph structure, propagating node representations until a fixed point is reached The resulting node representations are then used as features in classification and regression problems GNNs support only discrete labels and perform

as many backpropagation operations as there are edges and nodes in the graph per learning iteration Gated Graph Se- quence Neural Networks modify GNNs to use gated recurrent units and to output sequences ( Li et al , 2015 ).

Recent work extended CNNs to topologies that differ from the low-dimensional grid structure ( Bruna et al , 2014 ;

Henaff et al , 2015 ) All of these methods, however, assume one global graph structure, that is, a correspondence of the vertices across input examples ( Duvenaud et al , 2015 ) perform convolutional type operations on graphs, develop- ing a differentiable variant of one specific graph feature.

12/6/18

[Scarselli et al., IEEE TNN 2005, Niepert et al., ICML 2016]

Trang 20

Idea: Node’s neighborhood defines a

computation graph

Determine node computation graph

!

Propagate and transform information

!

Learn how to propagate information across the

graph to compute node features

[Kipf and Welling, ICLR 2017]

Trang 21

¡ Key idea: Generate node embeddings based

on local network neighborhoods

A

A C

F

B E A

Trang 22

¡ Intuition: Nodes aggregate information from

their neighbors using neural networks

A

A C

F

B E A

Neural networks

Trang 23

¡ Intuition: Network neighborhood defines a

computation graph

Every node defines a computation

graph based on its neighborhood!

Trang 24

¡ Model can be of arbitrary depth:

§ Nodes have embeddings at each layer

§ Layer-0 embedding of node ! is its input feature, i.e

" ! TARGET NODE B

A

A C

F

B E A

Trang 25

¡ Neighborhood aggregation: Key distinctions

are in how different approaches aggregate

information across the layers

A

A C

F

B E

Trang 26

¡ Basic approach: Average information from

neighbors and apply a neural network

A

A C

F

B E A

1) average messages

from neighbors

2) apply neural network

Trang 27

¡ Basic approach: Average neighbor messages

and apply a neural network

Average of neighbor’s previous layer embeddings

Initial 0-th layer embeddings are

equal to node features

Embedding after K

layers of neighborhood

aggregation

Non-linearity (e.g., ReLU)

1

A , 8k 2 {1, , K}

z v = h K v

Trang 29

We can feed these embeddings into any loss

function and run stochastic gradient descent to

train the weight parameters

Trainable weight matrices (i.e., what we learn)

1

A , 8k 2 {1, , K}

z v = h K v

Trang 30

¡ Train in an unsupervised manner:

§ Use only the graph structure

§ “Similar” nodes have similar embeddings

¡ Unsupervised loss function can be anything

from the last section, e.g., a loss based on

§ Random walks (node2vec, DeepWalk, struc2vec)

§ Graph factorization

§ Node proximity in the graph

Trang 31

Directly train the model for a supervised task

(e.g., node classification)

Safe or toxic

drug?

Safe or toxic drug?

E.g., a drug-drug interaction network

Trang 32

Directly train the model for a supervised task

(e.g., node classification )

Encoder output:

node embedding Classification weights

Node class label

Safe or toxic drug?

v 2V

y v log( (z > v ✓)) + (1 y v ) log(1 (z > v ✓))

Trang 33

1) Define a neighborhood aggregation function

2) Define a loss function on the

embeddings

! "

Trang 34

3) Train on a set of nodes, i.e., a

batch of compute graphs

Trang 35

4) Generate embeddings for nodes as

needed

Even for nodes we never

trained on!

Trang 36

¡ The same aggregation parameters are shared

for all nodes:

§ The number of model parameters is sublinear in

|"| and we can generalize to unseen nodes !

Compute graph for node A Compute graph for node B

shared parameters

Trang 37

Inductive node embedding Generalize to entirely unseen graphs E.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B

Train on one graph Generalize to new graph

z u

Trang 38

Train with snapshot New node arrives Generate embedding for new node

z u

¡ Many application settings constantly encounter

previously unseen nodes:

§ E.g., Reddit, YouTube, Google Scholar

¡ Need to generate new embeddings “on the fly”

Trang 39

¡ Recap: Generate node embeddings by

aggregating neighborhood information

§ We saw a basic variant of this idea

§ Key distinctions are in how different approaches

aggregate information across the layers

¡ Next: Describe GraphSAGE graph neural

network

Trang 40

1 Basics of deep learning for graphs

2 Graph Convolutional Networks

3 Graph Attention Networks (GAT)

4 Practical tips and demos

Trang 42

So far we have aggregated the neighbor

messages by taking their (weighted) average

A

A C

F

B E

Trang 43

Trang 44

¡ Simple neighborhood aggregation:

1 A

Generalized aggregation

Trang 45

¡ Mean: Take a weighted average of neighbors

¡ Pool: Transform neighbor vectors and apply

symmetric vector function

¡ LSTM: Apply LSTM to reshuffled of neighbors

Trang 46

1 A

Key idea: Generate node embeddings based on

local neighborhoods

§ Nodes aggregate “messages” from their neighbors using

neural networks

§ Basic variant: Average neighborhood information and stack

neural networks.

§ Generalized neighborhood aggregation

Trang 47

Relational inductive biases and graph networks (Battaglia, 2018)

Attention-based neighborhood aggregation:

§ Graph attention networks (Hoshen, 2017; Velickovic et al., 2018; Liu et al., 2018)

Embedding entire graphs:

§ Graph neural nets with edge embeddings (Battaglia et al., 2016; Gilmer et al., 2017)

§ Embedding entire graphs (Duvenaud et al., 2015; Dai et al., 2016; Li et al., 2018) and graph pooling (Ying et al., 2018)

§ Graph generation and relational inference (You et al., 2018; Kipf et al., 2018

Spectral approaches to graph neural networks:

§ Spectral graph CNN & ChebNet (Bruna et al., 2015; Defferrard et al., 2016)

§ Geometric deep learning (Bronstein et al., 2017; Monti et al., 2017)

Hyperbolic geometry and hierarchical embeddings:

§ Poincare embeddings and hierarchical relations (Nickel et al., 2017; Nickel et al., 2018)

§ Graph representation learning tradeoffs (De Sa et al., 2018)

Tiêu đề	19 Graph Convolutional Networks (GCN)
Tác giả	Jure Leskovec, Marinka Zitnik
Trường học	Stanford University
Chuyên ngành	Analysis of Networks
Thể loại	Lecture Notes
Năm xuất bản	2018
Thành phố	Stanford

Định dạng
Số trang	73
Dung lượng	38,15 MB