We find that cancer and neurological disorders show high locus heterogeneity and also represent the most connected disease classes, in contrast with metabolic, skeletal, and multiple dis
Trang 1CS224W: Analysis of Networks
http://cs224w.stanford.edu
Trang 210/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 2
Feature matrices, relationship tables, time
series, document corpora, image datasets, etc
Trang 3Today: How to construct and infer networks
from raw data?
Network construction and inference
Trang 4Jonas Richiardi et al., Correlated gene expression supports synchronous activity in
brain networks Science 348:6240, 2015.
Trang 51) Multimode Network Transformations:
§ K-partite and bipartite graphs
§ One-mode network projections/folding
§ Graph contractions
§ Direct and and indirect effects in a network
§ Inferring networks by network deconvolution
Trang 7¡ Most of the time, when we create a network, all nodes
§ People in social nets, bus stops in route nets, genes in gene nets
where edges exclusively go from one type to the other:
§ 2-partite student net: Students <-> Research projects
§ 3-partite movie net: Actors <-> Movies <-> Movie Companies
Network on the left is a social bipartite network Blue squares stand for people and
Trang 8¡ Example: Bipartite student-project network:
§ Student network: Students are linked if they work together
§ Project network: Research projects are linked if one or
more students work on both projects
¡ In general: K-partite network has K one-mode network
Trang 9¡ Example: Projection of bipartite student-project network
§ Triangle can be a result of:
§ Scenario #1: Each pair of students work on a different project
§ Scenario #2: Three students work on the same project
§ One-mode network projections discard some information:
§ Cannot distinguish between #1 and #2 just by looking at the projection
Trang 10¡ One-mode projection onto student mode :
§ #(projects) that students ! and " work together on is
equivalent to the number of paths of length 2
connecting ! and " in the bipartite network
¡ Let # be incidence matrix of student-project net:
Trang 11¡ Idea: Use ! to construct various one-mode network projections
¡ Weighted student network:
" #$ = & ' #$ , # projects that 4 and 7 collaborate on
0 otherwise
§ " #$ = ∑ >?@ A ! #> ! $> , i.e., the number of paths of length 2 connecting students 4 and 7 in the bipartite network
§ B = CC D and " ## represents #(projects) that student 4 works on
¡ Similarly, weighted project network:
E >F = &' >F , # students that work on I and J
0 otherwise
§ E >F = ∑ #?@ K ! #> ! #F , i.e., the number of paths of length 2 connecting projects I and J in the bipartite network
§ L = C D C and E >> represents #(students) that work on project I
¡ Next: Use B and L to obtain different network projections
Students
Projects
Trang 12¡ Construct network projections by applying a node
similarity measure to ! and "
§ Common neighbors: #(shared neighbors of nodes)
§ Student network: # and $ are linked if they work together in % or
more projects, i.e., if & '( ≥ *
§ Project network: + and , are linked if % or more students work on
both projects, i.e., if - ./ ≥ *
§ Jaccard index:
§ Common neighbors with a penalization for each non-shared
neighbor:
§ Ratio of shared neighbors in the complete set of neighbors for 2 nodes
§ Student network: # and $ are linked if they work together in at least
0 fraction of their projects, i.e., if & '( /(& '' + & (( − & '( ) ≥ 6
§ Project network: + and , are linked if at least 0 fraction of their
students work on both projects, i.e., if - ./ /(- + - // − - ./ ) ≥ 6
Trang 1310/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 13
Kwang-Il Goh et al., The human disease network PNAS, 104:21, 2007.
a few other disorders, whereas a few phenotypes such as colon
cancer (linked to k ! 50 other disorders) or breast cancer (k ! 30)
represent hubs that are connected to a large number of distinct
disorders The prominence of cancer among the most connected
disorders arises in part from the many clinically distinct cancer
subtypes tightly connected with each other through common tumor
repressor genes such as TP53 and PTEN.
Although the HDN layout was generated independently of any
knowledge on disorder classes, the resulting network is naturally
and visibly clustered according to major disorder classes Yet, there
are visible differences between different classes of disorders
Whereas the large cancer cluster is tightly interconnected due to the
many genes associated with multiple types of cancer (TP53, KRAS,
ERBB2, NF1, etc.) and includes several diseases with strong
pre-disposition to cancer, such as Fanconi anemia and ataxia
telangi-ectasia, metabolic disorders do not appear to form a single distinct
cluster but are underrepresented in the giant component and
overrepresented in the small connected components (Fig 2a) To
quantify this difference, we measured the locus heterogeneity of
each disorder class and the fraction of disorders that are connected
to each other in the HDN (seeSI Text) We find that cancer and
neurological disorders show high locus heterogeneity and also
represent the most connected disease classes, in contrast with
metabolic, skeletal, and multiple disorders that have low genetic
heterogeneity and are the least connected (SI Fig 7)
Properties of the DGN.In the DGN, two disease genes are connected
if they are associated with the same disorder, providing a
comple-mentary, gene-centered view of the diseasome Given that the linkssignify related phenotypic association between two genes, theyrepresent a measure of their phenotypic relatedness, which could beused in future studies, in conjunction with protein–protein inter-actions (6, 7, 19), transcription factor-promoter interactions (20),and metabolic reactions (8), to discover novel genetic interactions
In the DGN, 1,377 of 1,777 disease genes are connected to other
disease genes, and 903 genes belong to a giant component (Fig 2b).
Whereas the number of genes involved in multiple diseases creases rapidly (SI Fig 6d ; light gray nodes in Fig 2b), several disease genes (e.g., TP53, PAX6) are involved in as many as 10
de-disorders, representing major hubs in the network
Functional Clustering of HDN and DGN. To probe how the topology
of the HDN and GDN deviates from random, we randomlyshuffled the associations between disorders and genes, while keep-ing the number of links per each disorder and disease gene in thebipartite network unchanged Interestingly, the average size of thegiant component of 104randomized disease networks is 643 " 16,
significantly larger than 516 (P # 10$4; for details of statisticalanalyses of the results reported hereafter, see SI Text), the actualsize of the HDN (SI Fig 6c) Similarly, the average size of the giantcomponent from randomized gene networks is 1,087 " 20 genes,
significantly larger than 903 (P # 10$4), the actual size of the DGN(SI Fig 6e) These differences suggest important pathophysiologicalclustering of disorders and disease genes Indeed, in the actualnetworks disorders (genes) are more likely linked to disorders(genes) of the same disorder class For example, in the HDN there
AR ATM BRCA1 BRCA2 CDH1 GARS HEXB KRAS LMNA MSH2 PIK3CA TP53 MAD1L1 RAD54L VAPB CHEK2 BSCL2 ALS2 BRIP1
Fanconi anemia
Pancreatic cancer Wilms tumor
Charcot-Marie-Tooth disease
Sandhoff disease Lipodystrophy
Amyotrophic lateral sclerosis Silver spastic paraplegia syndrome Spastic ataxia/paraplegia
AR ATM
BRCA1 BRCA2
CDH1
GARS HEXB
KRAS LMNA
MSH2 PIK3CA
TP53 MAD1L1 RAD54L VAPB
CHEK2
BSCL2 ALS2
BRIP1
Androgen insensitivity
Breast cancer
Perineal hypospadias Prostate cancer
Spinal muscular atrophy
Ataxia-telangiectasia
Lymphoma
T-cell lymphoblastic leukemia
Ovarian cancer Papillary serous carcinoma Fanconi anemia
Pancreatic cancer Wilms tumor
Charcot-Marie-Tooth disease
Sandhoff disease
Lipodystrophy
Amyotrophic lateral sclerosis
Silver spastic paraplegia syndrome Spastic ataxia/paraplegia
Human Disease Network
(HDN)
Disease Gene Network
(DGN)
disease genome disease phenome
DISEASOME
Fig 1 Construction of the diseasome bipartite network (Center) A small subset of OMIM-based disorder–disease gene associations (18), where circles and rectangles
correspond to disorders and disease genes, respectively A link is placed between a disorder and a disease gene if mutations in that gene lead to the specific disorder The size of a circle is proportional to the number of genes participating in the corresponding disorder, and the color corresponds to the disorder class to which the disease
belongs (Left) The HDN projection of the diseasome bipartite graph, in which two disorders are connected if there is a gene that is implicated in both The width of
a link is proportional to the number of genes that are implicated in both diseases For example, three genes are implicated in both breast cancer and prostate cancer,
resulting in a link of weight three between them (Right) The DGN projection where two genes are connected if they are involved in the same disorder The width of
a link is proportional to the number of diseases with which the two genes are commonly associated A full diseasome bipartite map is provided as SI Fig 13
Homework 1
Trang 1410/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 14
a few other disorders, whereas a few phenotypes such as colon
cancer (linked to k ! 50 other disorders) or breast cancer (k ! 30)
represent hubs that are connected to a large number of distinct
disorders The prominence of cancer among the most connected
disorders arises in part from the many clinically distinct cancer
subtypes tightly connected with each other through common tumor
repressor genes such as TP53 and PTEN.
Although the HDN layout was generated independently of any
knowledge on disorder classes, the resulting network is naturally
and visibly clustered according to major disorder classes Yet, there
are visible differences between different classes of disorders.
Whereas the large cancer cluster is tightly interconnected due to the
many genes associated with multiple types of cancer (TP53, KRAS,
ERBB2, NF1, etc.) and includes several diseases with strong
pre-disposition to cancer, such as Fanconi anemia and ataxia
telangi-ectasia, metabolic disorders do not appear to form a single distinct
cluster but are underrepresented in the giant component and
overrepresented in the small connected components (Fig 2a) To
quantify this difference, we measured the locus heterogeneity of
each disorder class and the fraction of disorders that are connected
neurological disorders show high locus heterogeneity and also
represent the most connected disease classes, in contrast with
metabolic, skeletal, and multiple disorders that have low genetic
heterogeneity and are the least connected ( SI Fig 7 ).
if they are associated with the same disorder, providing a
comple-mentary, gene-centered view of the diseasome Given that the links signify related phenotypic association between two genes, they represent a measure of their phenotypic relatedness, which could be used in future studies, in conjunction with protein–protein inter- actions (6, 7, 19), transcription factor-promoter interactions (20), and metabolic reactions (8), to discover novel genetic interactions.
In the DGN, 1,377 of 1,777 disease genes are connected to other
disease genes, and 903 genes belong to a giant component (Fig 2b).
Whereas the number of genes involved in multiple diseases creases rapidly ( SI Fig 6d ; light gray nodes in Fig 2b), several disease genes (e.g., TP53, PAX6) are involved in as many as 10
de-disorders, representing major hubs in the network.
of the HDN and GDN deviates from random, we randomly shuffled the associations between disorders and genes, while keep- ing the number of links per each disorder and disease gene in the bipartite network unchanged Interestingly, the average size of the giant component of 104 randomized disease networks is 643 " 16,
significantly larger than 516 (P # 10$4; for details of statistical analyses of the results reported hereafter, see SI Text ), the actual size of the HDN ( SI Fig 6c ) Similarly, the average size of the giant component from randomized gene networks is 1,087 " 20 genes,
significantly larger than 903 (P # 10$4), the actual size of the DGN ( SI Fig 6e ) These differences suggest important pathophysiological clustering of disorders and disease genes Indeed, in the actual networks disorders (genes) are more likely linked to disorders (genes) of the same disorder class For example, in the HDN there
AR
ATM BRCA1 BRCA2 CDH1 GARS
HEXB KRAS LMNA MSH2 PIK3CA
TP53 MAD1L1 RAD54L VAPB CHEK2 BSCL2
ALS2 BRIP1
Fanconi anemia
Pancreatic cancer Wilms tumor
Charcot-Marie-Tooth disease
Sandhoff disease Lipodystrophy
Amyotrophic lateral sclerosis Silver spastic paraplegia syndrome Spastic ataxia/paraplegia
AR
ATM
BRCA1 BRCA2
CDH1
GARS HEXB
KRAS LMNA
MSH2 PIK3CA
TP53
MAD1L1 RAD54L VAPB
CHEK2
BSCL2 ALS2
BRIP1
Androgen insensitivity
Breast cancer
Perineal hypospadias Prostate cancer
Spinal muscular atrophy
Charcot-Marie-Tooth disease
Sandhoff disease
Lipodystrophy
Amyotrophic lateral sclerosis
Silver spastic paraplegia syndrome Spastic ataxia/paraplegia
Human Disease Network
(HDN)
Disease Gene Network
(DGN)
disease genome disease phenome
DISEASOME
Fig 1 Construction of the diseasome bipartite network (Center) A small subset of OMIM-based disorder–disease gene associations (18), where circles and rectangles
correspond to disorders and disease genes, respectively A link is placed between a disorder and a disease gene if mutations in that gene lead to the specific disorder.The size of a circle is proportional to the number of genes participating in the corresponding disorder, and the color corresponds to the disorder class to which the disease
belongs (Left) The HDN projection of the diseasome bipartite graph, in which two disorders are connected if there is a gene that is implicated in both The width of
a link is proportional to the number of genes that are implicated in both diseases For example, three genes are implicated in both breast cancer and prostate cancer,
resulting in a link of weight three between them (Right) The DGN projection where two genes are connected if they are involved in the same disorder The width of
a link is proportional to the number of diseases with which the two genes are commonly associated A full diseasome bipartite map is provided asSI Fig 13
8686 ! www.pnas.org"cgi"doi"10.1073"pnas.0701361104 Goh et al.
A clique of 9 gene nodes
contains many cliques:
§ Why do cliques arise in the folded gene network?
difficult to analyze:
§ Computational complexity of many algorithms depends on the
size and number of large cliques
contraction to eliminate cliques
Trang 15¡ Graph contraction: Technique for computing
properties of networks in parallel:
§ Divide-and-conquer principle
¡ Idea:
§ Contract the graph into a smaller graph, ideally a
constant fraction smaller
§ Recurse on the smaller graph
§ Use the result from the recursion along with the
initial graph to calculate the desired result
¡ Next: How to contract (“shrink”) a graph?
Trang 16¡ Start with the input graph !:
1 Select a node-partitioning of ! to guide the contraction:
§ Partitions are disjoint and they include all nodes in !
2 Contract each partition into a single node, a supernode
3 Drop edges internal to a partition
4 Reroute cross edges to corresponding supernodes
5 Set ! to be the smaller graph; Repeat
¡ Example: one round of graph contraction:
d
a
d e
Identify partitons Contract Delete duplicate edges
a
d e
a
e d
a
d
e Round 1
a
3 partitions: a, d, e
Trang 17a
d e
Identify partitons Contract Delete duplicate edges
a
d e
a
e d
a
Contracting a graph down to a single node in
three rounds:
Trang 18¡ Partitions should be disjoint and include all nodes in !
¡ Three types of node-partitioning:
§ Each partition is a (maximal) clique of nodes:
§ Each partition is a single node or two connected nodes:
§ Each partition is a star of nodes, etc
g h
i
a b
Trang 191) Multimode Network Transformations:
§ K-partite and bipartite graphs
§ One-mode network projections/folding
§ Graph contractions
§ Direct and and indirect effects in a network
§ Inferring networks by network deconvolution
Trang 21¡ K-nearest neighbor graph (K-NNG) for a set of
objects ! is a directed graph with vertex set !:
§ Edges from each " ∈ ! to its $ most similar
objects in ! under a given similarity measure:
§ e.g., Cosine similarity for text
Trang 22¡ K-NNG construction is an important operation:
§ Recommender systems: connect users with similar
product rating patterns, then make recommendations based on the user’s graph neighbors
§ Document retrieval systems: connect documents
with similar content, quickly answer input queries
§ Other problems in clustering, visualization,
information retrieval, data mining, manifold learning
Trang 23Figure 1: A typical pipeline of data visualization by first constructing a K-nearest neighbor graph and then projecting the
graph into a low-dimensional space.
eration technique [26] for the t-SNE by first constructing
a K-nearest neighbor (KNN) graph of the data points and
then projecting the graph into low-dimensional spaces with
tree-based algorithms T-SNE and its variants, which
rep-resent a family of methods that first construct a similarity
structure from the data and then project the structure into
a 2D/3D space (see Figure 1), have been widely adopted
re-cently due to the ability to handle real-world data and the
good quality of visualizations.
Despite their successes, when applied to data with millions
of points and hundreds of dimensions, the t-SNE techniques
are still far from satisfaction The reasons are three-fold:
(1) the construction of the K-nearest neighbor graph is a
computational bottleneck for dealing with large-scale and
high-dimensional data T-SNE constructs the graph using
the technique of vantage-point trees [28], the performance
of which significantly deteriorates when the dimensionality
of the data grows high; (2) the efficiency of the graph
vi-sualization step significantly deteriorates when the size of
the data becomes large; (3) the parameters of the t-SNE
are very sensitive on di↵erent data sets To generate a good
visualization, one has to search for the optimal parameters
exhaustively, which is very time consuming on large data
sets It is still a long shot of the community to create high
quality visualizations that scales to both the size and the
dimensionality of the data.
We report a significant progress on this direction through
the LargeVis, a new visualization technique that computes
the layout of large-scale and high-dimensional data The
LargeVis employs a very efficient algorithm to construct an
approximate K-nearest neighbor graph at a high accuracy,
which builds on top of but significantly improves a
state-of-the-art approach to KNN graph construction, the random
projection trees [7] We then propose a principled
probabilis-tic approach to visualizing the K-nearest neighbor graph,
which models both the observed links and the unobserved
(i.e., negative) links in the graph The model preserves the
structures of the graph in the low-dimensional space,
keep-ing similar data points close and dissimilar data points far
away from each other The corresponding objective
func-tion can be optimized through the asynchronous stochastic
gradient descent, which scales linearly to the data size N
Comparing to the one used by the t-SNE, the optimization
process of LargeVis is much more efficient and also more
ef-fective Besides, on di↵erent data sets the parameters of the
LargeVis are much more stable.
We conduct extensive experiments on real-world,
large-and documents), images, large-and networks Experimental sults show that our proposed algorithm for constructing the approximate K-nearest neighbor graph significantly out- performs the vantage-point tree algorithm used in the t- SNE and other state-of-the-art methods LargeVis gener- ates comparable graph visualizations to the t-SNE on small data sets and more intuitive visualizations on large data sets;
re-it is much more efficient when data becomes large; the rameters are not sensitive to di↵erent data sets On a set
pa-of three million data points with one hundred dimensions, LargeVis is up to thirty times faster at graph construction and seven times faster at graph visualization LargeVis only takes a couple of hours to visualize millions of data points with hundreds of dimensions on a single machine.
To summarize, we make the following contributions:
• We propose a new visualization technique which is able
to compute the layout of millions of data points with hundreds of dimensions efficiently.
• We propose a very efficient algorithm to construct an approximate K-nearest neighbor graph from large-scale, high-dimensional data.
• We propose a principled probabilistic model for graph visualization The objective function of the model can
be e↵ectively optimized through asynchronous tic gradient descent with a time complexity of O(N ).
stochas-• We conduct experiments on real-world, very large data sets and compare the performance of LargeVis and t- SNE, both quantitatively and visually.
2 RELATED WORK
To the best of our knowledge, very few visualization niques can efficiently layout millions of high-dimensional data points meaningfully on a 2D space Instead, most visualiza- tions of large data sets have to first layout a summary or a coarse aggregation of the data and then refine a subset of the data (a region of the visualization) if the user zooms in [5] Admittedly, there are other design factors besides the computational capability, for example the aggregated data may be more intuitive and more robust to noises However, with a layout of the entire data set as basis, the e↵ectiveness
tech-of these aggregated/approximate visualizations will only be improved Many visualization tools are designed to layout geographical data, sensor data, and network data These tools typically cannot handle high-dimensional data.
(a) 20NG (t-SNE) (b) 20NG (LargeVis)
(c) WikiDoc (t-SNE) (d) WikiDoc (LargeVis)
(e) LiveJournal (t-SNE) (f) LiveJournal (LargeVis) Figure 8: Visualizations of 20NG, WikiDoc, and LiveJournal by t-SNE and LargeVis Di↵erent colors correspond to di↵erent categories (20NG) or clusters learned by K-means according to high-dimensional representations.
Graph visualization K-NNG construction
§ Compute similarities between objects
§ Project objects into a 2D space by preserving the similarities
¡ K-NNG can substantially reduce computational costs
WikiDoc data (t-SNE)
Trang 24¡ Let’s construct a K-NNG by brute-force:
§ Given ! objects " and a distance metric
#: " × " → [0, ∞)
§ For each possible pair of (-, ), compute #(-, )
§ For each , let / 0 (1) be ’s K-NN, i.e., the 2
objects in " (other than ) most similar to
Choose 3 of the nearest objects
Compute similarity
Object
Trang 25¡ Computational cost of brute-force: !(# $ )
§ Not scalable: Practical for only small datasets
§ Not general: Many custom heuristics designed to
speed up computations:
§ Many heuristics are specific to a similarity measure
§ Not efficient: Compute all neighbors for every &
§ We only need ' nearest neighbors for every &
Trang 26¡ Can we do better than brute-force?
¡ NN-Descent [Dong et al., WWW 2011] :
§ Efficient algorithm to approximate K-NNG construction
with arbitrary similarity measure
¡ Other published methods (not covered today):
§ Locality Sensitive Hashing (LSH): A new hash function
needs to be designed for a new similarity measure
§ Recursive Lanczos bisection: Recursively divide the
dataset, so objects in different partitions are not compared
§ K-NN search problem: If K-NN problem is solved, K-NNG
can be constructed by running a K-NN query for each ! ∈ #
Trang 27¡ Key principle: A neighbor of a neighbor is also likely to be a neighbor
§ Start with an approximation of the K-NNG, !
§ Improve ! by exploring each point’s neighbors’
neighbors as defined by the current approximation
§ Stop when no improvement can be made
Type equation here.
2
Trang 28¡ Let:
§ ! be a metric space with distance metric
": !×! → [0, ∞) , , = −" is the similarity measure
Trang 29¡ Def: Metric space ! is growth-restricted if there exists a constant ", such that:
# $% & ≤ " # % & , ∀& ∈ !
¡ The smallest such " is growing constant of !
§ Start with an approximation of the K-NNG, #
§ Use the growing constant of + to show that # can be
improved by comparing each object , against its
current neighbors’ neighbors - . [,]
Details
Trang 30¡ Two assumptions:
§ Let ! be the growing constant of " and let # = ! %
§ Have an approximate K-NNG & that is reasonably good:
§ For a fixed radius ', for all (, &[(] contains # neighbors that are uniformly distributed in & + (()
¡ Lemma: & . ( is likely to contain # nearest
neighbors in & +/0 (
distance to the set of approximate K nearest
neighbors by exploring &′[(] for every (
Details