10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 3Degree distribution: Pk Clustering coefficient: C Connected components: s... ¡ Diameter: The m
Trang 1CS224W: Analysis of Networks
http://cs224w.stanford.edu
Trang 310/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 3
Degree distribution: P(k)
Clustering coefficient: C
Connected components: s
Trang 4¡ Degree distribution P(k): Probability that
a randomly chosen node has degree k
N k = # nodes with degree k
k
Nk
Trang 5¡ A path is a sequence of nodes in which each
node is linked to the next one
and pass through the same edge multiple times
Trang 6¡ Distance (shortest path, geodesic)
between a pair of nodes is defined as the number of edges along the
shortest path connecting the nodes
§ *If the two nodes are not connected, the distance is usually defined as infinite
¡ In directed graphs paths need to follow the direction of the arrows
Trang 7¡ Diameter: The maximum (shortest path)
distance between any pair of nodes in a graph
(component) or a strongly connected
(component of a) directed graph
§ Many times we compute the average only over the
connected pairs of nodes (that is, we ignore “infinite”
ij
h E
h
, max
2
1 where hij is the distance from node i to node j
E max is max number of edges (total number of
node pairs) = n(n-1)/2
Trang 8¡ Clustering coefficient:
§ C i Î [0,1]
§
¡ Average clustering coefficient:
where e i is the number of edges
between the neighbors of node i
Trang 9¡ Clustering coefficient:
§
10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 9
where e i is the number of edges
between the neighbors of node i
Trang 10¡ Size of the largest connected component
by a path
¡ Largest component = Giant component
How to find connected components:
• Start from random node and perform Breadth First Search (BFS)
• Label the nodes BFS visited
• If all nodes are visited, the network is connected
• Otherwise find an unvisited node and repeat BFS
A
B
H F
G
Trang 1110/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 11
Degree distribution: P(k)
Clustering coefficient: C
Connected components: s
Trang 13MSN Messenger.
Trang 15Network: 180M people, 1.3B edges
1510/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu
Trang 16Contact Conversation
Messaging as an undirected graph
• Edge (u,v) if users u and v
exchanged at least 1 msg
• N=180 million people
• E=1.3 billion edges
Trang 1710/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 17
Trang 18Note: We plotted the same data as on the previous slide, just the axes are now logarithmic.
Trang 1910/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 19
å
=
=
k k i
i k
k
i
C N
Trang 2110/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 21
Number of links between pairs of nodes in the largest connected component
Avg path length 6.6
90% of the nodes can be reached in < 8 hops
Trang 22Are these values “expected”?
Are they “surprising”?
To answer this we need a null-model!
Trang 23a Undirected network
N=2,018 proteins as nodes E=2,930 binding interactions as links
10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 23
Trang 25¡ Erdös-Renyi Random Graphs [Erdös-Renyi, ‘60]
¡ Two variants:
edge (u,v) appears i.i.d with probability p
§ G n,m : undirected graph with n nodes, and
m uniformly at random picked edges
Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu
What kind of networks do such models produce?
Trang 26¡ n and p do not uniquely determine the graph!
¡ We can have many different realizations given the same n and p
n = 10 p= 1/6
Trang 2710/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 27
Degree distribution: P(k)
Clustering coefficient: C
What are the values of
Trang 28¡ Fact: Degree distribution of G np is binomial.
¡ Let P(k) denote the fraction of nodes with
degree k:
k n
p k
n k
-ø
ö çç
è
æ
) (
Select k nodes out of n-1
Probability of
having k edges
Probability of missing the rest of
= p n
k By the law of large numbers, as the network size
increases, the distribution becomes increasingly narrow—we are increasingly confident that the degree
of a node is in the vicinity of k.
Mean, variance of a binomial distribution
k
Trang 2910/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 29
n
k n
k p
k k
k k
p C
i i
-×
=
1 )
1 (
) 1 (
Clustering coefficient of a random graph is small.
If we generate bigger and bigger graphs with fixed avg degree ! (that is we set " = ! ⋅ 1/'), then C decreases with the graph size n.
) 1 (
2 -
=
i i
i i
k k
e C
e i = p k i (k i −1)
2 Number of distinct pairs of
neighbors of node i of degree ki
Each pair is connected
with prob p
Where e i is the number
of edges between i’s neighbors
¡ Edges in G np appear i.i.d with prob p
Trang 30p k
n k
-ø
ö çç
è
æ
) (
Trang 31¡ Graph G(V, E) has expansion α : if " S Í V:
# of edges leaving S ³ α × min(|S|,|V\S|)
# min edges S leaving V S S
Trang 32¡ Expansion is measure of robustness:
Trang 33¡ Fact: In a graph on n nodes with expansion α for all pairs of nodes there is a path of length O((log n)/α).
¡ Random graph G np :
For log n > np > c, diam(G np ) = O(log n/ log (np))
§ Random graphs have good expansion so it takes a
logarithmic number of steps for BFS to visit all nodes
10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 33
s
Trang 34Erdös-Renyi Random Graph can grow very
large but nodes will be just a few hops apart
Trang 3510/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 35
p k
n k
-ø
ö çç
è
æ
) (
Trang 36¡ Graph structure of G np as p changes:
¡ Emergence of a giant component:
avg degree k=2E/n or p=k/(n-1)
§ k=1-ε : all components are of size Ω(log n)
§ k=1+ε : 1 component of size Ω(n), others have size Ω(log n)
§ Each node has at least one edge in expectation
p=
1/(n-1)
Giant component appears
c/(n-1)
Avg deg const
Lots of isolated nodes.
log(n)/(n-1)
Fewer isolated nodes.
Avg deg = 1
Trang 37¡ G np , n=100,000, k=p(n-1) = 0.5 … 3
10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 37
Fraction of nodes in the largest component
p*(n-1)=1
Trang 38Paul Erdos
Paul Erdös
Trang 3910/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 39
Degree distribution:
Avg path length: 6.6 O(log n)
Avg clustering coef.: 0.11 k / n
Largest Conn Comp.: 99%
C ≈ 8·10 -8
h ≈ 8.2
GCC exists when k>1.
Trang 40¡ Are real networks like random graphs?
§ Giant connected component: J
§ Average path length: J
§ Clustering Coefficient: L
§ Degree Distribution: L
§ Degree distribution differs from that of real networks
§ Giant component in most real network does NOT
emerge through a phase transition
§ No local structure – clustering coefficient is too low
¡ Most important: Are real networks random?
§ The answer is simply: NO!
Trang 41¡ If G np is wrong, why did we spend time on it?
then be compared to the real data
particular property the result of some random
process
10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 41
So, while G np is WRONG, it will turn out
to be extremly USEFUL!
Trang 42¡ Goal: Generate a random graph with a
given degree sequence k 1 , k 2 , … k N
Trang 43Can we have high clustering while also having short paths?
Trang 44¡ What is the typical shortest path
length between any two people?
§ Experiment on the global friendship
network
§ Can’t measure, need to probe explicitly
§ Picked 300 people in Omaha, Nebraska
and Wichita, Kansas
§ Ask them to get a letter to a
stock-broker in Boston by passing
it through friends
¡ How many steps did it take?
Trang 45¡ 64 chains completed:
(i.e., 64 letters reached the target)
average, thus
“6 degrees of separation”
had shorter paths to the stockbroker
than random people: 5.4 vs 6.7
closer paths: 4.4
10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 48
Milgram’s small world experiment
[Milgram, ’67]
Trang 46¡ Assume each human is connected to 100 other people
Then:
§ Step 1: reach 100 people
§ Step 2: reach 100*100 = 10,000 people
§ Step 3: reach 100*100*100 = 1,000,000 people
§ Step 4: reach 100*100*100*100 = 100M people
§ In 5 steps we can reach 10 billion people
¡ What’s wrong here? We ignore clustering!
§ Not all edges point to new people
§ 92% of FB friendships happen through a friend-of-a-friend
s
Trang 47¡ MSN network has 7 orders of magnitude
10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 50
h Average shortest path length
C Average clustering coefficient
“actual” … real network
“random” … random graph with same avg degree
Actor Collaborations (IMDB): N = 225,226 nodes, avg degree k = 61
Electrical power grid: N = 4,941 nodes, k = 2.67
Network of neurons: N = 282 nodes, k = 14
Network h actual h random C random
Trang 48¡ Consequence of expansion:
§ Short paths: O(log n)
§ This is the smallest diameter we can
get if we have a constant degree.
¡ But networks have
“local” structure:
Friend of a friend is my friend
diameter is also high
Low diameter Low clustering coefficient
High clustering coefficient
High diameter
Trang 49¡ Could a network with high clustering also
be a small world (log $ dimeter)?
§ How can we at the same time have
high clustering and small diameter?
10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 52
High clustering High diameter
Low clustering Low diameter
Trang 50Small-World Model [Watts-Strogatz ‘98]
Two components to the model:
¡ (1) Start with a low-dimensional regular lattice
¡ Now introduce randomness (“shortcuts”)
¡ (2) Rewire:
shortcuts to join remote parts
of the lattice
the other end to a random node
[Watts-Strogatz, ‘98]
Trang 5110/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 54
High clustering Low diameter
Low clustering Low diameter
4
3
N
h = = log
log
a
Rewiring allows us to “interpolate” between
a regular lattice and a random graph
[Watts-Strogatz, ‘98]
1 2
High clustering
High diameter
Trang 53¡ Could a network with high clustering be at the same time a small world?
and the small-world
10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 58
Trang 54¡ What mechanisms do people use to
navigate networks and find the target?
Trang 55The setting:
¡ s only knows locations of its friends
and location of the target t
¡ s does not know links of anyone else but itself
¡ Geographic Navigation: s “navigates” to
a node geographically closest to t
10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 60
s
t
Trang 56O
) )
((log n b O
O
Kleinberg’s model
Note: We know these graphs have diameter O(log n).
So in Kleinberg’s model search time is polynomial in log n,
while in Watts-Strogatz it is exponential (in log n).
Trang 57are not random
§ They follow geography!
10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 62
Saul Steinberg, “View of the World from 9th Avenue”
Trang 58¡ Model [Kleinberg, Nature ‘01]
§ d(u,v) … grid distance between u and v
v) d(u, v)
Trang 59¡ We know:
§ α = 0 (i.e., Watts-Strogatz): We need steps
§ α = 1: We need O(log(n) 2 ) steps
10/3/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 64
“funnels inward” through these different scales of resolution, as we see from the way the letter depicted in this figure reduces its distance to the target by approximately a factor of
two with each step.
So now let’s look at how the inverse-square exponent q = 2 interacts with these scales of resolution We can work concretely with a single scale by taking a node v in the network, and a fixed distance d, and considering the group of nodes lying at distances between d and
2d from v, as shown in Figure 20.7.
Now, what is the probability that v forms a link to some node inside this group? Since area in the plane grows like the square of the radius, the total number of nodes in this group
is proportional to d 2 On the other hand, the probability that v links to any one node in the group varies depending on exactly how far out it is, but each individual probability
is proportional to d 2 These two terms — the number of nodes in the group, and the
Trang 60Small α: too many long links Big α: too many short links