World wide web , tập 16, số 5 6, 2013

The approach introduces the concept of community score to measure the quality of a network partitioning in communities, and tries to optimize this quantity by running the genetic algorit

Trang 1

November 2013

Trang 2

Guest editorial: social networks and social Web mining

Guandong Xu&Jeffrey Yu&Wookey Lee

Received: 14 August 2013 / Accepted: 3 September 2013 /

Published online: 27 September 2013

# Springer Science+Business Media New York 2013

Nowadays the emergence of web-based communities and hosted services such as socialnetworking sites– Facebook, LinkedIn, wikis – Wikipedia, microblogging - Twitter andfolksonomies– Delicious, Flickr and so on, brings in tremendous freedom of Web autonomyand facilitate collaboration and sharing between users And along with the interactionsbetween users and computers, social Web is rapidly becoming an important part of ourdigital experience, ranging from digital textual information to rich multimedia formats.Social networks have played an important role in different domains for about one decade,particularly involved in a broad range of social activities like user interaction, establishingfriendship relationships, sharing and recommending resources, suggesting friends, creatinggroups and communities, commenting friend activities and opinions and so on Recent years,has witnessed the rapid progress in the study of social networks for diverse applications,such as user profiling in Facebook and group recommendation via Flickr

These aspects and characteristics form the most active and challenging parts of Web 2.0

A large amount of challenges and opportunities have arisen with the propagation andpopularity of new applications and technologies A prominent challenge lies in modelingand mining this vast volume of data to extract, represent and exploit meaningful knowledge,and to leverage structures and dynamics of emerging social networks residing in the socialWeb, especially social media Social networks and social Web mining combines data miningwith social computing as a promising direction and offers unique opportunities for devel-oping novel algorithms and tools ranging from text and content mining to link mining andcommunity detection and so on

This special issue has gained overwhelming attention and received 52 submissions fromresearchers and practitioners working on social network analysis and social media mining.After initial examining of all submissions, 42 papers are selected into the regular rigorousreview process and each submission has been reviewed by at least three reviewers After 2–3

Trang 3

round reviews, eventually ten quality papers are recommended to be included into thisspecial issue, which are summarized as below.

The paper, titled“Mesoscopic Analysis of Networks with Genetic Algorithms”, presents

a genetic based approach to discover communities in networks is proposed The algorithmoptimizes a simple but efficacious fitness function to identify densely connected groups ofnodes with sparse connections between groups, thus sensibly reducing the search space ofpossible solutions Experiments on synthetic and real life networks show the ability of themethod to successfully detect the network structure

Complex networks have received increasing attention by the scientific community, in linewith the increasing availability of real-world network data Apart from the network analysisthat has focused on the characterization and measurement of local and global properties ofgraphs, such as diameter, degree distribution, centrality, and so on, the multidimensionalnature of real world networks has discovered, i.e many networks containing multipleconnections between any pair of nodes have been analyzed The paper“MultidimensionalNetworks: Foundations of Structural Analysis” discusses the basis for multidimensionalnetwork analysis by presenting a solid repertoire of basic concepts and analytical measures,which take into account the general structure of multidimensional networks The frameworkhas been tested on different real world multidimensional networks, showing the validity andthe meaningfulness of the measures introduced, that are able to extract important and non-random information about complex phenomena in such networks

In“A Time Decoupling Approach for Studying Forum Dynamics”, authors propose anapproach that decouples temporal information about users into sequences of user events andinter-event times Online forums are rich sources of information about user communicationactivity over time Finding temporal patterns in online forum communication threads canadvance our understanding of the dynamics of conversations The main challenge oftemporal analysis in this context is the complexity of forum data There can be thousands

of interacting users, who can be numerically described in many different ways Moreover,user characteristics can evolve over time Authors aim to develop a new feature space torepresent the event sequences as paths, and model the distribution of the inter-event times.They study over 30, 000 users across four Internet forums, and discover novel patterns inuser communication

The paper, titled “Who Blogs What: Understanding the Publishing Behavior ofBloggers”, investigates the bloggers’ publishing style and impact by grouping bloggersbased on the analysis of topical coverage and comparing their publishing behaviors From ablog website with more than 370,000 posts, first two types of bloggers are identified:specialists and generalists Then they study and compare the respective publishing behaviors

in the blogosphere, finding that bloggers with different topical coverage do behave indifferent ways Specialists generally make more contributions than generalists and tend topublish more on weekdays, during business hours, and on a more regular basis Moreover,they also observe that specialists also have different publishing behaviors, with only a smallfraction creating a large“buzz” or producing a voluminous output

Online discussion threads are conversational cascades in the form of posted messages thatcan be generally found in social systems that comprise many-to-many interaction such asblogs, news aggregators or bulletin board systems The paper,“A likelihood-based frame-work for the analysis of discussion threads”, proposes a framework based on generativemodels of growing trees to analyze the structure and evolution of discussion threads Theauthors consider the growth of a discussion to be determined by an interplay betweenpopularity, novelty and a trend (or bias) to reply to the thread originator The relevance ofthese features is estimated using a full likelihood approach and allows to characterize the

Trang 4

habits and communication patterns of a given platform and/or community They apply theproposed framework on four popular websites: Slashdot, Barrapunto (a Spanish version ofSlashdot), Meneame (a Spanish Digg-clone) and the article discussion pages of the EnglishWikipedia, to evaluate their model.

Social recommender systems largely rely on user-contributed data to infer users’ ence, which might introduce unreliability to recommenders as users are allowed to insertdata freely Although detecting malicious attacks from social spammers has been studied foryears, detecting Noisy but Non-Malicious Users (NNMUs), which refers to those genuineusers who may provide some untruthful data due to their imperfect behaviors, remains anopen research question The paper “Noisy but Non-Malicious User Detection in SocialRecommender Systems”, studies how to detect NNMUs in social recommender systems.Based on the assumption that the ratings provided by a same user on closely correlated itemsshould have similar scores, the authors propose an effective method for NNMU detection bycapturing and accumulating user’s “self-contradictions”, i.e., the cases that a user providesvery different rating scores on closely correlated items They show that self-contradictioncapturing can be formulated as a constrained quadratic optimization problem w.r.t a set ofslack variables, which can be further used to quantify the underlying noise in each test user.The paper, titled “SocialSearch+: Enriching Social Network with Web Evidences”,addresses the problem of searching for social network accounts, e.g., Twitter accounts, withthe rich information available on the Web, e.g., people names, attributes, and relationships toother people Existing solutions building upon naive textual matching inevitably suffer lowprecision due to false positives (e.g., fake impersonator accounts) and false negatives (e.g.,accounts using nicknames) To overcome these limitations, the authors leverage“relational”evidences extracted from the Web corpus, namely web-scale entity relationship graphs thatextracted from name co-occurrences of Web and web-scale relational repositories, such asFreebase with complementary strength Using both textual and relational features obtainedfrom these resources, a ranking function is learned to aggregate these features for theaccurate ordering of candidate matches Another key contribution of this paper is toformulate confidence scoring as a separate problem from relevance ranking The proposedsystem is evaluated by using real-life internet-scale entity-relationship and social networkgraphs

prefer-The recommender systems utilizing Collaborative Filtering (CF) as the key algorithm arevulnerable to shilling attacks which insert malicious user profiles into the systems to push ornuke the reputations of targeted items There are only a small number of labeled users inmost the practical recommender systems, while a large number of users are unlabeledbecause it is expensive to obtain their identities In “Shilling Attack Detection UtilizingSemi-supervised Learning Method for Collaborative Recommender System”, a new semi-supervised learning based shilling attack detection algorithm, namely Semi-SAD, is pro-posed to take advantage of both types of data It first trains a naive Bayes classifier on asmall set of labeled users, and then incorporates unlabeled users with EM to improve theinitial naive Bayes classifier Experiments on MovieLens datasets indicate that Semi-SADcan better detect various kinds of shilling attacks than others, especially against obfuscatedand hybrid shilling attacks

Mobile and pervasive computing technologies enable us to obtain real-world sensing datafor sociological studies, such as exploring human behaviors and relationships In“Under-standing Social Relationship Evolution by Using Real-World Sensing Data”, the authorspresent a study of understanding social relationship evolution by using real-life anonymizedmobile phone data Through the study the authors show that social relationships (not onlyreciprocal friends and non-friends, but non-reciprocal friends) can be likely predicted by

Trang 5

using real-world sensing data In terms of the friendship evolution, they verify that theprinciples of reciprocality and transitivity play an important role in social relation evolution.The paper titled “Can Predicate-Argument Structures be used for Contextual OpinionRetrieval from Blogs?” presents the use of predicate-argument structures for contextualopinion retrieval Different from the keyword-based opinion retrieval approaches, which usethe frequency of certain keywords, their solution is based on frequency of contextuallyrelevant and subjective sentences They use a linear relevance model that leverages semanticsimilarities among predicate argument structures of sentences The model features with alinear combination of a popular relevance model, the proposed transformed terms similaritymodel, and the absolute value of a sentence subjectivity scoring scheme The predicate-argument structures are then derived from the grammatical derivations of natural languagequery topics and the well-formed sentences from blog documents Evaluation and experi-mental results demonstrate the feasibility of using predicate-argument structures for contex-tual opinion retrieval.

Finally, we would like to appreciate all authors who submitted manuscripts for eration, and over 120 anonymous dedicated reviewers for their criticism and time to help usmaking final decisions Without their valuable and strong supports, we cannot make thisspecial issue successful Our sincere gratitude will also go to the WWWJ EiC, Prof YanchunZhang, Ms Jennylyn Roseiento, Mr Hector Nazario from the Springer Journal EditorialOffice for helping us to presenting this special issue to readers

Trang 6

consid-DOI 10.1007/s11280-012-0174-4

Mesoscopic analysis of networks with genetic algorithms

Clara Pizzuti

Received: 15 July 2011 / Revised: 17 May 2012 /

Accepted: 22 May 2012 / Published online: 8 June 2012

Abstract The detection of communities is an important problem, intensively

inves-tigated in recent years, to uncover the complex interconnections hidden in networks

In this paper a genetic based approach to discover communities in networks isproposed The algorithm optimizes a simple but efficacious fitness function able toidentify densely connected groups of nodes with sparse connections between groups.The method is efficient because the variation operators are modified to take into con-sideration only the actual correlations among the nodes, thus sensibly reducing thesearch space of possible solutions Experiments on synthetic and real life networksshow the ability of the method to successfully detect the network structure

Keywords genetic algorithms · data mining · clustering · community detection ·

networks

1 Introduction

The suitability of networks to represent many real world systems has given animpressive spur to the recent research area of complex networks Collaborationnetworks, biological networks, communication and transport networks, the Internet,and the World-Wide-Web [25] are just some examples Networks, in general, areconstituted by a set of objects and by a set of interconnections among these objects

In social networks, for example, the objects are people and the connections representsocial relations, such as common interests, friendship, religion, and so on Members

of networks and relationships between them can be modeled as a graph of nodesand edges Each participant is denoted by a distinct node, and interactions arerepresented by edges connecting two objects Complex networks can be analyzed

C Pizzuti (B)

Institute for High Performance Computing and Networking (ICAR),

Italian National Research Council (CNR), Via P Bucci 41/C, 87036 Rende (CS), Italy

e-mail: pizzuti@icar.cnr.it

Trang 7

at different levels of granularity The node level is the smallest scale to study At thislevel the node degree can give valuable information on the role played by the objectsparticipating in the network More interestingly, the community or sub-graph levelinvestigates the division of a network into groups (also called clusters or modules)

having dense intra-connections, and sparse inter-connections, thus delivering a

meso-scopic description of a network where the elements are the communities and not the

nodes This partitioning is typical to many networks, thus the study of community

structure can give important information and useful insights to understand how

the structure of ties affects individuals and their relationships In fact, members

of a community interact with each other, they share information, and can have aremarkable influence on the behavior of the other objects of the community.The problem of community detection has been receiving a lot of attention in thelast few years, and many different approaches have been proposed [1,3,4,10,17,22,

23,26,29,31–33,37,39]

In this paper an algorithm, named GA-Net, to discover communities in networks

by employing Genetic Algorithms (GAs) [14] is proposed The approach introduces

the concept of community score to measure the quality of a network partitioning in

communities, and tries to optimize this quantity by running the genetic algorithm Allthe dense communities present in the network structure are obtained at the end ofthe algorithm by selectively exploring the search space, without the need to know

in advance the exact number of groups Specialized variation operators allow toreduce the space of the possible solutions thus improving the convergence of thealgorithm The method requires an input parameter that biases the search towards adifferent number of communities The number of communities found is determined

by the optimal value of the community score Experiments on synthetic and real life

networks show the capability of the genetic approach to correctly detect communitieswith results comparable to state-of-the-art approaches

The paper is organized as follows In the next section an overview of the mainproposals of community detection algorithms is given Section3provides the neces-sary background to formalize the problem and defines the quality metric employed

to detect communities In Section 4 a description of the method along with therepresentation adopted and the variation operators used are provided In Section5

the results of the method on synthetic and real life data sets are presented Section6

discusses the advantages of using GA-Net Finally, Section7concludes the paper

2 Related work

Many different algorithms have been proposed to detect communities in complexnetworks [1,3,4,7,11,13,17,22,23,26,27,29,31–33,35,37,39] In the following wereview some of the most known algorithms Overviews of community identificationmethods in complex networks can be found in [6,8,10]

One of the most famous algorithm has been presented by Newman and Girvan in[11,29] The method is a divisive hierarchical clustering method based on an iterativeremoval of edges from the network The edge removal splits the network in commu-nities An agglomerative, instead of a divisive, hierarchical algorithm that optimizes

the concept of modularity, introduced in [29], is presented by Newman in [26] Themodularity is the fraction of edges inside communities minus the expected value of

Trang 8

the fraction of edges, if edges fall at random without regard to the community ture Values approaching 1 indicate strong community structure Thus the algorithmcomputes the modularity of all the clusters obtained by applying the hierarchicalapproach, and returns as result the clustering having the highest value of modularity.

struc-A faster version of the method, based on the same strategy, is described in [4].Recently, some studies [9] have indicated that the optimization of modularity has amain disadvantage It can fail in finding communities smaller than a fixed scale, even

if these modules are well defined The scale depends on the total size of the networkand the interconnection degree of the modules This resolution limit can constitute aweakness for all those methods whose objective to optimize is modularity

Wakita and Tsurumi [37] improved the method of [4] by identifying the cause

of inefficiency of this latter agglomerative method in the strategy adopted to mergecommunities To this end they introduced three metrics that try to balance the size ofthe communities to be merged The modularity criterion enriched with these metricsallows for a sensible improvement of the algorithm efficiency

Radicchi et al [32] proposed a divisive hierarchical algorithm to identify

commu-nities based on the concept of edge-clustering coeff icient, defined in analogy with

the node clustering coefficient.1 The edge-clustering coefficient is the number oftriangles an edge participates, divided by the number of triangles it might belong to,given the degree of the adjacent nodes Their algorithm works like that of Newmanand Girvan, but it is faster The main difference is that instead of choosing to removethe edge with the highest edge betweenness, the removed edges are those havingthe smallest value of edge-clustering coefficient However, a quantitative measurefor the evaluation of the dendrograms generated by the hierarchical approach isnot defined Thus the choice of a solution with respect to another must rely on theintuitive concept of community that a user has

Pons and Latapy [31] introduced an agglomerative hierarchical algorithm tocompute the community structure of a network The algorithm starts from a partition

of the graph in which each node is a community, and then merges the two adjacentcommunities (i.e having at least a common edge) that minimize the mean of thesquare distances between each vertex and its community The distances betweencommunities are recomputed and the previous step is repeated until all the nodesbelong to the same community In order to decide the best partitioning to choose,the modularity criterion of Girvan and Newmann is adopted

Blondel et al [3] presented a method that partitions large networks based onthe modularity optimization The algorithm consists of two phases that are repeatediteratively until no further improvement can be obtained At the beginning each node

of the network is considered a community Then, for each node i, all its neighbors j are considered and the gain in modularity of removing i from its community and adding it to the j community is computed The node is placed in the community for which the gain is positive and maximum If no community has positive gain, i remains

in its original group This first phase is repeated until no node move can improve the

1 The clustering coefficient has been defined by [ 38] Given a node i, let n ibe the number of links

connecting the k i neighbors of i to each other The clustering coefficient of i is C i = 2n i /k i (k i − 1).

n i represents the number of triangles passing through i, and k i (k i − 1)/2 the number of possible triangles that could pass through node i The clustering coefficient a graph is the average of the

clustering coefficients of the nodes it contains.

Trang 9

modularity The second phase builds a network where the communities obtained are

considered as the new nodes and a link between two communities a, b exists if there

is an edge between a node belonging to a and a node belonging to b The network can be weighted, in such a case the weight of the edge between a and b is the sum

of the weights of the links between nodes of the corresponding communities At thispoint the method can be reiterated until no more changes can be done to improvemodularity The method is very accurate, however, it is unable to detect modules at

a particular scale

Approaches to community detection based on Genetic Algorithms can be found

in [7,13,22,35] In [35] the authors present a genetic algorithm that uses as fitnessfunction the network modularity proposed by Newmann and Girvan An individual is

constituted by N genes, where N is the number of objects The ith gene corresponds

to the ith node, and its value is the community identifier of node i They use a non standard one-way crossover operation in which, given two individuals A and B, a community identifier j is chosen at random, and the identifier j of the nodes j1, j h

of A is transferred to the same nodes of B.

Gog et al [13] proposed a collaborative evolutionary algorithm that uses also themodularity as fitness function to optimize The main novelty of this approach is thateach individual is endowed with the knowledge about the best potential solutionalready obtained during the search process, and the value of its best ancestor.The sharing of this information helps the method to find significative communitystructure Both the two above methods could fail to uncover community structurewhen the network contains modules satisfying the conditions of the limit resolutionproperty stated in [9]

A different approach is described in [7] where a random walk distance measurebetween graphs is integrated in a genetic algorithm to cluster networks The repre-

sentation used is the k-medoids, where each cluster center is represented by one of

the nodes of the network The fitness function tries to minimize the sum of all thepair-wise distances between nodes The main limitation of this approach is that the

number k of clusters must be known in advance.

An agglomerative clustering method based on Genetic Algorithms has been posed by Lipczak et al [22] In this approach each individual represents a single com-munity, instead of the whole clustering solution Two fitness functions are consid-ered The former considers the normalized cut, i.e it assumes that a graph is divided

pro-into two disjoint sets A and B, and defines the score of this division as the fraction

of all the connections between A and B with respect to the number of connections involving A and B separately The other fitness function is essentially the modularity

of Girvan and Newman The authors compared their approach with U PGM A [34],

a well known hierarchical method, and showed the good performance of theirapproach A main difference of this approach with respect to the other GA-basedmethods is the representation used In fact Lipczak et al proposed to represent eachcluster with a chromosome, thus a solution is represented by the whole population.The motivation of this choice, as stated from the authors, was to reduce the size of

an individual and the fitness computational cost This kind of representation impliesthat the method, in order to obtain a partitioning of the network in k clusters, needs

to use a population of k individuals Thus the method must be executed for anincreasing number of clusters, and thus a population of increasing size, to find thebest result Another drawback comes from the variable length of the individuals In

Trang 10

order to perform crossover, a mapping to the fixed-length representation of the twoindividuals involved in the crossover operation is needed The mapping of a parentadds null genes in places of genes present in the other parent This strategy partiallydestroys the objective of reducing the size of individuals.

Recently, the problem of community detection has been tackled by means of

particle swarm optimization (PSO) [40] In this approach a fixed number of particlesare deployed onto the search space and move according to their velocity vector.Each particle has size equal to the number of nodes of the network and represents apartitioning At each iteration, the fitness of particles is computed, and that havingthe best fitness is stored as the current best solution The fitness function adopted

is the modularity The particles then update their position and velocity vector, andrepeat the same steps until the stop condition is not reached

3 Community detection problem

A networkN can be modeled as a graph G = (V, E) where V is a set of n =| V |

objects, called nodes or vertices, and E is a set of m =| E | links, called edges, that connect two elements of V In the following, without loss of generality, the graph

modeling a network is assumed to be undirected A community in a network is agroup of vertices (i.e a sub-graph) having a high density of edges within them, and alower density of edges between groups In [8] it is observed that a formal definition

of community does not exist because this definition often depends on the applicationdomain In this paper we assume the intuitive definition given by Radicchi et al.[32] of weak community A weak community is interpreted as a set of nodes havingthe total number of intra-connections higher than the number of inter-connections

among different communities The partitioning of the graph G, modeling a network

N , in k weak communities {S1, , S k}, can be transformed into that of partitioning the adjacency matrix A of G in k sub-matrices, such that the sum of densities of the

sub-matrices is maximized

A naive density measure for a sub-matrix of n rows/columns is the number of ones

(i.e interactions) it contains The higher the number of ones, the more connected the

n nodes However, counting the number of interactions does not give any information

about the interconnections among the nodes A quality measure of a community S that maximizes the in-degree of the nodes belonging to S can be defined as follows.

j ∈S A ij is the fraction of edges connecting

node i to the other nodes in S, and

i , j∈S A ij is the double of the number of edges

connecting vertices inside S, i.e the number of 1 entries in the adjacency sub-matrix

Trang 11

The community score gives a global measure of the network division in

com-munities by summing up the local scores of each module found The problem ofcommunity identification can then be formulated as the problem of maximizingCS.

In order to better explain the meaning of community score, let S be a group

of nodes having ns nodes and ms edges, i.e mS = {(u, v) | u ∈ S, v ∈ S} Note that

Thus the score of a community measures the density of the edges with respect

to the number of nodes This implies that, if the community S has a high density

of edges, and it is contained in another community S of lower density, the score of

S can be higher than that of S, and the larger community could be split in many

smaller communities Figure1shows the scores of communities constituted by an

increasing number of nodes nS = 8, 16, 32, 64, 128, 256 when the number of edges augments from 2 to the maximum number of possible edges nS × (n S − 1)/2 The

figure points out that smaller and highly dense clusters can reach a score higher thanlarger, but less dense, groups of nodes For example, consider the score of an 8-nodescommunity of maximum density equal to 1, i.e a clique of 8 nodes Its score, which

is 0.875, is higher than the score of a community of 16 nodes having edge density lessthan 0.95 In the latter case, in fact, the score would be≤ 0.8461 Thus the 8-clique

is preferred over the 16-nodes cluster This behavior is emphasized when r > 1 and

damped when r < 1, thus r controls the size of a community S In fact, since the

quantity 1

|S|

j ∈S A ij ≤ 1, the higher the value of r, the lower the value of score(S)

and, consequently, the lower the value ofCS Thus, increasing r biases CS towards

Edge density

N=8 N=16 N=32 N=64 N=128 N=256

Trang 12

matrices containing a low number of zeroes but of lower volume, and communities ofsmaller size are found Its value can be set on the base of the resolution level desired.

In the experimental result section we show that varying the value of r allows for an

analysis of the network at different hierarchical levels

4 Genetic representation and operators

Genetic Algorithms [14] are a class of adaptive general-purpose search techniquesinspired by natural evolution They have been proposed by Holland [16] in the early1970s as computer programs that simulate the evolution process in nature In the lastfew years genetic algorithms revealed competitive alternative methods to traditionaloptimization and search techniques and they have been applied to many problems

in diverse research and application areas such neural nets evolution, planning andscheduling, machine learning and pattern recognition A standard Genetic Algorithm

(G A) evolves a constant-size population of elements (called chromosomes) by using the genetic operator of reproduction, crossover and mutation Each chromosome represents a candidate solution to a given problem and it is associated with a f itness

value that reflects how good it is, with respect to the other solutions in the population.

Generally, a chromosome is encoded as a string of bits from a binary alphabet.The reproduction operator copies elements of the current population into the nextgeneration with a probability proportionate to their fitness (this strategy is alsocalled roulette wheel selection scheme) The crossover operator generates two newchromosomes by crossing two elements of the population selected proportionate totheir fitness The mutation operator randomly alters the bits of the strings

In the following we give a description of the algorithm GA-Net, the representation

adopted for partitioning the network, and the genetic operators used

Genetic representation Our clustering algorithm uses the locus-based adjacency

representation proposed in [30] In this graph-based representation an individual of

the population consists of N genes g1, , g Nand each gene can assume allele values

j in the range {1, , N} Genes and alleles represent nodes of the graph G = (V, E)

modelling a networkN , and a value j assigned to the ith gene is interpreted as a link

between the nodes i and j of V This means that in the clustering solution found i and j will be in the same cluster Suppose to have the network showed in Figure2a

It consists of eleven nodes numbered from 1 to 11 The network can be partitioned

in the three groups visualized by different colors and shapes of the nodes Out of themany possible genotypes, that showed in Figure2b, corresponds to the graph divisiongiven in Figure2c It is worth to note that the locus-based representation naturally fitswith the problem of community detection since its decoding automatically identifies

the number k of connected components, i.e of communities The nodes participating

in the same component are assigned to one cluster Furthermore, with respect toother approaches, such as [13,35], that adopt a chromosome of length N storing the

identifier of the community which nodes belong to, it has a complexity of the search

space that reduces from N Nof the cluster based representation, toN

i=1k i where kiis

the degree of node i Since often networks are sparse, the solution space is narrower,

thus the locus-based representation can sensibly improve the efficiency of the geneticapproach

Trang 13

Figure 2 (a) A network modelled as a graph; (b) the locus-based representation of a genotype; (c) the graph-based structure of the genotype.

Objective function We are interested in identifying a partitioning that optimizes the community score because this guarantees highly intra-connected and sparsely inter-

connected communities The objective function is thus

Initialization The initialization process assigns to each each node i one of its

neighbors j This guarantees a division of the network in connected groups of nodes.

Uniform crossover and mutation The kind of crossover operator adopted is uniform

crossover Given two parents, a random binary vector is created Uniform crossoverthen selects the genes where the vector is a 0 from the first parent, and the geneswhere the vector is a 1 from the second parent, and combines the genes to formthe child The main motivation of using uniform crossover is that it guaranteesthe maintenance of the effective connections of the nodes in the network in thechild individual In fact, because of the biased initialization, each individual in the

population is such that if a gene i contains a value j, then the edge (i, j) exists Since

the child at each position i contains a value j coming from one of the two parents,

then the edge (i, j) exists Figure 3shows an example of crossover Two parents,

individuals A and B, and their graph-based representations are reported Uniform crossover of A and B gives the child C The mutation operator, analogously to the initialization process, randomly assigns to each node i one of its neighbors.

Trang 14

Figure 3 Uniform crossover of two individuals A and B, their genotype, their graph-based representation, and the child generated C.

The algorithm works as follows Given a networkN and the graph G modeling

it, GA-Net starts with a population initialized at random but such that each node

is linked with one of its neighbors Every individual generates a graph structure

in which each component is a connected subgraph of G For a fixed number of

generations the genetic algorithm computes the fitness function of each individualand applies the specialized variation operators described above to produce the newpopulation The individual having the best community score is returned as solution

5 Experimental results

In this section we study the effectiveness of our approach on a synthetic data set

Then we test the results obtained by GA-Net on some real-worlds networks for which

the partitioning in communities is known and compare it with the methods of [4](referred as CNM), [3] (referred as BGLL), [31] (referred as PL) Furthermore theresults obtained by Xiaodong et al in [40] with their particle swarm optimization

approach (referred as PSO) are also reported Finally GA-Net and BGLL are

compared on some real-life networks for which the network division is not known

In all the cases we show that our genetic algorithm successfully detects the network

structure and is competitive with the other approaches The GA-Net algorithm has

been written in MATLAB 4.3 R2010a, using the Genetic Algorithms and Direct

Trang 15

Search Toolbox 2 In order to set parameter values, a trial and error procedure hasbeen employed and then the parameter values giving good results for the benchmarkdata sets have been selected Thus we set crossover rate to 0.8, mutation rate to0.2, elite reproduction 10% of the population size, roulette selection function Thepopulation size was 100, the number of generations 100 For all the data sets,

the statistical significance of the results produced by GA-Net has been checked

by performing a t-test at the 5% significance level The p-values returned are, onaverage, below 0.05E-10, thus the significance level is very high since the probability

that a community computed by GA-Net could be obtained by chance is very low.

5.1 Evaluation metrics

The quality of the partitioning obtained can be evaluated by using validity indices.

The validity indices can be internal, i.e they rely on the connections and separationbetween the groups, or external, through the use of additional data to assess the

clustering outcomes In this paper, an external measure, the Normalized Mutual

Information (N MI), has been adopted to estimate the similarity between the true

partitions and the detected ones, and an internal one, the modularity introduced by

Girvan and Newman, to measure the density of the links inside a community withrespect to the links between communities

The Normalized Mutual Information is a similarity measure proved to be reliable

by [5] Given two partitions A and B of a network in communities, let C be the confusion matrix whose element Cij is the number of nodes of community i of the partition A that are also in the community j of the partition B The normalized mutual information N MI (A, B) is defined as :

where cA (cB) is the number of groups in the partition A (B), Ci . (C j) is the sum

of the elements of C in row i (column j), and N is the number of nodes If A = B,

N MI(A, B) = 1 If A and B are completely different, NMI(A, B) = 0.

The modularity of [29] is a well known quality function to evaluate the goodness

of a partition The idea underlying the modularity is that a random graph has not

a clustering structure, thus the edge density of a cluster should be higher than theexpected density of a subgraph whose nodes are connected at random This expected

edge density depends on a chosen null model Modularity can be written in the

δ is the Kronecker function and yields one if i and j are in the same community,

zero otherwise When it is assumed that the random graph has the same degree

Trang 16

distribution of the original graph, Pij= k i k j

2m , where ki and kjare the degrees of nodes

i and j respectively Thus the modularity expression becomes:

where k is the number of modules found inside a network, lsis the total number of

edges joining vertices inside the module s, and dsis the sum of the degrees of the

nodes of s Thus the first term of each summand is the fraction of edges inside a

community, and the second one is the expected value of the fraction of edges thatwould be in the network if edges fall at random without regard to the communitystructure Values approaching 1 indicate strong community structure

5.2 Synthetic data set

In order to check the ability of our approach to successfully detect the communitystructure of a network, we use the benchmark proposed by [19], which is an extension

of the classical benchmark proposed by [11] The network consists of 128 nodesdivided into four communities of 32 nodes each Every node has an average degree

of 16 and shares a fraction γ with the other nodes of the network, and 1 − γ of

links with the nodes of its community γ is called the mixing parameter When

γ < 0.5 the neighbors of a node inside its group are more than the neighbors

belonging to the other three groups, thus a good algorithm should discover them

We generated 100 different networks for values of γ ranging from 0.1 to 0.5, and

computed the Normalized Mutual Information to measure the similarity between the

true partitions and the detected ones, and the modularity to evaluate the goodness

of the partitioning obtained

Figures 4 and 5 show the normalized mutual information and the modularity,

averaged over the 100 runs, for different values of the exponent r when the mixing

parameterγ increases from 0.1 to 0.5 The figure points out that, when the fuzziness

of modules is low (until γ ≤ 0.2), independently of the r value, GA-Net is able to

recover almost 90% of community structure and obtains good modularity values

However, when the mixing parameter increases, higher values of r help in the

re-trieval of the true community structure Notice that forγ = 0.5, each node has half of

the links inside its community and the other half with the rest of the network thus it isvery difficult to identify the hidden groups, because the communities are mixed eachother Tables1and2reports the average values, over the 100 runs, of the normalizedmutual information and modularity, respectively, along with the standard deviation.The tables point out the very low values of the standard deviation This means thatthe differences among the clusterings found over the 100 runs are negligible

Trang 17

Figure 4 Normalized mutual

information values obtained

by GA-Net on the synthetic

network for different values

of the exponent r when the

mixing parameterγ varies

from 0.1 to 0.5.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Value of the exponent r

γ=0.1 γ=0.2 γ=0.3 γ=0.4 γ=0.5

5.3 Real-life networks with known community division

We now show the application of GA-Net on four real-world networks, well studied

in the literature: The Zackary’s Karate Club network [41], Bottlenose Dolphins [24],

Krebs’ books on American politics [27], and The American College Football network

[11], and compare our results with the algorithms of [3,4,31] Furthermore, we reportthe modularity results obtained by the PSO approach, published in [40], on three out

of the 4 real-life networks The number of real-life data sets is low because of the

Figure 5 Modularity values

obtained by GA-Net on the

synthetic network for different

values of the exponent r when

the mixing parameterγ varies

from 0.1 to 0.5.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Value of the exponent r

Trang 18

Table 1 Normalized mutual information and corresponding standard deviation obtained by GA-Net

on the synthetic data sets

For each network we run GA-Net for values of r equals to 0.3, 0.5, 1, 1.5, 2, and

computed the average normalized mutual information and modularity, besides thebest values of NMI and modularity over 100 runs The other contestant methodsproduce a unique result, that optimizing the modularity value

Table 3 shows the good performance of GA-Net with respect to the others approaches On the Karate club network GA-Net obtains the highest normalized

mutual information of 0.826 for r=0.3 and 0.5, and a best modularity value of0.419 for r=1,1.5, 2 As regards Bottlenose Dolphins the best NMI value of 0.888

is returned by GA-Net with r=0.3, though GA-MOD obtains a modularity value of 0.519 On the Krebs’ book network GA-Net finds best values of NMI and modularity

of 0.590 and 0.525, respectively, for r=0.3 Finally, on the American College Footballdata set, for r=2, GA-Net obtains a best NMI value of 0.924 and best modularity value

of 0.6005 with respect to 0.926 and 0.601 of Blondel et al The modularity valuesobtained by the particle swarm optimization approach on the three first networks,instead are rather poor, thus establishing the superiority of genetic algorithms It isworth to note that the optimization of modularity does not necessarily corresponds

to maximization of the normalized mutual information In fact, as pointed out by[15], the optimal partition returned by the best modularity value may not coincidewith the partition that correctly identifies the intuitive community division These

observations corroborate the belief that the input r parameter is not a limitation, but

rather a means to study community structure In the next section some suggestions

on the choice of this parameter are provided

Table 2 Modularity and corresponding standard deviation obtained by GA-Net on the synthetic data

Trang 19

Table 3 Best NMI results obtained by GA-Net and the other algorithms for the real-life data sets

avg MOD 0.379 0.482 0.454 0.457 0.429

best NMI 0.888 0.593 0.462 0.454 0.467 0.573 0.450 0.675 best MOD 0.379 0.509 0.486 0.493 0.491 0.495 0.495 0.517 0.331 Krebs avg NMI 0.564 0.489 0.434 0.423 0.406

avg MOD 0.524 0.510 0.489 0.457 0.428

best NMI 0.590 0.518 0.456 0.470 0.448 0.530 0.442 0.543 best MOD 0.525 0.516 0.499 0.484 0.477 0.502 0.515 0.515 0.412 Football avg NMI 0.167 0.820 0.851 0.820 0.904

avg MOD 0.175 0.389 0.548 0.510 0.575

best NMI 0.491 0.879 0.881 0.883 0.924 0.762 0.926 0.879 best MOD 0.378 0.588 0.584 0.565 0.6005 0.577 0.601 0.602

5.4 Study of the r parameter

As pointed out, the r parameter allows for an analysis of the community structure

at different hierarchical levels, each corresponding to a different number of clusters.The choice of the value to use can be done by a user on the base of the resolutionlevel desired A more systematic approach could be that of considering the concept

of stability of a partitioning of a network, as introduced in [2] and employed in[20] A partition of a network is considered stable if it can be destroyed only by sensibly changing the parameter r for which it was obtained Since varying r different

community structures are found with different modularity values, the plot of the

modularity value with respect to r can present plateaus, the length of the plateau can give a criterion to choose the better value of r In order to show the feasibility of this approach, GA-Net has been executed on the Zackary’s Karate Club network for values of the exponent r ranging from 0.1 to 2 Figure6shows the change in average

modularity value for increasing r, while Figure7reports the number of clusters found

with respect to the r values Figure6points out a plateau for 0.5 ≤ r ≤ 0.9, which

cor-respond to the network division in 4 clusters depicted in Figure8c Actually this is thebest division found with respect to the modularity value, but if it does not correspond

to the true division of the Karate Club in two groups, displayed in Figure8a When

r=0.3 or 0.4 GA-Net finds the three communities showed in Figure8b The smallerone, constituted by the nodes 5, 6, 7, 11, 17 is a subgroup of the community on the

left By increasing r above 0.9 the modularity value diminishes and a higher number

of groups are produced For example, the community on the right of Figure8d is split

in three sub-groups for r=1 Thus studying the stability of a partitioning can provide

an effective criterion in the choice of the r parameter value to use.

5.5 Real-life networks with unknown community division

The normalized mutual information and modularity employed to compare GA-Net

with the other approaches, though the most popular, have some limitations In fact,

Trang 20

Figure 6 Change in average

modularity for different values

of the exponent r.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

Value of the exponent

the NMI is applicable only with synthetic networks for which the network partition isknown On the other hand, the assessment of a method with a criterion that coincideswith the fitness function it optimizes, could bias the validation phase Recently,Leskovec et al [21] have compared a range of community detection methods byintroducing different measures They observe that the concept of good cluster relies

on two criteria The first is the number of edges between the members of the cluster,the second is the number of edges between the members of the cluster and the rest ofthe network Thus they group quality indices in two categories: multi-criterion scores,that combine both criteria, and single criterion scores, that are based on only onecriterion Modularity is a single criterion score In the following we report some ofmulti-criterion indices, defined to capture the notion of cluster quality, and gener-alize them to evaluate network structures with different number of communities In

Figure 7 Change in the

number of communities found

for different values of the

exponent r.

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

Trang 21

Figure 8 (a) True partition of the Karate Club (b) Network partition with r =0.3 (c) Network

partition with r=0.5 (d) Network partition with r=1.

particular we compare our approach and that of Blondel et al [3] with respect tomodularity and the multi-criterion scores The network considered are the adjacencynetwork of common adjectives and nouns in the novel David Copperfield by CharlesDickens [28], the network of Jazz musicians [12], and the Metabolic network

C Elegans [18]

Let G = (V, E) the graph modeling a network with n =| V | nodes and m =| E | edges Let S be a cluster of nodes having ns nodes and ms edges, and cs = {(u, v) | u ∈

S , v /∈ S} the number of edges on the boundary of S Let {S1, , S k} be the partition

of G in k clusters The following metrics, reported from [21], that catch the concept

of quality of a community structure are defined

Conductance it measures the fraction of edges pointing outside a community

Trang 22

Internal Density it measures the internal edges density of a community

Table4reports the validity indices computed for GA-Net with different values

of the r parameter, and BGLL From the table it can be observed that while BGLL

obtains higher values of modularity and conductance for all the networks considered,

GA-Net performs better on Internal Density and Cut Ratio for all the networks,

and on Expansion for Jazz and Adjnoun networks These results suggest that the

community score adopted by GA-Net finds smaller and highly dense groups of nodes

having few edges towards the remaining network These clusters substantially differsfrom those obtained by optimizing the modularity function, that, as already said,finds groups of nodes having a density higher than that expected in a random graph

Table 4 Best scores obtained by GA-Net and BGLL algorithm for real-life data sets

Trang 23

6 Discussion

Community detection in complex network has captured a lot of interest in the lastfew years, and the introduction by Newman and Girvan [29] of the quantitativemeasure of modularity to assess the quality of a partitioning in communities hasstimulated and advanced the research to uncover community structure Recently,however, it has been proved that the optimization of modularity has a resolutionlimit that depends on the total size of the network and the interconnections of themodules In [9] it is showed that modularity has an intrinsic scale such that modulesbelow this scale, even if tightly connected, cannot be found This limit implies theimportant drawback that, searching for partitioning of maximum modularity, maylead to solutions in which important structures at small scales are not discovered

All the methods presented in the previous section, except GA-Net, suffer from this

problem Suppose to have the network depicted in Figure9composed by 4 cliques,two identical cliques of 10 nodes, and two identical cliques of 5 nodes Neither of

BGLL, PL, and CN M are capable of distinguishing the two small cliques They

return a partitioning in which these two small cliques are merged with a maximummodularity value of 0.5471 It is worth noticing that Blondel et al [3] state that theirapproach seems to elude the limit resolution thanks to the multilevel approach oftheir method However, as the above example shows, they only partially circumvent

the problem GA-Net, instead, perfectly discriminates the two small cliques obtaining

a modularity value of 0.5356, for values of r ≥ 0.8, and merges them for lower values

of r This means that the search for communities that maximizes the community

score does not suffer of scale problems and has the main advantage of allowing

the analysis of the network at different granularity levels A user can thus decide atwhich hierarchical depth explore the structure of the network or adopt the strategydescribed in the previous section to obtain the most thorough information aboutits modular organization Furthermore, it is worth noting that the other scoresintroduced in the previous section pointed out that our approach can outperformmethods optimizing modularity when different metrics are adopted to evaluate thedivision of a network in communities

Finally we want to point out that one of the main criticisms in using geneticalgorithms, compared with traditional optimization algorithms, is the high executiontime required to generate a solution The major limitation of evolutionary algorithms

is, in fact, the repeated fitness function evaluation that, for complex problemscould often be prohibitive The problem is exacerbated when large populations ofindividuals are used and an high number of generations are executed to obtain an

Figure 9 Network showing the resolution limit of modularity.

Trang 24

128 256 512 1024 0

Execution times for increasing number of nodes

Figure 10 Execution times in seconds of GA-Net when the number of nodes increases from 128 to

1024.

optimal approximated solution In our approach fitness evaluation is rather simpleand can be computed in linear time, thus the main problem comes from the networksize Figure10shows how the execution time (in seconds) increases when the number

of nodes augments from 128 to 1024 The figure indicates that the running timeincreases linearly with the size of the input, thus large sized networks could be used ifmore powerful machines are available Moreover, Genetic Algorithms are naturallysuited to be implemented on parallel architectures [36], and an implementation of

GA-Net on a parallel machine can be easily realized.

7 Conclusions

The paper presented a genetic algorithm for detecting communities in networks Theapproach introduced the concept of community score, and searches for an optimalpartitioning of the network by maximizing the community score All the dense com-munities present in the network structure are obtained at the end of the algorithm byselectively exploring the search space, without the need to know in advance the exactnumber of groups The concept of community score, though simple, revealed veryefficacious More importantly, it enables to disclose the hierarchical organization

of a network Experiments on synthetic and real life networks showed the ability

of the genetic approach to correctly detect communities with results comparable tostate of the art approaches It is worth to note that the real-life data sets presented

in the paper to evaluate the method are rather small respect to the very largenetworks available nowadays It is known that Genetic Algorithms can require highexecution times when large populations of individuals are used On the other hand,they are naturally suited to be implemented on parallel architectures In order todeal with very large networks and make the approach proposed competitive withthe state of the art methods that detect communities, we are planning to realize an

implementation of GA-Net on a parallel machine.

Trang 25

3 Blondel, V.D., Guillaume, J., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large

networks J Stat Mech.: Theory Exp P10008 (2008)

4 Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks.

Phys Rev E70, 066111 (2004)

5 Danon, L., Díaz-Guilera, A., Duch, J., Arenas, A.: Comparing community structure

iden-tification J Stat Mech P09008 (2005)

6 Danon, L., Duch, J., Arenas, A., Díaz-Guilera, A.: Community structure identification Large Scale Structure and Dynamics of Complex Networks: From Information Technology to Finance and Natural Science, pp 93–113 World Scientific (2007)

7 Firat, A., Chatterjee, S., Yilmaz, M.: Genetic clustering of social networks using random walk.

Comput Stat Data Anal 51(12), 6285–6294 (2007)

8 Fortunato, S.: Community detection in graphs Phys Rep 486, 75–174 (2010)

9 Fortunato, S., Barthélemy, M.: Resolution limit in community detection Proc Natl Acad Sci.

U.S.A 104(1), 36–41 (2007)

10 Fortunato, S., Castellano, C.: Community structure in graphs arXiv: 0712.2716v1 [physics.soc-ph]

(2007)

11 Girvan, M., Newman, M.E.J.: Community structure in social and biological networks Proc Natl.

Acad Sci U.S.A 99, 7821–7826 (2002)

12 , Gleiser, P.M., Danon, L.: Community structure in Jazz Adv Complex Systems 6(4), 565–573

Addison-15 Good, B.H., de Montjoye, Y., Clauset, A.: The performance of modularity maximization in

practical contexts Phys Rev E 81(4), 046106 (2010)

16 Holland, J.H.: Adaptation in Natural and Artificial Systems Univ of Michigan Press, Ann Harbor Mich (1975)

17 Hopcroft, J.E., Khan, O., Kulis, B., Selman, B.: Natural communities in large linked networks In: Proc International Conference on Knowledge Discovery and Data Mining (KDD’03), pp 541–546 (2003)

18 Jeong, H., Tombor, B., Albert, R., Oltvai, Z., Barabási, A.-L.: The large-scale organization of

metabolic networks Nature 470, 651–655 (2000)

19 Lancichinetti, A., Fortunato, S., Radicchi, F.: New benchmark in community detection arXiv: 0805.4770v2 [physics.soc-ph] (2008)

20 Lancichinetti, A., Fortunato, S., Kertész, J.: Detecting the overlapping and hierarchical

commu-nity structure of complex networks New J Phys 11(033015) (2009)

21 Leskovec, J., Lang, K., Mahoney, M.W.: Empirical comparison of algorithms for network munity detection In: Proc Int World Wide Web Conference (WWW 2010), pp 631–640 (2010)

com-22 Lipczak, M., Milios, E.: Agglomerative genetic algorithm for clustering in social networks In: Proc Genetic and Evolutionary Computation Conference (GECCO’09), pp 1243–1250 (2003)

23 Lozano, S., Duch, J., Arenas, A.: Analysis of large social datasets by community detection Eur.

Trang 26

28 Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices.

32 Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying

commu-nities in networks Proc Natl Acad Sci U.S.A 101(9), 2658–2663 (2004)

33 Schuetz, P., Caflish, A.: Multistep greedy algorithm identifies community structure in real-world

and computer-generated networks Phys Rev E 78(026112) (2008)

34 Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy: The Principles and Practice of Numerical Classification W H Freeman (1973)

35 Tasgin, M., Bingol, A.: Communities detection in complex networks using genetic algorithms In: Proc of the European Conference on Complex Systems (ECSS’06) (2006)

36 Tomassini, M.: Parallel and distributed evolutionary algorithms: a review In: Chichester et al (eds) Evolutionary Algorithms in Engineering and Computer Science, J Wiley and Sons (1999)

37 Wakita, K., Tsurumi, T.: Finding community structure in mega-scale social networks arXiv:cs/0702048v1 (2007)

38 Watt, D.J.: Small Worlds Princeton University Press (1999)

39 Wei, F., Quian, W., Wang, C., Zhou, A.: Detecting overlapped communities in networks World

Wide Web J 12, 235–261 (2009)

40 Xiaodong, D., Cunrui, W., Xiangdong, L., Yanping, L.: Web community detection model using particle swarm optimization In: Proc of the IEEE Congress on Evolutionary Computation (CEC 2008), pp 1074–1079 (2009)

41 Zachary, W.W.: An information flow model for conflict and fission in small groups J Anthropol.

Res 33, 452–473 (1977)

Trang 27

DOI 10.1007/s11280-012-0190-4

Multidimensional networks: foundations

of structural analysis

Michele Berlingerio · Michele Coscia ·

Fosca Giannotti · Anna Monreale · Dino Pedreschi

Received: 15 July 2011 / Revised: 6 February 2012 /

Accepted: 10 September 2012 / Published online: 3 October 2012

Abstract Complex networks have been receiving increasing attention by the

scien-tific community, thanks also to the increasing availability of real-world network data

So far, network analysis has focused on the characterization and measurement oflocal and global properties of graphs, such as diameter, degree distribution, central-ity, and so on In the last years, the multidimensional nature of many real worldnetworks has been pointed out, i.e many networks containing multiple connectionsbetween any pair of nodes have been analyzed Despite the importance of analyzingthis kind of networks was recognized by previous works, a complete framework formultidimensional network analysis is still missing Such a framework would enable

This work was done when Michele Berlingerio and Michele Coscia were with KDDLab, ISTI CNR, Pisa, Italy

KDDLab, Dept of Computer Science, University of Pisa,

largo B Pontecorvo 3, 56100 Pisa, Italy

A Monreale

e-mail: annam@di.unipi.it

D Pedreschi

e-mail: pedre@di.unipi.it

Trang 28

the analysts to study different phenomena, that can be either the generalization

to the multidimensional setting of what happens in monodimensional networks, or

a new class of phenomena induced by the additional degree of complexity thatmultidimensionality provides in real networks The aim of this paper is then togive the basis for multidimensional network analysis: we present a solid repertoire

of basic concepts and analytical measures, which take into account the generalstructure of multidimensional networks We tested our framework on different realworld multidimensional networks, showing the validity and the meaningfulness of themeasures introduced, that are able to extract important and non-random informationabout complex phenomena in such networks

Keywords complex networks · social network analysis · World Wide Web

1 Introduction

In recent years, complex networks have been receiving increasing attention by thescientific community, also due to the availability of massive network data fromdiverse domains, and the outbreak of novel analytical paradigms, which pose at thecenter of the investigation relations and links among entities Examples are socialnetworks [3,8,14,16], technology networks [2,12], the World Wide Web [21,28],biological networks [24, 25], and so on Multidisciplinary and extensive researchhas been devoted to the extraction of non trivial knowledge from such networks.Predicting future links among the actors of a network [13,31], detecting and studyingthe diffusion of information among them [23,39], mining frequent patterns of users’behaviors [7,20,38,40], are only a few examples of problems studied in ComplexNetwork Analysis, that includes, among all, physicians, mathematicians, computerscientists, sociologists, economists and biologists

Most of the networks studied so far are monodimensional: there can be onlyone link between two nodes In this context, network analytics has focused on thecharacterization and measurement of local and global properties of such graphs, such

as diameter, degree distribution, centrality, connectivity—up to more sophisticateddiscoveries based on graph mining, aimed at finding frequent subgraph patterns andanalyzing the temporal evolution of a network

However, in the real world, networks are often multidimesional, i.e there might bemultiple connections between any pair of nodes Therefore, multidimensional analy-sis is needed to distinguish among different kinds of interactions, or equivalently tolook at interactions from different perspectives This is analog to multidimensionalanalysis in OLAP systems and data warehouses, where data are aggregated alongvarious dimensions In analogy, we refer to different interactions between two

entities as dimensions.

Dimensions in network data can be either explicit or implicit In the first case

the dimensions directly reflect the various interactions in reality; in the second case,the dimensions are defined by the analyst to reflect different interesting qualities

of the interactions, that can be inferred from the available data This is exactly thedistinction studied in [29], where the authors deal with the problem of communitydiscovery In their paper, our conception of multidimensional network is referred as

multislice, networks with explicit dimensions are named multiplex, and the temporal

information is used to derive dimensions for the network

Trang 29

Examples of networks with explicit dimensions are social networks where tions represent information diffusion: email exchange, instant messaging services and

interac-so on An example of network with implicit dimensions is an on-line interac-social networkwith several features: in Flickr, while the social dimension is explicit, two users may

be connected implicitly by the sets of their favorite photos

Moreover, different dimensions may reflect different types of relationship, ordifferent values of the same relationship This is exactly the distinction reported inFigure1, where on the left we have different types of links, while on the right we havedifferent values (conferences) for one relationship (for example, co-authorship)

To the best of our knowledge, however, the literature still misses a systematicdefinition of a model for multidimensional networks, together with a comprehensiveset of meaningful measures, that are capable of characterizing both global andlocal analytical properties and the hidden relationships among different dimensions.This is precisely the aim of this paper: we develop a solid repertoire of basicconcepts and analytical measures, which take into account the general structure

of multidimensional networks, with the aim of answering questions like: what isthe degree of a node considering only a given set of dimensions? How are two ormore dimensions related to each other? What is the “redundancy” among all thedimensions? To what extent one or more dimensions are more important than othersfor the connectivity of a node?

Our contribution can be then summarized as follows:

– we introduce a few examples of real-world multidimensional networks;

– we formally define a set of measures aimed at extracting useful knowledge onmultidimensional networks;

– we empirically test the meainingfulness and scalability of our measures, by means

of an extensive case study on the networks presented

Our analysis shows that the measures we define are both simple and meaningful, andopen the way for a new chapter of complex network analysis

We extend our previous work [10] by adding more measures to the framework,increasing our set of networks to embrace a wider range of real world scenarios,and including a study on real world application scenarios in which we show themeaningfulness of our measures

The rest of the paper is organized as follows: in Section2we present a few amples of real-world multidimensional networks; Section3introduces the measures

SIGKDD

SIGMOD ICDM

Figure 1 Example of multidimensional networks.

Trang 30

we define in this paper; Section4reports the empirical resuts obtained during ourcase study on real-world networks; in Section5we review a few related works; weconlude the paper in Section6.

2 Multidimensional networks in reality

In the world as we know it we can see a large number of interactions and connectionsamong information sources, events, people, or items, giving birth to complex net-works Enumerating all the possible networks detectable within our world, or theirproperties, would be difficult due to their number and heterogeneity, and it is notthe scope of this paper An excellent survey on complex network can be found in[30], where the author gives a good classification of networks into social (where, for example, we find on-line social network such as Facebook), information (such as for example citation networks), technological (among which we mention the power grid, the train routes, or the Internet), and biological (e.g., protein interaction networks)

networks

While all the example networks presented in [30] are monodimensional, in the realworld it is possible to find many multidimensional networks A few possible examplesare:

Transportation Network If we think about the complete transportation network of a

country (or the world), we can easily see that we can build a multidimensionalnetwork where nodes represent the cities, and each transportation mean is adimension In this way, each city is connected to all the other cities reachablefrom it by means of airplanes, or buses, or trains, or ferries, or any kind ofother available mean As one can imagine, there will possibly be pairs of citiesconnected by more than one mean (e.g Paris is connected to Madrid by bothtrain and airplane), cities connected to the rest of the network by many means,

or just one of them (think about cities on islands) It is interesting to note that

we are, in turns, used to “browse” this network in its multidimensionality eachtime we travel: we take a train or a bus to reach the airport, then we flight from acity to another one, then we take another transportation mean to reach our finaldestination It is clear also how this network is an aggregation of monodimensionalnetworks corresponding to any single transportation mean, and that, according

to our classification given in the previous section, this is a network in whichdimensions reflect different types of explicit connections

Social Network Most of us nowadays use on-line services such as Facebook,1Flickr,2Skype,3Google+,4and so on It is very common to have an account on many ofthem, because they provide different features, or we find different friends on them,

or for any other reason Each of us has a different user id in each of the networks,but if we join all the ids for every user, we can easily build a multidimensionalsocial network, where any pair of people are connected by their friendship within

1 http://www.facebook.com

2 http://www.flickr.com

3 http://www.skype.com

4 http://plus.google.com

Trang 31

the different monodimensional networks Significantly, there exist several platforms to connect a single user to his/her multiple accounts at the same time(Pidgin,5Fring,6or Nimbuzz7are a few examples) As for above, two nodes hereare connected by different types of connections, but in this case the links are notnecessarily explicit Two users for example may be linked in Flickr just becausethey use the same set of tags, or they like the same pictures, even if they are notexplicitly friends.

multi-Co-authorship Network The aim of every conference is to gather together

re-searcher in one particular area or topic If we connect two authors by the papersthey write together, it is clear to see that each conference, taken as dimension,provides its edges among the authors There are, however, authors that publish onthe same set of conferences for most of their collaborators, while others (mostlysenior researchers) whose interests span multiple fields or topics, leading then tohaving a different set of neighboring collaborators for different dimensions In thiscase, given the type of connection be the co-authorship, different conferences aredifferent values of the links connecting the authors

Utility Network Most of our houses are connected to each other, or to main nodes,

via different utility networks: water pipes, electric cables, phone and tv cables,build in fact a multidimensional network in which we live every day, where eachutility is a technological network connecting different houses and offices While

at the node level this multidimensional network is highly redundant (almost everynode is served by every utility), the network structure (i.e., the distribution ofthe links) might differ In addition, this network also presents meta-nodes andhyperedges, due to the presence of pipe or cable junctions, network routers, utilityheadquarters, and so on

The above is only a short, non-exhaustive list of possible real-world networks.Many other examples such as biological networks, other kinds of technologicalnetworks, social networks, peer-to-peer networks, and so on, can be found in reality

2.1 Collected networks

While the above examples all are interesting and representative of a wide class

of real-world networks with their properties, issues, and application scenarios,collecting data about them is not trivial and sometimes impossible We presenthere a few multidimensional networks built from different dataset collected fromvarious sources The examples are real-world multidimensional networks, highlyheterogeneous and representative of the possible different kinds of networks in thereal world We use these networks in the rest of the paper, to test our measures and

to give possible application scenarios

DBLP-C We created this network from the well known bibliographic database

DBLP.8We created a co-authorship network where the publication venues are

5 http://www.piding.im

6 http://www.fring.com

7 http://www.nimbuzz.com

8 http://www.informatik.uni-trier.de/˜ley/db

Trang 32

2003

2005 2006

(a) Small example of DBLP-C

Friendship Tag

Tag

Comment Favorite

(c) Small example of QueryLog

(b) Small example of DBLP-Y

(d) Small example of Flickr

Figure 2 Small extracts of the multidimensional networks built.

used as dimensions In this network, we considered some of the most importantconferences in Data Mining: SIGKDD, ICDM, SDM, VLDB, SIGMOD andCIKM The authors were connected in a specific dimension if they wrote atleast one paper together in the corresponding conference A small extract of thisnetwork is represented in Figure2a

DBLP-Y From the same DBLP source, we built also a co-authorship network of

authors, using years from 1955 to 2009 as dimensions, and connecting two authors(nodes) in a specific dimension if they wrote at least one paper together in thecorresponding year A small extract of this network is represented in Figure2b

QueryLog This network was constructed from a query log9 of approximately 20millions web-search queries submitted by 650,000 users, as described in [32] Eachrecord of this dataset stores a user ID, the query terms and the rank position ofthe result clicked by the user for the query We extracted a word-word network ofquery terms (nodes), connecting two words if they appeared together in a query.The dimensions are defined as the rank positions of the clicked results, groupedinto six almost equi-populated bins: “Bin1” for rank 1, “Bin2” for ranks 2–3,

“Bin3” for ranks 4–6, “Bin4” for ranks 7–10, “Bin5” for ranks 11–58, “Bin6” forranks 59–500 Hence two words appeared together in a query for which the userclicked on a resulting url ranked #4 produce a link in dimension “Bin3” betweenthe two words A small extract of this network is represented in Figure2c

9 http://www.gregsadetsky.com/aol-data

Trang 33

Table 1 Basic statistics of the networks used: number of nodes, edges, dimensions, average degree,

average number of neighbors.

Note that k and N are equivalent when computed on one single dimension

Flickr.10 This dataset comes from the well known photo sharing service, and wasobtained by crawling the data via the available APIs We extracted both implicitand explicit dimensions of the social network represented in this data For eachpicture, we extracted the list of all the users related to it and from these users wecompleted the social network by adding edges if two users commented, tagged orset the same picture as favorite, or if they had each other as a contact

The resulting network is a person-person network, where each dimension is one

of the “Friendship”, “Tag”, “Favorites”, or “Comment”, representing if the usersare friends, tagged the same picture, marked the same picture as favorite, orcommented the same picture A small extract of this network is represented inFigure2d

Note that while for QueryLog we created our concept of dimensions, that are then

to be considered implicit, in DBLP the authors explicitly set their collaborations, thenthe dimensions are explicit In turns, Flickr has one explicit dimension (friendship)and three implicit (tag, favorites and comments) Moreover, in QueryLog, as well as

in DBLP-Y, the dimensions reflect different quantitative values of the same type ofrelationships, while for DBLP-C and for Flickr the dimensions are built on differenttypes of connections among users, and are not comparable

Table1shows the basic properties of the networks, for each dimension, and for

the total network Note that k and N are equivalent when computed on a single

10 http://www.flickr.com

Trang 34

dimension, and that DBLP-Y and DBLP-C have different aggregated values as theywere built as different subsets of the entire DBLP data.

3 Multidimensional network analysis

In literature, many analytical measures, both at the local and at the global levels, havebeen defined in order to describe and analyze properties of standard, monodimen-sional networks Defining meaningful measures provides several advantages in the

analysis of complex networks From the simplest measure, the degree of a node, to more sophisticated ones, like the betweenness centrality, or the eigenvector centrality,

several important results have been obtained in analyzing complex networks on world case studies These interesting network analytical measures come under adifferent light when seen in the multidimensional setting, since the analysis scenariogets even richer, thanks to the availability of different dimensions to take intoaccount As an example, the connectivity of the whole network changes if we see

real-a single dimension real-as real-a sepreal-arreal-ate network, with respect to the network formed byall the edges in the entire set of dimensions Moreover, it would be interesting

to analyze the importance of a dimension with respect to another, the importance

of a dimension for a specific node, and so on As a consequence, in this novelsetting it becomes indispensable: (a) studying how most of the measures definedfor classical monodimensional networks can be generalized in order to be applied

to multidimensional networks; and (b) defining new measures, meaningful only

in the multidimensional scenario, to capture hidden relationships among differentdimensions

Thus, in the remainder of this section, we introduce the elements composing ourmodel as follows First, we introduce a mathematical model for multidimensionalnetworks Although not being the only possibility (other possibilities would includetensors, among all), we found multigraphs to be a simple and versatile model, thatallow also for a simple a fast implementation of the measures (see Section4) Then,

we discuss the extension of monodimensional measures to the multidimensionalsetting For sake of simplicity, we only present one measure, the degree, although

it is possible to extend most of the monodimensional measures following the samestrategy of adding a parameter to the domain of the functions Lastly, we introduceour multidimensional measures, meaningful only in the multidimensional setting Togive an overview, we introduce both measures that are local to the nodes, and mea-sures that are global to the dimensions The set of measures introduced is not meant

to be complete: other measures can be defined, for example, at the intermediatelevel of the ego-networks, or they can be assessing links instead of nodes For thesake of simplicity, however, we introduce only a few, generic, measures, togetherwith toy examples meant to help understand their meaning, and we will explore

in the future the possibility of introducing new ad-hoc measures that are meant to

be used in specific application-driven contexts (for example, measures for evolvingmultidimensional networks, measures for semantic networks, and so on)

3.1 A model for multidimensional networks

We use a multigraph to model a multidimensional network and its properties For the

sake of simplicity, in our model we only consider undirected multigraphs and since

Trang 35

we do not consider node labels, hereafter we use edge-labeled undirected multigraphs, denoted by a triple G = (V, E, L) where: V is a set of nodes; L is a set of labels; E

is a set of labeled edges, i.e the set of triples(u, v, d) where u, v ∈ V are nodes and

d ∈ L is a label Also, we use the term dimension to indicate label, and we say that a node belongs to or appears in a given dimension d if there is at least one edge labeled with d adjacent to it We also say that an edge belongs to or appears in a dimension d

if its label is d We assume that given a pair of nodes u , v ∈ V and a label d ∈ L only

one edge(u, v, d) may exist Thus, each pair of nodes in G can be connected by at

most|L| possible edges Hereafter P(L) denotes the power set of L.

3.2 Extending monodimensional measures

How can we extend the analytical measures defined on monodimensional networks

to deal with multiple dimensions? In general, in order to adapt the classical measures

to the multidimensional setting we need to extend the domain of each function inorder to specify the set of dimensions for which they are calculated Intuitively,when a measure considers a specific set of dimensions, a filter is applied on themultigraph to produce a view of it considering only that specific set, and then themeasure is calculated over this view In the following, due to space constraints,

we show how to redefine only the well-known degree measure by applying the

above approach Note that most of the classical measures can be extended in asimilar way

In order to cope with the multidimensional setting, we can define the degree of anode w.r.t a single dimension or a set of them To this end, we have to redefine thedomain of the classical degree function by including also the dimensions

Definition 1 (Degree) Letv ∈ V and D ⊆ L be a node and a set of dimensions of a

network G, respectively The function Degree : V × P(L) → N defined as

Degree (v, D) = |{(u, v, d) ∈ E s.t u ∈ V ∧ d ∈ D}|

computes the number of edges, labeled with one of the dimensions in D, between v

and any other node u.

We can consider two particular cases: when D = L we have the degree of the

node v within the whole network, while when the set of dimensions D contains

only one dimension d we have the degree of v in the dimension d, which is the

classical degree of a node in a monodimensional network This kind of considerationalso holds for any measure that is possible to extend to the multidimensional case

in this way

In order to illustrate the measures we define in this paper, we use a toy example,depicted in Figure3, to show the application of the measures on it

Example 1 Consider the multigraph in Figure 3 that models a multidimensional

network with 2 dimensions: dimension d1represented by a solid line, and dimension

d2 represented by the dashed line In this multigraph we have Degree (3, {d1}) = 2,

Degree (3, {d }) = 0 and Degree(2, {d , d }) = 3.

Trang 36

3.3 Multidimensional measures

In this section we define new measures on the multidimensional setting and that aremeaningful only in this scenario

3.3.1 Neighbors

In classical graph theory the degree of a node refers to the connections of a node in a

network: it is defined, in fact, as the number of edges adjacent to a node In a simplegraph, each edge is the sole connection to an adjacent node In multidimensionalnetworks the degree of a node and the number of nodes adjacent to it are no longerrelated, since there may be more than one edge between any two nodes For instance,

in Figure3, the node 4 has five neighbors and degree equal to 7 (taking into accountall the dimensions) In order to capture this difference, we define the following:

Definition 2 (Neighbors) Letv ∈ V and D ⊆ L be a node and a set of dimensions of

a network G = (V, E, L), respectively The function Neighbors : V × P(L) → N is

defined as

Neighb ors(v, D) = |NeighborSet(v, D)|

where

Neighb orSet (v, D) = {u ∈ V | ∃(u, v, d) ∈ E ∧ d ∈ D}.

This function computes the number of all the nodes directly reachable from nodev

by edges labeled with dimensions belonging to D.

Note that, in the monodimensional case, the value of this measure corresponds to

the degree It is easy to see that Neighb ors (v, D) ≤ Degree(v), but we can also easily

say something about the ratio Neighb ors Degree (v) (v,D) When the number of neighbors is small,but each one is connected by many edges tov, we have low values of this ratio, which

means that the set of dimensions is somehow redundant w.r.t the connectivity ofthat node This is the case of node 5 in the toy example illustrated On the oppositeextreme, the two measures coincide, and this ratio is equal to 1, which means thateach dimension is necessary (and not redundant) for the connectivity of that node:removing any dimension would disconnect (directly) that node from some of itsneighbors This is the case of node 2 in Figure3

We also define a variant of the Neighbors function, which takes into account onlythe adjacent nodes that are connected by edges belonging exclusively to a given set

of dimensions

Definition 3 (NeighborsXOR) Let v ∈ V and D ⊆ L be a node and a set of

di-mensions of a network G = (V, E, L), respectively The function NeighborsXOR:

Figure 3 Toy example Solid

line is dimension 1, the dashed

Trang 37

V×P(L) → N is defined as

Neighb orsXOR(v, D) = |{u ∈ V| ∃d ∈ D : (u, v, d) ∈ E ∧ d /∈ D : (u, v, d) ∈ E}|

It computes the number of neighboring nodes connected by edges belonging only to

dimensions in D.

3.3.2 Dimension relevance

One key aspect of multidimensional network analysis is to understand how important

a particular dimension is over the others for the connectivity of a node, i.e whathappens to the connectivity of the node if we remove that dimension We then define

the new concept of Dimension Relevance.

Definition 4 (Dimension relevance) Let v ∈ V and D ⊆ L be a node and a set

of dimensions of a network G = (V, E, L), respectively The function DR : V ×

P(L) → [0, 1] is defined as

DR (v, D) = Neighb ors(v, D) Neighb ors(v, L)

and computes the ratio between the neighbors of a node v connected by edges

belonging to a specific set of dimensions in D and the total number of its neighbors.

Clearly, the set D might also contain only a single dimension d, for which the

analyst might want to study the specific role within the network, to assess, forexample, the importance of the single conference in DBLP-C over the others.However, in a multidimensional setting, this measure may still not cover importantinformation about the connectivity of a node Figure 3 shows two nodes (4 and5) with a high dimension relevance for the dimension represented by a solid line.Specifically, in both cases the dimension relevance is equal to one, but the completeset of connections they present is different: if we remove the dimension representedwith a solid line, the node 4 will be completely disconnected from some its neighbors,for example it cannot reach the nodes 2, 3 and 7 anymore; while the node 5 canstill reach all its neighbors To capture these possible different cases we introduce avariant of this measure

Definition 5 (Dimension relevance XOR) Letv ∈ V and D ⊆ L be a node and a

set of dimensions of a network G = (V, E, L), respectively DRXOR: V × P(L) →

[0, 1] is defined as

DRXOR(v, D) = Neighb orsXOR(v, D)

Neighb ors (v, L)

and computes the fraction of neighbors directly reachable from node v following

edges belonging only to dimensions D.

Example 2 We can easily calculate the above measure for the nodes in Figure3 As

an example, for the node 8 there is no difference with the DR (Definition 4): all its

neighbors are only reachable by solid edges The opposite situation holds for node5: all its neighbors are reachable by solid edges, but we always have an alternative

edge So the DR of the solid line dimension is equal to zero

Trang 38

In the following, we want to capture the intuitive intermediate value, i.e thenumber of neighbors reachable through a dimension, weighted by the number ofalternative connections.

Definition 6 (Weighted dimension relevance) Letv ∈ V and d ∈ L be a node and

a dimension of a network G = (V, E, L), respectively The function DRW : V ×

P(L) → [0, 1], called Weighted Dimension Relevance, is defined as

DR W (v, D) =

u ∈NeighborSet(v,D) n n uvd uv

Neighb ors(v, L)

where: nu vd is the number of dimensions which label the edges between two nodes u

andv and that belong to D; n uvis the number of dimensions which label the edges

between two nodes u and v.

Hereafter we occasionally use DRs to indicate all the three variants of this

measure Note that DRXOR= 0 does not necessary imply that the node is notconnected to a particular dimension It represents a situation where the node has

no neighbors that can be reached exclusively through that particular dimension So

it is possible to reach it by alternative ways In Figure3, node 5 is an example of this,when considering the dashed (or solid) line dimension

The Weighted Dimension Relevance takes into account both the situations

mod-eled by the previous two definitions Low values of DRW for a set of dimensions

D are typical of nodes that have a large number of alternative dimensions through

which they can reach their neighbors High values, on the other hand, mean thatthere are fewer alternatives Our example shows the case of node 5 when considering

the solid line dimension: its DRW is clearly the highest, although the dashed line

dimension has a high value of DR.

3.3.3 Highest and lowest redundancy connections nodes

We introduce two new concepts regarding the nodes of multidimensional

net-works: Highest Redundancy Connections (HRC) and Lowest Redundancy

Connec-tion (LRC) nodes They are derived from the combinaConnec-tion of the funcConnec-tions Degree

and Neighbors Intuitively, these measures describe the structure around a givennode in terms of edge density: if the node is a LRC this structure is sparse, while

if the node is HRC it is dense and redundant

Definition 7 (LRC) A nodev ∈ V is said to be at Lowest Redundancy Connection (LRC) if each of its neighbors is reachable via only one dimension, i.e.,

∀u ∈ NeighborSet(v, L) : ∃! d ∈ L (u, v, d) ∈ E.

Note that if a nodev is LRC we have

Degree (v, L) = Neighbors(v, L).

Definition 8 (HRC) A nodev ∈ V is called Highest Redundancy Connections (HRC)

if each of its neighbors is reachable via all the dimensions in the network, i.e.,

∀u ∈ NeighborSet(v, L) : ∀d ∈ L (u, v, d) ∈ E.

Trang 39

Note that if a nodev is HRC we have

Degree(v, L) = Neighbors(v, L) × |L|.

Example 3 In Figure3we have several LRC nodes: 1, 2, 3, 7, 8 and 9 Some of themappear in both dimensions (2 and 7), while other nodes appear in only one dimension(1, 3, 8 and 9) On the other hand we have only one HRC node: node number 5 isconnected via both the dimensions with each of its neighbors

In the “utility network” introduced in Section2, we have that most of the nodesare HRC, as most of the houses have electricity, water pipes, gas, and so on On theother hand, in the “transportation network”, little islands are most likely to be LRC,

as most of them are connected to their neighboring cities only by ferry (excluding theones with little airports)

3.3.4 Dimension connectivity

Another interesting quantitative property of multidimensional networks to study isthe percentage of nodes or edges contained in a specific dimension or that belong

only to that dimension To this end we also introduce: the Dimension Connectivity

and the Exclusive Dimension Connectivity on both the sets of nodes and edges.

Definition 9 (Node dimension connectivity) Let d ∈ L be a dimension of a network

G = (V, E, L) The function N DC : L → [0, 1] defined as

N DC (d) = | {u ∈ V | ∃v ∈ V : (u, v, d) ∈ E} | |V|

computes the ratio of nodes of the network that belong to the dimension d.

Definition 10 (Edge dimension connectivity) Let d ∈ L be a dimension of a network

G = (V, E, L) The function EDC : L → [0, 1] defined as

EDC (d) = |{(u, v, d) ∈ E|u, v ∈ V}| |E|

computes the ratio of edges of the network labeled with the dimension d.

Definition 11 (Node exclusive dimension connectivity) Let d ∈ L be a dimension of

a network G = (V, E, L) The function NEDC : L → [0, 1] defined as

N EDC (d) = | {u ∈ V | ∃v ∈ V : (u, v, d) ∈ E} |

computes the ratio of nodes belonging only to the dimension d.

Trang 40

Definition 12 (Edge exclusive dimension connectivity) Let d ∈ L be a dimension of

a network G = (V, E, L) The function EEDC : L → [0, 1] defined as

EEDC (d) = | {(u, v, d) ∈ E | u, v ∈ V} |

computes the ratio of edges between any pair of nodes u and v labeled with the

dimension d such that there are no other edges between the same two nodes belonging to other dimensions j

Example 4 In Figure3the EDC of dimension d1is 0.61 since it has eight edges out

of the 13 total edges of the network Its EEDC is equal to 5/8 = 0.625 The NDC

for the same dimension d1is 0.88 (8 nodes out of 9) and its NEDC is 0.375 (3 uniquenodes out of 8)

Table 3 presents the values of these measures computed on our real-worldnetworks

3.3.5 D-Correlation

The last aspect of multidimensional networks that we study in this paper is the terplay among dimensions In the following we define two measures that, intuitively,give an idea of how redundant are two dimensions for the existence of a node or anedge These two measures are based on the classical Jaccard correlation coefficient,but they extend it in order to cope with more than two sets

in-Definition 13 (Node D-Correlation) Let D ⊆ L be a set dimensions of a work G = (V, E, L) The Node D-Correlation is the function ρnodes:P(L) → [0, 1]

where Vd denotes the set of nodes belonging to dimension d It computes the ratio of

nodes appearing in all the dimensions in D and the total number of nodes appearing

in at least one dimension in D

Definition 14 (Pair D-Correlation) Let D ⊆ L be a set dimensions of a network G =

(V, E, L) The Pair D-Correlation is the function ρ pairs:P(L) → [0, 1] defined as

ρ pairs (D) = |

d ∈D P d|

|d ∈D P d|

where Pd denotes the set of pairs of nodes (u, v) connected in dimension d It

computes the ratio of pairs of nodes connected in all the dimensions in D and thetotal number of pairs of nodes connected in at least one dimension in D

Figures4,5and6show the behavior of these measures on our real-life networks

When D = L, we can compute the percentage of nodes that exist in all the dimensions of the network, that we call Omni-Connected Nodes (OCN), and, in

analogy, the percentage of pairs of nodes connected in all the dimensions, that we call

Omni-Connected Pairs (OCP) Table3reports these percentages on our networks

Định dạng
Số trang	247
Dung lượng	7,75 MB