Inspired by this fact, the methodology adopted in this chapter leverages on key properties of the hyperbolic metric space for complex and social networks, exploited in a general framewor
Trang 2BIG DATA
NETWORKS
Trang 3Big Data Series
PUBLISHED TITLES
SERIES EDITOR Sanjay Ranka
AIMS AND SCOPE
This series aims to present new research and applications in Big Data, along with the tional tools and techniques currently in development The inclusion of concrete examples and applications is highly encouraged The scope of the series includes, but is not limited to, titles in the areas of social networks, sensor networks, data-centric computing, astronomy, genomics, medical data analytics, large-scale e-commerce, and other relevant topics that may be proposed by poten-tial contributors
computa-BIG DATA COMPUTING: A GUIDE FOR BUSINESS AND TECHNOLOGY
MANAGERS
Vivek Kale
BIG DATA IN COMPLEX AND SOCIAL NETWORKS
My T Thai, Weili Wu, and Hui Xiong
BIG DATA OF COMPLEX NETWORKS
Matthias Dehmer, Frank Emmert-Streib, Stefan Pickl, and Andreas Holzinger BIG DATA : ALGORITHMS, ANALYTICS, AND APPLICATIONS
Kuan-Ching Li, Hai Jiang, Laurence T Yang, and Alfredo Cuzzocrea
NETWORKING FOR BIG DATA
Shui Yu, Xiaodong Lin, Jelena Miši ´c, and Xuemin (Sherman) Shen
Trang 56000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S Government works
Printed on acid-free paper
Version Date: 20161014
International Standard Book Number-13: 978-1-4987-2684-9 (Hardback)
This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information stor- age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that pro- vides licenses and registration for a variety of users For organizations that have been granted a photo- copy license by the CCC, a separate system of payment has been arranged.
www.copy-Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Trang 6Section I Social Networks and Complex Networks
Chapter 1Hyperbolic Big Data Analytics within Complex
Eleni Stai, Vasileios Karyotis, Georgios Katsinis, Eirini EleniTsiropoulou and Symeon Papavassiliou
Chapter 2Scalable Query and Analysis for Social Networks 37
Tak-Lon (Stephen) Wu, Bingjing Zhang, Clayton Davis, EmilioFerrara, Alessandro Flammini, Filippo Menczer and Judy Qiu
Section II Big Data and Web Intelligence
Chapter 3Predicting Content Popularity in Social Networks 65
Yan Yan, Ruibo Zhou, Xiaofeng Gao and Guihai Chen
Chapter 4Mining User Behaviors in Large Social Networks 95
Meng Jiang and Peng Cui
Section III Security and Privacy Issues of Social
Networks
Chapter 5Mining Misinformation in Social Media 125
Liang Wu, Fred Morstatter, Xia Hu and Huan Liu
v
Trang 7Chapter 6Rumor Spreading and Detection in Online Social
Huiyuan Zhang, Huiling Zhang and My T Thai
Chapter 8Exploring Legislative Networks in a Multiparty
Jose Manuel Magallanes
Trang 8In the past decades, the world has witnessed a blossom of online social works, such as Facebook and Twitter This has revolutionized the way of hu-man interaction and drastically changed the landscape of information sharing
net-in cyberspace nowadays Along with the explosive growth of social networks,huge volumes of data have been generating The research of big data, referring
to these large datasets, gives insight into many domains, especially in complexand social network applications
In the research area of big data, the management and analysis of scale datasets are quite challenging due to the highly unstructured data col-lected The large size of social networks, spatio-temporal effect and interactionbetween users are among various challenges in uncovering behavioral mecha-nisms Many recent research projects are involved in processing and analyzingdata from social networks and attempt to better understand the complex net-works, which motivates us to prepare an in-depth material on recent advances
large-in areas of big data and social networks
This handbook is to provide recent developments on theoretical, mic and application aspects of big data in complex social networks The hand-book consists of four parts, covering a wide range of topics The first partfocuses on data storage and data processing The efficient storage of data canfundamentally support intensive data access and queries, which enables so-phisticated analysis Data processing and visualization help to communicateinformation clearly and efficiently The second part of this handbook is devoted
algorith-to the extraction of essential information and the prediction of web content
By performing big data analysis, we can better understand the interests, cation and search history of users and have more accurate prediction of users’behaviors The book next focuses on the protection of privacy and security
lo-in Part 3 Modern social media enables people to share and seek lo-informationeffectively, but also provides effective channels for rumor and misinformationpropagation It is essentially important to model the rumor diffusion, identifymisinformation from massive data and design intervention strategies Finally,Part 4 discusses the emergent application of big data and social networks It
is particularly interested in multilayer networks and multiparty systems
We would like to take this opportunity to thank all authors, the anonymousreferees, and Taylor & Francis Group for helping us to finalize this handbook.Our thanks also go to our students for their help during the processing of allcontributions Finally, we hope that this handbook will encourage research on
vii
Trang 9the many intriguing open questions and applications in the area of big dataand social networks that still remain.
My T ThaiWeili WuHui Xiong
Trang 10My T Thai is a professor and associate chair for research in the department
of computer and information sciences and engineering at the University ofFlorida She received her PhD degree in computer science from the Univer-sity of Minnesota in 2005 Her current research interests include algorithms,cybersecurity and optimization on network science and engineering, includingcommunication networks, smart grids, social networks and their interdepen-dency The results of her work have led to 5 books and 120+ articles published
in various prestigious journals and conferences on networking and torics
combina-Dr Thai has engaged in many professional activities She has been a chair for many IEEE conferences, has served as an associate editor for Journal
TPC-of Combinatorial Optimization (JOCO), Optimization Letters, Journal TPC-of crete Mathematics, IEEE Transactions on Parallel and Distributed Systems,and a series editor of Springer Briefs in Optimization Recently, she has co-founded and is co-Editor-in-Chief of Computational Social Networks journal.She has received many research awards including a UF Research FoundationFellowship, UF Provosts Excellence Award for Assistant Professors, a Depart-ment of Defense (DoD) Young Investigator Award, and an NSF (NationalScience Foundation) CAREER Award
Dis-Weili Wu is a full professor in the department of computer science, sity of Texas at Dallas She received her PhD in 2002 and MS in 1998 fromthe department of computer science, University of Minnesota, Twin City Shereceived her BS in 1989 in mechanical engineering from Liaoning University ofEngineering and Technology in China From 1989 to 1991, she was a mechani-cal engineer at Chinese Academy of Mine Science and Technology She was anassociate researcher and associate chief engineer in Chinese Academy of MineScience and Technology from 1991 to 1993 Her current research mainly dealswith the general research area of data communication and data management.Her research focuses on the design and analysis of algorithms for optimiza-tion problems that occur in wireless networking environments and variousdatabase systems She has published more than 200 research papers in vari-ous prestigious journals and conferences such as IEEE Transaction on Knowl-edge and Data Engineering (TKDE), IEEE Transactions on Mobile Comput-ing (TMC), IEEE Transactions on Multimedia (TMM), ACM Transactions
Univer-on Sensor Networks (TOSN), IEEE TransactiUniver-ons Univer-on Parallel and Distributed
ix
Trang 11Systems (TPDS), IEEE/ACM Transactions on Networking (TON), Journal
of Global Optimization (JGO), Journal of Optical Communications and working (JOCN), Optimization Letters (OPTL), IEEE Communications Let-ters (ICL), Journal of Parallel and Distributed Computing (JPDC), Journal
Net-of Computational Biology (JCB), Discrete Mathematics (DM), Social NetworkAnalysis and Mining (SNAM), Discrete Applied Mathematics (DAM), IEEEINFOCOM (The Conference on Computer Communications), ACM SIGKDD(International Conference on Knowledge Discovery & Data Mining), Interna-tional Conference on Distributed Computing Systems (ICDCS), InternationalConference on Database and Expert Systems Applications (DEXA), SIAMConference on Data Mining, as well as many others Dr Wu is associate edi-tor of SOP Transactions on Wireless Communications (STOWC), Computa-tional Social Networks, Springer and International Journal of BioinformaticsResearch and Applications (IJBRA) Dr Wu is a senior member of IEEE.Hui Xiong is currently a full professor of management science and informa-tion systems at Rutgers Business School and the director of Rutgers Center forInformation Assurance at Rutgers, the State University of New Jersey, where
he received a two-year early promotion/tenure (2009), the Rutgers sity Board of Trustees Research Fellowship for Scholarly Excellence (2009),and the ICDM-2011 Best Research Paper Award (2011)
Univer-Dr Xiong is a prominent researcher in the areas of business intelligence,data mining, big data, and geographic information systems (GIS) For his out-standing contributions to these areas, he was elected an ACM DistinguishedScientist He has a distinguished academic record that includes 200+ referredpapers and an authoritative Encyclopedia of GIS (Springer, 2008) He is serv-ing on the editorial boards of IEEE Transactions on Knowledge and Data En-gineering (TKDE), ACM Transactions on Management Information Systems(TMIS) and IEEE Transactions on Big Data Also, he served as a programco-chair of the Industrial and Government Track for the 18th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining (KDD), aprogram co-chair for the IEEE 2013 International Conference on Data Mining(ICDM-2013), and a general co-chair for the IEEE 2015 International Confer-ence on Data Mining (ICDM-2015)
Trang 12Social Networks and Complex
Networks
Trang 14C H A P T E R 1
A Hyperbolic Big Data Analytics Framework
within Complex and
Social Networks
Eleni Stai, Vasileios Karyotis, Georgios Katsinis, Eirini Eleni Tsiropoulou and Symeon Papavassiliou
CONTENTS
1.1 Introduction 4
1.1.1 Scope and Objectives 5
1.1.2 Outline 6
1.2 Big Data and Network Science 6
1.2.1 Complex Networks, Big Data and the Big Data Chain 6
1.2.2 Big Data Challenges and Complex Networks 8
1.3 Big Data Analytics based on Hyperbolic Space 9
1.3.1 Fundamentals of Hyperbolic Geometric Space 11
1.4 Data Correlations and Dimensionality Reduction in Hyperbolic Space 14
1.4.1 Example 15
1.5 Embedding of Networked Data in Hyperbolic Space and Applications 17
1.5.1 Rigel Embedding in the Hyperboloid Model 17
1.5.2 HyperMap Embedding 19
1.6 Greedy Routing over Hyperbolic Coordinates and Applications within Complex and Social Networks 21
1.7 Optimization Techniques over Hyperbolic Space for Decision-Making in Big Data 23
3
Trang 151.7.1 The Case of Advertisement Allocation over Online
Social Networks 23
1.7.2 The Case of File Allocation Optimization in Wireless Cellular Networks 27
1.8 Visualization Analytics in Hyperbolic Space 29
1.8.1 Adaptive Focus in Hyperbolic Space 30
1.8.2 Hierarchical (Tree) Graphs 31
1.8.3 General Graphs 31
1.9 Conclusions 32
Acknowledgment 32
Further Reading 32
Data management and analysis has stimulated paradigm shifts in
decision-making in various application domains Especially the emer-gence of big data along with complex and social networks has stretched the imposed requirements to the limit, with numerous and crucial potential bene-fits In this chapter, based on a novel approach for big data analytics (BDA),
we focus on data processing and visualization and their relations with com-plex network analysis Thus, we adopt a holistic perspective with respect to complex/social networks that generate massive data and relevant analytics techniques, which jointly impact societal operations, e.g., marketing, adver-tising, resource allocation, etc., closing a loop between data generation and exploitation within complex networks themselves In the latest literature, a strong relation between hyperbolic geometry and complex networks is shown,
as the latter eventually exhibit a hidden hyperbolic structure Inspired by this fact, the methodology adopted in this chapter leverages on key properties
of the hyperbolic metric space for complex and social networks, exploited in
a general framework that includes processes for data correlation/clustering, missing data (e.g., links) inference, social network analysis metrics efficient computations, optimization, resource (advertisements, files, etc.) allocation and visualization analytics More specifically, the proposed framework con-sists of the above hyperbolic geometry based processes/components, arranged
in a chain form Some of those components can also be applied independently, and potentially combined with other traditional statistical learning techniques
We emphasize the efficiency of each process in the complex networks domain, while also pinpointing open and interesting research directions
Data processing and analysis was one of the main drivers for the prolif-eration of computers (processing) and communications networks (analysis and transfer) However, lately, a paradigm shift is witnessed where networks
Trang 16Hyperbolic Big Data Analytics within Complex and Social Networks 5
themselves, e.g., social networks and sensor networks, can create data as well,and, in fact, in massive quantities Indeed, gigantic datasets are produced
on purpose or spontaneously, and stored by traditional and new tions/services
applica-Characteristic examples include the envisaged Internet of Things (IoT)paradigm [1], where pervasive sensors and actuators for almost every aspect
of human activity will collect, process and make decisions on massive data,e.g., for surveillance, healthcare, etc Similarly, the Internet, mobile networks,and overlaying (social) networks, i.e., Google, Facebook and others described
in [2], [3], are responsible for the explosion of produced and transferred data.Collecting, processing and analyzing these data generated at unprecedentedrates has concentrated significant research, technological and financial interestlately, in a broader framework popularly known as “big data analytics” (BDA)[2] The current setting is only expected to intensify in the future, since theexpanding complex and social networks are expected to generate much moremassive amounts of complexly inter-related information and impose harsherdata storage, processing, analysis and visualization requirements
1.1.1 Scope and Objectives
Given the aforementioned setting and the fact that significant research andtechnological progress has taken place regarding the lower level aspects, e.g.,storage and processing, this chapter focuses more on aspects of data analytics
It aspires to provide a framework for combining traditional methodologies(e.g., statistical learning) with novel techniques (e.g., communications theory)providing holistic and efficient solutions
More specifically, we adopt a radical perspective for performing data alytics, advocating the use of cross-discipline mathematical tools, and morespecifically exploiting properties of hyperbolic space [4], [5] We postulate thathyperbolic metric spaces can provide the substrate required in data analyticsfor keeping up with the pace of data volume explosion and required processing.The main goal is to briefly describe a holistic framework for data represen-tation, analysis (e.g., correlation, clustering, prediction), visualization, anddecision making in complex and social networks, based on the principles ofhyperbolic geometry and its properties Then, the chapter will touch on sev-eral key BDA aspects, i.e., data correlation, dimensionality reduction, dataand networks’ embeddings, navigation, social networks analysis (SNA) met-rics’ computation and optimization, and show how they are accommodated
an-by the above framework, along with the associated benefits achieved Thechapter will also explain the salient characteristics of these approaches re-lated to the features and properties of complex and social networks of interestgenerating massive datasets of diverse types Finally, throughout the chapter,
we highlight the key directions that will be of great potential interest in thefuture
Trang 171.1.2 Outline
The rest of this chapter is organized as follows In Section 1.2 the relationbetween complex networks-big data processes and their emerging challengesare presented, while in Section 1.3 the proposed hyperbolic geometry basedapproach is introduced and analyzed Section 1.4 describes how to performdata correlation, and dimensionality reduction over hyperbolic space In Sec-tion 1.5 several types of data embeddings on hyperbolic space, along withtheir properties especially related to complex networks are studied In Section1.6, we examine the navigability of complex networks embedded in hyperbolicspace via greedy routing techniques In Section 1.7 optimization methodolo-gies over large complex and social network graphs using hyperbolic space aredescribed, while applications on advertisement and file allocation problems arepinpointed In Section 1.8, visualization techniques based on hyperbolic spaceand their proporties/advantages versus Euclidean based ones are surveyed.Finally, Section 1.9 concludes the chapter
1.2 BIG DATA AND NETWORK SCIENCE
1.2.1 Complex Networks, Big Data and the Big Data Chain
Diverse types of complex and social networks are nowadays responsible forboth massive data generation and transfers The corresponding research andtechnological progress has been cumulatively addressed under the NetworkScience/Complex Network Analysis (CNA) domain [6]
It has been observed that several types of networks demonstrate similar,
or identical behaviors For example, modern societies are nowadays terized as connected, inter-connected and inter-dependent via various networkstructures Communication and social networks have been co-evolving in thelast decade into a complex hierarchical system, which asymmetrically expands
charac-in time, as shown charac-in Figure 1.1 The charac-interconnectcharac-ing physical layer expandsorders of magnitude faster than the growth rate of the overlaying social one.This leads to the generation of massive quantities of data from both layers, fordifferent purposes, e.g., data transferred in the low layer, control and peer data
at the higher, etc., in unprecedented rates compared to the past This form
of “social IoT” (s-IoT) [7] is tightly related to the big data setting, as age, analysis and inference over gigantic datasets impose stringent resourcerequirements and are tightly inter-related with the structure and operation ofthe complex and social networks involved Various forms of BDA are appliednowadays in diverse disciplines, e.g., banking, retail chains/shopping, health-care, insurance, public utilities, SNA, etc., where diverse complex networksproduce and transfer data
stor-Computers have revolutionized the whole process chain of data ics, allowing automation in a supervised manner Nowadays, such a chain ispart of a broader BDA pipeline that includes collection, correlation, manage-ment, search & retrieval and visualization of data and analysis results, in
Trang 18analyt-Hyperbolic Big Data Analytics within Complex and Social Networks 7
FIGURE 1.1 Communication (complex) – social network co-evolution.
unprecedented scales compared to the past [2] More specifically, the BDApipeline consists of data generation, acquisition, storage, analysis, vi-sualization and interpretation processes
Data generation involves creating data from multiple, diverse and tributed sources including sensors, video, click streams, etc Data acquisi-tion refers to obtaining information and it is subdivided into data collection,data transmission, and data pre-processing The first refers to retrieving rawdata from real-world objects, the second refers to a transmission process fromdata sources to appropriate storage systems, while the third one to all thosetechniques that may be needed prior to the main analysis stage, e.g., dataintegration, cleansing, transformation and reduction Data integration aims
dis-at combining ddis-ata residing in different sources and providing a unified ddis-ataperspective Data cleansing refers to determining inaccurate, incomplete, orunreasonable data and amending or removing (transforming) these data toimprove data quality Data reduction aims at decreasing the degree of redun-dancy of available data, which would in other cases increase data transmissionoverhead, storage costs, data inconsistency, reliability reduction and data cor-ruption
Analysis is the main stage of the BDA pipeline and can take multipleforms The goal is to extract useful values, suggest conclusions and/or supportdecision-making It can be descriptive, predictive and prescriptive It may use
Trang 19data visualization techniques, statistical analysis or data mining techniques inorder to fulfill its goals and interpret the results All the pre-analytics, ana-lytics and post-analytics stages (i.e., visualization and interpretation) of BDAdescribed above can only become more diverse and very informative withinthe complex and social network ecosystems considered in this chapter Thus,even though BDA is characterized by the four V’s — Volume (of data), Veloc-ity (generation speed), Veracity (quality) and Variability (heterogeneity) —the above settings create a new “V” feature for BDA, namely Value, renderingthem essentially a new and in fact “expensive” commodity for our informationsocieties.
1.2.2 Big Data Challenges and Complex Networks
Several challenges emerge due to the fact that big data carry special acteristics, e.g., heterogeneity, spurious correlations, incidental endogeneities,noise accumulation, etc [2], which become even more intense within the com-plex/social network environment Challenges related to BDA can be distin-guished in challenges related to data, and challenges related to processes of theBDA pipeline Table 1.1 summarizes these two types of challenges
char-Data-related challenges correspond to the four “V’s” of BDA with the dition of privacy that relates more to personal data protection The first twodeal with storage and timeliness issues emerging from the explosion of datagenerated/collected, and the following two with the reliability and heterogene-ity of data due to multiple sources and types of data
ad-Additional challenges emerging with respect to the big data pipeline dealwith the data collection and transferring requirements imposed, the pre-processing and analysis of data with respect to the associated complexity,accurate and distributed computation, the accumulated noise, as well as otherperipheral issues, such as data and results visualization, interpretation of re-sults and issues related to cloud storage, computing and services in general
TABLE 1.1 Big Data Challenges
Big Data ChallengesData-Related BDA Pipeline-RelatedVolume Collection
TransferringVelocity Pre-processing
AnalysisVeracity ComplexityDistributed operation
Variety
AccuracyNoiseVisualizationPrivacy Cloud computing
Interpretation
Trang 20Hyperbolic Big Data Analytics within Complex and Social Networks 91.3 BIG DATA ANALYTICS BASED ON HYPERBOLIC SPACEThe aforementioned challenges will require radical approaches for efficientlytackling the emerging problems and keeping up with the anticipated explosion
of produced data In this chapter, we describe a methodology that is capable
of addressing holistically the above challenges and provide impetus for moreefficient analytics in the future The framework is conceptually shown in Fig-ures 1.2 and 1.3 and it is mainly based on the properties of hyperbolic metricspaces (a brief summary of which is included in the forthcoming subsection1.3.1) This approach provides a generic computational substrate for data rep-resentation, analysis (e.g., correlation and clustering), inference, visualization,search & navigation, and decision-making (via, e.g., optimization) The pro-posed framework builds on primitive pre-processing operations of traditionalBDA techniques, e.g., statistical learning, and further complements them interms of analytics and interpretation/visualization to allow more scalable,powerful and efficient inference and decision-making
Figure 1.2 shows the observed evolution of data volumes until today, wherenowadays more than big, i.e., “hyperbolic”, data require processing The pro-posed framework suggests a lean approach for tackling with such scaling Inputdata may take either raw or networked form, where the latter corresponds tocorrelated data (nodes) and their correlations/relations (links between nodes)drawn from combinations of complex/social networks Their analysis leads tosophisticated decision-making for challenging problems over large data sets,
Data collectors/
Data Correlations, Clustering, Network Creation
Data Visualization
Decision Making/ Optimization
Big Data
Search &
Navigation, Efficient Computations &
Optimization
FIGURE 1.2 Evolution of data volume (from data to “hyperbolic” data), proposed framework’s functionalities and interaction with complex and social networks.
Trang 21Big Data Dimensionality (Hyperbolic)
Reduction
(Hyperbolic) Correlations Network Estimation
Hyperbolic Embedding
Hyperbolic Resource Allocation Optimization
Hyperbolic Visualization Analytics
Inference, Clustering, Search &
Navigation, SNA Metrics Computations
FIGURE 1.3 The workflow of the proposed hyperbolic geometry based approach for BDA over complex and social networks.
e.g., resource allocation and optimization, thus eventually having an impact
on the networks themselves, closing the loop of an evolutionary bond betweennetworks (humans, IoT)-data-machines (analytics)(Figure 1.2)
The role of the term “hyperbolic” in the proposed approach is twofold
On one hand, it successfully indicates the passage from “big data” to evenmore, i.e., “hyperbolic data”, denoting the tendency of growth of the avail-able data to be handled and analyzed in the future On the other hand, itemphasizes the benefit of the use of hyperbolic geometry for BDA The core
of this approach is the fact that, as it is shown in the literature, networks
of arbitrarily large size can be embedded in low-dimensional (even as small
as two) hyperbolic spaces without sacrificing important information as far asnetwork communication (e.g., routing) and structure (e.g., scale-free proper-ties [50]) are concerned [8], [9], [5] Thus, hyperbolic spaces are congruent withcomplex network topologies and are much more appropriate for representingand analyzing big data than Euclidean spaces
The specific workflow of the proposed framework is shown in Figure 1.3
It starts with obtaining data and determining a suitable data representationmodel Input (big) data from complex and social networks might be in raw(e.g., list) form, or in the form of a data network representing their correla-tions Pre-processing of data follows, consisting of dimensionality reduction,correlations and generation of networks over data that may be performedeither following traditional techniques or using hyperbolic geometry’s prop-erties The data representation after their pre-processing (e.g., network or
Trang 22Hyperbolic Big Data Analytics within Complex and Social Networks 11
raw form) will either lead to or determine the appropriate methodology forthe following data embedding into the hyperbolic geometric space (subject
of Section 1.5) Data embedding is the assignment of coordinates to networknodes in the hyperbolic metric space Properly visualizing the accumulatedand inferred data following the analysis bears significant importance The pro-posed framework will leverage on flexible (systolic) hyperbolic geometry basedmechanisms for data visualization, in order to allow their holistic and simul-taneously focused view and more informed decision-making This is capable
of providing visualization tools that capture simultaneously global patternsand structural information, e.g., hierarchy, node centrality/importance, etc.,and local characteristics, e.g., similarities, in an efficient and systolic manner,which hides/reveals detail when this is required by the decision-making in ascalable manner The latter approach can be very useful in applications andstudies of CNA/SNA
In this chapter, we also describe techniques for extracting useful mation from the data under processing and analysis for different applicationdomains Following and depending on the data embedding, further data cor-relation/clustering and inference may be attained, in which various forms of(possibly hierarchical) data communities/clusters will be built and missingdata (e.g., links) will be predicted from the input data within accuracy andtime constraints imposed Leveraging the hyperbolic distance function andgreedy routing techniques, efficient SNA metrics computations (such as cen-tralities, the computation of which becomes hard over large data sets) will
infor-be studied and proposed The proposed framework also allows performingefficient and suitable for large data sets optimization for advertisements’ allo-cation and other — mainly of discrete nature — resources’ allocation problems(e.g., file allocation over distributed cache memories in a 5G environment)
In the following, we first present some background on hyperbolic spaceand then present the proposed framework in more detail Following, we de-scribe in more detail techniques enabled by the framework for performing andexploiting the analytics over the embedded data
1.3.1 Fundamentals of Hyperbolic Geometric Space
Non-Euclidean geometries, e.g., hyperbolic geometry [4], emerged by tioning and modifying the fifth (parallel) postulate of Euclidean geometry.According to the latter, given a line and a point that does not lie on it, there
ques-is exactly one line going through the given point that ques-is parallel to the givenline As far as hyperbolic geometry is concerned, the parallel postulate changes
as follows: Given a line and a point that does not lie on it, there is more thanone line going through the given point that is parallel to the given line.The n-dimensional hyperbolic space, denoted as Hn, is an n-dimensionalRiemannian manifold with negative curvature c which is most often consideredconstant and equal to c = −1 Several models of hyperbolic space exist such
as the Poincare disk model, the Poincare half-space model, the Hyperboloid
Trang 23model, the Klein model, etc These models are isometric,1 i.e., any two ofthem can be related by a transformation which preserves all the geometricalproperties (e.g., distance) of the space We will describe in detail and use inour approach the Poincare models (disk and half space) which are mostly used
in practical applications
For instance, the Hyperboloid model realizes the Hn hyperbolic space as
a hyperboloid in Rn+1 = {(x0, , xn)|xi ∈ R, i = {0, 1, , n}} such that
x2− x2− − x2
n = 1, x0 > 0 Hyperbolic spaces have a metric function(distance) that differs from the familiar Euclidean distance, while also differsamong the diverse models In the case of the Hyperboloid model, for twopoints x = (x0, , xn), y = (y0, , yn), their hyperbolic distance is given by[4]:
cosh dH(x, y) =
r
1 + kxk2 1 + kyk2− < x, y >, (1.1)where k·k is the Euclidean norm and < ·, · > represents the inner product.The Hyperboloid model can be used to construct the Poincare disk/ball model,where the latter is a perspective projection of the former viewed from (x0=
−1, x1 = 0, , xn = 0), projecting the upper half hyperboloid onto an Rn
unit ball centered at x0= 0
Specifically, focusing on the two dimensions, the whole infinite hyperbolicplane can be represented inside the finite unit disk D = {z ∈ kzk < 1} ofthe Euclidean space, which is the 2-dimensional Poincare disk model Thehyperbolic distance function dP D(zi, zj), for two points zi, zj, in the Poincaredisk model is given by [4], [11]:
cosh dP D(zi, zj) = 2 kzi− zjk2
(1 − kzik2)(1 − kzjk2)+ 1. (1.2)The Euclidean circle ϑD = {z ∈ kzk = 1} is the boundary at infinity forthe Poincare disk model In addition, in this model, the shortest hyperbolicpath between two nodes is either a part of a diameter of D, or a part of
a Euclidean circle in D perpendicular to the boundary ϑD, as illustrated inFigure 1.4(a) Note that these shortest path curves differ from the cords thatwould be implied by the Euclidean metric
Let us now consider the following map in the two dimensions, z = 1−iww−i ,where z, with kzk < 1, is a point expressed as a complex number on thePoincare disk model and i is the imaginary unit Then w is a point (complexnumber) on the Poincare half-space model This map sends z = −i to w = 0,
z = 1 to w = 1 and z = i to w = ∞ (note that the extension to moredimensions is trivial)
According to the Poincare half-space model of Hn, every point is resented by a pair (w0, w) where, w0 ∈ R+
rep-and w ∈ Rn−1 The distance
1 Isometry is a map that preserves distance [10] between metric spaces.
Trang 24Hyperbolic Big Data Analytics within Complex and Social Networks 13
FIGURE 1.4 Poincare disk (a) and half-space (b) models along with their shortest paths in two dimensions: part of a diameter of D or a part
of a Euclidean circle in D perpendicular to the boundary ϑD for the disk model and vertical lines and semicircles perpendicular to R for the half-space model (c) shows the Voronoi tesselation of the Poincare disk into hyperbolic triangles of equal area.
between two points (w1, w1), (w2, w2) on the Poincare half-space model isdefined as [12]:
of a circle of radius r in the 2-dimensional (2D) Poincare disk model are given
by the following relations [46], [4], [8]:
C(r) = 2π sinh(r), A(r) = 4πsinh2(r/2) (1.4)Therefore, for small radius r, e.g., around the center of the Poincare disk, thehyperbolic space looks flat, while for larger r, both the circumference and thearea grow exponentially with r The exponential scaling with radius is illus-trated in Figure 1.4(c) which shows a tesselation of the Poincare disk into hy-perbolic triangles of equal area The triangles appear increasingly smaller thecloser they are to the circumference in the Euclidean visual representation ofthe triangulation In the following, we describe the different components syn-thesizing the proposed framework, even though several parts can be combinedand employed jointly
Trang 251.4 DATA CORRELATIONS AND DIMENSIONALITY REDUCTION
IN HYPERBOLIC SPACE
In this section, we describe two basic functionalities of the proposed work (Figures 1.2 and 1.3) The first deals with inferring correlations amongdata, yielding network structures representing such relations (nodes-data,correlations-edges) The second deals with a distance-preserving dimensional-ity reduction approach over the hyperbolic space (i.e multidimensional scaling[12], [13]) with multiple practical applications, e.g., various efficient compu-tations, efficient data visualization, etc Each functionality of course can beapplied independently
frame-We assume generic forms of “data items”, each of which can be unrolled
in a set of features The set of features will be common for all data items,e.g., customer’s parameters such as payment information, demographic in-formation, etc., when customers correspond to data items Before analyticsone needs to apply a method for clustering/reduction of these features to aset of latent features (considered important to fully describe each data item).Examples of such methods include spectral clustering [principal componentanalysis (PCA)] [14], [15] singular value decomposition (SVD) [14], [15], etc.,where each can be appropriately sped up to scale with large datasets, as in [15],[16], [17] Following, correlations may be inferred via the application of sim-ilarity/distance metrics to quantify similarities on various data aspects (e.g.,between pairs of data items) A thorough survey of similarity metrics such ascosine, Pearson, etc is performed in [18] Another widely accepted approachfor computing similarities is the one that identifies distribution functions inthe parameters of interest and then exploits an appropriate distribution com-parison metric, e.g., Kullback-Leibler divergence [19], [20] for probabilisticdistributions Hyperbolic distance may also serve as a similarity measure, asdescribed in the following Other ways of clustering and network estimationinclude [14] partitional algorithms (k-means and its variations, etc.), hierarchi-cal algorithms (agglomerative, divisive), the “lasso” algorithm and its variantsthat are based on convex optimization [21] producing a graph representation
of the data, etc In the case of the proposed framework, it is beneficial to sider hierarchical clustering of data for allowing efficient visualization usingthe two- or three-dimensional hyperbolic space (Section 1.8)
con-Data correlations in hyperbolic space can be achieved via the hyperbolicdistance function over the hyperbolic space of a suitable dimension — e.g.,equal to the number of important features of users/products — applied onpairs of data items to reveal their hidden dependencies/correlations with re-spect to their features to a controllable extent As an example, if having onlytwo latent features describing the data items, we can assign the radial and an-gular coordinates of the 2D Poincare disk model according to the values of eachfeature correspondingly Then, we consider linking two nodes together only iftheir hyperbolic distance (e.g., Equation (1.2) for the Poincare disk model) isless than a predefined upper bound By controlling this upper bound, one can
Trang 26Hyperbolic Big Data Analytics within Complex and Social Networks 15
control the “neighborhood” of each node and thus the extent to which thecorrelations among data reach In other words, important correlations may beconsidered up to a controllable extent via a threshold value over hyperbolicdistance This is a simple model of data correlation; however, its effective-ness lies in its simplicity and the fact that it can lead to a simultaneous datacorrelation, analysis and visualization
After embedding the data pieces/nodes on the k-dimensional Poincare space model (k corresponds to the number of latent features), one can apply adimension reduction distance-preserving technique over the hyperbolic space,such as the one proposed in [13], [12] Importantly, if choosing the dimension
half-of the final metric space equal to 2 or 3, we will be able to achieve ously a visualization of the data set and its analysis/navigation (Sections 1.6and 1.8) Particularly, regarding the dimensionality reduction over hyperbolicspace, we provide the following two theorems from the literature [12], [22].Given an n-point subset S of the hyperbolic space, let T be its projection
simultane-on Rn−1 (i.e., the Poincare half-space model, Section 1.3.1) By Linderstrauss Lemma [22], there exists an embedding of T , determined by
Johnson-a function f , into the O(logn)ε2
-dimensional Euclidean space such that forevery points x1, x2 ∈ T , kx1− x2k ≤ kf (x1) − f (x2)k ≤ (1 + ε) kx1− x2k,
ε > 0
Theorem 1.1 (Dimension reduction for Hn)
Consider the map g : Hn
for every two points (w1, w1), (w2, w2) at hyperbolic distance ∆, we have:
Johnson-Theorem 1.2 (Embedding into the hyperbolic plane (for visualization poses))
pur-Assume that the distance between every two points in S is at least ln(12n)ε ,then there exists an embedding of S into the hyperbolic plane H2 with distancedistortion at most 1 + ε
1.4.1 Example
A similar methodology of data correlation over hyperbolic space is applied in[9], where the new nodes added in the network embedding in the hyperbolic
Trang 27space form connections with existing ones The popularity of the latter andthe similarity of the new nodes with the existing ones is taken into account indetermining the connections of the new nodes in the embedding More specif-ically, newcomers choose existing nodes to connect via optimizing the product
of similarity and popularity with them In [9], the procedure of the neous data embedding/visualization and correlation in hyperbolic space is asfollows, starting with an initially empty network
simulta-1 At time t ≥ 1, a new node t is added to the embedded network and it isassigned the polar coordinates (rt, θt) where the angular coordinate, θt,
is sampled uniformly at random from [0, 2π] and the radial coordinate,
rt, relates to the birth date of node t via the relation rt(t) = ln t Everyexisting node s < t increases its radial coordinate to rs(t) = βrs(t) +(1 − β)rt(t), β ∈ [0, 1]
2 The new node t connects with a subset of existing nodes {s}, where
s < t, ∀s This subset consists of the m nodes with the m smallestvalues of product s · θst, where m is a parameter controlling the averagenode degree (i.e., the extent of the correlations among nodes), and θst
is the angular distance between nodes s and t
Actually, by following the above steps for a network construction over data,
it turns out that new nodes connect simply to their closest m nodes in bolic distance The hyperbolic distance in the Poincare disk (Equation (1.2))between two nodes at polar coordinates (rt, θt) and (rs, θs) is approximatelyequal to xst= rs+ rt+ ln(θst/2) = ln(s · t · θst/2) Therefore, the sets of nodes{s} minimizing xst or s · θst for each newcomer t are identical At the secondstep above, in order to reduce network clustering [23], the newcomer node t in-stead of connecting with its m closest nodes may select randomly a node s < tand form a connection with s with probability equal to p(xst) = [1+e(xst−Rt)/T1 ],where T is a temperature parameter and Rtis a threshold value This step isrepeated until m nodes are selected to connect to node t
hyper-Here, the radial coordinate abstracts the popularity of a node The smallerthe radial coordinate of a node (the closer the node in the center of thePoincare disk) the more popular it is, thus the more likely it is for it to attractnew connections (we will elaborate more on this fact in Section 1.5, see alsothe hyperbolic distance functions in Section 1.3.1) The increase of the radialcoordinate expresses any attenuation of nodes’ popularity with time, which
is equal to zero when β = 1 Note that, in complex networks the time ence of a node in the network is strongly related to its popularity Specifically,the scale-free structure of complex networks is mainly due to the preferentialattachment of newcomers, as the network grows, to existing nodes with highdegree Thus, nodes of high degree continue to increase their connectivity, andthese nodes are with higher probability older nodes assuming that initially allnodes have the same degree Therefore, in the above mapping of nodes tohyperbolic coordinates, the similarity characteristic is mapped to the angu-lar coordinate (here assigned randomly), while the popularity characteristic
Trang 28pres-Hyperbolic Big Data Analytics within Complex and Social Networks 17
is mapped to the radial coordinate and hyperbolic distance is used to dict/infer connections between pairs of nodes based on their characteristics
pre-As a result, hyperbolic distance serves as a convenient single-metric tation of a combination of popularity (radial) and similarity (angular)
SPACE AND APPLICATIONS
In this section, in order to perform data embedding, it is assumed that dataitems are already available in network form Thus, the focus shifts on obtainingdifferent embeddings into latent hyperbolic coordinates in conjunction withseveral applications over complex large-scale networks, such as graph theo-retic and SNA metrics’ computation (e.g., centrality metrics), missing links’prediction, etc Two types of embedding in the low-dimensional hyperbolicspace are presented In the first (Subsection 1.5.1), the latent node coordi-nates in hyperbolic space are determined so that the hyperbolic distancesbetween node pairs are approximately equal to their graph distances initialnetwork Towards this objective, multidimensional scaling (MDS) is applied[24] Given n the number of network nodes (data items), MDS has a runningtime of O(n3) and requires space O(n2) (distance matrix between all nodepairs) Since the complexity of MDS is extremely high for large-scale networks,landmark-based MDS has been introduced [24], based on the graph and hy-perbolic space distances among k chosen landmarks and the rest of nodes.With landmark-based MDS, the running time reduces to O(kn) and the space
to O(dkn + k3), where d is the dimension of the hyperbolic space and it shouldalso hold d < k << n By considering d, k as small constants, landmark-basedMDS has a linear running time complexity The second type of embedding(Subsection 1.5.2), applies statistical learning methods to embed a complexnetwork graph in hyperbolic space by constructing a new network graph try-ing to mimic with high probability the initial graph structure [5] Contrary
to the first approach the node pairs’ hyperbolic distances may differ cantly from their initial graph distances The statistical learning techniquesapplied are based on maximum likelihood estimation for the node coordinates’inference, while global (i.e., for the whole network) and local (i.e., for everynode) likelihood functions are defined and maximized, where local likelihoodfunctions serve to approximate the global ones for complexity reductions
signifi-1.5.1 Rigel Embedding in the Hyperboloid Model
Several complex and social network analysis problems such as computation
of node centralities, community detection, etc., are based on node distanceswhich appear hard to compute within large-scale graphs such as online so-cial networks with millions of nodes However, for marketing purposes, suchcomputations become necessary or even critical for companies, e.g., to locatethe more influential/central node for achieving efficient marketing Therefore,
Trang 29several works in literature [24], [12] have attempted to propose algorithms fornetwork embeddings (e.g., in Euclidean or hyperbolic space) so that the in-ferred coordinates can be used for approximating node distances in the initialgraph We will focus on large-scale network embedding in hyperbolic spaceand specifically on the Rigel embedding proposed in [24], which achieves lowdistortion (of distance) error and answers to queries for node distances andshortest paths in microseconds even for up to 43 million nodes compared tothe order of seconds of a traditional breadth-first-search (BFS) algorithm.Importantly, Rigel allows for parallelization in computations which is a greatadvantage in the field of BDA Experimental results in [25], [8], focused onembedding Internet distances in hyperbolic space, have shown less distortionwith respect to the node distances in the initial graph, compared with otherembeddings in Euclidean coordinates This fact is also verified in [26] via em-pirical computation of distortion metrics for diverse coordinate systems where
it is shown that hyperbolic space achieves significantly more accurate resultsthan Euclidean and spherical ones
Let us assume that the network consists of N nodes Rigel employs the perboloid model of hyperbolic space with distance function given by Equation(1.1) Rigel applies landmark-based MDS, where L << N nodes are chosen
Hy-as landmarks in the network graph Landmarks may be chosen Hy-as high-degreenodes, if the given network is scale-free, otherwise they can be chosen ran-domly First, the hyperbolic coordinates of the landmarks are computed withthe aid of a global optimization algorithm aiming to achieve that the dis-tances between the landmarks in the Hyperboloid are as close as possible totheir matching path distances in the graph This is the bootstrapping step
of Rigel Then, the hyperbolic coordinates of the rest of the nodes are brated, so that each node’s distances to all landmarks in the Hyperboloid arevery close to the corresponding actual path distances in the network graph.Note that the authors of [24] studied the accuracy of Rigel with respect tothe dimensions of the hyperbolic space and showed that the former increaseswith the increase of the latter However, the number of landmarks should behigher than the dimension of the embedding space, thus leading to a trade-offbetween accuracy and complexity [24]
cali-Importantly for large-scale network graphs, a parallel version of Rigel isproposed in [24], offering great improvement in the complexity of Rigel, thelatter increasing linearly with the network size Both steps of Rigel (boot-strapping and embedding in the Hyperboloid model) can be parallelized in anumber of servers at most equal to the number of landmarks One or morelandmarks are assigned to each server and the rest of the nodes are distributed
in a balanced way across servers It is shown that parallel Rigel performs ilarly with respect to accuracy as Rigel
sim-Concerning the effectiveness and efficiency of Rigel in computing SNA[27] and graph analysis metrics, experiments and comparisons with existingschemes are performed in [24] Regarding the graph analysis metrics of ra-dius, diameter and average path length, which are applied in identifying the
Trang 30Hyperbolic Big Data Analytics within Complex and Social Networks 19
small-world property of a network [6], [51], Rigel resulted in values extremelyclose to the ground truth Note that distances in Rigel are given by Equa-tion (1.1) Rigel’s performance in computing node centralities that constitute
an important SNA metric for industries is also examined in [24] Closenesscentrality [27] is considered, according to which the most central node is theone that has the lowest average distance to all other nodes in the network.Rigel achieved a high accuracy in identifying the node ranking with respect
to closeness centrality and outperforms existing schemes
1.5.2 HyperMap Embedding
This section uses statistical learning methods to embed a social graph inhyperbolic coordinates, focusing on the HyperMap embedding algorithm in-troduced in [5] HyperMap leverages the emerging relation between complexnetwork topologies and hyperbolic geometry [8] Due to their scale-free prop-erty, complex networks exhibit hierarchical, i.e., tree-like structure [28], whilehyperbolic geometry is the geometry of trees More specifically, the similaritybetween an infinite tree graph and the hyperbolic space provides an intuitionabout the hidden hyperbolic structure of complex networks The exponentialscaling of a circle and an area of a disk in hyperbolic space (explained in Sec-tion 1.3.1) coincides with the scaling of the number of nodes with respect totheir distance from the root of the tree in an “e-ary” tree [8] To make thisclearer, let us examine a b-ary tree which is a tree with branch factor equal
to b The number of nodes located at distance exactly R from the root of thetree is (b + 1)b(R−1)∼ bRand the number of nodes being at distance at most
R from the root of the tree is (b+1)b(b−1)R−2 ∼ bR As a result, hyperbolic spacecan be seen as a continuous version of a tree, a fact realized as the exponen-tial expansion property of the hyperbolic space Scale-free complex networksare characterized by heterogeneity regarding the node degree, where the ma-jority of nodes is assigned low node degree (power-law degree distribution),implying a tree-like network organization indicating the existence of a hiddenhyperbolic metric space [28]
The example of the simultaneous embedding and creation of a growing dom network provided in Subsection 1.4.1 leads to the formation of networkgraphs with the following two characteristics: (i) they appear to be highlyclustered [23] since the links added between close nodes in hyperbolic dis-tance lead to the formation of a large number of triangles and (ii) they havepower-law degree distribution, i.e., two basic properties of complex networks’structure These statements further support the existence of an underlyinghidden hyperbolic space in complex networks’ structure On one hand, a ran-dom network created over hyperbolic space as in Subsection 1.4.1 emerges to
ran-be scale-free while on the other hand, a scale-free network is proven to havenegative curvature [8] (similarly to hyperbolic metric spaces)
Based on these studies and observations, HyperMap aims at embedding agiven complex (social) network in hyperbolic space in a way that is congruent
Trang 31with the embedding of an extended version of the model of Subsection 1.4.1.The extension lies basically in providing the possibility to add links betweenexisting nodes, while in Subsection 1.4.1 new links can be added only between anewcomer and an existing node Precisely, HyperMap finds nodes’ angular andradial coordinates such that the probability that the given complex network
is produced by this extended model of Subsection 1.4.1 is maximized.HyperMap assigns hyperbolic coordinates to the nodes inside the Poincaredisk by maximizing approximately but in an efficient manner a globally de-fined likelihood function over the node pairs’ hyperbolic distances (which arefunctions of nodes’ hyperbolic coordinates) expressed considering the givencomplex network’s links Specifically, in order to mimic the network cre-ation/hyperbolic embedding of Subsection 1.4.1, it first performs a maximumlikelihood estimation of the appearance (i.e., birth) times of the given net-work’s nodes (let t denote their number) Then, after estimating the timesequence of nodes’ arrivals, it replays the hyperbolic growth of the networkroughly similarly to the steps of the model of Subsection 1.4.1 The differencelies in the computation of the angular coordinates where HyperMap computesthe angular coordinate θiof node i, i.e., with sequence number i, via maximiz-ing a local likelihood function defined for node i equivalently to maximizingthe aforementioned global likelihood function with respect to θi Specifically,the HyperMap embedding algorithm receives as basic input the adjacencymatrix of the given complex network and performs the following steps:
1 It sorts nodes in decreasing order with respect to their degree in thegiven complex network, where node 1 corresponds to the one with thehighest node degree Node 1 receives r1 = 0 and a random angularcoordinate θ1 ∈ [0, 2π] (i.e., it is placed on the center of the Poincaredisk model)
2 For i = 1 to t do
(a) Node i arrives (is born) and is assigned the radial coordinate
ri = 2ζ ln i, where ζ = |c| (Subsection 1.3.1) is the constant solute curvature value of the hyperbolic space provided as input toHyperMap Usually, ζ = 1 Every existing node s < i increases itsradial coordinate to rs(t) = βrs(t) + (1 − β)ri(t), β ∈ [0, 1], where
ab-β is provided as input to HyperMap
(b) The angular coordinate θi is computed via maximizing a local lihood function defined for node i
like-HyperMap embedding also provides the possibility to predict missing links
of the given complex network, efficiently and with high accuracy Link tion is a very important process on the study of large-scale networks sincetopology measurements for inferring their structure may miss part of thelinks In HyperMap, prediction is based on the aforementioned possibility
predic-of internal link addition, i.e., between pairs predic-of existing nodes Specifically, two
Trang 32Hyperbolic Big Data Analytics within Complex and Social Networks 21
(non-neighboring) existing nodes k, l are connected at time t (i.e., prediction
of a missing link in the initial complex network) with probability equal to
evaluated according to diverse indices and shown to be very satisfactory, while
it outperforms several well-known classical link prediction methods such asCommon-Neighbors, Katz Index, Hierarchical Random Graph Model, Degree-Product, Inverse Shortest Path, etc [5]
1.6 GREEDY ROUTING OVER HYPERBOLIC COORDINATES AND APPLICATIONS WITHIN COMPLEX AND SOCIAL NETWORKSThis section mostly concerns the navigability of networks embedded in hy-perbolic space [28] A network embedded in a geometric space is navigable,
if one can perform efficient greedy routing on the network using the nodecoordinates in the underlying geometric space [5]
After embedding the network graph (or the correlated data) in the perbolic geometric space, greedy routing over hyperbolic coordinates can beused to navigate or route messages from source to destination Specifically,each node forwards the message to its neighbor closer in hyperbolic distance
hy-to the destination As a result, greedy routing uses only local information, i.e.,each node’s necessary knowledge is limited to the hyperbolic coordinates of itsneighbors and the destination Due to this fact, greedy routing can be adaptedand applied for performing efficient search and navigation in large data sets[24], [26], [29], while we foresee its applications in SNA metrics’ computationand in recommender systems [30] A disadvantage of greedy routing lies in thecase of failure to deliver a message to the destination when a node does nothave a neighbor closer to the destination than itself (local minima of distance)
In this case, the message gets blocked in the specific node [11] with no furtherforwarding via greedy routing
With respect to networks with hidden hyperbolic structure (i.e., scale-freecomplex networks), greedy routing based on hyperbolic coordinates/distancesachieves a very high success rate (close to 100%), as it is shown throughexperimental examination in literature Also, in this case the paths obtainedvia greedy routing are very close to the global shortest paths between thecorresponding node pairs Specifically, in [8], [9], the performance of greedyrouting is studied over the synthetic networks constructed similarly to theexample of Subsection 1.4.1 (in a way congruent to the exponential expansion
of hyperbolic space) and it is shown to achieve success rate close to 100% andstretch with respect to the shortest paths close to 1 This is a very importantproperty showing the small-world navigability of this particular category ofnetworks [31] The success of greedy routing over hyperbolic space is stronglytied with the fact that hyperbolic space has a tightly connected core, whereall paths between nodes pass through This is the reason why shortest paths
in hyperbolic space can be found efficiently and with high accuracy [8] In[5], the performance of greedy routing in the AS Internet graph is examined
Trang 33when using the HyperMap inferred hyperbolic coordinates (Section 1.5) Notethat the AS Internet graph exhibits a scale-free structure [6], [23] Due to thecongruency between the scale-free network topology and hyperbolic geometry,the success of greedy routing over hyperbolic coordinates is much improvedcompared to the case when the real coordinates are used, while the length ofthe paths paved by greedy routing is roughly the same with one of the shortestpaths HyperMap actually estimates the node coordinates that best fit a givennetwork.
“Greedy embeddings” in other than Euclidean metric spaces [11], [49] havebeen proposed to optimize greedy routing techniques In the case of a greedyembedding of any network (not only scale-free) in hyperbolic space, the successrate of greedy routing becomes exactly 100% In [11], a distributed implemen-tation of a greedy embedding in two-dimensional hyperbolic space is proposed,which also can be applied in dynamic network conditions, by assigning hyper-bolic coordinates to new nodes without re-embedding the whole network Thegreedy embedding is constructed by choosing a spanning tree of the graph ofthe initial network and then embedding the spanning tree into the hyperbolicspace according to the algorithm of [11] Following this algorithm, after havingassigned hyperbolic coordinates to the root of the tree inside a specific area
of the Poincare disk model, each node computes its own coordinates using theones of its parent, in such a way that the hyperbolic bisector of the embed-ded spanning tree edge between the node and its parent does not intersectany other embedded edge of the spanning tree The greedy embedding of aspanning tree of a graph implies the greedy embedding of the whole graph.Importantly, it is proven that every graph has a greedy embedding in two-dimensional hyperbolic space [49] For all these reasons, hyperbolic geometrydominates over the Euclidean one for performing greedy routing Note that agreedy embedding basically ensures the existence of at least one greedy pathbetween each source-destination pair, thus 100% success of greedy routing.Greedy embeddings have been applied successfully in communications net-works, e.g., [11], [32], [33], however, in the case of large scale networks theirimplementation may impose challenges due to the need of a spanning tree ofthe whole graph, thus opening new research directions in BDA The averagelength of the paths paved by greedy routing is a crucial performance factor
to evaluate In the case of greedy hyperbolic embedding, different choices ofspanning tree and the root of the spanning tree (e.g., shortest path spanningtree rooted at the node with highest degree, or spanning tree derived via arandom walk) will lead to different routing paths and path lengths betweenpairs of sources and destinations [34]
Greedy routing can become nodegree aware by exploiting the node gree metric available in network graphs [26] This enhancement may improveits performance, since apart from the reason that high degree nodes are “moreconnected” to other nodes, they also tend to be embedded nearer to the core
de-of the network (e.g., center de-of the Poincare disk) than the lower degree nodes.Other enhancements of greedy routing (e.g., Gravity-Pressure Greedy For-
Trang 34Hyperbolic Big Data Analytics within Complex and Social Networks 23
warding [11]) have been also proposed to enhance its performance for dynamicnetwork conditions, e.g., random node arrivals and departures [11] Based onall the advantages of greedy routing techniques over hyperbolic coordinates,
we envision their suitability and efficiency for the computation of SNA rics that demand knowledge of paths between node pairs, e.g., betweennesscentrality [27] often used for defining most influential nodes for informationpropagation purposes
met-1.7 OPTIMIZATION TECHNIQUES OVER HYPERBOLIC SPACE FOR DECISION-MAKING IN BIG DATA
1.7.1 The Case of Advertisement Allocation over Online Social NetworksAnalysis of big data leads to problems of large-scale optimization Since op-timization involving large data sets is not only expensive but suffers fromslow numerical rates of convergence, new approaches are required Through-out this subsection, we will describe and study the advertisement (ad) al-location problem and how it can be significantly simplified computationallyleveraging hyperbolic space’s properties for large-scale networks, following theapproach of [35] A common advertising mechanism used by, e.g., an onlinesocial network (OSN) platform for the distribution of advertisements over itsusers is of auction-style where the advertisers place bids on users’ impressions(e.g., clicks) based on their budget constraints, while the platform’s ownerseeks to maximize its revenue In an OSN, users’ impressions are not ad hocsince users get influenced by their acquaintances and, therefore, the socialinfluence should be taken into consideration in the optimization This is due
to the fact that a user’s engagement may influence other users depending onthe influence strength of the former According to [35], a fairness constraintshould be added in the optimization problem so that “a similar users’ influencedistribution becomes assigned to each advertiser”
Initially, we review the conventional way to formulate an advertisementallocation problem over an OSN, which is the following (Equations (1.6)-(1.9))Integer Programming (IP) problem
Ii,j≤ Ii, ∀ui∈ V (impression constraint) (1.8)
Ii,j ∈ N, (S, I) ∈ RD (domain constraint) (1.9)where aj is an advertisement (corresponding to an advertiser), A is the set
of all advertisements, p the bid of the advertiser j which is considered
Trang 35ho-mogeneous over all users, ui is a user of the OSN (node of the network) withmaximum number of impressions assigned to all advertisers (P
equal to Iiand social influence given by g(ui) RD is a feasible set expressingdomain constraints, e.g., fairness or priority constraints among advertisers.Furthermore, S, I are the optimization variables where S = {S1, S2, , S|A|}
is the allocation strategy, i.e., the set of users assigned to each advertiser and
I = {Ii,j|ui ∈ Si, aj ∈ A} is the users’ impressions allocation strategy, i.e.,the number of impressions of a user assigned to each advertiser where Ii,j= 0
if ui ∈ S/ j Also, V stands for the set of users Note that the total number ofimpressions of a user is upper bounded due to the limited time that a userspends on OSNs daily
The IP problem formulation has two significant disadvantages Firstly, thedecision variable I has an order of |A| · |V |, implying an extreme increase indimensionality for the modern OSNs consisting of billions of users Secondly,the domain constraints mentioned above are hard to express in such an IPformulation setting The most common and important domain constraint (RD)
is the fairness one as it constitutes a requirement and business model of mostOSN platforms Except fairness, several other kinds of domain constraints aredescribed and handled in [35], such as the priority model and the hybrid modelthat combines fairness with priority In this chapter, we will focus only on thefairness constraint, as it is very representative on indicating the computationalefficiency when utilizing the properties of hyperbolic space in the ad allocationproblem for large-scale OSNs
In [35], an alternative problem formulation of the advertisement allocationproblem is proposed based on the mapping of the OSN in hyperbolic space(performed as in Sections 1.4 and 1.5) Following the new methodology thedisadvantages of the IP problem formulation are tackled in a significant degree
as (i) the discrete nature of the advertisement allocation problem (due to
I, S) becomes continuous leveraging region-wise integrals on the continuoushyperbolic geometric space, allowing for dimensionality reduction reaching
a final one of order O(|A|), (ii) in many cases the domain constraints can
be efficiently represented and visualized For the latter and considering thefairness domain constraints, note that two fan (or pie) shapes on the Poincaredisk indicate the same distribution of user influence due to the properties of
a complex network’s (e.g., OSN) mapping in hyperbolic space (Section 1.5)that will be also pinpointed below
For the network mapping in hyperbolic space, the HyperMap scheme tion 1.5) is used The mapping exhibits the important properties of OSNs,such as the power-law degree distribution (scale-free property), the commu-nity structure, and the efficient network navigability via greedy routing usinglocal information (related to small-world phenomenon, Section 1.6) One im-portant aspect is that after the mapping of the network on the Poincare disk,the expected node degree, pd(r), depends on the radial coordinate and is given
(Sec-by pd(r) ∝ e−r, while the node density is expressed as pn(r) ∝ er This means
Trang 36Hyperbolic Big Data Analytics within Complex and Social Networks 25
that every circle on the Poincare disk has uniform node density, while the nodedegree-node density is exponentially distributed along the radius This expo-nential dependence of node degree and node density on the radius can beexploited for capturing the users’ influence factor discussed above, while thecontinuity of the hyperbolic space can be leveraged for approximating the sumover users of the advertisement allocation problem with integrals over certainareas where users are mapped to In this case, the advertisement allocationproblem seeks an optimal allocation strategy that assigns to each advertiser
a region of population and a maximum revenue is achieved
Considering all the above, the advertisement allocation problem, after themapping of the OSN in hyperbolic space, becomes:
pjfj(S, I) (volume assignment) subject to: (1.10)
pjfj(S, I) ≤ bj, ∀j ∈ {1, , |A|} (budget constraint) (1.11)
aj, σi(Sj, I) is the amount of the impressions of user uithat become assigned
on advertisement aj According to this meta-formulation, an allocation egy or a shape design is given for S (e.g., fan-shape for the fairness model,ring-shape for the priority model [35]) which also determines the fj(S, I) func-tion The dependence of the fj, σi functions on I is due to the multiple im-pressions that a user has and may assign to different advertisers Therefore,the areas assigned to different advertisers over the hyperbolic space may beoverlapping complicating the optimization problem (Equations (1.10)–(1.14)).However in [35], this issue is resolved via a methodology denoted as Unit Im-pression Decomposition that leads to a multi-stage optimization problem withunit impressions (and nonoverlapping areas among advertisers) at each stage.For simplicity, suppose that Ii = 1, ∀ ui ∈ V Thus, fj, σi depend only on
S In the following, we will study the case of the fan-shape allocation egy that expresses fairness with respect to social influence in users’ allocationamong advertisers Then, the allocation area Sj for the advertiser j has afan-shape or pie-shape of angle θj in the Poincare disk (as shown in Figure1.5) Then, fj(Sj) is computed as follows:
strat-fj(Sj) = fj(θj) = a
Z R 0
eτ
Z θ j 0
(1 + w · δ(τ ))dldτ = q · θj, (1.15)
Trang 37FIGURE 1.5 An example of an OSN’s users’ allocation to six advertisers considering fairness with respect to the social influence (node degree) Each advertiser is assigned a pie-shaped area over the Poincare disk,
on which the users’ OSN is embedded.
where R < 1 the radius of the disk inside the Poincare disk where the bedded OSN network lies in and q a constant appropriately determined aftertedious computations Also, the quantity a(1 + w · δ(τ )) of the integral rep-resents the profit that each node lying on radius τ attributes to its assignedadvertiser where w, a constants and δ(τ ) the node degree, where δ(τ ) = g · eτ,
em-g a constant Thus, the advertisement allocation problem (for one staem-ge incase of non-unit users’ impressions) with fairness domain constraints attains
a linear programming (LP) form as follows:
Trang 38Hyperbolic Big Data Analytics within Complex and Social Networks 27
In this problem formulation (Equations (1.16)–(1.19)), the optimization able is Θ = {θ1, θ|A|} ∈ [0, 2π]|A|, which has only |A| dimensions, a sig-nificant reduction to the |A| × |V | dimensions of the conventional problemformulation (note that |V | is potentially in the order of billions) Each ad-vertiser aj is assigned a sector of angle θj in the Poincare disk Note thatthe variable S is not needed anymore in the problem formulation Also, it is
vari-a convex problem thvari-at cvari-an be solved efficiently [35] Two more observvari-ationsthat further support the efficiency of the last problem formulation are thefollowing Since the regions can be arranged very tightly close to each other,all the users’ impressions will be utilized as long as the demand (budget ofadvertisers) is more or equal to the supply (users’ impressions) Also, due tothis fact all the stages of the unit impression decomposition (in the case ofnon-unit users’ impressions) can be performed in parallel to reduce the com-putation time [35] which is a very important advantage of this approach forbig data analysis and computations
1.7.2 The Case of File Allocation Optimization in Wireless Cellular
Networks
In modern wireless cellular networks the shift from the reactive to proactivenetworking paradigm is a common trend [36] The need for a smarter networkthat incorporates proactive mechanisms is driven by the increasing mobiledata traffic [37] One type of mechanism for proactive network operation whichhas already been proposed in the literature [36, 38] is the file/content caching
at the edge of the network, i.e., at the evolved NodeBs (eNBs), small cell basestations (Home eNBs) or at the user equipment (UE) devices Pushing content
at the edge of the network alleviates the network from redundant data trafficand serves users requests at lower transmission delays
In this subsection, we focus on the problem of optimal file placement indifferent cache memories lying at various components of a mobile cellular net-work This problem can be cast in a form similar to the problem of Subsection1.7.1 for achieving efficiency, since it bears similar social and complex charac-teristics, as well as a similarly large-scale nature, as will be explained in thefollowing The size and especially the number of the available files becomesextremely large and the number of the connected devices is increasing [39]
In this subsection, we describe a formulation of an optimization problem fordistributing files having a complex networked structure over a large number ofheterogeneous caches in a fair way, targeting at reducing the system delay offile downloading Fairness is meant in terms of the popularity of each file, e.g.,
a particular cache should not monopolize all popular files For example, sider the WWW graph [23], where an edge represents a link from a webpage
con-to another Therefore, high (in-) degree [6] of a page implies high popularity,since this webpage is pointed by many others, thus it is more likely to be vis-ited, i.e., requested In this context, the following file placement optimization
Trang 39problem is formulated (Equations (1.20)–(1.24)), aiming to determine in anoptimal way the allocation of files in cache memories.
where cj is the capacity of the transmission link between the memory cache
j and the provider of the file fi, li is the size of the file fi, M is the set ofthe memory caches, F is the set of files, Ii,j is the indicator variable of theplacement of a file fi in memory cache j and g(fi) is a social influence factorassociated to file fi Ii,j is either 1 if the file fi is placed in memory cache j
or 0 otherwise, while Ii stands for the maximum number of caches into whichthe file fi can be stored and sj relates to the capacity of cache j Finally,
S = {S1, S2, , S|M |} is the allocation strategy, i.e., the set of files assigned
to each cache memory
The placement of a file fi at the memory cache j has a certain benefitfor the network in terms of the average system delay improvement Eachplacement of a file to a cache memory offloads the network from the timeneeded to download a file from the file/content provider This benefit can be
on average quantified by the term li
c j Thus, the above file allocation problemmaximizes the total benefit in terms of the system delay improvement fromthe placement of certain files in the available cache memories This problem
is of integer programming form, thus being NP-hard, while also attaininglarge-scale characteristics, as mentioned before Thus, alternative approachesneed to be taken into account in order to tackle efficiently the large scale anddiscrete nature of this problem It can be observed by the following mappingtable (Table 1.2) that the file placement problem in memory caches is of asimilar nature to the problem of advertisement allocation, presented in theprevious section (Subsection 1.7.1) Following the arguments and analysis of
Trang 40Hyperbolic Big Data Analytics within Complex and Social Networks 29TABLE 1.2 Mapping of the file allocation problem to the advertisementallocation problem.
Advertisement allocation in
users
File allocation in caches
Advertisement (A) Cache Memory (M )
Users (ui, V ) Files (fi, F )
Price Bid (pj) The inverse of the capacity of the
link between cacheand the file provider (1/cj)Social factor g(ui) Social factor g(fi)
Ad budget constraint (bj) Storage capacity constraint of
the cache memory (sj)
Subsection 1.7.1, the files’ network graph can be embedded in the hyperbolicspace After this mapping the file allocation problem takes the following form:
1.8 VISUALIZATION ANALYTICS IN HYPERBOLIC SPACE
Visual analytics consists of analytical reasoning facilitated by the visual terface, integrating the analytic capabilities of the computer and the abilities
in-of the human analyst The visual analytics approach relies on interactive andintegrated visualizations for exploratory data analysis in order to identify un-expected trends, outliers or patterns By putting a human back into the loop
to guide the analysis, interactive data visualizations have an important role
to play, e.g., as in [41]
Large datasets challenge the ability to visualize, navigate and understandrelationships among data In general, displaying large collections of data
... class="page_container" data- page="28">pres-Hyperbolic Big Data Analytics within Complex and Social Networks 17
is mapped to the radial coordinate and hyperbolic distance is used to dict/infer... 24
Hyperbolic Big Data Analytics within Complex and Social Networks 13
FIGURE 1.4 Poincare disk (a) and half-space (b) models along... class="text_page_counter">Trang 32
Hyperbolic Big Data Analytics within Complex and Social Networks 21
(non-neighboring) existing nodes k, l are connected at time