Data owner may also want to decrease their vulnerability of the social network by not showing the exact number of connections that each user has, or by varying the lookahead avail-able t
Trang 1social graph It considered the case where no underlying graph is released, and, in fact, the owner of the network would like to keep the entire structure
of the graph hidden from any one The goal of the adversary is, rather than to de-anonymize particular individuals from that graph, to compromise the link privacy of as many individuals as possible Specifically, the adversary deter-mines the link structure of the graph based on the local neighborhood views of the graph from the perspective of several non-anonymous users
Analysis showed that the number of users that need to be compromised
in order to cover a constant fraction of the entire network drops exponentially with increase in the lookahead parameter𝑙 provided by the network data owner
Here a network has a lookahead𝑙 if a registered user can see all the links and
nodes incident to him within distance𝑙 from him For example, 𝑙 = 0 if a user
can see exactly who he links to;𝑙 = 1 if a user can see exactly the friends that
he links to as well as the friends that his friends link to
Each time the adversary gains access to a user account, he immediately cov-ers all nodes that are at distance no more than the lookahead distance𝑙 enabled
by the social network In other words, he learns about all the edges incident
to these nodes Thus by gaining access to the account of user𝑢, an adversary
immediately covers all nodes that are within distance𝑙 of 𝑢 Additionally, he
learns about the existence of all nodes within distance𝑙+1 from 𝑢 The authors
studied several attacking strategies shown as below
Benchmark-Greedy: Among all users in the social network, pick the next user to bribe as the one whose perspective on the network gives the largest possible amount of new information Formally, at each step the adversary picks the node covering the maximum number of nodes not yet covered
Heuristically Greedy: Pick the next user to bribe as the one who can offer the largest possible amount of new information, according to some heuristic measure For example, Degree-Greedy picks the next user to bribe as the one with the maximum unseen degree, i.e., its degree minus the number of edges incident to it already known by the adversary Highest-Degree: Bribe users in the descending order of their degrees Random: Pick the users to bribe at random
Crawler: Similar to the Heuristically Greedy strategy, but choose the next node to bribe only from the nodes already seen (within distance
𝑙 + 1 of some bribed node) One example is Degree-Greedy-Crawler
that picks, from all users already seen, the next user to bribe as the one with the maximum unseen degree
Trang 2Experiments on a 572, 949-node friendship graph extracted from
Live-Joural.com indicated that 1) Highest-Degree yields the best performance while Random performs the worst; 2) in order to obtain80% coverage of the graph
using lookahead 2, Highest-Degree needs to bribe 6, 308 users while it only
needs to bribe 36 users to obtain the same coverage using lookahead 3 The authors suggested that as a general rule, the social network owners should re-frain from permitting a lookahead higher than 2 Data owner may also want
to decrease their vulnerability of the social network by not showing the exact number of connections that each user has, or by varying the lookahead avail-able to users based on their trustworthiness
7.2 Deriving Personal Identifying Information from
Social Networking Sites
Online network users often publish their profiles as well as their connec-tions that contain vast amounts of personal and sometimes sensitive informa-tion (e.g., photo, birth date, phone number, current residence, various inter-ests, and their friends) Acquisti and Gross in [16] studied the privacy risk associated with these networks The user’s profile information can be used
to estimate a person’s social security number and exposes his/her to identity theft Their studies showed that only a small number of Facebook members change the default privacy preferences As a result, users expose themselves to various physical and cyber risks, and make it extremely easy for third parties
to create digital dossiers of their behavior Their study quantified patterns of information revelation and inferred usage of privacy settings from actual field data
8 Conclusion and Future Work
We surveyed recent studies on anonymization techniques for privacy-preserving publishing of social network data The research and development
of privacy-preserving social network analysis is still in its early stage com-pared with much better studied privacy-preserving data analysis for tabular data We revisited the naive anonymization approach and several structural attacks which can be exploited on the naive anonymized graphs We cate-gorized the state-of-the-art anonymization methods on simple graphs in three main categories: 𝐾-anonymity based privacy preservation via edge
modifica-tion, probabilistic privacy preservation via edge randomizamodifica-tion, and privacy preservation via generalization We then review anonymization methods on rich graphs Since social network data is more complicated than tabular data, privacy preservation in social networks is much more challenging than privacy preservation in tabular data While ideas and methods can be borrowed from the well studied privacy preservation in tabular data, many serious efforts are
Trang 3greatly needed due to new challenges (see Section 1.2 and 1.3) associated with the network data We present a set of recommendations for future research in this emerging area
Develop privacy models for graphs and networks Investigate how well different strategies protect privacy (identity, link privacy, and attribute privacy) when adversaries exploit various complex background knowl-edge in their attacks How to model various background knowlknowl-edge and quantify disclosures when complex attacks are used needs to be investi-gated
Since how to preserve utility in the released graph is an important issue
in privacy-preserving social network analysis, measures and methodolo-gies need to be developed to quantify utility and information loss It
is important to develop workload-aware metrics that adequately quan-tify levels of information loss of graph data Furthermore, various anonymization strategies need to be evaluated in terms of the tradeoff between privacy and utility
Existing studies except [52] do not consider dynamic releases Many ap-plications of evolutionary networks and dynamic social network analysis require publishing data periodically to support dynamic analysis The
“one-time” released network data from existing annonymization meth-ods cannot guarantee privacy when adversaries collect historical infor-mation from multiple releases
Distributed privacy-preserving social network analysis based on secure multi-party computation [43] Distributed privacy-preserving data anal-ysis on tabular data has been well studied (e.g., [29]; refer to the book [1] for surveys) However, distributed privacy-preserving social network analysis has not been well reported in literature
Create a benchmark graph data repository Researchers can compare and learn how different approaches work in terms of the privacy-utility trade-off The scalability issue needs to be studied and empirical evaluations need to be conducted on large social networks
Acknowledgments
Authors Wu and Ying were supported in part by U.S National Science Foundation IIS-0546027 and CNS-0831204
References
Trang 4[1] C C Aggarwal and P S Yu Privacy-Preserving Data Mining: Models and Algorithms Springer, 2008.
[2] D Agrawal and C Agrawal On the design and quantification of privacy
preserving data mining algorithms In Proceedings of the 20th Sympo-sium on Principles of Database Systems, 2001.
[3] R Agrawal and R Srikant Privacy-preserving data mining In Proceed-ings of the ACM SIGMOD International Conference on Management of Data, pages 439–450 Dallas, Texas, May 2000.
[4] L Backstrom, C Dwork, and J Kleinberg Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural
steganog-raphy In WWW ’07: Proceedings of the 16th international conference
on World Wide Web, pages 181–190, New York, NY, USA, 2007 ACM
Press
[5] L Backstrom, D Huttenlocher, J Kleinberg, and X Lan Group for-mation in large social networks: membership, growth, and evolution In
KDD ’06: Proceedings of the 12th ACM SIGKDD international confer-ence on Knowledge discovery and data mining, pages 44–54, New York,
NY, USA, 2006 ACM
[6] J Baumes, M K Goldberg, M Magdon-Ismail, and W A Wallace
Dis-covering hidden groups in communication networks In ISI, pages 378–
389, 2004
[7] T Y Berger-Wolf and J Saia A framework for analysis of dynamic
social networks In KDD, pages 523–528, 2006.
[8] S Bhagat, G Cormode, B Krishnamurthy, and D Srivastava
Class-based graph anaonymization for social network data In Proc of 35th International Conference on Very Large Data Base, 2009.
[9] A Campan and T M Truta A clustering approach for data and structural
anonymity in social networks In PinKDD, 2008.
[10] D Chakrabarti, C Faloutsos, and M McGlohon Graph Mining: Laws and Generators Springer, 2010.
[11] G Cormode, D Srivastava, T Yu, and Q Zhang Anonymizing bipartite
graph data using safe groupings In Proc of VLDB08, pages 833–844,
2008
[12] L da F Costa, F A Rodrigues, G Travieso, and P R V Boas
Charac-terization of complex networks: A survey of measurements Advances In Physics, 56:167, 2007.
[13] S Das, -Omer Egecioglu, and A E Abbadi Anonymizing edge-weighted social network graphs Technical report, UCSB CS, March 2009 [14] A Fast, D Jensen, and B N Levine Creating social networks to improve
peer-to-peer networking In KDD, pages 568–573, 2005.
Trang 5[15] M Girvan and M E Newman Community structure in social and
bio-logical networks Proc Natl Acad Sci USA, 99(12):7821–7826, June
2002
[16] R Gross and A Acquisti Information revelation and privacy in online
social networks (the Facebook case) Proceedings of the Workshop on Privacy in the Electronic Society, 2005.
[17] S Guo, X Wu, and Y Li Determining error bounds for spectral filtering
based reconstruction methods in privacy preserving data mining Knowl Inf Syst., 17(2):217–240, 2008.
[18] S Hanhijarvi, G C Garriga, and K Puolamaki Randomization
tech-niques for graphs In Proc of the 9th SIAM Conference on Data Mining,
2009
[19] M Hay, G Miklau, D Jensen, D Towsely, and P Weis Resisting
struc-tural re-identification in anonymized social networks In VLDB, 2008.
[20] M Hay, G Miklau, D Jensen, P Weis, and S Srivastava Anonymizing
social networks University of Massachusetts Technical Report, 07-19,
2007
[21] Z Huang, W Du, and B Chen Deriving private information from
ran-domized data In Proceedings of the ACM SIGMOD Conference on Man-agement of Data Baltimore, MA, 2005.
[22] H Kargupta, S Datta, Q Wang, and K Sivakumar On the privacy
pre-serving properties of random data perturbation techniques In Proc of the 3rd Int’l Conf on Data Mining, pages 99–106, 2003.
[23] D Kempe, J M Kleinberg, and «E Tardos Maximizing the spread of
influence through a social network In KDD, pages 137–146, 2003.
[24] J M Kleinberg Challenges in mining social network data: processes,
privacy, and paradoxes In KDD, pages 4–5, 2007.
[25] Y Koren, S C North, and C Volinsky Measuring and extracting
prox-imity in networks In KDD, pages 245–255, 2006.
[26] A Korolova, R Motwani, S Nabar, and Y Xu Link privacy in social
networks In Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, 2008.
[27] R Kumar, J Novak, and A Tomkins Structure and evolution of online
social networks In KDD, pages 611–617, 2006.
[28] D Liben-Nowell and J Kleinberg The link prediction problem for social
networks In CIKM ’03: Proceedings of the twelfth international confer-ence on Information and knowledge management, pages 556–559, New
York, NY, USA, 2003 ACM
Trang 6[29] Y Lindell and B Pinkas Privacy preserving data mining In Advances
in Cryptology (CRYPTO’00), pages 36–53 Springer-Verlag, 2000.
[30] K Liu, K Das, T Grandison, and H Kargupta Privacy-preserving data analysis on graphs and social networks, 2008
[31] K Liu and E Terzi Towards identity anonymization on graphs In Pro-ceedings of the ACM SIGMOD Conference, Vancouver, Canada, 2008.
ACM Press
[32] L Liu, J Wang, J Liu, and J Zhang Privacy preservation in social
networks with sensitive edge weights In SDM, pages 954–965, 2009.
[33] A Machanavajjhala, J Gehrke, D Kifer, and M Venkitasubramaniam
𝑙-diversity: privacy beyond𝑘-anonymity In Proceedings of the IEEE ICDE Conference, 2006.
[34] A Narayanan and V Shmatikov De-anonymizing social networks In
IEEE Security & Privacy ’09, 2009.
[35] S Russell and P Norvig Artifical Intelligence: A Modern Approach.
Pearson Education, 2003
[36] A Seary and W Richards Spectral methods for analyzing and
visu-alizing networks: an introduction National Research Council, Dynamic Social Network Modelling and Analysis: Workshop Summary and Papers,
pages 209–228, 2003
[37] M Shiga, I Takigawa, and H Mamitsuka A spectral clustering approach
to optimally combining numericalvectors with a modular network In
KDD, pages 647–656, 2007.
[38] E Spertus, M Sahami, and O Buyukkokten Evaluating similarity
mea-sures: a large-scale study in the orkut social network In KDD, pages
678–684, 2005
[39] C Tantipathananandh, T Y Berger-Wolf, and D Kempe A framework
for community identification in dynamic social networks In KDD, pages
717–726, 2007
[40] S White and P Smyth Algorithms for estimating relative importance in
networks In KDD, pages 266–275, 2003.
[41] L Wu, X Ying, and X Wu Reconstruction of randomized graph via low rank approximation Technical report, UNC-Charlotte, SIS, 2009 [42] X Xiao and Y Tao Anatomy: Simple and effective privacy preservation
In Proceedings of the 32nd International Conference on Very Large Data Bases, pages 139–150, September 2006.
[43] A C Yao How to generate and exchange secrets In SFCS ’86: Proceed-ings of the 27th Annual Symposium on Foundations of Computer Science,
pages 162–167 IEEE Computer Society, 1986
Trang 7[44] X Ying, K Pan, X Wu, and L Guo Comparisons of randomization and k-degree anonymization schemes for privacy preserving social network
publishing In SNA-KDD ’09: Proceedings of the 3rd SIGKDD Workshop
on Social Network Mining and Analysis (SNA-KDD), 2009.
[45] X Ying and X Wu Randomizing social networks: a spectrum preserving
approach In Proc of the 8th SIAM Conference on Data Mining, April
2008
[46] X Ying and X Wu Graph generation with prescribed feature constraints
In Proc of the 9th SIAM Conference on Data Mining, 2009.
[47] X Ying and X Wu On link privacy in randomizing social networks In
PAKDD, 2009.
[48] L Zhang and W Zhang Edge anonymity in social graphs In Proceed-ings of the 2009 International Conference on Social Computing, 2009.
[49] E Zheleva and L Getoor Preserving the privacy of sensitive
relation-ships in graph data In PinKDD, pages 153–171, 2007.
[50] B Zhou and J Pei Preserving privacy in social networks against
neigh-borhood attacks IEEE 24th International Conference on Data Engineer-ing, pages 506–515, 2008.
[51] B Zhou, J Pei, and W.-S Luk A brief survey on anonymization
tech-niques for privacy preserving publishing of social network data SIGKDD Explorations, 10(2), 2009.
[52] L Zou, L Chen, and M T -Ozsu K-automorphism: A general framework
for privacy preserving network publication In Proc of 35th International Conference on Very Large Data Base, 2009.
Trang 8A SURVEY OF GRAPH MINING FOR
WEB APPLICATIONS
Debora Donato
Yahoo! Research
Avd Diagonal 177, Barcelona, Spain
debora@yahoo-inc.com
Aristides Gionis
Yahoo! Research
Avd Diagonal 177, Barcelona, Spain
gionis@yahoo-inc.com
Abstract Graph structures provide a general framework for modeling entities and their
relationships, and they are routinely used to describe a wide variety of data such
as the Internet, the web, social networks, metabolic networks, protein-interaction networks, food webs, citation networks, and many more In recent years, there has been an increasing amount of literature on studying properties, models, and algorithms for graph data In this chapter we provide a brief overview of graph-mining algorithms for web and social-media applications We review a wide range of algorithms, such as those for estimating reputation and popularity of items in a network, mining query logs and performing query recommendations The main goal of the chapter is to provide the reader with an understanding of how graph structural mining algorithms can be exploited in the context of web applications This highlights the challenges of, and provides an understanding of the power of graph mining in the context of web and social-media applications.
Keywords: Graph Mining, Link Mining, Web Mining, Social Network Analysis, World
Wide Web, Query-Log Mining, Query Recommendation
© Springer Science+Business Media, LLC 2010
C.C Aggarwal and H Wang (eds.), Managing and Mining Graph Data,
Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_15, 455
Trang 91 Introduction
Graph mining has been widely used to study relationships among various
types of entities Real-world graphs are also referred to as networks, and the
interactions between the entities represented in the networks are modeled as
links The problems of studying the properties of real-world networks,
design-ing algorithms for mindesign-ing such networks, and developdesign-ing applications on top
of network data has been of increasing interest in the past few years This has led to the birth of a very active area of scientific research, which is known as
analysis of complex networks [7, 16, 55].
One of the most pervasive properties of real-world networks is the
emer-gence of power-law distributions that tend to characterize many of networks
statistical properties [6, 26] Power laws have intrigued the interest of re-searchers, who have proposed various models that attempt to explain the pres-ence of power-law distributions in real graphs For examples of such models, see [6, 25, 40]
In this chapter, we deviate from the classical exposition of properties and generative models for complex networks, and we focus on graph-mining ap-plications that appear in the context of the web and social-media Such graphs include data that model the interaction of users in a social network For ex-ample, this may correspond to comments of users in a blog, user activity in a question-answering portal, or query-log data that summarize the interaction of users with a search engine Understanding the structure of such graphs, mod-eling the complex interactions between entities, and designing algorithms for
leveraging the latent knowledge (also known as the wisdom of the crowds) in
those graphs introduces new challenges in the field of graph mining One im-portant difference with networks that have been previously studied, is that in social-media and web-usage graphs the links represent many different types of interactions and activities among nodes For instance in a question-answering portal, users ask questions, answer questions for other users, vote for favorite answers, interesting questions, assign answers to categories of a hierarchy, and much more Hence graphs from such applications are characterized by having different types of nodes and high degree of heterogeneity in the types of in-teractions among nodes Consequently, algorithms and methodologies widely applied in the web and other complex networks have to be adapted to this new multifaceted scenario, which allows for the different meanings that are implic-itly or explicimplic-itly captured by each link
This chapter is organized as follows In Section 2 we briefly introduce mea-sures and algorithms that have been extensively used as basic tools for graph mining Then we focus on two different areas of graph mining in the context
of social-media and web applications In Section 3, we review techniques for identifying items of high quality in social-media networks We discuss two
Trang 10concrete examples: (1) predicting the number of citations of authors in a bib-liographic data set, and (2) finding high-quality items in a question answering system In both cases, the examples rely on adapting link-mining algorithms for computing authoritativeness scores in linked environments In Section 4
we discuss algorithms for mining graph structures that represented information collected in the query logs of search engines We first discuss various graph representations of query logs, and then discuss how to use these representa-tions in order to perform the task of query recommendation The conclusions are presented in Section 5
2 Preliminaries
An undirected graph 𝒢 = (𝑉, 𝐸) consists of a set of nodes 𝑉 , also called
vertices, and a set𝐸 of pairs of distinct nodes, which are called edges or arcs
A directed graph, or digraph, is distinguished from the undirected version by
the fact that its edges are ordered pairs of nodes In an undirected graph, the degree of a node is the number of edges incident to it For a directed graph,
we define the degree and the out-degree of a node to be the number of in-coming and out-going edges, respectively
In an undirected graph 𝒢, a set of nodes 𝑆 forms a connected component
(CC), if for every pair of nodes𝑢, 𝑣∈ 𝑆 there exists a path from 𝑢 to 𝑣 (which
is also a path from 𝑣 to 𝑢) In a directed graph 𝒢, a set of nodes 𝑆 forms
a strongly connected component (SCC), if for every pair of nodes 𝑢, 𝑣 ∈ 𝑆,
there exists a (directed) path from 𝑢 to 𝑣, and a path from 𝑣 to 𝑢 A set of
nodes 𝑆 forms a weakly connected component (WCC), if and only if the set
𝑆 is a connected component in the undirected graph 𝒢𝑢 that is obtained by ignoring the directionality of the edges in𝒢
Power laws and scale-free networks Power-law distributions ubiquitously
characterize real-world networks We say that a discrete random variable 𝑋
follows a power-law distribution if the probability distribution is defined for each discrete value𝑘 as follows:
Pr[𝑋 = 𝑘]∝ 𝑘−𝛾
The value𝛾 is called the exponent of the power-law We assume that 𝛾 ≥ 0
Detailed surveys on power laws may be found in [45] and [46]
If a random variable𝑋 follows a power-law distribution, then we know that
the conditional probability Pr[𝑋 ≥ 𝑘 ∣ 𝑋 ≥ 𝑚] is the same as Pr[𝑋 ≥ 𝑘]
In other words, conditioning on the size does not yield any additional infor-mation For this reason, networks that have attributes that follow a power-law
distribution are also called scale-free networks.
Degree and Assortativeness The degree of the nodes of a graph can be of
great interest in social-media applications The out-degree of a node might