Managing and Mining Graph Data part 47 potx

Data owner may also want to decrease their vulnerability of the social network by not showing the exact number of connections that each user has, or by varying the lookahead avail-able t

Trang 1

social graph It considered the case where no underlying graph is released, and, in fact, the owner of the network would like to keep the entire structure

of the graph hidden from any one The goal of the adversary is, rather than to de-anonymize particular individuals from that graph, to compromise the link privacy of as many individuals as possible Specifically, the adversary deter-mines the link structure of the graph based on the local neighborhood views of the graph from the perspective of several non-anonymous users

Analysis showed that the number of users that need to be compromised

in order to cover a constant fraction of the entire network drops exponentially with increase in the lookahead parameter𝑙 provided by the network data owner

Here a network has a lookahead𝑙 if a registered user can see all the links and

nodes incident to him within distance𝑙 from him For example, 𝑙 = 0 if a user

can see exactly who he links to;𝑙 = 1 if a user can see exactly the friends that

he links to as well as the friends that his friends link to

Each time the adversary gains access to a user account, he immediately cov-ers all nodes that are at distance no more than the lookahead distance𝑙 enabled

by the social network In other words, he learns about all the edges incident

to these nodes Thus by gaining access to the account of user𝑢, an adversary

immediately covers all nodes that are within distance𝑙 of 𝑢 Additionally, he

learns about the existence of all nodes within distance𝑙+1 from 𝑢 The authors

studied several attacking strategies shown as below

Benchmark-Greedy: Among all users in the social network, pick the next user to bribe as the one whose perspective on the network gives the largest possible amount of new information Formally, at each step the adversary picks the node covering the maximum number of nodes not yet covered

Heuristically Greedy: Pick the next user to bribe as the one who can offer the largest possible amount of new information, according to some heuristic measure For example, Degree-Greedy picks the next user to bribe as the one with the maximum unseen degree, i.e., its degree minus the number of edges incident to it already known by the adversary Highest-Degree: Bribe users in the descending order of their degrees Random: Pick the users to bribe at random

Crawler: Similar to the Heuristically Greedy strategy, but choose the next node to bribe only from the nodes already seen (within distance

𝑙 + 1 of some bribed node) One example is Degree-Greedy-Crawler

that picks, from all users already seen, the next user to bribe as the one with the maximum unseen degree

Trang 2

Experiments on a 572, 949-node friendship graph extracted from

Live-Joural.com indicated that 1) Highest-Degree yields the best performance while Random performs the worst; 2) in order to obtain80% coverage of the graph

using lookahead 2, Highest-Degree needs to bribe 6, 308 users while it only

needs to bribe 36 users to obtain the same coverage using lookahead 3 The authors suggested that as a general rule, the social network owners should re-frain from permitting a lookahead higher than 2 Data owner may also want

to decrease their vulnerability of the social network by not showing the exact number of connections that each user has, or by varying the lookahead avail-able to users based on their trustworthiness

7.2 Deriving Personal Identifying Information from

Social Networking Sites

Online network users often publish their profiles as well as their connec-tions that contain vast amounts of personal and sometimes sensitive informa-tion (e.g., photo, birth date, phone number, current residence, various inter-ests, and their friends) Acquisti and Gross in [16] studied the privacy risk associated with these networks The user’s profile information can be used

to estimate a person’s social security number and exposes his/her to identity theft Their studies showed that only a small number of Facebook members change the default privacy preferences As a result, users expose themselves to various physical and cyber risks, and make it extremely easy for third parties

to create digital dossiers of their behavior Their study quantified patterns of information revelation and inferred usage of privacy settings from actual field data

8 Conclusion and Future Work

We surveyed recent studies on anonymization techniques for privacy-preserving publishing of social network data The research and development

of privacy-preserving social network analysis is still in its early stage com-pared with much better studied privacy-preserving data analysis for tabular data We revisited the naive anonymization approach and several structural attacks which can be exploited on the naive anonymized graphs We cate-gorized the state-of-the-art anonymization methods on simple graphs in three main categories: 𝐾-anonymity based privacy preservation via edge

modifica-tion, probabilistic privacy preservation via edge randomizamodifica-tion, and privacy preservation via generalization We then review anonymization methods on rich graphs Since social network data is more complicated than tabular data, privacy preservation in social networks is much more challenging than privacy preservation in tabular data While ideas and methods can be borrowed from the well studied privacy preservation in tabular data, many serious efforts are

Trang 3

greatly needed due to new challenges (see Section 1.2 and 1.3) associated with the network data We present a set of recommendations for future research in this emerging area

Develop privacy models for graphs and networks Investigate how well different strategies protect privacy (identity, link privacy, and attribute privacy) when adversaries exploit various complex background knowl-edge in their attacks How to model various background knowlknowl-edge and quantify disclosures when complex attacks are used needs to be investi-gated

Since how to preserve utility in the released graph is an important issue

in privacy-preserving social network analysis, measures and methodolo-gies need to be developed to quantify utility and information loss It

is important to develop workload-aware metrics that adequately quan-tify levels of information loss of graph data Furthermore, various anonymization strategies need to be evaluated in terms of the tradeoff between privacy and utility

Existing studies except [52] do not consider dynamic releases Many ap-plications of evolutionary networks and dynamic social network analysis require publishing data periodically to support dynamic analysis The

“one-time” released network data from existing annonymization meth-ods cannot guarantee privacy when adversaries collect historical infor-mation from multiple releases

Distributed privacy-preserving social network analysis based on secure multi-party computation [43] Distributed privacy-preserving data anal-ysis on tabular data has been well studied (e.g., [29]; refer to the book [1] for surveys) However, distributed privacy-preserving social network analysis has not been well reported in literature

Create a benchmark graph data repository Researchers can compare and learn how different approaches work in terms of the privacy-utility trade-off The scalability issue needs to be studied and empirical evaluations need to be conducted on large social networks

Acknowledgments

Authors Wu and Ying were supported in part by U.S National Science Foundation IIS-0546027 and CNS-0831204

References

Trang 4

[1] C C Aggarwal and P S Yu Privacy-Preserving Data Mining: Models and Algorithms Springer, 2008.

[2] D Agrawal and C Agrawal On the design and quantification of privacy

preserving data mining algorithms In Proceedings of the 20th Sympo-sium on Principles of Database Systems, 2001.

[3] R Agrawal and R Srikant Privacy-preserving data mining In Proceed-ings of the ACM SIGMOD International Conference on Management of Data, pages 439–450 Dallas, Texas, May 2000.

[4] L Backstrom, C Dwork, and J Kleinberg Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural

steganog-raphy In WWW ’07: Proceedings of the 16th international conference

on World Wide Web, pages 181–190, New York, NY, USA, 2007 ACM

Press

[5] L Backstrom, D Huttenlocher, J Kleinberg, and X Lan Group for-mation in large social networks: membership, growth, and evolution In

KDD ’06: Proceedings of the 12th ACM SIGKDD international confer-ence on Knowledge discovery and data mining, pages 44–54, New York,

NY, USA, 2006 ACM

[6] J Baumes, M K Goldberg, M Magdon-Ismail, and W A Wallace

Dis-covering hidden groups in communication networks In ISI, pages 378–

389, 2004

[7] T Y Berger-Wolf and J Saia A framework for analysis of dynamic

social networks In KDD, pages 523–528, 2006.

[8] S Bhagat, G Cormode, B Krishnamurthy, and D Srivastava

Class-based graph anaonymization for social network data In Proc of 35th International Conference on Very Large Data Base, 2009.

[9] A Campan and T M Truta A clustering approach for data and structural

anonymity in social networks In PinKDD, 2008.

[10] D Chakrabarti, C Faloutsos, and M McGlohon Graph Mining: Laws and Generators Springer, 2010.

[11] G Cormode, D Srivastava, T Yu, and Q Zhang Anonymizing bipartite

graph data using safe groupings In Proc of VLDB08, pages 833–844,

2008

[12] L da F Costa, F A Rodrigues, G Travieso, and P R V Boas

Charac-terization of complex networks: A survey of measurements Advances In Physics, 56:167, 2007.

[13] S Das, -Omer Egecioglu, and A E Abbadi Anonymizing edge-weighted social network graphs Technical report, UCSB CS, March 2009 [14] A Fast, D Jensen, and B N Levine Creating social networks to improve

peer-to-peer networking In KDD, pages 568–573, 2005.

Trang 5

[15] M Girvan and M E Newman Community structure in social and

bio-logical networks Proc Natl Acad Sci USA, 99(12):7821–7826, June

2002

[16] R Gross and A Acquisti Information revelation and privacy in online

social networks (the Facebook case) Proceedings of the Workshop on Privacy in the Electronic Society, 2005.

[17] S Guo, X Wu, and Y Li Determining error bounds for spectral filtering

based reconstruction methods in privacy preserving data mining Knowl Inf Syst., 17(2):217–240, 2008.

[18] S Hanhijarvi, G C Garriga, and K Puolamaki Randomization

tech-niques for graphs In Proc of the 9th SIAM Conference on Data Mining,

2009

[19] M Hay, G Miklau, D Jensen, D Towsely, and P Weis Resisting

struc-tural re-identification in anonymized social networks In VLDB, 2008.

[20] M Hay, G Miklau, D Jensen, P Weis, and S Srivastava Anonymizing

social networks University of Massachusetts Technical Report, 07-19,

2007

[21] Z Huang, W Du, and B Chen Deriving private information from

ran-domized data In Proceedings of the ACM SIGMOD Conference on Man-agement of Data Baltimore, MA, 2005.

[22] H Kargupta, S Datta, Q Wang, and K Sivakumar On the privacy

pre-serving properties of random data perturbation techniques In Proc of the 3rd Int’l Conf on Data Mining, pages 99–106, 2003.

[23] D Kempe, J M Kleinberg, and «E Tardos Maximizing the spread of

influence through a social network In KDD, pages 137–146, 2003.

[24] J M Kleinberg Challenges in mining social network data: processes,

privacy, and paradoxes In KDD, pages 4–5, 2007.

[25] Y Koren, S C North, and C Volinsky Measuring and extracting

prox-imity in networks In KDD, pages 245–255, 2006.

[26] A Korolova, R Motwani, S Nabar, and Y Xu Link privacy in social

networks In Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, 2008.

[27] R Kumar, J Novak, and A Tomkins Structure and evolution of online

social networks In KDD, pages 611–617, 2006.

[28] D Liben-Nowell and J Kleinberg The link prediction problem for social

networks In CIKM ’03: Proceedings of the twelfth international confer-ence on Information and knowledge management, pages 556–559, New

York, NY, USA, 2003 ACM

Trang 6

[29] Y Lindell and B Pinkas Privacy preserving data mining In Advances

in Cryptology (CRYPTO’00), pages 36–53 Springer-Verlag, 2000.

[30] K Liu, K Das, T Grandison, and H Kargupta Privacy-preserving data analysis on graphs and social networks, 2008

[31] K Liu and E Terzi Towards identity anonymization on graphs In Pro-ceedings of the ACM SIGMOD Conference, Vancouver, Canada, 2008.

ACM Press

[32] L Liu, J Wang, J Liu, and J Zhang Privacy preservation in social

networks with sensitive edge weights In SDM, pages 954–965, 2009.

[33] A Machanavajjhala, J Gehrke, D Kifer, and M Venkitasubramaniam

𝑙-diversity: privacy beyond𝑘-anonymity In Proceedings of the IEEE ICDE Conference, 2006.

[34] A Narayanan and V Shmatikov De-anonymizing social networks In

IEEE Security & Privacy ’09, 2009.

[35] S Russell and P Norvig Artifical Intelligence: A Modern Approach.

Pearson Education, 2003

[36] A Seary and W Richards Spectral methods for analyzing and

visu-alizing networks: an introduction National Research Council, Dynamic Social Network Modelling and Analysis: Workshop Summary and Papers,

pages 209–228, 2003

[37] M Shiga, I Takigawa, and H Mamitsuka A spectral clustering approach

to optimally combining numericalvectors with a modular network In

KDD, pages 647–656, 2007.

[38] E Spertus, M Sahami, and O Buyukkokten Evaluating similarity

mea-sures: a large-scale study in the orkut social network In KDD, pages

678–684, 2005

[39] C Tantipathananandh, T Y Berger-Wolf, and D Kempe A framework

for community identification in dynamic social networks In KDD, pages

717–726, 2007

[40] S White and P Smyth Algorithms for estimating relative importance in

networks In KDD, pages 266–275, 2003.

[41] L Wu, X Ying, and X Wu Reconstruction of randomized graph via low rank approximation Technical report, UNC-Charlotte, SIS, 2009 [42] X Xiao and Y Tao Anatomy: Simple and effective privacy preservation

In Proceedings of the 32nd International Conference on Very Large Data Bases, pages 139–150, September 2006.

[43] A C Yao How to generate and exchange secrets In SFCS ’86: Proceed-ings of the 27th Annual Symposium on Foundations of Computer Science,

pages 162–167 IEEE Computer Society, 1986

Trang 7

[44] X Ying, K Pan, X Wu, and L Guo Comparisons of randomization and k-degree anonymization schemes for privacy preserving social network

publishing In SNA-KDD ’09: Proceedings of the 3rd SIGKDD Workshop

on Social Network Mining and Analysis (SNA-KDD), 2009.

[45] X Ying and X Wu Randomizing social networks: a spectrum preserving

approach In Proc of the 8th SIAM Conference on Data Mining, April

2008

[46] X Ying and X Wu Graph generation with prescribed feature constraints

In Proc of the 9th SIAM Conference on Data Mining, 2009.

[47] X Ying and X Wu On link privacy in randomizing social networks In

PAKDD, 2009.

[48] L Zhang and W Zhang Edge anonymity in social graphs In Proceed-ings of the 2009 International Conference on Social Computing, 2009.

[49] E Zheleva and L Getoor Preserving the privacy of sensitive

relation-ships in graph data In PinKDD, pages 153–171, 2007.

[50] B Zhou and J Pei Preserving privacy in social networks against

neigh-borhood attacks IEEE 24th International Conference on Data Engineer-ing, pages 506–515, 2008.

[51] B Zhou, J Pei, and W.-S Luk A brief survey on anonymization

tech-niques for privacy preserving publishing of social network data SIGKDD Explorations, 10(2), 2009.

[52] L Zou, L Chen, and M T -Ozsu K-automorphism: A general framework

for privacy preserving network publication In Proc of 35th International Conference on Very Large Data Base, 2009.

Trang 8

A SURVEY OF GRAPH MINING FOR

WEB APPLICATIONS

Debora Donato

Yahoo! Research

Avd Diagonal 177, Barcelona, Spain

debora@yahoo-inc.com

Aristides Gionis

Yahoo! Research

Avd Diagonal 177, Barcelona, Spain

gionis@yahoo-inc.com

Abstract Graph structures provide a general framework for modeling entities and their

relationships, and they are routinely used to describe a wide variety of data such

as the Internet, the web, social networks, metabolic networks, protein-interaction networks, food webs, citation networks, and many more In recent years, there has been an increasing amount of literature on studying properties, models, and algorithms for graph data In this chapter we provide a brief overview of graph-mining algorithms for web and social-media applications We review a wide range of algorithms, such as those for estimating reputation and popularity of items in a network, mining query logs and performing query recommendations The main goal of the chapter is to provide the reader with an understanding of how graph structural mining algorithms can be exploited in the context of web applications This highlights the challenges of, and provides an understanding of the power of graph mining in the context of web and social-media applications.

Keywords: Graph Mining, Link Mining, Web Mining, Social Network Analysis, World

Wide Web, Query-Log Mining, Query Recommendation

C.C Aggarwal and H Wang (eds.), Managing and Mining Graph Data,

Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_15, 455

Trang 9

1 Introduction

Graph mining has been widely used to study relationships among various

types of entities Real-world graphs are also referred to as networks, and the

interactions between the entities represented in the networks are modeled as

links The problems of studying the properties of real-world networks,

design-ing algorithms for mindesign-ing such networks, and developdesign-ing applications on top

of network data has been of increasing interest in the past few years This has led to the birth of a very active area of scientific research, which is known as

analysis of complex networks [7, 16, 55].

One of the most pervasive properties of real-world networks is the

emer-gence of power-law distributions that tend to characterize many of networks

statistical properties [6, 26] Power laws have intrigued the interest of re-searchers, who have proposed various models that attempt to explain the pres-ence of power-law distributions in real graphs For examples of such models, see [6, 25, 40]

In this chapter, we deviate from the classical exposition of properties and generative models for complex networks, and we focus on graph-mining ap-plications that appear in the context of the web and social-media Such graphs include data that model the interaction of users in a social network For ex-ample, this may correspond to comments of users in a blog, user activity in a question-answering portal, or query-log data that summarize the interaction of users with a search engine Understanding the structure of such graphs, mod-eling the complex interactions between entities, and designing algorithms for

leveraging the latent knowledge (also known as the wisdom of the crowds) in

those graphs introduces new challenges in the field of graph mining One im-portant difference with networks that have been previously studied, is that in social-media and web-usage graphs the links represent many different types of interactions and activities among nodes For instance in a question-answering portal, users ask questions, answer questions for other users, vote for favorite answers, interesting questions, assign answers to categories of a hierarchy, and much more Hence graphs from such applications are characterized by having different types of nodes and high degree of heterogeneity in the types of in-teractions among nodes Consequently, algorithms and methodologies widely applied in the web and other complex networks have to be adapted to this new multifaceted scenario, which allows for the different meanings that are implic-itly or explicimplic-itly captured by each link

This chapter is organized as follows In Section 2 we briefly introduce mea-sures and algorithms that have been extensively used as basic tools for graph mining Then we focus on two different areas of graph mining in the context

of social-media and web applications In Section 3, we review techniques for identifying items of high quality in social-media networks We discuss two

Trang 10

concrete examples: (1) predicting the number of citations of authors in a bib-liographic data set, and (2) finding high-quality items in a question answering system In both cases, the examples rely on adapting link-mining algorithms for computing authoritativeness scores in linked environments In Section 4

we discuss algorithms for mining graph structures that represented information collected in the query logs of search engines We first discuss various graph representations of query logs, and then discuss how to use these representa-tions in order to perform the task of query recommendation The conclusions are presented in Section 5

2 Preliminaries

An undirected graph 𝒢 = (𝑉, 𝐸) consists of a set of nodes 𝑉 , also called

vertices, and a set𝐸 of pairs of distinct nodes, which are called edges or arcs

A directed graph, or digraph, is distinguished from the undirected version by

the fact that its edges are ordered pairs of nodes In an undirected graph, the degree of a node is the number of edges incident to it For a directed graph,

we define the degree and the out-degree of a node to be the number of in-coming and out-going edges, respectively

In an undirected graph 𝒢, a set of nodes 𝑆 forms a connected component

(CC), if for every pair of nodes𝑢, 𝑣∈ 𝑆 there exists a path from 𝑢 to 𝑣 (which

is also a path from 𝑣 to 𝑢) In a directed graph 𝒢, a set of nodes 𝑆 forms

a strongly connected component (SCC), if for every pair of nodes 𝑢, 𝑣 ∈ 𝑆,

there exists a (directed) path from 𝑢 to 𝑣, and a path from 𝑣 to 𝑢 A set of

nodes 𝑆 forms a weakly connected component (WCC), if and only if the set

𝑆 is a connected component in the undirected graph 𝒢𝑢 that is obtained by ignoring the directionality of the edges in𝒢

Power laws and scale-free networks Power-law distributions ubiquitously

characterize real-world networks We say that a discrete random variable 𝑋

follows a power-law distribution if the probability distribution is defined for each discrete value𝑘 as follows:

Pr[𝑋 = 𝑘]∝ 𝑘−𝛾

The value𝛾 is called the exponent of the power-law We assume that 𝛾 ≥ 0

Detailed surveys on power laws may be found in [45] and [46]

If a random variable𝑋 follows a power-law distribution, then we know that

the conditional probability Pr[𝑋 ≥ 𝑘 ∣ 𝑋 ≥ 𝑚] is the same as Pr[𝑋 ≥ 𝑘]

In other words, conditioning on the size does not yield any additional infor-mation For this reason, networks that have attributes that follow a power-law

distribution are also called scale-free networks.

Degree and Assortativeness The degree of the nodes of a graph can be of

great interest in social-media applications The out-degree of a node might

Định dạng
Số trang	10
Dung lượng	1,38 MB