Nghiên cứu xây dựng một số giải pháp đảm bảo an toàn thông tin trong quá trình khai phá dữ liệu

Luận văn

Trang 1

, mn B GIÁO D C VÀ ÀO T O B QU C PHÒNG

VI N KHOA H C VÀ CÔNG NGH QUÂN S

-ěf -

DISTRIBUTED SOLUTIONS IN PRIVACY

PRESERVING DATA MINING

thông tin trong quá trình khai phá d li u)

LU N ÁN TI N S TOÁN H C

Trang 2

B GIÁO D C VÀ ÀO T O B QU C PHÒNG

VI N KHOA H C VÀ CÔNG NGH QUÂN S

-ěf -

DISTRIBUTED SOLUTIONS IN PRIVACY

PRESERVING DATA MINING

thông tin trong quá trình khai phá d li u)

Chuyên ngành: B o đ m toán h c cho máy tính và h th ng tính toán

Trang 3

I promise that this thesis is a presentation of my original research work.Any of the content was written based on the reliable references such aspublished papers in distinguished international conferences and journals, andbooks published by widely-known publishers Results and discussions of thethesis are new, not previously published by any other authors

Trang 4

1.1 Privacy-preserving data mining: An overview 1

1.2 Objectives and contributions 5

1.3 Related works 7

1.4 Organization of thesis 12

2 METHODS FOR SECURE MULTI-PARTY COMPUTATION 13 2.1 Definitions 13

2.1.1 Computational indistinguishability 13

2.1.2 Secure multi-party computation 14

2.2 Secure computation 15

2.2.1 Secret sharing 15

2.2.2 Secure sum computation 16

2.2.3 Probabilistic public key cryptosystems 17

2.2.4 Variant ElGamal Cryptosystem 18

2.2.5 Oblivious polynomial evaluation 20

2.2.6 Secure scalar product computation 21

2.2.7 Privately computing ln x 22

3 PRIVACY PRESERVING FREQUENCY-BASED LEARNING IN 2PFD SETTING 24 3.1 Introduction 24

3.2 Privacy preserving frequency mining in 2PFD setting 27

3.2.1 Problem formulation 27

3.2.2 Definition of privacy 29

3.2.3 Frequency mining protocol 30

Trang 5

3.2.4 Correctness Analysis 32

3.2.5 Privacy Analysis 34

3.2.6 Efficiency of frequency mining protocol 37

3.3 Privacy Preserving Frequency-based Learning in 2PFD Setting 38 3.3.1 Naive Bayes learning problem in 2PFD setting 38

3.3.2 Naive Bayes learning Protocol 40

3.3.3 Correctness and privacy analysis 42

3.3.4 Efficiency of naive Bayes learning protocol 42

3.4 An improvement of frequency mining protocol 44

3.4.1 Improved frequency mining protocol 44

3.4.2 Protocol Analysis 45

3.5 Conclusion 46

4 ENHANCING PRIVACY FOR FREQUENT ITEMSET MINING IN VERTICALLY 49

4.1 Introduction 49

4.2 Problem formulation 51

4.2.1 Association rules and frequent itemset 51

4.2.2 Frequent itmeset identifying in vertically distributed data 52 4.3 Computational and privacy model 53

4.4 Support count preserving protocol 54

4.4.1 Overview 54

4.4.2 Protocol design 56

4.4.5 Performance analysis 61

4.5 Support count computation-based protocol 64

4.5.1 Overview 64

4.5.2 Protocol Design 65

4.5.5 Performance analysis 68

4.6 Using binary tree communication structure 69

Trang 6

4.7 Privacy-preserving distributed Apriori algorithm 70

4.8 Conclusion 71

5 PRIVACY PRESERVING CLUSTERING 73 5.1 Introduction 73

5.2 Problem statement 74

5.3 Privacy preserving clustering for the multi-party distributed data 76 5.3.1 Overview 76

5.3.2 Private multi-party mean computation 78

5.3.3 Privacy preserving multi-party clustering protocol 80

5.4 Privacy preserving clustering without disclosing cluster centers 82 5.4.1 Overview 83

5.4.2 Privacy preserving two-party clustering protocol 85

5.4.3 Secure mean sharing 87

5.5 Conclusion 88

6 PRIVACY PRESERVING OUTLIER DETECTION 91 6.1 Introduction 91

6.2 Technical preliminaries 92

6.2.1 Problem statement 92

6.2.2 Linear transformation 93

6.2.3 Privacy model 94

6.2.4 Private matrix product sharing 95

6.3 Protocols for the horizontally distributed data 95

6.3.1 Two-party protocol 97

6.3.2 Multi-party protocol 100

6.4 Protocol for two-party vertically distributed data 101

6.5 Experiments 104

6.6 Conclusions 106

Trang 7

List of Phrases

Abbreviation Full name

PPDM Privacy Preserving Data Miningk-NN k-nearest neighbor

FD fully distributed setting

c

≡ computational indistinguishability

Trang 8

List of Tables

4.1 The communication cost 62

4.2 The complexity of the support count preserving protocol 63

4.3 The parties’s time for the support count preserving protocol 64

4.4 The communication cost 68

4.5 The complexity of the support count computation protocol 69

4.6 The parties’s time for the support count computation protocol 70

6.1 The parties’s computational time for the horizontally distributed data 1056.2 The parties’s computational time for the vertically distributed data 105

Trang 9

List of Figures

3.1 Frequency mining protocol 33

3.2 The time used by the miner for computing the frequency f 38 3.3 Privacy preserving protocol of naive Bayes learning 41

3.4 The computational time for the first phase and the third phrase 43 3.5 The time for computing the key values in the first phase 43

3.6 The time for computing the frequency f in third phrase 44

3.7 Improved frequency mining protocol 47

4.1 Support count preserving protocol 58

4.2 The support count computation protocol 66

4.3 Privacy-preserving distributed Apriori protocol 72

5.1 Privacy preserving multi-party mean computation 79

5.2 Privacy preserving multi-party clustering protocol 81

5.3 Privacy preserving two-party clustering 86

5.4 Secure mean sharing 89

6.1 Private matrix product sharing (PMPS) 96

6.2 Protocol for two-party horizontally distributed data 98

6.3 Protocol for multi-party horizontally distributed data 101

6.4 Protocol for two-party vertically distributed data 103

Trang 10

Chapter 1

INTRODUCTION

Data mining plays an important role in the current world and provides

us a powerful tool to efficiently discover valuable information from largedatabases [25] However, the process of mining data can result in a viola-tion of privacy, therefore, issues of privacy preservation in data mining arereceiving more and more attention from the this community [52] As a re-sult, there are a large number of studies has been produced on the topic ofprivacy-preserving data mining (PPDM) [72] These studies deal with theproblem of learning data mining models from the databases, while protectingdata privacy at the level of individual records or the level of organizations.Basically, there are three major problems in PPDM [8] First, the organi-zations such as government agencies wish to publish their data for researchersand even community However, they want to preserve the data privacy, forexample, highly sensitive financial and health private data Second, a group

of the organizations (or parties) wishes to together obtain the mining sult on their joint data without disclosing each party’s privacy information.Third, a miner wishes to collect data or obtain the data mining models fromthe individual users, while preserving privacy of each user Consequently,PPDM can be formed into three following areas depending on the models ofinformation sharing

re-Privacy-preserving data publishing: The model of this research sists of only an organization, is the trusted data holder This organizationwishes to publish its data to the miner or the research community such thatthe anonymized data are useful for the data mining applications For exam-ple, some hospitals collect records from their patients for the some required

Trang 11

con-medical services These hospital can be the trusted data holder, however thepatients may not trust the hospital when they send their data to the miner.When we publish our anonymized data there are many evidences showthat the publishing data may make privacy breaches via some attacks One ofthem is called re-identification and as showed in [61] For example, there are87% of the American population has characteristics that allows identifyingthem uniquely based on several public attributes, namely zip code, date ofbirth, and sex Consequently, privacy preserving data publishing has receivedmany attentions in recent years that aims to prevent re-identification attack,while preserving the useful information for data mining applications fromthe released data.

The general technique called k-anonymity [51], [82],[6], the goal here is

to make the property that obtains the protection of released data to fightagainst re-identification possibility of the released data Consider a privatedata table, where the data have been removed explicit identifiers (e.g., SSNand Name) However, values of other released attributes, such as ZIP, Date

of birth, Marital status, and Sex can also appear in some external sourcesthat may still joint with the individual users’ identities If some combinations

of values for these attributes occur uniquely or rarely, then from observingthe data one can determine the identity of the user or deduce a limited setthat consists of user The goal of k-anonymity is that every tuple in thereleased private table is indistinguishability from at least other k users.Privacy-preserving distributed data mining: This research areaaims to develop distributed data mining algorithms without accessing orig-inal data [33, 79, 35, 68, 80, 40] Different from privacy preserving datapublishing, each study in privacy-preserving distributed data mining is often

to solve a specific data mining task The model of this area usually consists

of several parties instead, each party has one private data set The eral purpose is to enable the parties for mining cooperatively on their jointdata sets without revealing private information to other participating par-ties Here, the way the data is distributed on parties also plays an importantrole in the solved problem Generally, data could be distributed into many

Trang 12

gen-parts either vertically or horizontally.

In horizontally distribution, a data set is distributed into several parties.Every party has data records with the same set of attributes For example,the customer databases union of the different banks Typically, banks havedifferent services for their clients as a savings account, choice of credit card,stock investments, etc Assuming that banks wish to predict who are safecustomers, who may be risk ones and may be frauds Gathering all kinds offinancial data about their customers and their transactions can help them

in the above predictions, thus they prevent huge financial losses Usingreasonable techniques for mining on the gathered data can generalize overthese datasets and identify possible risks for future cases or transactions.More specifically, when a customer Nam goes to Bank A to apply for a loan

to buy a car He needs to provide the necessary information to the Bank A

An expert system of Bank A can use k-NN algorithm to classify Nam as either

a risk or safe customer If this system only uses the database of the bank

A, it could happen that Bank A does not have enough customers that areclose to Nam Therefore, the system may produce a wrong classification Forexample, Nam is one safe customer but the system recognized him as a riskcustomer Consequently, the Bank A has a loss of profit It is clear that themining on the larger database can result in the more accurate classification.Thus, the classification on the joint databases of bank A and other banksmight give a more accurate result and Nam could have been classified assafe However, the problem is that privacy restrictions would not allow thebanks to access to each other’s databases So, privacy-preserving distributeddata mining can help this problem This scenario is a typical case of so-called horizontally partitioned data In the context of privacy-preservingdata mining, banks do not need to reveal their databases to each other.They can still apply k-NN classification to the joint databases of banks whilepreserving each bank’s privacy information

In vertically distribution, a data set is distributed into some parties.Every party owns a vertical part of every record in the database (it holdsrecords for a subset of the attributes) For example, financial transaction

Trang 13

information is collected by banks, while the tax information for everyone

is collected by the IRS In [71] Jaideep Vaidya et al show an illustrativeexample of two vertical distributed databases as well, one contains medicalrecords of people while another contains cell phone information for the sameset of people Mining the joint global database might obtain information likeCell phones with Li/Ion batteries lead to brain tumors in diabetics

Privacy-preserving user data mining: This research involves a nario in which a data miner surveys a large number of users to learn somedata mining results based on the user data or collects the user data whilethe sensitive attributes of these users need to be protected [74, 77, 19] Inthis scenario, each user only maintains a data record This can be thought

sce-of a horizontally partitioned database in which each transaction is owned by

a different user, and it is called fully distributed setting (FD) as well Unlikeprivacy-preserving data publishing, the miner is different from the publisherthat he is un-trusted, thus he could be an attacker who attempts to identifysome sensitive information from the user data For example, Du et al [19]studied to build decision tree on private data In this study, a miner wants

to collect data from users, and form a central database, then to conduct datamining on this database Thus, he gives a survey containing some questions,each user is requited to answer those questions and sends back the answers.However, the survey contains some sensitive questions, and user may notfeels comfortable to disclose their answers Thus, the problem is that howcould the miner obtain the mining model without learning sensitive informa-tion about the users One of requirement for the methods in this area is thatthere are not any interactions between users, and each user only communi-cate to the data miner However, we are still able to ensure that nothingabout the sensitive data beyond the desired results is revealed to the dataminer

Trang 14

1.2 Objectives and contributions

Up to now, there are many available solutions for solving the issues inPPDM The quality of each solution is evaluated based on the three basiccharacteristics: privacy degree, accuracy, and efficiency But the problemhere is that each solution was only used in a particular distributed scenario

or in a concrete data mining algorithm Although some of them can beapplied for more than one scenario or algorithm but their accuracy is lowerthan acceptable requirement Other solutions reach the accuracy, however,their privacy is poor In addition, it is easily to see that the lack of PPDMsolutions for various practical context as well as well-known data miningtechniques In this thesis, we aim at solving some issues in PPDM as follows:

1 To introduce a new scenario for privacy-preserving user data miningand find a good privacy solution for a family of frequency-based learningalgorithms in this scenario

2 To develop novel privacy-preserving techniques for popular data ing algorithms such as association rule mining and clustering methods

min-3 To present a technique to design protocols for privacy- preservingmultivariate outlier detection in both horizontally and vertically distributeddata models

The developed solutions will be evaluated in terms of the degree of privacyprotection, correctness, usability to the real life applications, efficiency andscalability

The contributions of this thesis is to provide solutions for four problems

in PPDM Each problem has an independent statement to the others, butthey share a common interpretation that given a data set being distributedinto several parties (or users), our task is to mine knowledge from all parties’joint data while preserving privacy of each party The difference among thoseproblems lies in various distributed data models (scenarios), and variousproposed functions to keep the privacy information for parties Summarizing,our contributions in this thesis are as follows

Trang 15

• The first work (Chapter 3), we propose a new scenario for preserving user data mining called 2-part fully distributed setting (2PFD)and find solution for a family of frequency-based learning algorithms

privacy-in 2PFD settprivacy-ing In 2PFD, the dataset is distributed across a largenumber of users in which each record is owned by two different users,one user only knows the values for a subset of attributes and the otherknows the values for the remaining attributes A miner aims to learn,for example, classification rules on their data, while preserving eachusers privacy In this work we develop a cryptographic solution forfrequency-based learning methods in 2PFD The crucial step in theproposed solution is the privacy-preserving computation of frequencies

of a tuple of values in the users data, which can ensure each users vacy without loss of accuracy We illustrate the applicability of themethod by using it to build the privacy preserving protocol for thenaive Bayes classifier learning, and briefly address the solution in otherapplications Experimental results show that our protocol is efficient

pri-• The second contribution of this thesis (Chapter 4) is the novel protocolsfor privacy-preserving frequent itemset mining in vertically distributeddata These protocols allow a group of parties cooperatively minefrequent itemsets in distributed setting without revealing each party’sportion of the data to the other The important security property ofour protocols is better than the previous protocols’ one in the way that

we achieve the full privacy protection for each party This propertydoes not require the existence of any of trusted parties In addition,

no collusion of parties can make privacy breaches

• For third work (Chapter 5), we present the expectation maximizationmixture model clustering method for distributed data that preservesprivacy for data of participating parties Firstly, privacy preservingEM-based clustering method for multi-party distributed data proposed.Unlike the existing method, our method does not reveal sum results

of numerator and denominator in the secure computation for the

Trang 16

pa-rameters of EM algorithm, therefore, the proposed method is moresecure and it allows the number of participating parties to be arbi-trary Secondly, we propose the better method for the case in whichthe dataset is horizontally partitioned into only two parts, this methodallows computing covariance matrices and final results without reveal-ing the private information and the means To solve this one, we havepresented a protocol based on the oblivious polynomial evaluation andthe secure scalar product for addressing some problems, such as themeans, covariance matrix and posterior probability computation Theapproach of paper allows two or many parties to cooperatively con-duct clustering on their joint data sets without disclosing each partysprivate data to the other.

• In fourth work (chapter 6), we study some parties - each has a privatedata set - want to conduct the outlier detection on their joint dataset, but none of them want to disclose its private data to the otherparties We propose a linear transformation technique to design pro-tocols of secure multivariate outlier detection in both horizontally andvertically distributed data models While different from the most ofprevious techniques in a privacy preserving fashion for distance-basedoutliers detection Our focus is other non-distance based techniquesfor detecting outliers in statistics

Recently a lot of solutions have been proposed for PPDM Those solutionscan be categorized into two main approaches: Secure multiparty computation(SMC) approach and randomization approach

The basic idea of randomization approach is to perturb the original vate) dataset and the result is released for data analysis The perturbationhas to ensure that original individual data values cannot be recovered, whilepreserving the utility of the data for statistical properties Thus it allows

Trang 17

(pri-patterns in the original data to be mined There are two main perturbationtechniques: random transformation and randomization First transformseach data value (record) into a random value (record) of the same domainwith the original data in ways that preserve certain statistics, but hide realvalues [21, 4, 19, 3] Second adds noise to data to prevent discovery of thereal Randomization has to ensure that given the distribution of the noiseadded to the data, and the randomized data set, it can be reconstructed thedistribution (but not actual data values) of the data set [1, 36, 16].

The typical example is the algorithm of Agrawal-Srikant [2], which values

of an attribute are discretized into intervals and each original value is assigned

to an interval Then, the original data distribution is reconstructed by aBayesian approach, and based on the reconstructed distribution the decisiontrees can be induced Many other distribution reconstruction methods havealso been introduced In [1], Agrawal et al developed an approach based

on Expectation Maximization that also gave a better definition of privacy,and an improved algorithm Evmievski et al [21] used a similar techniquefor association rules mining Polat et al proposed a privacy preservingcollaborative filtering method using randomized techniques [53], etc

Although perturbation techniques are very efficient, their use generallyinvolves a tradeoff between privacy and accuracy, if we require the moreprivacy, the miner loses more accuracy in the data mining results, and vice-versa Even the very technique that allow us to reconstruct distributions alsoreveal information about the original data values For example, consider thecase of randomizing age attribute In principle, there are no drivers under

18 in the general distribution of age Thus, assume that randomization

is implemented by adding noise randomly chosen from the range [-10,10].Although the reconstructed distribution does not show us any true age value,

it only give the age of a driver be 40 years old that corresponds to thetrue age in the range [30 50] However if an age value in the randomizedset is 7, we know that no drivers are under the age of 18, so the driverwhose age is given as 7 in the randomized data must be 18 years old in theoriginal data Thus, some works has been done to measure the privacy of

Trang 18

randomization techniques for the purpose that they must be used carefully

to obtain the desired privacy Kargupta et al [36] formally analyze theprivacy of randomization techniques and show that many cases reveal privacyinformation Evmievski et al [20] show how to limit privacy breaches whenusing the randomization technique for privacy preserving data mining.Many privacy preserving data mining algorithm based on SMC has pro-posed as well They can be described as a computational process where

a group of parties computes a function based on private inputs, but ther party wants to disclose its own input to any other party The securemultiparty computation framework was developed by Goldreich[24] In thisframework, multiparty protocols can fall into either the semi-honest model

nei-or malicious adversary model In the semi-honest model, the parties areassumed that follows the protocol rules, but after the execution of the pro-tocol has completed, the parties still try to learn additional information byanalyzing the messages they received during the execution of the protocol

In the malicious adversary model, it is assumed that the parties can executesome arbitrary operations to damage to other parties Thus, the protocoldesign in this model are much more difficult than in the semi-honest model.However, in current the semi-honest model is usually used for the context ofprivacy preserving data mining The formal definitions of SMC was stated

in [24]

The secure multi-party computation problem was first proposed by Yao[78] where he gave the method to solve Yao’s Millionaire problem that allowscomparing the worth of two millionaires without revealing any privacy in-formation of each people According to theoretical studies of Goldreich, thegeneral SMC problem can be solved by the circuit evaluation method How-ever, using this solution is not practical in terms of the efficiency Therefore,finding efficient problems specific solutions was seen as an important researchdirection In the recent years, many specific solutions were introduced forthe different research areas such as information retrieval, computational ge-ometry, statistical analysis, etc [17, 10, 70]

Randomization approaches [21, 4, 19, 3, 1, 36, 16] can be used in the

Trang 19

fully distributed scenario where a data miner wants to obtain classificationmodels from the data of large sets of users These users can simply random-ize their data and then submit their randomized data to the miner who canlater reconstruct some useful information However for obtaining strong pri-vacy without loss of accuracy, in [74, 77] the SMC techniques have proposed.The key idea of these techniques is a private frequency computation methodthat allows a data miner to compute frequencies of values or tuples in the

FD setting, while preserving privacy of each user’s data In Chapter 3, weproposed a SMC solution which allows the miner to learn frequency-basedmodels in 2PFD setting Note that in this setting, each user may only knowsome values of the tuple but not all Therefore, the above mentioned cryp-tographic approaches can not be used in 2PFD setting In the FD setting,other solutions based on k-anonymization of user’s data have been proposed

in [83, 77] The advantage of these solutions is that they do not depend onthe underlying data mining tasks, because the anonymous data can be usedfor various data mining tasks without disclosing privacy However, these so-lutions are inapplicable in 2PFD setting, because the miner can not link twoanonymous parts of one object with each other

The SMC approaches are usually are used for privacy-preserving tributed data mining as well, where data are distributed across several par-ties Thus, the privacy property of privacy-preserving distributed data min-ing algorithms is quantified by the privacy definition of SMC, where eachparty involved in the privacy-preserving distributed protocols is only allowed

dis-to learn the desired data mining models without any other information erally, each protocol has to be designed for specific work for the reason ofefficiency and privacy Currently specific privacy-preserving distributed pro-tocols have been proposed to address different data mining problems acrossdistributed databases, e.g., in [70, 63, 18], they developed a privacy preserv-ing classification protocols from the vertically distributed data based on asecure scalar product method that consists of privacy preserving protocolsfor learning naive Bayes classification, association rules and decision trees In[34], the privacy preserving naive Bayes lassification was addressed for hori-

Trang 20

Gen-zontally distributed data by computing the secure sum of all local frequencies

of participating parties Our work in Chapter 4 is to present the frequentitemset mining protocols for vertically partitioned data Distributed associa-tion rules/itemsets mining has been addressed for both vertically partitioneddata and horizontally partitioned data [33, 79, 35, 68, 80] However, to thebest of our knowledge, these protocols preserves the privacy of each partyand only resist the collusion at most n − 2 corrupted parties Our protocolsfor privacy preserving frequent mining involving multiple parties can protectthe privacy of each party and gainst the collusion, up to n − 1 corruptedparties

For the related work with privacy preserving distributed clustering cently, privacy preserving clustering problems have also been studied bymany authors In [49] and [47], the authors focused on different transfor-mation techniques that enable the data owner to share the data with theother party who will cluster it Clifton and Vaidya proposed a secure multi-party computation of k-means algorithm on vertically partitioned data [66]

Re-In [29], the authors proposed a solution for privacy preserving clustering onhorizontally partitioned data, where they primarily focused on hierarchicalclustering methods that can both discover clusters of arbitrary shapes anddeal with different data types In [59], Kruger et al proposed a privacy pre-serving, distributed k-means protocol on horizontally partitioned data thatthe key step is privacy preserving of cluster means At each iteration of thealgorithm, only means are revealed to parties without other things But, re-vealing means might allow the parties to learn some extra information of eachother To our knowledge, there is so far only one secure method for the ex-pectation maximization (EM) mixture model from horizontally distributedsources [40] based on secure sum computation However, this method re-quires at least three participating parties Because the global model is asum of local models, in case only two parties, which often happens in prac-tice, each party could compute other party’s local model by subtracting itslocal model from the global model The aim of this work in chapter 5 isfirstly to develop a more general protocol which allows the number of par-

Trang 21

ticipating parties to be arbitrary and more secure Secondly, we propose abetter method for the case in which the dataset is horizontally partitionedinto only two parts.

For privacy preserving outlier detection, while there is a number of ferent definitions for outliers as well as techniques to find them, only somecurrently developed methods in a privacy preserving fashion for distance-based outliers detection There are other non-distance based techniques fordetecting outliers in statistics [12], but there are still no work on findingthem in a privacy preserving fashion [3] The Mahalanobis distance hasbeen used in several work for outlier detection [10], [12] In chapter 6, weproposed solutions of privacy preserving outlier detection in both verticallyand horizontally distributed data Our work related to the work of securesound classification in [57] in which the gaussian single model are used forclassification However, this work solved the scenario of the two-party se-cure classification In this scenario, the parties engage in a protocol thatallows one party to classify her data using other’s classifier without revealinganything her private information Also, she will learn nothing about theclassifier Therefore, our purpose and method is different from this work

The thesis consists of six chapters, 109 pages of A4 Chapter 1 presents

an overview of PPDM and related works Chapter 2 presents the basic initions of secure multi-party computation and the techniques I frequentlyuse Chapter 3 proposes privacy preserving frequency-based learning algo-rithms in 2PFD Chapter 4 presents two privacy-preserving algorithms fordistributed mining of frequent itemsets Chapter 5 discusses privacy preserv-ing EM-based clustering protocols Chapter 6 presents privacy preservingoutlier detection for both vertically distributed data and horizontally dis-tributed data The summary of this thesis is presented in the last section

Trang 22

In this section, we review basic definitions from computational complexitytheory that will be used in this thesis [24].

The following is the standard definition of a negligible function

Definition 2.1 Let N be the set of natural numbers We say the functionϵ(·) : N 7→ (0, 1] is negligible in n, if for every positive integer polynomialpoly(·) there exists an integer n0 > 0 such that for all n > n0

ϵ(n) < 1

poly(n)The computational indistinguishability is another important concept whendiscussing the security properties of distributed protocols [24] Let X ={Xn}n∈N is an ensemble indexed by a security parameter n (which usuallyrefers to the length of the input), where the Xi′s are random variables.Definition 2.2 Two ensembles, X = {Xn}n∈N and Y = {Yn}n∈N, arecomputational indistinguishable in polynomial time if for every probabilistic

Trang 23

polynomial time algorithm A,

|P r(A(Xn) = 1) − P r(A(Yn) = 1)|

is a negligible function in n In such case, we write X ≡ Y , wherec ≡ denotesccomputational indistinguishability

This section reviews the secure multiparty computation framework veloped by Goldreich[24]

de-Secure multiparty computation function

In a distributed network with n participating parties A secure n-partycomputation problem can generally be considered as a computation of afunction:

f (x1, x2, , xn) 7→ (f1(x1, x2, , xn), , fn(x1, x2, , xn))

where each party i knows only its private input xi For security, it is quired that the privacy of any honest party’s input is protected, in thesense that each dishonest party i learns nothing except its own output

re-yi = fi(x1, x2, , xn) If there is any malicious party that may deviate fromthe protocol, it is also required that each honest party get a correct resultwhenever possible

Privacy in Semi-honest model

In the distributed setting, let π be an n-party protocol for ing f Let x denote (x1, , xn) The view of the ith (i ∈ [1, n]) partyduring an execution of π on x is denoted by viewπ(x) which includes xi,all received messages, and all internal coin flips For every subset I of[1, n], namely I = {i1, , it}, let fI(x) denote (yi 1, , yi t) and viewπI(x) =(I, viewπi1(x), , viewiπt(x)) Let OU T P U T (x) denotes the output of all par-ties during the execution of π

Trang 24

comput-Definition 2.3 An n-party computation protocol π for computing f (., , )

is secure with respect to semi-honest parties, if there exists a probabilisticpolynomial-time algorithm denoted by S, such that for every I ⊂ [1, n] wehave

{S(xi1, , xit, fI(x)), f (x))}≡ {viewc Iπ(x), OU T P U T (x)}

This definition states that the view of the parties in I can be simulatedfrom only the parties’ inputs and outputs If the function is privately com-puted by the protocol, then privacy of each party’s input data is protected

In this thesis, we focus on designing privacy-preserving protocols in the honest model The formal definition of the security protocol in the maliciousmodel can be found in [24]

semi-In this thesis, we also use composition theorem for the semi-honest modelthat its discussion and proof can be found in [24] The composition theoremstates that a protocol can be decomposed into several sub-protocols, thensecurity of the protocol will be proved if we can show that its subprotocolsare secure

Theorem 2.1 (Composition theorem) Suppose that g is privately reducible

to f , and that there exists a protocol for privately computing f Then thereexists a protocol for privately computing g

Secret sharing refers to any method by which a secret can be shared bymultiple parties in such a way that no party knows the secret, but it is easy

to construct the secret by combining some parties shares

In a two-party case, Alice and Bob share a value z, in such a way thatAlice holds (x, n), Bob holds (y, m), and z is equal to (x + y)/(m+n) This

is called secret mean sharing The result of sharing allows Alice and Bob to

Trang 25

obtain the random values rA and rB, respectively where rA + rA = z Theprotocol for this problem will be described in Chapter 5.

Shamir secret sharing is a threshold scheme [56] In Shamir secret sharing,there are n parties and a polynomial P of degree k − 1 such that P (0) = Swhere S is a secret Each of the n parties holds a point in the polynomial

P Because k points (xi, yi) (i = 1, , k) uniquely define a polynomial P ofdegree k − 1, a subset of at least k parties can reconstruct the secret S based

on polynomial interpolation But, fewer than k parties cannot construct thesecret S This scheme is also called (n, k) Shamir secret sharing

A simple example of efficient SMC that illustrates the idea of privacypreserving computations is the secure sum protocol [10]

Assume that there are n parties P1, P2, , Pn such that each Pi has aprivate data item di The parties wish to compute∑ni=1di, without revealingtheir private data di to each other we assume that ∑ni=1di is in the range[0, p] In secure sum protocol, one party is designated as the master partyand is given the identity P1 At the beginning P1 chooses a uniform randomnumber r from [0, p] and then sends the sum D = d1+ r mod p to the party

P2 Since the value of r is chosen uniformly from [0, p], the number D isalso distributed uniformly across this region, so P2 learns nothing about theactual value of d1

Each remaining party Pi (i = 2, , n) does the following: it receives

Trang 26

Finally, when party P1 receives a value from the party Pn, it will be equal

to the total sum r +∑nl=1di mod p Since r is only known to P1 it can findthe sum ∑nl=1di and distribute to other parties

A public key cyptosystem often uses two different keys for encryption anddecryption The public key is used for encryption and is normally publishedand known to everybody The private key is kept securely by the receiver.Therefore, everybody can send an encrypted message to the receiver Butonly the receiver who holds the valid private key can decrypt the message Apublic key cryptosystem is a triple probabilistic polynomial-time algorithms(K , E , D) defined as follows [24]

1 K is the key-generation algorithm Given a security parameter l, aprobabilistic expected polynomial-time algorithm K (1l) generates apair (kp, ks), where kp is the public key and ks is the correspondingprivate key Note that the lengths of the cryptographic keys are deter-mined by the security parameter l

2 E is the encryption algorithm Given the public-key kp, a plaintext mand a random r ∈ {0, 1}l, a probabilistic polynomial-time algorithmencrypts m as C = Ek p(m, r) ∈ {0, 1}l

3 D is the decryption algorithm Given the private key ks and a phertext C, the probabilistic polynomial-time algorithm returns m =

ci-Dk

s(C) ∈ {0, 1}l

Public key cryptosystems is a homomorphic encryption if it permits aspecific algebraic operation on the plain text by performing a the corre-sponding algebraic operation on the ciphertext We say E is additively ho-momorphic if given Ekp(m1, r1) and Ekp(m2, r2), we can efficiently compute

Trang 27

The security of a public key encryption bases on some intractability sumptions For example, the well-known RSA encryption [54] presumes thatfactoring a large integer is difficult The ElGamal encryption [62] is based

as-on the assumptias-on that discrete logarithm is intractable Note that for theconditional security, the cryptosystem is viewed as a family of encryptionsindexed by the security parameter Any improvement in the computing tech-nology and in the algorithmics can be compensated by a selection of a largersecurity parameter (the size of keys) This works only if the improvement

is in the polynomial scale In other words, the security can be guaranteedagainst adversaries whose computing resources are polynomially bounded.One of the most efficient currently known semantically secure homomorphiccryptosystems was proposed by Paillier cryptosystem [50] and then improved

by Damgard and Jurik [12]

Our protocols in Chapter 3 and 4 are based on the standard variant ofthe ElGamal encryption scheme The security of this cryptosystem is based

on the decisional Diffie-Hellman (DDH) assumption [7] The computationsare carried out in Zp and the message space is Zq, where p, q are both primeand q|(p − 1) The ElGamal encryption is defined by the following threealgorithms

1 K is the key-generation algorithm Given a security parameter l =log q, K (1l) generates the tuple (kp; ks), where the public key kp =(p, g, h, f ) and the corresponding private key ks, so that the followingconditions are satisfied:

(1) p is a l-bit prime, so that q is also a prime number

(2) g is an element of order q ∈ Zp∗, where Zp∗ is the multiplicativegroup of Zp

(3) h = gks and f ∈ ⟨g⟩ is randomly selected

Trang 28

2 E is the encryption algorithm Given the public key kp, the encryption

of message m ∈ Zq is defined as:

E(m, r) = (fmhr, gr) = (C1, C2)where r is uniformly chosen from {1, , q − 1}

3 D is the decryption algorithm Given the private key ks, the decryptionalgorithm returns C1C−ks

2 mod p

Note that the decryption will only return fm rather than m To recover

m, an exhaustive search is needed However, in protocols of this thesis, weonly need to test whether m is equal to a particular value (such as 0 or aconstant c where c is assumed is small), which is equivalent to testing if

C1C−ks

2 ≡ 1 mod p or C1C−ks

2 ≡ fc mod pThis cryptosystem holds the additive homomorphic property that can beused to perform computation on ciphertexts from different part Let E (a) =(hr1fa, gr1) and E (b) = (hr2fa, gr2), where a, b ∈ Zq and correspondingr1, r2 ← ZR q \ {0} The following relation defines the multiplication ⊙ over

E(r, a)c = E (cr, ca)

In addition, this encryption scheme holds the indistinguishability propertythat any two different messages will have different ciphertexts, since therandom number r can take many different values ElGamal encryption has

a randomization property which allows computing a different encryption of

M from a given encryption of M

Trang 29

Decisional Diffie-Hellman Assumption For uniformly random a, b, c ∈[0, q − 1], the DDH assumption is that

{ga, gb, gab}≡ {gc a, gb, gc}

The problem of the oblivious polynomial evaluation (OPE) was first sidered in [44] As with oblivious transfer, this problem involves a sender and

con-a receiver The sender’s input is con-a polynomicon-al P of degree k over some finitefield F and the receiver’s input is an element z ∈ F (the degree k of P ispublic) The protocol is such that the receiver obtains P (z) without learninganything else about the polynomial Q, and the sender learns nothing Anefficient solution to this problem was presented in [45] For our protocols, weuse the protocol given in [45] since it requires only O(k) exponentiations inorder to evaluate a polynomial of degree k (where the constant is very small).This works well since we only require evaluation of low-degree polynomials

We now briefly describe the protocol used for oblivious polynomial uation This description is excerpted from [32]: Let P (y) = ∑ki=0aiyi beAlice’s input and x be Bob’s input The following protocol enables Bob

eval-to compute P (x), where g is the generaeval-tor of a group in which the sional Diffie-Hellman assumption holds The protocol can be converted tothe probmlem of computing P (x) using the methods of Paillier [50], who pre-sented a trapdoor for computing discrete logs The protocol is quite simplesince the parties are assumed to be semi-honest Bit-commitment and zeroknowledge proofs can be used to achieve security against malicious parties.The protocol is given as follows

Deci-1 Bob chooses a secret key s and does

for i = 0 k do

generates a random r

computes ci = (gr i, gr i sgxi)

end for

Trang 30

Bob sends c0, , ck and gs to Alice.

2 Alice computes

C = ∏ki=0cai

i = (gR, gsR, gP (x)), where R = riaiGenerates a random number r, and computes C′ = (gRgr, gsRgP (x)gsr).Send C′ to Bob

3 Bob divides the second element of C′ by the first element of C′ raised

to the power s, obtain gP (x)

Based on the DDH assumption, Alice learns nothing of xi from the sages c0, , ck sent by Bob On the other hand, Bob learns nothing of Pfrom C′

Protocols for computing secure scalar product (SSP) of two vectors arefrequently used in privacy-preserving data mining applications Assumethat two vectors A = (a1, , an) and B = (b1, , bn) are owned by twocorresponding parties Alice and Bob A privacy-preserving scalar prod-uct protocol is to allow one or both parties to learn the scalar product

A · B = ∑ni=1aibi, and neither party should learn anything about the otherparty’s input beyond what is implied by its input and its final result To

be useful as a building block that can be used as a sub-protocol for themore complexity privacy-preserving protocols, it is often desirable to use theprivacy-preserving scalar product protocols in which Alice and Bob learnadditive secret shares of the resulting scalar product, rather than only oneparty learning the scalar product For example, if Alice holds A, Bob holds

B, and the scalar product A·B is known to be less than M , then Alice learns

rA and Bob learns rB, where rA and rB are random integers, called shares,between 0 and M − 1 such that rA+ rB mod M = A · B Therefore, togetherAlice and Bob know A · B, but individually neither learns any information

Trang 31

about its value This protocol is called the secure scalar product share col This section presents the efficient protocol based on semantically securehomomorphic encryption It was proposed in [23] Alice learns rA and Boblearns rB, where rA and rB are random integers.

ln (x1+ x2) One of the key results excerpted in [32] was a cryptographicprotocol for this computation In this section, we briefly present this pro-tocol Basically, the main method in sharing ln (x1+ x2) is based on theTaylor approximation Indeed, using the Taylor approximation method, wehave:

Trang 32

Let n = ⌊log2x⌋ and let 2n represents the closest power of 2 to x, then,

= ln 2n+ T (ϵ)The protocol consists of two phases Phase 1 determines an appropriate

n and ϵ Let N be a predetermined upper-bound on n First, we use Yao’scircuit evaluation that takes x1 and x2 as input, and outputs are randomshares of ϵ2N and 2Nn ln 2 Note that ϵ2n = x − 2n, where n can be deter-mined from the two most significant bits of x and ϵ2N is obtained simply byshifting the result by N − n bits to the left Thus, it outputs random values

α1 and α2 where α1+ α2 = ϵ2N, and it also outputs random values β1 and

β2 such that β1 + β2 = 2Nn ln 2

Phase 2 is to compute shares of the Taylor series approximation, T (ϵ)that can be done as follows Alice picks a random element w1 ∈ F anddefines a polynomial Q(x) such that w1 + Q(α2) = T (ϵ) in which Q(x) isrepresented by

u2 = lcm(2, , k)β2+ w2, giving us that u1+ u2 = 2Nlcm(2, , k) ln x

Trang 33

Let us take some examples of 2PFD Consider the scenario in which asociologist wants to find out the depersonalization behavior of children de-pending on the parenting style of their parents [73] The sociologist providesthe sample survey to collect information about the parenting style from par-ents and about behavior from their children Clearly, the information is quitesensitive, parents do not want to objectively reveal their limitations in edu-cating children, while it is also difficult to ask the children to answer honestlyand truthfully about their depersonalization behavior Therefore, in order

to get accurate information, the researcher must ensure the confidentialityprinciple of information for each subject In this case, each data record isprivately owned by both the parents and their children

Another example is the scenario where a medical researcher needs tostudy the relationship between living habits, clinical information and a cer-tain disease [31, 30] A hospital has a clinical data set of the patients that

Trang 34

can be used for research purposes and the information of living habits can

be collected by a survey of patients, though, neither the hospital nor thepatients are willing to share their data with the miner because of privacy.This scenario meets the 2PFD setting, where each data object consists oftwo parts: one part consisting of living habits belonged to a patient, theremaining part consisting of clinical data of this patient kept by the hos-pital Furthermore, we can see that the 2PFD setting is quite popular inpractice, and that privacy preserving frequency mining protocols in 2PFDare significant and can be applied to many other similar distributed datascenarios

In this work we develop a cryptographic solution for frequency-basedlearning methods in 2PFD The crucial step in the proposed solution is theprivacy-preserving computation of frequencies of a tuple of values in theusers’ data, which can ensure each user’s privacy without loss of accuracy

We illustrate the applicability of the method by using it to build the vacy preserving protocol for the naive Bayes classifier learning, and brieflyaddress the solution in other applications Experimental results show thatour protocol is efficient

pri-This work belongs the area of privacy preserving user data mining asintroduced in Section 1.1.3 A variety of privacy preserving data mining so-lutions have been proposed in this area Some randomization-based solutionsproposed in [21, 4, 19, 3, 1, 36, 16] can be applied to classification algorithms

in fully distributed setting The basic idea of these solutions is that everyuser perturbs its data, before sending it to the miner The miner then canreconstruct the original data to obtain the mining results with some boundederror These solutions allow each user to operate independently, and the per-turbed value of a data element does not depend on those of the other dataelements, but only on its initial value Therefore, they can be used in vari-ous distributed data scenarios Although these solutions are highly efficient,their use generally involves a tradeoff between privacy and accuracy, i.e., if

we require more privacy, the miner loses more accuracy in the data miningresults, and vice-versa

Trang 35

In [74, 77] the authors solved various privacy preserving data mining taskssuch as naive Bayes learning, decision tree learning, association rule miningetc The proposed cryptographic approaches are able to maintain strongprivacy without loss of accuracy The key idea of these approaches is a pri-vate frequency computation method that allows a data miner to computefrequencies of values or tuples in the fully distributed data set, while pre-serving privacy of each user’s data To compute the frequency of a tuple ofvalues, each user outputs a boolean value (either 1 or 0) indicating whetherthe data it holds matches the pattern or not, and the miner uses private fre-quency computation method to privately compute the sum of boolean valuesfrom all users Note that in this setting, each user may only know some val-ues of the tuple but not all Therefore, the above mentioned cryptographicapproaches can not be used in 2PFD setting.

Some other solutions based on k-anonymization of user’s data have beenproposed in [83, 77] The advantage of these solutions is that they do not de-pend on the underlying data mining tasks, because the anonymous data can

be used for various data mining tasks without disclosing privacy However,these solutions are inapplicable in 2PFD setting, because the miner can notlink two anonymous parts of one object with each other

One of the requirements in our computation model is the connection oftwo different parts of the partitioned records to obtain the desired computa-tion results without disclosing any attribute information It is similar withthe problem of secure scalar product [63] and the problem of computingthe intersection of private datasets in two-party vertically partitioned model[22] Indeed, we consider the problem of computing the intersection of privatedatasets of two parties This problem requires combination of two values be-longed two different parts of the two-party partitioned records to obtain thematching results while preserving each party’s privacy To solve this problembased on the proposed protocol in [22], we follow the basic structure: oneparty defines a polynomial whose roots are her inputs, and then encryptsthe coefficients of this polynomial by homomorphic encryption Thus, otherparty can use the homomorphic properties of the encryption system to eval-

Trang 36

uate the polynomial at each of his inputs He then multiplies each result

by a random number and adds to it an encryption of the value of his input.The result allows the party with the encrypted polynomial to find the values

in the intersection of the two parties’ inputs while protecting privacy of theremaining values Here, we note that the evaluation party owns the valueone of each combined values pair, thus it can easily combine its values withother party’s corresponding values by evaluating the encrypted polynomial

In our problem, the miner plays a role as a combiner; however the miner doesnot know any values in each partitioned record Therefore, our problem isclearly more difficult than the similar problems in the vertically portioneddata model

The rest of this chapter is organized as follows In Section 3.2, we duce a privacy preserving protocol for frequency mining in 2PFD We alsoprove the correctness and privacy properties of the protocol, and present ex-perimental results of the efficiency In Section 3.3, we demonstrate the useful

intro-of the frequency mining method by using it as a building block to design aprivacy preserving protocol for naive Bayes learning, again with experimen-tal results of efficiency In section 3.4, we give an improvement of frequencymining protocol using Shamir secure sharing scheme that allows the miner

to be able to obtain frequency without requiring the full participation of allusers, and we conclude in Section 3.5

In 2PFD setting, a data set (a data table) consists of n records, andeach record is described by values of nominal attributes The data set is dis-tributed across two sets of users U = {U1, U2, , Un} and V = {V1, V2, , Vn}.Each pair of users (Ui, Vi) owns a record in which user Ui knows values for aproper subset of attributes, and user Vi knows the values for the remainingattributes Note that in this setting, the set of attributes is the same among

Trang 37

Ui′s, so is for set of remaining attributes among Vi′s.

The miner aims to mine the frequency of a tuple of values in the dataset Assume that each user’s data includes some sensitive attribute values

To protect users’ privacy and also enable learning frequency, our purpose is

to design a protocol that enables the miner to learn frequency from all users’data without learning any individual’s sensitive values

Assume that the tuple consists of two parts: the first part consists ofvalues for some attributes belonged to Ui; and the second part consists ofthe remaining values for some attributes belonged Vi In this case, each Uioutputs a boolean value ui (either 1 or 0) to indicate whether or not thedata it holds matched the first part, and each Vi outputs a boolean value vi

to indicate whether or not the data it holds matched the second part It isclear that the frequency of the tuple is ∑ni=1uivi Therefore, the frequencycomputation problem can be defined as follows

Definition 3.1 Assume that there are n pairs of users (Ui, Vi), each Uihas a binary number ui and each Vi has a binary number vi The privacy-preserving frequency computation problem is to allow a miner to compute

f = ∑ uivi without disclosing any information about ui and vi In otherwords, we need a privacy-preserving protocol for constructing the followingfunction:

(u1, v1, , un, vn) 7→∑uivi

The definition notation implies that each pair Ui and Vi provide inputs

ui and vi to the protocol, and the miner receive output ∑ uivi without anyother information

Our problem formula is still appropriate when the tuple consisting ofvalues for some attributes only belongs to Ui (or Vi) For example, when thetuple consists of values for some attributes only belonged to Ui, Ui outputs

a boolean value ui to indicate whether the data it holds matches all values

in the tuple and Vi outputs vi = 1 Therefore, clearly the sum f = ∑ ui =

∑ uivi is the frequency value which needs to be computed However, tocompute ∑ ui, we can use the privacy-preserving frequency mining protocol

Trang 38

for the fully distributed setting proposed in [74].

To be applicable, we require that the protocol can ensure users’ privacy

in an environment that doesn’t have any secure communication channel tween the user and the miner, as well as it should not require any commu-nication among the users In addition, it should minimize the number ofinteractions between the user and the miner Particularly, the user Ui mustnot interact with the miner more than twice, and the user Vi must interactwith the miner exactly once Those requirements make our protocol moreapplicable For example, considering a real scenario when a miner uses aweb-application to investigate a large number of users for his research, auser only needs to use his browser to communicate with the server one ortwo times, while he does not have to communicate with the others

The privacy preservation of our protocol is based on the semi-honest curity model In this model, each party participating in the protocol has tofollow rules using correct input, and cannot use what it sees during execution

se-of the protocol to compromise security A general definition se-of secure party computation in the semi-honest model is stated in [24], which reviewed

multi-in the previous This defmulti-inition was derived to make a simplified defmulti-inition

in the semi-honest model for privacy-preserving data mining in the fully tributed setting scenario [74] Since this scenario is similar to 2PFD setting,here we consider the possibility that some corrupted users share their datawith the miner to derive the private data of the honest users One of therequirements for our protocol is that no other private information about thehonest users can be revealed, except a multivariate linear equation in whicheach variable presents a value of an honest user In our model, informationknown by users is no more than information known by the miner, so we donot have to consider the problem in which users share information with eachother

dis-In particular, we are going to introduce the privacy definition for the

Trang 39

protocol with the following parameter model There are n pairs of users (Ui,

Vi) and a miner involved in the frequency mining protocol Ui and Vi havethe private binary inputs ui and vi, respectively The proposed protocolbases on the Elgamal Encryption scheme Thus, assume that prior to theexecute of the protocol, each Ui previously has obtained a set of private keys

D(u)i that corresponds to a set of public keys Ei(u), and each Vi previouslyhas obtained a set of private keys Di(v) that corresponds to a set of publickeys Ei(v) Note that, the private key is secretly kept, while the public key isthe public information The definition of privacy preserving in semi-honestmodel is presented as follows:

Definition 3.2 A protocol for frequency mining problem in the definition3.1 is said to protect each user’s privacy against the miner as well as t1

corrupted users Ui and t2 corrupted users Vi in the semi-honest model if,for all I1, I2 ⊆ {1, , n} such that |I1| = t1 and |I2| = t2, there exists aprobabilistic polynomial-time algorithm M such that

{M (f, [ui, Di(u)]i∈I 1, [Ej(u)]j / 1, [vk, D(v)k ]k∈I 2, [El(v)]l /∈I2)}

Our protocol is designed based on the homomorphic property of a variant

of ElGamal encryption [27] The privacy of our protocol is based on the

Trang 40

se-mantic property of ElGamal encryption scheme under the DDH assumption,which has been introduced in the previous chapter Note that the homomor-phic property allows the miner to combine encrypted results received fromthe users into the desired final.

Let p and q be two primes such that q|(p − 1), let G be a subgroup of Z∗

p

of order q, and g is a generator of G In the proposed protocol, we assumethat each user Ui has private keys xi, yi uniformly chosen from {1, , q − 1},and public keys Xi = gxi, Yi = gyi Each user Vi has private keys pi, qi andpublic keys Pi = gp i, Qi = gq i We note that computations in this thesisalways take in Zp We define

In the proposed protocol, X and Y are known by all users, as presented

in 3.2.1, our purpose is to allow the miner to securely compute the sum

f = ∑ni=1uivi Our protocol consists of the four phases

In the first three phases, each pair (Ui, Vi) runs a three-step non-interactiveprocedure over the server By using the homomorphic property of the en-cryption scheme, at the end of the three-step, they output K(ui, vi) =(gui v iXyi +q i, Yxi +p i) In other words, the ith pair of users implements aprivacy preserving procedure for computing the following function:

(ui, vi) 7−→ K(ui, vi)Once computed, we obtain two goals Firstly, it protects each user’s privacy,because based on the semantic security property of Elgamal encryption, we

Tiêu đề	Distributed Solutions In Privacy Preserving Data Mining
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Data Privacy and Security
Thể loại	Graduation project
Năm xuất bản	2011
Thành phố	Hanoi

Định dạng
Số trang	130
Dung lượng	593,48 KB