Nghiên cứu xây dựng một số giải pháp đảm bảo an toàn thông tin trong quá trình khai phá dữ liệu bản tóm tắt tiếng anh

This frameworkcan be simply imagined as: we needs to find the knowledge from a distributeddataset while the privacy preserving of involved parties is must be guaranteed.The difference of

Trang 1

B GIÁO D C VÀ ÀO T O B QU C PHÒNG

VI N KHOA H C VÀ CÔNG NGH QUÂN S

-

DISTRIBUTED SOLUTIONS IN PRIVACY

PRESERVING DATA MINING

toàn thông tin trong quá trình khai phá d li u)

Chuyên ngành: B o đ m toán h c cho máy tính và h th ng tính toán

Mã s : 62 46 35 01

TÓM T T LU N ÁN TI N S TOÁN H C

Hà N i, 2011

Trang 2

Chapter 1INTRODUCTION1.1 Privacy-preserving data ming: An overview

Data mining plays an important role in the current world and provides us apowerful tool to efficiently discover valuable information from large databases[Han and Kamber, 2006] However, the process of mining data can result in a vio-lation of privacy As a result, there are a large number of studies has been produced

on the topic of privacy-preserving data mining (PPDM) [Verykios et al., 2004].These studies deal with the problem of learning data mining models from thedatabases, while protecting data privacy at the individual or organizational level.Basically, PPDM can be formed into three following areas [Charu and Yu, 2008]:The first area is privacy-preserving data publishing Studies in this area are toallow an organization (party) to publish his data to the miners with a concern thathow to publish the data so that the anonymized data are useful for data miningapplications The second area is the privacy-preserving distributed data mining,the model of this area usually consists of several parties, each party has a privatedata set The general purpose is to enable the parties to mine cooperatively ontheir joint data sets without revealing private information of each party Here,data could be distributed into many parts either vertically or horizontally Thethird area is a scenario in which a data miner surveys a large number of users tolearn some results based on the user data or collects the user data while protectingthe sensitive attributes of these users

de-1 First work is to introduce a new scenario for privacy-preserving user data

Trang 3

mining called 2-part fully distributed setting (2PFD) and find solution for afamily of frequency-based learning algorithms in 2PFD setting

2 Second work is to develop novel privacy-preserving protocols for frequentitemset mining in vertically distributed data The important security prop-erty of our protocols is better than the previous protocols’ one in the waythat we achieve the full privacy protection for each party This property doesnot require the existence of any of trusted parties In addition, no collusion

of parties can make privacy breaches

3 Third work is firstly to develop a privacy preserving EM-based clusteringprotocol for multi-party model Our protocol is more secure than the existingones with the collusion resistance In addition, our protocol works not onlyfor three parties and above but also for two parties Secondly, we propose abetter protocol for the case in which the dataset is horizontally partitionedinto only two parts This protocol requires protecting privacy of the clustercenters

4 Forth work is a technique to design protocols for privacy- preserving variate outlier detection in both horizontally and vertically distributed data.The developed solutions will be evaluated in terms of the degree of privacyprotection, correctness, efficiency and scalability The contributions of this the-sis are solutions for four problems in PPDM Each problem has an independentstatement to the others, but they share a common framework This frameworkcan be simply imagined as: we needs to find the knowledge from a distributeddataset while the privacy preserving of involved parties is must be guaranteed.The difference of each problem is the way we obtain the dataset from distributedparties and the proposed function to keep the privacy information for users.1.3 Organization of thesis

multi-The thesis consists of six chapters, 109 pages of A4 Chapter 1 introduces anoverview of PPDM and related works Chapter 2 presents the basic definitions

of secure multi-party computation and the techniques I frequently use Chapter

3 proposes privacy preserving frequency-based learning protocols in 2PFD ter 4 presents two privacy-preserving protocols for distributed mining of frequentitemsets Chapter 5 discusses privacy preserving EM-based clustering protocols.Chapter 6 presents the technique to design protocols of privacy preserving outlierdetection for both vertically and horizontally distributed data, and we give theconclusion in the last section of this thesis

Trang 4

Chap-Chapter 2METHODS FOR SECURE MULTI-PARTY COMPUTATION

In this thesis, we use secure multi-party computation (SMC) and cryptographictools as the building blocks to design privacy-preserving data mining protocols.Before discussing in details, in this chapter, we first review some important defini-tions of SMC Then, we summarize the techniques which will be used in the nextchapters

2.1 Definitions

In this section, we review basic definitions from computational complexity theoryand SMC that will be used in this thesis [Goldreich, 2004]

Definition 2.1 Let N be the set of natural numbers We say the function ǫ(·) :

N 7→ (0, 1] is negligible in n, if for every positive integer polynomial poly(·) there

ǫ(n) < 1

poly(n)The computational indistinguishability is another important concept when dis-cussing the security properties of distributed protocols [Goldreich, 2004] Let

X = {Xn}n∈N is an ensemble indexed by a security parameter n (which usuallyrefers to the length of the input), where the Xi′s are random variables

Definition 2.2 Two ensembles, X = {Xn}n∈N and Y = {Yn}n∈N, are tational indistinguishable in polynomial time if for every probabilistic polynomial time algorithm A,

f (x1, x2, , xn) 7→ (f1(x1, x2, , xn), , fn(x1, x2, , xn))where each party i knows only its private input xi For security, it is requiredthat the privacy of any honest party’s input is protected, in the sense that each

Trang 5

dishonest party i learns nothing except its own output yi = fi(x1, x2, , xn) Ifthere is any malicious party that may deviate from the protocol, it is also requiredthat each honest party get a correct result whenever possible.

Privacy in Semi-honest model: In the distributed setting, let π be an party protocol for computing f Let x denote (x1, , xn) The view of the ith(i ∈ [1, n]) party during an execution of π on x is denoted by viewπ(x) whichincludes xi, all received messages, and all internal coin flips For every subset

n-I of [1, n], namely n-I = {i1, , it}, let fI(x) denote (yi 1, , yi t) and viewπ

I(x) =(I, viewπi1(x), , viewiπt(x)) Let OU T P U T (x) denotes the output of all partiesduring the execution of π

Definition 2.3 An n-party computation protocol π for computing f (., , ) is

se-cure with respect to semi-honest parties, if there exists a probabilistic time algorithm denoted by S, such that for every I ⊂ [1, n] we have

polynomial-{S(xi1, , xi t, fI(x)), f (x))} ≡ {viewc Iπ(x), OU T P U T (x)}

This definition states that the view of the parties in I can be simulated fromonly the parties’ inputs and outputs If the function is privately computed by theprotocol, then privacy of each party’s input data is protected In this thesis, wefocus on designing privacy-preserving protocols in the semi-honest model Theformal definition of the security protocol in the malicious model can be found

in [Goldreich, 2004] In this thesis, we also use composition theorem for the honest model that its discussion and proof can be found in [Goldreich, 2004]

semi-Theorem 2.1 (Composition theorem) Suppose that g is privately reducible to f ,

and that there exists a protocol for privately computing f Then there exists a protocol for privately computing g.

Variant ElGamal Cryptosystem: Our Protocols in Chapter 3 and 4 are based

on the standard variant of the ElGamal encryption scheme ElGamal encryption

is semantically secure u ‘ nder the decisional Diffie-Hellman (DDH) assumption[Boneh, 1998] The computations are carried out in Zp and the message space is

Trang 6

Zq, where p and q are prime, and q|(p − 1) We briefly review the variant of theElGamal encryption scheme as follows.

Let G be a cyclic group of order q (G is a sub group of Zp∗) Let g be a generator

of G, f ∈ hgi is randomly selected, and x be uniformly chosen in [1, q − 1] InElGamal encryption schema, x is a private key and the public key is h = gx Eachuser securely keeps their own private keys, otherwise public keys are publiclyknown

To encrypt a message m using the public key h, one randomly chooses k in[1, , q − 1] and then computes the ciphertext C = (C1 = fmhk, C2 = gk) The de-cryption of the ciphertext C with the private key x can be executed by computing

fm = C1(Cx

2)−1, and find m from fm.Decisional Diffie-Hellman Assumption For uniformly random a, b, c ∈ [0, q −1], the DDH assumption is that {ga, gb, gab} ≡ {gc a, gb, gc}

Oblivious polynomial evaluation (OPE)[Naor and Pinkas, 1999]: Thisproblem involves a sender (Alice) and a receiver (Bob) The sender’s input is

and Bob to learn rB, where rA and rB are random integers, called shares, between

0 and M − 1 such that rA + rB mod M = A · B (where A · B ∈ [0, M ]).In otherwords, a SSP protocol is to compute the following function:

(A, B) → (r1, r2)|r1 + r2 = A · BPrivately computing ln x [Kantarcioglu, 2005]: In secure multi-party meancomputation, we need to be able to privately share ln x, where x = x1 + x2 with

x1 known to P1 and x2 known to P2 Thus, P1 should get y1 and P2 should get y2

such that y1 + y2 = ln x = ln (x1 + x2) In other words, a protocol for computing

ln (x) is to construct the following function:

(x1, x2) → (y1, y2)|y1 + y2 = ln (x1 + x2)

Trang 7

Chapter 3PRIVACY PRESERVING FREQUENCY-BASED LEARNING IN

2PFD SETTING3.1 Introduction

In this chapter, we consider privacy preserving frequency-based learning in a called 2-part fully distributed setting (2PFD) In this scenario, the dataset isdistributed across a large number of users in which each record is owned by twodifferent users, one user only knows the values for a subset of attributes, whilethe other knows the values for the remaining attributes A miner aims to learnfrequency-based models from their data, while preserving each user’s sensitiveattributes Some solutions based on randomization techniques can address thisproblem, but suffer from the tradeoff between privacy and accuracy In this chap-ter, we develop a cryptographic method that ensures each user’s privacy withoutloss of accuracy Our key contribution is the privacy preserving frequency com-putation method in 2-part fully distributed setting To illustrate the applicability

so-of this method, we used it to build the privacy preserving protocol for the naiveBayes classifier learning and show its other applications The experimental resultsshow that our protocol is very efficient

3.2 Privacy preserving frequency mining in 2PFD setting

3.2.1 Problem formulation

The frequency computation problem in 2PFD can be formulated into the moresimple problem as follows

Assume that there are n pairs of users (Ui, Vi), each Ui has a binary number

ui and each Vi has a binary number vi The privacy-preserving frequency putation problem is to allow a miner to compute f = P

com-uivi without disclosingany information about ui and vi In other words, we need a privacy-preservingprotocol for constructing the following function:

(u1, v1, , un, vn) 7→ X

uivi

The definition notation implies that each pair Ui and Vi provide inputs ui

and vi to the protocol, and the miner receive output P

uivi without any otherinformation

Trang 8

3.2.2 Definition of privacy

The definition of privacy given below can be viewed as a simplification of thegeneral definition in the semi-honest model[Goldreich, 2004], Basically, the defi-nition states that the computation is secure if the joint view of the miner andthe corrupted users (the t1 users Ui and the t2 users Vi) during the execution ofthe protocol can be effectively simulated by a simulator, based on what the minerand the corrupted users have observed in the protocol using only the result f , thecorrupted users’ knowledge, and the public keys Therefore, the miner and thecorrupted users can not learn anything from f

3.2.3 Frequency mining protocol

Our protocol is designed based on the homomorphic property of a variant of Gamal encryption [Hirt and Sako, 2000] The privacy of our protocol is based onthe semantic property of ElGamal encryption scheme under the DDH assumption,which has been introduced in the previous chapter

El-Let p and q be two primes such that q|(p − 1), let G be a subgroup of Z∗p oforder q, and g is a generator of G In the proposed protocol, we assume that eachuser Ui has private keys xi, yi uniformly chosen from {1, , q − 1}, and public keys

Xi = gx i, Yi = gy i Each user Vi has private keys pi, qi and public keys Pi = gp i,

Qi = gqi We note that computations in this thesis always take in Zp We define

3.2.4 Correctness and Privacy Analysis

In the thesis, we proved the correctness of the protocol and we showed that underthe semantic security property of the ElGamal encryption scheme, our protocolpreserves each user’s privacy in the semi-honest model

Theorem 3.1 The protocol presented in figure 3.1 correctly computes the

Theorem 3.2 Assuming that f < n, the protocol in Figure 3.1 preserves the

privacy of the honest users against the miner and up to 2n − 2 corrupted users.

In cases with only two honest users, the conclusion remains correct as long as two honest users do not hold the attribute values of the same record.

Trang 9

• Phase 1 Each user U i does as follows:

– Randomly choose k i from {1, , q − 1}.

– Computes C (i) = (C1(i), C2(i)) = (g u i Xki

i , g k i ) – Send C (i) to the miner

• Phase 2 Each user V i does the follows:

– Get C (i) from the miner

– Randomly choose r i from {1, , q − 1}

– if v i = 0 then compute R (i) = (R(i)1 , R(i)2 , R(i)3 )=(Xri

i X q i , g r i , Y p i ) – if v i = 1 then compute R (i) = (R(i)1 , R(i)2 , R(i)3 )=(g uiXri +k i

i X qi, g ri+k i , Y pi) – Send R (i) to the miner.

– Get R (i) from the miner.

– Compute K(u i , v i ) = (K1(i), K2(i)) = (R(i)1 (R(i)2 ) −x i X y i , R(i)3 Y x i )

– Send K(u i , v i ) to the miner

• Phase 4 The miner does as follows:

1000 to 5000 Before executing the protocol, we generate the pairs of keys for eachuser, with the size of p and q set at 1024 bits and 160 bits, and compute values

X and Y The results show that the average time used by each Ui for computingthe first-phase messages and the third-phase messages are about 21ms and 29ms,respectively Each Vi needs about an average 32ms to compute her messages Theminer’s time are very efficient and nearly linearly related to n such as when n =

5000, the miner uses only about 460 ms for the computation

Trang 10

3.3 Frequency-based Learning in 2PFD Setting

The method of frequency mining is very useful in privacy preserving data miningapplications that its learning is based on frequency such as naive Bayes, associationrules mining, ID3 learning, Pearson correlation analysis etc In this thesis, wedemonstrated the useful of frequency mining method by using it as a primitive todesign a privacy-preserving protocol for naive Bayes learning

3.4 An improvement of frequency mining protocol

3.4.1 Improved frequency mining protocol

A problem of the frequency mining protocol is that a single client may be able

to disrupt the system Thus, our purpose is to improve the frequency miningprotocol That is, only a set S of t user pairs can obtain the frequency withoutrequiring the presence of all users, where t ≥ k, k is the defined threshold Weexpand the idea of threshold decryption system [Noack and Spitz, 2009] to solvethe above problem For a (n, k) threshold scheme, the basic idea is that a privatekey is shared among n users by using a (n, k)-Shamir secret sharing, so that only

a set T of k users involves in the protocol, miner can decrypt a ciphertext by usingLagrange interpolation without explicitly reconstructing the private key

In proposed protocol, we assume that two key seeds x0 and p0 ∈ [1, q − 1] areshared among n users Ui and n users Vi by a (n, k)-Shamir secret sharing Sharesowned by Ui and Vi are xi = f (i) and pi = h(i) respectively, where f (x) andh(x) are the random polynomials of degree (k − 1) ∈ Zq such that f (0) = x0 andh(0) = p0 Thus, each user Ui has the key pair (xi, Xi = gx i) and Vi has (pi,

Pi = gpi) In our protocol, H = gx0 +p 0 is announced as the general public key Thedetailed phases of the improved frequency mining are presented in Figure 3.73.4.2 Protocol Analysis

Different from the previous protocol, the private keys yi and qi of the improvedprotocol are temp keys that are chosen at the encrypting time The general keys Yreplaced by g and X replaced by H This protocol preserves privacy of each usergainst up to 2k − 2 corrupted users In the improved protocol, the computationalcomplexity of these users increases a modular exponentiation The computationalcomplexity for miner is nearly equal to the previous protocol

3.5 Conclusion

In this chapter, we proposed a method for privacy preserving frequency-basedlearning in 2PFD setting, which has not been investigated previously Basically,the proposed method is based on ElGamal encryption scheme, and it can provide

Trang 11

– Randomly choose k i from {1, , q − 1}.

– Computes C (i) = (C1(i), C2(i)) = (g u i Xki

i , g k i ) – Send C (i) to the miner.

• Phase 2 Each user V i does the follows:

– Get C (i) from the miner,

– Randomly choose r i and q i from {1, , q − 1},

– if v i = 0 then compute R (i) = (R(i)1 , R(i)2 , R(i)3 )=(Xri

i H q i , g r i , g q i ) – if v i = 1 then compute R (i) = (R(i)1 , R(i)2 , R(i)3 )=(g uiXri +k i

i H qi, g ri+k i , g qi) – Send R (i) to the miner.

– Get R (i) from Miner.

– Randomly choose y i from {1, , q − 1},

– Compute K (i) = (K1(i), K2(i)) = (R(i)1 (R(i)2 ) −x i H y i , R3(i)g y i )

– Send K (i) to Miner.

i∈S

K2(i)

• Phase 5 The users does as follows:

– Each U i computes a i = K xi and sends a i to Miner

– Each V i computes b i = K p i and sends b i to Miner

• Phase 6 Miner does as follows:

t∈T

(a t b t )Qj∈T ,j6=t

−j t−j

– Compute d =

Q n i=1 K1(i)

as well We discussed an improvement of the protocol using Shamir sharing scheme

to allows the miner to obtain frequency without requiring the full participation of

n user pairs

Trang 12

Chapter 4ENHANCING PRIVACY FOR FREQUENT ITEMSET MINING IN

VERTICALLY DISTRIBUTED DATA4.1 Introduction

In this chapter, we present the protocols for vertically partitioned data: a tion of each transaction is present at each party, but no party contains completeinformation for any transaction The several protocols have been proposed forthis problem [Zhong, 2007, Vaidya and Clifton, 2005, Han and Ng, 2007] How-ever, some of them only resist the collusion at most n − 2 corrupted parties among

por-n participapor-nts, while other opor-nes require at least opor-ne por-nopor-n-collusiopor-n party We pose the protocols for privacy-preserving frequent itmeset mining that does notrequire any trusted party while they can protect the privacy of each party gainstthe collusion of any group of corrupted parties In addition, we give two protocolsthat allow the parties to be able to select one of two privacy level corresponding totwo protocols, one of them reveals only the support count, and the other revealsnothing

pro-4.2 Problem statement

The association rules and frequent itemsets mining problem is formally stated in[Cheung et al., 1996] Given a database D with m transactions, the problem is tofind the association rules that have an implication of the form X ⇒ Y , where Xand Y are the subsets of the set of items of D, and X ∩ Y = φ An itemset X isfrequent if its support count (the number of transactions contains X) is not lessthan the minimum support count t The main technical problem in associationrules mining is to find frequent itemsets

Assume that D is vertically distributed on n parties P1, , Pn, the paries wish

to find the frequent itemsets from D, where D is called the joint data set of allparties Our aim is to design distributed protocols to obtain the frequent itemsetswhile preserving privacy of each party’s data We consider privacy as protectingindividual data records as well as protecting information about the local supportcount of the frequent itemsets of each party and even the global support count ofthe joint database The frequent itemset identifying problem can be formulated

as follows

In a distributed setting with n parties, each party Pi has a private vector

Ui = (ui1, , uim), where each uij ∈ {0, 1}, i = 1, , n, and j = 1, , m For a

Trang 13

public threshold t, the privacy-preserving frequent itmset identifying problem is

to check if s = Pm

j=1

Qn i=1uij ≥ t without disclosing any privacy information ofparticipants

4.3 Privacy definition

The privacy preservation of this proposed protocol is based on the semi-honestsecurity model [Goldreich, 2004] Thus, we gave the privacy definition for the pro-posed protocol with the following parameter model There are n parties involved

in the protocol Each party Pi has the private input Ui, where Ui is a the binaryvector We assume that prior to the protocol, each party has obtained the keypairs for the Elgamal encryption scheme: the private key xi and the public key yi.Each party’s public key has known by members in the system, while the privatekey is secretly kept Basically, this definition is similar to the Definition 2.3, but inthe view of each party includes the public keys of other parties, and each party’sprivate key is a component of its input

4.4 Support count preserving protocol

4.4.1 Overview

Assuming that X is an frequent itemset, we have t ≤ s ≤ m Thus, there exists

a 0 in the list λ ={λ1 = s − 1 − t, λ2 = s − 2 − t, , λk = s − k − t}, where

k = m − t If s is known by all parties, this problem can be solved immediately.However, for our purpose with strong privacy, this vector cannot be revealed.Therefore, the basic idea of the protocol is follows Let p and q be two primessuch that q|(p − 1), let G be a subgroup of Z∗p of order q, and g is a generator of

G All computations in this chapter always take in Zp The proposed protocol is

to implement the following function

The joint decryption technique [Hirt and Sako, 2000]: We assume thateach party has a key pair (xi, yi = gx i) We define y =

n

Y

i=1

yi = gx, in our protocol,the parties use y as a public key to encrypt their data, and each message m is

Trang 14

changed to gm before encrypting Decryption need to be jointly performed by allparties.

Rerandomization technique [Markus and Patrick, 1996]: A tion is multi-party protocol that involves several mix servers The input to theprotocol is a list of ciphertext items {(a1, h1), (am, hm)} and the output is a re-encrypted, permutated list of those ciphertext items {(a′π(1), h′π(1)), a′π(m), h′π(m))}.The security of this technique is characterized by looking at these two sequences

rerandomiza-of cipher-texts, the adversary cannot determine any information about the spondence between the new cipher-text corresponding and the old cipher-text Inthe proposed protocol, we use a rerandomization technique based on the ElGamalencryption, in which each party plays the role as a mix server

corre-4.4.2 Protocol design

The protocol is presented in Figure 4.1

4.4.3 Correctness Analysis

Theorem 4.1 If all participants follow the protocol and there exists one plaintext

“1” existing in the decryption list, then s < t.

4.4.4 Privacy Analysis

The important security feature of our protocol which is better than the previousmethod is that we do not assume the existence of any kind of trusted parties.Moreover, no collusion of parties can possibly lead to the revelation of any pri-vate information, unless all parties together form a single collusion, which is notsignificant

Theorem 4.2 The protocol in Subsection 4.5.2 preserves the privacy of the honest

parties against the collusion, up to n − 1 corrupted parties.

4.4.5 Performance analysis

Let the size of the parties’s key be K bits, the upper bound on the total nication cost of the protocol is O(nmK) This is equivalent to the one by Zhong[Zhong, 2007] The complexity of the protocol is bounded by O(mn) and expo-nentiations and O(mn) inversions However, these operations can be computedconcurrently Therefore, the overall computational complexity is O(m), which isequivalent to the one by Zhong [Zhong, 2007]

Tiêu đề	Nghiên cứu xây dựng một số giải pháp đảm bảo an toàn thông tin trong quá trình khai phá dữ liệu
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Mathematics for computers and computing systems
Thể loại	Luận án
Năm xuất bản	2011
Thành phố	Hà Nội

Định dạng
Số trang	28
Dung lượng	281,5 KB