A Privacy-preserving Query on OutsourcedDatabase with B-tree ⋆ Sha Maa,∗, Bo Yangb, Kangshun Lia, Feng Xiaa aDepartment of Informatics, South China Agricultural University, Guangzhou 510
Trang 1A Privacy-preserving Query on Outsourced
Database with B-tree ⋆
Sha Maa,∗, Bo Yangb, Kangshun Lia, Feng Xiaa
aDepartment of Informatics, South China Agricultural University, Guangzhou 510642, China
bSchool of Computer Science, Shaanxi Normal University, Shaanxi 710062, China
Abstract
In outsourced database, once the data is encrypted, query processing is more difficult compared with traditional plaintext database Providing query service with preserving privacy is of essential concern
in such framework This paper proposes a novel method of a privacy-preserving query on outsourced database with B-tree by searching on B-tree with PIR and then obtaining query results with PIR again
We describe the scheme that enable a user to access an encrypted database accurately, privately retrieve information and only obtain query results without leaking other information Our contributions include
a set of security notion for such a system as well as a construction which is secure under the newly introduced security notions
Keywords: Outsourced Database; Database Security; Private Information Retrieval; B-tree;
Order-perserving Sysmmetric Encryption
The proliferation of a new bread of data management applications that store and process data
at remote locations has led to the emergence of data outsourcing or database as a service as
an important research problem [1, 2, 3] In a typical setting of the problem, data is stored as the remote location in an encrypted form A query generated at the client-side is transformed into a representation such that it can be evaluated directly on encrypted data at the remote location The results might be processed by the client after decryption to determine the final answers The nature of data processing starts to change when the level of trust in the service-provider itself begins to decrease from complete to partial to (perhaps) none at all! Such a varying trust scenario necessitates the usage of various security enhancing techniques in the context of outsourced database [4, 5, 6, 7, 8, 9] Our motivation is to preserve the query privacy in the passive adversary (e.g., the database administrator or the user) model
⋆This work is supported by the National Natural Science Foundation of China under Grants 60973134, 61173164 and 70971043, and the Natural Science Foundation of Guangdong Province under Grant 10351806001000000
∗Corresponding author.
Email address: martin deng@163.com (Sha Ma).
1548–7741 / Copyright © 2012 Binary Information Press
March 2012
Trang 2Why do we care about the privacy of a database query? Consider the following typical real-life scenario: An outsourced database sever contains diagnosis information about various diseases Alice thinks that she may have some disease, so she wants to investigate it further After Al-ice sends the description of the disease to the outsourced database, it will then tell AlAl-ice the corresponding the diagnosis information If Alice’s query is found in the database, the server immediately knows that Alice may have such a disease; even worse, after receiving Alice’s dis-ease’s description, it can derive much else about Alice, such as other health problems that Alice might have If the server is not trustworthy, it could disclose the information about Alice to other parties, and Alice might have difficulty getting employment, insurance, credit, etc But even if Alice trust the server, and it has no intention of disclosing Alice’s private information, the server himself might prefer that Alice’s query be kept private out of liability concern: If the server knows Alice’s disease information, and that information is accidentally disclosed (perhaps
by a external system irruption), the server might face an expensive lawsuit from Alice From this perspective, a trusted server will actually prefer not to know either Alice’s query or its response The known Private Information Retrieval (PIR) techniques is the most related to this problems, which has been widely studied [10, 11, 12, 13] The PIR problem consists of devising a protocol involving a user and a database server, each having a secret input The database’s secret input
is called the data string, an n-bit string B=b1, b2, · · · , b n The user’s secret input is an integer
i between 1 and n The protocol should enable the user to learn b i in a communication-efficient
way and at the same time hide i from the database However, PIR technique cannot be directly
utilized in outsourced database There are three main reasons The first one is that the user does
not know the physical address, e.g., i =2, in outsourced database The user usually sends a SQL sentence including predicts, e.g., attribute op constant (op includes =, <, >, ≤, ≥, etc.)to the
database server and the database server retrieves the correct results according to the predicts The second one is that most research related PIR focuses on the user privacy without concerning
about data privacy In another way, the user may obtain other physical bits of the data (i.e., x j for j ̸= i) or other information such as the exclusive-or of certain subsets of the bits of x except
for a single physical bit of x Although Oblivious Transfer (OT) protocol in cryptography can
meet this requirement, it is not a good solution to utilize OT protocol on outsourced database due to significant communication complexity The last one is that data confidentiality can not be guaranteed by PIR, which means the data are still in plaintext However, in outsourced database, the stored data in service provider should be encrypted because the data are out of control by the data owner and may be stolen by a malicious adversary A new protocol is required to realize
a privacy-preserving query on outsourced database
A trivial solution of a query with preserving privacy on outsourced database is to send all encrypted data to the client, which can operate the decryption and execute querying on plaintext data Obviously, it weakens the advantage of outsourced database because of the drastic increase
of the user’s computational cost In addition, the data privacy is not guaranteed any more because the client can obtains all information after decryption While much research focuses
on how to query efficiently on the encrypted data [14, 2, 15], the research concerning about privacy-preserving in this scenario has been an interesting direction [16, 17, 18, 19] This paper points to a special query manner on outsourced database with PIR technique using B-tree index Since the search on B-tree index needs specified nodes, e.g., the root node of the tree, which has stable physical addresses as long as the tree’s structure is not changed, we can utilize PIR technique to realize the query with preserving privacy However, a problem we face is that when the user receives specified nodes of the index tree, the decryption of the nodes may still disclose
Trang 3other information, which should be hidden from the user, because the data after the decryption may be out of the query results, leading to violating the data privacy Our solution is to use order-preserving encryption to search keys of nodes in the B-tree index
The rest of this paper is organized as follows Section 2 describes the preliminaries used in our construction In Section 3, we present a general framework of privacy-preserving query on outsourced database with B-tree and security definitions Section 4 describes our construction and proves that it is secure under the introduced security notions Finally, Section 5 concludes
2.1 B-tree
To speed up data access, B-tree index structure is very popular in modern database application (in this paper, we denote all variant of B-tree as B-tree, e.g B+tree.) In [15], the author chooses
to encrypt each tree node as a whole because protecting a tree-based index by encrypting each
of its field would disclose the ordering relationship between the index values The original tree is then stored as a table with two attributes: the node ID, automatically assigned by the system on insertion, and an encrypted value representing the node content The advantage of this solution
is that the content of the B-tree nodes is not visible to the untrusted DBMS The drawback, however, is that the user privacy and data privacy are not protected during the query process Intuitively, to execute an interval query, the front end has to perform a sequence of queries that retrieve tree nodes at progressively deeper levels; The user’s access pattern may be disclosed since the information collected during the retrieve of tree nodes will disclose the construction of the
whole tree Fig 1 above shows an example of the B-tree on attribute Customer with sample
values
Assume the frond end will produce a sequence of queries that will access in sequence node 0,
1, and 4; then, the server knows that the user was accessing node 0, 1, and 4, and node 0 is the root, node 1 is an internal node, node 4 is a leaf node of the tree Using such information collected gradually, together with statistical methods, the server can rebuild the whole tree and infer sensitive information from the encrypted database To solve the problem, we utilize PIR technique to access each layer nodes of B-tree obtaining user privacy In addition, during the query, the user will get more information showing that there are at least two other customers named Jane and Donna in the database through the decryption of node 0, 1, 4 so the data privacy cannot be satisfied To solve the problem, we utilize encryption twice to each node
of B-tree, firstly using OPE algorithm and then a general encryption Our solution originates the primitive idea: preserving the typical structure of B-tree through encryption each its fields
by OPE and meanwhile breaking the correlation of data and its corresponding index items by different identifying information and different encryption
2.2 PIR
A Private Information Retrieval (PIR) scheme allows a user to retrieve information from a database while maintaining the query private from the database managers In this model, the
database is viewed as a n-bit string x out of which the user retrieves the i-th bit x i, while giving
Trang 4Fig.1: (a) B-tree and (b) Plaintext table and encrypted table for B-tree
the database no information about the index i The main cost measure for such a scheme is its
communication complexity The notion of PIR was introduced in Ref [10], where it was shown
that if there is only one copy of the database available then n bits of communication are needed (for information-theoretic user-privacy) However, if there are k ≥ 2 non-communicating copies
of the database, then there are solutions with much better communication complexity Gertner firstly introduces a model of Symmetrically-private Information Retrieval (SPIR) [20], where the privacy of the data, as well as the privacy of the user, is guaranteed That is, in every invocation
of a SPIR protocol, the user learns only a single physical bit of x and no other information about the data The SPIR is realized based on k databases (k ≥ 2) as the first implementation of a
distributed version of 1-out-of-n oblivious transfer A Single-Database Private Information Re-trieval is proposed by Giovanni Di Crescenzo on EUROCRYPT 2000 [13], which is a non-trivial PIR protocol At the end of the execution of the protocol, the following two properties must hold:
(1) after applying the reconstruction function, the user obtains the i-th data bit x i; and (2) the distributions on the query sent to the database are computationally indistinguishable for any two
indices i, i ′
Definition 1 (PIR) Let (D, U) be an interactive protocol, and let R be a polynomial time
al-gorithm P rob[R1;· · · ; R n : E] is denoted as the probability of event E, after the execution of
random processes R1;· · · ; R n The notation t A,B (x, r A , y, r B ) denotes the transcript of an
execu-tion of an interactive protocol (A, B) with input x for A and y for B and with random string r A for A and r B for B and (r A , r B , t) ← t A,B (x, ·, y, ·) is denoted the case where the random strings for both A and B are chosen uniformly at random We say that (D, U, R) is a private information retrieval (PIR) scheme if:
1 (Correctness) For each n ∈ N, each i ∈ {1, , n}, each x ∈ {0, 1} n , where x = x1◦ · · · ◦ x n ,
Trang 5and x l ∈ {0, 1} for l = 1, , n, and for all constants c, and all sufficiently large k,
P rob[(r D , r U , t) ←− t D,U((1k , x), ·, ((1 k
, n, i)), ·) : R(1 k
, n, i, r U , t) = x i]≥ 1 − k −c
2 (User Privacy) For each n ∈ N, each i, j ∈ {1, , n}, each x ∈ {0, 1} n , where x = x1 ◦
· · · ◦ x n , x l ∈ {0, 1} for l = 1, , n, for each polynomial time D ′ , for all constant c, and all
sufficiently large k, it hold that |p i − p j | ≤ k −c , where
p i = P rob[(r D ′ , r U , t) ←− t D ′ ,U((1k , x), ·, ((1 k , n, i)), ·) : D ′(1k , x, r D ′ , t) = 1]
p j = P rob[(r D ′ , r U , t) ←− t D ′ ,U((1k , x), ·, ((1 k , n, j)), ·) : D ′(1k , x, r D ′ , t) = 1]
2.3 OPE
Order-preserving Symmetric Encryption (OPE) is a deterministic encryption scheme whose en-cryption function preserves numerical ordering of the plaintexts Let us define what we mean
by this For A, B ⊆ N with|A| ≤ |B|, a function f : A → B is order-preserving if for all i, j
∈ A, f (i) > f (j ) iff i > j OPE has a long history in form of one-part codes, which are list of
plaintexts and the corresponding ciphertexts, both arranged in alphabetical or numerical order
so only a single copy is required for efficient encryption and decryption Agrawal et al firstly suggests a primitive of OPE for allowing efficient range queries on encrypted data in the database community [21] However, the construction is rather ad-hoc and has certain limitations, namely its encryption algorithm must take as input all the plaintexts in the database It is not always practical to assume that users know all these plaintexts in advance, so a stateless scheme whose encryption algorithm can process single plaintexts on the fly is preferable Moreover, It does not define security nor provide any formal security analysis Alexandra Boldyreva et al proposes an efficient OPE scheme and proves its security based on pseudorandomness of an underlying block-cipher [22] Their construction is based on a natural relation between a random order-preserving function and the hypergeometric probability distribution In this paper, OPE is used for each field of B-tree to make query processing to be done exactly as efficiently as for unencrypted data The user can locate the desired ciphertext in nodes without getting more information, which can satisfy data privacy
Definition 2 (OPE)Let SE = (K, Enc, Dec) be an order-preserving encryption scheme with plaintext-space [M] and ciphertext-space [N] for M, N ∈ N such that 2 k −1 ≤ N < 2 k for some k
∈ N Then there exist an IND-OCPA(indistinguishability under ordered chosen-plaintext attack) adversary A against SE such that
Adv ind SE −cpa (A) ≥ 1 − 2k
M − 1
So, k in the theorem should be almost as large as M for A’s advantage to be small.
3.1 Model
Fig 2 illustrates the four primary entities of the DAS model: Data Owner (DO), user (U), trusted front (F) and Database Service Provider (DSP) We assume that DO stores the encrypted
Trang 6Database (DB) at the DSP and the outsourced data allows certain amount of query processing for U to occur at the DSP without jeopardizing privacy Below, we propose a protocol to ensure the security requirements of this DAS model resorting to F Our assumption for this protocol is that F will not collude with DO, U or DSP in any cases Furthermore, F is usually the deputy of the DO and responsible for query transformation It can send queries to the server on behalf of
DO when allowed since the user has registered to use the data owner’s service We briefly depict the properties of our protocol below
Data owner
F Trusted
Untrusted
Trusted
Untrusted
Un trusted
DB DSP
User
Fig.2: Our model
In our model, DO outsources information to the DSP and charges U for using their data The outsourced information is valuable thus all the information should be encrypted to prevent analysis by DSP and other intruders, which we call data confidentiality Meanwhile the outsourced information is important and the user is not allowed to get more information other than what she is querying on DB, which we call data privacy In addition, whenever the user accesses DB, she does not want DO and DSP to know exactly what she is concern about, both the query and its result, which we call user privacy
3.2 Adversarial Model
There are three types of adversaries in our model
1 A naive player (U or DSP): who gets a copy of the encrypted data stored in the outsourced database and wants to infer some information
2 A curious service provider: who wants to infer some information from a query or the response
to a query
3 A curious user: who wants to infer some information from the response to a query
3.3 Storage Model
In order to illustrate the storage model, in this section we give a simple example DSP uses a table for storing and maintaining data entries The table stores encrypted data entries associate
with a unique id For example, consider a regular table that has 3 attributes, such as name,
age and salary The encrypted table contains 2 columns: tid and etuple, where tid is the unique
number of a tuple, which is usually numbered sequentially starting from 1 and the etuple is the
encrypted value of the plaintext tuple In addition, encrypted table for storing the B-tree consists
Trang 7of 2n + 2 attributes: a unique number which is generated differently from tid since the nid, a unique number of the node in B-tree, is generated as a random number, n search keys and n + 1 pointers, where a parameter n associated with each B-tree index and determines the layout of
all blocks of the B-tree See Fig 3 In more detail, plaintext data entries are encrypted by a general encryption algorithm as a tuple in the table because encrypting by row is preferable to encrypting by field for queries from the TPC-H benchmark Encrypted table for B-tree is used
to support search functions by which the user can obtain the exact results without the leakage
of other information Specifically, all pointers are encrypted once by another key which is not the same as the one used to encrypted the plain entries and all search keys are encrypted twice: firstly, encrypted using OPE, then using the same encryption algorithm for pointers
Fig 3: (a) Plaintext table and B-tree and (b) Encrypted table for data entries and B-tree
3.4 Operations
We now provide an inaccurate description of our solution Consider a database system D, a data owner DO, the database serve provider DSP, a user U, the trusted party F Suppose the
database D consists of m records {d1, · · · , d m }, each of which contains n attributes {a1, · · · , a n },
for a record d i , we use id(d i)to denote the identifying information that is uniquely associated
with d i, such as the value of primary key The DSP not only hosts the encrypted version of D,
denoted by D ′ ={d ′
1, · · · , d ′
m }, where d ′
i =⟨id(d i ), E(d i)⟩(E(d i ) is an encryption of d i), but also
hosts an encrypted version of B-tree denoted by BT ree ′, which is constructed on each attribute
a j (j ∈ {1 · · · n}) P re is the predicate expression of the query whose value is TURE or FALSE
Trang 8representing the satisfaction of predicates or the converse respectively.
Definition 3 A privacy-preserving query on outsourced database with B-tree consists of the
fol-lowing probabilistic polynomial time algorithms and protocols:
1 KeyGen(1 s ) outputs public and private keys: (A public , A private ) for the encryption and de-cryption of data entries, (B public , B private ) for the encryption and decryption of B-tree and
C private for OPE algorithm.
2 Store DO,DSP,F (D, BT ree, A public , B public , C private ) is a protocol that allow DO to send D ′ to DSP , which is the encryption of D under A public , and also associate BT ree ′ for each at-tribute, which is the encryption of B-tree under B public and C private C private are held only
by F.
3 Query U,DSP,F (P re, A private , B private ) is a protocol that retrieves all records satisfying P re for
U P re, A private , and B private are held only by U.
3.5 Security Properties
Firstly, we define the security of database encryption
Definition 4 (Security of database encryption [4]) An encryption scheme (Gen, Enc, Dec) for
database tables, which consists of key generation scheme Gen, encryption function Enc, and de-cryption function Dec, has indistinguishable ende-cryptions if for every polynomial-size circuit family {C n }, every polynomial p, and all sufficiently large n, every database R1 and R2 ∈ {0, 1} poly(n)
with the same schema and the same number of tuples (i.e., |R1| = |R2|):|P r{C n (Enc Gen(1 n (R1 ))) =
1} − P r{C n (Enc Gen(1 n (R2 ))) = 1}| < 1
p(n) The probability in the above terms is over the internal coin tosses of G and E.
Next, we describe correctness and privacy for such a system
Definition 5 (Query Correctness) Let Apublic , A private , B public , B private ←− KeyGen(1 s ) Fix a
fi-nite sequence of messages and indexes: {{d i
′
} m i=1 , {BT ree ′
j } n j=1 } Suppose that, for all i ∈ [m] and j ∈ [n], the protocol Store DO,DSP,F (D, BT ree, A public , B public , C private ) is executed by DO,
DSP and F Denote by R P re the results that U receives after the execution of Query U,DSP,F
(P re, A private , B private ) Then, a privacy-preserving query on outsourced database with B-tree is
said to be correct on the sequence {{d i
′
} m i=1 , {BT ree ′
j } n j=1 } if P r⌈R P re(aw) = {d ′
i |P re(d i a w) =
TRUE }⌉ > 1 − neg(1 s ), for each predicate, where the probability is taken over all internal
random-ness used in the protocols Store and Query A privacy-preserving query on outsourced database with B-tree is said to be correct if it is correct on all such finite sequences.
DO’s privacy consists of two folds: the first one is that all stored data should be indistinguishable
to the DSP and the second one is that the user cannot learn any other information besides the results of user’s query
Definition 6 For DO’s privacy to DSP, consider the following game between an adversary A and a challenger C A will play the role of DSP and C will play the role of a DO.The game consists of the following steps:
Trang 91 KeyGen(1 s ) is executed by C who sends the output A public and B public to A.
2 A asks queries of the form (D, BT ree) where D is the plaintext database and BT ree is the plaintext index on D; C answers by executing the protocol Store(D, BT ree, A public ,
B public , C private ).
3 A chooses two pairs (D0, BT ree0), (D1, BT ree1) to be sent to C, where D0 and D1 are of equal size, and BT ree0 and BT ree1 are of equal size.
4 C picks a random bit b ∈ R {0, 1} and executes Store(D b , BT ree b , A public , B public , C private ) with
A.
5 A asks more queries of the form (D, BT ree) and C responds by executing protocol Store(D b ,
BT ree b , A public , B public , C private ) with A.
6 A outputs a bit b ′ ∈ {0, 1}.
We define the adversary’s advantage as Adv A(1s) =|P r[b = b ′
]− 1
2| We say that a privavy-preserving query on outsourced database is DO’s privacy to DSP if, for all A ∈ PPT, we have that Adv A(1s ) is a negligible function.
Definition 7 For DO’s privacy to U, consider the following game between an adversary A and
a challenger C A will play the role of U and C will play the role of a DO The game consists of the following steps:
1 KeyGen(1 s ) is executed by C who sends the output A public , B public to A.
2 A asks queries of the form (D, BT ree) where D is the plaintext database and BT ree is the plaintext index on D; C answers by executing the protocol Store(D, BT ree, A public , B public ,
C private );
3 A chooses two pairs (D0, BT ree0), (D1, BT ree1) and sends this to C, where the database and BTrees are of equal size, respectively.
4 C picks a random bit b ∈ R {0, 1} and executes Store(D b , BT ree b , A public , B public ) with A.
5 A asks queries of the form P re, where the predicate is on a certain attribute; C answers by executing the protocol Query(P re, A private , B private ) with A.
6 A asks more queries P re and C responds by executing the protocol Query(P re, A private ,
B private ) with A.
7 A outputs a bit b ′ ∈ {0, 1}.
We define the adversary’s advantage as Adv A(1s) = |P r[b = b ′]− 1
2| We say that a privacy-preserving query on outsourced database is DO’s private to U if, for all A ∈ PPT, we have that Adv A(1s ) is a negligible function.
Definition 8 For query privacy, consider the following game between an adversary A and a challenger C A plays the role of DSP, and C plays the role of U The game proceeds as follows:
Trang 101 KeyGen(1 s ) is executed by C who sends the output A public , B public to A.
2 A asks queries of the form P re, where the predicate is on the certain attribute; C answers
by executing the protocol Query(P re, A private , B private ) with A.
3 A chooses two predicates P re(a0), P re(a1) and sends them to C a0 and a1 both are at-tributes of D.
4 C picks a random bit b ∈ R {0, 1} and executes the protocol Query(P re(a b ), A private , B private)
with A.
5 A asks more queries Pre and C responds by executing the protocol Query(P re, A private , B private)
with A.
6 A outputs a bit b ′ ∈ {0, 1}.
We define the adversary’s advantage as Adv A(1s) = |P r[b = b ′]− 1
2| We say that a privacy-preserving query on outsourced database is query privacy if, for all A ∈ PPT, we have that Adv A(1s ) is a negligible function.
We present a construction of a privacy-preserving query on outsourced database in a “semi-honest” model In our context, the term “semi-“semi-honest” refers to a party that correctly executes the protocol, but may collect information during the protocol’s execution Correctness and privacy will be proved under a computational assumption We assume the outsourced data are encrypted
by a semantically secure public-key encryption satisfying the Definition 4 The key generation,
encryption algorithms will be denoted by K and E, respectively We define the required algorithms
below Firstly, let us describe our assumption about the parties involved again: DO, U, DSP and
F In general, there could be many data owners but, for the purpose of describing the protocol,
we need only to name one DO is assumed to hold the data, B-tree and the public key U holds the private key and submits query to the database DSP stores the encrypted data and B-tree and provides search service F is the deputy of DO and assists in the execution of user’s query
1 KeyGen(s): Run K(1 s), the key generation algorithm of the underlying cryptosystem, to
create public and private keys, (A public , A private) for the encryption and decryption of data
entries, (B public , B private ) for the encryption and decryption of B-tree and C private for the search keys of nodes in B-tree Private and public parameters for a PIR scheme are also generated by this algorithm
2 StoreDO,DSP,F (D, BT ree, A public , B public , C private): DO sends the encrypted database and in-dexes to the DSP The protocol consists of the following steps:
(a) DO sends the encrypted version of the database D ′ = {(id i , E Apublic (d i))} m
i=1 and its
BT rees ′ = {BT rees ′
j } n j=1 to DSP Specially, all pointers are encrypted once using
B public and all search keys are encrypted twice: firstly encrypted using C private for
OPE, then encrypted using B public for a general encryption algorithm