KEMBAR78
Codaspy 2010 | PDF | Databases | Matrix (Mathematics)
0% found this document useful (0 votes)
6 views12 pages

Codaspy 2010

Uploaded by

kifega9328
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views12 pages

Codaspy 2010

Uploaded by

kifega9328
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

RASP: Efficient Multidimensional Range Query on

Attack-Resilient Encrypted Databases

Keke Chen, Ramakanth Kavuluru, Shumin Guo


Department of Computer Science and Engineering
Wright State University
Dayton, Ohio 45435, USA
{keke.chen,ramakanth.kavuluru, guo.18}@wright.edu

ABSTRACT 1. INTRODUCTION
Range query is one of the most frequently used queries for online With the increasing popularity of web-based applications and
data analytics. Providing such a query service could be expensive the support from widely available cloud infrastructures, service-
for the data owner. With the development of services computing based computing has become a major computing paradigm. Ser-
and cloud computing, it has become possible to outsource large vice providers take advantage of low cost cloud infrastructures,
databases to database service providers and let the providers main- while service users enjoy convenient services without worrying about
tain the range-query service. With outsourced services, the data the cost of maintaining hardware and software. On the other hand,
owner can greatly reduce the cost in maintaining computing infras- large datasets have been collected, stored, and analyzed in business
tructure and data-rich applications. However, the service provider, intelligence and scientific computing for several years. It was re-
although honestly processing queries, may be curious about the ported that maintaining data and supporting query-based services
hosted data and received queries. Most existing encryption based incur much higher cost than initial data acquisition [12]. An ap-
approaches require linear scan over the entire database, which is pealing solution is to delegate data services to a service provider,
inappropriate for online data analytics on large databases. While a which, however, raises the question: how to protect the private in-
few encryption solutions are more focused on efficiency side, they formation in the outsourced data, considering the service provider
are vulnerable to attackers equipped with certain prior knowledge. might be curious about the data.
We propose the Random Space Encryption (RASP) approach that Range query is the most frequently used query in online data ana-
allows efficient range search with stronger attack resilience than ex- lytics (OLAP) that requires the service provider to quickly respond
isting efficiency-focused approaches. We use RASP to generate in- to concurrent user queries. To efficiently process range queries, in-
dexable auxiliary data that is resilient to prior knowledge enhanced dexing is a necessary step. However, most existing encryption ap-
attacks. Range queries are securely transformed to the encrypted proaches [30, 4, 5, 29] require linear scan over the entire database,
data space and then efficiently processed with a two-stage process- thus, impractical for OLAP. Fully homomorphic encryption [13] in
ing algorithm. We thoroughly studied the potential attacks on the theory allows any operation on encrypted data that can be traced
encrypted data and queries at three different levels of prior knowl- back to an equivalent operation on the corresponding plaintexts.
edge available to an attacker. Experimental results on synthetic and However, as the author of [13] mentioned, this is still too expensive
real datasets show that this encryption approach allows efficient to be practical even for a simple application like encrypted keyword
processing of range queries with high resilience to attacks. search.
Several methods that consider different tradeoffs between data
Categories and Subject Descriptors security and efficiency of query processing were proposed in the
recent years. Both Crypto-index [20, 21] and order-preserving en-
H.2.0 [Database Management]: General—Security, integrity, and cryption (OPE) [1, 3] assume the attacker does not have sufficient
protection; E.3 [Data Encryption] prior knowledge about the data; thus powerful attacks cannot be
conducted. Specifically, they assume the attacker knows only the
General Terms ciphertext. However, we have found that if the attacker has some
prior knowledge, such as the attribute domains (maximum and min-
Security, Algorithms
imum values), the attribute distributions, and even a few pairs of
plaintext and ciphertext, these encryption methods will be vulner-
Keywords able to attacks. Therefore, although they can allow the service
Multidimensional Range Query, Random Space Encryption, Attack provider to build indices on encrypted data and perform efficient
Analysis, Outsourced Databases query processing, they can only be applied to very restricted ap-
plications. Wang and Lakshmanan [31] use OPE in querying en-
crypted XML database and address the prior-knowledge enhanced
attacks on OPE with duplicated fake index entries that point to the
Permission to make digital or hard copies of all or part of this work for same data item in the encrypted data block. However, their ap-
personal or classroom use is granted without fee provided that copies are proach requires the data owner to build indices for the server, which
not made or distributed for profit or commercial advantage and that copies
is expensive and not convenient when the database is large and fre-
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific quently updated.
permission and/or a fee.
CODASPY’11, February 21–23, 2011, San Antonio, Texas, USA.
Copyright 2011 ACM 978-1-4503-0465-8/11/02 ...$10.00.
Challenge: Therefore, the challenge is to provide an encryption cases are used to represent scalars. Each column is defined on a
scheme that allows efficient, index based query processing, and is domain. For categorical domain, we use integers to represent the
also resilient to prior knowledge enhanced attacks on both data and categorical values. A table is also represented as a k×n matrix, no-
queries. Our goal is to develop such an encryption scheme. tated with capital characters. In the following, we briefly describe
the definition of range queries and the importance of indexability
1.1 Our Contributions and Scope of Research to the performance of query processing.
We propose the RAndom SPace encryption (RASP) approach
for efficient range query processing on encrypted data. We assume Range Queries: Range query is an important type of query for
the outsourced data are multidimensional data and thus the data many data analytic tasks from simple aggregation to more sophisti-
records can be treated as vectors (or points) in the multidimen- cated machine learning tasks. Let T be a table and Xi , Xj , and Xk
sional space. The RASP method randomly transforms the multidi- be the real valued attributes in T , and a and b be some constants.
mensional space, while preserving the convexity of datasets, which Take the counting query for example. A typical SQL-style range
allows indexing and query processing with the encrypted multidi- query looks like
mensional data. The framework assumes a secure proxy server at
the client side that handles data encryption/decryption and query select count(*) from T
encryption. The data owner and authorized users submit the orig- where Xi ∈ [ai , bi ] and Xj ∈ (aj , bj ) and Xk = ak ,
inal data and queries to the proxy server; the proxy server then
sends the encrypted data/queries to the service provider. The ser- which calculates the number of records in the range defined by
vice provider is able to index the encrypted data and use it to effi- conditions on Xi , Xj , and Xk . Range queries may be applied
ciently process encrypted queries. to arbitrary number of attributes and conditions on these attributes
Our approach has several important features: (1) The RASP ap- combined with conditional operators “and”/“or”. We call each part
proach uses a random space transformation method that allows the of the query condition that involves only one attribute as a simple
service provider to build indices and process queries with multi- condition. A simple condition like Xi ∈ [ai , bi ] can be described
dimensional indices. With the support of indices, the proposed with two half space conditions Xi ≤ bi and −Xi ≤ −ai . Without
two-stage query processing algorithm can achieve much better per- loss of generality, we will discuss how to process half space condi-
formance than linear scan. (2) The existing indexable encryption tions like Xi ≤ bi in this paper. A slight modification will extend
schemes hold strong assumptions on attacker’s lack of prior knowl- the discussed algorithms to handle other conditions like Xi < bi
edge on the data; thus they are vulnerable to many attacks enhanced and Xi = bi .
with prior knowledge. Our work categorizes attacker’s prior knowl-
edge into three levels and the proposed schemes are resilient to 3. RANDOM SPACE ENCRYPTION
these knowledge-enhanced attacks. To increase resilience against
In this section, we propose the basic Random Space Encryp-
known plain text attacks we use a straightforward composition of
tion (RASP) approach for secure range query processing on the
Agarwal et al.’s OPE [1] with RASP. (3) Attacks based on queries
encrypted outsourced data. First, we give the system framework
were rarely discussed in existing schemes. We show that with prior
and assumptions held for the attack models. Second, we present
knowledge, attacks on queries can seriously undermine the encryp-
the definition of the basic random space encryption method and
tion. We design certain methods to enhance the resilience to the
distinguish it from order preserving encryption methods. Finally,
query-based attacks. (4) Some approaches, such as crypto-index,
we describe how to generate outsourced data and answer queries
may return a lot of encrypted records irrelevant to the query and
with the encrypted data.
burden the client side to filter out these irrelevant records. Our ap-
proach always returns the exact result to the client, eliminating the 3.1 System Framework and Attack Models
unnecessary additional costs.
System Framework. We assume the outsourced data are multi-
We also conduct a number of experiments on synthetic and real
dimensional data and thus the data records can be treated as vec-
datasets to evaluate the performance and the attack resilience. The
tors (or points) in the multidimensional space. Figure 1 shows the
experimental results show that the proposed method is efficient and
framework for processing range query services on outsourced data.
resilient to the knowledge enhanced attacks.
In the client side, the data owner has all rights to upload/query
The rest of this paper is organized as follows. Section 2 briefly
data, and may also grant the query right to the trusted users. The
describes range queries and the privacy problems with outsourced
proxy server receives original data and queries, encrypts and sub-
databases. Section 3 gives the definition of random space perturba-
mits them, and decrypts the query results. It keeps the security
tion. In section 4 we present the algorithms for query transforma-
key, the encryption functions ET (), EQ (), the decryption function
tion on the client side and efficient query processing on the server
D(), and controls the access rights. The traffic between the proxy
side. In Section 5, we formally analyze the security of the scheme,
server and the service provider contains only the encrypted data
describe various attacks and discuss the resilience of our scheme
and queries. Although the proxy server does not handle the large
to these attacks. The algorithms outlined in Section 4 and Section
dataset and process queries, it might still become a bottleneck for
5 are summarized at the end of Section 5. The cost of encryption,
a large number of users and frequent query submissions. However,
the resilience to attacks, and the efficiency of query processing are
the cost to scale the proxy server should be much lower than that to
further evaluated through extensive experiments in Section 6.
host the entire query processing service.
This framework includes several key components. (1) Encrypted
2. PRELIMINARIES auxiliary data generation. This approach will generate the auxil-
First, we establish the notation used in this paper. A database iary data encrypted with the proposed scheme for indexing purpose
table consists of n records and k searchable attributes. We also fre- through the encryption function ET () in Figure 1. It applies a type
quently refer to an attribute as a dimension or column. These three of multiplicative perturbation [9, 27] on the searchable attributes in
names are exchangeable in our context. Each record can be repre- the original database to generate the auxiliary data. The goal is to
sented as a vector, and notated with bold lower cases, while lower keep the topology of original data vectors in the auxiliary data but

 Service provider user may use queries to probe the encrypted database. The prior
 $
knowledge based attacks have also been used in attack analysis for
    privacy preserving data mining [2, 22, 9] and kNN queries [33] on
!" #
   outsourced data. We will study attacks under these three different
Data Y
levels after we present the basic encryption methods.
% & % Attacks
   
  3.2 Definition of Random Space Encryption
Index on Y
Random Space Encryption (RASP) is one type of multiplicative
Trust Parties Untrusted party perturbation [8], with relaxed constraints on the encryption param-
eters. Let’s consider the multidimensional data are numeric and
Figure 1: A framework for hosting range query services. in multidimensional vector space − For categorical attributes, we
use a simple mapping of integers to categorical values to convert
them to integers1 . Assume the database has k searchable dimen-
obscure the original data values so that they cannot be possibly in- sions and n records, i.e., a k × n matrix X. Let x represent a
ferred from the auxiliary data. (2) Query Encryption. A submitted k-dimensional record. Note that in the k-dimensional vector space,
query should also be appropriately transformed so that the server the range query conditions are represented as half space conditions
can use the index on the encrypted auxiliary data to process the and a range query is correspondingly translated to finding the point
query. This query transformation should be secure, not reveal any set in a hyper-cube [6]. The RASP encryption involves two steps.
information that helps curious service providers breach privacy. We For each k-dimensional input vector x,
denote it as the EQ () function. (3) Server side indexing and query
processing. The service provider is able to build multidimensional 1. the vector is first extended to k+2 dimensions as (xT , 1, v)T ,
index on the auxiliary data. However, processing the transformed where the (k + 1)-th dimension is always a 1 and v is drawn
queries requires algorithms different from the existing ones. Our from (0, α] using a random number generator Rα , with some
framework also includes the algorithms for query processing. private α and distribution 2 .
2. After this extension the (k + 2)-dimensional vector is then
Attack Models. In our framework, we study the attack models
subjected to the transformation
based on the popular honest-but-curious service provider assump-
 
tion. We assume the service provider will honestly provide the ser- x
vices and perform the computations following the protocol (e.g., ET (x, K = {A, Rv }) = A  1  , (1)
public cloud providers). However, the provider might be curious v
about the data and the users’ queries. Also, we assume that the at-
tacker knows the algorithms used to encrypt data and queries (i.e., where A is a (k + 2) × (k + 2) randomly generated invertible
the algorithms for ET (), EQ ()). Active attackers will also try to matrix (see Appendix for matrix generation) with aij ∈ R
obtain as much prior knowledge as possible to help break the en- such that there are at least two non-zero values in each row
cryption, or estimate the encrypted data and queries. To better eval- of A and the last column of A is also non-zero.
uate the strength of an encryption scheme, we categorize attacks
Note that A is shared by all vectors in the database, but v is ran-
into different levels based on the prior knowledge the attacker may
domly generated with the random number generator Rα for each
have.
individual vector. Note that only A is needed in the trusted proxy
• Level 1: the attacker observes only the encrypted data and server for decryption (and hence forms the key) - we don’t need
queries, without any additional knowledge. This corresponds to keep v in the proxy server. The design of extended data vec-
to the ciphertext-only attack (COA) in cryptography [15]. tor (xT , 1, v)T is to address the query-based attacks: the k + 1
dimension is a homogeneous dimension for hiding the query con-
• Level 2: Apart from the encrypted data, the attacker also tent; the k + 2 dimension is used to counter the inherent linearity in
knows some dimensional distribution information about the transforming the queries. The rationale behind different aspects of
original data, including the attribute domain (the maximum this encryption will be discussed clearly in later sections. Also, the
and minimum values) and distribution (e.g., the Probability structure of A will be slightly changed to withstand known plain
Density Function (PDF) or histogram). In practice, for some text attacks in Section 5.
applications such as hosting query services for census data, RASP has two important features. First, we want to show that
the dimensional distribution might be known by the public. RASP does not preserve the ordering of dimensional values, which
distinguishes itself from order preserving encryption schemes, and
• Level 3: In addition to Level 2 knowledge, the attacker ob-
thus does not suffer from the bucket-based attack (details in Section
serves a small set of plaintext tuples X and the corresponding
7). Second, we show that RASP is convexity preserving, which al-
encrypted tuples Y in the outsourced data. This corresponds
lows range queries on the encrypted data.
to the known-plaintext attack (KPA) in cryptography. In our
special context, the attacker may also observe a small num-
RASP is Not Order Preserving. In the following, we show that
ber of plain queries and the corresponding encrypted queries.
RASP is not order preserving; thus the attacks on OPE schemes
The three levels also correspond to the difficulty level of obtain- cannot be applied to RASP. An OPE scheme maps a set of dimen-
ing the required prior knowledge. Both Level 2 and Level 3 knowl- sional values to another, while keeping the value order unchanged.
edge are difficult to obtain and it is possible only when the attacker, 1
For a categorical attribute Xi , the values {c1 , . . . , cm } in the do-
e.g., the curious service provider, resorts to social engineering or main are mapped to {1, . . . , m}. A query condition Xi = cj , is
gains temporary unauthorized access to some user accounts at the converted to j − δ ≤ Xi ≤ j + δ, where δ ∈ (0, 1)
data owner location. For example, the related database applications 2
We used α = 1 and uniform random distribution for the experi-
may reveal some knowledge about the domain; the compromised ments
Let x be any record in the dataset, and f i be the selection vector Hi can be represented as wiT x ≤ ai (“=” for the closed set), and
(0, . . . , 1, . . . , 0) i.e., only the i-th dimension is 1 and other di- wi ∈ Rk and ai ∈ R are parameters for the halfspace. By re-
mensions are 0. For simplicity we use unextended vectors − the placing x with a column vector z = (xT , 1, v)T and wi with
extended dimensions are not related to this discussion and can be ui = (wiT , −ai , 0)T , the set enclosed by Hi is transformed to
safely removed. Then, (f i )T x will return the value at dimension i the set enclosed by the halfspace Hiext : uTi z ≤ 0. With the RASP
of x. function, we have y = Az, and thus this halfspace Hiext can be
further transformed to Hi′ as follows
P ROPOSITION 1. Let A be an invertible matrix with at least
two non-zero entries in each row. For any vector s, let s′ = As. uTi A−1 y ≤ 0. (3)
Then for any i ∈ {1, . . . , k} there exist vectors x, y for which A Each of the halfspace conditions, Hi′ ,
in the transformed space rep-
preserves the order of dimension i (that is, (xi − yi )(x′i − yi′ ) > resents a convex set. Thus, the intersection of them is convex as
0) and there exist vectors u, v for which A reverses the order of well. Therefore, the RASP encryption is convexity preserving.
dimension i. That is, RASP does not preserve order for arbitrary
input vector pairs. Since a range query defines a convex set, the transformation
method (Eq. 3) gives a basic method for transforming the range
P ROOF. Using the same dimensional selection vector, we have
query for the RASP encrypted data - we name it RASP query trans-
s′i = (f i )T As and t′i = (f i )T At. Thus, we get
formation method. The following proposition shows that by search-
(si − ti )(s′i − t′i ) = (si − ti )(f i )T A(s − t) ing with the transformed conditions in the encrypted data space, we
k
can get the exact set of points that is the image of the query result
= (si − ti )
X
ai,j (sj − tj ), (2) in the original data space.
j=1 P ROPOSITION 4. Let Hi and Hi′ be halfspaces defined as in the
where ai,j is the i-th row j-th column element of A. Without loss proof of Proposition 3. The RASP query transformation
T uniquely
of generality, let’s assume si > ti (for si < ti the same proof maps the convex setTS enclosed by halfspaces Hi to the convex
applies). It is straightforward to see that given the fixed values of set S ′ enclosed by Hi′ .
A, the values of sj and tj for all j 6= i can be chosen so that It is straightforward to show that by using the RASP query trans-
(si − ti ) kj=1 ai,j (sj − tj ) is either negative or positive. Note
P
formation, any point in S cannot be mapped to a point outside S ′
that since each row of A has at least two non-zero entries, even if and any point not in S cannot be mapped to a point in S ′ . So we
ai,i (si − ti )2 > 0 (or < 0), using the other non-zero value in the skip the details of the proof.
i-th row of A, say ai,k , the sign of (si − ti ) kj=1 ai,j (sj − tj ) can
P
Note that duplicate records in the original set might be mapped
be adjusted to either positive or negative by appropriately choosing to different records in the encrypted space due to the randomly gen-
the values sk and tk . erated additional dimension k +2. However, this query transforma-
tion method guarantees all of such records are still exactly found in
RASP is Convexity Preserving. Let’s treat data records as points the encrypted space. This proposition forms the basis for the pro-
in a real multidimensional space. In the following, we will show posed query processing strategy, which will be discussed in details.
that although RASP does not preserve ordering, it preserves con-
vexity, which forms the basis of our query processing strategy. The
following definitions of convex set and convexity preserving func-
Attributes
tion can be found in most textbooks on convex optimization, e.g.,
Original record R1,… , Rm Rm+1,…, Rn
[6].
G(<R1 ,…, R m>) Encrypt(<R 1,…, Rn>)
D EFINITION 1. A set S is a convex set, if and only if for ∀x1 ,
Record stored at Encrypted/
x2 ∈ S, and ∀θ ∈ [0, 1], R’1,… , R’m compressed block
server side
θx1 + (1 − θ)x2 ∈ S
Figure 2: The records stored on the server.
D EFINITION 2. A convexity preserving function E() preserves
the convexity of sets. Concretely, if S is a convex set in the orig-
Generating Auxiliary Vectors for Outsourcing. With the RASP
inal data space, the function E() always transforms S to another
encryption function, we generate the outsourced data as follows.
convex set E(S).
First, we normalize each attribute to avoid the attacks based on the
The following proposition cited from [6] is critical to the proof of value ranges. The normalization process is briefly described as fol-
convexity preserving property of RASP encryption and our query lows. Let the mean of the attribute distribution be µi and the vari-
processing algorithm. ance be σi2 . For any value x of the i-th attribute, the transformation
(x − µi )/σi is applied. For the sake of simplifying presentation,
P ROPOSITION 2. (1) Every we assume the data columns are already normalized - when we say
Tconvex set is a (possibly infinite) in-
tersection of halfspaces, i.e., Hi , where Hi defines a halfspace; the vector x we mean it is the normalized version. Second, we
and (2) the intersection of (possibly infinite) convex sets is also con- assign an unintelligible name to each attribute, e.g. “X1 ” for the
vex. first attribute. Finally, Eq. 1 is applied to the searchable dimen-
sions x to generate the encrypted auxiliary record y. y and the
With this proposition we can prove that original record that is compressed and encrypted with any exist-
ing methods are used for outsourcing (as shown in Figure 2). The
P ROPOSITION 3. RASP encryption is convexity preserving.
service provider may build indices or perform linear scan on the
P ROOF. We assume an original convexTset in Rk (that is closed) auxiliary data vectors to answer queries. The cost for generating an
is the intersection of a set of halfspaces Hi , where a halfspace outsourced record consists of one RASP encryption perturbation
(O(k2 ) multiplications; k is the number of searchable dimensions) y be the auxiliary vector, i.e., y = Az. The original condition is
and the compression/encryption operations applied to the whole transformed to the form of Eq. 3 in the encrypted space.
record. Here note that the RASP transformation is applied only As Proposition 4 shows, searching with the transformed queries
to those attributes that are actually queried. Thus like mentioned on the auxiliary vectors is equivalent to searching with the original
earlier, conventional encryption can be used on compressed values queries and data. Note that this simple query transformation is vul-
of attributes that will not be queried. nerable to attacks as shown in Section 5.2; we will eventually use
slightly different query transformation method. (In the Appendix,
we also give the details of the transformed query.) Next, we show
4. EFFICIENT RANGE QUERY PROCESS- how to efficiently process these transformed queries.
ING WITH RASP 4.2 A Two-Stage Query Processing Strategy
We have shown that the RASP encryption is convexity preserv- with Multidimensional Index Tree
ing. This result is closely related to how a query can be transformed With the transformed queries, the first important task is to pro-
and processed. A range query can be represented as a convex set cess queries efficiently. A commonly used method is to use tree
query. Thus, in the encrypted space there is a unique convex set indices to improve the search performance. However, multidimen-
that is the answer to the query. However, there are challenges in (1) sional tree indices are normally used to process axis-aligned “bound-
efficiently processing it, and (2) making sure query processing does ing boxes”; whereas, the transformed queries are in arbitrary con-
not reveal significant information about the encryption key and the vex shape, not necessarily aligned to axes. In this section, we pro-
original data. One may already notice that the simple query trans- pose a two stage query processing strategy to handle such irregular
formation method described in this section is vulnerable to attacks. shape queries in the encrypted space. First, we briefly introduce the
However, in this section, we will focus on the first challenge. It query processing algorithm based on multidimensional index trees.
will be revisited and significantly improved in security analysis in Then, we describe the two stage processing algorithm.
Section 5.
In the encrypted space, a simple dimensional condition in the Multidimensional Index Tree. Most multidimensional indexing
original space is transformed to a general halfspace condition (as algorithms are derived from R-tree like algorithms [19], where the
Figure 4 shows). It would be straightforward to scan each auxiliary minimum bounding region (MBR) is the construction block for the
vector with the transformed conditions and return the result. We multidimensional data. For 2D data, an MBR is a rectangle. For
want to explore more efficient index-based processing methods in higher dimensions, the concept of a MBR is extended to a hyper-
this section. The normal processing strategies are based on mul- cube. Figure 3 shows the MBRs in the R-tree for a 2D dataset,
tidimensional index trees, such as R-Tree [28], that handles axis- where each node is bounded by a node MBR.
aligned minimum bounding boxes (MBR). If we still depend on
multidimensional tree indexing to process the transformed queries,
the processing algorithm should be slightly modified to handle ar- c root
bitrary convex areas, the boundaries of which are not necessarily a d
Stage1:
Bounding

axis-aligned. We will start with the method of query transforma- b box

Original space Transformed space


tion, briefly discuss the normal range query processing algorithms a b c d

using multidimensional indices, and then present the proposed so-


lution for processing the transformed queries. Figure 4: Illustration of the
Figure 3: R-tree index. two-stage processing algo-
rithm.
4.1 Query Transformation
Since the auxiliary vectors are in the encrypted space, to query on Range query processing with a multidimensional indexing tree
this space, range queries should also be appropriately transformed. can be described as follows. The conjunction of a set of simple
We have mentioned that the transformation method used in proving range conditions can be represented as a query MBR. The goal is
Proposition 3 can be used for query transformation. In this section, to find the MBRs in the tree that are contained by or intersected
we discuss how to transform an original range query into the en- with the search MBR. If the query MBR contains a node MBR,
crypted space in details. all points in the subtree should be included in the query result. If
First, let’s look at the general form of a range query condition. the query MBR intersects a node MBR, further checking should be
Let Xi be an attribute in the database. A simple condition in a performed for the children nodes. If the query MBR intersects a
range query involves only one attribute and is of the form “Xi leaf MBR, each point included by the leaf node should be checked
<op> ai ”, where ai is a constant in the normalized domain of Xi and only those inside the query MBR are selected.
and op ∈ {<, >, =, ≤, ≥, 6=} is a comparison operator. For con-
venience we will only discuss how to process Xi ≤ ai , while the The Two-Stage Processing Algorithm. The transformed query
proposed method can be slightly changed for other conditions. Any describes an irregular convex shape that cannot be directly pro-
complicated range query can S be transformed into the disjunction of cessed by multidimensional tree algorithms. New tree search al-
a set of conjunctions, i.e., n
Tm
j=1 ( i=1 C i,j ), where m, n are some gorithms can be designed to use arbitrary polyhedron conditions,
integers depending on the original query conditions and Ci,j is a i.e., the transformed query, directly for search. However, we use a
simple condition about Xi . Again, to simplify the presentation we simpler two-stage solution that keeps the existing tree search algo-
restrict our discussion to single conjunction condition ∩m i=1 Ci . A rithms unchanged.
simple condition Xi ≤ ai is a halfspace condition. Following the At the first stage, the proxy in the client side finds the MBR of
previous discussion, Xi ≤ ai is converted to the extended vector polyhedron (as a part of the submitted transformed query) and the
representation first: uT z ≤ 0, where u is a k + 2 dimensional vec- server uses the MBR to find the initial result set. We use the sim-
tor with ui = 1, uk+1 = −ai , and uj = 0 for j 6= i, k + 1, (for ple vertex-based algorithm for finding the MBR of the polyhedron.
Xi ≥ ai , ui = −1, uk+1 = ai ), and z = (xT , 1, v)T . Then, let The original query condition constructs a hyper-cube shape. With
the query transformation, the vertices of the hyper cube are also when the original data have independent columns and no more than
transformed to vertices of the polyhedron. Therefore, we can cal- one column having Gaussian distribution, an attack called Indepen-
culate the MBR with only the transformed vertices. Figure 4 illus- dent Component Analysis (ICA) [23] can be applied to effectively
trates the relationship between the vertices and the MBR. There are recover the original data from the perturbed data. Originally de-
a maximum number of 2k vertices for one conjunctive range query veloped for signal processing, ICA is used to discover components
condition on k dimensions, i.e., each dimension has its lower and A (the mixing matrix) and X (the original signals) from the mixed
upper bounds. It is straightforward to construct these vertices based data Y = AX. Since ICA recovers columns in an arbitrary order,
on the dimensional bounds. In practice, the MBR of the polyhedron it has to rely on the known distributional information to distinguish
needs to be calculated by the proxy server for security reason, and the columns and order them correctly. Furthermore, the effective-
then sent to the server together with the transformed queries. ness of ICA heavily depends on the independence of the columns
At the second stage, the server uses the transformed halfspace and the number of columns having non-Gaussian distributions . In
conditions 3 to filter the initial result and find the final result. In practice, since the independence condition and the Gaussian dis-
most cases, the initial result set will be reasonably small so that it tribution condition are often not satisfied, the ICA attack can only
can be filtered in memory with linear scan. In the worst case, the result in approximate estimation to the original data. However, the
MBR of the polyhedron will possibly enclose the entire dataset and previous study [9] shows that if the matrix A is not carefully se-
the second stage is reduced to linear scan of the entire dataset. lected, the ICA attack can still result in serious damage.
Another distributional attack is to enumerate the matrix A and
Cost Analysis. Assume the query ranges are selected uniformly at then check the column distributions of A−1 Y to find the best match
random. For small ranges the first stage average cost is O(logB N ) between the known column distributions and the distributions of the
index block accesses plus a few of data block accesses [28], where estimated columns. However, since there is no constraint on the el-
N is the number of records and B is the number of children an ements of A, with uniformly discretized domains, the number of
index node has. Due to the randomness associated with the RASP candidate matrices will be extremely large. Concretely, if the dis-
transformation, the data distribution, and the unpredictable query cretized domain has d values, the total candidate matrices will be
2
ranges, the cost to get the initial result could vary, which will be d(k+2) , where k is the number of dimensions. Even for extremely
investigated in experiments. If the initial result has n records, a lin- low dimensionality, e.g., k=2, this attack could be computationally
ear scan at the 2nd stage with 2k simple conditions will cost ≤ 2kn intractable.
checks of the form in Eq. 3. In Section 6, we study the cost dis-
tribution between the two stages and experimentally demonstrate Known Input/Output Attack. With the level 3 knowledge, the at-
that this two-stage processing is efficient and orders of magnitudes tacker knows a number of input/output (plaintext/ciphertext) record
faster than the linear scan approach. pairs. Concretely, let Pk×m be the known m k-dimensional origi-
nal records (x1 , . . . , xm ), m > k + 2, that include k + 2 linearly
5. ATTACK ANALYSIS independent records, and Qk+2×m be the corresponding perturbed
k + 2-dimensional records (y1 , . . . , ym ). The typical method is
We categorize the possible attacks into two types: (1) Attacks to use the linear regression method to get an estimate of the key
on auxiliary vectors; (2) Attacks based on range queries. There and then recover the entire original data. In the following, we show
has been some related work on attack analysis methods for simi- how to use the regression method to attack the encryption. Let A
lar encryption methods, e.g., geometric data perturbation for data decomposed into blocks A = (A1 , A2 , A3 ), where A1 , A2 and A3
mining [9], which can be migrated to analyze the first type of at- have block sizes (k + 2) × k, (k +  2) × 1and (k + 2) × 1, re-
tacks. However, attacks on range queries are entirely new for our
P
approach.
spectively, and the extended data be  1  where 1 is the row
5.1 Attacks on Auxiliary Vectors v
vector with ‘1’ and v is a row vector with random positive values,
According to the three levels of knowledge the attacker may corresponding to the two additional dimensions in RASP. Thus, the
have, we categorize the attacks into three classes: (1) Naive es- encryption can be represented as
timation; (2) Distributional Attacks; and (3) Known Input/Output  
Attacks. Due to the random component in the RASP encryption, P
some attacks are actually estimation attacks, i.e., the goal of the Q = (A1 , A2 , A3 )  1  = A1 P + A2 1 + A3 v, (4)
attack is to estimate the original values. If the estimation result is v
sufficiently accurate, we say the encryption is broken.
where A2 1 is a translation matrix that adds the vector A2 to each
5.1.1 Attack Description and Analysis of the column vectors in A1 X; A3 v is a random noise matrix.
With sufficient number of known record pairs (m > k + 2), first,
Naive Estimation. With the level 1 knowledge, the attacker ob-
the translation component can be canceled out; then, the regres-
serves only the encrypted data. The only attack is to blindly guess
sion method can be applied to estimate A1 . With the estimate
the matrix A. It has been discussed to find a matrix A to maximize
Â1 of A1 , A2 can be estimated as well. Therefore, for an en-
the difference between the encrypted data and the original data [9].
crypted dataset Y , the estimate of the original data X is X̂ =
However, since there is no way to verify how accurate a random
(ÂT1 Â1 )−1 ÂT1 (Y − Â2 1).
guess is, this type of attack is ineffective, in general.
5.1.2 Countering Attacks on Auxiliary Data
Distributional Attack. With the level 2 knowledge, the attacker
Countering ICA-based Distributional Attack. Since the enumer-
also knows column domains and distributions. This knowledge can
ation based attack is computationally intractable, we focus on the
be possibly used to perform more effective attacks. In particular,
ICA-based attack. We propose two approaches to increase the re-
3 silience to the attack. The first approach is to simulate the ICA
The final form of the security-enhanced transformed query is rep-
resented with the matrices Θi s that are described in Section 5.2. attack in sufficient rounds to find a statistically resilient A matrix
as the previous work does [9]. However, a more attack-resilient 5.2 Attacks on Transformed Queries
approach is using the composition encryption scheme (CES) that As we have discussed, in query processing, the proxy server will
consists of two steps: transforming the original data with an order submit the MBR and the transformed query to the server. We re-
preserving encryption scheme Eo first; then followed by the basic fer to the original query as the input query, and the transformed
RASP encryption, which can be represented as follows. query that is submitted to the server as the output query. With the

Eo (X, Ko )
 knowledge of a number of pairs of input/output queries, the follow-
E(X, K, Ko ) = A  1  (5) ing attacks can be performed on the current query conditions, e.g.,
v uT A−1 y ≤ 0, to break the encryption.

We use the OPE scheme by Agarwal et al. [1] that allows us to 5.2.1 Attack Description and Analysis
change all column distributions to normal distributions. Thus, the First, we will show that the row vectors of A−1 can be probed
requirement of non-Gaussian distribution is not satisfied, which if the attacker has the level 3 knowledge on query conditions. Sec-
renders ICA ineffective. Since the composition scheme is not or- ond, we show a more serious attack that can reveal columns of data
der preserving, the attacks on OPE schemes will not be applicable if the attacker has the level 2 knowledge about dimension distribu-
either. However, can the two-stage query processing strategy still tions.
be applied? The following proposition indicates that it can still be
applied. A−1 Probing Attack. With the level 3 knowledge, the attacker
knows a pair of input query conditions on the same dimension, say
P ROPOSITION 5. Order preserving encryption functions trans- Xi ≤ ai and Xi ≤ bi , and their output forms, uTai A−1 y ≤ 0
form a hyper-cubic query range to another hyper-cubic query range. and uTbi A−1 y ≤ 0, respectively. Then, uTai A−1 y − uTai A−1 y =
P ROOF. Assume the original range query condition consists of uTai −bi A−1 y, where uTai −bi = (0, . . . , ai − bi , 0), only the non-
simple conditions like bi ≤ Xi ≤ ai for each dimension. Since the zero (k + 1)-th element remains. Let rj be the j-th row of A−1 .
order is preserved, each simple condition is transformed as follows: The constant part of the condition represents (ai − bi )rk+1 , thus
Eo (bi ) ≤ Eo (Xi ) ≤ Eo (ai ), which means the transformed range revealing rk+1 . While the single condition like uTai A−1 y ≤ 0 has
is still a hyper-cubic query range. the constant part ri + ai rk+1 , with known ai and rk+1 , ri can be
revealed. Repeating this process for all dimensions with known in-
When processing a query, the proxy server needs to transform the put/output conditions, the attacker can recover k + 1 rows of the
ranges to OPE encrypted ranges first, and then apply the query matrix A−1 , which leaves very weak security.
transformation method. We will show in the experiments how the
resilience to the ICA attack is improved with the composition scheme. Dimensional Selection Attack. With the level 2 knowledge, i.e.,
the column domains and the column distributions, the attacker can
Countering Known Input/Output Attacks. As we have men- perform a dimensional selection attack. Assume the condition is
tioned earlier, the random noise matrix A3 v in Eq. 4 determines applied to some unknown dimension i. Applying the query param-
how effective the linear regression estimation can be done. The eters uTi A−1 to each record y in the server, the attacker can get
more intense (in terms of variance) the noise component is, the less uTi A−1 y = xi − ai , where xi is the i-th dimension of the corre-
accurate the estimation can be. A randomly generated matrix A sponding original record x. After getting all the values, the attacker
does not allow us to control the noise intensity, however. We pro- can build up a histogram to compare with the known column distri-
pose to use the following method to generate A3 that satisfying a butions. It is thus easy to identify what dimension is queried. With
specified noise intensity. (1) generate a k + 2 by β random ma- the knowledge of the column domain, the constant ai can be easily
trix Ψ according to the required noise distribution and intensity, removed, which leads to complete breach of the entire column i.
with sample size β that is sufficiently large, or larger than the max- In summary, the original query transformation method can be
imum number of known Input/Output records that an attacker may exploited to construct very effective attacks. It needs to be carefully
have access to; (2) generate β positive random values v; (3) apply redesigned to address these attacks.
A3 = ΨvT /(vvT ) to get A3 . The last column of the randomly
generated A is then replaced with A3 . Note here that having a 5.2.2 Countering Query-based Attacks
value of zero for say the i-th entry in A3 would mean that the noise The additional dimension Xk+2 is used to construct secure query
values are never added to i-th attribute. So the random generation conditions. Instead of processing a half space condition Xi ≤ ai ,
process is repeated until all elements of A3 are not zero. Previ- we use (Xi − ai )Xk+2 ≤ 0 instead. These two conditions are
ous study shows that Principle Component Analysis (PCA) can be equivalent because the additional dimension Xk+2 satisfies Xk+2 >
used to possibly filter out the noise component or reduce the ef- 0. Using the extended vector form zT = (xT , 1, v), we have
fect of noise in some circumstances [22]. We will investigate the Xi − ai = zT u and Xk+2 = wT z, where ui = 1, uk+1 = −ai ,
relationship between the noise component and the accuracy of es- uj = 0, for j 6= i, k + 1; wk+2 = 1 and wj = 0, for j 6= k + 2.
timation, and study whether PCA can help improve the estimation With the transformation y = Az, we get the transformed quadratic
in experiments. query condition
A more attack-resilient solution is using the previous discussed
composition encryption scheme. If the OPE scheme uses a non- yT (A−1 )T uwT A−1 y ≤ 0. (6)
linear transformation, the composition of OPE and RASP will cre- −1 T T −1
Let Θ = (A ) uw A . In the two-stage processing strat-
ate a nonlinear mapping from the original data to the encrypted egy, the bounding box of the transformed query area is calculated
data. Therefore, the linear regression attack will not be effective in the proxy server as we discussed earlier. Then, this bound-
by simply using the known pairs of input and output records, if the ing box and the parameters Θi for each condition i, are submit-
OPE key is unknown. ted to the server. Thus, assume each dimension is represented
with two half space conditions, the encrypted query EQ is repre-
sented as {MBR,{Θ1 , . . . , Θ2k }}. The server will use the bound-
ing box to get the first-stage results and then use the conditions, 6.3 Resilience to Estimation Attacks
e.g., yT Θi y ≤ 0, to filter out the results. We now show that this We have discussed the methods for countering the estimation at-
query transformation is resilient to both query-based attacks. tacks, primarily the ICA attack and the known input/output attack.
Assume the attacker knows two pairs of input/output query con- In this set of experiments, we explore the resilience of both the
ditions, e.g., for Xi ≤ ai and Xi ≤ bi . We use the same method RASP-Only scheme and the composition scheme to the estima-
used in the A−1 probing attack to find the difference of the two tion attacks. Although the composition scheme is more resilient
conditions. Let Θa and Θb notate the parameters for these two to attacks, it incurs the additional cost that might not be favored by
conditions, respectively. The simplified form for a single condi- some applications. Therefore, it is worth looking at how resilient
tion, e.g., Θa , is (rTi − ai rTk+1 )rk+2 , where ri , rk+1 , and rk+2 the RASP-Only scheme is to the attacks.
are the row vectors of matrix A−1 . Thus, the result of Θa − Θb is
(bi − ai )rTk+1 rk+2 . But knowing Θa , Θb , ai , bi does not help find Metric for Evaluating Estimation Attacks. The accuracy of esti-
the unknown vectors − in fact there are an infinite number of solu- mation attacks can be evaluated with the well-known mean square
(Θa −Θb )rk+2
tions because rk+1 = (bi −a i )||rk+2 ||
. Therefore, knowing pairs of error (MSE). Let the number of records be n and the value of
input/ouput queries does not help probing A−1 . the i-th attribute in the j-th record be xi,j and the correspond-
This quadratic query transformation method counters the dimen- ing estimated value be x̂i,j . Model the i-th attribute with a ran-
sional selection attack as well. For any perturbed record y, yT Θa y dom variable Xi and its estimate as X̂i . The estimation error can
recovers (xi − ai )xk+2 , where xi and xk+2 are the dimensional be estimated with the root of mean square error (RMSE): mi =
q
values of the corresponding original vector x. Since xk+2 is a ran- 1/n n
P 2
j=1 (xi,j − x̂i,j ) , i.e., Xi = X̂i ± mi . 2mi can be used
domly generated positive value, knowing Xi ’s domain and distri-
to roughly represent the effective estimation range. Apparently, mi
bution does not help recover xi . Therefore, dimensional selection
has different meaning for different value range. For example, ±10
attack does not work either.
means ineffective estimation for an attribute “age” (in the range
In the appendix, we put together all the algorithms after consid-
[0,100]), while very effective for “salary” (often > 10000). One
ering the attacks discussed in this section.
of the common methods is to normalize all attributes to approxi-
mately the same range. For large data, the assumption that each
6. EXPERIMENTS attribute has approximately normal distribution would be reason-
In this section, we present three sets of experimental results to in- able [25]. Therefore, standardization can be used to normalize all
vestigate the following questions: (1) How costly are the RASP en- attributes to normal distribution with mean zero and standard de-
cryption scheme and the composition scheme involving OPE scheme? viation one. For a standardized domain, four times of the standard
(2) How effective are the ICA attack and the known input/output at- deviation (i.e., 4σ = 4) covers the majority of records (> 95%).
tack, if the composition scheme is not applied? (3) How efficient is Then, we can use the rate pi = 2mi /(4σ) = mi /2 to represent the
the two-phase query processing? relative effectiveness of the estimation attack. The larger the mi ,
the less effective the estimation is. To evaluate the resilience across
6.1 Setup all attributes, we also define dataset-wise metrics, such as the min-
imum security guarantee pmin = min{pi , 1 ≤ i ≤ k}, which is
Two datasets are used in experiments: (1) a synthetic dataset that
used in our experiment.
draws samples from uniform distribution in the range [0, 1]; and
(2) the Adult dataset from UCI machine learning database4 . For
Results. We simulate the ICA attack for randomly chosen matri-
the adult dataset, we assign numeric values to the categorical val-
ces A. The data used in the experiment is the 10-dimension Adult
ues using a simple one-to-one mapping scheme. For each dataset,
data with 10K records. The x-axis in Figure 6 represents the se-
we generate multiple versions with different numbers of records by
quence number of randomly chosen matrix A and the y-axis repre-
using sampling with replacement. We also change the dimensional-
sents the minimum security guarantee among all dimensions. The
ity of the datasets by randomly selecting a number of dimensions of
label “Best” means the most resilient A to the ICA attack; “Worst”
the data. All experiments were done in a quad-core AMD Opteron
means the A shows the weakest resilience; “Average” is the pro-
server (2.5GHz CPU and 120GB memory).
gressive average resilience for the generated A matrices. Figure 6
shows that the effectiveness of the ICA attack can vary with differ-
6.2 Cost of Encryption ent matrices A and we can find some ones that are more resilient
In this experiment, we study the cost of the components in the to the attack. In addition, if applied is the composition method that
composition scheme. We implement the OPE scheme [1] by map- uses the OPE scheme to change column distributions to Gaussian
ping original column distributions to normal distributions. The distributions, the resilience of a randomly chosen A is significantly
OPE algorithm partitions the target distribution into buckets, first. increased.
Then, the sorted original values are proportionally partitioned ac- We also simulate the known input/output attack with a number
cording to the target bucket distribution to create the buckets for the of randomly selected input/output records pairs (10% of the entire
original distribution. With the aligned original and target buckets, dataset). The original data is generated with the method mentioned
an original value can be mapped to the target bucket and appro- in Section 5.2.2 by generating the noise matrix with standard nor-
priately scaled. Therefore, the encryption cost mainly comes from mal distribution N (0, 1). Due to the randomness, we repeat the
the bucket search procedure (proportional to log D, where D is the experiment 10 times and record the variance of the estimation. The
number of buckets). Both encryption schemes are implemented PCA based noise filtering technique [22] is also applied as a part
with Matlab. Figure 5 shows the cost distribution for 20K records of the attack. Let Y be the encrypted data. The PCA method finds
at different number of dimensions of data for the two components the eigenvalue decomposition of Y Y T . Let Q be the eigenvectors
in the composite scheme. The dimensionality has less effect on the corresponding to the largest p preserved eigenvalues (i.e., the prin-
cost of RASP than on that of OPE. cipal components). The noise filtering algorithm uses Y QQT to
4 represent Y . Figure 7 shows the result for the known I/O attack
http://archive.ics.uci.edu/ml/
Q opqrs tpuvprwswpx yz{|u|
'(-, opqrs }~y€ x‚ƒ
IJK ` OPY „|rs }~y€ x‚ƒ ˆ‰Ž
'(-+ ` ²³´µ¶ ·¸¶¸
c\ OPX ~ |q†‡| }~y€ x‚ƒ
?3 LMNJ f ™
72 '(-* bf OPW ™ ¹º»¼½¾¿ ·¸¶¸
^ œ• ˆ‰
<4 e OPV Ÿ
>8 '(-)
d ›Ÿ
=37 [cb OPU —
ž ˆ‰Œ
2 '(- ^
<4 à 
œ”
:;9 '(', _ OPT ›—
8 ] š
67 '('+ ^ OPS
] ˜™ ˆ‰‹
52 –
43 [\ OPR —
'('* [Z
12 OPQ –
”• ˆ‰Š
'(') O ”“
' Q T W QO QS QV QY RR RU RX SQ ST SW TO TS TV TY ˆ
. + / , 0 g hi jhklmn
@ AB CDEFGH DA GH Šˆ   ‘ ’
¡¢£¤¥ ¦§ ¨¥©ª«©¬­® ¯¦¢¬¦ª¤ª°±

Figure 6: Randomly generated ma-


Figure 5: The cost distribution of the trix A and the resilience to ICA at- Figure 7: Known input/output attacks
composition encryption scheme. Data: tack. Data: Adult (10dimensions, 10K on both Adult and Uniform data.
Adult (20K records,5-9 dimensions) records)

with the PCA noise filtering step. The x-axis represents the num- ent sizes of data with different query processing methods. For clear
ber of principal components preserved. Since the data dimension- presentation, we use log10 (# of block accesses) as the y-axis. The
ality is 10, 10 principal components means no noise reduction is cost of linear scan is simply the number of blocks for storing the
applied. For both datasets, the average estimation errors are higher whole dataset. The data dimensionality is fixed to 5 and the query
than 0.2. Also, the PCA noise filtering does not help much − when range is set to 30% of the whole domain. Obviously, the first stage
the number of principle components is reduced (trying to remove with MBR for polyhedron has a cost much cheaper than the linear
the noises), the estimation error does not reduce. This result shows scan method and only moderately higher than R*tree processing on
that with appropriately set noise component, the known I/O attack the original data. Interestingly, different distributions of data result
is not effective either. in slightly different patterns. The costs of R*tree on transformed
queries are very close to those of original queries for Adult data,
6.4 Performance of Two-stage Range Query while the gap is larger on uniform data. The costs over different
Processing dimensions and different query ranges show similar patterns.
In this set of experiments, we study the performance aspects of Linear Scan R*Tree-Orig Stage-1 Stage-2 rpq purity
polyhedron range query processing. We use the two-stage process- Uniform 6.32 0.132 0.805 0.041 60 1.3%
ing strategy described in Section 4, and explore the additional cost Adult 5.42 0.091 0.20 0.017 24 2.2%
incurred by this processing strategy. We implement the two-stage
query processing based on an R*-tree implementation provided by Table 1: Wall clock cost distribution and comparison.
Dr. Hadjieleftheriou at AT&T Lab http://www2.research.att.com/ mar-
ioh/spatialindex/. The block size is 4KB and we allow each block We also studied the cost of the second stage. We use “purity”
to contain only 20 entries to mimic a large database with many disk to represent the rate (final result count)/(1st stage result count), and
blocks. Samples from the three databases in different size (10,000 records per query (RPQ) to represent the average number of records
− 50,000 records, i.e., 500-2500 data blocks) are encrypted as the per query for the first stage results. The quadratic filtering condi-
auxiliary data and then indexed for query processing. Another set tions are used in experiments. Table 1 compares the average wall-
of indices are also built on the original data for setting up the per- clock time (milliseconds) per query for the two stages, the RPQ
formance baseline of query processing on non-encrypted data. We values for stage 1, and the purity of the stage-1 result. The tests are
will use the number of disk block accesses, including index blocks run with the setting of 10K queries, 20K records, 30% dimensional
and data blocks, to assess the performance to avoid the possible query range and 5 dimensions. Since the 2nd stage is done in mem-
variation caused by other parts of the computer system. In addition, ory, its cost is much lower than the 1st-stage cost. Overall, the two
we also show the wall-clock time for some results for comparison. stage processing is much faster than linear scan and comparable to
Recall the two-stage processing strategy: (1) calculate the MBR the original R*Tree processing.
of the transformed query and use the MBR to search the indexing
tree; (2) filter the returned result with the transformed query. We 7. RELATED WORK
will study the performance of the first stage by comparing it to two We review the two most related methods: OPE and crypto-index
additional methods: (1) the original queries with the index built on first, and then give other related work.
the original data, which is used to identify how much additional
cost is paid for querying the MBR of the transformed query; (2) the OPE. As the name indicates, order preserving encryption (OPE)
linear scan approach, which is the worst case cost. Range queries [1] preserves the dimensional value order after encryption. It can
are generated randomly within the domain of the datasets, and then be described as a function y = F (x), ∀xi , xj , xi < (>, =)xj ⇔
transformed with the method described in the Section 4.1. We also yi < (>, =)yj . A well-known attack is based on attacker’s prior
control the range of the queries to be [10%,20%,30%,40%,50%] of knowledge of original distributions of attribute values. If the at-
the total range of the domain, to observe the effect of the scale of tacker knows the original distributions and manages to identify the
the range to the performance of query processing. mapping between the original attribute and its encrypted counter-
part, the following bucket-based attack can be performed to break
Results. The first pair of figures (the left subfigures of Figure 8 and the encryption for the attribute. (1) Model the original distribu-
9) shows the number of block accesses for 10,000 queries on differ- tion for the attribute with a histogram of a number of buckets; (2)
9 8.5 8.5
8.5 8 8

Log10(Block Accesses)

Log10(Block Accesses)

Log10(Block Accesses)
8 7.5 7.5
7.5 R*Tree Original R*Tree Original
R*Tree Transformed 7 7 R*Tree Transformed
7 R*Tree Original
Linear Scan 6.5 R*Tree Transformed 6.5 Linear Scan
6.5 Linear Scan
6 6
6
5.5 5.5 5.5
5 5 5
4.5 4.5 4.5
10 20 30 40 50 5 6 7 8 9 10 20 30 40 50
Number of Records (thousands) Number of Dimensions Length of Query Range(%)

Figure 8: Performance comparison on Uniform data. Left: data size vs. cost of query; Middle: data dimensionality vs. cost of query;
Right: query range (percentage of the domain) vs. cost of query

9 8.5 8.5
8.5 8 8
Log10(Block Accesses)

Log10(Block Accesses)

Log10(Block Accesses)
8 7.5 7.5
7.5 R*Tree Original
7 7 R*Tree Transformed
7 R*Tree Original R*Tree Original
R*Tree Transformed 6.5 R*Tree Transformed 6.5 Linear Scan
6.5 Linear Scan Linear Scan
6 6
6
5.5 5.5 5.5
5 5 5
4.5 4.5 4.5
10 20 30 40 50 5 6 7 8 9 10 20 30 40 50
Number of Records (thousands) Number of Dimensions Length of Query Range (%)

Figure 9: Performance comparison on Adult data. Left: data size vs. cost of query; Middle: data dimensionality vs. cost of query;
Right: query range (percentage of the domain) vs. cost of query

Calculate the percentage of each bucket to the entire distribution; Other Related Work. Private information retrieval (PIR) [10,
(3) Sort the encrypted values; (4) According the bucket’s percent- 24] tries to fully preserve the privacy of access pattern, while the
ages, sequentially partition the sorted encrypted values to generate data may not be encrypted. PIR schemes are normally very costly.
buckets; (4) Sequentially map the encrypted buckets to the original Focusing on the efficiency side of PIR, Williams et al. [32] use a
buckets and get the estimate of the encrypted value. The precision pyramid hash index to implement efficient privacy preserving data-
of estimation is determined by the width of the buckets - the nar- block operations based on the idea of Oblivious RAM [16]. It is
rower the buckets are the higher the precision will be. Since the different from our setting of high throughput range query process-
number of buckets can be arbitrarily chosen, the bucket width can ing. Another line of research [5, 29] facilitates authorized users
be very small. It is also not difficult to get the mapping between the to access only the portion of data in the authorized range with a
original attribute and the encrypted attribute, if the attacker knows a public key scheme. The underlying identity based encryption used
number of plain queries and their encrypted queries. In developing in these schemes does not produce indexable encrypted data. Also
our schemes, we have carefully studied whether the known query the setting for which Shi et al. [29] propose the multidimensional
patterns will also damage the proposed encryption scheme in our range query is different from ours. The untrusted service provider
framework. in our setting is responsible for both indexing and query processing.
Secure keyword search on encrypted documents [30, 17, 14, 4, 11]
Crypto-Index. Crypto-Index is also based on column-wise buck- scans each encrypted document in the database and finds the doc-
etization. It assigns a random ID to each bucket; the values in the uments containing the keyword, which is more like point search
bucket are replaced with the bucket ID to generate the auxiliary data in database. The research on privacy preserving data mining has
for indexing. To utilize the index for query processing, a normal discussed multiplicative perturbation methods [7, 9, 27, 26, 18],
range query condition has to be transformed to a set-based query which are similar to the RASP encryption, but with more emphasis
on the bucket IDs. For example, Xi < ai might be replaced with on preserving the utility for data mining.
Xi′ ∈ [ID1 , ID2 , ID3]. If the attacker manages to know the map-
ping between the input original query and the output bucket-based
query, the range that a bucket ID represents could be estimated. The 8. CONCLUSION AND FUTURE WORK
width of the bucket determines how precise the estimation could be In this paper we propose the random space encryption approach
done. A bucket-diffusion scheme [21] was proposed to address this to efficient range queries over encrypted data and analyze the unique
problem, which, however, has to sacrifice the precision of query attacks to this approach. Our approach uses a random space trans-
results. Another drawback of this method is that the client, not the formation to generate indexable auxiliary data. The auxiliary data
server, has to filter out the query result. Low precision results raise is exported to the service provider, indexed and used for processing
large burden on the network and the client system. Furthermore, range queries. We present an efficient server-side two-stage query
due to the randomized bucket IDs, the index built on bucket IDs is processing strategy. Experimental results show that this processing
not so efficient for processing range queries as the index on OPE strategy is highly efficient. In addition, we analyzed the attacks on
encrypted data is. encrypted data and queries. Experiments are performed to show
the resilience of the encryption to estimation attacks. Note that this
attack analysis is just the first step to rigorous analysis of security. International Conference on Applied Cryptography and
We will continue to explore more attacks and formally study the se- Network Security. Springer-Verlag, 2004, pp. 31–45.
curity. As an important extension to our approach, we would like to [18] S. Guo and X. Wu, “Deriving private information from
further investigate how database update operations, such as record arbitrarily projected data,” in Proceedings of the 11th
deletions, insertions, and updates, affect data utility and security. European Conference on Principles and Practice of
The goal is to allow the data owner and authorized users update the Knowledge Discovery in Databases (PKDD07), Warsaw,
encrypted data without undermining the security. Poland, Sept 2007.
[19] A. Guttman, “R-trees: A dynamic index structure for spatial
9. REFERENCES searching,” in SIGMOD’84, Proceedings of Annual Meeting,
Boston, Massachusetts, June 18-21, 1984, B. Yormark, Ed.
[1] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “Order ACM Press, 1984, pp. 47–57.
preserving encryption for numeric data,” in Proceedings of [20] H. Hacigumus, B. Iyer, C. Li, and S. Mehrotra, “Executing
ACM SIGMOD Conference, 2004. sql over encrypted data in the database-service-provider
[2] R. Agrawal and R. Srikant, “Privacy-preserving data model,” in Proceedings of ACM SIGMOD Conference, 2002.
mining,” in Proceedings of ACM SIGMOD Conference. [21] B. Hore, S. Mehrotra, and G. Tsudik, “A privacy-preserving
Dallas, Texas: ACM, 2000. index for range queries,” in Proceedings of Very Large
[3] A. Boldyreva, N. Chenette, Y. Lee, and A. O’Neill, “Order Databases Conference (VLDB), 2004.
preserving symmetric encryption,” in Proceedings of [22] Z. Huang, W. Du, and B. Chen, “Deriving private
EUROCRYPT conference, 2009. information from randomized data,” in Proceedings of ACM
[4] D. Boneh, G. D. Crescenzo, R. Ostrovsky, and G. Persiano, SIGMOD Conference, 2005.
“Public-key encryption with keyword search,” in [23] A. Hyvarinen, J. Karhunen, and E. Oja, Independent
Proceedings of Advances in Cryptology, (EUROCRYPT). Component Analysis. Wiley, 2001.
Springer, 2004. [24] E. Kushilevitz and R. Ostrovsky, “Replication is not needed:
[5] D. Boneh and B. Waters, “Conjunctive, subset, and range Single database, computationally-private information
queries on encrypted data,” in the Theory of Cryptography retrieval,” in In Proc. of the 38th Annu. IEEE Symp. on
Conference (TCC. Springer, 2007, pp. 535–554. Foundations of Computer Science, 1997, pp. 364–373.
[6] S. Boyd and L. Vandenberghe, Convex Optimization. [25] E. L. Lehmann and G. Casella, Theory of Point Estimation.
Cambridge University Press, 2004. Springer-Verlag, 1998.
[7] K. Chen and L. Liu, “A random rotation perturbation [26] K. Liu, C. Giannella, and H. Kargupta, “An attacker’s view
approach to privacy preserving data classification,” in of distance preserving maps for privacy preserving data
Proceedings of International Conference on Data Mining mining,” in European Conference on Principles and Practice
(ICDM). Houston, TX: IEEE, 2005. of Knowledge Discovery in Databases (PKDD), Berlin,
[8] K. Chen and L. Liu, “A survey of multiplicative data Germany, September 2006.
perturbation for privacy preserving data mining,” [27] K. Liu, H. Kargupta, and J. Ryan, “Random projection-based
Privacy-Preserving Data Mining: Models and Algorithms, multiplicative data perturbation for privacy preserving
Edited by Charu C. Aggarwal and Philip S. Yu, 2008. distributed data mining,” IEEE Transactions on Knowledge
[9] K. Chen, L. Liu, and G. Sun, “Towards attack-resilient and Data Engineering (TKDE), vol. 18, no. 1, pp. 92–106,
geometric data perturbation,” in SIAM Data Mining 2006.
Conference, 2007. [28] Y. Manolopoulos, A. Nanopoulos, A. Papadopoulos, and
[10] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan, Y. Theodoridis, R-trees: Theory and Applications.
“Private information retrieval,” ACM Computer Survey, Springer-Verlag, 2005.
vol. 45, no. 6, pp. 965–981, 1998. [29] E. Shi, J. Bethencourt, T.-H. H. Chan, D. Song, and
[11] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, A. Perrig, “Multi-dimensional range query over encrypted
“Searchable symmetric encryption: improved definitions and data,” in IEEE Symposium on Security and Privacy, 2007.
efficient constructions,” in Proceedings of the 13th ACM [30] D. X. Song, D. Wagner, and A. Perrig, “Practical techniques
conference on Computer and communications security. for searches on encrypted data,” in IEEE Symposium on
New York, NY, USA: ACM, 2006, pp. 79–88. Security and Privacy. Washington, DC, USA: IEEE
[12] I. Gartner, “Server storage and raid worldwide,” Technical Computer Society, 2000, p. 44.
Report, 1999. [31] H. Wang and L. V. S. Lakshmanan, “Efficient secure query
[13] C. Gentry, “Fully homomorphic encryption using ideal evaluation over encrypted xml databases,” in VLDB ’06:
lattices,” in STOC ’09: Proceedings of the 41st annual ACM Proceedings of the 32nd international conference on Very
symposium on Theory of computing. New York, NY, USA: large data bases. VLDB Endowment, 2006, pp. 127–138.
ACM, 2009, pp. 169–178. [32] P. Williams, R. Sion, and B. Carbunar, “Building castles out
[14] E.-J. Goh, “Secure indexes,” Cryptology ePrint Archive, of mud: Practical access pattern privacy and correctness on
Report 2003/216, 2003, http://eprint.iacr.org/2003/216/. untrusted storage,” in ACM Conference on Computer and
[15] O. Goldreich, Foundations of Cryptography. Cambridge Communications Security, 2008.
University Press, 2001. [33] W. Wong, D. W. Cheung, B. Kao, and N. Mamoulis, “Secure
[16] O. Goldreich and R. Ostrovsky, “Software protection and knn computation on encrypted databases,” in Proceedings of
simulation on oblivious ram,” Journal of the ACM, vol. 43, ACM SIGMOD Conference, 2009.
pp. 431–473, 1996.
[17] P. Golle, J. Staddon, and B. Waters, “Secure conjunctive
keyword search over encrypted data,” in ACNS 04: 2nd
Appendix: the Attack-Resilient Algorithms In Algorithm 2, the query transformation and encryption func-
There are four key algorithms in the proposed research − two de- tion takes the 2k simple conditions (assume two for each dimen-
ployed at the proxy server: (1) Data Encryption and Decryption; sion: the upper and lower bounds for each conjunction clause) and
(2) Query Transformation and Encryption; and two at the service the key matrix A as the input, and transforms each condition with
provider: (3) Data Indexing; (4) Query Processing. We will use the the method described in Section 5.2. The MBR is calculated by
existing multidimensional tree algorithms for data indexing, thus the following steps: (1) calculate the vertices of the original query
we skip the procedure (3). For simplicity, we process the condi- range with the dimensional bounds (the (k+1)-th dimension is 1
tions like Xi ≤ ai or Xi ≥ bi . The algorithms can be slightly and the (k+2)-th dimension is bounded by (0, α]); (2) transform
changed to handle other types of conditions. the vertices with the composite encryption; (3) Find the bounding
In Algorithm 1, the key matrix A is generated with the resilience box of the transformed vertices as the MBR.
to known input/output attack in mind. First, A is randomly gener-
Algorithm 2 Query transformation and encryption.
ated with elements drawn from a random real number generator RA
(We used the normal distribution N(0,1) for RA in experiments). 1: QueryEnc(Cond, A)
Second, according to the desired noise distribution (i.e., N (0, σ 2 )) 2: Input: Cond: 2k simple conditions, 2 for each dimensions. A:the key;
for enhancing the resilience to known input/output attack, the algo- 3: for each condition Ci in Cond do
rithm described in Section 5.1 is used to find the last column of A 4: ui ← zeros(k + 2, 1);
(i.e., A3 ) and A3 replaces the last column of the generated A. The 5: if Ci is like Xj ≤ aj then
6: uij ← 1, ui,k+1 ← −aj ;
invertibility of A is checked to make sure decryption can be done.
7: end if
The data encryption function extends each original data vector x to 8: if Ci is like Xj ≥ aj then
the (k + 2) dimensional vector with the homogeneous (k + 1)-th 9: uij ← −1, ui,k+1 ← aj ;
dimension and the random positive (k + 2)-th dimension with the 10: end if
random real number generator Rα . Then, it uses an OPE scheme 11: wi ← zeros(k + 2, 1);
Eo to transform the original k dimensions, followed by the RASP 12: wi,k+2 ← 1;
encryption with the key matrix A. Note that the proxy server needs 13: Θi ← (A−1 )T uwT A−1 ;
only A and the key for OPE in decryption. 14: end for
15: Use the vertex transformation method to find the MBR of the trans-
formed queries;
Algorithm 1 Data encryption and decryption algorithms 16: submit MBR and the filtering conditions {Θi } to the server;
1: Encrypt(X, Rα , RA , Ko , α, β, σ)
2: Input: X: k × n data records, Rα and RA : random real value genera-
tors for generating the (k + 2)-nd dimension (i.e., v) and the invertible In Algorithm 3, the two-stage query processing uses the MBR to
(k + 2) × (k + 2) matrix A with at least two non-zero values in each find the initial query result and then filters the result with the trans-
row, Ko : key for OPE Eo , α: the upper bound for v, β:sample size, formed query conditions yT Θi y ≤ 0, where the matrices {Θi } are
σ: noise intensity; Output: the matrix A passed by the client and y is a record in the first-stage query result.
3: A ← 0;
4: while A is not invertible do Algorithm 3 Two-Stage Query Processing.
5: generate the elements in A with RA ; 1: ProcessQuery(M BR, {Θi })
6: v = (v1 , . . . , vβ ) ← generate β random positive values in range 2: Input: MBR: MBR for the transformed query; {Θi }:filtering condi-
(0,α) with Rα ; tions;
7: A3 ← 0;
8: while A3 contains zero elements do 3: Y ← use the indexing tree to find answers for MBR;
9: generate the (k + 2) × β noise matrix Ψ use N (0, σ 2 ); 4: Y ′ ← ∅;
10: A3 ← ΨvT /(vvT ); 5: for each record yi in Y do
11: end while 6: success ← 1
12: Replace the last column of A with A3 ; 7: for each condition Θi do
13: Check the invertibility of the matrix A; 8: if yT Θi y > 0 then
14: end while 9: success ← 0;
15: for each record x in X do 10: break;
16: v ← random positive value in range (0, α) with Rα 11: end if
17: y ← A(Eo (xT , Ko ), 1, v)T ; 12: end for
18: submit y to the server; 13: if success = 1 then
19: end for 14: add yi into Y ′ ;
20: return A; 15: end if
16: end for
1: Decrypt(Y, A, Ko ) 17: return Y ′ to the client;
2: Input: Y : k × n matrix, the encrypted records, A:the RASP key, Ko :
the OPE key; Output: the decrypted records X
3: X ← A−1 Y ;
4: X ′ ← the first k dimensions of X;
5: return OPE decryption Do (X ′ )

You might also like