# Keyword Search On Encrypted Data Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

As Cloud Computing becomes prevalent, more and more sensitive information are being centralized into the cloud. For the protection of data privacy, sensitive data usually have to be encrypted before outsourcing, which makes effective data utilization a very challenging task. Although traditional searchable encryption schemes allow a user to securely search over encrypted data through keywords and selectively retrieve files of interest, these techniques support only exact keyword search. That is, there is no tolerance of minor typos and format inconsistencies which, on the other hand, are typical user searching behavior and happen very frequently. This significant drawback makes existing techniques unsuitable in Cloud Computing as it greatly affects system usability, rendering user searching experiences very frustrating and system efficacy very low. In this paper, for the first time we formalize and solve the problem of effective fuzzy keyword search over encrypted cloud data while maintaining keyword privacy. Fuzzy keyword search greatly enhances system usability by returning the matching files when users' searching inputs exactly match the predefined keywords or the closest possible matching files based on keyword similarity semantics, when exact match fails. In our solution, we exploit edit distance to quantify keywords similarity and develop an advanced technique on constructing fuzzy keyword sets, which greatly reduces the storage and representation overheads. Through rigorous security analysis, we show that our proposed solution is secure and privacy-preserving, while correctly realizing the goal of fuzzy keyword search.

I. Introduction

As Cloud Computing becomes prevalent, more and more sensitive information are being centralized into the cloud, such as emails, personal health records, government documents, etc. By storing their data into the cloud, the data owners can be relieved from the burden of data storage and maintenance so as to enjoy the on-demand high quality data storage service. However, the fact that data owners and cloud server are not in the same trusted domain may put the oursourced data at risk, as the cloud server may no longer be fully trusted. It follows that sensitive data usually should be encrypted prior to outsourcing for data privacy and combating unsolicited accesses. However, data encryption makes effective data utilization a very challenging task given that there could be a large amount of outsourced data files. Moreover, in Cloud Computing, data owners may share their outsourced data with a large number of users. The individual users might want to only retrieve certain specific data files they are interested in during a given session. One of the most popular ways is to selectively retrieve files through keyword-based search instead of retrieving all the

encrypted files back which is completely impractical in cloud computing scenarios. Such keyword-based search technique allows users to selectively retrieve files of interest and has been widely applied in plaintext search scenarios, such as Google search [1]. Unfortunately, data encryption restricts user's ability to perform keyword search and thus makes the traditional plaintext search methods unsuitable for Cloud Computing. Besides this, data encryption also demands the protection of keyword privacy since keywords usually contain important information related to the data files. Although encryption of keywords can protect keyword privacy, it further renders the traditional plaintext search techniques useless in this scenario.

To securely search over encrypted data, searchable encryption techniques have been developed in recent years [2]-[10]. Searchable encryption schemes usually build up an index for each keyword of interest and associate the index with the files that contain the keyword. By integrating the trapdoors of keywords within the index information, effective keyword search can be realized while both file content and keyword privacy are well-preserved. Although allowing for performing searches securely and effectively, the existing searchable encryption techniques do not suit for cloud computing scenario since they support only exact keyword search. That is, there is no tolerance of minor typos and format inconsistencies. It is quite common that users' searching input might not exactly match those pre-set keywords due to the possible typos, such as Illinois and Ilinois, representation inconsistencies, such as PO BOX and P.O. Box, and/or her lack of exact knowledge about the data. The naive way to support fuzzy keyword search is through simple spell check mechanisms. However, this approach does not completely solve the problem and sometimes can be ineffective due to the following reasons: on the one hand, it requires additional interaction of user to determine the correct word from the candidates generated by the spell check algorithm, which unnecessarily costs user's extra computation effort; on the other hand, in case that user accidentally types some other valid keywords by mistake (for example, search for "hat" by carelessly typing "cat"), the spell check algorithm would not even work at all, as it can never differentiate between two actual valid words. Thus, the drawbacks of existing schemes signifies the important need for new techniques that support searching flexibility, tolerating both minor typos and format inconsistencies.

In this paper, we focus on enabling effective yet privacy-

preserving fuzzy keyword search in Cloud Computing. To the best of our knowledge, we formalize for the first time the problem of effective fuzzy keyword search over encrypted cloud data while maintaining keyword privacy. Fuzzy keyword search greatly enhances system usability by returning the matching files when users' searching inputs exactly match the predefined keywords or the closest possible matching files based on keyword similarity semantics, when exact match fails. More specifically, we use edit distance to quantify keywords similarity and develop a novel technique, i.e., an wildcard-based technique, for the construction of fuzzy keyword sets. This technique eliminates the need for enumerating all the fuzzy keywords and the resulted size of the fuzzy keyword sets is significantly reduced. Based on the constructed fuzzy keyword sets, we propose an efficient fuzzy keyword search scheme. Through rigorous security analysis, we show that the proposed solution is secure and privacy-preserving, while correctly realizing the goal of fuzzy keyword search.

The rest of paper is organized as follows: Section II summarizes the features of related work. Section III introduces the system model, threat model, our design goal and briefly describes some necessary background for the techniques used in this paper. Section IV shows a straightforward construction of fuzzy keyword search scheme. Section V provides the detailed description of our proposed schemes, including the efficient constructions of fuzzy keyword set and fuzzy keyword search scheme. Section VI presents the security analysis. Finally, Section VIII concludes the paper.

II. Related Work

Plaintext fuzzy keyword search. Recently, the importance of fuzzy search has received attention in the context of plaintext searching in information retrieval community [11]-[13]. They addressed this problem in the traditional information-access paradigm by allowing user to search without using try-and-see approach for finding relevant information based on approximate string matching. At the first glance, it seems possible for one to directly apply these string matching algorithms to the context of searchable encryption by computing the trapdoors on a character base within an alphabet. However, this trivial construction suffers from the dictionary and statistics attacks and fails to achieve the search privacy.

Searchable encryption. Traditional searchable encryption [2]-[8], [10] has been widely studied in the context of cryptography. Among those works, most are focused on efficiency improvements and security definition formalizations. The first construction of searchable encryption was proposed by Song et al. [3], in which each word in the document is encrypted independently under a special two-layered encryption construction. Goh [4] proposed to use Bloom filters to construct the indexes for the data files. To achieve more efficient search, Chang et al. [7] and Curtmola et al. [8] both proposed similar "index" approaches, where a single encrypted hash table index is built for the entire

Fig 1: Architecture of Fuzzy Keyword Search

file collection. In the index table, each entry consists of the trapdoor of a keyword and an encrypted set of file identifiers whose corresponding data files contain the keyword. As a complementary approach, Boneh et al. [5] presented a public-key based searchable encryption scheme, with an analogous scenario to that of [3]. Note that all these existing schemes support only exact keyword search, and thus are not suitable for Cloud Computing.

Others. Private matching [14], as another related notion, has been studied mostly in the context of secure multiparty computation to let different parties compute some function of their own data collaboratively without revealing their data to the others. These functions could be intersection or approximate private matching of two sets, etc. The private information retrieval [15] is an often-used technique to retrieve the matching items secretly, which has been widely applied in information retrieval from database and usually incurs unexpectedly computation complexity.

III. Problem Formulation A. System Model

In this paper, we consider a cloud data system consisting of data owner, data user and cloud server. Given a collection of n encrypted data files C = (F1,F2,...,FN) stored in the cloud server, a predefined set of distinct keywords W = {w1, w2, ..., wp}, the cloud server provides the search service for the authorized users over the encrypted data C. We assume the authorization between the data owner and users is appropriately done. An authorized user types in a request to selectively retrieve data files of his/her interest. The cloud server is responsible for mapping the searching request to a set of data files, where each file is indexed by a file ID and linked to a set of keywords. The fuzzy keyword search scheme returns the search results according to the following rules: 1) if the user's searching input exactly matches the pre-set keyword, the server is expected to return the files containing the keyword1; 2) if there exist typos and/or format inconsistencies in the searching input, the server will return the closest possible results based on pre-specified similarity semantics (to be formally defined in section III-D). An architecture of fuzzy keyword search is shown in the Fig. 1.

B. Threat Model

We consider a semi-trusted server. Even though data files are encrypted, the cloud server may try to derive other sensitive information from users' search requests while performing keyword-based search over C. Thus, the search should be conducted in a secure manner that allows data files to be securely retrieved while revealing as little information as possible to the cloud server. In this paper, when designing fuzzy keyword search scheme, we will follow the security definition deployed in the traditional searchable encryption [8]. More specifically, it is required that nothing should be leaked from the remotely stored files and index beyond the outcome and the pattern of search queries.

C. Design Goals

In this paper, we address the problem of supporting efficient yet privacy-preserving fuzzy keyword search services over encrypted cloud data. Specifically, we have the following goals: i) to explore new mechanism for constructing storage-efficient fuzzy keyword sets; ii) to design efficient and effective fuzzy search scheme based on the constructed fuzzy keyword sets; iii) to validate the security of the proposed scheme.

D. Preliminaries

Edit Distance There are several methods to quantitatively measure the string similarity. In this paper, we resort to the well-studied edit distance [16] for our purpose. The edit distance ed(w1,w2) between two words w1 and w2 is the number of operations required to transform one of them into the other. The three primitive operations are 1) Substitution: changing one character to another in a word; 2) Deletion: deleting one character from a word; 3) Insertion: inserting a single character into a word. Given a keyword w, we let Sw,d denote the set of words w' satisfying ed(w,w') < d for a certain integer d.

Fuzzy Keyword Search Using edit distance, the definition of fuzzy keyword search can be formulated as follows: Given a collection of n encrypted data files C = (F 1, F2,..., Fn) stored in the cloud server, a set of distinct keywords W = {w1, w2,..., wp} with predefined edit distance d, and a searching input (w, k) with edit distance k (k < d), the execution of fuzzy keyword search returns a set of file IDs whose corresponding data files possibly contain the word w, denoted as FIDw: if w = wi s W, then return FIDwi; otherwise, if w not belong to W, then return {FIDwi}, where ed(w, wi) < k. Note that the above definition is based on the assumption that k < d. In fact, d can be different for distinct keywords and the system will return {FIDwi} satisfying ed(w, wi ) < min{k, d} if exact match fails.

IV. The Straightforward Approach

Before introducing our construction of fuzzy keyword sets, we first propose a straightforward approach that achieves all the functions of fuzzy keyword search, which aims at

providing an overview of how fuzzy search scheme works over encrypted data.

Assume Î =(Setup(1Î»), Enc(sk, â€¢), Dec(sk, â€¢)) is a symmetric encryption scheme, where sk is a secret key, Setup(1Î») is the setup algorithm with security parameter Î», Enc(sk, â€¢) and Dec(sk, â€¢) are the encryption and decryption algorithms, respectively. Let Twi denote a trapdoor of keyword wi. Trapdoors of the keywords can be realized by applying a one-way function f, which is similar as [2], [4], [8]: Given a keyword wi and a secret key sk, we can compute the trapdoor of wi as Twi = f(sk, wi).

V. Constructions of Effective Fuzzy Keyword Search in Cloud

The key idea behind our secure fuzzy keyword search is two-fold: 1) building up fuzzy keyword sets that incorporate not only the exact keywords but also the ones differing slightly due to minor typos, format inconsistencies, etc.; 2) designing an efficient and secure searching approach for file retrieval based on the resulted fuzzy keyword sets.

A. Advanced Technique for Constructing Fuzzy Keyword Sets

To provide more practical and effective fuzzy keyword search constructions with regard to both storage and search efficiency, we now propose an advanced technique to improve the straightforward approach for constructing the fuzzy keyword set. Without loss of generality, we will focus on the case of edit distance d = 1 to elaborate the proposed advanced technique. For larger values of d, the reasoning is similar. Note that the technique is carefully designed in such a way that while suppressing the fuzzy keyword set, it will not affect the search correctness.

Wildcard-based Fuzzy Set Construction In the above straightforward approach, all the variants of the keywords have to be listed even if an operation is performed at the same position. Based on the above observation, we proposed to use a wildcard to denote edit operations at the same position. For example, for the keyword CASTLE with the pre-set edit distance 1, its wildcard-based fuzzy keyword set can be constructed as Scastle,1 = {CASTLE, * CASTLE, *ASTLE, C*ASTLE, C*STLE, â€¦, CASTL*E, CASTL*, CASTLE*}. The total number of variants on CASTLE constructed in this way is only 13+1, instead of 13 x 26 + 1 as in the above exhaustive enumeration approach when the edit distance is set to be 1. Generally, for a given keyword wi with length £, the size of Swi,1 will be only 2^+1 + 1, as compared to (2£ +1) x 26 + 1 obtained in the straightforward approach. The larger the pre-set edit distance, the more storage overhead can be reduced: with the same setting of the example in the straightforward approach, the proposed technique can help reduce the storage of the index from 30GB to approximately 40MB.

B. The Efficient Fuzzy Keyword Search Scheme

Based on the storage-efficient fuzzy keyword sets, we show how to construct an efficient and effective fuzzy keyword search scheme. The scheme of the fuzzy keyword search goes as follows:

1) To build an index for wi with edit distance d, the data owner first constructs a fuzzy keyword set Swi,d using the wildcard based technique.. The index table {({Tw'i}w'ieSwi d, Enc(sk,FIDwi\\wi))}wieW and encrypted data files are outsourced to the cloud server for storage;

To search with (w, k), the authorized user computes the trapdoor set {Tw'}w'eSw k, where Swk is also derived from the wildcard-based fuzzy set construction. He then sends {Tw>}w>eSw k to the server;

Upon receiving the search request {Tw'}w'eSw k, the server compares them with the index table and returns all the possible encrypted file identifiers {Enc(sk,FIDwi\\wi))} according to the fuzzy keyword definition in section III-D. The user decrypts the returned results and retrieves relevant files of interest.

In this construction, the technique of constructing search request for w is the same as the construction of index for a keyword. As a result, the search request is a trapdoor set based on Sw,k, instead of a single trapdoor as in the straightforward approach. In this way, the searching result correctness can be ensured.

VI. Security Analysis

In this section, we analyze the correctness and security of the proposed fuzzy keyword search scheme. At first, we show the correctness of the schemes in terms of two aspects, that is, completeness and soundness.

Theorem 1: The wildcard-based scheme satisfies both completeness and soundness. Specifically, upon receiving the request of w, all of the keywords {wi} will be returned if and only if ed(w, wi) < k.

The proof of this Theorem can be reduced to the following Lemma:

Lemma 1: The intersection of the fuzzy sets Swi,d and Sw,k for wi and w is not empty if and only if ed(w, wi) < k. Proof: First, we show that Swi,dr\Sw,k is not empty when ed(w, wi ) < k. To prove this, it is enough to find an element in Swi,d^Sw,k. Let w = a1 a2 â€¢ â€¢ â€¢ as and wi = b 1b2 â€¢ â€¢ â€¢ bt, where all these ai and bj are single characters. After ed(w,wi) edit operations, w can be changed to wi according to the definition of edit distance. Let w* = a*a2 â€¦ a*m, where a* = aj or a* = * if any operation is performed at this position. Since the edit operation is inverted, from wi, the same positions containing wildcard at w* will be performed.

Because ed(w, wi ) < k, w* is included in both Swi,d and Sw,k, we get the result that Swi,d r)Sw,k is not empty. Next, we prove that Swi,d ||Sw,k is empty if ed(w, wi) > k. The proof is given by reduction. Assume there exists an w* belonging to Swi,d n Sw,k. We will show that ed(w, wi) < k, which reaches a contradiction. First, from the assumption that

w* G Swi,d n Sw,k, we can get the number of wildcard in w*, which is denoted by n*, is not greater than k. Next, we prove that ed(w, wi) < n*. We will prove the inequality with induction method. First, we prove it holds when n* = 1. There

are nine cases should be considered: If w* is derived from the

operation of deletion from both wi and w, then, ed(wi, w) < 1

because the other characters are the same except the character

at the same position. If the operation is deletion from wi and

substitution from w, we have ed(wi, w) < 1 because they will

be the same after at most one substitution from wi. The other

cases can be analyzed in a similar way and are omitted. Now,

assuming that it holds when n* = Î³, we need to prove it also

holds when n* = Î³ + 1. If w* = aâ€¦a*n G Swi,d n Sw,k, where a* = aj or ai* = *. For a wildcard at position t, cancel the underlying operations and revert it to the original characters in wi and w at this position. Assume two new elements w* and w* are derived from them respectively. Then perform one operation at position t of w* to make the character of wi at this position be the same with w, which is denoted by w'. After this operation, w* will be changed to w*, which has only k wildcards. Therefore, we have ed(w', w) < Î³ from the assumption. We know that ed(w',w) < Î³ and ed(w',wi) = 1, based on which we know that ed(wi,w) < Î³ + 1. Thus, we can get ed(w,wi) < n*. It renders the contradiction ed(w, wi) < k because n* < k. Therefore, Swid n Swk is empty if ed(w, wi) > k.

Theorem 2: The fuzzy keyword search scheme is secure regarding the search privacy.

Proof: In the wildcard-based scheme, the computation of index and request of the same keyword is identical. Therefore, we only need to prove the index privacy by using reduction. Suppose the searchable encryption scheme fails to achieve the index privacy against the indistinguishability under the chosen keyword attack, which means there exists an algorithm A who can get the underlying information of keyword from the index. Then, we build an algorithm A1 that utilizes A to determine whether some function f'(â€¢) is a pseudo-random function such that f'(â€¢) is equal to f(sk, â€¢) or a random function. A' has an access to an oracle Of'() that takes as input secret value x and returns f'(x). Upon receiving any request of the index computation, A' answers it with request to the oracle Of'(.). After making these trapdoor queries, the adversary outputs two challenge keywords wq and w* with the same length and edit distance, which can be relaxed by adding some redundant trapdoors. A' picks one random b G {0,1} and sends wl to the challenger. Then, A' is given a challenge value y, which is either computed from a pseudo-random function f(sk, â€¢) or a random function. A' sends y back to A, who answers with b' G {0,1}. Suppose A guesses b correctly with non-negligible probability, which indicates that the value is not randomly computed.

Then, A' makes a decision that f'(â€¢) is a pseudo-random function. As a result, based on the assumption of the indistinguishability of the pseudo-random function from some real random function, A at most guesses b correctly with approximate probability 1/2. Thus, the search privacy is obtained.

VII. Conclusion

In this paper, for the first time we formalize and solve the problem of supporting efficient yet privacy-preserving fuzzy search for achieving effective utilization of remotely stored encrypted data in Cloud Computing. We design an advanced technique (i.e., wildcard-based technique) to construct the storage-efficient fuzzy keyword sets by exploiting a significant observation on the similarity metric of edit distance. Based on the constructed fuzzy keyword sets, we further propose an efficient fuzzy keyword search scheme. Through rigorous security analysis, we show that our proposed solution is secure and privacy-preserving, while correctly realizing the goal of fuzzy keyword search.

As our ongoing work, we will continue to research on security mechanisms that support: 1) search semantics that takes into consideration conjunction of keywords, sequence of keywords, and even the complex natural language semantics to produce highly relevant search results; and 2) search ranking that sorts the searching results according to the relevance criteria.