An Efficient Method For Knowledge Hiding Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract-In our era, Knowledge is not "just" information anymore, it is an asset. Data mining is thus extensively used for knowledge discovery from large data bases. The problem with the data mining is that with the availability of non-sensitive information that is not to be disclosed. Thus privacy is becoming an increasingly important issue in many data mining applications. A number of methods have recently been proposed for privacy preserving data mining of multidimensional data records. A number of techniques such as randomization and k-anonymity have been suggested in recent years in order to perform privacy-preserving data mining. Furthermore, the problem has been discussed in multiple communities such as the database community, the statistical disclosure control community and the cryptography community. We propose a new solution by integrating the advantages of both these techniques with the view of minimizing information loss and privacy loss. By making use of cryptographic techniques to store sensitive data and providing access to the stored data based on an individual's role, we ensure that the data is safe from privacy breaches. The trade-off between data utility and data safety of our proposed method will be assessed.

Keywords-privacy preserving data mining.


Over the past few years, there has been a tremendous growth in the amount of private data collected about individuals that can be collected and analyzed. This data comes from a variety of sources including medical, financial, library, telephone, and shopping records. With the rapid growth in database, networking, and computing technologies, such data can be integrated and analyzed digitally. On the one hand, this has led to the development of data mining tools that aim to infer useful trends from this data. But, on the other hand, easy access to personal data poses a threat to individual privacy.

In privacy preserving data mining (PPDM), the goal is to perform data mining operations on sets of data without disclosing the contents of the sensitive data. Since the results of the mining tell us something about the data, some information about the original data is leaked to the mining results. This leads to privacy loss. If the data is perturbed on the other hand for privacy concerns, it leads to information loss, which typically refers to the amount of critical information preserved about the datasets after the perturbation [7]. Thus, we need to work towards minimizing both privacy loss and information loss.

Many approaches emerged for privacy preserving data mining. The first approach involved perturbing the input before mining. Though it is the benefit of simplicity it does not provide a formal framework for proving how much privacy is guaranteed. Secure Computation technique [11] has the advantage of providing a well defined model for privacy using cryptographic techniques and is also accurate. However it is a slower method.

The k-anonymity [5,10] model was developed because of the possibility of indirect identification of records from public databases. This is because combinations of record attributes can be used to exactly identify individual records. In the k-anonymity method, we reduce the granularity of data representation with the use of techniques such as generalization and suppression. This granularity is reduced sufficiently that any given record maps onto at least k other records in the data

The l-diversity [6] model was designed to handle some weaknesses in the k-anonymity model since protecting identities to the level of k-individuals is not the same as protecting the corresponding sensitive values, especially when there is homogeneity of sensitive values within a group. To do so, the concept of intra-group diversity of sensitive values is promoted within the anonymization scheme.

The t-closeness model is a further enhancement on the concept of l-diversity. One characteristic of the l -diversity model is that it treats all values of a given attribute in a similar way irrespective of its distribution in the data. This is rarely the case for real data sets, since the attribute values may be much skewed.

Here, we propose a new approach to privacy preserving data mining based on cryptographic based access control approach (PPDEC) where we have 2 sets of attributes: Sensitive attributes (SAS) and Non sensitive attributes (NSAS). Using the data mining technique, users are allowed to mine different sets of data based on their roles. The data is first classified as sensitive attributes and non sensitive attributes. Sensitive attributes are encrypted and stored. The permitted user can access the sensitive attributes only after decryption ensuring privacy.

related work

There are several approaches have been proposed in the context of privacy preserving data mining. Some of the main approaches include heuristic based approach, reconstruction based approach and cryptographic approach. The underlying concept of the heuristic based approach technique is: how to hide sensitive rules that can be mined from the original data while maximizing the utility of the released data. In the reconstruction based approach [1,3,8], we first use some methods to distort the values of the original data and then release these distorted data. It is efficient in centralized environment but it produces some problem in distributed environment and the degree of spatial complexity is also high. The third approach is Cryptography based approach [11] which has been developed to solve the above problem: Two or more parties want to conduct a computation based on their private inputs, but neither party is willing to disclose its own output to anybody else. This problem is referred to as the Secure Multiparty Computation (SMC) problem, which requires that no more information be revealed to a participant in the computation than that participant's input and output. But generic SMC protocols are impractical for arbitrary inputs. It is important to realize that data modification results in degradation of the database Performance. In order to quantify the degradation of the data, we mainly use two metrics. The first one measures the confidential data protection, while the second measures the loss of functionality.


Vertical partitioning of data refers to the method of partitioning data in which different sites may have different attributes of the same set of records.






Fig 1: Vertical Partitioning

The architecture in the above figure shows the participating databases:

A miner which decides what computation is to be done

A Calculator that does the computation. The Calculator is unaware of what data it computes.

It is important to note that, only the Miner and participants get the mining results while the Calculator performs only auxiliary computations, without knowing their meaning.


Fig 2: PPDEC Architecture

The idea of Cryptographic approach and Database modification has motivated us to provide a more secure hiding for both data and knowledge by combining the benefits of these two techniques along with the idea of vertical fragmentation of the data for distributed storage. The existing methods focus on a universal approach that exerts the same amount of preservation for all persons, with-out catering for their concrete needs. We illustrate this idea by identifying data as sensitive and non-sensitive attributes and using cryptographic and vertical partitioning techniques to securely store the data.

Fig. 2 shows the framework of PPDEC approach that consists of all the above components. Initially, D0 is the original transaction database that, leads to the disclosure of some sensitive knowledge in the form of sensitive frequent patterns. This sensitive knowledge needs to be protected. The database Dx is the extension of D0 that is created by generalization and suppression techniques and applied for only the non-sensitive attributes. Generalization involves replacing the specific values with general one where as suppression is the process of deleting the cell values or the entire tuples. The server identifies the sensitive attributes, decides how the data is to be partitioned and where it is stored in the databases .The

Encrypter then encrypts the sensitive attribute choosing randomly from any of the encryption techniques stored in it. The encrypted sensitive attribute is sent to the server which is stored in the respected database (DB1, DB2, DB3) allocated by the vertical partitioning approach. Server maintains the database which contains the name of the sensitive attribute, the corresponding the identifier name, the database where the encrypted version of sensitive attribute is stored. The server does not have the prior knowledge of the encryption techniques used and Encrypter on the other hand is not aware of the actual data that is being encrypted by it.


In this paper we proposed a method for privacy preserving data mining using database extension based cryptographic approach. This method preserves maximum level of privacy and requires less time for data transformation. Thus, illegal access to sensitive attribute is prevented and retrieval of data occurs through a secure path. As a future work, we plan to expand our knowledge hiding framework in neural networks and genetic algorithms.