Numerical Values By Data Transformation Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Explosive progress in networking, storage and processor technologies has led to the creation of ultra large databases that record unprecedented amount of transactional information. Since data mining with its promise to efficiently discover valuable, non-obvious information from large databases analyses personal data, public concerns regarding privacy are arising. Preserving the privacy of shared data for clustering was considered as the most challenging problem. To overcome the problem, the data owner published the data by random modification of the original data in certain way to disguise the sensitive information while preserving the particular data property. Data transformation techniques played a vital role to preserve privacy in data mining. We propose an effective approach which defeats the problem of addressing privacy of confidential categorical and numerical data in clustering. The main goal of our proposed approach is to illustrate the effectiveness of clustering of sensitive categorical and numerical data before and after the transformation.

1. Introduction

Due to the ever increasing use of information technology, large volumes of detailed personal data are regularly collected. Such data include shopping habits, criminal records, medical history and credit records, among others [1,2]. These data can be analyzed by applications which make use of data mining techniques. Hence such data is an important asset to business organizations and governments for decision making processes and also to offer social benefits, such as medical research, crime reduction, national security, etc. [3]. On the other hand, analyzing such data opens new threats to privacy and autonomy of the individual if not done properly.

With the conventional data analysis methods there is a limited threat to privacy. Also these techniques mainly present the results based on the mathematical characteristics associated with the data. Making use of such techniques may not reveal some interesting patterns which are hidden in the data. By using appropriate data mining techniques it is possible to explore the hidden patterns. But the threat to privacy becomes real since data mining techniques are able to derive highly sensitive knowledge from unclassified data which is not even known to database holders [4]. In order to overcome this issue the data owners may decide not to share or release such data for analysis provided they should make a compromise for exploring hidden knowledge [5]. The privacy becomes worst when they decided to have secondary usage of data when they are unaware of behind the scenes use of data mining techniques [6]. The challenging problem that we address in this study is: how can we protect against the misuse of the knowledge discovered from secondary usage of data and meet the needs of organizations to support decision making.

In order to address this issue, we focus on privacy preserving confidential categorical and numerical data clustering, particularly when personal or confidential data are shared before clustering analysis. To address privacy concerns in clustering analysis, we need to design specific data transformation methods that enforce privacy without loosing the benefit of mining.

2. Literature survey

The primary goal in privacy preserving clustering is to protect the sensitive data before it is released for analysis. However the data may reside within an organization or in different places a distributed data. In such a scenario appropriate algorithms or techniques should be used which does not reveal any sensitive information in the knowledge discovery process. To address this issue there are many approaches adopted for privacy preserving data mining. It can be classified based on the following dimensions: Data distribution, Data modification, Data mining algorithm, Data or rule hiding and Privacy preservation [7].

In [8], this problem is addressed by transforming a database using Object Similarity-Based Representation (OSBR) which uses the similarity between objects and Dimensionality Reduction-Based Transformation (DRBT) which uses random projection. Here the dissimilarity matrix is shared for the analysis purpose. Privacy preserving clustering is addressed [9,10] based on either vertically partitioned data or horizontally partitioned data. Protecting privacy for numerical data is addressed [11] by using geometric data transformation. Oliveria et al.[11] proposed an approach to perform privacy preserving clustering of numerical data using geometric data transformation. Although our proposed work will also be based on geometric data transformation methods, there will two significant differences between our work and their work: first, our work will deals with hybrid data transformation. Second, in their solution, each sensitive attribute is numeric whereas we will consider categorical and numerical attributes. Our proposed work will also consider selective modification of confidential categorical and numerical data such that the perturbed data will release for secondary use which maintains appropriate level of privacy.

3. Problem definition

Let us consider an organization A. It owns a dataset D and wants to cluster it. However A does not have the expertise to do the clustering process. Hence it is decided to release the dataset to the any other organization B to perform clustering. Since organization A has confidential data, the original dataset cannot be released as such to B. Also the dataset D may contain different type of attributes. For our problem we have taken the dataset consisting of sensitive categorical and numerical attributes. Before sharing the dataset D with B, organization A must transform D to preserve privacy of individual data records. However, the transformation applied to D must not affect the similarity between objects. The problem can be stated as follows:

Let D be a relational database and the set of clusters generated from D is C. The goal is to transform D into D' so that the following limitations will be hold:

A transformation T when applied to D must preserve the privacy of individual records, so that the released database D' conceals the values of confidential attributes

The similarity between objects in D' must be the same as that one in D, or slightly altered by the transformation process. Although the transformed database D' looks very different from D, the clusters in D and D' should be as close as possible

4. Proposed approach

In order to address the above problem, the original database consisting of categorical and numerical data will be transformed using the following steps.

The categorical attribute will be converted into binary attribute and mapped to numeric value

Hybrid geometric data transformation approach will be used to transform the converted categorical and numerical attribute

4.1. Categorical data conversion

The Geometric data transformation methods can not be applied for the categorical value. Categorical variable can be converted into asymmetric binary variable by creating a new binary variable for each of the M nominal states [12]. For an object with a given state value, the binary variable representing that state will be set to 1 while the remaining binary variable will be set to 0. After the conversion the binary value will be mapped to the corresponding numeric value. List of transformation approaches will be considered as follows:

4.2. Geometric data transformation methods: In this proposal, we will consider the family of geometric data transformation methods (GDTM) specified in [11]. The inputs for the GDTMs will be the vectors of V, composed of confidential converted categorical and numerical attributes and the random noise vector N, while the output will be the transformed vector subspace V. The data transformation algorithms will have essentially two major steps:

A noise term will be chosen and the operations that will be applied to each confidential attribute. In this step random noise vector N will be created

Using the random noise vector N, V will transform into V' using a geometric transformation function

4.3. Translation data transformation: In this method the noise term will be applied to each confidential attribute will constant and can be either positive or negative [11]. The set of operations takes only the value {Add} corresponding to an additive noise will be applied to each confidential attribute.

4.4. Scaling data transformation: In this method the noise term will be applied to each confidential attribute will constant and can be either positive or negative [11]. The set of operations takes only the value {Multi} corresponding to a multiplicative noise will be applied to each confidential attribute.

4.5. Rotation data transformation: This method will work differently from the previous methods. In this case, the noise term will be an angle. The rotation angle, will be measured clockwise, will be the transformation applied to the observations of the confidential attributes [11]. The set of operations takes only the value {Rotate} that identifies a common rotation angle between the attributes Ai and Aj. Unlike the previous methods, this may be applied more than once to some confidential attributes. Data reconstruction methods can be used to deduce original data from the randomized data. Application of the above transformations separately to the original data, the privacy breach will be high. In order to overcome this issue, we have to apply hybrid transformation to the original data which will make it difficult to construct the sensitive data.

4.6. Noise level: In order to measure the effectiveness of our approach with respect to varying noise range, we will define noise level for the attributes. Let us consider an attribute Ai. Let n be the number of categories in the attribute represented as. Let e be a noise level. When the noise level will low, the probability of moving a record from original category to a new category in the distorted database will less. However when the percentage will high the probability of moving the record to a new category will also high. Hence it will essential to choose a suitable noise level such that the privacy level will high and the misclassification of the records in the clusters will low.

4.7. Algorithm:

Input: V, N

Output: V'

Step 1: For each confidential attribute in V, where (dataset) do

Get the noise level e

Accordingly calculate the noise range to

Select the noise term in N for the confidential attribute randomly within the range

The j-th operation {Add}

The k-th operation {Rotate}

Step 2: For each V do

For each in , where is the observation of the j-th attribute do


4.8. Clustering technique

In order to compare the results of clustering before and after the data transformation we will use K-means clustering algorithm. It will be used to group the objects based on attributes/features into K number of groups where K will be positive integer. The grouping will be done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of K-mean clustering will group the data. The basic steps of k-means clustering are as shown in Fig. 1:


Number of clusters-k

Centroid calculation

Distance of objects to centroids

Grouping based on minimum distance

No object

moves to another group


Fig. 1: Clustering process



Iterate until stable (= no object moves to another group):

Determine the centroid coordinate

Determine the distance of each object to the centroids

Group the object based on minimum distance


The family of hybrid data transformation methods introduced ensures privacy preservation in clustering analysis, notably both on categorical and numerical data. The proposed methods distort confidential categorical and numerical attributes to meet privacy requirements, while preserving general features for clustering analysis. Hence the data owner can decide to select an appropriate noise level for distortion based on the categories present in the sensitive attributes. To our best knowledge this will be the first effort to provide a solution for the problem of privacy preserving clustering of categorical and numerical data. The proposed methods will be effective and will provide practically acceptable values for balancing privacy and accuracy. The transformed database will available for secondary use such that the distorted database preserves the main features of the clusters mined from the original database and an appropriate balance between clustering accuracy and privacy will be guaranteed.