Scheme To Hide Sensitive Sequential Patterns Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Shahzad, F. proposed a scheme to hide sensitive sequential patterns. This approach is based on FP Growth technique. The anti-monotone and monotone constraints on FP tree to hide sensitive sequential patterns have been implemented in this approach, and the developed scheme has also illustrated the better result for the sensitive sequential patterns.

Das, K. [12] proposed a scalable, local privacy-preserving algorithm for distributed peer-to-peer (P2P) data aggregation useful for many advanced data mining/analysis tasks such as average/sum computation, decision tree induction, feature selection, and more. Unlike most multi-party privacy-preserving data mining algorithms, this approach works in an asynchronous manner through local interactions and therefore, is highly scalable. It particularly deals with the distributed computation of the sum of a set of numbers stored at different peers in a P2P network in the context of a P2P Web mining application. In the proposed optimization-based privacy-preserving technique for computing the sum allows different peers to specify different privacy requirements without having to adhere to a global set of parameters for the chosen privacy model. Since distributed sum computation is a frequently used primitive, the proposed approach is likely to have significant impact on many data mining tasks such as multi-party privacy-preserving clustering, frequent item set mining, and statistical aggregate computation.

Raja Kumar, R. [13] to investigate the leeway of using multiplicative random projection sparse matrices for privacy preserving data in datasets which gets incremented asynchronously over time from various sources. The data stream is asynchronous. This work proposes the use of random projections with a sparse matrix to maintain a sketch of a collection of high-dimensional data-streams that are updated asynchronously. This sketch allows estimating L2 (Euclidean) distances and dot products with high accuracy. Here it has also been proposed a conceptual architecture for implementing the privacy preservation techniques especially the Sparse Random Projection Matrix technique in incremental data to improve the level of privacy protection. It has been tested to see that the perturbed data still preserves certain statistical characteristics of the data as the original unperturbed data. At this juncture there has been proposed a generic projection based sketch for incremental data stream which can be used not only for this application but also can be used for any other applications, which supports incremental data bases. Here the origin of PPDM, the definition of privacy preservation in data mining, and the implications of benchmark privacy doctrine in information detection and advocate a few policies for PPDM based on these privacy principles have been traced. These are vital for the development and deployment of methodological solutions. This will let vendors and developers to construct unyielding information reuse and integration (IRI) in PPDM. We pursue to capitalize on the reuse of PPDM information by crafting easy, affluent, and reusable knowledge depictions and accordingly investigates tactics for amalgamate this knowledge into heritage systems and make advances in the upcoming of PPDM.

Harnsamut, N. [14] focuses on maintaining the data quality in the scenarios which the transformed data will be used to build associative classification models. They propose a data quality metric for such the associative classification. Also, propose a heuristic approach to preserve the privacy and maintain the data quality. Subsequently, they do validate their proposed approaches with experiments.

Kargupta, H. [15] present the theoretical foundation of this filtering method and extensive experimental results to demonstrate that in many cases random data distortion preserve very little data privacy. They also point out possible avenues for the development of new privacy-preserving data mining techniques like exploiting multiplicative and colored noise for preserving privacy in data mining applications.

Li Liu [16] proposes an individually adaptable perturbation model, which enables the individuals to choose their own privacy level. Hence the proposed model provides different privacy guarantees for different privacy preferences. The new perturbation model had been tested by applying different reconstruction methods to the perturbed data sets. Furthermore, they had build decision tree and Naive Bayes classifier models on the reconstructed data sets both for synthetic and real world data sets. For the synthetic data set, our experimental results indicate that their model enables the users to choose their own privacy level without reducing the accuracy of the data mining results. For the real world data sets, they got very interesting results; hence they pose the question of whether the perturbation reconstruction model-based privacy preserving data mining is applicable for real-world data.

Loh, B.C.S. [17] focuses on a domain-driven data mining outsourcing scenario whereby a data owner publishes data to an application service provider who returns mining results. To ensure data privacy against an un-trusted party, anonymization, a widely used technique capable of preserving true attribute values and supporting various data mining algorithms is required. Several issues emerge when anonymization is applied in a real world outsourcing scenario. The majority of methods have focused on the traditional data mining paradigm, therefore they do not implement domain knowledge nor optimize data for domain-driven usage. To proposes an anonymization framework for aiding users in a domain-driven data mining outsourcing scenario. The framework involves several components designed to anonymized data while preserving meaningful or actionable patterns that can be discovered after mining. In contrast with existing works for traditional data-mining, this framework integrates domain ontology knowledge during DGH creation to retain value meanings after anonymization. In addition, users can implement constraints based on their mining tasks thereby controlling how data generalization is performed. Finally, attribute correlations are calculated to ensure preservation of important features. Preliminary experiments show that an ontology-based DGH manages to preserve semantic meaning after attribute generalization. Also, using Chi-Square as a correlation measure can possibly improve attribute selection before generalization.

Bhaduri, K. [18] discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy-preserving anomaly detection from sensitive data sets. They have developed bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. They show how the general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. Here it has also been analyzed that the proposed nonlinear transformation in full generality show that, for specific cases, it is distance preserving. A main contribution of this paper is the discussion between the invariability of a transformation and privacy preservation and the application of these techniques to outlier detection. The experiments conducted on real-life data sets demonstrate the effectiveness of the approach.

Jie Wang [19] proposes several efficient and flexible techniques to address this issue, and utilize unique characteristics of matrix factorization to maintain data pattern. They use the support vector machine classification to compare accuracy maintenance after data distortion by different methods. With better performance than some classical data perturbation approaches, nonnegative matrix factorization and singular value decomposition are considered to be promising techniques for privacy preserving data mining. Experimental results demonstrate that mining accuracy on the distorted data used these methods is almost as good as that on the original data, with added property of privacy preservation. It indicates that the matrix factorization-based data distortion schemes perturb only confidential attributes to meet privacy requirements while preserving general data pattern for knowledge extraction.

Wang Yan [20] present an effective method for privacy preserving association rule mining in the web usage mining, Secondary Random Response Column Replacement (SRRCR) to improve the privacy preservation and mining accuracy. Then, a privacy preserving association rule mining algorithm based on SRRCR is presented, which can achieve significant improvements in terms of privacy and efficiency. Finally, we present experimental results that validate the algorithms by applying it on real datasets.

Banu, R.V. [21] explores all the aspects of privacy issues in data mining, especially related with clustering, and provides a technique for privacy preserving clustering with a hypothetical banking scenario. Here we propose a model for clustering horizontally partitioned or centralized data sets using a simple PCA based transformation approach. The proposed PPC method has been implemented using Matlab and evaluated using synthetic datasets. The proposed privacy preserving transformation preserved the nature of the data even in the transformed form. The classification accuracy while using the transformed data is almost equal to that of the original dataset.

Ukil, A [22] develop a scheme for secure multiparty data aggregation with the help of modular arithmetic concept. Specifically, they consider a scenario in which two or more parties owning confidential data need to share only for aggregation purpose to a third party, without revealing any unnecessary information. More generally, data aggregation needs to take place by the server or aggregator without acquiring the content of the individual data. Their work is motivated by the need to both protect privileged information and confidentiality. We have shown through simulation results the efficacy of our scheme and compare the result with one of the established scheme.

Chang-bin Jiang [23] analyzes the status quo of privacy preservation in web log mining, and then it puts forward privacy preserving mining model based on evolutionary algorithm of cloud model, combining with evolutionary algorithm and cloud model theory. This model utilizes digital features of cloud and transformation between its qualitative concept and quantitative value expression. Thus, this model effectively conceals sensitive data, realizes web log mining based on privacy preservation. Results of the experiment reveal the feasibility and superiority of applying evolutionary algorithm of cloud model to privacy preservation in web log mining.

Ting Wang [24] proposes Butterfly, a light-weighted countermeasure that can effectively eliminate these breaches without explicitly detecting them, meanwhile minimizing the loss of the output accuracy. They further optimize the basic scheme by taking account of two types of semantic constraints, aiming at maximally preserving utility-related semantics while maintaining the hard privacy and accuracy guarantee. They have conducted the extensive experiments over real- life datasets to show the effectiveness and efficiency of our approach.

Pui Kuen Fong [25] introduces a privacy preserving approach that can be applied to decision tree learning, without concomitant loss of accuracy. It describes an approach to the preservation of the privacy of collected data samples in cases where information from the sample database has been partially lost. This approach converts the original sample data sets into a group of unreal data sets, from which the original samples cannot be reconstructed without the entire group of unreal data sets. Meanwhile, an accurate decision tree can be built directly from those unreal data sets. This novel approach can be applied directly to the data storage as soon as the first sample is collected. The approach is compatible with other privacy preserving approaches, such as cryptography, for extra protection.

Na Li [26] process leads to critical user concerns over their privacy, especially sensitive relationship with others on OSNs. Existing anonymization techniques in publishing online social data are focused on user identities, as users' relationship privacy will be automatically protected in general, if their identities are hidden. However, in reality, some users can still be identified from an identity-anonymized OSN by an attacker, as an individual user may publish his personal information to the public, through blog for example, which can be exploited by the attacker to re-identify the user from the published data. Therefore, it is intended to preserve relationship privacy between two users one of whom can even be identified in the released OSN data. They define the â„“-diversity anonymization model to preserve users' relationship privacy. Additionally, we devise two algorithms to achieve the â„“-diversity anonymization - one only removes edges while the other only inserts vertices/edges for maintaining as many topological properties of the original social networks as possible, thus retaining the utility of the published data for the third-parties. Extensive experiments are conducted on both synthetic and real-world social network data sets to demonstrate that except from the achievement of privacy preservation, the utility loss caused by their proposed graph manipulation based techniques is acceptable. Besides, they analyze the influence of social network topology (e.g., average degree, network scalability) on the performance of our algorithms.

Tiancheng Li [27] presents a general framework for modeling the adversary's background knowledge using kernel estimation methods. This framework subsumes different types of knowledge (e.g., negative association rules) that can be mined from the data. Under this framework, we reason about privacy using Bayesian inference techniques and propose the skyline (B, t)-privacy model, which allows the data publisher to enforce privacy requirements to protect the data against adversaries with different levels of background knowledge. Through an extensive set of experiments, we show the effects of probabilistic background knowledge in data anonymization and the effectiveness of our approach in both privacy protection and utility preservation.

In reference [28] a scheme is developed to provide privacy preservation in a much simpler way with the help of a secure key management scheme and randomized data perturbation technique. Here they consider a scenario in which two or more parties owning confidential data need to share only for aggregation purpose to a third party, without revealing the content of the data. Through simulation results the efficacy of our scheme is shown by comparing the results with one of the established schemes.

Deivanai, P. [29] this paper, new method for achieving k-anonymity (based on suppression) called `kactus' is proposed. In this method, efficient multi-dimensional suppression is performed, i.e., values are suppressed only on certain records depending on other attribute values, without the need for manually-produced domain hierarchy trees. Thus, this method identify attributes that have less influence on the classification of the data records and suppress them if needed in order to comply with k-anonymity. The method was evaluated on several datasets to evaluate its accuracy as compared to other k-anonymity based methods. Anonymization can be integrated with perturbation for privacy preservation in a multiparty environment.

Jin Ma [30] proposes an approach to correlate and analyze intrusion alerts, while preserve privacy for alert holders. The raw intrusion alerts are protected by improved k-anonymity model, which preserves the alert regulation inside disturbed data records. With this privacy preserving technique, combing the typical FP-tree association rules mining algorithm, the approach provides the capacity of well balancing the alert correlation and the privacy preservation. Experimental results show that this approach works comparatively efficient and reaches a well balance between the alerts correlation and the privacy issues.

Shuting Xu [31] proposes a Fast Fourier Transform (FFT) based method for data distortion, and compare it with the Singular Value Decomposition (SVD) based method. The experimental results show that the FFT based method can obtain similar performance as SVD based method in preserving privacy as well as maintaining utility of the datasets, however, the computational time used by the FFT based method is much less than the SVD based method. They conclude that the FFT based method is a very promising data distortion method.

Shuguo Han [32] proposes a preliminary formulation of gradient descent with data privacy preservation. Here they present two approaches-stochastic approach and least square approach-under different assumptions. Four protocols are proposed for the two approaches incorporating various secure building blocks for both horizontally and vertically partitioned data. They have conducted experiments to evaluate the scalability of the proposed secure building blocks and the accuracy and efficiency of the protocols for four different scenarios. The excremental results show that the proposed secure building blocks are reasonably scalable and the proposed protocols allow us to determine a better secure protocol for the applications for each scenario.

Aggarwal, C.C. [33] provide a first comprehensive analysis of the randomization method in the presence of public information. Here a quantification of the randomization method which we refer to as k-randomization of the data has been defined. The inclusion of public information in the theoretical analysis of the randomization method results in a number of interesting and insightful conclusions. These conclusions expose some vulnerabilities of the randomization method. Here they illustrates that the randomization method is unable to effectively achieve privacy in the high dimensional case. They theoretically quantify the degree of randomization required to guarantee privacy as a function of the underlying data dimensionality. Furthermore, they show that the randomization method is susceptible to many natural properties of real data sets such as clusters or outliers. Finally, they illustrate that the use of public information makes the choice of perturbing distribution very critical in a number of subtle ways. The concluding analysis shows that the inclusion of public information in the analysis makes the goal of privacy preservation more elusive than previously thought for the randomization method.

Duy Vu [34] demonstrates how to integrate a differential privacy framework with the classical statistical hypothesis testing in the domain of clinical trials where personal information is sensitive. A concrete methodology that researchers can be used has been developed. Here they derive rules for the sample size adjustment whereby both statistical efficiency and differential privacy can be achieved for the specific tests for binomial random variables and in contingency tables.

Ge, R. [35] introduces a summarization approach for numerical data based on discs formalizing the notion of quality. The main objective is to find a minimal set of discs, i.e. spheres satisfying a radius and a significance constraint, covering the given dataset. Since the proposed problem is NP-complete, they design two different approximation algorithms. These algorithms have a quality guarantee, but they do not scale well to large databases. However, the machinery from approximation algorithms allows a precise characterization of a further, heuristic algorithm. This heuristic, efficient algorithm exploits multi-dimensional index structures and can be well-integrated with database systems. The experiments show that the heuristic algorithm generates summaries that outperform the state-of-the-art data bubbles in terms of internal measures as well as in terms of external measures when using the data summaries as input for clustering methods.

Singh, L. [36] focus on understanding the impact of network topology and node substructure on the level of anonymity present in the network. They present a new measure, topological anonymity that quantifies the amount of privacy preserved in different topological structures. The measure uses a combination of known social network metrics and attempts to identify when node and edge inference breeches arise in these graphs.

Gui Qiong [37] proposed the privacy preservation and the mining efficiency, an effective privacy preserving distributed mining algorithm of association rules. Combining the advantages of both RSA public key cryptosystem and homomorphism encryption scheme, a model of hierarchical management on the cryptogram is put forward in the algorithm. By introducing cryptogram management server and data mining server in the process of mining, the algorithm quickly generates global K-frequent item sets using similarity matrix of transactions as well as effectively protects security of sensitive data. As shown in the theoretical analysis and the experimental results, the algorithm can achieve improvements in terms of privacy, accuracy, and efficiency.

Mathew, G. [38] proposes a framework for distributed knowledge-mining that results in a useful clinical decision support tool in the form of a decision tree. This framework facilitates knowledge building using statistics based on patient data from multiple sites that satisfy a certain filtering condition, without the need for actual data to leave the participating sites. Our information retrieval and diagnostics supporting tool accommodates heterogeneous data schemas associated with participating sites. It also supports prevention of personally identifiable information leakage and preservation of privacy, which are important security concerns in management of clinical data transactions. Results of experiments conducted on 8 and 16 sites with a small number of patients per site (if any) satisfying specific partial diagnostics criteria are presented. The experiments coupled with restricting a fraction of attributes from sharing statistics as well as applying different constraints on privacy at various sites demonstrate the usefulness of the tool.

Naeem, M. [39] propose a novel architecture which acquired other standard statistical measures instead of conventional framework of Support and Confidence to generate association rules. Specifically a weighing mechanism based on central tendency is introduced. The proposed architecture is tested with UCI datasets to hide the sensitive association rules as experimental evaluation. A performance comparison is made between the new technique and the existing one. The new architecture generates no ghost rules with complete avoidance of failure in hiding sensitive association rules. Here it has been demonstrated that Support and Confidence are not the only measures in hiding sensitive association rules. This research is aimed to contribute to data mining areas where privacy preservation is a concern.

Yanguang Shen [40] studied privacy preserving distributed data mining. The existing methods focus on a universal approach that exerts preservation in the same degree for all persons, without catering for their concrete needs. In view of this it has been innovatively proposed a new framework combining the secure multiparty computation (SMC) with K-anonymity technology, and achieved personalized privacy preserving distributed data mining based on decision tree classification algorithm. Compared with other algorithms this method could make a good trade-off point between privacy and accuracy, with high efficiency and low-overhead of computing and communication.

Tsiafoulis, S.G. [41] in particular in the field of electronic data storage media and processing power, has raised serious ethical, philosophical and legal issues related to privacy protection. To cope with these concerns, several privacy preserving methodologies have been proposed, classified in two categories, methodologies that aim at protecting the sensitive data and those that aim at protecting the mining results. In this work, the focus on sensitive data protection has been made and compares existing techniques according to their anonymity degree achieved, the information loss suffered and their performance characteristics. The l-diversity principle is combined with k-anonymity concepts, so that background information cannot be exploited to successfully attack the privacy of data subjects data refer to. Based on Kohonen Self Organizing Feature Maps (SOMs), initially the data sets in subspaces according to their information theoretical distance to each other are organized, then the most relevant classes paying special attention to rare sensitive attribute values are created, and finally generalize attribute values to the minimum extend required so that both the data disclosure probability and the information loss are possibly kept negligible. Furthermore, we propose information theoretical measures for assessing the anonymity degree achieved and empirical tests to demonstrate it.

Kadampur, M.A. [42] presents a method for privacy preserving clustering through cluster bulging. In this method, the objects of the database are first aligned into clusters based on a similarity measure. The data in these clusters is perturbed in a controlled manner by modifying the values of various objects, so that, in the perturbed data set, the clusters are bulged in comparison to those in the original data set. In order to perform this perturbation, every cluster is displaced along the line joining its centroid to the centroid of the whole data set. And, then, every object in each cluster is shifted along the line joining that object to the centroid of the cluster. The word bulging used here refers to both positive and negative bulging. The method in essence manipulates the similarity measures and recomputed the new perturbed objects of the respective clusters. Thus, every object in the bulged cluster represents its corresponding object from the original cluster. After the application of this method, the objects get perturbed, while the number of member objects and shape of each cluster remain the same as those of the original clusters, thereby the information in the two instances of the data sets is sustained, while, the privacy of sensitive data is preserved.

Weiwei Fang [43] focuses on privacy-preserving research in the situation of distributed decision-tree mining, and presents a decision-tree mining algorithm based on homomorphism encryption technology, which can get accurate mining effect in the premise of no sharing of private information among mining participators. Theoretical analysis and experiment results show that this algorithm can provide good capability of privacy-preserving, accuracy and efficiency.

Keng-Pei Lin [44] decides which instances of training dataset are support vectors, i.e., the necessarily informative instances to form the classifier. The support vectors are intact tuples taken from the training dataset. Releasing the SVM classifier to public use or shipping the SVM classifier to clients will disclose the private content of support vectors, violating the privacy-preservation requirement in some legal or commercial reasons. To the best of our knowledge, there has not been work extending the notion of privacy-preservation to releasing the SVM classifier. In this paper, they propose an approximation approach which post-processes the SVM classifier to protect the private content of support vectors. This approach is designed for the commonly used Gaussian radial basis function kernel. By applying this post-processor on the SVM classifier, the resulted privacy-preserving SVM classifier can be publicly released without exposing the private content of support vectors and is able to provide comparable classification accuracy to the original SVM classifier.

Dan Hu [45] the relation between the cores of partitioned data and global data are discussed. A useful proposition is obtained, which shows that the union of the cores of partitioned data is determinedly included in the core of global data. In following, two algorithms, DMC and PPDMC, are proposed for distributed mining of core on horizontally partitioned data. DMC concerns the reduction of time complexity while PPDMC focuses on privacy preserving. Experiment and propositions show the excellent function of DMC and PPDMC through practical and academic ways. Just because of the pivotal status of core in RST, the algorithms proposed in this paper will show good foreground in distributed data mining.

Fung, B.C.M. [46] propose a k-anonymization solution for classification. Our goal is to find a k-anonymization, not necessarily optimal in the sense of minimizing data distortion, which preserves the classification structure. We conducted intensive experiments to evaluate the impact of anonymization on the classification on future data. Experiments on real-life data show that the quality of classification can be preserved even for highly restrictive anonymity requirements.

Challagalla, A. [47] proposes a technique for detecting outliers while preserving privacy, using hierarchical clustering methods. We analyze our technique to quantify the privacy preserved by this method and also prove that reverse engineering the perturbed data is extremely difficult.

Hsiang-Cheh Huang [48] proposes a functional scheme for the above-mentioned goals. They employ the reversible data hiding scheme, a newly developed branch in DRM researches, for combating the goals. With the term of reversibility, it means that data, including patients' private information and the diagnosis data, can be hidden into the medical image by some means. Later on, the medical image containing data might be retrieved while necessary, and both the original image and the hidden data can be perfectly recovered. Data can be authenticated for enhanced privacy protection. Simulation results present the applicability of such an implementation.

Peter Christen [49] present an overview of current approaches to privacy-pre serving data linkage, and discuss their limitations. Using real-world scenarios they illustrate the significance of developing improved techniques for automated, large scale and distributed privacy-pre serving linking and geocoding. They then discuss four core research areas that need to be addressed in order to make linking and geocoding of large confidential data collections feasible.

Yingjie Wu [50] present a novel technique based on anatomy technique and propose a simple linear-time anonymous algorithm that meets the l-diversity requirement. The simulation experiments on real datasets and the results of association rules mining on the anonymous transaction data showed that our algorithm can safely and efficiently preserve the privacy in transaction data publication, while ensuring high utility of the released data.

Vreeken, J. [51] proposed data is an inherently safer alternative. Here a data generator based on the models obtained by the MDL-based KRIMP (Siebes et al., 2006) algorithm has been presented. These are accurate representations of the data distributions and can thus be used to generate data with the same characteristics as the original data. Experimental results show a very large pattern-similarity between the generated and the original data, ensuring that viable conclusions can be drawn from the anonymized data. Furthermore, anonymity is guaranteed for suited databases and the quality-privacy trade-off can be balanced explicitly.

Lv Pin [52] provides a practical metric framework for implementing one model of k-anonymization, called generalization including suppression metric. We introduce Data fly algorithm for the metric method. The experiments show that generalization including suppression metric is more precise than those existing methods focusing on generalization.

Keng-Pei Lin [53] proposes an approach to post processes the SVM classifier to transform it to a privacy-preserving classifier which does not disclose the private content of support vectors. The post processed SVM classifier without exposing the private content of training data is called Privacy-Preserving SVM Classifier (abbreviated as PPSVC). The PPSVC is designed for the commonly used Gaussian kernel function. It precisely approximates the decision function of the Gaussian kernel SVM classifier without exposing the sensitive attribute values possessed by support vectors. By applying the PPSVC, the SVM classifier is able to be publicly released while preserving privacy. We prove that the PPSVC is robust against adversarial attacks. The experiments on real data sets show that the classification accuracy of the PPSVC is comparable to the original SVM classifier.

Shaofeng Bu [54] shows that it is possible to minimize disclosure while guaranteeing no outcome change. They do conduct their investigation in the context of building a decision tree and propose transformations that preserve the exact decision tree. Thus they illustrate that with a detailed set of experiments the substantial protection to both input data privacy and mining output privacy can be achieved.

Zhihui Wang [55] addresses the problem of better preserving private knowledge by proposing an Item-based Pattern Sanitization to prevent the disclosure of private patterns. We also present two strategies to generate a safe and shareable pattern set for preserving private knowledge in frequent pattern mining.

Oliveira, S.R.M. [56] introduce new algorithms for balancing privacy and knowledge discovery in association rule mining. Here they illustrates that their algorithms requires only two scans, regardless of the database size and the number of restrictive association rules that must be protected. Here the performance study compares the effectiveness and scalability of the proposed algorithms and analyzes the fraction of association rules, which are preserved after sanitizing a database. They also report the main results of our performance evaluation and discuss some open research issues.

Turaga, D.S. [57] focuses on the popular k-means clustering algorithm and we demonstrate how a properly constructed compression scheme based on post-clustering quantization is capable of maintaining the global cluster structure. Their analytical derivations indicate that a 1-bit moment preserving quantize per cluster is sufficient to retain the original data clusters. Merits of the proposed compression technique include: a) reduced storage requirements with clustering guarantees, b) data privacy on the original values, and c) shape preservation for data visualization purposes. They evaluate quantization scheme on various high-dimensional datasets, including 1-dimensional and 2-dimensional time-series (shape datasets) and demonstrate the cluster preservation property. They also compare with previously proposed simplification techniques in the time-series area and show significant improvements both on the clustering and shape preservation of the compressed datasets.

Mhatre, A. [58] propose a procedure to protect the privacy of data by adding noisy items to each transaction. The experimental results indicate that this method can achieve a rather high level of accuracy. The method is applied on an existing algorithm PISA for frequent pattern mining. This algorithm works on both static and dynamically increasing databases, and thereby takes full advantage of their applicability of the module.

Dan Hu [59] the relation between partitioned data and global data are discussed. A useful proposition is obtained, which shows that the reduced of global data determinedly has subsets as the elements in reduces of partitioned data. In following, two algorithms, DMR and PPDMR, are proposed for distributed mining of reduces on horizontally partitioned data. DMR concerns the reduction of time complexity while PPDMR focuses on privacy preserving. Experiments and propositions show the excellent function of DMR and PPDMR through practical and academic ways. Just because the pivotal status of reduce in RST, the algorithms proposed in this paper will show good foreground in distributed data mining.

Mumtaz, S. [60] propose a data perturbation technique called uniformly adjusted distortion, which initially distorts one cell and then uniformly distributes this distortion in the whole data cube. This uniform distribution not only preserves the aggregates but also provides maximum accuracy with range sum queries and high availability.

Xiaofeng Zhang [61] defines the optimal abstraction task as a game and computes the Nash equilibrium as its solution. Also, they propose an iterative version of the game so that the Nash equilibrium can be computed by actively exploring details from the local sources in a need-to-know manner. They tested the proposed game theoretic approach using a number of data sets for model-based clustering with promising results obtained.

Dong Seong Kim [62] presents a privacy preserving data mining based on Support Vector Machines (SVM). They review the previous approach in privacy preserving data mining in distributed system. And they also review energy efficient data mining in WSN. They then propose an energy efficient privacy preserving data mining in WSN. We use SVM because it has been shown best classification accuracy and sparse data presentation using support vectors. They show security analysis and energy estimation of their proposed approach.

Bo Peng [63] class of novel data distortion strategies is proposed. Four schemes via attribute partition, with different combinations of singular value decomposition (SVD), nonnegative matrix factorization (NMF), discrete wavelet transformation (DWT), are designed to perturb sub matrix of the original datasets for privacy protection. They use some metrics to measure the performance of the proposed new strategies. Data utility is examined by using a binary classification based on the support vector machine. Their experimental results indicate that, in comparison with the individual data distortion techniques, the proposed schemes are very efficient in achieving a good trade-off between data privacy and data utility, and provide a feasible solution for collaborative data analysis.

Karthikeswaran, D. [64] presents a novel based approach that strategically modifies a few transactions in the database. It modifies support or confidence values for hiding sensitive rules without producing many side effects. Nevertheless, undesired side effects such as no sensitive rules falsely hidden and spurious rules falsely generated, may be produced in the rule hiding process.

Luong the Dung [65] to developed a method for the more difficult problem when the dataset is horizontally partitioned into only two parts. The key question is how to compute and reveal only the covariance matrix at various steps of the EM iterative process to the participating parties. They propose a method consisting of several protocols that provide privacy preservation for the computation of covariance matrices and final results without revealing the private information and the means. They also extend the proposed method for a better solution to the problem of privacy preserving k-means clustering.

Hongjun Wang [66] introduces a model for privacy preservation clustering which can handle the problems of privacy preservation, distributed computing. First, the latent variables in latent Dirichlet conditional Naive-Bayes models (LDCNB) are redefined and some terminologies are defined. Second, Variation approximation inference for LD-CNB is stated in detail. Third, base on the variation approximation inference, they do design a distributed EM algorithm for privacy preservation clustering. Finally, some datasets from UCI are chosen for experiment, Compared with the distributed k-means algorithm, the results show LD-CNB algorithm does work better and LD-CNB can work distributed, so LD-CNB can protect privacy information.

Hoh, Baik [67] show that data mining techniques, such as clustering, can reconstruct private information from such anonymous traces. To meet this new challenge, they propose enhanced privacy-preserving algorithm to control the release of location traces near origins/destinations and evaluate it using real-world GPS location traces.

Geetha Jagannathan [68] investigates privacy-preserving data imputation on distributed databases. We present a privacy-preserving protocol for filling in missing values using a lazy decision tree imputation algorithm for data that is horizontally partitioned between two parties. The participants of the protocol learn only the imputed values; the computed decision tree is not learned by either party.

Weijia Yang [69] proposed novel method to protect data privacy in data mining. Nowadays, privacy is becoming an increasingly important issue in many data mining applications. Among the current privacy preserving techniques, data anonymization provides a simple and effective way to protect the sensitive data. However, in most of the related algorithms, data details are lost and the result dataset is far less informative than the original one. In their method, they do adopt a statistical way to anonymized the dataset and we are able to preserve not only the data details but also the useful data knowledge. They also analyze in detail the accuracy and the privacy levels of our method. Experimental results further demonstrate the effectiveness of our method by comparing it to the existing methods.

Mark Shaneck [70] addressed the issue of privacy preserving nearest neighbor search, which forms the kernel of many data mining applications. To this end, they present a novel algorithm based on secure multiparty computation primitives to compute the nearest neighbors of records in horizontally distributed data. They then illustrates how this algorithm can be used in three important data mining algorithms, namely LOF outlier detection, SNN clustering, and kNN classification.

Goo, S.K. [71] to promote user sensitiveness in privacy policy and tackle malicious data extraction and selling, with the focus on assistive technologies for a diverse mix of services for use in applications ranging from healthcare to smart shopping.

Poovammal, E. [72] developed to solve the external linkage problem resulting in sensitive attribute disclosure. It is very easy to prevent sensitive attribute disclosure by simply not publishing quasi-identifiers and sensitive attributes together. But the only reason to publish generalized quasi identifiers and sensitive attributes together is to support data mining tasks that consider both types of attributes in the database. Our goal in this paper is to eliminate the privacy breach (how much an adversary learn from the published data) and increase utility (accuracy of data mining task) of a released database. This is achieved by transforming a part of quasi-identifier and personalizing the sensitive attribute values. Their experiment has been conducted on the datasets from the UCI machine repository demonstrates that there is incremental gain in data mining utility while preserving the privacy to a great extend.

Macek, J. [73] demonstrate practical functionality of our algorithm on the task of ECG records classification. Their results are promising since comparable or superior accuracy is achieved when compared with results obtained by other existing methods of classification of ECG records, namely with the C5.0 decision tree algorithm.

Xingping Wen [74] CART (classification and regression trees) and C5.0 decision tree algorithms were used to CBERS-02 remote sensing data. Firstly, the remote sensing data was transformed using the principal component analysis (PCA) and multiple-band algorithm. Then, the training data was collected from the combining total 20 processed bands. Finally, the decision tree was constructed by CART and C5.0 algorithm respectively. Comparing two results, the most important variables are clearly band3, 4, band1,4 and band2,4. The depth of the CART tree is only two with the relative high accuracy. The classification outcome was calculated by CART tree.

Xiaoyu Chen [75] a new discriminate diagnosis model constructed by attribute selection, decision tree C5.0 algorithm and discrimination analysis is proposed, which consists of two phases. One is attribute selection. The critical attributes are filtered out from the original attributes. The other is modeling phase to acquire discriminates between syndromes of chronic hepatitis B and syndrome information in TCM. From our experiments, combinations of TCM clinical symptoms and lab indicators are selected to provide formulas for syndrome differentiation of chronic hepatitis B in TCM from original 247 symptoms initially, and the model shows a better prospect for application in TCM diagnosis.

Borah, M.D. [76] proposing a new Attribute Selection Measure Function (heuristic) on existing C4.5 algorithm. The advantage of heuristic is that the split information never approaches zero, hence produces stable Rule set and Decision Tree. In ideal situation, in admission process in engineering colleges in India, a student takes admission based on AIEEE rank and family pressure. If the student does not get admission in the desired branch of engineering, then they find it difficult to take decision which will be the suitable branch. The proposed knowledge based decision technique will guide the student for admission in proper branch of engineering. Another approach is also made to analyze the accuracy rate for decision tree algorithm (C5.0) and back propagation algorithm (ANN) to find out which one is more accurate for decision making. In this research work we have used the AIEEE2007 Database.

Yang Huaizhen [77] proposed a forecasting model of service marketing which was based on cluster analysis combining with decision tree, it depicts K-means algorithm, C5.0 algorithm of decision tree and design of building model, and applies the model on predicting whether a region users will accept the interactive service of cable television, by this way it finds the users group which shows the highest response rate. The result shows that the model finds the users group of high response rate easier than predicting only by decision tree, it also improves the classification accuracy on a greater extent. The framework presented in this paper provides the enterprise with an important basis for making effective market decision in commercial competition.

Jiao-Jiao Wang [78] focused on comparing the classification performance and accuracy for short-term urban traffic flow condition using decision tree algorithms (CHAID, CART, QUEST and C5.0). In building decision tree models, input variables were multiple roads' traffic flow condition value at current time, while, target variable was a certain road's condition value at future temporal horizon from 5-30 min. The results showed that when all the predictors were input without feature selection, the classification accuracy obtained by CART algorithm was higher than the other three algorithms. While using CART and CHAID with feature selection, the accuracy showed lower but the obtained decision tree expressed more concise and understandable with fewer nodes, besides, by enlarging training samples to about 10 times of that before, the accuracy with feature selection is higher than that without feature selection.

Gu Yu [79] introduces decision tree algorithm and C5.0 algorithm in the data mining at first. Then it introduces financial analysis methods, the problems which need to pay attention to in application and the selection process of attributes. At last, they study the financial ratios of listed logistics companies through the application of SPSS Clenmentine12.0 software. The accuracy of this model is as high as 95.83%.

Tung-Hsu Hou [80] proposed to develop a fuzzy rule-based reasoning system to set a nano- particle milling process. The characteristics of the proposed system are to use data-driven to do Fuzzification and rule extraction instead of directly using domain experts. In addition, the regulation scheme of the parameter setting system is based on adjusting the process parameters by using its current deviation from the optimal parameters. The verified results show that the proposed process parameter setting system indeed can be applied to guide engineers to set the process parameters when a nano-particle milling process is shifted. It can be applied to adjust the milling process from current shift condition back to near-optimal condition.

Jiri, K. [81] presents one of many possibilities of decision theory that can be used in the modeling of the quality of life in a given city in the Czech Republic. Real data sets from citizen questionnaires for the city of Chrudim were analyzed, pre-processed, and used in the classification models. Classifier models, on the basis of C5.0, CHAID, C&RT, and C5.0 boosting algorithms were proposed and analyzed. The aim of the analysis can be defined as follows: The possibility of the availability of decision trees for the classification of the defined system is judged.

Pourebrahimi, A. [82] presents an opportunity to increase significantly the rate at which the volumes of data generated through the maintenance process can be turned into useful information. This can be done using classification algorithms to discover patterns and correlations within a large volume of data. This study uses the C5.0 DT algorithm which is an improved version of C4.5 and ID3 algorithms to diagnose the machine status and detect the failure of unhealthy machines in vibration signals database which contained of vibration amplitude and frequencies at three planes (horizontal, vertical and axial) of centrifugal pumps. In addition the result compared with some popular classifiers like QUEST, CART, and CHAID to evaluate the result and showed the C5.0 performance is remarkably better than others. In this paper C5.0 identifies all three classes of machine status with 92.08% accuracy in test subset and furthermore it detects 4 failures classes of 5 with accuracy.

Zhixian Niu [83] presents an approach to automatically identify a DBMS workload as either OLTP or OLAP. We use C5.0 algorithm to construct a set of classifiers based on the characteristics that differentiate OLTP and OLAP and then use the classifier to identify the workload type. The experiments show that the classifiers can be able to accurately identify the OLTP and OLAP workloads.

Doug Won Choi [84] focuses on creating hierarchies which have a relationship between parent and child nodes but not between siblings. When we evaluate or classify certain objects (e.g., service quality), we often use a conceptual hierarchy which has various items (concept) at its nodes. If the target of the evaluation/classification has more complicated features, we need a more complicated conceptual hierarchy. And currently most conceptual hierarchies are constructed qualitatively. This paper presents two new quantitative approaches in constructing a conceptual multi-level hierarchy. They are novel in the sense that a multi-level hierarchy (conceptual relationship knowledge) can be generated from a set of questionnaire survey data by applying factor analysis, structural equation modeling, and decision tree induction (C5.0) techniques. Through this factor analysis we discovered pattern knowledge from the heap of questionnaire survey data which contained hidden knowledge about the problem domain. It is a fresh idea to consider the questionnaire survey data as another form of knowledge elicitation for pattern discovery. AHP (Analytic Hierarchy Process) is a widely used technique for multi-criteria decision making. The significance of this paper is that it can substitute the qualitative stage of building hierarchy in the AHP technique, which is renowned for its weakness in hierarchy building.

Skevofilakas, M. [85] present study is to design and develop a Decision Support System (DSS) closely coupled with an Electronic Medical Record (EMR), able to predict the risk of a Type 1 Diabetes Mellitus (T1DM) patient to develop retinopathy. The proposed system is able to store a wealth of information regarding the clinical state of the T1DM patient and continuously provide the health experts with predictions regarding the possible future complications that he may present. The DSS is a hybrid infrastructure combining a Feed forward Neural Network (FNN), a Classification and Regression Tree (CART) and a Rule Induction C5.0 classifier, with an improved Hybrid Wavelet Neural Network (iHWNN). A voting mechanism is utilized to merge the results from the four classification models. The proposed DSS has been trained and evaluated using data from 55 T1DM patients, acquired by the Athens Hippokration Hospital in close collaboration with the EURODIAB research team. The DSS has shown an excellent performance resulting in an accuracy of 98%. Care has been taken to design and implement a consistent and continuously evolving Information Technology (IT) system by utilizing technologies such as smart agents periodically triggered to retrain the DSS with new cases added in the data repository.

Lee, J.W.T. [86] presents a matching method that can improve the classification performance of a fuzzy decision tree (FDT). This method takes into consideration prediction strength of leave nodes of a fuzzy decision tree by combining true degrees (CF) of fuzzy rules, generated from a fuzzy decision tree, with membership degrees of antecedent parts of rules when applied to cases for classification. Here they illustrate the importance of CF through an example. An experiment shows by using this method, it can be obtained the more accurate results of classification when compared to the original method and to those obtained using the C5.0 decision tree.

Fung, B.C.M. [87] presents a practical and efficient algorithm for determining a generalized version of data that masks sensitive information and remains useful for modeling classification. The generalization of data is implemented by specializing or detailing the level of information in a top-down manner until a minimum privacy requirement is violated. This top-down specialization is natural and efficient for handling both categorical and continuous attributes. The proposed approach exploits the fact that data usually contains redundant structures for classification. While generalization may eliminate some structures, other structures emerge to help. The results show that quality of classification can be preserved even for highly restrictive privacy requirements. This work has great applicability to both public and private sectors that share information for mutual benefits and productivity.

Parmar, A.A. [88] propose a blocking based approach for sensitive classification rule hiding. First we find the supporting transactions of sensitive rules. Then the known values are replaced with the unknown values ("?") in those transactions to hide a given sensitive classification rule. Finally the sanitized dataset is generated from which sensitive classification rules are no longer mined. Here the experimental results of our algorithm has also being discussed.

Tripathy, B.K. [89] proposed a scheme for anonymization of social networks, which is an initiative in this direction and provides a partial solution to this problem. In fact, their algorithm cannot handle the situations in which an adversary has knowledge about vertices in the second or higher hops of a vertex, in addition to its immediate neighbors. In this paper, they propose a modification to their algorithm for the network anonymization which can handle such situations. In doing so, they use an algorithm for graph isomorphism based on adjacency matrix instead of their approach using DFS technique. More importantly, the time complexity of our algorithm is less than that of Zhou and Pei.

Weiheng Zhu [90] discusses in depth the key concerns and implementation techniques related to privacy preservation in e-government, before providing a solution which best tradeoffs the demand for information disclosure and the concerns about privacy preservation.

Kevenaar, T.A.M. [91] considers generating binary feature vectors from biometric face data such that their privacy can be protected using recently introduced helper data systems. Here they explain how the binary feature vectors can be derived and investigate their statistical properties. Experimental results for a subset of the FERET and Caltech databases show that there is only a slight degradation in classification results when using the binary rather than the real-valued feature vectors. Finally, the scheme to extract the binary vectors is combined with a helper data scheme leading to renewable and privacy preserving facial templates with acceptable classification results provided that the within-class variation is not too large.

SathiyaPriya, K. [92] a method to hide fuzzy association rule is proposed, in which, the fuzzified data is mined using modified priory algorithm in order to extract rules and identify sensitive rules. The sensitive rules are hidden by decreasing the support value of Right Hand Side (RHS) of the rule. A framework for automated generation of membership function is also proposed. Experimental results of the proposed approach demonstrate efficient information hiding with minimum side effects.

Gao Ai-qiang [93] study the problem of utility-based anonymization to concentrate on attributes order sensitive workload, where the order of the attributes is important to the analysis workload. Based on the multidimensional anonymization concept, a method is discussed for attributes order sensitive utility-based anonymization. The performance study using public data sets shows that the efficiency is not affected by the attributes order processing.

Aggarwal, C. [94] will provide the first comprehensive analysis of the randomization method in the presence of public information, and the effects of high dimensionality on randomization. The goal is to examine the strengths and weaknesses of randomization and explore both the potential and the pitfalls of randomization in the presence of public information. The inclusion of public information in the theoretical analysis of the randomization method results in a number of interesting and insightful conclusions. (1) The privacy effects of randomization reduce rapidly with increasing dimensionality. With increasing dimensionality, no privacy is practically possible. (2) The properties of the underlying data set can affect the anonymity level of the randomization method. For example, natural properties of real data sets such as clustering improve the effectiveness of randomization. On the other hand, variations in data density of non-empty data localities and outliers create privacy preservation challenges for the randomization method. (3) The presence of public information makes the choice of perturbing distribution more critical than previously thought. In particular, Gaussian perturbations are significantly more effective than uniformly distributed perturbations for the high dimensional case. These insights are very useful for future research and design of the randomization method.

Jinfei Liu [95] presents a novel method, rating, for publishing sensitive data. Rating releases AT (Attribute Table) and IDT (ID Table) based on different sensitivity coefficients for different attributes. This approach not only protects privacy for multiple sensitive attributes, but also keeps a large amount of correlations of the micro data. Here the algorithms for computing AT and IDT have been developed that obey the privacy requirements for multiple sensitive attributes, and maximize the utility of published data as well. We prove both theoretically and experimentally that their method has better performance than the conventional privacy preserving methods on protecting privacy and maximizing the utility of published data. To quantify the utility of published data, they have proposed a new measurement named classification measurement.

Emilin Shyni, C. [96] provides the capability to store various types of information the users reveal during their activities. A key feature of our model is that it allows multiple purposes to be associated with each data element and also supports explicit prohibitions, thus allowing privacy offers to specify that some data should not be used for certain purposes. To maintain consistency between the privacy policy and the practices, privacy protection requirements in privacy policy should be formally specified.

Hongjun Wang [97] introduces the model of distributed latent Dirichlet location (D-LDA) for objects-distributed cluster ensemble which can handle the problems of privacy preservation, distributed computing and knowledge reuse. First, the latent variables in D-LDA and some terminologies are defined for cluster ensemble. Second, Markov chain Monte Carlo (MCMC) approximation inference for D-LDA is stated in detail. Third, some datasets from UCI are chosen for experiment, Compared with cluster-based similarity partitioning algorithm (CSPA), hyper-graph partitioning algorithm (HGPA) and meta-clustering algorithm (MCLA), the results show D-LDA does work better, furthermore the outputs of D-LDA, as a soft cluster model, can not only cluster the data points but also show the structure of data points.

Xuning Tang [98] introduce the KNN and EBB algorithm for constructing generalized subgraphs and a mechanism to integrate the generalized information to conduct the closeness centrality measures. The result shows that the proposed technique improves the accuracy of closeness centrality measures substantially while protecting the sensitive data.

Dhiraj, S.S.S. [99] proposes a data perturbation technique for privacy preservation in k-means clustering. Data objects that have been partitioned into clusters using k-means clustering are perturbed by performing geometric transformations on the clusters in such a way that the object membership of each cluster and orientation of objects within a cluster remain the same. This geometric transformation is achieved through cluster rotation, i.e., every cluster is rotated about its own centroid. The clusters are first displaced away from the mean of the entire dataset so that no two clusters overlap after the subsequent cluster rotation. Here the privacy measure offered by this data perturbation technique has been analyzed and it has been proven that a dataset perturbed by this method cannot be easily reverse engineered, yet is still relevant for cluster analysis.

Wu Xiao-dan [100] presents a generic PPDM framework and a classification scheme for centralized database, adopted from early studies, to guide the review process. Frequencies of different techniques/algorithms used are tableau and analyzed. A set of metrics and a theoretical framework are also proposed for assessing the relative performance of selected PPDM algorithms. Finally, we share directions for future research.

Ling Guo [101] investigates the randomization approach and focus on attribute disclosure under linking attacks. Here they give efficient solutions to determine optimal distortion parameters so as to maximize the utility preservation while still satisfying privacy requirements. Here they compare their randomization approach with l-diversity and anatomy in terms of utility preservation (under the same privacy requirements) from three aspect