Anonymity Based Privacy Preserving Techniques Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract-In recent years, with the development of information technology, the operation of enterprises has gone through a drastic revolution. Data information flow became the lifeblood of enterprises. However, in such a situation, people would worry about disclosure of privacy and are likely to provide phony information rather than the authentic. So with the development of data analysis and processing technique, the privacy disclosure problem about individual or company is inevitably exposed when releasing or sharing data to mine useful decision information and knowledge, then give the birth to the research field on privacy preserving. The technology of privacy preserving has become the focus of the researcher home and abroad in recent years. In order to protect individuals privacy, the technology of anonymity has been proposed to protect sensitive attributes from the corresponding identifiers. This paper intends to reiterate several anonymity-based privacy preserving technologies clearly and then proceeds to analyze the merits and shortcomings of these technologies.

Keywords - privacy preserving , anonymity

I. INTRODUCTION

With the development of data analysis and processing technique, organizations, industries and governments are increasingly publishing micro data (i.e., data that contain unaggregated information about individuals) for data mining purposes, studying disease outbreaks or economic patterns. While the released datasets provide valuable information to researchers, they also contain sensitive information about individuals whose privacy may be at risk [1].

For example, a hospital may release patients' diagnosis records so that researchers can study the characteristics of various diseases. The raw data, also called microdata, contains the identities (e.g. names) of individuals, which are not released to protect their privacy.

However, there may exist other attributes that can be used, in combination with an external database, to recover the personal identities.

Now we assume that the hospital publishes the data in TableI, which does not explicitly indicate the names of patients. However, if an adversary has

access to the voter registration list in TableII, he can easily discover the identities of all patients by joining the two tables on {Age, Sex, Zipcode}.

When releasing microdata for research purposes, one needs to preserve the privacy of respondents while maximizing data utility. An approach that has been studied extensively in recent years is to use anonymity techniques.Anonymity is a common approach to avoid the above problem, by transforming the QI values into less specific forms so that they no longer uniquely represent individuals [2].

Recently there are many privacy preserving technologies, such as clustering, randomization, data compression and so on. In this article, we mainly focus on anonymity-based privacy preserving technologies. In next parts we will talk about k-anonymity, l-Diversity, Anatomy, and t-closeness.

TABLE I.

MICRODATA

ID

Attributes

Age

Sex

Zip code

Disease

1

26

M

83661

Headache

2

24

M

83634

Headache

3

31

M

83967

Toothache

4

39

F

83949

Cough

TABLE II.

VOTER REGISTRATION LIST

ID

Attributes

Name

Age

Sex

Zip code

1

Jim

26

M

83661

2

Jay

24

M

83634

3

Tom

31

M

83967

4

Lily

39

F

83949

TABLE III. 2 ANONYMOUS TABLE

ID

Attributes

Age

Sex

Zip code

Disease

1

2*

M

836**

Headache

2

2*

M

836**

Headache

3

3*

*

839**

Toothache

4

3*

*

839**

Cough

II. K-ANONYMITY

When releasing microdata for research purposes, one needs to limit disclosure risks to an acceptable level while maximizing data utility. To limit disclosure risk, Samarati et al.[3]; Sweeney [4] introduced the k-anonymity privacy requirement, which requires each record in an anonymized table to be indistinguishable with at least k-1 other records within the dataset, with respect to a set of quasi-identifier attributes. To achieve the k-anonymity requirement, they used both generalization and suppression for data anonymization. Generalization replaces a value with a "less-specific but semantically consistent" value. Tuple suppression removes an entire record from the table [5]. Unlike traditional privacy protection techniques such as data swapping and adding noise, information in a k-anonymous table through generalization and suppression remains truthful.

In particular, a table is k-anonymous if the QI values of each tuple are identical to those of at least k-1 other tuples. TableIII shows an example of 2-anonymous generalization for TableI. Even with the voter registration list, an adversary can only infer that Jim may be the person involved in the first 2 tuples of Table3, or equivalently, the real disease of Jim is discovered only with probability 50%. In general, k-anonymity guarantees that an individual can be associated with his real tuple with a probability at most 1/k.

The limitations of the k-anonymity model stem from the two assumptions [6]. First, it may be very hard for the owner of a database to determine which of the attributes are or are not available in external tables. This limitation can be overcome by adopting a strict approach that assumes much of the data is public. The second limitation is much harsher. The k -anonymity model assumes a certain method of attack,

While in real scenarios there is no reason why the attacker should not try other methods, such as

injecting false rows into the database [7].

While k-anonymity protects against identity disclosure, it does not provide sufficient protection against attribute disclosure. Two attacks were identified in [8]: the homogeneity attack and the background knowledge attack.

Example 1. TableIV is the original data table, and TableV is an anonymous version of it satisfying 2-anonymity. The Disease attribute is sensitive. Suppose Jay knows that Tom is a 27-year old man living in ZIP 83634 and Tom's record is in the table. From Table5, Jay can conclude that Tom corresponds to the first equivalence class, and thus must have headache. This is the homogeneity attack. For an example of the background knowledge attack, suppose that, by knowing Lucy's age and zip code, Jay can conclude that Lucy corresponds to a record in

the last equivalence class in TableV. Furthermore, suppose that Jay knows that Lucy has very low risk for cough. This background knowledge enables Jay to conclude that Lucy most likely has toothache.

TABLE IV. ORIGINAL PATIENTS TABLE

Zip code

Age

Disease

1

83661

26

Headache

2

83634

24

Headache

3

83967

31

Toothache

4

83949

39

Cough

TABLE V. A 2-ANONYMOUS VERSION OF TABLE1

Zip code

Age

Disease

1

836**

2*

Headache

2

836**

2*

Headache

3

839**

3*

Toothache

4

839**

3*

Cough

III. l-DIVERSITY

While k-anonymity protects against identity disclosure, it does not provide sufficient protection against attribute disclosure. The notion of l-diversity attempts to solve this problem by requiring that eacequivalence class has at least l well-represented values for each sensitive attribute. The technology of l-diversity has some advantages than k-anonymity.

Because k-anonymity dataset permits strong attacks due to lack of diversity in the sensitive attributes.

The l-diversity Principle: An equivalence class is said to have l-diversity if there are at least l well-represented values for the sensitive attribute.

A table is said to have l-diversity if every equivalence class of the table has l-diversity [9].

While the l-diversity principle represents an important step beyond k-anonymity in protecting against attribute disclosure, it has several shortcomings that we now discuss [10]. l-diversity may be difficult and unnecessary to achieve. Besides, l-diversity is insufficient to prevent attribute disclosure. Such as Similarity Attack: when the sensitive attribute values in an equivalence class are distinct but semantically similar, an adversary can learn important information. There is an example of Similarity Attack.

Example 2. TableVI is the original table, and TableVI shows an anonymous version satisfying 3-diversity. There are two sensitive attributes: Salary and Disease. Suppose one knows that Lily's record corresponds to one of the first three records, then one knows that Lily's salary is in the range [1K-3K] and can infer that Lily's salary is relatively low. This attack applies not only to numeric attributes like "Salary", but also to the categorical attributes like "Disease". Knowing that Lily's record belongs to the first equivalence class enables one to conclude that Lily has some stomachrelated problems, because all three diseases in the class are stomach-related.

TABLE VI. ORIGINAL SALARY AND DISEASE TABLE

Sex

Zip code

Salary

Disease

M

555402

1K

Gastric ulcer

M

555722

2K

Gastritis

M

555801

3K

Stomach cancer

F

987056

19K

Cough

M

987444

8K

Toothache

TABLE VII. A 3-DIVERSE VERSION OF TABLE6

Sex

Zip code

Salary

Disease

M

555***

1K

Gastric ulcer

M

555***

2K

Gastritis

M

555***

3K

Stomach cancer

*

987***

32K

Headache

This disclosure risk of sensitive information occurs because while l-diversity requirement ensures diversity of sensitive values in each group, it does not take into account the semantic closeness of these attribute values.

IV. ANATOMY

Xiaokui Xiao[11]proposed an alternative technique for anonymity's generalization called Anatomy, which allows much more accurate data analysis while still preserving privacy.

Anatomy is an innovative technique which preserves both privacy and correlation in the microdata, and hence, overcomes the drawbacks of k-anonymity and l-diversity. Anatomy releases all the quasi-identifier and sensitive values directly in two separate tables. Combined with a grouping mechanism, this approach protects privacy, and captures a large amount of correlation in the microdata.

This technique developed a linear-time algorithm for computing anatomized tables that obey the l-diversity privacy requirement, and minimize the error of reconstructing the microdata.

Extensive experiments confirm that this technique allows significantly more effective data analysis than the conventional publication method based on anonymity.

However, the technique of anatomy also has several drawbacks [12]. For example, this technique only focused on the case where there is a single sensitive attribute, not focused on multiple sensitive attributes. Besides, it would be highly useful to study how anatomized tables can be utilized for effective mining of interesting patterns in the microdata .

V. t-CLOSENESS

Recently, some researchers found that the distributions of personal information which have the same level of diversity may provide very different levels of privacy. Because there are semantic relationships among the attribute values, and different values have very different levels of sensitivity. They also believe that privacy is also affected by the relationship with the overall distribution. So some researchers proposed a novel privacy notion called t-closeness.

The t-closeness Principle: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness [13].

t-closeness requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table. One

key novelty of this approach is to separate the information gain from a released data table into two parts: that about all population in the released data and that about specific individuals. A major challenge of this technique is how to find the distance measure which can reflect the semantic distance among values. Some researchers use the Earth Mover Distance [14] measure for t-closeness requirement, which has the advantage of taking into consideration of the semantic closeness of attribute values.

t-closeness allows us to take advantage of anonymity techniques other than generalization of quasi-identifier and suppression of records. For example, instead of suppressing a whole record, one can hide some sensitive attributes of the record; one advantage is that the number of records in the anonymity table is accurate, which may be useful in some applications. Because this technique does not affect quasiidentifiers, it does not help achieve k-anonymity and hence has not been considered before. Removing a value only decreases diversity; therefore, it does not help to achieve l-diversity.

However, in t-closeness, removing an outlier may smooth a distribution and bring it closer to the overall distribution.

Data Quality This experiments compare the

data quality of the 3 privacy measures using the discernibility metric [2] and Minimal Average Group Size [10, 15].The first metric measures the number of records that are indistinguishable from each other. Each record in an equivalence class of size t gets a penalty of t while each suppressed tuple gets a penalty equal to the total number of records. The second metric is the average size of the equivalence classes generated by the anonymization algorithm.

We use the 7 regular attributes as the quasi-identifier and Occupation as the sensitive attribute. We set different parameters for k, !, and compare the resulted dataset produced by different measurements. Figure 2 summarizes the results. We found that entropy !-diversity tables has worse data quality than the other measurements. We also found that the data quality of k-anonymous tables without tcloseness is slightly better than k-anonymous tables with tcloseness.

This is because t-closeness requirement provides

extra protection to sensitive values and the cost is decreased data quality. When choosing t = 0.2, the degradation in data quality is minimal

Figure 1. Efficiency of the privacy measures.

Figure 2.Data Quality analysis of the 5 measures

VI. CONCLUSION

Neither k-anonymity nor its enhancements are entirely successful in ensuring that no privacy leakage occurs

while keeping a reasonable data utility level. In fact, while k-anonymity and l-diversity do not completely protect privacy, t-closeness offers complete privacy at the cost of severely impairing the correlations between confidential attributes and key attributes.

Another problem of the above properties is the computational approach to reach them for a specific dataset to be anonymized. The papers defining k-anonymity, l-diversity propose approaches based on generalization and suppression which, among other shortcomings, fail to preserve the nature of numerical attributes by causing them to become categorical. In the case of t-closeness, there is not even mention of a computational procedure to reach it. Therefore, there are plenty of open research avenues in this area, both at the conceptual level (definition of better properties) and at the computational level (definition of less disruptive computational procedures).If, in addition, one assumes that the intruder knows the precise privacy property being pursued by the data protector (as assumed in [16]), new challenges appear.

Below we discuss some interesting open research issues.

Multiple Sensitive Attributes

Multiple sensitive attributes present additional challenges. Suppose we have two sensitive attributes U and V . One can consider the two attributes separately, i.e., an equivalence class E has t-closeness if E has t-closeness with respect to both U and V . Another approach is to consider the joint distribution of the two attributes.

To use this approach, one has to choose the ground

distance between pairs of sensitive attribute values. Earth Mover Distance measure is used for t-closeness requirement;this has the advantage of taking into consideration the semantic closeness of attribute values. A simple formula for calculating EMD may be difficult to derive, and the relationship between t and the level of privacy becomes more complicated.

Limitations of using EMD in t-closeness

The t-closeness principle can be applied using other distance measures. While EMD is the best measure we have found so far, it is certainly not perfect. In particular, the relationship between the value t and information gain is unclear. For example, the EMD between the two distributions (0.01, 0.99) and (0.11, 0.89) is 0.1, and the EMD between (0.4, 0.6) and (0.5, 0.5) is also 0.1. However, one may argue that the

change between the first pair is much more significant than that between the second pair. In the first pair, the probability of taking the first value increases from 0.01 to 0.11, a 1000% increase. While in the second pair, the probability increase is only 25%. In general, what we need is a measure that combines the distance-estimation properties of the EMD with the probability scaling nature of the KL distance

ACKNOWLEDGMENT

The authors thank the Almighty and all the resource persons those who directed them to proceed this paper efficiently.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.