Association Rules Hiding Algorithms Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Any organization is looking for protecting their confidential information such as social security number, income, type of disease, customer purchases etc. from unauthorized accesses. Solutions to such a problem require combining several techniques and mechanisms in an environment where data have different sensitivity levels; this data may be classified at different levels, and made available only to those subjects with an appropriate clearance. A multilevel secure system would ensure that data at classification levels above the user's login level would be invisible to that user. Such a system is designed to prevent higher level of data from being directly encoded in lower level data. This can be done by changing the values of lower level data in a particular pattern or by selecting certain visible data for output based upon values of higher level data. It is however; well known that simply by restricting access to sensitive data does not ensure complete sensitive data protection. For example, sensitive or "high" data items may be inferred from non-sensitive, or "low" data through some inference process based on some knowledge of the semantics of the application the user has. Such a problem, known as the "inference problem" has been widely investigated and possible solutions have been identified. The proposed solutions address the problem of how to prevent disclosure of sensitive data through the combination of known inference rules with non-sensitive data. Examples of inference rules are deductive rules, functional dependencies, or material implications.

And there are many techniques and applications increased the security risks that one may incur when is releasing data. Data mining techniques is considered as knowledge discovery techniques. However, it also can be used to analyze data from different resources to provide a useful information used as a guide to the customer. This technique brings the problem of privacy. For example assume we have a server and many clients in which each client has a set of data. The clients want the server to gather statistical information about association among their items in order to provide recommendations to the customers. the clients do not want the server to know some sensitive rules. Sensitive rules are the frequent itemset that contain highly sensitive knowledge. Thus, when a client sends its database to the server, some sensitive rules are hidden from its database according to some specific privacy policies. Therefore, the server only can gather statistical information from the modified database. Other example to explain what association rules technique can do Let us suppose, that we are negotiating a deal with Dedtrees Paper Company, as purchasing directors of BigMart, a large supermarket chain. They offer their products in reduced price, if we agree to give them access to our database of customer purchases. We accept the deal and Dedtrees starts mining our data. By using an association rule mining tool, they find that people who purchase skim milk also purchase Green paper. Dedtrees now runs a coupon marketing campaign saying that "you can get 50 cents off skim milk with every purchase of a Dedtrees product".

This campaign cuts heavily into the sales of Green paper, which increases the prices to us, based on the lower sales. During our next negotiation with Dedtrees, we find out that with reduced competition they are unwilling to offer us a low price. Finally, we start to lose business to our competitors, who were able to negotiate a better deal with Green paper.

Based on the previous scenario we figure out the need to prevent disclosure not only of confidential personal information from summarized or aggregated data, but alsoto prevent data mining techniques from discovering sensitive knowledge which is not even known to the database owners. And these sensitive data are divided into:

Sensitive Transactions:

A group of sensitive association rules is mined from a database DB according to a special group of transactions. We called it as sensitive transactions and we define it as the following:

Let T be a set of all transactions in a transactional database DB and SR be a set of sensitive association rules Discovered from DB. A set of transactions is said to be

sensitive, denoted ST , if ST⊂ T, ∀ t ∈ ST , ∃ sr ∈ SR such that items(sr) ⊆  t.

Sensitive Association Rules:

Let D be a transactional database, R be a set of all association rules that can be mined from D based on a minimum support σ, and RulesH be a setoff decision support rules that need to be hidden according to some security policies. A set of association rules, denoted by SR, is said to be sensitive if SR ⊂ R and SR would derive the set RulesH. ~SR is the set of non-sensitive association rules such that ~SR ∪ SR = R.

Currently, many researches and studies emphasize the importance of the privacy problems. And many algorithms are suggested in this filed. And we will discuss these algorithm later in this document.

The basic of association Rules:

Association rules are defined as follows:

Let I = {i1,...,in} be a set of literals, called items. Let DB be a database of transactions, where each transaction t is an itemset such that t ⊆ I. A unique identifier, called TID, is associated with each transaction. A transaction t supports X, a set of items in I, if X ⊂ t. An association rule is an implication of the form X ⇒ Y, where X ⊂ I, Y ⊂ I and X ∩Y = ∅. Thus, we say that a rule

X ⇒ Y holds in the database DB with confidence ∝ if ≥ ∝, where is the number of occurrences of the set of items A in the set of transactions DB. Similarly, we say that a rule

X ⇒ Y holds in the database DB with support σ if ≥ σ where N is the number of transactions in DB. While the support is a measure of the frequency of a rule, the confidence is a measure of the strength of the relation between sets of items.

Problem Definition:

Assume we have a database D, a set R of relevant rules that are mined from D and a subset RH of R, how can we transform D into a database D'(the released database) in such a way that the rules in R can still be Discovered, except for the rules in RH?

To solving this problem by reducing the support of the large itemsets via deleting items from transactions (also referred to as "sanitization "problem) is an NP-hard problem. Thus, we are looking for a transformation of D (the source database) in D' (the released database) that maximizes the number of rules in R-RH that can still be discovered.

"The sanitizing algorithms modify some transactions to hide sensitive rules based on a disclosure threshold controlled by the database owner. This threshold indirectly controls the balance between knowledge disclosure and knowledge protection by controlling the proportion of transactions to be sanitized. Therefore, sanitizing algorithms must focus on hiding sensitive rules and, at the same time, reducing the side effect on the non-sensitive rules as much as possible.

The sanitization algorithms are categorized into two main groups:

data sharing-based algorithms:

Item Grouping Algorithm

Sliding Window Algorithm

pattern sharing-based algorithm:

Downright Sanitizing Algorithm".

There are two main approaches that can be adopted when we try to hide a set RH of rules

(i.e., prevent them from being discovered by association rule mining algorithms):

We can either prevent the rules in RH from being generated, by hiding the frequent sets from which they are derived, or

We can reduce the confidence of the sensitive rules, by bringing it below a user-specified threshold (min conf).

Hiding strategies:

The hiding algorithms are categorized into tow strategies as the following:

We decrease the confidence of the rule:

by increasing the support of the rule antecedent X through transactions that partially support it.

by decreasing the support of the rule consequent Y in transactions that support both X and Y .

We decrease the support of the rule by decreasing the support of either the rule antecedent X, or the rule consequent Y , through transactions that fully support the rule.

In the next section we will talk about the algorithms that will be used to hiding rules of association.

Algorithm 1.a:

For each selected rule, it increases the support of the rule's antecedent until the rule confidence decreases below the min conf threshold.

Algorithm 1.b:

This algorithm hides sensitive rules according to the second of the proposed strategies. It reduces the support of each selected rule by decreasing the frequency of the consequent through transactions that support the rule. This process goes on until either the confidence or the support of the rule is below the minimum threshold.

Algorithm 2.a:

This algorithm decreases the support of the sensitive rules until either their confidence is below the min conf threshold or their support is below the min supp threshold.

Algorithm 2.b:

This algorithm hides sensitive rules by decreasing the support of their generating itemsets until their support is below the minimum support threshold. The item with maximum support is hidden from the minimum length transaction.

Algorithm 2.c:

This algorithm hides sensitive rules by decreasing the support of their generating itemsets until the support is below the minimum support threshold. If there are more than one large itemsets to hide, the algorithm first sorts the large itemsets with respect to their size and support as algorithm 2.b does.

The Item Grouping Algorithm (IGA):

The main idea is to group sensitive rules in groups of rules sharing the same itemsets. If two sensitive rules intersect, by sanitizing the sensitive transactions containing both sensitive rules, one would take care of hiding these two sensitive rules at once and consequently reduce the impact on the released database. Find below IGA steps:

Step1: Identifying sensitive transactions and building index

Step2: Selecting the number of sensitive transactions

Step3: Identifying victim items for each sensitive transaction

Step4: transactions that do not need sanitization are directly copied from D to D'

Sliding Window Algorithm (SWA):

The intuition behind this algorithm is that the SWA scans a group of K transactions (window size) at a time. SWA then sanitizes the set of sensitive transactions, denoted by ST , considering a disclosure threshold that defined by a database owner.

Unlike the (IGA), which has a unique disclosure threshold for all sensitive rules, the SWA has a disclosure threshold assigned to each sensitive association rule. Find below SWA steps:

Step1: Identifying sensitive transactions and building index

Step2: Identifying the victim items

Step3: Selecting the number of sensitive transactions

Step4: transactions that do not need sanitization are directly copied from D to D

Downright Sanitizing Algorithm (DSA):

The idea behind this algorithm is to sanitize some sensitive rules while blocking inference channels as well. To block inference channels, the DSA removes at least one subset of each sensitive itemset in the level 1 of the frequent itemset graph. The removal is done recursively up to level 1. Find below DSA steps:

Step1: Identifying the sensitive itemsets

Step2: Selecting subsets to sanitize

Step3: Sanitizing the set of supersets of marked pairs.