# Performance Evaluation On The Effectiveness Of Algorithms Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

## CHAPTER 1

## INTRODUCTION

Data mining is the process of extracting patterns from data. Data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery. Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. From past few years there is a specific attention towards the development of ranking and top k- retrieval algorithms to help the buyers in searching and retrieving from the database, but only fewer techniques are available for helping sellers in selecting the attributes of a product that is to be inserted in the database. This project deals with such type of technique.

## 1.1 OBJECTIVE

The main objective of this project is to provide guidance in selecting the best attributes of the product that is to be entered in to the database. This technique must be so efficient such that the product stands out in the crowd and must also be visible to the pool of potential buyers. In addition to that, performance evaluation of real and synthetic data for demonstrating the effectiveness of algorithms is also an objective function of this project.

## 1.2 PROBLEM STATEMENT

Problem in inserting a product in the database such that it markets it in the best possible manner is being handled in this project. There is a limit in the number of attributes that is to be entered for the product in scenarios such as inserting an ad in online newspaper because of the ad costs involved. So there is a difficulty in inserting the appropriate attributes in database such that it is highly visible to all users. Since there is a large number of attributes for a product, there is a great deal of confusion, about which attributes to be inserted so that the product is of high visibility to the users.

## 1.3 CHAPTERWISE SUMMARY

The rest of this project is organized as follows. Chapter 2 explains the proposed techniques in detail. Also, it specifies the overall view of the system. It describes various problem variants along with the techniques used for those variants. Chapter 3 describes about the implementation of proposed techniques. Each sub section in chapter 3 gives an overview of how the proposed techniques are implemented. Chapter 4 gives a conclusion. The next chapter gives a detailed explanation about the proposed technique.

## CHAPTER 2

## DETERMINING ATTRIBUTES TO MAXIMIZE VISIBILITY OF OBJECTS

In the recent years there is a significant interest in the development of ranking functions and efficient top-k retrieval algorithms to help users in ad-hoc search and retrieval in databases. Generally there can be two types of users for these databases: users who search such databases, trying to locate objects of interest and users who insert new objects into these databases with an expectation that they will be discovered by the first type of users. First types of users are the potential buyers of the product. And the second types of users are the sellers of the product. Products could range from apartments for rent to job advertisements to automobiles for sale.

All the previous researches focus on effective search and retrieval techniques such as new top-k algorithms in order to support the buyers of the product. In contrast, only less research has been addressing techniques to help the sellers/manufacturers in inserting new product for sale in such databases that market it in the best possible manner i.e. such that it stands out in a crowd of competitive products and is widely visible to the pool of potential buyers. This is the main focus of the paper.

Let D be the database of products already being advertised in the marketplace. Based on the problem variants that are considering this database can be either a traditional relational table or a text document where each text document is an ad for a product. Let Q be the set of search queries that has been executed against this database in the recent past. It shows what the potential buyers are having been interested in. Thus Q can be considered as the "workload" or the "query log". Suppose a new product is to be entered in to the database, provided the seller have complete knowledge description about the product. If due to budget constraints there is limit in the number of attributes/keywords that can be entered into the database. Problem that needs to be focused is which attributes of the product must be inserted. So the problem can be defined formally as: Given a database D, a query log Q, a new tuple t, and an integer m, determine the best attributes of t to retain such that if the shortened version of t is inserted into the database, the number of queries of Q that retrieve t is maximized. For this novel optimization problem, several problem variants such as Boolean, categorical, text and numeric data are considered. The project describes several formulation of the problem. It includes exact and approximation algorithms that work well in practice.

The following example will be used throughout the paper to illustrate various concepts. Consider an inventory database of an auto dealer, which contains a single database table D with N rows and M attributes where each tuple represents a car for sale. The table has numerous attributes that describe details of the car: Boolean attributes such as AC, Four Door, Turbo, Power Doors, Auto Trans, Power Brakes, etc; categorical attributes such as Make, Color, Engine Type, Zip Code, etc; numeric attributes such as Price, Age, Fuel Mileage, etc; and text attributes such as Reviews, Accident History, and so on. Table 2.1 illustrates such a dataset (where only the Boolean attributes are shown) of seven cars already advertised for sale. The table 2.2 and 2.3 illustrates a new car t that needs to be advertised, i.e., inserted into this dataset and a query log of five queries, respectively.

Table 2.1 Dataset D

Car ID

AC

Four Door

Turbo

Power Doors

Auto trans

Power Brakes

t1

0

1

0

1

0

0

t2

0

1

1

0

0

0

t3

1

0

0

1

1

1

t4

1

1

0

1

0

1

t6

0

1

0

1

0

0

t7

0

0

1

1

0

0

Table 2.2 New Tuple to be inserted

New car

AC

Four Door

Turbo

Power Doors

Auto Trans

Power Brakes

t

1

1

0

1

1

1

Table 2.3 Query Log Q

Query ID

AC

Four Door

Turbo

Power Doors

Auto Trans

Power Brakes

q1

1

1

0

0

0

0

q2

1

0

0

1

0

0

q3

0

1

0

1

0

0

q4

0

0

0

1

0

1

q5

0

0

1

0

1

0

## 2.1 MAIN PROBLEM VARIANT: CONJUNCTIVE BOOLEAN WITH QUERY LOG (CB-QL)

Conjunctive Boolean with query log is considered to be the main problem variant that is associated with the Boolean data. The query is being considered as conjunctive query. In CB-QL, it is the query log that needs to be analyzed in solving the problem. The actual database is irrelevant. A tuple t satisfies a query q if q is a subset of t. For example a query such as {a1, a3} is equivalent to "return all tuples such that a1= 1 and a3 = 1". The problem can be formally defined as: Given a query log Q with Conjunctive Boolean Retrieval semantics, a new tuple t, and an integer m, computes a compressed tuple tâ€² having m attributes such that the number of queries that retrieve tâ€² is maximized.

In the table which is given above, t1 is the new tuple to be inserted into the database. Suppose it is required to retain m=3 attributes. Then it is possible to consider any three attributes that makes the tuple more visible to the queries. If AC, Four Door, Power Doors i.e. t'= [1, 1, 0, 1, 0, 0], is selected that can see that it can satisfy maximum of 3 queries. i.e. (q1,q2,q3). No other selections of three attributes of the new tuple will satisfy more queries.

## 2.1.1 ALGORITHMS FOR CB-QL

In this section it describes about different algorithms for the main problem variant. It includes:-Optimal Algorithm based on Integer Linear Programming, Optimal Algorithm based on Maximal Frequent Item sets which are considered to be exact algorithms and 3 greedy heuristics ConsumeAttr-CB-QL, ConsumeAttrCumul-CB-QL, ConsumeQueries-CB-QL which are the approximation algorithms. These algorithms are used also for other problem variants with slight modifications. The overall view of the functioning of the system can be shown as below

Start

Query log Q

Database D

Algorithm computation

Stop

Fig 2.1 Overview of the System

## OPTIMAL ALGORITHM BASED ON INTEGER LINEAR PROGRAMMING

CB-QL can be described in the ILP framework. Let the new tuple be the Boolean vector t={a1(t),â€¦,aM(t)}, S be the total number of queries in the query log, and let x1 ,..., xM be integer variables such that if ai(t) = 1 then xi Ð„ {0, 1}, else xi = 0. Then maximum of product of xi is considered by considering the limit from i=1 to S, subject to the constraint that xi â‰¤ m, where m is the number of attributes required after compression of tuple t.

It is easy to see that the maximum gives exactly the solution to CB-QL. The objective function is not linear. But this linearity can be shown as follows: By introducing additional 0-1 integer variables y1,..., ys, i.e., one variable for each query in Q. Maximum of summation of yi is considered subject to the constraint that xi â‰¤ m and also yi â‰¤ xj for each j such that ajqj=1. yiâ‰¤xj actually means yi= xj=1 if yi=1, that is, if qi is satisfied thus the variable yi corresponding to a query can be 1 only if all the variables xj corresponding to the attributes in the query are 1. This implies that the maximum remains the same. This algorithm is referred to as ILP-CB-QL.

The algorithm based on Integer Linear Programming has certain limitations. It is impractical for problem instances beyond a few hundred queries in the query log. The reason is that it is a very generic method for solving arbitrary integer linear programming formulations, and consequently fails to leverage the specific nature of our problem. And so it is not implemented.

## OPTIMAL ALGORITHM BASED ON MAXIMAL FREQUENT ITEM SETS

Another type of algorithm is MaxFreqItemSets-CB-QL which scales to very large data query logs. It's an adaptation of an algorithm for mining maximal frequent item set [2]. Frequency item set problem can be defined as follows: Let R be an N-row M-column Boolean table, and let r > 0 be an integer known as the threshold. Given an item set I (i.e., a subset of attributes), let freq (I) be defined as the number of rows in R that "support" I (i.e., the set of attributes corresponding to the 1's in the row is a superset of I). Let m be the number of attributes to be retained. Compute all item sets I such that freq (I) >r.

The input to the algorithm is the query log where previous user queries are being saved. The problem that is associated with the query log is that, it will be a sparse table as the queries in most search applications involve the specification of just a few attributes. So, the complement ~Q is an extremely dense table. And also, in item set mining, a row of the Boolean table is said to support an item set if the row is a superset of the item set. In our case, a query satisfies a tuple if it is a subset of the tuple. To overcome this conflict, our first task is to complement our problem instance, i.e., convert 1's to 0's and vice versa. Let ~t (~q) denote the complement of a tuple t (query q), i.e., where the 1's and 0's have been interchanged. Likewise let ~Q denote the complement of a query log Q where each query has been complemented. Now, freq (~t) can be defined as the number of rows in ~Q that support t. Rest of the approach is now seemingly clear: compute all frequent item sets of ~Q (using an appropriate threshold t), and from among all frequent item sets of size M-m, determine the item set I that is a superset of ~t with the highest frequency. The optimal compressed tuple t' is therefore the complement of I, i.e., ~I.

There are 2 approaches for setting the threshold. One is a heuristic where the threshold is set to a reasonable fixed value depending on the practicalities of the application. Second approach is an adaptive procedure of setting the threshold that is guaranteed to find the optimal compression. First initialize the threshold to a high value and compute the frequent item sets of ~Q. If there are no frequent item sets of size at least M - m, repeat the process with a smaller threshold which is half of the previous threshold. This process is guaranteed to discover the optimal tâ€². Instead of calculating the complete frequent item set, the paper focuses on calculating the maximal frequent item sets of ~Q by means of random walk. In the dense table with M attributes, most of the maximal frequent item set will exist up in the Boolean lattice. Our random walk can be divided into 2 phases. (a) Down Phase: starting from the top of the lattice (I = {a1a2â€¦aM}), walk down the lattice by removing random items from I until I becomes frequent, and (b) Up Phase: starting from I, walk up the lattice by adding random items to I (from among all items A - I such that I remains frequent), until the no further additions are possible. At this point a candidate for the maximal frequent item set I has been discovered. Random walk is iterated several number of times. Until all item set occurs at least 2 times. From among all these item sets of size M-m that identify the item set that is repeated large number of times. It is then complemented to get the attributes that is to be inserted for the new tuple. The steps for the maximal frequent item set algorithm for CB-QL can be as follows:

Step 1: Complement the problem instance.

Step 2: Perform the two phase random walk on the query log

Step 3: Compute the number of times, the item set occurs in the random walk

Step 4: Add the item set to the maximal frequent item set if it occurs over a specified number of times.

Step 5: Select the item sets which is:

A subset of maximal frequent itemset.

Have size as the difference between the total number of attributes and required number of attributes in the tuple to be inserted.

Step 6: From the item sets obtained from step 5 select item set with highest frequency.

Step 7: Return the negation of the item set obtained from step 6

## GREEDY HEURISTICS

Even though the maximal frequent item-set-based algorithm has much better scalability than ILP based algorithm, it is also becomes prohibitively slow for really large datasets (query logs). Hence additional three greedy heuristics is being developed. These heuristics is being derived from [5].

One type of greedy heuristics is consumeAttr-CB-QL. The algorithm ConsumeAttr-CB-QL first computes the number of times each individual attribute appear in the query log. It then selects the top-m attributes of the new tuple that have the highest frequencies. The steps for this greedy heuristics are as follows:

Step 1: Computes the occurrence of individual attribute in Query Log.

Step 2: Select the top- m attributes having highest occurrence for the new tuple.

The second greedy heuristic algorithm is ConsumeAttrCumul-CB-QL, first selects the attribute with the highest individual frequency in the query log. It then selects the second attribute that co-occurs most frequently with the first attribute in the query log, and so on. The steps for this greedy heuristics are as follows:

Step 1: Select the attribute with highest individual frequency in the Query Log.

Step 2: Selects the next one attribute that co-occur most frequently with the first one.

Step 3: Process is repeated till 'm' (given) attributes are selected

Instead of consuming attributes greedily, an alternative approach is to consume queries greedily. The algorithm first picks the query with minimum number of attributes, and selects all attributes specified in the query. It then picks the query with minimum number of new attributes (i.e., not already specified in the first query), and adds these new attributes to the selected list. This process is continued until m attributes have been selected. The steps for this greedy heuristics are as follows:

Step1: Picks query with minimum number of attributes.

Step 2: Selects all attributes in the query.

Step 3: Picks next query with minimum number of new attributes

Step 4: Adds new attributes to the list.

Step 5: Process is repeated till 'm' attributes are selected

## 2.2 PROBLEM VARIANT: TEXT DATA

A text database consists of a collection of documents, where each document is modeled as a bag of words as is common in Information Retrieval. Queries are sets of keywords, with top-k retrieval via query-specific scoring functions. The text data can be treated as Boolean data. And so algorithms developed for Boolean data can be also used for text data. But there is an issue that is if viewed each distinct keyword in the text corpus (or query log) as a distinct Boolean attribute, the dimension of the Boolean database is enormous. Consequently, none of the optimal algorithms, either IP-based or frequent item set-based, are feasible for text data. Fortunately, the greedy heuristics that is developed scale very well with reasonable results. To apply this algorithm on the text data, each term in the text data is assigned a 1 or 0 depending on the presence or absence of the term in that particular document. The term frequency and document frequency is being calculated and the weight of each term is determined using the formula:

(2.1)

In the formula dj is the term frequency, dfj is the document frequency of the term. And dl is the document length, avdl is the average document length across the collection, and k1 and b are free parameters (we set k1 = 1 and b = 0.5). Then average weight for each term is calculated. This average weight of each term is used to determine the top k terms among all documents.

## 2.3 OTHER PROBLEM VARIANTS

There are certain other problem variants for Boolean, categorical and numeric data.

## 2.3.1 CONJUNCTIVE BOOLEAN-DATA (CB-D)

In this type of problem variant, the access is limited only to the database of existing products and there is no access to the query log. The strategy here is to search for a compressed version of the new product that dominates as many products in the database as possible. Thus any buyer that executes a query that selects a dominated product will also get to see the new product[9]. The problem can be formally defined as follows: Given a database D, a new tuple t, and an integer m, computes a compressed tuple t' by retaining m attributes such that the number of tuples in D dominated by t' is maximized. Any of the algorithms used for CB-QL can also be used to solve CB-D, by simply replacing the query log with the database as input.

## 2.3.2 SKYLINE BOOLEAN (SB)

Given a set of points, the skyline comprises of the points that are not dominated by other points [8], [11]. For each query q in the query log, define the query skyline S(q) = {s1â€¦sL}, which is a collection of skyline points. Each skyline point s defines a subset of attributes for which any data point (tuple) remains on the skyline. For example, suppose a user poses a query q = "Select * from Cars where Make = Honda and AC = yes and Power Windows = yes", and the database has three cars t1 = <Toyota, AC, Power Windows>, t2 = <Honda, AC, Power Brakes> and t3 = <Nissan, AC, Power Brakes>. It can be seen from the skyline definition that the cars t1 and t2 will be on the skyline of q, since they are not dominated by any other cars (here t3) present in the database based on the attributes asked by the query q. Do not store the actual skyline data points such as t1 and t2 in skyline log, instead the set of attributes for which a data point is visible on the skyline. Here, t1 = <Toyota, AC, Power Windows> is visible on the skyline of q because of attributes {AC, Power Windows} asked by q. So the skyline points are s1 = {AC, Power Windows} and s2 = {Honda, AC} for which t2 is on the skyline of q. A skylines log contains all the skylines for the query log.

Table 2.4 Skylines of Queries

Skyline ID

Query ID

Car ID

Attributes for which the car is on the skyline

s1

q1

t4, t5

AC, Four Door

s2

q2

t3, t4

AC, Power Doors

s3

q3

t1, t4, t6,

Four Door, Power Doors

s4

q4

t3, t4

Power Brakes, Power Doors

s5

q5

t2, t7

Turbo

s5

q5

t3

Auto Trans

The Table 2.4 describes the skyline log for the query log Q and database D in Table 2.3 and Table 2.1 respectively. None of the tuples in D satisfies for q5 completely. It is considered that if a tuple has either 'turbo' or 'auto trans' attribute then it satisfy q5. The problem for the skyline boolean can be formally defined as follows: Given a database of competing products D, a query log Q with Skyline Query semantics, a new tuple t, and an integer m, compute a compressed tuple tâ€² by retaining m attributes such that the number of queries for which tâ€² appears on their skylines is maximized. The MaxFreqItemSets algorithm used for CB-QL problem can be used for the skyline Boolean with certain updates.

Skyline log is used instead of query log

Query is being counted only once rather than each skyline.

## 2.3.3 CATEGORICAL AND NUMERIC DATA.

Categorical databases are the natural extension of Boolean databases where each attribute ai can take one of several values from a multivalued categorical domain Domi. A query over a categorical database is a set of conditions of the form ai= xi, xi Ð„ Domi. Each categorical column ai can be replaced by |Domi| Boolean columns and consequently a categorical database/query log with M attributes is replaced by a Boolean database/query log with Î 1â‰¤iâ‰¤M Domi Boolean attributes.

The paper focus on queries that specify ranges over a subset of attributes. The problem variants for Boolean data have corresponding versions for numeric databases. For example, users browsing for used digital cameras may specify queries with ranges on price, age of product, desired resolution, etc, and the returned results may be ranked by price. Problems involving numeric ranges in queries can be reduced to Boolean problem instances as follows. We first execute each query in the query log, and reduce Q to Qâ€² by eliminating queries for which the new tuple has no chance of entering into the top-k results, Then, for each numeric attribute ai in Qâ€², replace it by a Boolean attribute bi as follows: if the ith range condition of query q contains the ith value of tuple t, then assign 1 to bi for query q, else assign 0 to bi for query q. I.e., each query has effectively been reduced to a Boolean row in a Boolean query log Qâ€². The tuple t can be converted to a Boolean tuple consisting of all 1's. It is easy to create CB-QL variant for Boolean data, whose solution will solve the corresponding problem for numeric data. The next chapter deals with the implementation of the algorithm.

## CHAPTER 3

## IMPLEMENTATION

This chapter describes about the implementation of proposed techniques. The most efficient three algorithms are considered for implementation and the rest two are avoided. The implementations for two problem variants are taken into account, i.e., conjunctive Boolean query log and for text variants are the two problem variants considered for implementation. The implementation of the paper mainly focuses on determining the performance indicators such as the time, satisfied queries. The algorithms are implemented in c#. The backend used is SQL Server.

## 3.1 BASICS OF C#

It is a multi-paradigm programming language that encompassing imperative, declarative, functional, generic, object-oriented (class-based), and component-oriented programming disciplines[13]. It was developed by Microsoft within the .NET initiative and later approved as a standard by Ecma (ECMA-334) and ISO (ISO/IEC 23270). The language is intended for use in developing software components suitable for deployment in distributed environments. Source code portability is very important, as is programmer portability, especially for those programmers already familiar with C and C++. Support for internationalization is very important. C# is intended to be suitable for writing applications for both hosted and embedded systems, ranging from the very large that use sophisticated operating systems, down to the very small having dedicated functions[13]. Although C# applications are intended to be economical with regard to memory and processing power requirements, the language was not intended to compete directly on performance and size with C or assembly language. The c# is used as the front end for this project.

## 3.2 SQL SERVER

Microsoft SQL Server is a relational model database server produced by Microsoft. Its primary query languages are T-SQL and ANSI SQL. Microsoft SQL Server is a computer application used to create desktop, enterprise, and web-based database systems [14]. It is used at different levels and with various goals. The code base for MS SQL Server (prior to version 7.0) originated in Sybase SQL Server, and was Microsoft's entry to the enterprise-level database market, competing against Oracle, IBM, and, later, Sybase. Microsoft, Sybase and Ashton-Tate originally teamed up to create and market the first version named SQL Server 1.0 for OS/2 (about 1989) which was essentially the same as Sybase SQL Server 3.0 on Unix, VMS, etc. Microsoft SQL Server 4.2 was shipped around 1992 (available bundled with IBM OS/2 version 1.3). Later Microsoft SQL Server 4.21 for Windows NT was released at the same time as Windows NT 3.1. Microsoft SQL Server v6.0 was the first version designed for NT, and did not include any direction from Sybase. Since parting ways, several revisions have been done independently. SQL Server 7.0 was a rewrite from the legacy Sybase code. It was succeeded by SQL Server 2000, which was the first edition to be launched in a variant for the IA-64 architecture.

## 3.3 DATA SETS

Two data sets have to be used. One is a Boolean data set which is used for Boolean variants and another is a text dataset which is used for text data variants.

## 3.3.1 BOOLEAN DATA SET

The Boolean data set used here is actually an automobile data set which includes car details that is fetched by the user. It includes the attributes such as AC, Power Locks, Turbo, Auto Trans etc. The values to this attributes are either zero or one. The value one for this attributes means the presence of that attribute. The user may request to the server to list out the cars that are having certain features. For example, user makes a query to list the cars having AC, Power Locks, and Turbo. Then the query log saves the query as a new row which is having value one for those three attributes. Rest attribute values are set as zero. The value zero indicates the absence of that attribute.

## 3.3.2 TEXT DATA SET

The other data set is a publication titles dataset that consists of 119,332 titles extracted from the DBLP database [8]. This dataset is used for the text data variants. The text data set consist of keywords which are considered as the attributes. The keywords with a specified tf-idf value are considered to create another data set which is Boolean data set. And for each such attributes, one's or zero's are entered based on their presence or absence in the text data set. The resulting data set is used as the input data set for which the specified algorithms in chapter 3 are implemented.

## 3.4 MAXFREQITEMSET-CB-QL OPTIMAL ALGORITHM

The first module is the optimal algorithm, MaxFreqItemSet algorithm which is a maximal frequent item set algorithm for determining the frequent item sets in the query log. The most familiar algorithm for determining frequent item set is by using apriori and FP- growth algorithms. But it does not work well for dense data set. Hence several other algorithms are being proposed in this project that work well for dense data set. In order to show that the apriori algorithm does not work well for dense data set, it is also included in the implementation of the paper.

## 3.4.1 APRIORI ALGORITHM

The algorithm which is specified in [1] is implemented. In the apriori algorithm it first sets the support value. This value specifies the minimum number of times the item should be repeated. Here, in the implemention it is set as 200. It then finds the frequent 1- itemsets using the function FindOneColSet. It checks if the 1-itemset frequency is greater than 200. If it is so then it is considered to be frequent 1-itemset and it is returned or else it is ignored. The AprioriGenerate is the next function that generate the candidates and then uses the apriori property to eliminate those having a subset that is not frequent. Once all candidates have been generated, the database is scanned. For each transaction, a subset function is used to find all subsets of the transaction that are candidates and the count for each of these candidate is accumulated. Finally, all of those candidates satisfying minimum support form the set of final frequent itemsets.

The function AprioriGenerate procedure performs two kinds of action, namely join and prune.In the join component, to generate k-itemset, k-1 itemsets is joined to itself. But before joining two condition is checked. The first condition is that to join two itemsets in k-1 itemsets, these two itemsets must havek-2 items in common. The second condition is that the element in the position k-1 of the first itemset should be less than the element in the position k-1 of the second itemset. This will generate tha candidate for k-itemset. Then the pruning component employes the apriori property to eliminate candidates that have a subset that is not frequent. The test for infrequency is done in another procedure HasInFrequentSubset. This function is used to check if the subset of the candidate itemset is frequent. It returns true if the subset is not frequent and those itemset whose subset is not frequent can be removed from the candidate set. Or else, if it returns false,then it means the candidate itemset's subsets are all frequent and no need to prune out the candidate itemset. The snapshot of the apriori algorithm is as shown below:

Fig 3.1 Apriori algorithm

## 3.4.2 MAXIMUM FREQUENT ITEM SET ALGORITHM

For the maxfreqitemset algorithm the input will be the query log and a number which indicates the limited number of attributes that can be entered in to the database. The support value is being set o a required number. First step to be followed is to negate the query log. This means 1's are set as 0's and 0's are set as 1's. Once the query log is complemented, the down phase and up phase random walk has to be performed on the negated query log. Two functions are used for down phase and up phase random walk. It considers the tuples having all features set as one. Within the function random ( ) is used to select randomly the items.

Then random walk[2] is performed from a random singleton itemset which has the minimum support value as the above specified one. It walks downward removing randomly infrequent items. The function_cal1 ( ) is used to perform down phase. The walk is continued until the itemset becomes frequent. Up phase random walk is another function for which the output of the down phase is considered to be the input of up phase. The function_cal2 () is used to perform up phase. It adds and removes the items that are frequent or infrequent as it moves downward. This random walk can be performed iteratively any number of times. This may be a simple number or may be a condition. Here, the frequent itemset count is taken and the random walk is repeated that much number of times. From the set that is obtained, item set that is most frequent is selected and all other attributes other than those in the selected itemset is retrieved as the maximal frequent item set. This will be the attributes that has to be inserted in to the database for the new tuple. The snapshots of the maximal frequent itemset algorithm are as shown below. The first snapshot shows that the support value is set, number of attributes to be retained (m value) is set and the data set is negated.

Fig 3.2 Negating the query log

The next snapshot shows that the two phase random walk is performed and the attributes to be retained.

Fig 3.3 The attributes to be retained

## 3.5 GREEDY HEURISTICS

The first greedy heuristic implemented is consumeattr. This algorithm retrieves the top m frequent attributes. For this m value has to be given as input. The output is retrieved as in the snapshot shown below.

Fig 3.4 Consume attributes algorithm

The second greedy heuristic is consumeattrcumul algorithm. This is a cumulative form of the first algorithm. It first selects the most frequent attribute. Then it selects the next attribute that co-occur more frequently with the first attribute. This process is continued till m (no. of attributes to be retained which is given as input) attributes are retrieved. The output of this algorithm is as shown below.

Fig 3.5 Consume Attributes cumulative algorithm

## 3.6 TEXT DATA

The entire text data obtained from the site is divided into 10 documents. Then the term frequency and the document frequency of all attributes are calculated. The weight of each term is obtained using the equation stated in chapter 2. And the average weight is calculated and stored in the database. An input k value is used to select the top k rows having the highest average term weight. Then for that k rows the top m attributes that occur more frequently in the data base is obtained using any of the 2 greedy heuristics. The fig 3.6 and fig 3.7 shows the output obtained using the consume attribute algorithm and consume attribute cumulative algorithm respectively.

Fig 3.6 Text data- Consume attribute algorithm

Fig 3.7 Text data- Consume attribute cumulative algorithm

After implementing the algorithms, the graph has to be generated to show the efficiency of each algorithm. This section will be implemented in the later stages of development.

## CHAPTER 4

## CONCLUSION

In this project, the problem introduced is of selecting the best attributes of a new tuple, such that this tuple will be ranked highly, given a data set, query log or both. This means tuple "stands out in the crowd". Several variants for the problem are presented, which includes Boolean, categorical, text and numerical data. The Boolean variants are those in which the attributes will be having values either 1 or 0. It represents the presence or absence of the attribute. The text variant will have attributes such as reviews, accident history, and so on. The numerical variants include attributes such as price, age, fuel, mileage, etc. The categorical variants include attributes such as make, color, engine type, etc. The optimal algorithms specified in the paper are feasible for small inputs. In addition to that, greedy algorithms, that are being specified, produce good approximation ratios.

Both the optimal algorithms and the greedy algorithms can be applied to all the problem variants. For all problem variants, there is a difference in the performance of optimal algorithm and approximation algorithms. But for the implementation only two problem variants are considered. The implementation section still includes the graph generation for evaluating the efficiency of the algorithms to be constructed. The problem considered in the paper is novel and important in the area of adhoc data exploration and retrieval. Thus it can be concluded that, the paper takes an important first step towards developing principled approaches for attribute selection in a data exploration environment.