Column Store Tuple Reconstruction Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Column-Stores gained popularity as a promising physical design alternative. The main overhead of query processing in column-stores is on-the-fly tuple reconstruction for multi-attribute queries. This paper presents an adaptive approach for improving tuple reconstruction time. Our approach exploits decision tree algorithm to cluster attributes for each projection. The proposed algorithm without candidate generation eliminates the frequent database scanning.

One of the main components impacting query performance in column-stores is tuple reconstruction time [1, 2]. During query execution of column-stores, traditional row-id based tuple reconstruction time has a geometric-fold growth with the data amount. Therefore, researching and designing a time effective tuple reconstruction method adapted to column-stores is of great significance. The successful column-stores systems C-Store and Vertica, exploit projection to support tuple reconstruction. They divide logically a relation into sub-relations, called projections. Each attribute may appear in more than one projection. The attributes in the same projection are sorted on one or more sort key(s) and are stored respectively on the sorted order [5]. Projection makes multiple attributes referenced by a query to be stored in the same order, but not all the attributes be the part of the projection and hence pays tuple reconstruction time penalty. However, it is possible to improve tuple reconstruction time by clustering frequently used attributes with available projection techniques.

It has been observed that attribute projection plays vital role for tuple reconstruction. Our proposed algorithm i.e. Decision Tree Frequent Clustering Algorithm (DTFCA) uses some existing projection techniques to improve tuple reconstruction time. These techniques are discussed in Section 2. Some notations and terminology are needed to review the correlation amongst query attributes and are discussed in Section 3. In Section 4, the informal and formal description of the proposed DTFCA is presented. To exploit DTFCA, experimental data based on TPC-H schema is used as an input in suitable experimental environment. The experimental results based on aforesaid execution are analyzed and discussed in Section 5. Finally, we conclude with concluding remarks in Section 6.

2. RELATED WORK

In the past, the series of column-stores databases C-Store and MonetDB have attracted researchers in column-stores area due to their good performance [4, 5]. C-Store series exploit projections to organize logically the attributes of base tables. Each projection may contain multiple attributes from a base table. An attribute may appear in more than one projection and stored in different sorted order. The objective of projections is to improve the tuple reconstruction time. Though the issue of partitioning strategies and techniques has not been addressed well, a good solution is to group attributes into sub-relations based on the usage patterns [5]. The Bond Energy Algorithm (BEA) was used to group the attributes of a relation based on the attribute affinity values in Attribute Affinity Matrix (AAM, measures the bond between two attributes of a relation according to how they are accessed by application) [10]. Navathe et al. developed Binary Vertical Partitioning Algorithm by extending the results of Hoffer and Severance [11]. Most of the vertical fragmentation algorithms, including Bond Energy Algorithm, Binary Vertical Partitioning Algorithm use the Attribute Affinity Matrix (AAM) formed from the Attribute Usage Matrix (AUM) [1, 10]. MonetDB proposed self-organization tuple reconstruction strategy in a main memory structure called cracker map [5, 7]. Split-up and combination of the pieces of the existing relevant cracker maps make the current or subsequent query select qualified result set faster. But for the large databases, cracker map pays performance penalty due to high maintenance cost [4]. Therefore, the cracker map is only adapted to the main memory database systems such as MonetDB.

3. DEFINITIONS AND NOTATIONS

In relational data warehouse applications, the common format of a query from the fact table(s) and dimension tables is depicted as follows:

Select <attribute list>

From <Table list>

Where <condition>

Group by <attribute list>

Order by <attribute list>

Definition 1 Strong Correlation

k attributes A1, A2,…, Ak of relation R are strongly correlated if and only if they appear in the conjunction in conditional clause and target list of an access query.

Definition 2 Weak Correlation

Given a collection of accesses A={a1, a2,…, am}, in which every ai(1≤i≤m) is an access to a query. If the correlation among query target attributes is smaller than P, where P is the threshold, the attributes are considered for week correlation.

In column-stores, two critical tasks are needed to perform for processing a query: First, to determine qualifying rows based on the conditions in Where clause, and to generate a set of row identifiers; Second, to merge the qualifying columns of identical row identifiers, and to construct the target rows according to target expression in Select clause. The attributes appearing in a Select clause are considered to be query target correlative, and the attributes appear simultaneously in a Where clause are considered to be query restrict correlative. The attributes appear in conjunction in conditional clause and query target are considered to be strongly correlative. In this paper, we exploit the strong correlation amongst attributes.

4. PROPOSED ALGORITHM (DTFCA)

Let each relation R is grouped into several projections based on empirical knowledge and users' requirement. After the system is used by users for a period, a set of accesses A={a1, a2,…, am} are collected by the system. DTFCA uses mainly set of accesses to cluster frequent projected attribute-set, which are strongly correlated. We discuss the informal description of DTFCA followed by the formal description in this section.

4.1. INFORMAL DESCRIPTION-DTFCA

DTFCA is broadly designed with three functions and two methods namely; attrset(table(TPC-H(attributes)))),map(node), upperbound(instance_pos), getParent(node), getName(node) respectively. attrset(table(TPC-H))) is a built-in function that accepts the input as TPC-H queries and returns a set of accessed attributes. map(node) is also a built-in function that map the position of the attribute in a decision tree. Built-in function upperbound(instance_pos) checks for the last attribute to be included in frequent attribute-set. The purpose of getParent(node) built-in method, is to find the successive attribute for frequent attribute-set. getName(node) built-in method is designed to generate frequent attribute-set for support greater than minimum support.

The input to our main program is a variant of AUM (Table 1). Variables are initialized to compute the frequent attribute-set, for support greater than minimum support. Internal call to map(node), getParent(node), upperbound(instance_pos) for computing cluster of frequent attribute-set, will improve the tuple reconstruction time through projection. Each element in the output forms attributes of a projection. Our clustering results reflect truly the strong correlativity between attributes implied in previous collection of accesses for queries, and conform to the meanings of projection in column- stores.

4.2. FORMAL DESCRIPTION-DTFCA

We discuss here the formal description of DTFCA.

/* Main program begins */

void main()

{

/* Initially generate attribute-set by scanning A*/

String L={attrset(table(TPC_H)};

String S=DTFCA(L);

}

function DTFCA(String S)

{

/* Decision Tree Frequent Clustering Algorithm*/

Input : a relation R(U={A1,A2,…,An}) , a collection of accesses A={a1,a2,…,am}, the support threshold min_sup.

Output: clusters of strongly correlated attributes Cl1,Cl2,…,Clk, which holds support no smaller than min_sup.

/* Variable Declaration*/

TreeNode currentNode;

/*Checking for the limit for traversal*/

while(!currentNode.isGoal())

{

/*Checking for the frequent attribute-set by bottom up search*/

if (currentNode.hasParent())

{

map<int, TreeNode > Parent = currentNode.getParent();

if (parent.support> min_support)

{

int instance_pos = 0;

/*Getting the item name of tree node for frequent attribute-set */

instance_pos = currentNode.getName();

map<int, TreeNode >::iterator cur

= children.upper_bound(instance[instance_pos]);

/* No classification criteria for single node*/

if(children.end() == cur)

{

return(NO_CLASSIF_CRITERIA);

}

currentNode = (*cur).second;

}

}

/* Return the classification, which is now the key*/

return (currentNode.getName());

}

}

5. EXPERIMENT DETAILS

The objective of the experiment is to compare the execution time of existing tuple reconstruction method with DTFCA for TPC-H schema queries on column-stores DBMS.

5.1. Experimental Environment

Experiments are conducted on 2.20 GHz Intel® Core™2 Duo Processor, 2M Cache, 1G Memory, 5400 RPM Hard Drive , and Windows® XP operating system.

5.2. Experimental Data

We use the TPC-H data set as the experiment data set, which is generated by the data generator. Given a TPC-H Schema, fourteen different queries are accessed altogether 140 times during a given window period.

5.3. Experimental Analysis

Let us consider a relation with four attributes namely; S_Supplierkey (A1), P_partkey (A2), O_orderdate (A3) and L_orderkey (A4). These attributes are accessed by fourteen different queries of TPC-H schema for varying minimum support. For each attribute AUM have access frequency generated by system. The access frequency is derived from three parameters namely: Minimum access frequency (Min_fq), Maximum access frequency (Max_fq), Average access frequency (Avg_fq). Minimum access frequency of query attributes is computed for initial access of attributes. Maximum access frequency of query attributes is computed on the system with more processing queries. Average frequency is computed on the system with less processing. DTFCA uses access frequency to cluster frequent attribute-set. Frequent attribute-set refers to the set for frequency no smaller than the minimum support.

Table 1: Attribute Usage Matrix

Attr

Que

S_supplierkey (A1)

P_partkey

(A2)

O_orderdate

(A3)

l_orderkey (A4)

Min_

fq

Max_

Fq

Ave_

fq

Acc_

Fq

Min_

fq

Max

_fq

Ave

_fq

Acc

_fq

Min_

fq

Max_

Fq

Ave_

fq

Acc_

fq

Min_

Fq

Max_

fq

Ave_

Fq

Acc_

Fq

2

1

1

1

100123

1

1

1

100123

0

0

0

0

1

0

0

331

3

0

0

0

0

0

1

0

331

1

1

1

100123

0

1

0

331

4

0

1

0

331

1

0

0

331

1

1

1

100123

1

1

1

100123

5

1

0

0

331

0

0

0

0

1

1

1

100123

0

1

0

331

6

0

1

0

331

1

1

1

100123

0

0

0

0

0

0

0

0

7

1

0

0

331

0

1

0

331

0

1

0

331

1

1

1

100123

8

1

0

0

331

0

1

0

331

1

1

1

100123

1

1

1

100123

9

0

1

0

331

1

1

1

100123

0

0

0

0

0

0

0

0

10

0

0

0

0

0

0

0

0

1

0

0

331

0

1

0

331

12

0

1

0

331

0

1

0

331

0

1

0

331

1

1

1

100123

16

0

0

0

0

1

1

1

100123

0

1

0

331

0

1

1

6612

17

0

1

0

331

0

1

1

6612

0

1

0

331

0

0

0

0

19

0

0

0

0

1

1

1

100123

0

1

0

331

1

1

1

100123

21

0

1

0

331

0

1

0

331

0

0

0

0

0

1

0

331

Now, we explain the execution of DTFCA algorithm for different cases of minimum support. We denote the attribute frequency as a candidate of frequent attribute-set by (int_val) xyz, where int_val is the integer value of access frequency. Also, x, y and z represent the case variable number respectively. For the given minimum support, the frequent attribute-set will be a combination of four attributes (A1, A2, A3, and A4).

Case-Study

Let Case 1, Case 2 and Case 3 denote three minimum support case variables. For DTFCA experiment analysis, let the minimum support values for these variables are 30, 50 and 60 respectively. For the implementation of Case 1, user has gained the control to determine the strong correlation among attributes and hence formulating the frequent attribute set. By referring to Table 1, frequent attribute sets generated are higher than other cases i.e. out of fourteen pairs or patterns, thirteen patterns were frequent. This result may be considered as moderate and improves tuple reconstruction time for column store. However, with Case 2 and Case 3, generated frequent attribute sets are four and one respectively. The clustering result of DTFCA is more conformity for the idea of projection in column- stores.

The experiments are performed on TPC-H dataset queries with the same selectivity and varying minimum support. As shown in Figure 1 (horizontal axis denotes the minimum support, vertical axis denotes the execution time), the execution time of TPC-H query schema is inversely proportional to minimum support, and system may perform well for query attributes, which are strongly correlative (refer Section 3). As shown in Figure 2, the execution time under DTFCA is much lower than traditional tuple reconstruction method; hence DTFCA could minimize the tuple reconstruction time.

Figure 1: Minimum Support Time Cost

Figure 2: Tuple Reconstruction Time Cost

6. CONCLUSION

Our approach DTFCA, exploits decision tree to cluster frequently accessed attributes of a relation based on workloads. All the attributes in each cluster form a projection. The experiment shows that our approach is conformity to determine projection in column-stores and hence improve the tuple reconstruction time. The output of DTFCA is not a partition, but a group of attributes, for the meaning of projecting in column-stores.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.