This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
The principal mechanism through which a database maintains an optimal level of performance is known as the database query optimizer; without a well-designed query optimizer, even a professional-quality database--such as PostgreSQL--would be noticeably sluggish. In broad terms, for a database to execute a query, the query must first be parsed and converted into an expression in relational algebra. For any given query, there may be many ways in which to convert the query into relational algebra, and some of these ways will be more computationally costly than others. Thus, the point of the query optimizer is to select the least computationally expensive query evaluation plan. In the course of considering different query evaluation plans, the optimizer must analyze how the data interacts with the various operators that exist in relational algebra. Of these operators, the most difficult one to process and optimize is the join, and so this operator will become the focus of this project. It is true that under normal circumstances a database usually processes queries that involve only a few joins, and it is in these situations that databases such as PostgreSQL excel. There are, however, fairly common applications--data mining and OLAP, for example--that would realistically require the processing of queries involving many joins.
A basic, firsthand analysis of PostgreSQL's performance in handling multiple-join queries is not particularly encouraging: it took approximately 4.8 seconds to process joins involving 10 relations, approximately 16.1 seconds to process joins involving 15 relations, and well over 12 minutes to process joins of 20 relations or more. Clearly, the effectiveness of the PostgreSQL optimizer begins to suffer when we give it queries involving numerous (20 or more) joins. It is for this reason we look to provide an alternative to the PostgreSQL query optimization algorithm--an alternative algorithm that will actually specialize in optimizing large queries. (Here, and throughout this paper, when we refer to a large query, we mean large in terms of the number of joins involved--not necessarily in terms of the size of the output or the size of the relations that are referenced). Because of the inherent complexity involved in optimizing database queries--especially those involving numerous joins--we propose the use of a genetic algorithm that aims to produce plans that outperform the plans created by PostgreSQL.
Explaining the Problem
In order to understand the problem of optimizing large queries, we must briefly investigate the basics of query optimization and genetic algorithms. Before proceeding any further, let us define exactly what constitutes a query evaluation plan: in general, a query evaluation plan is a tree with relational operators at the intermediate nodes and relations at the leaf nodes. Although the actual implementation of a database's query optimizer varies from system to system, in theory, optimizing a SQL query involves three basic steps. First, the SQL must be rewritten in terms of relational algebra. Specifically, the query is treated as a collection of projections (p), selections (s), and Cartesian products (x), with any remaining operators carried out on the result of the given p-s-x expression. Next, once such a relational algebra expression has been formed, the optimizer must enumerate various alternative plans for evaluating the expressions. A typical optimizer does not consider every possible evaluation plan of a given expression since this would require excessive overhead, thereby rendering any possible timesaving optimizations moot. Finally, the optimizer must estimate the cost associated with each of the enumerated plans, and choose the plan with the lowest estimated cost (or at least avoid the plans with the highest estimated costs).
The exact details of how an optimizer eliminates inferior plans and assigns cost is a remarkably complex subject and does not need to be addressed here. The only thing we need to concern ourselves with at this point is why optimizing a query with many joins is so much more difficult than optimizing a query with just a few joins. The join operator (denoted by j) takes as input two relations, combines their tuples one-by-one based on some give criteria, and produces a new relation as output. Complicating matters is the fact that, like most databases, PostgreSQL provides a variety of join methods. Essentially, this means that any one join can actually be processed with some number (three in PostgreSQL's case) of different algorithms--all of which may require varying costs. The dominant reason that larger queries are so costly, however, is that the join operator is both commutative and associative; this means that the number of ways to create a plan for a given query grows exponentially with the number of joins in the query. Thus, a database query's search space--which is complicated enough under normal circumstances--becomes exponentially more complex when we begin to consider numerous joins.
Fortunately for us, generic algorithms are especially suited to navigating through extremely complex search spaces; but what exactly is meant by the term genetic algorithm? Simply put, a genetic algorithm--or GA for short--is a search algorithm based on the mechanics of natural selection and genetics. A genetic algorithm operates by starting with some group (or population, in biological terms) of chromosomes, each of which represents some valid solution to the given problem. To introduce the concept of natural selection into the algorithm, we must devise and employ some function that evaluates the fitness level of each chromosome (or in terms of a problem, how close any given solution comes to the optimal solution). Once each chromosome has been assigned a fitness value, we then select the fittest individual chromosomes and enter them into the mating population. Each chromosome of the mating population is subsequently paired off at random, and the resulting pairs then undergo the process of crossover, where different pieces of the paired chromosomes are interchanged to produce two new chromosomes. This is essentially the algorithmic equivalent to the swapping of DNA that occurs during reproduction in nature. Thus, since the chromosomes are simply representations of the various solutions to our problem, a set of parent chromosomes being crossed represents a pair of relatively correct solutions that are then combined to produce two new and (assuming the theories of evolution hold true) more correct solutions. Finally, the children arising from the mating pool are then taken to be the new population of chromosomes, and the process begins again.
In addition to crossover and fitness-based mate selection, there is one other genetic operator that must be mentioned. Many computer scientists (and even biologists) disagree as to the degree of importance of the mutation operator, but classically, it plays a decidedly secondary role in the algorithmic process. Mutation is, in fact, needed because even though fitness-based mate selection and crossover effectively search through and recombine specific members the population, occasionally they may become overzealous and lose some potentially useful genetic material. Mutation, then, is simply the occasional alteration of a small, randomly chosen part of a chromosome.
As stated earlier, in order to use a GA we need to develop some way of encoding the pertinent information of our database query optimization problem into a fixed-length chromosome. Devising this encoding, as well as implementing the necessary genetic operators so that they properly modify and improve upon our initial population of solutions is the fundamental focus of this project.
Proposing a Solution
We have frequently stated that one of the most crucial steps in using a GA to optimize a database query is to devise a method for encoding the information of the query plan into a form so that it is usable by the operators in our genetic algorithm. Any feasible representation must both preserve and distinguish between two main properties: the commutativity of the operands for each gene and the associativity of the operands for each gene. Unfortunately, these complexities preclude the use of a simple string of bits, which is the canonical representation used in genetic algorithms. We must use a more complex representation, and specifically, we will adopt an encoding scheme that is a variation on a representation developed by Kristin Bennett, Michael Ferris and Yannis Ionnidis. In particular, the encoding that we will use involves taking a fixed-length chromosome string and breaking it down into a number of genes; each gene will itself represent a join that must be performed. Thus, if the given SQL query involves 23 joins, then each of the chromosomes will consist of 23 genes.
In order for each gene to model the behavior of a join, it must consist of three pieces of information: the pair of relation operands, the join method, and the relation orientation. Because of the preceding examination of the role of the join operator in query optimization, the first two of these components should be straightforward; but what does the relation orientation mean? Given the pair of relations that our join operates on, we need to know which one is the outer (left) operand and which one is the inner (right) operand; this is exactly what the orientation indicates. Although the orientation could be implemented many different ways, we will assume that the orientation has one of two possible values: alphabetical or reverse-alphabetical.
In order to fully grasp how the encoding works, it is necessary to work through an example. In this paper we adopt the convention of using numbers to denote joins, capital letters to denote relations (hence the alphabetical and reverse-alphabetical possibilities for our orientation), and either a lower-case a or r in the subscript of the appropriate join to denote the orientation. (Note that, for simplicity, the join method will not be represented visually). Now consider the following simple expression:
This expression involves two joins, so we know that the chromosome must contain two genes. To develop the encoding for this plan, we start at the lowest and leftmost join, which happens to be j1. Looking at the operands for this join, we see that the outer one is A and the inner one is C; this means that the relations have an alphabetical orientation, and so the first gene is 1a. We must now consider the second (and final) join. Looking at the operands, we see that j1 is the outer operand, and B is the inner operand; without considering the orientation yet, we can at least write 1a 2. Unlike the previous join, it is not immediately clear what the orientation should be for j2. Because it is a subjoin of j2, j1 itself has no alphabetical association. The problem here is one involving alphabetical promotion; clearly it would be easiest to simply promote either A or C to when determining the alphabetical orientation, but which one do we choose? This is a significant question because if we promote A, then j2 will have alphabetical orientation, but if we promote C, it will have reverse-alphabetical orientation. Establishing a rule for alphabetical promotion is essential in order to preserve the accuracy of the encoding; the actual rule itself is not particularly important so long as it is applied consistently. For our representation, we use the following rule: If the subjoin occurs after the current join in the chromosome string, obey the subjoin's orientation suggestion; otherwise, if the subjoin occurs before the current join the chromosome string, do the opposite of the subjoin's orientation suggestion. So, for our example, 1a represents our subjoin and 1a occurs before 2 in our encoding, so we are to disobey the orientation suggestion of 1, which means we should take the reverse-alphabetical relation of j1, giving us a promotion of C. So, now j2 effectively has C as its outer operand and B as its inner operand, so we have reverse-orientation, and our final encoding is 1a 2r.
Despite its seeming complexity, this encoding has many advantages. Because the representation is based on labeling joins instead of relations, we avoid the complexity that frequently arises when several intermediate relations exist at the same point during the encoding process. In our representation, these relations are simply attached to and considered in terms of their respective joins. Furthermore, although it may seem awkward initially, using the concept of orientation is extremely helpful because it allows us to keep track of the commutative order of the outer and the inner relations for each join with just a single variable; the relations themselves can simply remain static and unordered--all we need access to is their names. Finally, our representation of the query plans makes implementing the genetic algorithms fairly straightforward--almost as straightforward as if we were using a string of bits.
Now that we have introduced how we will encode the important database query information so that can be manipulated by the genetic algorithm, it is worthwhile to consider how we will implement the genetic operators that will interact with the chromosomes. The mutation operator is the easiest to implement, and consists of performing one of three possible alterations. First, mutation can select two genes and transpose them, switching their position in the string. Second, mutation can select one gene and change its associated orientation. Finally, mutation can select one gene and change its associated join method. Determining which of these three alterations will be used is done at random to ensure the spontaneous nature that is indicative of mutation.
Crossover is a slightly more complicated operation, and an example will be very useful. Recall that crossover operates on two chromosomes that have attained some base level of fitness so as to be entered into the mating pool. Suppose we have two chromosomes X and Y defined as follows:
The interactions between the two chromosomes in this implementation of crossover can be described as an active-passive/give-take relationship. We first have to determine which chromosome will be the active giver of information; we arbitrarily choose X to be first. Next we must define what pieces of information need to be sent to Y: for a chromosome with n genes, we randomly select a size from 1 to n for our chunk of data. In our case we need to choose some integer on the interval [1, 5], and let's suppose we select a size of 2. Now we need to determine at which gene we should begin our data chunk (keeping in mind that we can effectively wrap the chunk around to the beginning of the chromosome if we get to the end without satisfying the prescribed chunk size). For this example, assume we randomly chose to start the chunk at the 3rd position, which is the 5r gene. So, with a size of 2, our chunk looks like:
Of course, to actually complete the crossover operation, we would also need to go through the algorithm again, this time with Y as the active giver and X as the passive receiver. It should be clear that this method essentially performs all at once and on a multi-gene scale the three operations that were defined in the mutation operator. This cements the earlier statement that crossover plays a more dominant and dynamic role in the algorithmic process than does mutation.
The final genetic operator that we must consider is that of fitness-based mating. In point of fact, the implementation of this operator is governed by how we define our fitness function; every other aspect of the operator simply involves moving chromosomes from one population to another. Recall that the fitness function must capture some measure of correctness or success that we want to preserve. In terms of the problem of database query optimization, since each chromosome represents a query plan, this function is already been determined for us--it is simply the estimated cost for the plan! Chromosomes that encode plans with low costs will be labeled the most fit since we are interested in avoiding plans with exorbitant costs.
Important Project Components
In order to implement this genetic algorithm, it will be necessary to obtain information about the initial plan structure, the estimated cost of the plan, and various statistics about the plan; this information can be extracted directly from PostgreSQL. Although it only takes a moderate amount of effort to configure PostgreSQL to output the information that is needed, being able to parse, collect, and categorize this information will be very difficult given the obtuse syntax that PostgreSQL uses to output its plan information. Because parsing, file scanning, and string manipulations are necessities, PERL seems to be a natural choice as far as a language in which to base the code.
The actual programmatic structure of the genetic algorithm, however, seems to cater to the object-oriented nature of C++. In particular, there appears to be three levels of organization: the population level, the chromosome level, and the gene level. Moreover, careful analysis of the genetic operators will reveal that they can often span more than one of these levels. Mutation, for example is a chromosome-level operation when it transposes two genes, but is a gene-level operation at the point that it changes the join method or the orientation. Similarly, fitness-based mate selection is a chromosome-level operation when it applies the fitness function, but it is a population-level operation at the point of moving the fittest chromosomes from the normal population into the mating pool. Thus, we seem to require three levels of abstraction--which is about three levels more than what PERL can easily provide. Therefore, it appears logical to implement this genetic hierarchy as a series of container classes, with a population class containing a chromosome class containing a gene class. Fortunately, since we are using PERL as both our parsing language and our base language, we can simply embed these separate C++ modules in our PERL code, and we can exploit the strengths of both languages.
A final component of this project involves creating set of data and a schema that will be viable for testing. Composing a complex and properly interconnected testing schema consisting of 20 or more relations, complete with appropriate and useable test data is no easy task. In fact, this could probably be an entire project onto itself. In addition to the numerous variations that will need to be manipulated with the test schema and test data, we will most likely require significant flexibility so that the testing input can be controlled and monitored. This most likely suggests the use of a PERL script using the DBI/DBD interface to create and populate the tables in the schema; but it is important to avoid getting bogged down in this ancillary aspect of the project. Once the algorithm has been tested and finalized, it would be very interesting to note its performance while being run against a real world, industry-standard benchmarking schema--if one that includes at least 20 relations is available.
- Beasley, David, David R. Bull, Ralph R. Martin. "An Overview of Genetic Algorithms: Part 1, Fundamentals." University Computing, volume 15, issue 2 (February 1993), pp. 170-181.
- Bennett, Kristin, Michael C. Ferris, and Yannis E. Ioannidis. "A Genetic Algorithm for Database Query Optimization." The Proceedings of the Fourth International Conference on Genetic Algorithms. San Diego, California: Morgan Kaufmann Publishers, 1991, pp. 400-407.
- Celko, Joe. "Genetic Algorithms and Database Indexing." Dr. Dobb's Journal, volume 18, issue 4 (April 1993), pp. 30-34.
- Goldberg, David E. Genetic Algorithms in Search, Optimization and Machine Learning. Reading, Massachusetts: Addison-Wesley, 1989.
- Goldberg, David E. "Genetic and Evolutionary Algorithms Come of Age." Communications of the ACM, volume 37, issue 3 (March, 1994), pp. 113-119.
- Holland, John H. "Genetic Algorithms." Scientific American, volume 267, issue 1 (January 1992), pp. 66-72.
- Kim, Won. "On Optimizing an SQL-like Nested Query." ACM Transactions on Database Systems, volume 7, issue 3 (September 1982), pp. 443-469.
- Lohman, Guy M., Dean Daniels, Laura M. Haas, Ruth Kistler, Patricia G. Selinger. "Optimization of Nested Queries in a Relational Database." IMB Research Laboratory RJ4260, San Jose, Calif., April 1983.
- McGoveran, David. "Evaluating Optimizers." Database Programming and Design, January 1990, pp. 38-49.
- Momjian, Bruce. "PostgreSQL Internals Through Pictures." Software Research Associates, December 2001. Available online at http://developer.postgresql.org/pdf/internalpics.pdf
- Ramakrishnan, Rahgu, Johannes Gehrke. Database Management Systems: Second Edition. Boston, Massachusetts: McGraw-Hill, 2000.
- Selinger, Patricia G., Morton M. Astrahan, Donald D. Chamberlain, Raymond A. Lorie, Thomas G. Price. "Access Path Selection in a Relational Database Management System." Proceedings of ACM-SIGMOD, May 1979.
- Stillger, Michael, and Myra Spiliopoulou. "Genetic Programming in Database Query Optimization." Proceedings of the First Annual Conference on Genetic Programming. Stanford, California, July 1996.
- Wayner, Peter. "Genetic Algorithms: Programming Takes a Valuable Tip from Nature." Byte. January 1991, pp. 361-368.
- Worsley, John C. and Joshua D. Drake. Practical PostgreSQL. Sebastopol, California: O'Reilly & Associates, 2002.
- Yao, S. Bing. "Optimization of Query Algorithms." ACM Transactions on Database Systems, volume 4, issue 2 (June 1979), pp. 133-155.