This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Owning to the cheaper hardware, different kinds of parallel computers are developed by interconnecting several single processor and single storage computers into the multi-processors and multi-storages computer. Therefore, the parallel algorithms are also proposed based on the traditional algorithms by optimizing the traditional algorithms into parallel ones. This report gives an introduction to the parallel algorithms on parallel database, such as sorting, query, join, and indices.
Compared to five years ago, or ten years ago, or ' computer hardware is much cheaper. On the other hand, our lives are more relied on the computation. Therefore, the parallel computer occurs, which owns multiple processors and multiple memories including the main memory and the secondary memory. There are several different parallel architectures. In order to fully utilize the parallelism of such computers, parallel algorithms are proposed and further optimized by researchers. This report focuses on the parallel database algorithms including parallel sorting algorithms, parallel selection algorithms, and parallel join algorithms. Besides, the index techniques can also be adopted in parallel database.
The following sections are organized as followings. Section 2 gives an introduction to the basic concept of the parallel database including the parallel architectures and data partitioning methods. Section 3 summaries several algorithms on parallel data sort, query, join, and index. Section 4 shows the some issues in parallel database. Section 5 concludes this report.
2. Basic Concepts in Parallel Database
2.1. Parallel Database Architecture 
The architecture of parallel database plays an important role in the operating parallel database, such as query optimization, data partitioning. Regarding the architectures of the parallel database, there are three kinds of them.
? The first one is shared-nothing (SN) architecture. Shared-nothing means that there is nothing (such as memory, disk) shared among all processors. In this architecture, each processor owns its memory and disk. The communication between each processor's memory and disk or between each processor's memory and other processors' disk cost too much. With this architecture, the general idea for parallelism is data partitioning among all processors. When operating on data, data is firstly processed at each processor locally and then merged into one processor. Of course, some operations may require repeated steps on above two steps.
? The second one is shared-memory (SM) architecture, which indicates that there is a shred memory among all processors but each processor owns a disk or several processors owns a disk (totally there are more than one disk). When the memory is shared, many problems will arise, such as data race (several processors read the data in shared memory but others may want to write it). Usually a common bus and interconnection network exist among all processors and memory. The disadvantage of this architecture is that the number of processor is limited to less than 32 or 64 because that the network among processors is difficult to manage and it further limit the speed and throughout of data transformation. However, in certain case, such as the scenario that low degree parallel database is needed, shared-memory can be efficient.
? The third architecture is shared-disk (SD) architecture. In this model of architecture, one or more (no more than the number of processors) are shared by all processors and each processor owns itself memory. It is an intermediate of shared-nothing architecture and shared-disk architecture. The benefit of this architecture is that all processors can direct all disk through an interconnection network. However, this architecture still has some drawbacks, such as the low communication between different processors.
? The last one is hierarchical architecture. This architecture is the combination of above three architectures and forms a two-level one. The first level (top level) is shared-nothing. Each sub-architecture under the top level owns its memory and disk. The second level is shared-memory. Hence, each memory and disk can be shared among few processors not all processors.
Above kinds of parallel architectures can be simply classified into two categories: shared-nothing (SN) and shared-everything (SE). Shared-memory should be the combination of shared-nothing and shared-everything. Each category has its advantages and drawbacks. Shared-nothing architecture is scalable but needs complicated data partitioning among processors; and it is also very large. Because of data partitioning and different data distribution, shared-nothing architecture may face some problem such as skew so that the operating on all processors may not fully parallel. On the other hand, shared-everything architecture is not scalable and needs scheduling because the different processors may access the same resources at the same time. Therefore, scalability means no-sharing while sharing means non-scalable, which is a tradeoff.
Literature  proposed a scalable sharing architecture (SSA) for parallel database which shares the advantages of both shared-nothing and shared-everything. Scalable sharing architecture is also based on the distributed shared memory (DSM). The difference from the previous distributed sharing memory is that the scalable sharing architecture provides hardware-level memory sharing not software-level which reduces the cost for memory sharing. Besides, the processors in this architecture are connected through the rings. Ring is also the scalability unit. Figure 1 shows the scalable sharing architecture. In Figure 1, the dotted parts are the scalable unit. The experiment showed that this scalable architecture outperforms the shared nothing architecture or, at least, performs the same as the shared nothing architecture. Another feature of this architecture is in its ability to adapting to the changing of data skew. However, this architecture also owns overhead. That is data movement.
2.2. Partitioning Techniques in Parallel Database
In order to process data in parallelism, data must be stored in multiple disks. Data partitioning refers to how to store the data on different disks so that operations on data can be in low cost.
There are three basic and practical partitioning techniques, namely hash function partitioning, Road Robin (RR) partitioning, and range based partitioning. Suppose there are m disks (disk-0, disk-1 ' disk-m-1), n tuples, and our partitioning based on attribute att. Three partitioning techniques are as following:
? Hash function partitioning: Hash function is one of widely used techniques. Using this partitioning strategy, each tuple is stored in disk-i, where i is the modulus between the value of att in this tuple and m (the number of disks). This partitioning method is based on the value of each tuple.
? Round Robin (RR) partitioning: Round Robin is also one of widely used techniques. The general idea of Round Robin partitioning is to store the first tuple in the first disk and the second tuple in second tuple ' m+1-th tuple in the first disk, m+2-th tuple in the second disk ' Therefore, this strategy is irrelevant to the value of each tuple.
? Range based partitioning: The general idea of range based partitioning is to assign a range to each disk and store the tuple to disk k whose range contains the value of this tuple' partitioning attribute value. In details, we split the value of att in to several continues range , such as, for a integer value, from 0 to 5, 5 to 8, 8 to 10. Then this split forms a range vector [5, 8] which contains three ranges as above. The first range (0, 5] is related to the first disk, the second range (5, 8] is related to the second disk, and the third range (8, 10] is related to the third disk. Certainly, if there are more disks, we can split the range into more sub ranges. Then, if a tuple's partitioning attribute value is, for example, 6, this tuple will be store in the second disk because the second disk's range (5, 8] contains 6. This strategy is also related to the value of the tuple.
2.3. Selection of Nodes for Distributing the Relations Parallel Database 
Above two sections give an overview to the architecture and partitioning techniques on parallel database. From the architecture to partitioning relation, some factors should be considered in order to improve the performance of parallel database. One of these factors is the communication cost among the processors. For example, if a relation is partitioned into two parts and these two parts are distributed on two processors which are far away from each other in the whole parallel system, then the operation on this relation may be slowdown owning to the communication between these two processors. Therefore, the selection of processors (here we also call these processors the nodes) can affect the efficient of operations (especially the join operations) on parallel database. In the following parts under this section, I will firstly give a brief introduction to three kinds of parallelisms which impact the selection of node when distributing the relations and then introduce some cases and algorithms on nodes selections proposed in .
From the view of parallelism, there are three kinds of parallelism:
? Inter-operation parallelism: one operation's parallel execution among multiple processors.
? Intra-operation parallelism: multiple independent operations' parallel execution among multiple processors.
? Pipeline parallelism: multiple dependent operations' parallel execution among multiple processors. For example, the second operation may use the first operation's results and the results produced by the first operation can be pipelined to join the second operation.
There are two cases that the relations should be redistributed: if the relations, such that a join is performed on them, are not distributed in the same set of nodes; and if the sets, such that a join is performed on the relations distributed on them, are different.
Literature  proposed one selection algorithm through two steps: 1) estimate the number of nodes that is enough to store each relation; 2) for each relation, select the nodes to distribute this relation. The second is based on the relations and some rules: 1) the estimated number of node for each relation should be the same as the number of memory for each node / current relation; 2) every pair of nodes owns the same nodes; 3) the sets of nodes should be assigned to different pairs of relation using RR algorithm. Above two steps help to develop three kinds of parallelism: inter-operation parallelism, intra-operation parallelism, and pipeline parallelism. For the full algorithms, please refer to . According to the experiment in the paper, this nodes selection algorithm really reduces the cost to communicate with the other processors.
3. Parallel Algorithms and Optimization
Owning to the huge size data, even using parallel database, the algorithms that are widely used in traditional database have to be revised, optimized, and further optimized. Researchers have paid great effort in the following aspect to optimize the algorithms in parallel database : selection optimization, Input and Output optimization, buffer usage optimization, computation reduction, hardware, parallel processing, and design of parallel database. In this article, not all of the above mentioned optimization aspects are covered. I will just cover the parts that are related to the algorithms and their optimization, such as sorting, query optimization and join optimization.
3.1. Parallel Sorting Algorithms ( )
Data sorting is one of well-known questions in computer science. According to , one quarter to half of the computer based operations are sorting. Several sorting algorithms have been proposed, such as well-known quick sorting, merge sorting. In the scope of database operations, data sorting is also an important research topic. There are two kinds of sorting method: internal sorting and external sorting. For small data that can be stored in main memory, sorting is performed in main memory, which is the internal sorting. If the data is huge that it cannot be loaded in main memory at once, then sorting is performed in the secondary memory, which is external sorting. In traditional database, these algorithms are suitable. However, referring the parallel database, there should be more research. There are also two kinds of sorting algorithms: parallel internal sorting and parallel external sorting. Literature  proposed three sorting algorithms for parallel database or multi-processor system and classified the sorting algorithms for the shared nothing parallel architecture (see section 2.1) into five sorting categories: parallel merge all sorting and parallel binary sorting (these two methods were proposed before literature ), parallel redistribution sorting, parallel redistribution merge all sort and parallel partitioned sorting (these three methods were newly proposed in literature ). All these five methods are parallel external sorting. I will introduce these five methods after a brief introduction to the external sorting algorithm.
Basic Serial External Sorting:
External sorting algorithm on a single processor is performed in two steps: sort and merge. The first step firstly splits the data into many small data portion that can be stored in main memory and then sorts each portion of data. The sorted portion is stored in external memory (disk). The second step is to merge the sorted portion from disk by firstly merging the half of sorted portion and then merging the left portion. This is the basic idea for external sorting.
Parallel Merge All Sorting
Parallel merge all sorting is much simple. Like the basic serial external sorting, there are also two steps: local sorting and final merge. The local sorting is done by each processor in the same way as basic serial external sorting performs. After the completion of local sorting, every processor transfers its local data to one processor named host and the host merges the data from the each processor. However, there are two problems with this sorting method. The first one is that if the data is too large, the merging action which is performed by one processor is very heavy. The second is the interconnection among the processors. When each processor transfers its local sorted data to the host processor, the interconnection may be tackled and occupied by these processors.
Parallel Binary Merge Sorting
This sorting method is slightly different from the parallel merge all sorting. The local sorting of parallel binary merge sorting is done by each processor and, during the process of local sorting, every two pipeline their temporary sorted data to a third processor which is the second level processor. For every pair of two level processors, they also pipeline their temporary sorted data to a third level processor ' until only one processor is left which is the host processor and performs the final merging. Therefore, the total pair of processors makes the final sorting tree to be binary. Compared to the parallel merge all sorting method, the host processor does not merge the data from all other processors but only two processors which balances the workload among the processors; and the searching among all other processors is replaced the comparison of two processors. However, because of pipelining, it produces some extra work and is also time consuming. Besides, the final merge is also performed by only one host processor and other processors just wait the host to finish the final merging. Therefore, the parallelism is not among all processors but a few processors.
Parallel Redistribution Binary Merge Sorting
Above two methods are not paralleled in all level, such as in local sorting and final merging, especially in the final merging where only one processor (host) performs the merging while the other processors is forced to wait the host to fetch the local sorted data. Parallel redistribution binary merge sorting eliminates this drawback and makes the sorting in parallelism by using redistribution. The first step of this method is also the local sorting. The second step is redistributes the results from every pair of processors and then transfer the redistributed data to the same pool processors. The third step is, at each processor, merging the data. The following steps repeated the step two and three. Finally, there exists a sorting tree in which each step forms a node. At last, the sorted results are to be united. This method performs the merging at all level and all processors and replaces the final merging by a union. Therefore, the merging workload is distributed. Whereas, at the final union, some processor may receive empty values and causes skew which is known to be hard to solve. Another problem in this sorting method is that, in some case, the height of sorting tree may be very high.
Parallel Redistribution Merge All Sorting
This method is designed to eliminate the sorting tree height that exists in parallel redistribution binary merge sorting. The main difference of this method from the previous one is in step two. The local sorting is the same as parallel redistribution binary merge sorting. However, in step two of final merge, the redistribution of local sorted data is send to all processors not just the same pool processors based on the range partitioning in order to eliminate the tall tree caused by the 'Binary distribution'. Because this method also adopts the redistribution technique, the parallelism is also performed among all processors. While the redistribution makes the parallelism performed among all processors, it also introduces the common problem skew existing in the parallel redistribution binary merge sorting.
Parallel Partitioned Sorting
Parallel partitioned sorting is the most different one from above four sorting methods. This method does not adopt the local sorting and final merge mode. It uses a different two steps: partitioning and local sorting. The first step is that the data is partitioned into many ranges and each processor takes one range. The each processor sorts the data locally. After the local sorting, the sorted data is already the final sorting. When outputting the sorted data, s simple concatenation is performed. This method eliminates the merge bottleneck in the final merge by not using final merge and is also done in parallelism among all processors. The two steps in this method form one level tree and, therefore, no tall tree exists. The only problem in this method is still the skew owning to the partitioning.
The proper using of above five sorting methods may improve the performance in parallel database, especially the last one: parallel partitioned sorting.
From above five algorithms on parallel partitioned sorting, we can see that the basic idea of parallel sorting algorithms is composed by two phrases: 1) paralleled local sorting and local storing at each processor; 2) paralleled merging of temporary sorting data from each processors. As the literature  pointed out that parallel sorting can be performed better if there is a range partitioning before the local sorting. However, partitioned sorting algorithms, although improve the total performance by making sorting in nearly full parallelism, also introduce skew problem which may cause the workload imbalance problem among all processors. X. Zhao et al. at  proposed an algorithm (PPS) based on the range partition using dynamic range partitioning algorithm through estimating the distribution of the data in the processing of sorting, which can resolve the workload imbalance problem. In the following sub-section, I will introduce this algorithm.
PPS - Parallel Partitioned Sorting 
PPS is designed for shared-nothing multiple processors architecture. It works following four steps: range partitioning, data redistribution and dynamic range adjustment, local sorting, and final merging. The first is similar with the above mentioned algorithms. It also creates the range partitioning intervals. In the second step, PPS also assign the range partitioning intervals to the corresponding processors. However, the difference between PPS and other partitioning algorithm is that, other partitioning algorithms use the fixed range partitioning intervals created in step 1 while PPS uses a dynamic range partitioning intervals. In the process of every tuple being sent to the corresponding processors, PPS uses a dynamic range interval estimating algorithm to estimate the range intervals and re-assign the range intervals among all processors to balance the work load. For example, if there are too many tuple assigned to a processor, this range interval will be split into two range intervals. The full dynamic estimating algorithm can be found from . After the redistribution of tuples, the local sorting is performed by each processor. Within the processing of local sorting, the data distribution information is also monitored and exchanged among the processors in case that some processors perform too much sorting while other processors may be in idle. In this step, the real action is not merging but simple concatenation if the data is needed to be sent to a single processor. From above four steps, we can find that the dynamic adjustment occurs in step two and step three: data redistribution and local sorting, which is based on the data distribution and dynamic range estimating algorithm. Since it is a dynamic adjustment process, PPS does not need to know the real input data distribution; instead, it uses dynamic analysis technique to estimate the data distribution.
3.2. Selection Query Optimization with Indices ( )
Among the algorithms for parallel database, selection operation optimization is usually ignored by the researchers (that may pay more attention on the join operation and its optimization). There are several reasons. One of them is that, though the selection operation is performed frequently, this operation is usually based on the Primary Key (PK) or Secondary Key (SK) and the relations are stored according to one of these two keys. However, using indices which is a useful structure and supports quick access to the pointing tuple, the selection operation can be further optimized.
Like the join, there are two kinds of selections: exact match selection and range selection. Exact math selection, like equivalence join, requires that the value of each tuple at the given position is equal to the given value. This selection is often performed by '='. Range selection requires the value of tuple at the given position is within the given range. There are two cases. The first one selects a continuous range, such as operators '>' and '='. The second one selects the discrete values, such as the operator in SQL 'color in ('red', 'blue', 'white')', where the given values is a set. In the following part of this section, I will introduce selection operation optimization for these two kinds of selections in parallel database and the following two selection optimizations are related to the indices part (a index tree is built on the search attribute, not mater the global index or the local index. see section 3.4 parallel database indices).
Exact Match Selection in Parallel Database with Indices
For exact math selection, three factors should be considered: processors, traversing of index tree (such as B+ tree), and record loading. The first factor decides the processor to perform the index tree traversing; the second factor refers to the searching of the leaf node containing the desired data; and the third factor refers to load the record after the leaf node is selected by traversing of index tree. These three factors vary according the index scheme that discussed in section 3.4.
Referring the processors, the best solution may constraint the processing to be performed at processor that contains the data to be selected. If the data in each processor is indexed using NRI (see section 3.4) scheme, only one processor containing the desired data will be selected to process the index tree traversing. Within the selected processor, the selection processor will perform the selection as the traditional database performs. For PRI and FRI, this selection is also applicable for that the only processor containing the required value can also be selected. In this case, the processor's local index is part of the global index and evolves the same operation.
Regarding the traversing of index tree to find the leaf node that contains the desired data, there are two cases. If the index scheme is NRI that means within a processor a local index is built, therefore, the selected processor will traverse its index and perform the selection. However, if the index scheme is PRI that means several (not all) processors contains some common nodes including leaf nodes. Then, in the traversing of the global index tree, more than one processor may be evolved one after one. For example, if we want to search the value '78' and the global index tree is distributed as shown in Figure 4, then all three processors are evolved and the traversing jumps from one to another. In this case, the searching of potential processor cannot be processed in parallelism. Fortunately, though the global index tree is replicated among all processors, if a processor contains a whole global index tree, the case become the same as that using scheme NRI. Of course, if the index scheme is FRI that means that all processors replicate the whole global index tree, then, the traversing of index tree can be performed within one processor and then the processor containing the desired data will be selected.
After the selection of the leaf node, the records will be loading. If the selected leaf node by the traversing of index tree discussed above is contained by the same processor that perform the index traversing, the record will be loaded from the local disk. However, if the selected leaf node is contained by another processor, such as in case of using scheme FRI or PRI, then the controlling switching between processors is necessary, which also introduces overhead.
Range Selection in Parallel Database
For the range selection, there are also three factors that are the same as above mentioned. The first two factors impact the range selection in the same way as that in the exact match selection. The difference is with the third factor. In exact match selection, after finding the desired leaf node, there is only one processor containing the desired data even though there may be one more step the processor switching. However, for range selection, the desired nodes may be more than one leaf node. I will introduce the range selection in two sub-parts: continuous range selection and discrete range selection.
In continuous range selection, after finding the leaf node containing one of the boundary values (either the upper boundary or the lower boundary which should be in the upper position in the leaf nodes if the leaf nodes are connected in single direction), we just traverse the leaf nodes until reaching the other boundary. Within the traversing, crossing among the different processor may be necessary if not all the leaf nodes reside in the same processor.
However, in discrete range selection, the case becomes much more complicated. The set of values in the selection statement such as SQL statement will be split into several sub-set and further driver the exact match selection to select the leaf nodes. Owning to its complication, the literature  did not consider this case and only consider the exact match selection and continuous range selection.
The algorithm for parallel selection can be outlined as followings :
Given the selection (or query) S and the index tree I:
1) Initialization: select the available processors, determine the selection type (exact math, continuous range, or discrete range), and convert the discrete range selection to exact math selection.
2) Processors allocation: according the index scheme I, allocate the necessary number of processors.
3) Parallel search: for each search key in Q, search Q in the index I and collect the temporary results.
4) Record loading: assign the loading task in temporary in step 3) to the processors allocated in step 2) and collect the final result.
5) Output the final result produced in step 4).
The proper revision of this algorithm can process discrete range selection.
3.3. Join Operation Optimization 
In relational database, the join operation is widely regarded as one of the most expensive and difficult operations  for that these operations are executed frequently and each join operation evolves the combination of two relations. Therefore, researchers have paid great effort to optimize this kind of operations both on traditional database and the parallel database (, ).
As discussed above, in parallel database, data (to be strict, the tuples of relation database) are partitioned among all disks using some strategies, such as hash partitioning, Round Robin partitioning, range based partitioning, and the simple random partitioning. Therefore, on join, tuples will be composed, decomposed, and reconstructed before output, which may evolve too much cost (which can be evaluate through a function) due to the various distribution of the data and updating. However, if the join is optimized before processing it, the cost will be reduced to a low level. Usually the optimization is processed on algebraic expressions.
In the following sub-sections, I will first introduce several traditional join algorithms and then introduce several algorithms on equivalence join and nonequivalence join (range join) used for parallel database.
Traditional Join Techniques
The basic join unit is tuple; however, join based on tuple cost too much data transferring. Therefore, several algorithms have been proposed to improve the join performance. To simplify the problem, we will introduce the join technique on two relational table, for example R and S.
? Nested loop join: This algorithm is based on tuples. The general idea is as following: for each tuple r in R, for each tuple s in S, join(r, s). As we learned from Database Course, this algorithm is straightforward and simple; however, it involves too many blocks transfer and cost too much time. It can only be used in small sized relations. In practice, this algorithm is implemented as block nested loop join  (see following part).
? Block nested loop join: This algorithm is a variant of above one. Unlike the nested loop join, block nested loop join is based on blocks. It joins every pair of tuple in every pair of blocks between R and S. Therefore, the total number of blocks transfer in the inner loop is reduced to be one out of m, where m is the number of tuple in each block of R. although block nested loop join reduces the number of blocks transfer compared to nested loop join, it still evolves much blocks transfer. Because, either nested loop join or block nested loop join, compares every pair of tuples between R and S, which is very expensive. Hence, in my opinion, these two join algorithm can be a kind of the enumeration algorithms (actually we may call them 'blind algorithm' for that they do not glance the search value before the blocks transferring). These two algorithms are suitable for natural join .
? Index loop join: If there are some indices on the table, we can use the index loop join. This algorithm will firstly find the blocks from the indices and then transfer the blocks to be further processed, which reduces a great number of redundant blocks transferring and saves time.
? Merge join: Merge join is much similar to the merge sorting. If the relation is sorted on the join attribute, this algorithm is much efficient. Like merge sorting, there two pointers pointing to R and S respectively, for each tuple r in R, only the tuple in S which satisfy the condition with s will be selected to be joined. The only disadvantage of this algorithm is that, if the relation is not sorted, additional sorting on the join attribute is necessary which may cost certain time. However, some hardware is designed to support sorting, known as hardware sorters, such as VERSO used this kind of hardware sorters . This algorithm is also efficient for implementation of outer join and inner join.
? Hash Join : Hash function can also be used to perform join of two relations. This algorithms work as following. Give two relations R and S, suppose that R is the smaller one so that all tuples in R can be stored in memory. Firstly, a hash table is built in the memory. And for each tuple's join attribute of R, compute the hash value of this attribute and put it into hash table. Then, for each tuple's join attribute of S, compute the hash value of this attribute and concatenate this tuple with the tuples in R that have the same hash value in hash table. This algorithm is much efficient than nested loop algorithm if the relations are huge.
? Variants of hash partitioning join: if the size of the relation is larger so that it cannot be stored in main memory, above hash join is not suitable. Therefore, there are many variant algorithms of hash join, such as Hashed Loop Join .
Parallel Join Techniques
As discussed in section 2.2, in parallel database, the data in parallel database is partitioned among many disks and is processed by multiple processors. Therefore, the join algorithms for parallel database are different from that for traditional database. The general idea is that we split the data, process each part on each processor, and collect the result to output so that all processor can process the join in parallelism and the total processing time will be reduced. However, this idea usually meets some challenges.
? Partitioning Join: Partitioning join requires that the relations to be joined use the same partitioning methods, such as the same hash partitioning function, the same range partitioning vector. For example, under the same range partitioning vector, each processor can only process the data from the disks that own the same range in partitioning vector. This join can be efficient for equivalence join (EquiJion ).
? Fragment and Replicate Join: This is another case for join that the condition is not the equivalence condition but the non-equivalence condition on attributes of relations, such as '<', '=' conditions. In this case, fragment and replicate is used. The general idea for this case is: partition the two relations into some ranges (if there are range partitioning vectors, they can be used) which can be different for each other's relation, and replicate the i-th part of the first relation R to i-th row, and also replicate the j-th part of the second relation S to the j-th column. Therefore, processors in a same row process a part tuples of R and all tuple in S and processors in a same column process a part tuples of S and all tuples in R. However, each processor processes only a part of R and a part of S.
A.B.M.R. I. Sadat and P. Lecca  performed a simulation on the performance of join operation in parallel database. In their experiments, the parameters vary in block transfer time, number of processor, disk access time, and the size of relations (tables). The four algorithms are tested: nested loop join, block nested loop join, index loop join, and merge join. The empirical result showed that the performance under the dynamic scheduling methods is far better than others (such as asymmetric and symmetric which performed almost the same to each other).
? Parallel hash join Error! Reference source not found.: this is a hybrid algorithm. There are two steps. In the first step, the join attributes of two relations are hashed at each processors and redistributed to the certain processors, in which the requirement is that the tuples of two relations which own the same hash values should be redistributed to the same processor. There is some exception that is the join attribute is the key of the relations. In this case, no hashing is performed (or the hash function just output the input value). In the second step, each processor simply joins the tuples in its storage.
? Semijoin: semijoin is a complex hash join which was based on the following equation. Literature Error! Reference source not found. listed the detailed algorithm of Semijoin. Compared to the pure parallel hash join, the advantage is that, when the communication among the processors is limited, semijoin performs much better than pure hash join algorithm. However, when the communication among the different processor is not the bottleneck, semijoin keeps the same performance; whereas, the pure hash join algorithm performs better.
(S semijoin by R)=Z
(R join S) = (R join (S semijoin by R))= (R join Z)
Note: S,R, and Z are three relations
? Parallel GRACE hash join : parallel GRACE hash join is also based on the hash function where the assumed the architecture is shared-memory and the size of relations is large than main memory. This algorithm evolves two steps: splitting step and join step. In the first step, setting up a read buffer and certain number of write buffer. Then read the data of one relation and splits it in to the write buffer. If the write buffer is full, then the content of write buffer is stored into secondary memory. The second relation is read and written in the same way. In the second step, there are also one write buffer and certain number of read buffer built. Besides, a hash table is built in memory. Then the data that stored in the first step is read into memory and is hashed on the join attribute. According to the hash value, the data is chained to the certain hash item. The second relation is also read and hashed in the same way. In the chained bucket, if the two tuples shared the same attribute value, then the two tuples are joined as one tuple of the results. The cost of this algorithm is related to the size of the joining relations.
Literature  proposed a strategy on dynamic data loading during range partitioning step and performed a simulation on the performance of above hash based algorithms (combined with the proposed strategy) such as parallel Grace hash join. During data partitioning step, a certain number of buckets are created. If some buckets are large that do not fit the memory, then the buckets are partitioned into fragment such that the size of fragment is small to fit the memory. During the data redistribution step, the fragment information is also sent to the corresponding processors. The simulation results on above hash partitioning algorithms showed that the fragment allocation strategy performed better than bucket allocation owning to data skew (there are some large sized buckets). However, if there is no data skew which means that the data is uniformed in values, the performance under two strategies is similar. Therefore, although the hash partitioning algorithms themselves may outperform each other, the strategies also have an impact on the performance of the different algorithms.
3.4. Parallel Database Indices ( )
3.4.1. Parallel Index structure and access control 
In traditional database, the index is usually based on tree structure (such as B tree or B+ tree) or hash structure. The main issues considered in constructing an index structure include access time and storage consumption. However, in parallel database, the issues are not limited to access time and storage consumption, but also include concurrency control. When an index structure is accessed, updated, and deleted frequently, concurrency mechanism mush be introduced to control this conflict problem; otherwise, different processors may get different views of the same index structure. Besides, performance and response time are also considered when multiple accesses exist.
T. HONISHI et al. at  proposed one structure for the parallel database index: the hybrid index structure (HBX). The general idea of this hybrid index is as followings:
The top level structure of the index is a hash function and each function points to a B tree. Figure 2 shows the structure of this hybrid index.
Search: For exact match selection (see section 3.2), the input search key will be hashed and one of B-Trees will be selected. Therefore, one processor will be selected to perform the selection. Concurrency Policy for exact match selection: only the selected B-Tree is locked in share mode. For range selection: all of B-trees are selected and the traversing of B-tree is performed in parallelism. Concurrency Policy for range selection: all of B-trees are locked in share mode.
Insertion: only one of B-trees is selected and the insertion action is performed by one processor. Concurrency Policy: only one B-tree is locked in share mode.
Deletion: only one of B-tree is selected and the deletion action is performed by one processor. Concurrency Policy: only one B-tree is locked in exclusive mode.
HBX uses a hash function to split one index tree into several sub-B-trees. Therefore, the operation on index tree can be performed in parallelism, which solves the conflict of different transactions and also improve the performance of parallel database processing.
3.4.2. Parallel Indices Scheme 
Researchers have proposed several indices scheme for parallel database, such as non-replicated index, partially-replicated index, and fully-replicated index. They can be shortened as NRI, PRI, FRI, respectively. Because of the different partitioning attributes, there are several variants of above three index schemes. I will introduce these three indices scheme for parallel database in the following part of this section.
This scheme works like this: the data is firstly partitioned under some partitioning algorithm and stored in multiple disks. Within each disk, index is built on one attribute (field). Therefore, this index is a local index which is not built on the whole table but the local data. The structure of local index can use B+ tree. Figure 3 shows an example of this Non-replicated index.
There are several variants of this no-replicated index . I will omit these variants here.
Under non-replicated index, each processor does not own the index of the global index but owns its local index. Unlike non-replicated index, partially-replicated index schema is that each processor owns partial index (including leaf nodes) of global index. On the other hand, the global index is owned by several processors. Figure 4 shows an example of global view of the index owned among all processors.
'Fully-replicated index is similar to the partially-replicated index. The difference is that the global index is replicated to all processors. Although it is redundant to store the whole global index in each processors, it can be useful when operating the database, such as selection, query.
The comparison between above three kinds of index schemes on selection / query operations showed that each scheme owned its disadvantages and advantages . However, PRI and FRI did not supply the extra benefits. Therefore, there is a tradeoff between the storage and the performance when making a decision on which one of three or other index schemes should be used.
4. Some Issues in Parallel Database
Data Skew 
Data skew occurs because the data distribution is not uniformed. Therefore, the total parallelism may be influenced, especially the operations evolving range partitioning. In shared nothing architecture, data skew can be eliminated to low degree through data redistribution, such as dynamic range partitioning which monitors the distribution of the data in the process of partitioning and splits the range that contains too many tuples into several small ranges in order to balance the tuple number of tuples in each processor.
This report studied the main architectures of the parallel database: shared nothing architecture, shared memory architecture, shared disk architecture and some variant architecture: hierarchy architecture of above and scalable sharing architecture. One of the main problems in parallel database is how to operate the parallel database in full parallelism. Because of the non-uniform of data distribution, data skew occurs and degrades the parallelism degree. Therefore, researchers proposed many kinds of algorithms and optimizations to parallel database in every aspect of parallel database operations, such as parallel sorting, parallel query, and parallel join. Researchers also proposed several index schemes to make the accesses and operating to the index in parallel and independence. Because of the limited time, this report just listed the main idea of each algorithm; however, for every algorithm, the reference is listed so that it is very convenient to find the sources.