Temporal Performance Analysis of Batch Operations in Index Structures

4071 words (16 pages) Essay in Information Systems

23/09/19 Information Systems Reference this

Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.

Temporal Performance Analysis of Batch Operations in Index Structures (November 2018)

 

Abstract— The most famous commercial database softwares are disk-based database systems, i.e. all create, read, update and delete operations have to reference the primary storage location for data storage and retrieval. The requirement for high-performance data access,coupledwiththelatestadvancesinhardwaretechnologyhasledtotheemergenceofvariousmain-memorydatabasemanagement systems such as SQLite, TimesTen, MeanSQL etc. It is now possible to accommodate entire databases in the main memory itself. This removestheneedtoreferencetheprimarystorageandresultsinupto30timesfasterperformance.However,thesoftwarerequirements for main-memory databases is still struggling to provide efficient designs and better algorithms to allow for optimization. In this paper, we have proposed the use of B-trees and red-black trees as the structure for main-memory database systems by demonstrating and analyzing the time taken for batch operations of insertion, deletion and updation. Since both of these are index data structures, they inherently provide faster performances compared to non-indexstructures.

 

Index Terms—index structures, main-memory database, performance analysis

 

 

  1. INTRODUCTION

    I

N recent years, the hardware industry has grown by leaps and bounds. Only five years ago, upto 1 terabyte of data storage space was considered to be more than enough for consumer purposes, the latest processors with two cores were considered to be top of the line, and hard disks were the pinnacle of storage technology. Now, however, terabytes of data are generated within weeks, the latest processors have upto 16 cores and the upcoming solid-state drive technology is rapidly replacing ordinary hard drives. Amidst these advances, the database systems industry is also making major breakthroughs. The most widely used database management systems, such as Oracle, MongoDB, MySQL, Microsoft Access, Firebird, Cassandra etc. are all disk-based systems. They have to reference the primary storage, i.e. hard drives, solid state drives etc. for all operations. However, primary storage references are slow, and to a large extent, have reached saturation point of optimization. Thus, the requirement of high-performance

databases has led to the development of main-memory database systems.

This category of database systems makes it feasible to manipulate hundreds of megabytes or even several gigabytes of data, and store it entirely in volatile memory. Main memory references are much faster than primary storage references, and that has been the primary driving force behind this new system. Example applications include radio-frequency identification (RFID) readers to identify and track a large number of products, animals, vehicles, and even persons within a very short duration of time. Sometimes, the number of objects such as RFID tags per truck container, could be large, and the time duration, such as the passing time of a truck through a control gate, could be very short [1][2]. Many other applications also have the characteristics of high-rate data access, such as the process line of products manufacturing management [1][2][3], logistics management [4] goods retail management [2], and inventory management [2].

In this paper, we have talked about B-trees and red-black trees as potential designs for main-memory database systems. These are index structures, and while they involve additional writes take up additional space in memory, they inherently provide faster performances due to reduced look-up time. Researchers have explored different index structures such as the AVL tree, B+-tree [5], T-tree [6][7], R-tree [19] and their derivatives [8]. B+-trees [9][10][11][12], along with T-trees and AVL trees, would suffer from significant overheads in tree rebalancing when data are updated, inserted or deleted in a main-memory (and non-main-memory databases). Moreover, it has been observed that AVL trees are, in general, no better than T-trees in manipulation of data in short range even though each T-tree node could have more data than an AVL-tree node does (because each T-tree only has two pointers). While much research has been done on B+-trees, T-trees and AVL trees, B-trees and red-black trees have remained relatively unexplored, and they are the focus of this paper. We have taken these two structures, and performed the operations of insertion, deletion and updation, in large batches. The time taken has been observed after fixed numbers of operations as well as after the completion of each operation, for both B-trees and red-back trees. Finally, these observations have been plotted on a graph and bar-chart, and the performance of these index structures for batch operations has been compared and analyzed.

  1. LITERATURE REVIEW

Much research has been done on the topic of main-memory database systems, the use of index structures to design them as well as efficient algorithms for data storage and retrieval. Some of the problems tackled in such work, and the solutions proposed are described in this section.

  1.   Models and Issues in Data Stream Systems
  2. Babcock et. al. (2002) suggested that it is not feasible to simply load the arriving dot into а traditional DBMS as they are not designed for rapid and continuous loading of individual dot items, and they do not directly support the continuous queries that are typical of dot stream applications [13]. Decision trees are one form of synopsis used for prediction. They propose a query language and architecture for a DSMS query processor designed specifically to address the issues above.
  1. The Universal B-Tree for Multi-Dimensional Indexing: General Concepts

R. Bayer (1997) suggested that even though almost all database systems use B-trees as their main access method, one of the main drawbacks of the classical B-tree is that it works well only for one-dimensional data [14]. In the paper, he presents a new access structure, called UB-tree (for universal B-tree) for multidimensional data. The UB-tree is balanced and has all the guaranteed performance characteristics of B-trees, i.e. it requires linear space for storage and logarithmic time for the basic operations of INSERT, FIND and DELETE. In addition, the UB-tree has the fundamental property, that it preserves clustering of objects with respect to Cartesian distance. Therefore, the UB-tree shows its main strengths for multidimensional data. It has very high potential for parallel processing.

  1. Organization and Maintenance of Large Ordered Indexes

    1. McCreight et. al. (1972) discussed the problem of organizing and maintaining an index for a dynamically changing random access file [15]. They assume that the index itself is so voluminous that only rather small parts of it can be kept in main store at one time. Thus, the bulk of the index must be kept on some backup store. Since the data file itself changes, it must be possible not only to search the index and to retrieve elements, but also to delete and to insert keys–more accurately index elements– economically. They propose an index organization scheme in this paper which always allows retrieval, insertion, and deletion of keys in time proportional to log k I or better, where I is the size of the index, and k is a device dependent natural number which describes the page size such that the performance of the maintenance and retrieval scheme becomes near optimal.
    2. Implementation Techniques for Main Memory Database Systems

      1. J. DeWitt et. al. (1984) discussed the availability of very large, relatively inexpensive main memories, it is becoming possible keep large databases resident in main memory [16]. In this paper, the changes necessary to permit а relational database system to take advantage of large amounts of main memory are considered. АVL vs B+-tree access methods for main memory databases, hаsh-bаsed query processing strategies vs sort-merge, and study recovery issues are evaluated when most or all of the database fits in main memory. As expected, B+-trees are the most efficient method of data retrieval.
      2. The Ubiquitous B-Tree

D. Comer (1979) argued that despite the presence of large amount of secondary memory, a computer must retrieve аn item and place it in main memory before it can be processed [5]. А sequential request requires the searcher to examine the entire file, one folder at а time. Associated with а large, randomly accessed file in а computer system is in index which, like the labels on the drawers and folders of the file cabinet, speeds retrieval by directing the searcher to the small part of the file containing the desired item. They aim to introduce a general-purpose file access method using internal storage structure called the binary search tree because of their guaranteed low retrieval cost.

  1. Autonomic Tracing of Production Processes with Mobile and Agent-Based Computing

M.G.C.A. Cimino et. al. (2011) discussed about the simplification of tracing items in a supply chain, across different enterprises and through the full processes scope by implementing a supply chain traceability system with a high level of automation [17]. The proposed system adopts an agent-based approach, in which cooperative software agents find solutions to back-end tracing problems by self-organization.

  1. Issues in Data Stream Management

M. T. Özsu et. al. (2003) discuses about Traditional databases store sets of relatively static records with no pre-defined notion of time, unless timestamp attributes are explicitly added. Although this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require support for on-line analysis of rapidly changing data streams.

  1. Real-Time Access Control and Reservation on B-Tree Indexed Data

Tei-Wei Kuo et. al. (2000) proposed methodologies to control the access of B+-tree-indexed data in a batch and firm real-time fashion. He discussed methodologies to reduce the number of disk I/O to improve the system performance without introducing

more priority inversion. When the schedulability of requests with critical timing constraints is highly important, he proposed a mechanism for data reservation based on the idea of preemption level and the Stack Resource Policy.

  1. B-Tree Indexes for High Update Rates

G. Graefe (2006) surveys some techniques that let B-trees sustain very high update rates, up to multiple orders of magnitude higher than traditional B-trees, at the expense of query processing performance [18].

  1. R-Trees: A Dynamic Index Structure for Spatial Searching

Antonin Guttman (1984) describes a dynamic index structure called an R-tree [19] in order to handle spatial data efficiently as required in computer aided design and geo-data applications, and give algorithms for searching and updating it.

  1. PROBLEM DEFINITION

Main-memory database management systems are the future of bulk data storage and access However, as discussed earlier, the advances in software have not been able to keep up with the leaps and bounds in hardware technology. Thus, there is still a lot of ground to cover in terms of the design of main-memory database systems, as well as algorithms for faster and more efficient data management operations.

In this context, index data structures pose as the most obvious choice when it comes to the design of database systems. The various advantages and improvements in performance that they provide are discussed in the next section. While much research has been done on B+-trees, AVL trees, T-trees, R-trees and their derivatives, B-trees and red-black trees remain relatively unexplored. In this paper, we explore the possibility of using B-trees and red-black trees to design main-memory databases. Using Python scripts, we perform batch operations on B-trees and red-black trees and record the time taken for the operations. This time is then

analyzed, and conclusions are drawn based on the same.

  1. PROPOSED METHODOLOGY

An index data structure is different from an ordinary (non-indexed) data structure in the fact that it will require less time for data retrieval operations due to the presence of a data index, an additional structure that is used to quickly locate data without having to search every row in the structure (tree, graph, database table, etc.) every time a query is made. However, this comes at the cost of additional writes and storage space to maintain the index structure. Indexes can be created using one or more parameters (e.g. multiple rows of the database table), providing the basis for both rapid random lookups and efficient access of ordered data.

In database management systems, a number of records can only be sorted based on one field. Thus, it can be state that searching on a field that isn’t sorted requires a linear search which requires N/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire table space must be searched at N block accesses. On the other hand, with a sorted field, a binary search may be used as this has log2N block accesses. Also, since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values once a higher value is found. Thus, the performance increase is quite substantial.

Both B-trees and red-black trees are self-balancing tree structures. The self-balancing nature implies that after every operation, the tree will attempt to keep its height (maximal number of levels below the root) small to optimize insertions and deletions. Such structures can also be used for other abstract data structures such as associative arrays, priority queues and sets.

In a red-black tree, each node has an extra bit that is often interpreted as the colour of the node. Since there are only two colours, only one extra bit is needed. Moreover, since there is no more information pertaining to it being a red-black tree, its memory footprint is almost identical to a classic binary search tree. Balance is preserved by the maintenance of certain properties by the colours which collectively constrain how unbalanced the tree might get in the worst case. Even though the balancing of the trees is not perfect, it is good enough to allow the tree to perform all operations in O(log n) time.

Even though both the structures are self-balancing in nature, there is a vital difference between the type of data they are optimized for. A B-tree is a simple self-balancing tree structure, and is optimized for systems that read large blocks of data. On the other hand, a red-black tree is a binary search tree, meaning that apart from the self-balancing nature, the data is also arranged in order in the nodes of the tree. It is optimized for sparse operations of insertion and deletion.

Initially, we perform 5000 insertions to populate the database (initialization), followed by 1000 deletions and then 1000 insertions again. All of these are batch operations, i.e. they are done in bulk using Python scripts. For each type of tree, the time taken for each of the operations is recorded, and then added to calculate the total time for all operations. Next, a bar chart is plotted for time taken for each operation, for both tree types (the time for initialization is divided by 5 since there are 5000 initialization operations, but only 1000 deletions and insertions). This enables us to compare the performance of the two tree structures. Additionally, the time is recorded at every 200 operations, and a graph is plotted to observe and compare the progress of the operations, again for both tree types. All the operations have been done using Python, and graphs have been plotted using the matplotlib package in Python.

  1. RESULTS AND ANALYSIS

The B-Tree and red-black tree structures have been made to go through a batch insertion (initialization), deletion and a second insertion phase with the involvement of 5000, 1000 and 1000 records respectively. The experiments were simulated on a personal computer having Intel Core i5-7200U CPU @ 2.50GHz and 8GB RAM. Screenshots of the output obtained, along with the graphs and bar charts plotted, are described in this section.

Fig. 1. Total and split-up of time taken for initialization, deletion and insertion in B-Tree and RB-Tree.

Fig. 1 shows an instance of the time recorded for batch initialization, deletion and insertion in B-Tree and red-black tree. The time has also been recorded at intervals of every 200 operations for plotting of the graph.

Fig. 2. Total and split-up of time taken for initialization, deletion and insertion in B-Tree and RB-Tree.

Fig. 2 shows another instance of the time recorded for batch initialization, deletion and insertion in B-Tree and red-black tree. Again, the time has also been recorded at intervals of every 200 operations. When compared to Fig. 1, it is observed that the recorded times are consistent in both the instances.

Fig. 3. Progress of batch operations (5000 insertions, 1000 deletions and 1000 insertions again) in B-Tree and red-black tree.

Fig. 4. Progress of 5000 insertions (initialization) in B-Tree and red-black tree, recorded at every 200 operations.

From Fig. 3 and Fig. 4, it is observed that the time taken for batch insertion operations is approximately the same for both B- Tree and red-black tree. Moreover, for B-Tree, insertion and deletion operations take same amounts of time, because of which the graph for B-Tree is almost a straight line.

Fig. 5. Progress of 1000 deletions and 1000 insertions in B-Tree and red-black tree, recorded at every 200 operations.

Fig. 6. Comparison of time taken for 1000 operations each of insertion (initialization), deletion and again insertion.

From Fig. 5, it is observed that while batch deletion in B-Tree takes approximately the same time as batch insertion, it is 98 times slower in red-black trees. Fig. 6 shows compares the total time taken for batch insertion and deletion in B-Tree and red-black tree structures.

Fig. 7. Detailed comparison of time taken for 1000 operations each of insertion (initialization), deletion and again insertion.

Fig. 8. Comparison of difference in time taken for 1000 deletion operations in B-Tree and red-black tree.

Fig. 9 shows a more detailed comparison of the time taken for batch insertion and deletion in B-Tree and red-black trees. While insertion takes slightly more time in red-black trees, the time for deletion extends far beyond the figure. Finally, Fig. 8 gives a true indication of the huge difference in batch deletion time between B-trees and red-black trees.

  1. CONCLUSION

This work has been motivated by a strong demand for main-memory database system design. B-Trees and red-black trees have been proposed to design the databases firstly, due to their inherently improved performance because of their nature as an index data structure, and secondly because of their self-balancing nature which provides additional optimization for data retrieval. After obtaining the results, it was observed that B-Trees performed slightly better at batch insertion and distinctly better at batch deletion than red-black trees. In fact, in the latter operation, they performed more than 98 times better than red-black trees. The results were highly conclusive in the direction that B-Trees are optimized for large blocks of data, while red-black trees are more suited for applications with sparse data manipulation, or at least where batch deletions are avoided wherever possible.

  1. FUTURE SCOPE

There is a growing need for efficient main-memory database design, in applications including but not limited to process line of products manufacturing management, logistics management goods retail management, inventory management etc. This work can be carried forward to incorporate more data structures, and finally come to a conclusion on the choice of data structure depending on the requirements and applications of the database management software. Moreover, cutting-edge algorithms can also be developed to couple these structures to fulfill high-performance requirements.

REFERENCES

[1]      B.N. Lee, Y.-W. Kim, H.J. Kim, Evolution of RFID applications and its implications-standardization perspective, in: Proceedings of Portland International Center for Management of Engineering and Technology, 2007, pp. 903–910.

[2]      D.-L. Wu, W.W.Y. Ng, D.S. Yeung, H.-L. Ding, A brief survey on current RFID applications, in: Proceedings of the International Conference on Machine Learning and Cybernetics, Baoding, 2009, pp. 2330–2335.

[3]      M.G.C.A. Cimino, F. Marcelloni, Autonomic tracing of production processes with mobile and agent-based computing, Information Sciences 181 (2011), pp. 935–953.

[4]      J. Li, Q.S. Jia, X. Guan, X. Chen, Tracking a moving object via a sensor network with a partial information broadcasting sche me, Information Sciences 181 (2011) 4733–4753.

[5]      D. Comer, The ubiquitous B-tree, ACM Computing Surveys 11 (1979) 121–137.

[6]      T.J. Lehman, M.J. Carey, A study of index structures for main memory database management systems, in: Proceedings of the Twelfth International Conference on Very Large Data Bases, 1986, pp. 294–303.

[7]      H. Lu, Y.Y. Ng, Z. Tian, T-tree or b-tree: main memory database index structure revisited, in: Proceedings of the Eleventh Australasian Database Conference, 2000, pp. 65–73.

[8]      K.R. Choi, K.C. Kim, T*-tree: a main memory database index structure for real time applications, in: Proceeding of the Third International Workshop on Real-Time Computing Systems and Applications, 1996, pp. 81–88.

[9]      J.V.D. Bercken, B. Seeger, An evaluation of generic bulk loading techniques, in: Proceedings of the 27th International Conference on Very Large Data Bases, 2001, pp. 461–470.

[10]   G. Graefe, B-tree indexes for high update rates, ACM SIGMOD Record 35 (2006) 39–44.

[11]   P.-M. Kerttu, S.-S. Eljas, Y. Tatu, Concurrency control in B-trees with batch updates, IEEE Transactions on Knowledge and Data Engineering 8 (1996) 975– 984.

[12]   T.W. Kuo, C.H. Wei, K.Y. Lam, Real-time access control and reservation on B-tree indexed data, Journal of Real-Time Systems 19 (2000) 245–281.

[13]   B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom, Models and issues in data stream systems, in: Proceedings of the Twenty-First ACM SIGMOD- SIGACT-SIGART Symposium on Principles of Database Systems, 2002, pp. 1–16.

[14]   R. Bayer, The universal B-tree for multi-dimensional indexing: general concepts, in: Proceedings of the International Conference on World-Wide Computing and Its Applications, Lecture Notes on Computer Science, Springer-Verlag, 1997, pp. 198–209.

[15]   R. Bayer, C. McCreight, Organization and maintenance of large ordered indexes, Acta Informatica 1 (1972) 173–189.

[16]   D.J. DeWitt, R.H. Katz, F. Olken, L.D. Shapiro, M.R. Stonebraker, D. Wood, Implementation techniques for main memory database systems, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 1984, pp. 1–8.

[17]   Mario G. C. A. Cimino, Francesco Marcelloni, Autonomic tracing of production processes with mobile and agent-based computing, Information Sciences 181, 2011, pp. 935–953.

[18]   Goetz Graefe, B-Tree indexes for high update rates, ACM SIGMOD Record 35, Issue 1, March 2006, pp. 39-44.

[19]   Antonin Guttman, R-Trees: A dynamic index structure for spatial searching, in Proceeding, SIGMOD ’84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data, 1984, pp. 47-57.

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please: