This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Multi-dimensional analysis is used for decision support systems and statistical inference to find interesting information from large database. Multidimensional databases are suitable for OLAP and data mining since these applications require dimension-oriented operations on data. Traditional multidimensional databases store data in multidimensional arrays on which analytical operations are performed. Multidimensional arrays are good to store dense data, but most data sets are sparse in practice for which other efficient storage schemes are required.
The main challenge of multi-dimensional data analysis is its huge data size and immense computations. The complexity is extreme when we consider more dimensions and the resulting intersection space. The data needs to be accessed, computed and aggregated along with all dimensions with multi-level hierarchical grouping. This demands huge processing power and memory requirement, which is normally found in Symmetric Multi-Processing (SMP) computers.
Here we present Architecture for the analysis of multidimensional data using shared nothing-desktop server. The architecture allows running on Shared Nothing desktop servers, which form the cluster environment. The multi-dimensional cube building, data storage, querying and computations are all distributed across the cluster nodes, which gives unparalleled scalability and performance.
Figure 1: Multi-Dimensional Aggregation Architecture
Proposed Distributed Architecture
The main challenge of any multi-dimensional aggregation server is the data volume involved and the required processing power and memory requirement. Modern multi-processor computers can address this challenge but the costs of such machines are exorbitant.
Another economical way to address this scalability issue is to make use of the processing power and memory of commodity desktop machines. Since these machines are independent machines (shared nothing), harnessing their capability is a big challenge. By making use of Message Passing Interface (MPI) we can build parallel applications which can run on many shared nothing desktop computers which are connected by normal network.
In this paper, we are presenting a parallel architecture for Multi-Dimensional Aggregation Server by using Message Passing Interface.
Figure 2: Distributed Architecture for Multidimensional OLAP Server
Is a multi-dimensional OLAP (MOLAP) engine
Is a data parallel software. Can be run on cluster of commodity desktop machines
Handles sparsity very well
Data explosion is not an issue because aggregate data is neither pre computed nor stored
Aggregate data at query time and has low memory foot print
Supports basic Mutli Dimensional Expression (MDX) for query
Supports SQL like syntax for dimension and cube management
Supports multi user environment
Employs client server architecture
Server is a parallel engine using message passing interface (MPI) middleware
Use embedded Berkeley DB for efficient data management
Has a command line utility for invoking commands
Has client APIs for integrating with applications
The main challenge of multi-dimensional data analysis is its huge data size and immense computations. The complexity is extreme when we consider more dimensions and the resulting intersection space. The data needs to be accessed, computed and aggregated along with all dimensions with multi-level hierarchical grouping.
Figure 3: Message Passing Interface-Distributed Architecture
This demands huge processing power and memory requirement, which is normally found in Symmetric Multi Processing (SMP) computers. But SMP have the following disadvantages
- Cost per computation is high
- Scalability is a question
SMP has the advantage of shared memory access for the parallel processors, which gives better performance due to lower latency.
This architecture allows it to run on shared nothing desktop servers which form the cluster environment. The multi-dimensional cube building, data storage, querying and computations are all distributed across the cluster nodes which give unparalleled scalability and performance.
The downside is the network latency of interconnect and which is addressed by the following approaches:
- Use high bandwidth interconnects like Myrinet / Giganet
- Have a distribution strategy, which minimizes the data communication among the nodes
- Employ concurrent file access with data partitioning
- Employ asynchronous disk I/O which is concurrent with the computations to hide the latency
Hardware: Parallel, Distributed, Shared Nothing Commodity Cluster with High Speed Interconnect.
[Ethernet (100Mbps) / Myrinet / Gigabit (1Gbps)]
Symmetric Multi-Processing (SMP) Shared Memory Systems
File system - Native operating system file system
Middleware - Industry standard message passing interface (MPI)
Storage Format - Multi dimensional proprietary data storage with efficient sparse data compression
Storage Engine - Berkeley DB embedded
Query language - Multi Dimensional Expression (MDX)
Client API - C++ / Java Client API (SDK)
Data Import / Export - Delimited text files
Compiler - C++ programming language
Interconnect can be normal 100mbps ethernet, Gigabit or Myrinet for better performance and lower latency.
The term On Line Analytical Processing dates from 1993. E.F. Codd, the father of relational databases, introduced it. OLAP is an acronym for On Line Analytical Processing. It is an approach to quickly provide the answer to analytical queries that are dimensional in nature. The typical applications of OLAP are in business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas. The term OLAP was created as a slight modification of the traditional database term OLTP (On Line Transaction Processing).The output of an OLAP query is typically displayed in a matrix (or pivot) format. The dimensions form the row and column of the matrix; the measures, the values. OLAP tools. OLAP tools allow the user to query, browse, and summarize information in a very efficient, interactive, and dynamic way. OLAP tools represent a vital component of both the business intelligence and data mining technology. They provide an aggregated approach to analyzing large amounts of detailed data.
Online Analytical Processing, a category of software tools that provides analysis of data stored in a database. OLAP tools enable users to analyze different dimensions of multidimensional data. For example, it provides time series and trend analysis views. OLAP often is used in data mining. The chief component of OLAP is the OLAP server, which sits between a client and a database management systems (DBMS). The OLAP server understands how data is organized in the database and has special functions for analyzing the data.
OLAP systems have been traditionally categorized using the following taxonomy:
Each type has certain benefits, although there is disagreement about the specifics of the benefits between providers.
Some MOLAP implementations are prone to database explosion. Database explosion is a phenomenon causing vast amounts of storage space to be used by MOLAP databases when certain common conditions are met: high number of dimensions, pre-calculated results and sparse multidimensional data. The typical mitigation technique for database explosion is not to materialize all the possible aggregation, but only the optimal subset of aggregations based on the desired performance vs. storage trade off.
MOLAP generally delivers better performance due to specialized indexing and storage optimizations. MOLAP also needs less storage space compared to ROLAP because the specialized storage typically includes compression techniques.
ROLAP is generally more scalable. However, large volume pre-processing is difficult to implement efficiently so it is frequently skipped. ROLAP query performance can therefore suffer.
Since ROLAP relies more on the database to perform calculations, it has more limitations in the specialized functions it can use.
HOLAP encompasses a range of solutions that attempt to mix the best of ROLAP and MOLAP. It can generally pre-process quickly, scale well, and offer good function support.
Multi-Dimensional Aggregation Engine
The architecture is for multi-dimensional data aggregation using Bit Encoded Storage Scheme (BESS) which addresses sparsity and data explosion in an efficient manner. The architecture makes use of MOLAP concepts but does not store the data in memory which causes large memory foot print. Data management is achieved by Berkeley DB database library which helps in storing, retrieving and updating the data using file system.
Multi-Dimensional Expression (MDX) interface is proposed to enable multi-dimensional querying of data. The aggregation services are exposed as interfaces to enable Client Server architecture. CORBA is used as middle ware for exposing the interfaces. CORBA enables language independence through the concept of language mapping.
Multidimensional Expressions (MDX) as a Language
MDX emerged circa 1998, when it first began to appear in commercial applications. MDX was created to query OLAP databases, and has become widely adopted within the realm of analytical applications. MDX forms the language component of OLE DB for OLAP, and was designed by Microsoft as a standard for issuing queries to, and exchanging data with, multidimensional data sources. The language is rapidly gaining the support of many OLAP providers, and this trend is expected to continue.
The focus of the MDX Essentials series will be MDX as implemented in MSSQL 2000 Analysis Services. MDX acts in two capacities within Analysis Services: as an expression language that is used to calculate values, and as a query language that is accessed and used by client applications to retrieve data. We will address aspects of both perspectives throughout the series.
Oracle Berkeley DB is a family of open source embeddable databases that allows developers to incorporate within their applications a fast, scalable, transactional database engine with industrial grade reliability and availability. As a result, customers and end-users will experience an application that simply works, reliably manages data, can scale under extreme load, but requires no ongoing database administration. As a developer, you can focus on your application and be confident that Oracle Berkeley DB will manage your persistence needs.
Shared Nothing Architecture
A shared nothing architecture (SN) is a distributed computing architecture where each node is independent and self-sufficient, and there is no single point of contention across the system. People typically contrast SN with systems that keep a large amount of centrally-stored state information, whether in a database, an application server, or any other similar single point of contention. While SN is best known in the context of web development, the concept predates the web: Michael Stonebraker at UC Berkeley used the term in a 1986 database paper, and it is possible that earlier references also exist.
Shared Nothing is popular for web development because of its scalability. As Google has demonstrated, a pure SN system can scale almost indefinitely simply by adding nodes in the form of inexpensive computers, since there's no single bottleneck to slow the system down. Three popular web development technologies, PHP, Django, and Ruby on Rails, all emphasize an SN approach, in contrast to technologies like J2EE that manage a lot of central state. An SN system may partition its data among many nodes (assigning different computers to deal with different users or queries), or may require every node to maintain its own copy of the application's data, using some kind of coordination protocol.
Distributed computing is a method of computer processing in which different parts of a program run simultaneously on two or more computers that are communicating with each other over a network. Distributed computing is a type of parallel computing. But the latter term is most commonly used to refer to processing in which different parts of a program run simultaneously on two or more processors that are part of the same computer. While both types of processing require that a program be parallelized-divided into sections that can run simultaneously, distributed computing also requires that the division of the program take into account the different environments on which the different sections of the program will be running. For example, two computers are likely to have different file systems and different hardware components.
An example of distributed computing is BOINC, a framework in which large problems can be divided into many small problems which are distributed to many computers. Later, the small results are reassembled into a larger solution.Distributed computing is a natural result of the use of networks to allow computers to efficiently communicate. But distributed computing is distinct from networking. The latter refers to two or more computers interacting with each other, but not, typically, sharing the processing of a single program. The World Wide Web is an example of a network, but not an example of distributed computing.
There are numerous technologies and standards used to construct distributed computations, including some which are specially designed and optimized for that purpose, such as Remote Procedure Calls (RPC) or Remote Method Invocation (RMI) or .NET Remoting.
Conclusion and Future work
Hydra cube is an open source, parallel software to provide scalable Online Analytical Processing (OLAP) capabilities like aggregation, slicing and dicing of multi-dimensional data. The objective is to build parallel software, which can run on cheaper commodity desktop machines as a cluster farm, which are highly scalable and economical.
This architecture is the parallel daemon software which runs across multiple shared nothing desktop servers (or on SMPs) and execute the commands.
Web Service is the future implementation possibility and will be using client SDK to interact with server.
If words are considered as symbols of approval and tokens of Acknowledgement, then let the words play the heralding role of expressing my gratitude.
First and foremost, we are thankful to the Almighty, for his blessings in the successful completion of this dissertation and course.
we would like to express my sincere and hearty thanks to my guide Dr, S. Babu Sundar, Professor & Head, Department of Computer Applications, Cochin university of Science & Technology, Cochin for granting us permission to work under his supervision and guidance.
Our heartfelt thanks to Mr. Radhakrishnan Maniyani for his splendid help for the successful completion of my dissertation.
We wish to express our sincere gratitude to our family members and our friends for their encouragement and help for doing this work.