L Data Using Shared Nothing Desktop Server Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Multi-dimensional analysis is used for decision support systems and statistical inference to find interesting information from large database. Multidimensional databases are suitable for OLAP and data mining since these applications require dimension-oriented operations on data. Traditional multidimensional databases store data in multidimensional arrays on which analytical operations are performed. Multidimensional arrays are good to store dense data, but most data sets are sparse in practice for which other efficient storage schemes are required.

The main challenge of multi-dimensional data analysis is its huge data size and immense computations. The complexity is extreme when we consider more dimensions and the resulting intersection space. The data needs to be accessed, computed and aggregated along with all dimensions with multi-level hierarchical grouping. This demands huge processing power and memory requirement, which is normally found in Symmetric Multi-Processing (SMP) computers.

Here we present Architecture for the analysis of multidimensional data using shared nothing-desktop server. The architecture allows running on Shared Nothing desktop servers, which form the cluster environment. The multi-dimensional cube building, data storage, querying and computations are all distributed across the cluster nodes, which gives unparalleled scalability and performance.

Architecture

Figure 1: Multi-Dimensional Aggregation Architecture

Proposed Distributed Architecture

The main challenge of any multi-dimensional aggregation server is the data volume involved and the required processing power and memory requirement. Modern multi-processor computers can address this challenge but the costs of such machines are exorbitant.

Another economical way to address this scalability issue is to make use of the processing power and memory of commodity desktop machines. Since these machines are independent machines (shared nothing), harnessing their capability is a big challenge. By making use of Message Passing Interface (MPI) we can build parallel applications which can run on many shared nothing desktop computers which are connected by normal network.

In this paper, we are presenting a parallel architecture for Multi-Dimensional Aggregation Server by using Message Passing Interface.

Figure 2: Distributed Architecture for Multidimensional OLAP Server

Is a multi-dimensional OLAP (MOLAP) engine

Is a data parallel software. Can be run on cluster of commodity desktop machines

Handles sparsity very well

Data explosion is not an issue because aggregate data is neither pre computed nor stored

Aggregate data at query time and has low memory foot print

Supports basic Mutli Dimensional Expression (MDX) for query

Supports SQL like syntax for dimension and cube management

Supports multi user environment

Employs client server architecture 

Server is a parallel engine using message passing interface (MPI) middleware

Use embedded Berkeley DB for efficient data management

Has a command line utility for invoking commands

Has client APIs for integrating with applications

Distributed Strategy

The main challenge of multi-dimensional data analysis is its huge data size and immense computations. The complexity is extreme when we consider more dimensions and the resulting intersection space. The data needs to be accessed, computed and aggregated along with all dimensions with multi-level hierarchical grouping.

Figure 3: Message Passing Interface-Distributed Architecture

This demands huge processing power and memory requirement, which is normally found in Symmetric Multi Processing (SMP) computers. But SMP have the following disadvantages

- Cost per computation is high

- Scalability is a question

SMP has the advantage of shared memory access for the parallel processors, which gives better performance due to lower latency.

This architecture allows it to run on shared nothing desktop servers which form the cluster environment. The multi-dimensional cube building, data storage, querying and computations are all distributed across the cluster nodes which give unparalleled scalability and performance.

The downside is the network latency of interconnect and which is addressed by the following approaches:

- Use high bandwidth interconnects like Myrinet / Giganet

- Have a distribution strategy, which minimizes the data communication among the nodes

- Employ concurrent file access with data partitioning

- Employ asynchronous disk I/O which is concurrent with the computations to hide the latency

Design Decisions

Hardware: Parallel, Distributed, Shared Nothing Commodity Cluster with High Speed Interconnect.

[Ethernet (100Mbps) / Myrinet / Gigabit (1Gbps)]

-Or-

Symmetric Multi-Processing (SMP) Shared Memory Systems

File system - Native operating system file system

Middleware - Industry standard message passing interface (MPI)

Storage Format - Multi dimensional proprietary data storage with efficient sparse data compression

Storage Engine - Berkeley DB embedded

Query language - Multi Dimensional Expression (MDX)

Client API - C++ / Java Client API (SDK)

Data Import / Export - Delimited text files

Compiler - C++ programming language

Interconnect can be normal 100mbps ethernet, Gigabit or Myrinet for better performance and lower latency.

OLAP

The term On Line Analytical Processing dates from 1993. E.F. Codd, the father of relational databases, introduced it. OLAP is an acronym for On Line Analytical Processing. It is an approach to quickly provide the answer to analytical queries that are dimensional in nature. The typical applications of OLAP are in business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas. The term OLAP was created as a slight modification of the traditional database term OLTP (On Line Transaction Processing).The output of an OLAP query is typically displayed in a matrix (or pivot) format. The dimensions form the row and column of the matrix; the measures, the values. OLAP tools. OLAP tools allow the user to query, browse, and summarize information in a very efficient, interactive, and dynamic way. OLAP tools represent a vital component of both the business intelligence and data mining technology. They provide an aggregated approach to analyzing large amounts of detailed data.

Online Analytical Processing, a category of software tools that provides analysis of data stored in a database. OLAP tools enable users to analyze different dimensions of multidimensional data. For example, it provides time series and trend analysis views. OLAP often is used in data mining. The chief component of OLAP is the OLAP server, which sits between a client and a database management systems (DBMS). The OLAP server understands how data is organized in the database and has special functions for analyzing the data.

OLAP systems have been traditionally categorized using the following taxonomy:

Multidimensional

Relational

Hybrid

Each type has certain benefits, although there is disagreement about the specifics of the benefits between providers.

Some MOLAP implementations are prone to database explosion. Database explosion is a phenomenon causing vast amounts of storage space to be used by MOLAP databases when certain common conditions are met: high number of dimensions, pre-calculated results and sparse multidimensional data. The typical mitigation technique for database explosion is not to materialize all the possible aggregation, but only the optimal subset of aggregations based on the desired performance vs. storage trade off.

MOLAP generally delivers better performance due to specialized indexing and storage optimizations. MOLAP also needs less storage space compared to ROLAP because the specialized storage typically includes compression techniques.

ROLAP is generally more scalable. However, large volume pre-processing is difficult to implement efficiently so it is frequently skipped. ROLAP query performance can therefore suffer.

Since ROLAP relies more on the database to perform calculations, it has more limitations in the specialized functions it can use.

HOLAP encompasses a range of solutions that attempt to mix the best of ROLAP and MOLAP. It can generally pre-process quickly, scale well, and offer good function support.

Multi-Dimensional Aggregation Engine

The architecture is for multi-dimensional data aggregation using Bit Encoded Storage Scheme (BESS) which addresses sparsity and data explosion in an efficient manner. The architecture makes use of MOLAP concepts but does not store the data in memory which causes large memory foot print. Data management is achieved by Berkeley DB database library which helps in storing, retrieving and updating the data using file system.

Multi-Dimensional Expression (MDX) interface is proposed to enable multi-dimensional querying of data. The aggregation services are exposed as interfaces to enable Client Server architecture. CORBA is used as middle ware for exposing the interfaces. CORBA enables language independence through the concept of language mapping.

Multidimensional Expressions (MDX) as a Language

MDX emerged circa 1998, when it first began to appear in commercial applications. MDX was created to query OLAP databases, and has become widely adopted within the realm of analytical applications. MDX forms the language component of OLE DB for OLAP, and was designed by Microsoft as a standard for issuing queries to, and exchanging data with, multidimensional data sources. The language is rapidly gaining the support of many OLAP providers, and this trend is expected to continue.

The focus of the MDX Essentials series will be MDX as implemented in MSSQL 2000 Analysis Services. MDX acts in two capacities within Analysis Services: as an expression language that is used to calculate values, and as a query language that is accessed and used by client applications to retrieve data. We will address aspects of both perspectives throughout the series.

Berkeley DB

Oracle Berkeley DB is a family of open source embeddable databases that allows developers to incorporate within their applications a fast, scalable, transactional database engine with industrial grade reliability and availability. As a result, customers and end-users will experience an application that simply works, reliably manages data, can scale under extreme load, but requires no ongoing database administration. As a developer, you can focus on your application and be confident that Oracle Berkeley DB will manage your persistence needs.

Shared Nothing Architecture

A shared nothing architecture (SN) is a distributed computing architecture where each node is independent and self-sufficient, and there is no single point of contention across the system. People typically contrast SN with systems that keep a large amount of centrally-stored state information, whether in a database, an application server, or any other similar single point of contention. While SN is best known in the context of web development, the concept predates the web: Michael Stonebraker at UC Berkeley used the term in a 1986 database paper, and it is possible that earlier references also exist.

Shared Nothing is popular for web development because of its scalability. As Google has demonstrated, a pure SN system can scale almost indefinitely simply by adding nodes in the form of inexpensive computers, since there's no single bottleneck to slow the system down. Three popular web development technologies, PHP, Django, and Ruby on Rails, all emphasize an SN approach, in contrast to technologies like J2EE that manage a lot of central state. An SN system may partition its data among many nodes (assigning different computers to deal with different users or queries), or may require every node to maintain its own copy of the application's data, using some kind of coordination protocol.

Distributed Computing

Distributed computing is a method of computer processing in which different parts of a program run simultaneously on two or more computers that are communicating with each other over a network. Distributed computing is a type of parallel computing. But the latter term is most commonly used to refer to processing in which different parts of a program run simultaneously on two or more processors that are part of the same computer. While both types of processing require that a program be parallelized-divided into sections that can run simultaneously, distributed computing also requires that the division of the program take into account the different environments on which the different sections of the program will be running. For example, two computers are likely to have different file systems and different hardware components.

An example of distributed computing is BOINC, a framework in which large problems can be divided into many small problems which are distributed to many computers. Later, the small results are reassembled into a larger solution.Distributed computing is a natural result of the use of networks to allow computers to efficiently communicate. But distributed computing is distinct from networking. The latter refers to two or more computers interacting with each other, but not, typically, sharing the processing of a single program. The World Wide Web is an example of a network, but not an example of distributed computing.

There are numerous technologies and standards used to construct distributed computations, including some which are specially designed and optimized for that purpose, such as Remote Procedure Calls (RPC) or Remote Method Invocation (RMI) or .NET Remoting.

Conclusion and Future work

Hydra cube is an open source, parallel software to provide scalable Online Analytical Processing (OLAP) capabilities like aggregation, slicing and dicing of multi-dimensional data. The objective is to build parallel software, which can run on cheaper commodity desktop machines as a cluster farm, which are highly scalable and economical.

This architecture is the parallel daemon software which runs across multiple shared nothing desktop servers (or on SMPs) and execute the commands.

Web Service is the future implementation possibility and will be using client SDK to interact with server.

Acknowledgement

If words are considered as symbols of approval and tokens of Acknowledgement, then let the words play the heralding role of expressing my gratitude.

First and foremost, we are thankful to the Almighty, for his blessings in the successful completion of this dissertation and course.

we would like to express my sincere and hearty thanks to my guide Dr, S. Babu Sundar, Professor & Head, Department of Computer Applications, Cochin university of Science & Technology, Cochin for granting us permission to work under his supervision and guidance.

Our heartfelt thanks to Mr. Radhakrishnan Maniyani for his splendid help for the successful completion of my dissertation.

We wish to express our sincere gratitude to our family members and our friends for their encouragement and help for doing this work.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.