Study On The Architecture Of Column Databases Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Database is a collection of meaningful and related data usually used by an organization. Databases have revolutionized Information Technology by introducing a centralized storage of huge amounts of data that can be inserted, deleted, updated and retrieved efficiently. Along with all the above mentioned uses, databases are further being used by organizations in decision making procedures i.e. as the Data Warehouse (DW). DWs are not intelligent. They provide statistical data and assist decision making by deriving information from the database. However, this concept is relatively new and still developing.

DWss are growing fast. Nearly 40% of organizations say data volumes are increasing as much as 50% per year, while 18% say their warehouses are doubling in size annually, according to an IDC survey*. DWs are facing large data volumes, complex data transformations and huge data aggregation. Therefore, to achieve better performance, a more efficient approach to data management is desired i.e. DBMS based on Column Database structure.

Columnar Databases emerged in 1970s, but huge improvement of hardware placed relational databases in the fore front.  Nowadays, a huge expansion of analytical (business intelligence/data warehouse) solutions working with large volume of data allow columnar databases to back in the game - as approach that could improve efficiency of analytical systems**.

A Columnar DBMS stores its content by column rather that by row. However, a database program must show data in a format of two-dimensional tables, but store it as one-dimensional strings. Take the following table as an example:

Fig No.1 Sample Database

* Reinvent The Data Warehouse With Column-Store Databases And Appliances by Doug Henschen from Information Week

**Columnar databases and data warehouse by Bojan Ćirić from Global Data Consulting

A row-oriented database serializes all of the values in a row together, then the values in the next row, and so on. In a row-oriented database, accessing specific columns requires reading all records.

Fig No.2 Row Oriented Database

Data will be read in the following series from Fig No.1:

1, Smith, John, 10000, 2, Bolton, Nick, 50000, 3, Jones, Leanne, 40000

On the contrary, a column-oriented database serializes all of the values of a column together, then the values of the next column, and so on. In a column-oriented database, only the columns in the query need to be retrieved.

Fig No.3 Column Oriented Database

The data will be read as follows from fig No.1:

1, 2, 3, Smith, Bolton, Jones, John, Nick, Leanne, 10000, 50000, 40000

Contrast with Standard Relational Model:

The major, simplified advantage of column oriented databases over row oriented databases is in the efficiency of hard-disk access. As the standard model retrieves all rows and values in addition to the required values, it ends up reading more data than required and increases query costs and I/O burdens. In response, system architects and DBAs often tune the environment for the different queries by building additional indexes, pre-aggregating data, and creating special materialized views and cubes. These require yet more processing time and consume additional persistent data storage. Because they are often quite query-specific, these tunings only address the performance of the queries that are known, and do not even touch upon general performance of ad hoc queries. However, some benefits and challenges of columnar approach are listed below*.

Benefits of Columnar Database:

Better Analytic Performance:

Instead of requiring separate indexes for optimally tuned queries, the data values themselves within each column form the index, reducing I/O, enabling rapid access to the data without the need for expanding the database footprint, all while simultaneously and dramatically improving query performance.

Rapid Joins & Aggregation:

Data access streaming along column-oriented data allows for incrementally computing the results of aggregate functions, which is critical for data warehouse applications. In addition, there is no requirement for different columns of data to be stored together; allocating columnar data across multiple processing units and storage allows for parallel accesses and aggregations as well, increasing the overall query performance.

Suitability for compression:

The columnar storage of data not only eliminates storage of multiple indexes, views and aggregations, but also facilitates vast improvements in compression, which can result in an additional reduction in storage while maintaining high performance.

Rapid data loading:

The typical process for loading data into a data warehouse involves extracting data into a staging area, performing transformations, joining data to create denormalized representations and loading the data into the warehouse as fact and dimension tables, and then creating the collection

* Reinvent The Data Warehouse With Column-Store Databases And Appliances by Doug Henschen from Information Week

of required indexes and views. In a row-based arrangement, all of the data values in each row need to be stored together, and then indexes must be constructed by reviewing all the row data. In a columnar arrangement the system effectively allows one to segregate storage by column. This means that each column is built in one pass, and stored separately, allowing the database system to load columns in parallel using multiple threads. Further, related performance characteristics of join processing built atop a column store is often sufficiently fast that the load-time joining required to create fact tables is unnecessary, shortening the latency from receipt of new data to availability for query processing. Finally, since columns are stored separately, entire table columns can be added and dropped without downing the system, and without the need to re-tuning the system following the change.

Challenges of Columnar Database:**

Load time:

Converting the data source into columnar format can be unbearably slow where tens or hundreds of gigabytes of data are involved. Organizing data by rows does have its advantages. Writing data to disk in row format is faster than doing so by columns. That is key for high-transaction database applications where data is constantly being read and written to the database, though markedly less important for data warehouses, where data is typically written just once and accessed many times after that*.

Incremental loads:

Incremental loads can be performance problematic.

Data compression:

Some columnar systems greatly compress the source data. However, uncompressing the data to read it can slow performance.

Structural limitations:

Columnar databases use different techniques to simulate a relational structure. Some require the same primary key on all tables, meaning the database hierarchy is limited to two levels.


Columnar databases major advantage is to get good performance on large databases. It is unsuitable for medium or small sized databases.

*Relational database pioneer says technology is obsolete By Eric Lai

**Gaining a Performance Edge Using a Column-Oriented Database Management System by David Loshin from Sybase White Paper

Present Status:

A recent study presented the following comparison:

Column database have still not gained much popularity amongst organizations. Even though they reduce the query response time dramatically, they are difficult to maintain. Major column-store database vendors, include ParAccel, Sybase, and Vertica. They have also introduced software-hardware bundles built on third-party hardware.

Future Prospects:

The recent launch of InfiniDB 2.0 will break the "data growth barriers" as stated by Jeff Vogel. Some industry analysts have predicted that by 2015, column databases will dominate data warehousing environments. The fact of the matter is that row-based architectures simply can't keep up with the ever-growing amount of data that companies need to put to work in next generation analytics applications. Many are already struggling to keep up and are making difficult tradeoffs in their businesses to stay with the legacy row based products.  This is especially true for one our more important vertical markets - online Internet digital ad serving. The future favors column oriented databases.  

Although the drawbacks of this database still need to be conquered. Ways have to be researched to make columnar databases scalable and how to decrease the load times etc.