Study Of Data Warehousing And Data Mining Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Q1 Delineate the concept of 'metadata' and 'access tools' by giving examples. Ans1 Meta means "about," so metadata is "about data," or, more specifically, "information about data." There is metadata that describes the fields and formats of databases and data warehouses. There is metadata that describes documents and document elements, such as Web pages, research papers, and so on. And there are metadirectories that describe how information is organized in directories. An important feature of metadata is that it provides concise information about documents and data that improves searching. For example, compare searching entire sets of documents for keywords as opposed to searching descriptive indexes of those documents.

Most people who are vaguely familiar with metadata think of database management systems. A database contains fields such as Name, Address, City, and so on. Metadata names these fields, describes the size of the fields, and may put restrictions on what can go in the field (for example, numbers only).

If you were to transfer a database file to someone without also giving them the metadata information, the file would appear to the recipient as a long string of characters. The metadata delineates in terms of the alignment of character blocks how the data should be extracted into fields and records. Therefore, metadata is information about how data is extracted, and how it may be transformed. It is also about indexing and creating pointers into data. Database design is all about defining metadata schemas.

Anyone who uses an advanced word processing program such as Microsoft Word can view and edit document metadata. For example, on the File menu, choose Properties. A dialog box appears that has general information about the document such as creation date, size, and so on. A Summary tab has metadata fields such as Title, Subject, Author, Manager, Company, Category, Keywords, and Comments. This information follows the definition for metadata: it is data about the data in the document and you can search for documents by searching this information.

Metadata takes the form of named element/value pairs. For example, the element City may have the value New York. A schema defines the vocabulary of a particular set of metadata (that is, element names and formatting rules). The metadata may be included with the document or stored in separate files. A schema is a separate file that is referenced from the document.

Access tools

The principal purpose of data warehousing is to provide information to business users for strategic decision-making. These users interact with the data warehouse using front-end tools. Many of these tools require an information specialist, although many end users develop expertise in the tools. Tools fall into four main categories: query and reporting tools, application development tools, online analytical processing tools, and data mining tools.

Query and Reporting tools can be divided into two groups: reporting tools and managed query tools. Reporting tools can be further divided into production reporting tools and report writers. Production reporting tools let companies generate regular operational reports or support high-volume batch jobs such as calculating and printing paychecks. Report writers, on the other hand, are inexpensive desktop tools designed for end-users.

Managed query tools shield end users from the complexities of SQL and database structures by inserting a metalayer between users and the database. These tools are designed for easy-to-use, point-and-click operations that either accept SQL or generate SQL database queries.

Often, the analytical needs of the data warehouse user community exceed the built-in capabilities of query and reporting tools. In these cases, organizations will often rely on the tried-and-true approach of in-house application development using graphical development environments such as PowerBuilder, Visual Basic and Forte. These application development platforms integrate well with popular OLAP tools and access all major database systems including Oracle, Sybase, and Informix.

OLAP tools are based on the concepts of dimensional data models and corresponding databases, and allow users to analyze the data using elaborate, multidimensional views. Typical business applications include product performance and profitability, effectiveness of a sales program or marketing campaign, sales forecasting and capacity planning. These tools assume that the data is organized in a multidimensional model.

A critical success factor for any business today is the ability to use information effectively. Data mining is the process of discovering meaningful new correlations, patterns and trends by digging into large amounts of data stored in the warehouse using artificial intelligence, statistical and mathematical techniques.

Q2 Discuss the concept of data warehouse administration and management in detail

Ans2 Data Warehouse Administration and Management

Data warehouses tend to be as much as 4 times as large as related operational databases, reaching terabytes in size depending on how much history needs to be saved. They are not synchronized in real time to the associated operational data but are updated as often as once a day if the application requires it.

In addition, almost all data warehouse products include gateways to transparently access multiple enterprise data sources without having to rewrite applications to interpret and utilize the data. Furthermore, in a heterogeneous data warehouse environment, the various databases reside on disparate systems, thus requiring inter-networking tools. The need to manage this environment is obvious.

Managing data warehouses includes security and priority management; monitoring updates from the multiple sources; data quality checks; managing and updating meta data; auditing and reporting data warehouse usage and status; purging data; replicating, sub setting and distributing data; backup and recovery and data warehouse storage management.

Q3 What are the various DBMS schemas for decision support? Elaborate.

Ans3 Now every industry have begun to implement data warehouse. Traditional OLTP database systems were simple and not designed to send data warehouse requirement. The data warehouse forced to choose between the data model and corresponding data schema. As data warehouse continue to mature, new approaches to schema design were made. There is widespread acceptance for data warehousing in the star schema.

Data layout for best access - There are many efficient operational systems that are developed like payroll, inventory etc. early database systems were complex to develop and is very difficult to understand. For this the most powerful solution is that it is based on mathematical principles, predicate logic is existence of RDBMS. Now the key element of database design expertise is focused on developing data modeling and RDBMS schema so that corresponding RDBMS can achieve maximum operational efficiency.

Multidimentional data model - Data is looked as a multidimentional cube and can answer number of questions which can answer only one question.

Star schema- The star schema (also called star-join schema, data cube, or multi-dimensional schema) is the simplest style of data warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema, and is more effective for handling simpler queries

The facts that the data warehouse helps analyze are classified along different dimensions:

The fact table holds the main data. It includes a large amount of aggregated data, such as price and units sold. There may be multiple fact tables in a star schema.

Dimension tables, which are usually smaller than fact tables, include the attributes that describe the facts. Often this is a separate table for each dimension. Dimension tables can be joined to the fact table(s) as needed.

Dimension tables have a simple primary key, while fact tables have a set of foreign keys which make up a compound primary key consisting of a combination of relevant dimension keys.

It is common for dimension tables to consolidate redundant data in the most granular column, and are thus rendered in second normal form. Fact tables are usually in third normal form because all data depends on either one dimension or all of them, not on combinations of a few dimensions.

The star schema is a way to implement multi-dimensional database (MDDB) functionality using a mainstream relational database: given most organizations' commitment to relational databases, a specialized multi-dimensional DBMS is likely to be both expensive and inconvenient.

Another reason for using a star schema is its simplicity for users: queries are never complex because the only joins and conditions involve a fact table and a single level of dimension tables, without the indirect dependencies to other tables that are possible in a better normalized snowflake schema.

Bitmap Indexes

Bitmap indexes are widely used in data warehousing environments. The environments typically have large amounts of data and ad hoc queries, but a low level of concurrent DML transactions. For such applications, bitmap indexing provides:

Reduced response time for large classes of ad hoc queries

Reduced storage requirements compared to other indexing techniques

Dramatic performance gains even on hardware with a relatively small number of CPUs or a small amount of memory

Efficient maintenance during parallel DML and loads

Fully indexing a large table with a traditional B-tree index can be prohibitively expensive in terms of space because the indexes can be several times larger than the data in the table. Bitmap indexes are typically only a fraction of the size of the indexed data in the table.

An index provides pointers to the rows in a table that contain a given key value. A regular index stores a list of rowids for each key corresponding to the rows with that key value. In a bitmap index, a bitmap for each key value replaces a list of rowids. Each bit in the bitmap corresponds to a possible rowid, and if the bit is set, it means that the row with the corresponding rowid contains the key value. A mapping function converts the bit position to an actual rowid, so that the bitmap index provides the same functionality as a regular index. If the number of different key values is small, bitmap indexes save space.

Bitmap indexes are most effective for queries that contain multiple conditions in the WHERE clause. Rows that satisfy some, but not all, conditions are filtered out before the table itself is accessed. This improves response time, often dramatically. Bitmap indexes are primarily intended for data warehousing applications where users query the data rather than update it. They are not suitable for OLTP applications with large numbers of concurrent transactions modifying the data. Parallel query and parallel DML work with bitmap indexes as they do with traditional indexes. Bitmap indexing also supports parallel create indexes and concatenated indexes.

The advantages of using bitmap indexes are greatest for columns in which the ratio of the number of distinct values to the number of rows in the table is under 1%. We refer to this ratio as the degree of cardinality. A gender column, which has only two distinct values (male and female), is ideal for a bitmap index. However, data warehouse administrators also build bitmap indexes on columns with higher cardinalities.

For example, on a table with one million rows, a column with 10,000 distinct values is a candidate for a bitmap index. A bitmap index on this column can outperform a B-tree index, particularly when this column is often queried in conjunction with other indexed columns. In fact, in a typical data warehouse environments, a bitmap index can be considered for any non-unique column.

Characteristic of Bitmap Indexes

For columns with very few unique values (low cardinality) Columns that have low cardinality are good candidates (if the cardinality of a column is <= 0.1 %  that the column is ideal candidate, consider also 0.2% - 1%)

Tables that have no or little insert/update are good candidates (static data in warehouse)


Stream of bits: each bit relates to a column value in a single row of table create bitmap index person_region on person (region)

Advantage of Bitmap Indexes

The advantages of them are that they have a highly compressed structure, making them fast to read and their structure makes it possible for the system to combine multiple indexes together for fast access to the underlying table.

Compressed indexes, like bitmap indexes, represent a trade-off between CPU usage and disk space usage. A compressed structure is faster to read from disk but takes additional CPU cycles to decompress for access - an uncompressed structure imposes a lower CPU load but requires more bandwidth to read in a short time.

One belief concerning bitmap indexes is that they are only suitable for indexing low-cardinality data. This is not necessarily true, and bitmap indexes can be used very successfully for indexing columns with many thousands of different values.

Disadvantage of Bitmap Indexes

The reason for confining bitmap indexes to data warehouses is that the overhead on maintaining them is enormous. A modification to a bitmap index requires a great deal more work on behalf of the system than a modification to a b-tree index. In addition, the concurrency for modifications on bitmap indexes is dreadful.


Q4 How can we map the data warehouse to multiprocessor architecture? Elaborate.

Ans4 Relational database technology for data warehouse- As we know that data in the data warehouse is increasing at rapid rate. So for that better performance and scalability becomes a real necessity. It is pursuing two goals :-

Speed up- Taking less time to execute same request on same amount of data

Scale up- As database size increases, the ability to obtain same performance on same request.

Types of parallelism We get the parallel hardware architecture by implementing multiserver and multithreaded systems and can handle large number of client request efficiently.

Interquery parallelism - It is implemented on SMP systems, in which different server thread handle multiple request at same time. It helps in increasing the throughput.

Intraquery parallelism - It decomposes the serial SQL query into lower level operations. Such as scan, join, sort and aggregation. These operations then execute concurrently in parallel.

Parallel execution is done in two ways:-

Horizontal parallelism- In this database is partitioned across multiple disks.

Vertical parallelism- All component query operations are executed in parallel and in pipeline manner

Database architecture for parallel processing

To take the advantage of parallelism in shared and distributed memory environment an adaptable parallel database software architecture is required. There are 3 main types of DBMS software architecture styles:-

Shared memory architecture

It is a traditional approach which is very simple to implement and is successful upto the point where it runs into the scalability limitations. In this we have a single RDBMS server that utilizes all processors, access all memory and access entire database providing user with consistent single system image. All processors have access to all data which is partitioned across local disk.

Shared disk architecture

It is based on the concept of shared ownership of the entire database between RDBMS servers. Each RDBMS server can read, write, update and delete records from same shared database. This architecture imposes some constraints to the scalability. But on the other side it can help to reduce performance bottlenecks resulting from data skew and can significantly increase the system availability.

Shared nothing architecture

The data here is partitioned across all disks and DBMS is partitioned across multiple co-servers. It parallelizes the execution of SQL query across multiple processing nodes. Each processor has its own memory and disk and can communicate with other processors by exchanging messages and data over the interconnected network. It offers non-linear scalability. It is the most difficult to implement.

Q5 What is the role of OLAP in data mining?

Ans5 In large data warehouse environments, many different types of analysis can occur. In addition to SQL queries, you may also apply more advanced analytical operations to your data. Two major types of such analysis are OLAP (On-Line Analytic Processing) and data mining. Rather than having a separate OLAP or data mining engine, Oracle has integrated OLAP and data mining capabilities directly into the database server. Oracle OLAP and Oracle Data Mining are options to the Oracle9i Database


OLAP adds the query performance and calculation capability previously found only in multidimensional databases to Oracle's relational platform. In addition, it provides a Java OLAP API that is appropriate for the development of internet-ready analytical applications. Unlike other combinations of OLAP and RDBMS technology, OLAP is not a multidimensional database using bridges to move data from the relational data store to a multidimensional data store. Instead, it is truly an OLAP-enabled relational database. As a result, Oracle9i provides the benefits of a multidimensional database along with the scalability, accessibility, security, manageability, and high availability of the Oracle9i database. The Java OLAP API, which is specifically designed for internet-based analytical applications, offers productive data access.

Data Mining

Oracle enables data mining inside the database for performance and scalability. Some of the capabilities are:

An API that provides programmatic control and application integration

Analytical capabilities with OLAP and statistical functions in the database

Multiple algorithms: Naïve Bayes, decision trees, clustering, and association rules

Real-time and batch scoring modes

Multiple prediction types

Association insights

OLAP and Data Mining

OLAP and data mining are used to solve different kinds of analytic problems:

OLAP provides summary data and generates rich calculations. For example, OLAP answers questions like "How do sales of mutual funds in North America for this quarter compare with sales a year ago? What can we predict for sales next quarter? What is the trend as measured by percent change?"

Data mining discovers hidden patterns in data. Data mining operates at a detail level instead of a summary level. Data mining answers questions like "Who is likely to buy a mutual fund in the next six months, and what are the characteristics of these likely buyers?"

OLAP and data mining can complement each other. For example, OLAP might pinpoint problems with sales of mutual funds in a certain region. Data mining could then be used to gain insight about the behavior of individual customers in the region. Finally, after data mining predicts something like a 5% increase in sales, OLAP can be used to track the net income. Or, Data Mining might be used to identify the most important attributes concerning sales of mutual funds, and those attributes could be used to design the data model in OLAP.

Vendors sometimes add to the confusion when they claim their products support data mining, because these are often more appropriate for OLAP instead. OLAP involves "slicing and dicing" data using dimensions and measures of interest. For example, we may want to know how many SUVs were sold last month in a Midwest region at the sticker price. This question's dimensions include the type of vehicle, time, location, and price. With OLAP, the user directs the analysis and explores hypotheses or relationships. In most cases, the required computations are not mathematically complex but involve sorting through many rows of data.

In contrast, data mining involves the automated process of finding relationships and patterns in data. For example, a company might want to know what pattern of behaviors predicts that a customer might leave for a competitor. Using computationally complex algorithms (e.g., genetic algorithms), the software finds relationships that were previously unknown. The algorithm directs the analysis and identifies hypotheses or relationships that merit further investigation.

OLAP and data mining users have different characteristics. Those working with OLAP employ software from vendors such as Cognos, Hyperion, and MicroStrategy to access predefined reports, manipulate the data using available dimensions and measures, and (in the case of power users) create queries and reports for themselves and others.

Data mining analysts typically work with specialized software (e.g., Clementine from SPSS) to find the relationships that are important to the business. These analysts may be either highly skilled data mining professionals or businesspeople with good analytical and problem-solving skills who work with packaged data mining software in applications such as fraud detection. Analysts and the work they do can differ considerably.

Q6 Draw star schema diagram for sales database.

Ans6 A star schema consists of fact tables and dimension tables. Fact tables contain the quantitative or factual data about a business--the information being queried. This information is often numerical, additive measurements and can consist of many columns and millions or billions of rows. Dimension tables are usually smaller and hold descriptive data that reflects the dimensions, or attributes, of a business. SQL queries then use joins between fact and dimension tables and constraints on the data to return selected information.

Fact and dimension tables differ from each other only in their use within a schema. Their physical structure and the SQL syntax used to create the tables are the same. In a complex schema, a given table can act as a fact table under some conditions and as a dimension table under others. The way in which a table is referred to in a query determines whether a table behaves as a fact table or a dimension table.

Even though they are physically the same type of table, it is important to understand the difference between fact and dimension tables from a logical point of view. To demonstrate the difference between fact and dimension tables, consider how an analyst looks at business performance:

A salesperson analyzes revenue by customer, product, market, and time period.

A financial analyst tracks actuals and budgets by line item, product, and time period.

A marketing person reviews shipments by product, market, and time period.

The facts--what is being analyzed in each case--are revenue, actuals and budgets, and shipments. These items belong in fact tables. The business dimensions--the by items--are product, market, time period, and line item. These items belong in dimension tables.

For example, a fact table in a sales database, implemented with a star schema, might contain the sales revenue for the products of the company from each customer in each geographic market over a period of time. The dimension tables in this database define the customers, products, markets, and time periods used in the fact table.

A well-designed schema provides dimension tables that allow a user to browse a database to become familiar with the information in it and then to write queries with constraints so that only the information that satisfies those constraints is returned from the database.