Clustered And Nonclustered Indexes Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Indexes in databases are similar to indexes in books. In a book, an index allows you to find information quickly without reading the entire book. In a database, an index allows the database program to find data in a table without scanning the entire table. An index in a book is a list of words with the page numbers that contain each word. An index in a database is a list of values in a table with the storage locations of rows in the table that contain each value. Indexes can be created on either a single column or a combination of columns in a table and are implemented in the form of B-trees. An index contains an entry with one or more columns (the search key) from each row in a table. A B-tree is sorted on the search key, and can be searched efficiently on any leading subset of the search key. For example, an index on columns A, B, C can be searched efficiently on A, on A, B, and A, B, C.

Most books contain one general index of words, names, places, and so on. Databases contain individual indexes for selected types or columns of data: this is similar to a book that contains one index for names of people and another index for places. When it is create a database and tune it for performance, it should create indexes for the columns used in queries to find data.

Here is one example for the created table named Person in SQL. The following illustration shows how the index stores each emp_id value and points to the rows of data in the table with each value.

When SQL Server executes a statement to find data in the employee table based on a specified emp_id value, it recognizes the index for the emp_id column and uses the index to find the data. If the index is not present, it performs a full table scan starting at the beginning of the table and stepping through each row, searching for the specified emp_id value.

SQL Server automatically creates indexes for certain types of constraints (for example, PRIMARY KEY and UNIQUE constraints). You can further customize the table definitions by creating indexes that are independent of constraints.

The performance benefits of indexes, however, do come with a cost. Tables with indexes require more storage space in the database. Also, commands that insert, update, or delete data can take longer and require more processing time to maintain the indexes. When you design and create indexes, you should ensure that the performance benefits outweigh the extra cost in storage space and processing resources.

There are two indexes supported by Microsoft SQL Server: clustered and nonclustered:

Clustered indexes dictate the physical order of data stored in a table. Since a table can only have one physical order, there can only be one clustered index per table. Clustered indexes are efficient when searching for ranges of data since the data is already physically ordered.

Nonclustered indexes do not impact the underlying physical storage, but rather are made up of pointers to data rows. If a clustered index exists, the pointers in nonclustered indexes contain a reference to the location in the clustered index. These indexes are more compact than the data and can be scanned faster than the actual data table.

Clustered Indexes

Clustered index determines the physical order of data in a table. A clustered index is analogous to a telephone directory, which arranges data by last name. Table can contain only one clustered index, because the clustered index dictates the physical storage order of the data in the table. However, the index can include multiple columns (a composite index), like the way a telephone directory is ordered by last name and first name.

A clustered index is principally capable on columns that are often searched for ranges of values. After the row with the first value is found using the clustered index, rows with subsequent indexed values are guaranteed to be physically adjacent. For example, if an application frequently executes a query to retrieve records between a range of dates, a clustered index can quickly locate the row containing the beginning date, and then retrieve all adjacent rows in the table until the last date is reached. This can help increase the performance of this type of query. Also, if there is a column(s) that is used frequently to sort the data retrieved from a table, it can be advantageous to cluster (physically sort) the table on that column(s) to save the cost of a sort each time the column(s) is queried.

Clustered indexes are also efficient for finding a specific row when the indexed value is unique. For example, in the table Person mention before, the fastest way to find a particular employee using the unique employee ID column emp_id is to create a clustered index or PRIMARY KEY constraint on the emp_id column.

PRIMARY KEY constraints create clustered indexes automatically if no clustered index already exists on the table and a nonclustered index is not specified when you create the PRIMARY KEY constraint.

On the other hand, a clustered index could be created on lname, fname (last name, first name), because employee records are often grouped and queried in this way rather than by employee ID.

Considerations

It is significant to define the clustered index key with as few columns as possible. If a large clustered index key is defined, any nonclustered indexes that are defined on the same table will be significantly larger because the nonclustered index entries contain the clustering key. The Index Tuning Wizard does not return an error when saving an SQL script to a disk with insufficient available space. For more information about how nonclustered indexes are implemented in Microsoft SQL Server 2000.

During analysis, The Index Tuning Wizard can consume significant CPU and memory resources. It is suggested that tuning should be performed against a test version of the production server rather than the production server. Furthermore, the wizard should be run on a separate computer from the computer running SQL Server. The wizard cannot be used to select or create indexes and statistics in databases on SQL Server version 6.5 or earlier.

Before clustered indexes are created, it should know how your data will be accessed. It has to be consider using a clustered index for:

Columns that contain a large number of distinct values.

Queries that return a range of values using operators such as BETWEEN, >, >=, <, and <=.

Columns that are accessed sequentially.

Queries that return large result sets.

Columns that are frequently accessed by queries involving join or GROUP BY clauses; typically these are foreign key columns. An index on the column(s) specified in the ORDER BY or GROUP BY clause eliminates the need for SQL Server to sort the data because the rows are already sorted. This improves query performance.

OLTP-type applications where very fast single row lookup is required, typically by means of the primary key. Create a clustered index on the primary key.

Clustered indexes are not a good choice for:

Columns that undergo frequent changes

This results in the entire row moving (because SQL Server must keep the data values of a row in physical order). This is an important consideration in high-volume transaction processing systems where data tends to be volatile.

Wide keys

The key values from the clustered index are used by all nonclustered indexes as lookup keys and therefore are stored in each nonclustered index leaf entry.

Using Clustered Indexes

Like I mention before, the written phone directory is a great example of a clustered index. Each entry in the directory represents one row of the table. A table can have only one clustered index. That is because a clustered index is the actual table sorted in order of the cluster key. At first glance, we might think that inserting a new row into the table will require all the rows after the inserted row to be moved on the disk. Luckily, this is not the case. The row will have to be inserted into the correct data page, and this might require a page split if there is not enough room on the page for the new row. A list of pointers maintains the order between the pages, so the rows in other pages will not have to actually move.

For example, if we wanted to find all the people who have the last name of "Smith" for a family reunion. How quickly could we locate all the names? Of course, it would take a matter of seconds to find all of them grouped together, possibly over a few pages. What if we wanted to locate everyone with the first name of "Jeff" in the book? Could it be done? Of course it could, but we would have to look at every entry in the entire book because the first name is the second column in our cluster key. This is called a clustered index scan, which is a very expensive operation if we work on a big table.

Here is an example using one of the tables in AdventureWorks. The Sales.SalesOrderDetail table has a clustered index on SalesOrderID plus SalesOrderDetailID. In Figure 1 we can se the the graphical estimated execution plan of the batch, paying particular attention to the Query cost when either the first or second column in the cluster key is used.

SELECT SalesOrderID, SalesOrderDetailID

FROM Sales.SalesOrderDetail

WHERE SalesOrderID = 58950

SELECT SalesOrderID, SalesOrderDetailID

FROM Sales.SalesOrderDetail

WHERE SalesOrderDetailID = 68531

Figure 1: Compare the query costs when either the first or second column is searched.

The first query in the batch will use a "Clustered Index Seek" while the second will use a "Clustered Index Scan". We notice that the first query will use 1% of the cost of the batch while the second query will use 99% of the cost. These results make sense when we are thinking about the fact that the first column of the index is SalesOrderID. When trying to find a particular row based on the SalesOrderDetailID in the second query, the entire table is searched.

One more search that we probably wouldn't attempt with the phone directory is looking for every entry with a particular last name or a particular first name. If we wanted to find all of the entries with the first name of "Jeff" or the last name of "Smith" we would have to search every name in the book. Here is a batch illustrating this point:

SELECT SalesOrderID, SalesOrderDetailID

FROM Sales.SalesOrderDetail

WHERE SalesOrderID = 43683

AND SalesOrderDetailID = 240

SELECT SalesOrderID, SalesOrderDetailID

FROM Sales.SalesOrderDetail

WHERE SalesOrderID = 43683

OR SalesOrderDetailID = 240

Figure 2: Using "OR" causes a scan.

The second query using "OR" in the WHERE clause uses 99% of the resources of the batch because the entire clustered index must be scanned.

Nonclustered Indexes

A nonclustered index is analogous to an index in a textbook. The data is stored in one place, the index in another, with pointers to the storage location of the data. The items in the index are stored in the order of the index key values, but the information in the table is stored in a different order (which can be dictated by a clustered index). If no clustered index is created on the table, the rows are not guaranteed to be in any particular order.

Analogous to the way of using an index in a book, Microsoft SQL Server 2000 searches for a data value by searching the nonclustered index to find the location of the data value in the table and then retrieves the data directly from that location. This makes nonclustered indexes the optimal choice for exact match queries because the index contains entries describing the exact location in the table of the data values being searched for in the queries. If the underlying table is sorted using a clustered index, the location is the clustering key value; otherwise, the location is the row ID (RID) comprised of the file number, page number, and slot number of the row. For example, to search for an employee ID (emp_id) in a table that has a nonclustered index on the emp_id column, SQL Server looks through the index to find an entry that lists the exact page and row in the table where the matching emp_id can be found, and then goes directly to that page and row.

Multiple Nonclustered Indexes

Some books include multiple indexes. For example, a gardening book can contain one index for the common names of plants and another index for the scientific names because these are the two most common ways in which the readers find information. The same is true for nonclustered indexes. It can be defined a nonclustered index for each of the columns commonly used to find the data in the table.

Considerations

Before nonclustered indexes are created, it should know how your data will be accessed. It has to be consider using a nonclustered index for:

Columns that contain a large number of distinct values, such as a combination of last name and first name (if a clustered index is used for other columns). If there are very few distinct values, such as only 1 and 0, most queries will not use the index because a table scan is usually more efficient.

Queries that do not return large result sets.

Columns frequently involved in search conditions of a query (WHERE clause) that return exact matches.

Decision-support-system applications for which joins and grouping are frequently required. Create multiple nonclustered indexes on columns involved in join and grouping operations, and a clustered index on any foreign key columns.

Covering all columns from one table in a given query. This eliminates accessing the table or clustered index altogether.

Using Non-Clustered Indexes

The index in the back of a book is an example of a non-clustered index. A non-clustered index has the indexed columns and a pointer or bookmark pointing to the actual row. In this example it contains a page number. Another example could be a search done on Google or another of the search engines. The results on the page contain links to the original web pages. The thing to remember about non-clustered indexes is that it may have to retrieve part of the required information from the rows in the table. When using a book index, you will probably have to turn to the page of the book. When searching on Google, you will probably have to click the link to view the original page. If all of the information you need is included in the index, you have no need to visit the actual data.

Here is a batch and the graphical execution plan (Figure 3) when using the non-clustered index to search on the ProductID column. The second query retrieves a column that is not included in the index key:

SELECT ProductID, SalesOrderID, SalesOrderDetailID

FROM Sales.SalesOrderDetail

WHERE ProductID = 776

SELECT CarrierTrackingNumber, ProductID

FROM Sales.SalesOrderDetail

WHERE ProductID = 776

Figure 3: Retrieving a column not part of the index.

The first query cost is 0% of the batch. In this case the ProductID is retrieved from the index -- no need to look at the actual table. The SalesOrderID and SalesOrderDetailID are not defined in the index, but are automatically included since they comprise the primary key. The second query in the batch costs 100% of the resources. Even though we are searching on an indexed column, we must retrieve the CarrierTrackingNumber from the actual table. Unexpectedly, a "Clustered Index Seek" is the most expensive part of the query. To see why, we can run the following:

SET SHOWPLAN_TEXT ON

GO

SELECT CarrierTrackingNumber, ProductID

FROM Sales.SalesOrderDetail

WHERE ProductID = 776

Instead of executing the query, the execution plan is returned in text format (results abbreviated). The "LOOKUP" keyword designates that a bookmark lookup was used.

|--Clustered Index Seek

...

LOOKUP ORDERED FORWARD)

In the previous example, if the CarrierTrackingNumber was part of the index, the performance issue would be solved. We can run this script to modify the index:

USE [AdventureWorks]

GO

DROP INDEX [IX_SalesOrderDetail_ProductID] ON [Sales].[SalesOrderDetail] WITH ( ONLINE = OFF )

GO

USE [AdventureWorks]

GO

CREATE NONCLUSTERED INDEX [IX_SalesOrderDetail_ProductID] ON [Sales].[SalesOrderDetail]

(

[ProductID] ASC,

[CarrierTrackingNumber] ASC

)

Now when the query batch is run, the second query has the same cost as the first query because the CarrierTrackingNumber is now part of the index. The table doesn't have to be accessed. Below is the graphical execution plan (Figure 4):

Figure 4: Using a covering index

When all of the required columns are part of the index, it is called a "covering index". An index key can include up to 16 columns and can be up to 900 bytes wide. SQL Server 2005 has a new feature to create indexes that surpass these limits called "included columns". This allows you to include additional columns in the index over the 16 column limit or columns that would be too large to include.

While it can only have one clustered index per table, it can have up to 249 non-clustered indexes per table. If it ever have that many, rest assured, it have a design problem! An important thing to keep in mind is that while indexes can improve the performance of queries, indexes take up disk space and require resources to keep updated. If a table has four non-clustered indexes, every write to that table requires four additional writes to keep the indexes up to date.

Using two indexes in combination

One more interesting thing about non-clustered indexes is that SQL Server can use them in combination or along with the clustered index. The SalesOrderDetail table has a non-clustered index on the ModifiedDate column. This query batch shows the difference when a WHERE clause has two conditions. In the first query, there is no index on the second column used in the WHERE clause. In the second query, there is a separate index on each index in the WHERE clause. In this case, the SQL Server uses the two indexes in combination to process the query, and the performance of the second query is much better (Figure 5).

SELECT SalesOrderID

FROM Sales.SalesOrderDetail

WHERE ProductID = 804 and UnitPrice = 25

SELECT SalesOrderID

FROM Sales.SalesOrderDetail

WHERE ProductID = 804 and ModifiedDate = '6/1/2003'

Figure 5: Using two indexes in combination.

Recall how using the "OR" operator with a clustered index caused a clustered index scan. If there are individual indexes defined on two columns used in a WHERE clause with "OR", the performance is about the same as when "AND" is used. Here is an example comparing "AND" and "OR", and the costs are almost evenly divided (Figure 6):

SELECT SalesOrderID

FROM Sales.SalesOrderDetail

WHERE ProductID = 804 OR ModifiedDate = '6/1/2003'

SELECT SalesOrderID

FROM Sales.SalesOrderDetail

WHERE ProductID = 804 AND ModifiedDate = '6/1/200

Figure 6: Using "OR" with two indexes.

There are two more issues to consider concerning indexes and WHERE clauses. What happens when a function is used in the WHERE clause? In that case, the index will not be used because the function will have to be applied to every row in the table. What happens when a wildcard is used in the WHERE clause? In this case it depends on where the wildcard is located. If the first character of the search term is replaced with a wildcard, the index will not be used and a table scan will result. Think about our phone directory example for a minute. How easy would it be to find an entry if you did not know the first letter of the last name?

Conclusion

Strategically placed indexes can make a huge difference in the performance of an application, especially over time as the amount of data increases. It is important to remember that a clustered index is comprised of the actual rows sorted in order of the cluster key. A non-clustered index is comprised of the key columns plus a pointer to the actual rows. You can use the Estimated Execution Plan as a way to experiment. It allows you to compare two or more similar queries to see where most of the resources will be used in the batch and, therefore, which query will perform better. SQL Server can use non-clustered indexes in combination or along with the clustered index

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.