CHAPTER 1 - Company Profile
Infinity computer systems is a Sri Lanka based company engaged in selling computers, computer components and software applications to the local buyers. The company had pledged to provide the local market with the latest in products in IT for an affordable price as they appear in the world market, gaining an edge over its competitors. This well known secret has been the formula of success of the company to achieve rapid growth in a short time span.
Company has wide range of computer hardware and software products on offer to the customers. One key market sector that the company is aiming to spread their reach in future is mobile handheld devices such as smart phones.
Having started the business in 1999 with just two employees, today Infinity computer systems has grown into one of the biggest IT and computer components vendors in Sri lanka and in the South Asian subcontinent. Currently the company has 3 branches One in Mumbai, India and one in Kandy; a town in central part of Sri lanka and the head quarters situated in Colombo and employ 102 full time staff in all three branches. Infinity computer systems has a market share of about 30% in Sri lanka. Furthermore company has realize the benefits of the boom in IT sector in India and is aiming to expand the Mumbai branch to serve as a major computer hardware and software vendor in India to increase the revenue.
Colombo head office and Mumbai branches maintains two large warehouses for storing directly imported products. Mumbai branch also directly engage with suppliers and manufacturers for buying stocks with minimal supervision form the Colombo head office. Kandy branch depends on Colombo head office for obtaining stocks and when dealing with major decisions.
At Infinity computer systems there's a qualified sales and customer service team available to provide customers with expert product selection assistance and support. They try to keep an open dialogue with customers, so feedback and suggestions are always welcomed and highly appreciated.
Be it a hard core gamer, student, small or medium sized business or an IT professional, Infinity Computer System has the right solution to cater every IT need.
Due to the popularity of the company, Everyday a large number of business transactions are carried out at infinity computer systems outlets, resulting in heavy usage of databases and database applications. Stock maintenance, Human resource management as well as and sales and marketing departments all rely on database systems in day to day operations in Infinity computer systems all 3 branches.
Currently Infinity computer systems utilize a centralized database system with relational database architecture to store data. The central database system is located in Colombo head office branch. Colombo, Mumbai and Kandy branches access the database in real time through the company WAN in day to day operations.
The database system consists of following major tables which are related to other sub tables.
Database manages separate tables for each branch that are logically related to above three main tables. The basic structure of the database table architecture is as follows.
Both Mumbai branch and Kandy branch share Colombo central database.
Disadvantages of Current System
Due to the centralized nature of the current system, company faces number of difficulties and uncovered the following disadvantages.
- The main disadvantage of the current systems is single point of failure. If central database fails all the branches affect by it and all business activities comes to a halt. Furthermore breakdown in WAN line also affect the accessibility to the network.
- Slow access time is a major concern as well. Because, all three branches access the database simultaneously, current database systems has difficulties in processing queries quickly leading to frustration by many users. This affects negatively the fast phased nature of the infinity computer systems working environment as well as customer serving time.
- The sluggish nature of the current system is not appropriate in any mean to the upcoming expansions of the India branch that the company management is planning to execute in near future. specially, Mumbai branch will need a database which has quick access and has the ability to sustain a rapid growth in both capacity as well as demand.
Areas Where Current System Lacks Security
- Current WAN network has a huge security hole as none of the site is protected by a firewall. This allows hackers and other malware programs such as worms to penetrate in to the network easily and it pose a great threat to the data at rest as well as those which are travelling on the network.
- The current system does not use any type of encryption when transferring data between remote sites and the main site. This pose a great threat for data such as Passwords and usernames as user authentication is done at the main site (Colombo) rather than the local sites. Lack of encryption means, anyone who intercept data get the access to user authentication information.
- Furthermore, the absence of encryption pose a threat to other data that transferred between main and remote sites as results for use queries. Because company database stores data that is vital and confidential to Infinity computer systems. If the data such as sales records and price listings fall in to rival business organizations' hands, they can gain advantage over Infinity computer systems.
- User authentication system of the current database system has less than adequate authentication mechanism which grant access to all areas of the database system with single point of authentication. This pose a threat to data as staff of the company with all levels of position have the easy access to almost all of the company data.
To overcome the problems currently faced by Infinity Computer systems, a distributed database system can be implemented. In a distributed database environment, database is distributed over many locations where end users have quick access.
Configuration and advantages of the new database system will be described in the next chapter.
CHAPTER 2 - Distributed Database
Distributed database is a collection of multiple, logically interrelated databases distributed over a computer network. In a distributed database environment, users have the ability to access data from different sources that are located at multiple locations.
When a database is distributed over many locations, it produces the challenge of retrieving the data from many locations and present it to system user. Furthermore managing the database becomes a critical function as well. This is where "distributed database management system" (DBMS) comes in to play. DBMS is a software system that manages the Distributed Database and provides the access mechanism to the users of the database. By tightly integrating with various systems and databases in a distributed environment, DBMS make the distribution transparent to the user.
Infinity Computer Systems Distributed Database Overview
Depending on the current geographical distribution of the branches and the WAN architecture, Infinity Computer Systems' database can be distributed in to three branches to maximize the productivity and access times as well as obtaining many other advantages over existing centralized database.
Out of many available, There are two major types of database design architectures we can consider when designing our company database system. They are,
- Multiple Site Processing, Multiple Site Data architecture using either
- Homogenous or
Distributed Database Management Systems(DDBMS).
Out of above two architectures we will implement Homogenous Multiple Site Processing, Multiple Site Data architecture for Infinity Computer Systems' distributed database.
Multiple Site Processing, Multiple Site Data
Multiple Site Processing, Multiple Site Data (MPMD) scenario refers to type of database that is fully distributed with multiple data processor support which includes transaction processors at multiple sites.
When all the sites of the logically related distributed database utilize and integrates the same type of Database Management System(DBMS) at all sites of the distributed network it's called Homogenous DBMS.
Infinity Computer Systems Distributed Database Architecture
Combining the concepts described above new distributed database architecture of the company can be demonstrated as below.
As shown in the figure 4 and 5 in the new architecture, Both Mumbai and Kandy branches will maintain a copy of the database that consisting of records related to their respective branches. Colombo branch will maintain two databases. one includes data related to Colombo branch and a main database which will hold records related to all three branches. The new distributed database use relational database architecture.
With this new architecture each branch will get the ability to access their very own database which is located locally. With this implementation, processing of the data also will be decentralized to all three branches. Apart from accessing the locally located database, both Kandy and Mumbai branches will be able to access the main database that is located in Colombo head office.
The distributed database management system (DDBMS) will consist of Distributed Query Processor(DQP), that handles distributed queries, a Distributed Transaction
Manager (DTM) for processing distributed transactions, a Distributed Metadata Manager (DMM) for managing distributed metadata, a Distributed Integrity Manger (DIM) for enforcing integrity among the various components of the distributed database system and a Distributed Security Manager (DSM) for enforcing security constraints across the database.
LANs of all three branches were redesigned To facilitate the modifications to the new distributed database architecture, following section describes each LAN site with their new features and configurations.
New LAN Architecture of Colombo Site
Colombo branch function as the heart of the new distributed database system. As Colombo branch handles all management and financial decisions it's important for Colombo branch to have access to data quickly. For this purpose, Colombo LAN is revamped as shown in the following figure to facilitate the functions of new distributed database.
Being a company that is engaged in selling state of the art computers, accessories and networking products, its important to have faster access to database as well as fast access to different interconnected nodes within the LAN itself for this purpose Colombo site LAN is completely redesigned to facilitate the new database system. The old token ring based topology was replaced with a new Gigabit Ethernet LAN with Star topology. Gigabit Ethernet provides data rate of up to 1000 Mbp/s for LAN data.
The new database server and backup server is implemented as a separated segment in the LAN. The separation is done through the router. The switches that connect Accounting and human resource department, Sales department and Warehouse department connects to a central switch, which in return connects to the router. Database server and Backup server connects to a switch, which in return connects to the router which also has the built-in firewall capability. The router segments the database section of the LAN form the other sections. This way it helps to reduce the congestion in the Local LAN. It helps faster access to database within the Colombo LAN itself as well as faster processing for incoming queries that are coming from other branches through WAN.
The addition of the firewall protects the internal LAN form unauthorized access. This helps to protect the data of the company. The backup server provides continuous back up facility to the database. It helps to recover database in an event of a failure of the main database.
New LAN Architecture of Mumbai Site
Mumbai branch function as the second important branch after Colombo head office. Furthermore with management's intention to expand it in future to facilitate more storage and attain more sales targets that covers a larger customer base, makes it essential to have a good infrastructure in the LAN at Mumbai branch. For this purpose, Mumbai branch LAN also revamped as shown in the following figure to facilitate the functions of new distributed database and future additions.
Mumbai branch LAN also revamped with an architecture similar to that of the Colombo LAN. The old network topology of Token Ring based architecture is replaced with a new STAR Topology Gigabit Ethernet. Gigabit Ethernet provide the faster access to data within the LAN, which is much need in day to day communication within the organization. Furthermore it lays the foundation to future expected expansions to the Branch. The router segments the LAN area consisting of distributed database from that of the other areas of the LAN. This helps to prevent congestion and improves the data transfer efficiency of the LAN as well as providing faster a access to data for both local and distributed queries.
The router is equipped with a built in firewall which protects the internal LAN from unauthorized access, thus protects the valuable data of Infinity computer systems.
Database server is connected to a backup server which backs up the data of the main database server. It helps to recover the main server in an event of a failure.
New LAN Architecture of Kandy Site
Kandy branch also revamped to facilitate the new changes to the database system. The architecture of the LAN is nearly similar to that of the other two branches. following figure shows the new architecture.
The most notable addition to Kandy branch is the addition of the T1 line which replaced the previous ISDN line that connects the branch LAN to the company WAN. The T1 line provides the faster access to distributed data as well as internet. This makes it easier for all branches as it provides same speed of access to all three branches' data without creating any bottle necks.
The LAN is designed as a Gigabit LAN using a Star topology which provides fast data transmission within the LAN.
The router has built-in firewall which protects the internal LAN form intrusions. The database server section of the LAN is segmented using the router. This helps to control the congestion and allows the faster access to data for local and distributed queries. Backup database server provide data backup functionality for main database server which helps in quick recovery of the main database server in an event of a failure.
New Features of the Proposed WAN Network architecture and Distributed System
There are few new features were introduced to the existing WAN network to make it compatible with Distributed database system and address certain security holes that presented in the existing WAN.
- Firewalls have been introduced to each local site to protect each LANs of all three branches. This feature address the issue of network being exposed to Worms and hackers threats. Firewalls block malicious traffic that are not authorized to enter any segment of the Infinity computer system network while allowing legitimate traffic to access any part of the network
- ISDN line that connected Kandy branch to the WAN has been replaced with a high bandwidth T1 line. This allows the distributed queries to access Kandy branch data at the same speed as the of the other two branches; Colombo and Kandy as well as Mumbai branch being benefitted from accessing the other two sites data much faster than the existing ISDN line.
- Each LAN of all three branches have been introduced with a new LAN segment which consists of distributed database and processing systems and it was segmented through the router to reduce the congestion so that both local and remote data traffic gets the ability to access the database faster.
Security Enhancements Provided by New System
- The new system is designed to provide encryption for user authentication data. This prevents the data interceptors from understanding the data related to user authentication and authorization.
- The proposed system is designed with a multi level security control system. Multilevel security controls ensure users cleared at different security level access and share the company's distributed database in which data is assigned different security levels. This prevents the lower level staff from gaining access to data that are not relevant to them and it ensure the security of the data.
Advantages of the New Database System
There are number of advantages of the new distributed database system.
- The most significant advantage of the new system is, speed. with having locally available database for each branch there is no longer a need to connect to the Colombo head office database during day to day operations giving fast access to data that is located at each branch LAN. Fast access to database means rapid operations in all task in the company environment as well as quick service for customers.
- Having redundant data in two company branches apart from Colombo head office means higher availability. Even if a failure occur in the main branch database, it has the ability to quickly recover from the regional branches. In the same manner should a failure occur in a regional branch database, it has the ability to recover from main database in Colombo branch with minimal downtime.
- The distributed architecture of the database reduce the strain on the main database servers as workstations connected each branch's database server shares the processing workload between them. this result in faster processing of queries.
- There will be a tremendous reduction in network traffic as well. In the old configuration, company WAN was highly utilized for database traffic. specially Colombo branch received a huge amount of database traffic every day. with the new configuration, branch offices are no longer needed to utilize the WAN to access database. This frees up the WAN traffic and allow all branches to utilize it for other more critical tasks.
- As the database query processing is distributed between branches, there is no longer a need to maintain high cost high-end servers for processing. This reduces company expenditure in long run.
- New database system provides the ability to expand both in quantity and processing power. This provides the platform to carry out planned expansions for Mumbai branch without having to worry about recourses and infrastructure.
- Removal of Reliance on a Central Site. In the existing centralized system, both remote branches of the company as well as Colombo branch is depend on same database that is located in Colombo. But the proposed system eliminates this reliance and provide a independent database system for each branch with the ability of also connecting with the databases of the other branches through the Distributed database management system and eliminates single site of failures.
- With fragments as the unit of distribution in the new architecture, a transaction can be divided in to several sub queries that operates on fragments. This increases the degree of concurrency or parallelism in the system.
Possible Problems in the New Architecture
- Difficulties in integrity control
The new distributes database system hides the distributed nature of the system from the user. Though it provides an acceptable level of performance, reliability and availability is more complex than the existing centralized database architecture. The fact that certain data, specially Colombo and Mumbai stocks related data are replicated in Colombo and Mumbai branches, ads an extra level of complexity when dealing with synchronization between these elements. So the software must be designed to handle the data replication adequately, if not it will lead to degradation of availability, reliability and overall performance of the entire system.
The increased complexity of the new distributed database architecture leads to higher costs in hardware and software resources as well as maintenance costs.
Validity and consistency of stored data is referred to as Database integrity. Integrity is usually is expressed in terms of constraints, which are consistency rules that DBMS is not allowed to violate. Enforcing integrity constraints requires that defines the constrains but that are not related to actual update operation itself. In a Distributed DBMS environment like the proposed Infinity computer systems architecture, the processing and communication cost that are required to enforce such integrity constraints may be prohibitive.
Unlike the centralized DBMS system which the access can easily be controlled, new Distributed database system which consists of fragmented and replicated data which are located at multiple sites, the security control is more challenging. Furthermore the network itself needs to be made secure in order to protect the data that travel between three branches.
CHAPTER 3 - Detailed structure and functionality of distributed database
In this chapter, structure of the distributed database of infinity computer systems and the functionality of the distributed components will be discussed in greater detail.
There are three major tables used in the database architecture. They are Employees, sales and stocks. Following is the table format for each table.
As shown in the figure 7, Employees, Sales and Stocks tables are fragmented and located at all three branches according to the relevance of the site where the data is accessed most often. This architecture make the data access time faster and keep the communication costs down.
Furthermore data of the Stocks_CMB, are vertically fragmented and located at the Mumbai branch database site with the table name of Stocks_CMB_FRG. During the fragmentaion of Stock_CMB all attributes of the Stock_CMB table were allocated in to Stocks_CMB_FRG except UNITPRICE attribute, because it is irrelevant to INDIAN territory thus it reduce the wastage of storage space due to repetition of irrelevant data. The purpose of allocating Colombo head office Stock data in the Mumbai site is to allow faster access because Mumbai branch of Infinity computer systems, run its own warehouse and deal with manufacturers and suppliers directly. This makes it important fir Mumbai branch have the ability to access the Colombo stock data very often and quickly so that both branches can maintain a healthy stock for everyday business. Colombo branch retains a copy of the Stocks tables related to Mumbai site as well as maintaining its own stock table related to Colombo stocks. In this new design, Stock_MBI table that is located in the Colombo head office site is configured to synchronize with Stock_MBI table, two times a day, during midday and then at the end of the working day.
In a nutshell, following is the way the database is distributed across three branches.
- Employees and Sales Tables that were previously located in the Colombo branch, were fragmented according to the relevancy where data items are physically belong and located at their relevant branches.
- Stock_MBI is replicated at Mumbai (Stock_MBI_LCL) site while retaining a exact copy in Colombo branch
- Stock_KDY table is transferred to Kandy site from its previous position of Colombo
- Stock_CMB table is Vertically fragmented and located a copy at Mumbai branch.
Data Allocation Method
There are four methods to consider when choosing a data allocation method for proposed distributed database architecture for Infinity computer systems. They are,
- Complete Replication
- Selective Replication
Out of the above methods, we use Selective Replication as the data allocation method for proposed distributed database architecture.
Selective Replication is a combination of Fragmentation, replication and centralized data allocation methods. In this method some data items are fragmented to maximize high locality of reference and others, which are used at many sites and are not frequently updated, are replicated; otherwise data items are centralized. This approach gives combination of advantages of all the other three methods.
Using the selective replication method, we will only be distributing data related to Kandy and Mumbai branch to their respective branches while keeping a main database at Colombo branch which will consist of records related to all branches. This will serve as a redundant database as well as a central repository where all data related to company's all three branches can be easily retrieved.
Following section describes how the above distribution was done by using relational algebra.
Fragmentation of Database for Allocation of Data at Various Sites
When distributing a database across multiple sites, one of the main factors need to consider is the fragmentation of database items such as Tables. Fragmentation consists of breaking a relation in to smaller relations or fragments and storing the fragments at different sites. By fragmenting, data can be distributed to the sites where they used more often. There are two approaches to distribute database elements across multiple sites. They are,
- Distribute one copy of each database table in all sites
- Distribute portions of the selected tables that are important to local sites
In our company case we will be using the second method mentioned above. In that method we will be distributing only the data in the three main tables that are related to each site (Branch). When fragmenting data in a table there are three techniques used. They are,
- Horizontal fragmentation
- Vertical fragmentation
- Hybrid fragmentation
For our company distributed database we use both horizontal and vertical fragmentation to distribute table data among three branches. More precisely, For Employee table and Sales table we use Horizontal fragmentation and for Stocks_CMB table we use Vertical Fragmentation. The reason for using vertical fragmentation for stock table is because Mumbai branch deals with manufacturers and other vendors who provides hardware and software stocks directly to Mumbai branch.
In horizontal fragmentation, certain rows of the tables are put in to a base relation at one site, and other rows are put in to a base relation at another site. In other words, the rows (tuples) of a relation are distributed to many sites as disjointed fragments.
In infinity computer systems database, we use the horizontal fragmentation as follows, to fragment Employees and sales tables. When selecting the criteria to horizontally fragment the Employees and Sales tables is the relevance of data to the location.
As shown above, the current employee table, we fragment by considering the BRCODE field. BRCODE indicates the branch where the employee works. By doing so we can build three new tables out of the above database table and allocate them to each of the 3 branches of the company. We can horizontally fragment employees table in to 3 separate logically related tables as follows.
- Using relational algebra to do the horizontal fragmentation of Employees table
To do the above horizontal fragmentation of employees table in to three tables we can use relational algebra SELECT operation. Our intention is to fragment the table in to three small fragments so each table would contain Employees related to their respective branch. To achieve this, Relational algebra operations are,
- Employees_CMB = SELECT(Employees_Table)
- Employees_MBI = SELECT(Employees_Table)
- Employees_KDY = SELECT(EMP_TABLE)
WHERE BRCODE = CMB
WHERE BRCODE = MBI
WHERE BRCODE = KDY
Executing the above three formulas results in following three table fragments:
- Employees_CMB (contains 2 tuples)
- Employees_MBI (contains 2 tuples)
- Employees_KDY (contains 1 tuple)
- Relational algebra operation for fragmenting Sales Table
We can divide sales tables in to three fragments through the SELECT algebra operation as follows and it will result in three tables containing sales data related to each of the three branches.
- Sales_CMB = SELECT(Sales_Table)
- Sales_MBI = SELECT(Sales _Table)
- Sales_KDY = SELECT(Sales _Table)
WHERE BRCODE = CMB
WHERE BRCODE = MBI
WHERE BRCODE = KDY
Vertical Fragmentation works by splitting a table between attributes. Vertical fragmentation is used in situations where some sites needed to access the attributes of the tables of many data items in a table. This fragmentation is more difficult than horizontal fragmentation as more options exist. The fragmentation can be achieved by either
- Grouping attributes to fragments or
- Splitting relations in to fragments
For fragment Colmbo branch's Stock_CMB Table, we use the first method mentioned above.
- Relational algebra operation for Vertical fragmentation of Sales_CMB Table
For vertical fragmentation, relational algebra Project operation is used. We fragment the above table to form a new table called Stocks_CMB_FRG. This new table will contain all the attributes of the above table except UNITPRICE.
Following is the relational algebra Project operation.
Stocks_CMB_FRG = PROJECT(Stocks_CMB)
TOCKCODE, ITEMNO, QTY, S LASTIN, NEXTIN
Executing the above operation will result in creating the following table.
The data model consists of three layers called schemas. Each schema defines a set of views that database can be seen. The three schemas are,
- External schema layer
- Conceptual schema layer
- Internal schema layer
Represents the view of the database that users and/or applications might see
At this level the database objects such as tables, columns, views, and indexes are defined. These definitions provide mappings to the next level of the model, which is where the physical layout of the database is defined.
This layer defines the actual layout of the records and fields.
Distributed databases of all three branches are modeled according the above structure and all three branches maintain their own set of the above model. In local sites, when users access the data stored locally, they access them as defined in the external Views. Conceptual schema maps the logical structure of the tables to Internal Schema which defines the physical storage of data on the discs.
The above model should be extended to fit it in a distributed database environment so that users of all three sites get the ability to access the data regardless of their physical site where data is stored . It is done by introducing a global conceptual schema which integrate all local conceptual schemas that are related to three branches of the company. This new Global Conceptual schema is located in main database site, which is Colombo while preserving at all three branches, the lower level local schema views that was shown in the Figure 8 in previous page
As shown in the figure, Local conceptual schemas are integrated through a Global Conceptual Schema (GCS). At all three branches (Mumbai, Colombo and Kandy) their Local schemas are still preserved without any alteration. What GCS does is map the local view of each site to a global structure, so that users from all three branches get the access to the data regardless of the site they reside.GCS is located at Colombo Branch. GCS is the Union of all local conceptual schemas at each Branch.
GCS = UNION(LCS1, LCS2, LCS3)
That way GCS has the access to Local Conceptual schemas of all three branches and through that a whole new set of User views can be created by integrating local views. Users of all three branches access through this Global views when they access the data that is located at remote sites.
The Global Database Catalog (Global Conceptual Schema)
GCS can be also called as Global database catalog. As described in the early chapters, Infinity computer systems distributed database is designed as Homogeneous Relational Model. Thus all three branches have the same DBMS which are logically related. The Global catalog is divided in to two schemas,
- Global schema
- Fragment schema
The global schema contains the definitions of the global relations used to create the global user views. This portion of the catalog defines the global view of the distributed database. The fragment schema includes tables that contain localization information that is used when the distributed query is broken down to sub queries; that is, it identifies the location of all the table fragments that make up each global relation.
As shown in figure 10, global schema is made up of following tables.
- Global views
- Global tables
- Global columns
The global views table contains the following columns:
- VNAME - Name of the user external view
- GSQL-SELECT - the SQL SELECT statement used to construct the view
The global tables table contains three columns:
- TBL NAME - Name of the global relation
- SITE - Name of the site where the table is distributed
- TYPE - Type of table, view, or actual base table
As described in previous sections, Employees, Sales and Stock tables are fragmented and distributed across three sites, when we create the Global tables table, we reference to all fragmented tables that are logically related as follows.
The global columns catalog table contains the following attributes:
- COL NAME - Name of the column
- TBL NAME - Table that owns the column
- TYPE - Data type of the column
- LEN - Length in bytes of the column
The next schema of the Global database catalog is Fragment schema. the fragment schema portion of the global database catalog contains localization information used for fragmenting the distributed query into sub queries that can run at each site involved with the query. It contains two main tables,
- Fragment Tables
- Fragment Columns
The FRAGMENT TABLES table contains the following columns:
- SITE - Name of the site that contains the table fragment
- TBL NAME - Name of the global table from which this fragment was derived
- FNAME - Name of this table fragment
- NROWS - Number of rows contained in the table fragment. This statistic is used to optimize the distributed query.
The FRAGMENT COLUMNS table contains information describing the columns contained in the table fragment. It is made up of the following columns:
- COL NAME - Name of the column
- FNAME - Name of the table fragment that owns this column
- TYPE - Data type of the column
- LEN - Length of the column
The catalog management plays an important part in a distributed database system. In Infinity computer system's distributed database, the global database catalog is managed as Replicated Global Catalog. In this type of a catalog management, each site maintains its own global catalog. This greatly speeds up remote data location.
The Global Application Views
The Global application view table defines the external user views that are used to access the distributed tables. In other words, through external application views data can be accessed as if there were stored in a one site even though in fact, the data is scattered over few branches. It consists of following attributes.
- V NAME - Name of the external view
- GSQL SELECT - The attribute that stores the global distributed SQL query used to create the global view
The Local Database Catalogs
The local database catalogs are similar to the global database catalogs with some minor differences. In Infinity computer system's distributed database the catalogs are implemented with the relational database model.
The catalog is made up of the following four tables: LCAT_TABLES, LCAT_VIEWS, LCAT_COLUMNS, and LCAT_INDEXES. (This last table is not present in the global database catalogs.) The table LCAT TABLE stores definitions for the tables that are included in the local database. It is made up of the following attributes:
- TABLE_NAME Name of the table
- TYPE Table type, T = table, V = VIEW
- NUM_ROWS Number of rows contained in the tables. (This is usually referred to as the tables' cardinality)
The LCAT_VIEWS table is similar to its global counterpart. It consists of definitions for all of the views that can be found in the database. It is made up of the following columns:
- VIEW_NAME - Name of the view
- LSQL_QUERY SQL - query used to define the view
The LCAT_COLUMNS table is used to define all of the columns that can be found in the tables that make up the local database. It is also similar to its global counterpart except the for two new attributes: HKEY and LKEY.
Below are the attributes that make up this table:
- COL_NAME - Name of the column
- TABLE - Table that owns this column
- TYPE - Data type of the column
- LEN - Length of the column in bytes
- HKEY High - key value of the column
- LKEY Low - key value of the column
The HKEY and LKEY attributes are used when building an access plan for a query that requires this column in its WHERE clause. These two attributes are used in a formula that roughly determines the number of rows that will be returned from a table.
Finally, the LCAT_INDEXES table contains information describing each index stored in the database. It contains the following columns:
- INDEX_NAME - Name of the index
- KEY_COLS - Column names that make up the index
- SORT - Sort order
- PCT_UNIQUE - Percentage of unique values
The Local Physical Database Schema
The local physical database schema is used to define the actual files that will store the records on disk. It is derived from the local conceptual schema.
CHAPTER 4 - Transaction management
This chapter describes the transaction characteristics and Transaction transparencies that are related to Infinity computer systems proposed Distributed database and how the database system is designed to preserve those transparencies.
Transaction in a Distributed System
A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency. Transaction management in a distributed database ensures that, is that if a database was in a consistent state prior to the initiation of a transaction, then after the transaction database return to a consistent state after the transaction is completed. irrespective of the fact that transactions were successfully executed simultaneously or there were failures during the execution, Transaction management mechanism in the distributed database always ensure that database is maintained in a consistent state. Thus, a transaction can be referred to as a unit of consistency and reliability.
Classification of Transactions
There are four classifications of transactions in a distributed database environment: Remote request, Remote transaction, distributed request, Distributed transaction.
These are read only transactions made up of an arbitrary number of SQL queries, addressed to a single remote DBMS. The remote DBMS can only be queried. Following figure shows this.
It is made up of any number of SQL commands (select, insert, delete, update) directed to a single remote DBMS and each transaction writes onto only one DP. Consider a situation where Customer and Employee tables located at site2. The transaction should have the ability to update customer and employee tables. The transaction can reference to only one DP at a time.
It is made up of any number of SQL commands (select, insert, delete, update) directed to an arbitrary number of remote DP sites, but each SQL command refers to a single DBMS.
Consider a situation where a transaction points towards two remote sites, assume site 2 and 3. The first request (Select statement) is processed by the DP at the remote site 2, and next requests (Update, Insert statements) are processed by the DP at the remote site 3. Each request can access only one request at a time.
It is made up of arbitrary transactions, in which SQL commands can refer to any DP that may contain the fragmentation. This request requires a distributed optimizer.
Let us consider the following two examples to understand this idea better.
Situation 1: Let Shop(sno, sname) is at site 2 and customer (cno, cname, bill, sno) and employee (eno, ename, sno) be at site 3. If we issue a request to fetch the tuples to find sname and cname where sno+123, the process if selection happens as illustrated in the figure.
Situation 2: The distributed request features allow a single request to reference a physically partitioned table. Assume Employee table is fragmented in to two fragments, say E1 and E2 and located at site 2 and 3. Suppose we need to obtain all the tuples whose salary exceeds $15000. The SQL request is shown in the following figure.
To ensure the smooth operation of various distributed components of the database system and to ensure the consistency of the database system, certain properties must be maintained. following section describe those properties.
In a distributed database environment, transaction transparency ensures that all distributed transactions maintain the distributed database's integrity and consistency. In a distributed database system like the one of Infinity computer systems, atomicity of the distributed transaction is fundamental to the transaction concept. A distributed transaction access data stored at more than one location. Each transaction is consisting of Sub transactions as one per each site that need to be accessed. For this reason, Distributed database System also should ensure the atomicity of each sub transaction. For the transaction to commit, all of the operations must complete successfully. If only one operation fails, the entire transaction fails. To achieve this DDBMS must ensure the synchronization of sub transactions with global transactions that are executing simultaneously at the same or different sites. Fragmentation, allocation sand replication s schemas of a DDBMS, complicates the transaction transparency.
There are two transparencies that are related to Transaction transparency. They are,
- Concurrency Transparency
- Failure Transparency
Concurrency transparency is provided by the DDBMS, if the results of all concurrent transactions (distributed and non distributed) execute independently and are logically consistent with the results that are obtained if the transactions are executed one at a time in some order. In a Distributed systems, DDBMS should ensure that global and local transactions do not interfere with each other as well as ensuring the consistency of all the sub transactions of the global transactions.
In other words, Concurrency transparency Enables several processes to operate concurrently using shared information objects without interference Among the processes.
To preserve the concurrency transparency in a distributed database two techniques that can be used are Locking and Time stamping. Out of these two methods we use Time stamping in our company database. This will be discussed in a latter chapter.
Each site in a distributed database system is vulnerable to same types of failures that are presented in centralized database systems like the current centralized database system of infinity computer systems. Furthermore there is the risk of communication failure between various replicated and fragmented data that are distributed over different sites. Therefore a mechanism must be presented in the DBMS system to detect failure and be able to system to reconfigure and allow processing to proceed. Furthermore it should be able to recover after a communication link breakdown and once the repair has been done to the link.
In other words, Failure transparency enables the concealment of faults and allow users and applications to complete their tasks despite the failure of other components in the distributed system.
Furthermore, Distributed Database management system must ensure atomicity and durability of transactions. That means in the presence of a site or network failure, DDBMS ensure sub transactions of global transaction either all commit or completely abort. In order to do that, DDBMS must synchronize global transactions to ensure that all sub transactions have completed successfully before recording a final COMMIT for global transactions.
Distributed Transaction Management
Distributed transactions span across two or more servers known as resource managers. a server component known as transaction manager coordinate between the resource managers when executing transactions. Each instance of the SQL Server Database Engine can operate as a resource manager in distributed transactions coordinated by transaction managers.
At the application level, a distributed transaction is treated and managed same way as a local transaction. Each transaction could result in two states, Commit the transaction or Rollback. A distributed Commit state carried out in two phases to ensure the consistency of the distributed database system. This is known as two-phase commit (2PC). The two phases are,
- Prepare phase
- Commit phase
Upon receive of a commit request transaction manager sends a prepare command to all the resource managers that are involved in the transaction. Each resource manager at each site then carryout the steps necessary to make the transaction durable, and all buffers holding log images for the transaction are flushed to disk. Each resource manager at the completion of the prepare phase returns success or failure of the prepare to the transaction manager.
Transaction manager then sends commit commands to each resource manager. But this is done only if all of the resource managers message back stating the prepares phase has done successfully. The resource managers then carryout the steps involved in commit phase. If all of the resource managers report a successful commit, the transaction manager then sends a success notification to the application. If any resource manager reported a failure in the prepare phase, then the transaction manager issues a rollback command to each resource manager indicating the failure of the commit to the application.
Following measures are implemented to the database to ensure the transactions are carried out properly and keep the database in a consistent state. In each branch of the distributed database a Transition manager, Scheduler, Recovery manager and a Buffer manager monitors and control the transactions carried out and maintain the database system in operational and consistent state.
In each local branch database system, Transaction Manager coordinates transactions on behalf of applications programs, communicating with the scheduler, the module responsible for implementing a particular strategy for concurrency control. The objective of the scheduler is to maximize concurrency without allowing concurrently executing transactions to interfere with one another which may compromise the consistency of the database.
Recovery Manager, In the event of a failure occurring during the transaction, ensures that the database is restored to the state it was in before the start of the transaction there by it always keeps the database in a consistence state. It also oversee the restoration of database to a consistent state after a system failure.
The Buffer manager handles the efficient transfer of data between disk storage and main memory.
There is also a module called Global Transaction Manger (Transaction coordinator) at each site which coordinate between local and global transactions initiated at the site. Transaction managers at different sites, though doesn't communicate with each other directly, Instead Inter site communication is still handled by Data communication components.
Suppose Colombo site initiate a global transaction,
- The Transaction coordinator at Colombo site divides the transaction into number of sub transactions using information held in the global system catalog.
- Data Communication components at Colombo site sends the sub transactions to appropriate sites. (according to the figure both Mumbai and Kandy branches)
- At both Mumbai and Kandy branches, the transaction coordinators manage these sub transaction and execute them. The result of the sub transactions are communicated back to Colombo Branches' Data communication components which in return handover the result data to Transaction coordinator of Colombo site.
Each transaction that is executed in above manner should poses certain qualities in order to carry out its assigned task and maintain the distributed database in a consistent state. following are the properties a transaction should have, in order to achieve this.
A transaction must execute and complete each operation in its logic before it commits its changes. As stated earlier, the transaction behaves as if it were one operation, even if it includes multiple reads, writes, and other operations.
Execution of a transaction must leave a database in either its prior stable state or a new stable state that reflects the new modifications made by the transaction. In other words, if the transaction fails, the database must be returned to the state it was in prior to the execution of the failed transaction. If the transaction commits, the database must reflect the new changes.
The transaction must act as if it is the only one running against the database. It acts as if it owned its own copy and could not affect other transactions executing against their own copies of the database. No other transaction is allowed to see the changes made by a transaction until the transaction safely terminates and returns the database to a new stable or prior stable state. (This depends on whether or not the transaction committed or aborted its changes). If distributed system failed to maintain this property, one of the following things could happen to the database system.
- Lost Updates: results when another transaction updates the same data being modified by the first transaction (TI) in such a manner that T2 reads the value prior to the writing of TI thus creating the problem of loosing this update.
- Cascading Aborts: this problem occurs when one transaction (TI) aborts, then the transactions that had read or modified data that has been used by the first transaction will also abort.
This last property states that the changes made by a transaction are permanent. They cannot be lost by either a system failure or by the erroneous operation of a faulty transaction.
The above four properties of a transaction are also referred to as the ACID properties or the ACIDITY of a transaction.
Chapter 5 - Concurrency control and Deadlock management
Concurrency control in a distributed system ensures that database transactions are performed concurrently without violating the database data integrity. Thus concurrency control helps to maintain the Distributed database system in a consistent state. concurrency control in a distributed DBMS is much challenging than a centralized DBMS due to the fact that there is a high chance that data may be replicated and partitioned. If a user wants unique access to a piece of data, say, to perform an update or a read, the DBMS must have the ability to ensure unique access to the required data, which is difficult if there are copies throughout the sites in the distributed database. But in order to achieve this unique access to data, concurrency control is needed so that DBMS can ensure each data object is being modified in a systematic and a serialized manner to ensure the consistency and accuracy of the data items being modified.
There are few main methods used for concurrency control in a distributed system. They are,
- Optimistic - in this method concurrency control is done by Delaying the checking of whether a transaction meets the isolation rules (e.g., serializability and recoverability) until its end, without blocking any of its (read, write) operations, and then abort a transaction, if the desired rules are violated.
- Pessimistic - Block operations of a transaction at the beginning of the transaction, if they may cause violation of the rules.
- Semi-optimistic - As performed in Optimistic method, delay rules checking to transaction's end, but Operation is blocked only in some situations only while does not block in other situations.
To implement the concurrency control in a distributed system many methods exist. They can be implemented under any of the three main categories described above. Few such methods for concurrency control are, Two phase locking, Conflict graph checking, Time stamping.
Out of the above methods, Infinity computer system's distributed database system use Two phase locking protocol for concurrency control.
Two-Phase Locking Protocol
Locking is a mechanism commonly used to solve the problem of synchronizing access to shared data. The concept behind the locking is simple and intuitive. Each data item in the database has a lock associated with it. A scheduler checks the associated lock for data items before any transaction may access the data item.
Each data item has two types of locks: a read lock and a write lock. Certain types of locks are compatible. This means more than one transaction can place a lock on the same object. A read lock is compatible with another read lock. If transaction T1 requests a read lock on object X and transaction T2 already has a read lock on object X, T1 is granted the lock anyway. If on the other hand, if both or any one of the transactions have requested a write lock on the object, a conflict occurs and one of the transactions is denied the lock request.
The use of the two phase locking protocol ensure the serial execution of concurrent transactions thus only one transaction can access the data item at a time. This guarantee the correct execution of a transaction. Two phase locking protocol has two states,
- Growing phase - here all the locks required by transaction are obtained.
- Shrinking phase - in this state, all the locks that were previously obtained are released by the transaction
Once a transaction has released a lock, it has no possibility to obtain another.
In real world, there might happen situations where two different transactions request access to the same lock. Both of these transactions might also possess a lock on a data object that the other transaction requires. Neither transaction will give up its current lock until it obtains a lock on the object it needs. This situation is Known as deadlock. Deadlock detection prevention is important for a database as deadlock situations could bring database in to a complete halt and freeze the functionality.
Distributed Deadlock Detection And Resolution
Distributed Deadlock can be detected using Wait for Graph technique. Infinity computer systems distributed database also use this same technique. Wait for Graph indicates the transactions that are "waiting for" lock release by other transactions, and if Wait for Graph finds a cycle it indicates a deadlock. Wait for Graph can thought of as a graph of conflicts blocked by locks from being materialized; it can be also defined as the graph of non-materialized conflicts.
Above figure illustrates a typical deadlock situation (Before) in a local site and then a global sites access and the data and carryout the scheduled transaction after the deadlock is broken.
Infinity computer systems Distributed database uses a central deadlock resolution process that collects all the local WAIT-FOR-GRAPHS, interconnects them to a global WAITFOR- GRAPH, and selects a likely candidate for termination to break the global deadlocks. In this strategy, the local sites still have the responsibility to resolve local deadlocks. But Global wait for graph coordinate between the local sites to resolve conflicting deadlock situations.
Chapter 6 - Distributed and Local Recovery Strategies and Reliability control
Distributed recovery is concerned with the coordination of the activities that occur at the local sites when a transaction fails.
There are basically two categories of failures that we must examine: local and distributed. There are three types of local failures:
- Transaction Failures
- System Failures
- Media Failures
There are few methods that can be used to recover from the above failures such as Two phase commit protocol and three phase commit protocol. Out of this, Infinity distributed database system uses two phase commit protocol to recover from failures. it is described in the following chapter.
Implementation of Two Phase Commit Protocol
Two phase commit protocol implements the strategies used in distributed systems to ensure that local sites can recover from failures in a uniform manner. Since distributed transactions involve multiple sites, either all sites successfully commit their portion of the distributed transaction or all sites abort their portion of the distributed transaction. If even one site fails, all sites must abort their transactions.
Two phase commit protocol uses intercommunications dialog to determine the outcome of a transaction and it can be implemented in the following three manners.
Out of these three architectures Infinity computer systems database uses Distributed architecture to implement two phase locking protocol.
The distributed architecture approach allows all sites in the network to communicate with each other. In this approach, each site receives the decisions of all the other sites involved in the transaction. Each local site makes its own decision by collecting the votes from the other sites. If one or more of the other sites vote to abort, the site aborts its sub transaction. With this strategy, each site acts as its own coordinator.
Functionality of Two Phase Commit Protocol
This protocol has two phases or stages that it goes through to terminate a distributed transaction. The coordinator enters a "WAIT" phase while it polls the participants for their decision as to the fate of the distributed transaction. On receiving all of the participants' votes, the coordinator enters either a global commit or global abort phase. The participants enter a ready phase after they inform the coordinator of their termination condition. This "READY" phase is where each participant waits for the coordinator's final decision on the outcome of the global transactions. On receiving the coordinator's decision, each participant enters either a local commit or a local abort phase, after which the transaction ends at each of the sites.
As shown in figure, each the coordinators are shown as stick figures for demonstration. In step 1, the coordinator stick figure sends out a prepared message to both participants and immediately enters a WAITING state. In step 3, both participants respond with a COMMIT decision and enter a READY state. The stick figure finishes its coffee, collects both commit responses from the participants, and issues a global commit response to the two participants. On receiving this response from the coordinator, both the participants locally commit their sub transaction and the dialog ends. The scenario could have been such that participant number 2 at site 2 decided to abort the transaction while the participant at site 1 decided to commit its portion of the transaction. The coordinator stick figure received one positive response and one negative response, so it decided to abort the transaction globally. It now sends out a global abort message, and the disappointment can be seen in the faces of the participants.
In a more formal manner the above process can be shown as below.
As demonstrated in the figure, The coordinator enters a START state by issuing a PREPARE command to the participants of the distributed transaction. The coordinator goes immediately to its next state, the WAITING state. The participants, on receiving the PREPARE instruction from the coordinator, send a reply to the coordinator in the form of a local abort or local commit vote. If the local site aborts, it immediately goes to the local abort state and waits for the coordinator's response. If the local site's response was a commit, it enters a READY state and waits for the coordinator's final global decision.
If both of the participants vote LOCAL COMMIT, the coordinator leaves the WAITING state and transmits a global COMMIT instruction. Back at each site that is participating in the distributed transaction, the local recovery manager receives the global instruction and leaves the READY state. Since the coordinator voted to GLOBAL COMMIT the transaction, both local sites take the appropriate measures to successfully conclude the execution of their sub transactions. Each site then acknowledges the local commit so that the coordinator can end the global transaction. Had one of the participants decided to abort just prior to entering the READY state, it would have sent a LOCAL ABORT vote to the coordinator. On tallying all the votes, the coordinator would have seen that there was one ABORT vote and would have proceeded to issue a GLOBAL ABORT instruction to all the participants of the distributed transaction. Notice that the participating site that had originally voted to COMMIT can now leave the READY state and change its mind by entering the abort state.
As described in the above process, two phase protocol can be successfully deployed to recover from errors during distributed transactions.
Using the above two phase commit method the reliability of the system may be compromised in a situation where a communication failure occur between site. That is the above algorithms work fine if all the sites remain in communication throughout the life of a distributed transaction and no network failures occur. Should a failure occur at one of the sites, the other sites could remain in a WAIT state or READY state, waiting for instructions from the coordinator that will never come, at least not until the communications failure is corrected.
One remedy to this situation is to include a timeout condition for each of the states wherein both the coordinator and the participants are vulnerable to failures. With this timer process, if a response is not received in a certain period of time, the participants or coordinator assume that the other site has failed and it can begin to abort the transaction at its site. Let's see how the coordinator handles this situation. The coordinator can first implement the timer in the WAITING state of the state transition diagram. If the timer period terminates and all the participants have not returned their decisions, the coordinator decides to terminate the global transaction by sending a GLOBAL ABORT instruction to all the sites participating in the distributed transaction. Next, the coordinator can time out in either the GLOBAL ABORT or GLOBAL COMMIT state. If all the participants have not acknowledged the global instruction after the coordinator has timed out, the coordinator keeps sending it until the offending site or sites respond. Once this occurs, it means that the remote failing site(s) has recovered and implemented its local recovery routines. It can now process the global instruction sent by the coordinator and acknowledge it. There are two states where a participant can time out—either in the INITIAL state or in the READY state. If the participant times out in the INITIAL state, it locally aborts the sub transaction and terminates. A timeout condition in this participant state indicates that the coordinator has failed. When the coordinator recovers, it will wait for a response from the site. Since this site is already finished it will not respond. The coordinator will time out and globally abort the transaction. This way it ensures the reliability and consistency of the distributed database system.
- Distributed and Multi-Database Systems - Angelo R. Bobak, Artech House Boston, London 1995
- Database systems - a practical approach to design, implementation and management - Thomas Connolly, Carolyn Beggg, Pearson education International 2010
- Distributed System Principles - Wolfgang Emmerich, University College London 1997
- Distributed DBMS - Hamilton campus of the University of the West of Scotland http://hamilton.bell.ac.uk/databases/documents/Lecture4.pdf (Accessed on 04-03-2010)
- Distributed Database - Scribbed.com, http://www.scribd.com/doc/6142386/-distributed-database (Accessed on 04-03-2010)
- Distributed Database Design - Infosys, Scribbed.com http://www.scribd.com/doc/11110339/Distributed-databases (Accessed on 20-03-2010)
- Distributed Transactions Overview - Microsoft Developer Network, msdn.com http://msdn.microsoft.com/en-us/library/ms681205%28VS.85%29.aspx (Accessed on 21-03-2010)
- Notes on Two Phase Locking and Commit Protocols http://user.it.uu.se/~arnoldp/distrib/TwoPhase.html (Accessed on 05-04-2010)