Storing and exchanging data are playing a huge role in our life nowadays. As volume of information goes up, the need for more effective ways to operate it increases. One of the most effective ways of storing information is known as "Tree structure". Tree structures arrange information in a hierarchical order. This structure easily allows you to create, edit, enumerate, and move data. The most popular way of operating tree-like structured data is Extensible Markup Language (XML). It is now widely used to store and exchange big amounts of information on the Web.
With global increasing of amounts of data the use of XML is undertaking a huge leap. Its low requirements and flexibility determine its prevalence. However, complicated structure of a document and inflating of the information it stores cause the need for refining the ways of retrieving data. Indexing of XML documents can provide significant performance boost for querying information. In this research I briefly describe two approaches of indexing of XML documents and analyze the results of their work.
XML Document consists of hierarchical ordered elements, which can be displayed as a labeled tree. Each element can be reached through a certain path. If we attach to each element its appropriate label we will be able to build label paths through the tree. If the label paths of two elements are the same then we can consider these elements equivalent. Path summary is a tree on which every node represents one label path. Paths summary helps evaluate single path queries, but it does not keep the information about hierarchical relationships of individual elements. This was a reason for proposing an index structure which also provides detailed element-level connections, which is called a Compact Tree Structure (Ctree).
Ctree is a tree on which every node contains an array of elements describing relationships of the elements. It provides a summarized view of hierarchical structures at the group level and detailed information at the element level.
Ctree index data can be modeled in relational tables and be implemented in relational database. It is also possible to transform it into XML database or store it in disk files for fast access.
For storing Ctree structure index can be used into four tables: Elements, Groups, CtreeDB, and ElmPosLen.The first table stores the mappings from elements to their parents. The next one stores the group information and the label path. The Ctree table describes the main features of Ctree. As for the ElmPosLen table, it has the information about the position and length of each element.
Since XML has different data types, Ctree proposes five types of value indexes. The number of data types can be extended, if required. All five indexes support a search operation search(value, gid?) where value and gid represent spicific value and a certain group. This operator returns a list of absolute elements, if the gid is not specified, or relative elements, if it is. For Ctree-based query processing, in order to eliminate irrelevant groups by group-level structure mapping, gid is determined and then value predicates are evaluated. Therefore, the returned results and the I/O cost are reduced.
These are the advantages of Ctree-based query processing algorithm:
- Locating frames at the group level cuts a big number of irrelevant groups at an early stage;
- Evaluating a value predicate based on a group significantly reduces the possible matches and improves the efficiency of combining the matches for value predicates and structure constraints.
- Using an array for fast mapping elements to their parents helps evaluate constraints of element-level structure. Fast access to elements' parents is extremely important for efficient XML query processing because of the tree nature of XML data and queries are rooted on sharing common parents.
XISS/R system is an implementation of the XML Indexing and Storage System (XISS) on top of a relational database. The system includes a web-based user interface, which enables stored documents to be queried via XPath. The user interface utilizes the Xpath Query Engine, which automatically translates XPath queries into efï¬cient SQL statements.
The XISS/R system consists of three components:
- A mapping of XML data to relational schema;
- An XPath Query Engine;
- A web-based user interface;
The mapping of XML data to relational schema is done by using the extended preorder numbering scheme. During the research, two relational schemas were generated to make best use of this numbering scheme. The XPath Query Engine enables XPath queries to be issued on the relational implementation of the mapping of XML data. Universal access to the system is provided via a web-based interface which allows users to visually interact with the query engine and result set.
The current implementation of XISS/R uses MySQL but also can use any commercial database system. Documents are parsed and loaded into the database by a program that uses MySQL's C interface. This loader uses the LibXML library to access XML documents such that their structural information can be encoded with the extended preorder numbering scheme.
Several schemas were created to determine whether to store elements in a large node table or store them in separate node tables divided by node name. The overall trend in performance was that schemas using horizontal partitioning were faster. As for whether to store tag names in the node tables or store them in a separate tag name table, It was found that since the number of distinct elements and attributes is usually small, the join time between a tag name table and node tables is inconsequential compared to the common total query time. In addition to the tables in the schemas described before, was also utilized database indexes to accelerate query processing.
In order to compare the efficiency of both reviewed methods of XML indexing, two databases DBLP and XMARK were tested and the results of the work of both Ctree and XISS approaches were compared. According to the results of this test, Ctree requires much less space for building an index file than XISS (78MB to 118MB on DBLP and 28MB to 54MB on XMARK database). Then, 6 queries on each database were conducted. Results of this test show that Ctree significantly excels XISS in every query made on both database systems.
XML has become an important standard for data representation and exchange in the Internet. The need for efficient storage and manipulation is an important issue nowadays. Methods that I reviewed within this paper were designed to refine operating the data. It is notable from the research that in spite of good results of the XISS indexing system it is unable to compete with the Ctree Indexing method in terms of space- and time-consuming. The Ctree approach outperforms the XISS method by operating hierarchical connections and element descriptions as main sources for indexing XML documents. Conducted research opts for this approach as for an excellent choice for operating XML files.