This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Abstract. Extracting potentially relevant information either from unstructured, semi structured or structured information on construction tender documents is paramount with respect to improve decision-making processes in tender evaluation. However, various forms of information on tender documents make the information extraction process non trivial. Manually identification, aggregation and synthesize of information by decision makers is inefficient and time consuming. Thus, semantic analysis of content and document structure using domain knowledge representation is proposed to overcome the problem. The ontological-based information extraction processes contain three important components; document structure ontology, document preprocessing and information acquisition. The findings are significantly good in precision and recall which the performance measures have reached accuracy of precision about 82.35 % (concepts), 96.10 % (attributes), 100% (values) and 100 % of recall for both parameters of concepts and attributes, while 91.08 % for values.
Keywords: Document Analysis, Information Extraction, Ontology, Construction Tender
Nowadays, ontology plays an essential role in knowledge management in which it shares common understanding of a domain. Since the last decade ontology has achieved a great success in multiplicity of research fields including e-commerce [1, 2], medical , bioinformatics  and others. However, knowledge management in construction tender has been less intensively studied due to its additional challenges in information integration, information analysis and weak interoperability between stakeholders.
Construction tendering processes involve large volumes of complex text documents in paper-based and digital format . Tender documents may consist of unstructured (terms of contract in natural language sentence), semi structured (form-based) and structured (tabular) information. These documents involve diversity of information such as project specifications, terms and regulations of contract, tendering procedures, tender forms and supporting documents . In addition, tenderers usually tend to provide manifold of documents to prove their abilities to win the tender. All these complicated features contribute as the challenges to automated information extraction since the documents are purposely designed for human-readable. Therefore, extracting information from text documents requires knowledge of both document structure and contents.
In construction tender, recognizing relevant information from tender documents is necessary for decision-making process, especially in tender evaluation based on multi-criteria . Current approach is impractical and time consuming when the evaluators need to identify, aggregate and synthesize salient information of these criteria manually. Subsequently, information that would be useful for making a decision may be missing. This paper proposed an approach to computationally extract and map human-readable document structure information into machine readable format using ontology. Here, predefined key knowledge automates the task of extracting tender information and hence improving information finding. The approach can be customized as general information extraction tool for other similar document structure.
Ontology is a knowledge representation model of real concepts and intricate relationships between those concepts. We explore semantic-based representation between concepts and relationships between them based on common vocabulary of document structure. Further, complex concepts are also built from simpler concepts definitions using OWL operators including union, intersection and complement. The extracted information are matched and associated with domain specific keywords and regular expression data patterns defined in the ontology. Users are permit to query extracted data and infer logical rules using Pellet OWL to recognize which concepts fit under which definitions. Based on rules inferred, the possible information are located to be extracted when it satisfies the rule defined. In this way, the meaning of a document is recognized and significant information will be available for evaluation process.
The paper proceeds in the following manner. Firstly, related researches on ontological-based information extraction are briefly reviewed in related works of Section 2. Section 3 describes about proposed ontological-based information extraction processes. Next, the experimental setup and results are provided and discussed in Section 4 and Section 5 respectively. Finally, Section 6 concludes with summary of this paper and future research directions.
The importance of ontology is recognized and implemented in information extraction. Information extraction is defined as any method of analyzing large volumes of unstructured texts, normally in the form of natural language and automatically extracts salient information from these texts into pre-defined template [7-9]. Ontology and information extraction are closely related in two main tasks either pre-existing ontology is used to extract information or it is used to populate and improving ontology. The first task is our focus in the research. The use of ontology enables information extraction to have better access and coverage to relevant information by providing domain knowledge specific application [10, 11]. Output from this process is useful for further processing in wide range of application such as text mining, text classification, ontology learning, information retrieval, decision support systems and others.
Research that was done by Holzinger et al.  populated domain specific ontology with instances extracted from structural information on tabular data of Web documents. Here, they come out with table ontology and fixed adjacent attribute-value pairs to identify the instances. The goal of this paper shares the same interest of instantiating domain ontology, yet expands to process non structural and semi structural information. Table ontology is improved by identifying headers of tabular structure and overcome fixed adjacent attribute-value pairs approach. In addition, non tabular concepts also included to give semantic knowledge for non tabular data. Meanwhile, Shashirekha and Murali  improved similar framework that had been proposed by Embly et al.  by identifying relevant information written in short forms or abbreviations using domain dependent ontology and populated the extracted information into relational database. Nevertheless, they examined only non structural documents and the semantic of document structures were not considered. In our approach, the semantic representations allow information to be extracted directly from documents and can be queried directly from the ontology.
WeDax is web-based data extraction tool where it restructures web documents into XML schema representations and mapping with domain specific ontology . The tool however only manages to extract constantly changing data with a fixed structure. Moreover, XML is more to syntactic language, thus lack supports for efficient sharing between semantically defined concepts. Instead, our research uses the classification capabilities of rule-based inference engine to identify meaningful instances without having to depend on conventional mapping process. Other information extraction application has been proposed by Biletskiy et al.  is Course Outline Data Extractor. The tool used data integration to automatically transform learning syllabus stored on HTML web pages into XML format. Then, the relevant information is extracted by computing similarity between source and target syllabus in XML using domain specific ontology. However, the main different of these researches is they do not consider enough semantic relations among fundamental concepts of document structure and recognize instances depend on heuristic mapping and matching algorithms.
Ontological-based Information Extraction Processes
In this paper, ontology is used to give semantic analysis of document concepts and their instances in order to assist information extraction on tender documents. Hence, the processes of ontological-based information extraction as shown in Fig. 1 is proposed. There are three important components including document structure ontology, document preprocessing and information acquisition.
Tender Documents in PDF
Document Structure Ontology
Removing irrelevant tags and punctuations
Text Tokenizing and Coordinating
Table and Non Table Recognizing
Parsing Keywords and Expression Patterns
Concepts and Properties Retrieval
Keywords and Patterns Matching
Update semantic relations
Fig. 1. Ontological-based Information Extraction Processes
Document Structure Ontology
Ontological modeling of document represents knowledge about document construction considering most information on the document are visually represented as full sentences, form-based and tabular data. The purpose of document structure ontology is to provide semantic knowledge representation on each concepts in a document. Fig. 2 shows ontological modeling of basic concepts such as Document, Non-Table, Table, Paragraph, Column, Row, Cell, Header, Body, Keyword, Pattern and these concepts are associated with relationships. Some concepts in the ontology are reused from table ontology as proposed by Holzinger et al. . The ontology populates instances according to the structure found on the document. Each document and table is reflected by Document and Table concepts respectively in which are differentiated by string title. Transitive consists relation is modeled as OWL object property which indicates semantic relationship between concepts of Document, Table, Row, Column Cell, Non-Table and Paragraph. Relevant data found are stored into either Cell or Paragraph which contains string value modeled as OWL data type property. This value is associated with matched keywords and pattern rules. Concept of Keyword can be inherited by subclasses such as KeywordConcept, KeywordAttribute and etc. Here, related words and regular expression patterns are defined to guide the extraction. Meanwhile, complex concepts can be derived from basic concepts definitions. All of these concepts and relationships are encoded using the most recent standard ontology language of OWL. This document model can represent any standard document that contains three different type either non structured, semi structured or unstructured information. It is adequate in modeling construction tender documents structure.
relates to cell
Fig. 2. Document Structure Ontology
Initially, original documents are preprocessed using special tasks such as removing irrelevant tags and punctuations, text tokenizing and coordinating. Tokenizing splits all sentences into single word. Identification of coordinate (x, y) for each token on the documents is essential since all the documents are in non structural Portable Document Format (PDF). Table structure in PDF documents do not have any identified tagged characteristics in common. This operation is accomplished using JPedal architecture (http://www.jpedal.org). Furthermore, table extraction algorithm is applied to recognize tabular structure and tabular content. Also, the algorithm is able to identify non tabular text. The preprocessed tabular and non tabular text data are transformed into specific text representation vector matrix. Both types of text represents information in unstructured, semi structured and structured form.
Structural relationships of documents are expressed by applying document structure ontology. Initially, basic concepts and properties defined in the document structure ontology is parsed. This also includes all the predefined keywords and regular expression data patterns. Subsequently, content mapping algorithm maps related document structures and data values found in preprocessed text as instances into ontology. Concept definitions on the ontology specify document structures and instances as data values. Semantic relationships between these instances are recognized as well. Possible semantic relationships between identified instances are updated by matching with lexical value of keywords and regular expression patterns. As the result, the ontology derives all facts of document structures and data values. In order to interpret and analyze the meaning of extracted content, OWL Pellet inference engine is used to reasoning rules derived in concepts. In this way, simpler concept definitions infer complex concept definitions using OWL operators. Example one of the rules that can be inferred by OWL Pellet reasoner is depicted in Fig. 3. Here, complex class of StringName is derived by union of any Cell has KeywordName and has Pattern. Furthermore, SPARQL query is executed to retrieve the extracted information. Concepts of domain specific keywords identification and regular expression patterns are semantically associated with cell and paragraph.
Fig. 3. Example of Complex Class Using OWL Operator and
The purpose of this experiment is to evaluate the performance of the proposed approach according to precision and recall measurements. Six copies of tender documents of similar building construction project based on Malaysia Construction Tender are used as the experimental data. The average pages per document is approximately fifteen pages. Each document consists of compulsory information about tender agreement, certified approval letter, contractor background, financial data, technical staff, list of construction plant and equipment, past and current project of six different contractors. These information are visually represented in natural language sentences, form-based and table. Fig. 4 presents the sample of tender documents format. Each document contains document title, concepts, attributes and values that need to be identified. The experiments are run in Java-based environment and divided into three strategies according to information types.
Fig. 4. Types of Information on Construction Tender Documents; a) Unstructured Information, b) Semi Structured Information, and c) Structured Information
Result and Evaluation
The experiment was run to extract details content about tender such as tender title, bid price and time to complete a project included as unstructured information, form-based representation of tenderer background profile and structured table denotes data on company staff, list of facilities available, financial record, current projects and past projects.
In order to evaluate the extraction accuracy result, standard information extraction method of precision, recall and f-measure have been applied. Table 1 shows the comparison results of both evaluation methods for computerized extraction. Three parameters that have been evaluated are concepts, attributes and values. These parameters reflects the prime categories of information that need to be extracted. The evaluation of precision, recall and f-measure have shown significantly good accuracy in detecting relevant information. The precision rates for concepts, attributes and values have reached 82.35 %, 96.10 % and 100 % respectively. The recall have achieved 100 % for both parameters of concepts and attributes, while 91.08 % for values. Meanwhile, tests accuracy of f-measure are 90.32 % for concepts, 98.01 % for attributes and 95.33 % for values.
The finding shows that the ontology is significantly capable in recognizing important of concepts, attributes and values and then represent them as instances into machine readable format of ontology that can be used for further decision-making process. The results indicate the approach is capable to detect information that matched with the predetermined keywords and regular patterns. In addition, the capability of ontology to allow rules reasoning improve the extraction process.
Table 1. Comparison Results of Information Extraction based on Precision, Recall and F-Measure
Computerized Information Extraction
This study has proposed an approach of information extraction using domain dependent ontology. The relevant information is recognized by matching with keywords and regular expression patterns. Based on precision, recall and f-measure, the extracted information has shown significant experimental results. However, the implementation of keywords and regular expression patterns allows any matched string to be associated with them including two different string that contains similar word. There is a chance to produce redundant recognizable information and ambiguous interpretation of the knowledge. In future, the meanings of each keyword will be hierarchical expanded in the ontology in order to produce more quality keywords. Furthermore, table recognizing which currently works based on simple table assumption will be enhanced for complex table. We are considering to include supporting documents as part of document sources and proposed a model to semantically match between the content of compulsory and supporting documents.