1. Introduction

1.1. The Problem Statement

This thesis deals with the problem of Automatic generation of a UML Model from Natural Language Software Requirement Specifications. This thesis describes the development of Auto Modeler an Automated Software Engineering tool that takes Natural Language Software System Requirement Specifications as Input, performs an automated OO analysis and tries to produce an UML Model (a partial one in its present state i.e. static Class diagrams only) as output. The basis for Auto Modeler is described in [2][3].

1.2. Motivation

We conducted a short survey of the Software Industry in Islamabad in order to determine what sorts of Automated Software Engineering Tools were required by the Software houses. The result of the Survey (see Appendix-I for the survey report) indicated that there is demand for such a tool as Auto Modeler. Since such tools i.e. [2][3] that have already been developed are either not available in the market or are very expensive, and thus out of the reach of most software houses. Therefore we decided to build our own tool that can be used by the software industry in order to enable them to be more productive and competitive. But at present Auto Modeler is not ready for commercial use. But it is hoped that future versions of Auto Modeler will be able to cater to the needs of the Software Houses.

1.3. Background

1.3.1. The need for Automated Software Engineering Tools: In this era of Information Technology great demands are placed on Software Systems and on all those that are involved in the SDLC. The developed software should not only be of high quality but it should also be developed in minimal amount of time. When it comes to Software quality, the software must be highly reliable and it should meet the customer's needs and it should satisfy the customer's expectations.

Automated Software Engineering Tools can assist the Software Engineer's and Software Developers in producing High Quality Software in minimal amount of time.

1.3.2. Requirements Engineering: Requirements engineering consists of the following tasks [6]:

· Requirements Elicitation

· Requirements Analysis

· Requirements Specification

· Requirements Validation / Verification

· Requirements Management

Requirements engineering is recognized as a critical task, since many software failures originate from inconsistent, incomplete or simply incorrect System Requirements specifications.

1.3.3. Natural Language Requirement Specifications: Formal methods have been successfully used to express Requirements Specifications, but often the customer cannot understand them and therefore cannot validate them [4]. Natural Language is the only common medium understood by both the Customer and the Analyst [4]. So the System Requirements Specifications are often written in Natural Language.

1.3.4. Object Oriented Analysis: The System Analyst must manually process The Natural Language Requirements Specifications Document and perform an OO Analysis and produce the results in the form of an UML Model, which has become a Standard in the Software Industry. The manual process is laborious, time consuming and often prone to errors. Some specified requirements might be left out. If there are problems or errors in the original requirements specifications, they may not be discovered in the manual process.

OOA applies the OO paradigm to models of proposed systems by defining classes, objects and the relationships between them. Classes are the most important building block of an OO system and from these we instantiate objects. Once an individual object is created it inherits the same operations, relationships, semantics, and attributes identified in the class. Attributes of classes, and hence objects, hold values of properties. Operations, also called methods, describe what can be done to an object/class.[1]

A relationship between classes/objects can show various attributes such as aggregation, composition, generalization and dependency. Attributes and operations represent the semantics of the class, while relationships represent the semantics of the model [1]. The KRB seven-step method, introduced by Kapur, Ravindra and Brown, proposes how to find classes and objects manually [1]. Hence,

Identify candidate classes (nouns in NL).
Define classes (look for instantiations of classes).
Establishing associations (capturing verbs to create association for each pair of classes in 1 and 2).
Expanding many-to-many associations.
Identify class attributes.
Normalize attributes so that they are associated with the class of objects that they truly describe.
Identify class operations.

From this process we can see that one goal of OOA is to identify NL concepts that can be transformed into OO concepts; which can then be used to form system models in particular notations. Here we shall concentrate on UML [1].

1.3.5. Natural Language Processing (NLP): If an automatic analysis of the NL Requirements Document is carried out then it is not only possible to quickly find errors in the Specifications but with the right methods we can quickly generate a UML model from the Requirements.

Although, Natural language is inherently ambiguous, imprecise and incomplete; often a natural language document is redundant, and several classes of terminological problems (e.g., jargon or specialist terms) can arise to make communication difficult [2] and it has been proven that Natural Language processing with holistic objectives is a very complex task, it is possible to extract sufficient meaning from NL sentences to produce reliable models. Complexities of language range from simple synonyms and antonyms to such complex issues as idioms, anaphoric relations or metaphors. Efforts in this particular area have had some success in generating static object models using some complex NL requirement sentences.

1.3.5.1. Linguistic analysis: Linguistic analysis studies NL text from different linguistic levels, i.e. words, sentence and meaning.[1]

(i) Word-tagging analyses how a word is used in a sentence. In particular, words can be changeable from one sentence to another depending on context (e.g. light can be used as noun, verb, adjective and adverb; and while can be used as preposition, conjunction, verb and noun). Tagging techniques are used to specify word-form for each single word in a sentence, and each word is tagged as a Part Of Speech (POS), e.g. a NN1 tag would denote a singular noun, while VBB would signify the base form of a verb.[1]

(ii) Syntactic analysis applies phrase marker, or labeled bracketing, techniques to segment NL as phrases, clauses and sentences, so that the NL is delineated by syntactical/grammatical annotations. Hence we can shows how words are grouped and connected to each other in a sentence.[1]

(iii) Semantic analysis is the study of the meaning. It uses discourse annotation techniques to analyze open-class or content words and closed-class words (i.e. prepositions, conjunctions, pronouns). The POS tags and syntactic elements mentioned previously can be linked in the NL text to create relationships.

Applying these linguistic analysis techniques, NLP tools can carry out morphological processing, syntactic processing and semantic processing. The processing of NL text can be supported by Semantic Network (SN) and corpora that provide a knowledge base for text analysis.

The difficulty of OOA is not just due to the ambiguity and complexity of NL itself, but also the gap in meaning between the NL concepts and OO concepts.[1]

1.3.6. From NLP to UML Model Creation. After NLP the sentences are simplified in order to make identification of UML model elements form NL elements easy. Simple Heurists are used to Identify UML Model elements from Natural Text: (see Chapter 7)

* Nouns indicate a class

* Verb indicates an operation

* Possessive relationships and Verbs like to have, identify, denote indicate attributes

* Determiners are used to identify the multiplicity of roles in associations.

1.5. Plan of the thesis

In Chapter 2 we present a brief survey of previous work and work similar to our work. Chapters 3, 4, 5, 6 and 7 describe the theoretical basis for Auto Modeler. Chapter 8 Describes the Architecture of Auto Modeler. In Chapter 9 we describe Auto Modeler in action with a case study. In Chapter 10 we present conclusions.

2. Literature Survey

The first relevant published technique attempting to produce a systematic procedure to produce design models from NL requirements was Abbot. Abbott (1983) proposes a linguistic based method for analyzing software requirements, expressed in English, to derive basic data types and operations. [1]

This approach was further developed by Booch (1986). Booch describes an Object-Oriented Design method where nouns in the problem description suggest objects and classes of objects, and verbs suggest operations.[1]

Saeki et al. (1987) describe a process of incrementally constructing software modules from object-oriented specifications obtained from informal natural language requirements. Their system analyses the informal requirements one sentence at a time. Nouns and verbs are automatically extracted from the informal requirements but the system cannot determine which words are relevant for the construction of the formal specification. Hence an important role is played by the human analyst who reviews and refines the system results manually after each sentence is processed.[1]

Dunn and Orlowska (1990) describe a natural language interpreter for the construction of NIAM (Nijssen's, or Natural-language, Information Analysis Method ) conceptual schemas. The construction of conceptual schemas involves allocating surface objects to entity types (semantic classes) and the identification of elementary fact types. The system accepts declarative sentences only and uses grammar rules and a dictionary for type allocation and the identification of elementary fact types.[1]

Meziane (1994) implemented a system for the identification of VDM data types and simple operations from natural language software requirements. The system first generates an Entity-Relationship Model (ERM) from the input text and then generates VDM data types from the ERM.[1]

Mich and Garigliano (1994) and Mich (1996) describe an NL-based prototype system, NL-OOPS, that is aimed at the generation of object-oriented analysis models from natural language specifications. This system demonstrated how a large scale NLP system called LOLITA can be used to support the OO analysis stage.[1]

V. Ambriola and V. Gervasi.[4] have developed CIRCE an environment for the analysis of natural language requirements. It is based on the concept of successive transformations that are applied to the requirements, in order to obtain concrete (i.e., rendered) views of models extracted from the requirements. CIRCE uses, CICO a domain-based, fuzzy matching, parser which parses the requirements document and converts it into an abstract parse tree. This parse tree is encoded as tuple's and stored in a shared repository by CICO. A group of related tuples constitutes a T-Model. CIRCE uses internal tools to refine the encoded tuples called extensional knowledge and the knowledge about the basic behavior of software systems called intentional knowledge derived from modelers to further enrich the Tuple space. When a specific concrete view on the requirements is desired, a projector is called to build an abstract view of the data from the tuple space. A translator then converts the abstract view to a concrete view. In [5] V. Ambriola and V. Gervasi describe their experience of automatic synthesis of UML diagrams from Natural Language Requirement Specifications using their CIRCE environment.

Delisle et al., in their project DIPETT-HAIKU, capture candidate objects, linguistically differentiating between Subjects (S) and Objects (O), and processes, Verbs (V), using the syntactic S-V-O sentence structure. This work also suggests that candidate attributes can be found in the noun modifier in compound nouns, e.g. reserved is the value of an attribute of “reserved book”.[1]

Harmain and Gaizauskas developed a NLP based CASE tool, CM-Builder [2][3], which, automatically constructs an initial class model from NL text. It captures candidate classes, rather than candidate objects.

Börstler constructs an object model automatically based on pre-specified key words in a use case description. The verbs in the key words are transformed to behaviors and nouns are transformed to objects.[1]

Overmyer and Rambow developed NLP system to construct UML class diagrams from NL descriptions. Both these efforts require user interaction to identify OO concepts.[1]

The prototype tool developed by Perez-Gonzalez and Kalita supports automatic OO modeling from NL problem descriptions into UML notations, and produces both static and dynamic views. The underlying methodology includes theta roles and semi-natural language.[1]

3. Software Requirements Engineering

Software requirements engineering is the science and discipline concerned with establishing and documenting software requirements [6]. It consists of:

* Software requirements elicitation:- The process through which the customers (buyers and/or users) and the developer (contractor) of a software system discover, review, articulate, and understand the users' needs and the constraints on the software and the development activity.

* Software requirements analysis:- The process of analyzing the customers' and users' needs to arrive at a definition of software requirements.

* Software requirements specification:- The development of a document that clearly and precisely records each of the requirements of the software system.

* Software requirements verification:- The process of ensuring that the software requirements specification is in compliance with the system requirements, conforms to document standards of the requirements phase, and is an adequate basis for the architectural (preliminary) design phase.

* Software requirements management:- The planning and controlling of the requirements elicitation, specification, analysis, and verification activities.

In turn, system requirements engineering is the science and discipline concerned with analyzing and documenting system requirements. It involves transforming an operational need into a system description, system performance parameters, and a system configuration

This is accomplished through the use of an iterative process of analysis, design, trade-off studies, and prototyping.

Software requirements engineering has a similar definition as the science and discipline concerned with analyzing and documenting software requirements. It involves partitioning system requirements into major subsystems and tasks, then allocating those subsystems or tasks to software. It also transforms allocated system requirements into a description of software requirements and performance parameters through the use of an iterative process of analysis, design, trade-off studies, and prototyping.
A system can be considered a collection of hardware, software, data, people, facilities, and procedures organized to accomplish some common objectives. In software engineering, a system is a set of software programs that provide the cohesiveness and control of data that enables the system to solve the problem.[6]

The major difference between system requirements engineering and software requirements engineering is that the origin of system requirements lies in user needs while the origin of software requirements lies in the system requirements and/or specifications. Therefore, the system requirements engineer works with users and customers, eliciting their needs, schedules, and available resources, and must produce documents understandable by them as well as by management, software requirements engineers, and other system requirements engineers.

The software requirements engineer works with the system requirements documents and engineers, translating system documentation into software requirements which must be understandable by management and software designers as well as by software and system requirements engineers. Accurate and timely communication must be ensured all along this chain if the software designers are to begin with a valid set of requirements. [6]

4. Automated Software Engineering Tools

Software engineering is concerned with the analysis, design, implementation, testing, and maintenance of large software systems. Automated software engineering focuses on how to automate or partially automate these tasks to achieve significant improvements in quality and productivity.

Automated software engineering applies computation to software engineering activities. The goal is to partially or fully automate these activities, thereby significantly increasing both quality and productivity. This includes the study of techniques for constructing, understanding, adapting and modeling both software artifacts and processes. Automatic and collaborative systems are both important areas of automated software engineering, as are computational models of human software engineering activities. Knowledge representations and artificial intelligence techniques applicable in this field are of particular interest, as are formal techniques that support or provide theoretical foundations.[7]

Automated software engineering approaches have been applied in many areas of software engineering. These include requirements definition, specification, architecture, design and synthesis, implementation, modeling, testing and quality assurance, verification and validation, maintenance and evolution, configuration management, deployment, reengineering, reuse and visualization. Automated software engineering techniques have also been used in a wide range of domains and application areas including industrial software, embedded and real-time systems, aerospace, automotive and medical systems, Web-based systems and computer games.[7]

Research into Automated Software Engineering includes the following areas:

* Automated reasoning techniques

* Component-based systems

* Computer-supported cooperative work

* Configuration management

* Domain modeling and meta-modeling

* Human-computer interaction

* Knowledge acquisition and management

* Maintenance and evolution

* Model-based software development

* Modeling language semantics

* Ontologies and methodologies

* Open systems development

* Product line architectures

* Program understanding

* Program synthesis

* Program transformation

* Re-engineering

* Requirements engineering

* Specification languages

* Software architecture and design

* Software visualization

* Testing, verification, and validation

* Tutoring, help, and documentation systems

5. Natural Language Processing

Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. Natural language generation systems convert information from computer databases into normal-sounding human language, and natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.

5.1. Language Processing

Language processing can be divided into two tasks:[11]

* Processing written text, using lexical, syntactic, and semantic knowledge of the language as well as any required real world information.[11]

* Processing spoken language, using all the information needed above, plus additional knowledge about phonology as well as enough additional information to handle the further ambiguities that arise in speech.[11]

5.2. Uses for NLP:

5.2.1. User interfaces. Better than obscure command languages. It would be nice if you could just tell the computer what you want it to do. Of course we are talking about a textual interface -- not speech.[10]

5.2.2. Knowledge-Acquisition. Programs that could read books and manuals or the newspaper. So you don't have to explicitly encode all of the knowledge they need to solve problems or do whatever they do.[10]

5.2.3. Information Retrieval. Find articles about a given topic. Program has to be able somehow to determine whether the articles match a given query.[10]

5.2.4. Translation. It sure would be nice if machines could automatically translate from one language to another. This was one of the first tasks they tried applying computers to. It is very hard.[10]

5.3. Linguistic levels of Analysis

Language obeys regularities and exhibits useful properties at a number of somewhat separable "levels".[10]

Think of language as transfer of information. It is much more than that. But that is a good place to start.

Suppose that the speaker has some meaning that they wish to convey to some hearer.[10]

Speech (or gesture) imposes a linearity on the signal. All you can play with is the properties of a sequence of tokens. Actually, why tokens? Well for one thing that makes it possible to learn.[10]

So the other thing to play with is the order the tokens can occur.

So somehow, a meaning gets encoded as a sequence of tokens, each of which has some set of distinguishable properties, and is then interpreted by figuring out what meaning corresponds to those tokens in that order.[10]

Another way to think about it is that the properties of the tokens and their sequence somehow "elicits" an understanding of the meaning. Language is a set of resources to enable us to share meanings, but isn't best thought of as a means for *encoding* meanings. This is a sort of philosophical issue perhaps, but if this point of view is true, it makes much of the AI approach to NLP somewhat suspect, as it is really based on the "encoded meanings" view of language.[10]

The lowest level is the actual properties of the signal stream:

phonology -- speech sounds and how we make them

morphology -- the structure of words

syntax -- how the sequences are structured

semantics -- meanings of the strings

There are important interfaces among all of these levels. For example sometimes the meaning of sentences can determine how individual words are pronounced.[10]

This many levels is obviously needed. But language turns out to be more clever than this. For example, language can be more efficient by not having to say the same thing twice, so we have pronouns and other ways of making use of what has already been said:

A bear went into the woods. It found a tree.

Also, since language is most often used among people who are in the same situation, it can make use of features of the situation:

this/that

you/me/they

here/there

now/then

The mechanisms whereby features of the context, whether it is the context created by a sequence of sentences, or the actual context where the speaking happens is called "pragmatics".[10]

Another issue has to do with the fact that the simple model of language as information transfer is clealy not right. For one thing, we know there are at least the following three types of sentences:

statements

imperatives

questions

And each of them can be used to do a different kind of thing. The first *might* be called information transfer. But what about imperatives? What about questions? To some degree the analysis of such sentences can involve the ideas of a basic notion of meaning Speech acts.[10]

There are other, higher-levels of structuring that language exhibits. For example there is conversational structure, where people know when they get to talk in a conversation, and what constitutes a valid contribution. There is "narrative structure" whereby stories are put together in ways that make sense and are interesting. There is "expository structure" which involves the way that informative texts (like encyclopedias) are arranged so as to usefully convey information. These issues blend off from linguistics into literature and library science, among other things.[10]

Of course with hypertext and multi-media and virtual reality, these higher levels of structure are being explored in new ways.[10]

5.4. Steps in Natural Language Understanding

The steps in the process of natural language understanding are:[11]

5.4.1. Morphological analysis

Individual words are analyzed into their components, and non-word tokens (such as punctuation) are separated from the words. For example, in the phrase "Bill's house" the proper noun "Bill" is separated from the possessive suffix "'s."[11]

5.4.2. Syntactic analysis. Linear sequences of words are transformed into structures that show how the words relate to one another. This parsing step converts the flat list of words of the sentence into a structure that defines the units represented by that list. Constraints imposed include word order ("manager the key" is an illegal constituent in the sentence "I gave the manager the key"); number agreement; case agreement.[11]

5.4.3. Semantic analysis. The structures created by the syntactic analyzer are assigned meanings. In most universes, the sentence "Colorless green ideas sleep furiously" [Chomsky, 1957] would be rejected as semantically anomalous. This step must map individual words into appropriate objects in the knowledge base, and must create the correct structures to correspond to the way the meanings of the individual words combine with each other. [11]

5.4.4. Discourse integration. The meaning of an individual sentence may depend on the sentences that precede it and may influence the sentences yet to come. The entities involved in the sentence must either have been introduced explicitly or they must be related to entities that were. The overall discourse must be coherent. [11]

5.4.5. Pragmatic analysis. The structure representing what was said is reinterpreted to determine what was actually meant. [11]

5.5. Syntactic Processing

Syntactic parsing determines the structure of the sentence being analyzed. Syntactic analysis involves parsing the sentence to extract whatever information the word order contains. Syntactic parsing is computationally less expensive than semantic processing.[10]

A grammar is a declarative representation that defines the syntactic facts of a language. The most common way to represent grammars is as a set of production rules, and the simplest structure for them to build is a parse tree which records the rules and how they are matched. [10]

Sometimes backtracking is required (e.g., The horse raced past the barn fell), and sometimes multiple interpretations may exist for the beginning of a sentence (e.g., Have the students who missed the exam -- ). [10]

Example: Syntactic processing interprets the difference between "John hit Mary" and "Mary hit John."

5.6. Semantic Analysis

After (or sometimes in conjunction with) syntactic processing, we must still produce a representation of the meaning of a sentence, based upon the meanings of the words in it. The following steps are usually taken to do this: [10]

5.6.1. Lexical processing. Look up the individual words in a dictionary. It may not be possible to choose a single correct meaning, since there may be more than one. The process of determining the correct meaning of individual words is called word sense disambiguation or lexical disambiguation. For example, "I'll meet you at the diamond" can be understood since at requires either a time or a location. This usually leads to preference semantics when it is not clear which definition we should prefer. [10]

5.6.2. Sentence-level processing. There are several approaches to sentence-level processing. These include semantic grammars, case grammars, and conceptual dependencies. [10]

Example: Semantic processing determines the differences between such sentences as "The ink is in the pen" and "The ink is in the pen."

5.6.3. Discourse and Pragmatic Processing. To understand most sentences, it is necessary to know the discourse and pragmatic context in which it was uttered. In general, for a program to participate intelligently in a dialog, it must be able to represent its own beliefs about the world, as well as the beliefs of others (and their beliefs about its beliefs, and so on).[10]

The context of goals and plans can be used to aid understanding. Plan recognition has served as the basis for many understanding programs -- PAM is an early example. [10]

5.7. Issues in Syntax

For various reasons, a lot of attention in computational linguistics has been paid to syntax. Partly this has to do with the fact that real linguistics have spent a lot of work on it. Partly because it needs to be done before just about anything else can be done. I won't talk much about morphology. We will assume that words can be associated with a set of features or properties. For example the word "dog" is a noun, it is singular, its meaning involves a kind of animal. The word "dogs" is related, obviously, but has the property of being plural. The word "eat" is a verb, it is in what we might call the "base" form, it denotes a particular kind of action. The word "ate" is related, it is in the "past tense" form. You can imagine I'm sure that the techniques of knowledge representation that we have looked at can be applied to the problem of representing facts about the properties and relations among words. [11]

The key observation in the theory of syntax is that the words in a sentence can be more or less naturally grouped into what are called "phrases", and those phrases can often be treated as a unit.

So in a sentence "The dog chased the bear," the sequence "the dog" forms a natural unit. The sequence "chased the bear" is a natural unit, as is "the bear".[11]

Why do I say that "the dog" is a natural unit? Well one thing is that I can replace it by another sequence that has the same referent, or a related referent. For example I could replace it by: [11]

Snoopy (a name)

It (a pronoun)

My brother's favorite pet (a more complex description)

What about "chased the bear"? Again, I could replace it by

died (a single word)

was hit by a truck (a more complex event)

This basic structure, in English, is sometimes called the "subject-predicate" structure. The subject is a nominal, something that can refer to an object or thing, the predicate is a "verb phrase", which describes an action or event. Of course, as in the example, the verb phrase can also contain other constituents, for example another nominal. [11]

These phrases also have structure. For example a noun phrase (a kind of nominal) can have a determiner, zero or more adjectives, and a noun, maybe followed by another phrase, like:

the big dog that ate my homework

Verb phrases can have complicated "verb groups" like

will not be eaten

Syntactic theories try to predict and explain what patterns are used in a language. Sometimes this involves figuring out what patterns just don't work. For example the following sentences have something wrong with them: [11]

* the dogs runs home

* he died the book

* she saw himself in the mirror

* they told it to she

Figuring out exactly what is wrong with such sentences allows linguists to create theories that help understand the way that sentences get structured.

The general idea, in English, is that a sentence consists, as I said of a subject and a predicate. A predicate is a verb followed by one or more nominal or prepositional phrases. Verbs often require a certain number of either nominal or prepositional phrases, these are called "complements" [11]. For example:

it died (no complements, "intransitive")

the horse kicked the farmer (one complement "transitive")

I gave her the book (two complements)

I gave the book to her (one complement is a prepositional phrase)

The sentences above are wrong for reasons that can be stated clearly. But another class of constraints was discovered in the early 60s. They generally involve sentences in which a component is moved out of its ordinary position, for example to make a question or relative clause. [11]

Consider:

I like flowers.

Can be transformed into:

What do I like?

And

He gave the fish to Ned.

Can be transformed to:

Who did he give the fish to?

(Some people say this is ungrammatical. They are wrong. But even the "grammatical" version "to whom did he give the fish?" illustrates the point I am making.)

The general rule seems to be that you an take any nominal and replace it with a question word, and move it to the front of the sentence. [11]

But consider the following sentences:

A She likes ice cream and olives.

A' * What does she like ice cream and?

B I know a Democrat who hates Clinton.

B' * Who do you know a Democrat who hates?

Now these sentences are interesting because it is not exactly clear what sort of rule is being broken, you never see such sentences in language textbooks as the sort of thing to avoid, and children never produce them - and in fact children often make the sorts of errors mentioned previously. [11]

Other information may also be added to a sentence which is not required by the verb but which adds other information about what is going on, these are called "adjuncts".

it died yesterday (gives time)

it died in the garage (gives location)

it died because nobody fed it (gives reason)

Note that in the last example a "sentence" is part of another sentence. This can happen in various ways. For example some verbs take sentence-like units as complements: [11]

he thought I liked him

Or, as above, they can be used as adjuncts. Rather than call these sentences, they are sometimes called "clauses" -- a clause is a verb with some other arguments, usually its complements, sometimes (not always) a subject.

"Phrase structure trees" are often used to represent the configuration of sentences. These can show how the structural elements are related, and the relations among nodes in the tree can be used to describe constraints that have to hold. [11]

One approach to characterizing syntactic structure involves giving rules to describe how phrases can be generated. For example here are some such rules:

S -> NP VP

NP -> Det {Adj} Noun

VP -> Verb {NP} {PP}

PP -> Prep NP

A category in parenthesis {}, means that it is optional.

Assuming that we have a "lexicon" of words, with their categories represented, these rules could be used to generate some syntactic structures that sentences may exhibit. [11]

Suppose we add this rule:

NP -> Det {Adj} Noun {PP}

For example "the man on the dock". This gives rise to the possibility that two sentences with the same sequence of words could be grouped differently.

I saw the man with a telescope.

These different configurations can be associated with different meanings. This is called "syntactic ambiguity." Ambiguity is when a word or sentence can be taken as having more than one distinct meaning. For example some words have more than one meaning:

I went to the bank.

Different meanings of words can cause sentences to be understood in very different ways:

I saw her duck.

Flying planes can be dangerous.

The sorts of rules that I have described are called "context-free" because the rewrite operation that they describe doesn't depend on any context in which the left-hand symbol occurs. But this can't capture some fairly simple regularities: [11]

Agreement: *She saw himself.

Complements: *He put the block.

Case: *They saw she.

To solve this, rules need to specify more than just what tree configurations can occur, but must somehow indicate constraints that hold among the elements in the tree.

Another issue is that some sentences seem pretty directly related to others. For example consider the following pairs: [11]

he ate the fish the fish was eaten by him

she read the book what did she read?

the dog is at the corner the dog at the corner barked

There is a sense in which the second sentence or phrase is a "transformed" version of the first. This observation led to a powerful theory of syntactic structure called "transformational grammar" in which a language began with some simple context-free rules and some local constraints to create a set of basic sentences, which could then be transformed in various ways. [11]

It turned out however that this didn't really work, so lately linguists are looking at a more abstract theory. The basic idea is that there is a general theory of phrase structure:

X -- lexical category (noun, preposition, verb)

X' -- "modified" lexical category (with complements)

X'' -- "specified" lexical category.

Constraints can be specified among phrases built up this way. And restrictions on movement can be stated.

The hypothesis goes even deeper than this, in that some linguists believe that this representation system is somehow innate, that it underlies all human linguistic knowledge. The evidence for this claim is the fact that all languages can be described using this terminology (more or less) and that it doesn't have to be this way. There is also evidence having to do with the fact that there are often relations between ordering rules in languages that seem to hold for all phrases, rather than for just one type of phrase. [11]

For example there are languages in which the complements of a verb go after the verb (Like English.) In many of these languages, modifiers to nouns and complements to prepositions go after the modified element (like English for prepositions, but not for nouns, French is a good example of this). Obviously this doesn't always work, but it works often enough that some researchers think that there might be something there. Others think this whole notion is totally bogus (for example most people at UCSD).

5.8. Issues in Parsing

Given all of the attention paid to syntax, it is not surprising that a lot of work has been done on getting computers to come up with a characterization of the semantic structures of sentences. Obviously, the way that this will work depends on the specific syntactic theory you believe in, but in general a parsing program is a search through the space of possible structural characterizations of the sentence, constrained by the fact that the structural characterization must be compatible with the given sequence of words. [11]

Most of the research on automatic parsing, has involved context-free grammars. Sometimes the basic ideas from context-free parsing are then augmented to make the parser able to handle non-context-free-constraints. [11]

The general idea of parsing with a set of context-free rules is to start generating possible tree structures, until a rule generates a lexical category. This is then checked against the next word in the sentence. If it is of the appropriate category, the parse continues If not, the parser must explore another node in the search space. [11]

For example:

S -> NP VP

NP -> Det Noun

VP -> Verb {NP} {PP}

PP -> Prep NP

Suppose we are parsing:

The dog barked in the yard.

We assume we have sentence, so we start with the tree:

S

We expand it using the rule

NP VP

Working from left to right, we expand the NP node:

Det Noun

Now "Det" is a lexical category, so we look at the first word of the sentence, it is indeed a determiner, so we continue. The next category "Noun" is also a lexical category, so we check, and succeed. [11]

Now we come to a non-lexical category, VP, so we find a rule for that. This rule has optional constituents, so we treat each optional possibility as a separate node. Our first assumes that both are optional: [11]

VP -> Verb

And we create a node for each of the other possibilities:

VP -> Verb NP

VP -> Verb PP

VP -> Verb NP PP

The first node predicts a verb and one is there so we continue. However that rule says we should be done, and we aren't yet, so it fails, and we go back to the next node. This one also predicts a verb, so we continue. We expand and NP node which predicts a Determiner, but there is none there, so that one fails. [11]

The next node predicts a verb, and we expand the PP node to predict a preposition, which is what is there, and we continue on.

Obviously there can be lots more complexity to all of this but the general idea in what is called "top down" parsing is a depth-first search down the left side of the tree until a lexical category is predicted. This is compared with the next word in the sentence. [11]

To handle non-context-free phenomena, a context-free parser is sometimes augmented with some additional tests or operations to perform after the parser succeeds on the context-free operation to possibly eliminate some sentences. For example we might have:

S -> NP VP (= (number NP) (number VP))

Where 'number' returns whether is argument is singular or plural. Of course we will have to augment our representation of the syntactic structure somehow to record this and other potentially relevant syntactic properties. We will see a specific example of this next time, when we examine a parser that uses the machinery we developed for proving theorems. [11]

5.9. Issues in Semantics

Although it is hard to tell sometimes at linguistics talks, the only reason that people are interested in syntax is that the structure of a sentence is presumably related somehow to the meaning that it conveys. [11]

One idea in semantics we have already seen the idea of hierarchies of objects. To some degree, the meanings of nouns and noun phrases can be understood with the sorts of knowledge representation ideas we have already looked at, and many of these ideas were developed for natural language understanding systems. [11]

The idea of the "referent" of a noun phrase--the thing that it refers to, usually by satisfying some description.

The idea of hierarchies of objects can also be extended to the idea of hierarchies of actions and events. In the theory of "conceptual dependency" the claim is that the relations among complex events by composing them out of more simple events. [11]

A key idea in representing events is that certain kinds of events have specific participants". For example a "buy" event has a buyer and a seller and a thing bought. A “move" event has the thing that moves and possibly an initial and a final location and maybe path along which the motion happens. [11]

These observations lead to the theory of "case frames". A case frame is a representation of an action or event, along with its participants. The reason they are called "case" frames has to do with the fact that in many languages (though not English), nouns are0 assigned case depending on the role that the referent of the noun phrase plays in the sentence. For example in Latin, there is a different ending to indicate if the word is the subject of the sentence, the direct object, or if it refers to a location (and some more). [11]

The idea of case frames is that each verb is associated with a specific case frame, and a set of "role mappings" which indicate how the syntactic arguments of the sentence are assigned to the participant slots in the case frame.

Here are some typical slots in case frames:

agent

object

location

source

goal

beneficiary

For example the verb "buy" might be associated with a "purchase" case frame with a buyer and seller and an thing bought. So we will assume that it uses the "source" slot for the seller, the "goal" slot for the buyer, and the "object" slot for the thing bought. [11]

Thus the verb "buy" maps the subject of the sentence to the "goal" slot, the direct object of the "object" slot, and the object of the preposition "from" to the "source" slot. Note that prepositions are often used to assign case roles. Obviously, "from" is often the "source" slot and "to" is often the goal slot.

Now consider the verb "sell". This evokes the same case frame but with different mappings: the subject is now the source, the object is again the object, and the object of "to" is the goal. [11]

5.10. Issues in Pragmatics

Pragmatics usually refers to how contextual resources are used to work out the specific meanings of sentences. Sometimes the contextual resources are linguistics, for example referring expressions, and sometimes they are part of the speech situation, for example the speaker and hearer, and the time and place of the utterance. [11]

So for example we have in English the difference between "definite" and "indefinite" reference. An "indefinite" expression gives a description and is often used to indicate that an object satisfying that description is to be newly introduced into the discourse. A "definite" referring expression is used to refer back to a previously mentioned entity. So in: [11]

A bear came to our campsite last night.

The bear was eating our garbage.

It scared my brother.

The first expression "a bear" is indefinite. Introduces the entity to the store. "The bear" is definite. Refers to previously introduced bear. So does "it". All of this requires some notion of a "structure" or "context" in which referring expressions are introduced. [11]

The discourse situation must be represented also, for many references to be understood For example we need to represent the speaker and hearer, and perhaps onlookers, if we are to work out the intended referents of "me" and "you" and "us" and "them". Also the times of "now" and "yesterday", and the locations of "here" and "there". [11]

Different languages partition the speech situation in different ways than English. For example many languages have a second person plural, sort of like "you all". Some have two different kinds of first person plurals -- one that includes the hearer, and one that doesn't.. Spanish, for example, has four spatial pronouns, one for near the speaker, one for near the hearer, one for in the region where both are, and one for a region far from both. [11]

5.11. Issues in Discourse

The next level of analysis is called "discourse theory". This is about the higher level relations that hold among sequences of sentences in a discourse or a narrative. It merges sometimes with literary theory, but also with pragmatics. [11]

One thing to understand is that different sentences do different kinds of "work" in a discourse. We have seen some examples of this already -- noun phrases that refer to new entities, or back to previously introduced ones. Same for whole sentences. Some introduce new events or relations, some used them to introduce something new. [11]

A car began rolling down the hill

It collided with a lamppost.

One important idea in discourse theory is the idea that much language is performed in the context of some mutual activity. For example two people could be working on some project together. In this case, they are probably both somewhat aware of the plan that they are both following, and so much of the pragmatic information needed to understand what they are talking about can be thought of in terms of that plan. And sometimes utterances can be understood as if they were steps in the execution of a plan. For example if I say, [11]

please pass the salt

This could be thought of as a way to get me the salt, if having salt was part of a plan.

Some people think of sentences like

can you pass the salt

As "indirect speech acts" because they look like questions, but aren't really. One way to think about sentences like this is that the hearer understands that this is probably not a question, but is a conventionalized (and polite) means of asking for the salt. [11]

Another analysis of this sort of sentence is that you are trying to avoid rejection. You do this by considering ways that your plan might fail. So you don't want to have this happen:

please pass the salt

I can't, I'm tied up with ropes.

oh, sorry.

So you ask about potential problems first -- asking about ability. So that if there is a problem, you don't have to ask directly and you won't be rejected. It is sort of like: [11]

are you doing anything saturday night?

yes, I'm feeding my goldfish

So you don't have to be rejected if you actually ask for a date. [11]

6. GATE (General Architecture for Language Engineering)

We have used GATE [8] as the NLP engine in Auto Modeler. GATE is an infrastructure for developing and deploying software components that process human language. GATE helps scientists and developers in three ways[8]:

By specifying an architecture, or organizational structure, for language processing software;
By providing a framework, or class library, that implements the architecture and can be used to embed language processing capabilities in diverse applications.
By providing a development environment built on top of the framework made up of convenient graphical tools for developing components.

The architecture exploits component-based software development, object orientation and mobile code. The framework and development environment are written in Java and available as open-source free software under the GNU library licence2. GATE uses Unicode [8] throughout, and has been tested on a variety of Slavic, Germanic, Romance, and Indic languages [8].

From a scientific point-of-view, GATE's contribution is to quantitative measurement of accuracy and repeatability of results for verification purposes [8].

GATE has been in development at the University of Sheffield since 1995 and has been used in a wide variety of research and development projects [8]. Version 1 of GATE was released in 1996, was licensed by several hundred organizations, and used in a wide range of language analysis contexts including Information Extraction ([8]) in English, Greek, Spanish, Swedish, German, Italian and French. Version 3.1 of the system, a complete reimplementation and extension of the original, is available from http://gate.ac.uk/download/ [8].

GATE is distributed with an Information Extraction system called ANNIE, A Nearly-New IE system ANNIE relies on finite state algorithms and the JAPE language [8]. ANNIE components form a pipeline which appears in below

ANNIE components are included with GATE. We have used ANNIE for Tokenization, Sentence Splitting and Part of Speech Tagging. For Morphological analysis we have used GATE Morphological Analyzer. For Domain independent Semantic analysis we GATE's SUPPLE parser. Below each Process is explained in detail.

6.1. Tokeniser

The tokeniser splits the text into very simple tokens such as numbers, punctuation and words of different types. For example, we distinguish between words in uppercase and lowercase, and between certain types of punctuation. The aim is to limit the work of the tokeniser to maximise efficiency, and enable greater flexibility by placing the burden on the grammar rules, which are more adaptable [8].

6.1.1. Tokeniser Rules. A rule has a left hand side (LHS) and a right hand side (RHS). The LHS is a regular expression which has to be matched on the input; the RHS describes the annotations to be added to the Annotation Set. The LHS is separated from the RHS by '>'. The following operators can be used on the LHS [8]:

| (or)

* (0 or more occurrences)

? (0 or 1 occurrences)

+ (1 or more occurrences)

The RHS uses ';' as a separator, and has the following format:

{LHS} > {Annotation type};{attribute1}={value1};...;{attribute

n}={value n}

Details about the primitive constructs available are given in the tokeniser file (DefaultTokeniser.Rules).

The following tokeniser rule is for a word beginning with a single capital letter:

"UPPERCASE_LETTER" "LOWERCASE_LETTER"* >

Token;orth=upperInitial;kind=word;

It states that the sequence must begin with an uppercase letter, followed by zero or more

lowercase letters. This sequence will then be annotated as type “Token”. The attribute “orth” (orthography) has the value “upperInitial”; the attribute “kind” has the value “word”.

6.1.2 Token Types. In the default set of rules, the following kinds of Token and SpaceToken are possible [8]:

6.1.2.1. Word: A word is defined as any set of contiguous upper or lowercase letters, including a hyphen (but no other forms of punctuation). A word also has the attribute “orth”, for which four values are defined:[8]

• upperInitial - initial letter is uppercase, rest are lowercase

• allCaps - all uppercase letters

• lowerCase - all lowercase letters

• mixedCaps - any mixture of upper and lowercase letters not included in the above categories

6.1.2.2. Number: A number is defined as any combination of consecutive digits. There are no subdivisions of numbers.[8]

6.1.2.3. Symbol: Two types of symbol are defined: currency symbol (e.g. ‘$', ‘£') and symbol (e.g. ‘&', ‘ˆ'). These are represented by any number of consecutive currency or other symbols (respectively).[8]

6.1.2.4. Punctuation: Three types of punctuation are defined: start punctuation (e.g. ‘('), end punctuation (e.g.‘)'), and other punctuation (e.g. ‘:'). Each punctuation symbol is a separate token.[8]

6.1.2.5. SpaceToken: White spaces are divided into two types of SpaceToken - space and control - according to whether they are pure space characters or control characters. Any contiguous (and homogenous) set of space or control characters is defined as a SpaceToken.

The above description applies to the default tokeniser. However, alternative tokenisers can be created if necessary. The choice of tokeniser is then determined at the time of text processing.[8]

6.1.3. English Tokeniser. The English Tokeniser is a processing resource that comprises a normal tokeniser and a JAPE transducer [8]. The transducer has the role of adapting the generic output of the tokeniser to the requirements of the English part-of-speech tagger. One such adaptation is the joining together in one token of constructs like “ '30s”, “ 'Cause”, “ 'em”, “ 'N”, “ 'S”, “ 's”, “ 'T”, “ 'd”, “ 'll”, “ 'm”, “ 're”, “ 'til”, “ 've”, etc. Another task of the JAPE transducer is to convert negative constructs like “don't” from three tokens (“don”, “ ' ” and “t”) into two tokens (“do” and “n't”).[8]

The English Tokeniser should always be used on English texts that need to be processed afterwards by the POS Tagger.[8]

6.2. Gazetteer

The gazetteer lists used are plain text files, with one entry per line. Each list represents a

set of names, such as names of cities, organizations, days of the week, etc.[8] Below is a small section of the list for units of currency:

Ecu

European Currency Units

FFr

Fr

German mark

German marks

New Taiwan dollar

New Taiwan dollars

NT dollar

NT dollars

An index file (lists.def) is used to access these lists; for each list, a major type is specified and, optionally, a minor type 2. In the example below, the first column refers to the list name, the second column to the major type, and the third to the minor type. These lists are compiled into finite state machines. Any text tokens that are matched by these machines will be annotated with features specifying the major and minor types. Grammar rules then specify the types to be identified in particular circumstances. Each gazetteer list should reside in the same directory as the index file.

currency_prefix.lst:currency_unit:pre_amount

currency_unit.lst:currency_unit:post_amount

date.lst:date:specific

day.lst:date:day

So, for example, if a specific day needs to be identified, the minor type “day” should be specified in the grammar, in order to match only information about specific days; if any kind of date needs to be identified, the major type “date” should be specified, to enable tokens annotated with any information about dates to be identified. More information about this can be found in the following section.[8]

6.3. Sentence Splitter

The sentence splitter is a cascade of finite-state transducers which segments the text into sentences. This module is required for the POS tagger. The splitter uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds.[8]

Each sentence is annotated with the type Sentence. Each sentence break (such as a full stop) is also given a “Split” annotation. This has several possible types: “.”, “punctuation”, “CR” (a line break) or “multi” (a series of punctuation marks such as “?!?!”.

The sentence splitter is domain and application-independent.[8]

6.4. Part of Speech Tagger

The POS tagger [8] is a modified version of the Brill tagger, which produces a partof-

speech tag as an annotation on each word or symbol. The list of tags used is given in [8]. The tagger uses a default lexicon and ruleset (the result of training on a large corpus taken from the Wall Street Journal). Both of these can be modified manually if necessary. Two additional lexicons exist - one for texts in all uppercase (lexicon cap), and one for texts in all lowercase (lexicon lower). To use these, the default lexicon should be replaced with the appropriate lexicon at load time. The default ruleset should still be used in this case.[8]

The ANNIE Part-of-Speech tagger requires the following parameters.

* encoding - encoding to be used for reading rules and lexicons (init-time)

* lexiconURL - The URL for the lexicon file (init-time)

* rulesURL - The URL for the ruleset file (init-time)

* document - The document to be processed (run-time)

* inputASName - The name of the annotation set used for input (run-time)

* outputASName - The name of the annotation set used for output (run-time). This is an optional parameter. If user does not provide any value, new annotations are created under the default annotation set.

* baseTokenAnnotationType - The name of the annotation type that refers to Tokens in a document (run-time, default = Token)

* baseSentenceAnnotationType - The name of the annotation type that refers to Sentences in a document (run-time, default = Sentences)

* outputAnnotationType - POS tags are added as category features on the annotations of type “outputAnnotationType” (run-time, default = Token)

If - (inputASName == outputASName) AND (outputAnnotationType == baseTokenAnnotationType)

then - New features are added on existing annotations of type “baseTokenAnnotationType”. Otherwise - Tagger searches for the annotation of type “outputAnnotationType” under the “outputASName” annotation set that has the same offsets as that of the annotation with type “baseTokenAnnotationType”. If it succeeds, it adds new feature on a found annotation, and otherwise, it creates a new annotation of type “outputAnnotationType” under the “outputASName” annotation set.[8]

6.5. GATE Morphological Analyzer

The Morphological Analyzer Processing Resource (PR) can be found in the Tools plugin[8]. It takes as input a tokenized GATE document. Considering one token and its part of speech tag, one at a time, it identifies its lemma and an affix. These values are than added as features on the Token annotation. Morpher is based on certain regular expression rules. These rules were originally implemented by Kevin Humphreys in GATE1 in a programming language called Flex [8]. Morpher has a capability to interpret these rules with an extension of allowing users to add new rules or modify the existing ones based on their requirements. In order to allow these operations with as little effort as possible, we changed the way these rules are written.[8]

Two types of parameters, Init-time and run-time, are required to instantiate and execute the PR.

* rulesFile (Init-time) The rule file has several regular expression patterns. Each pattern has two parts, L.H.S. and R.H.S. L.H.S. defines the regular expression and R.H.S. the function name to be called when the pattern matches with the word under consideration.

* caseSensitive (init-time) By default, all tokens under consideration are converted into lowercase to identify their lemma and affix. If the user selects caseSensitive to be true, words are no longer converted into lowercase.

* document (run-time) Here the document must be an instance of a GATE document.

* affixFeatureName Name of the feature that should hold the affix value.

* rootFeatureName Name of the feature that should hold the root value.

* annotationSetName Name of the annotationSet that contains Tokens.

* considerPOSTag Each rule in the rule file has a separate tag, which specifies which rule to consider with what part-of-speech tag. If this option is set to false, all rules are considered and matched with all words. This option is very useful. For example if the word under consideration is ”singing”. ”singing” can be used as a noun as well as a verb. In the case where it is identified as a verb, the lemma of the same would be ”sing” and the affix ”ing”, but otherwise there would not be any affix.

6.5.1. Rule File. GATE provides a default rule file, called default.rul, which is available under the gate/plugins/Tools/morph/resources directory. The rule file has two sections.[8]

Variables Rules

6.5.1.1. Variables: The user can define various types of variables under the section defineVars. These variables can be used as part of the regular expressions in rules. There are three types of variables:

Range With this type of variable, the user can specify the range of characters. e.g. A==>[-a-z0-9]
Set With this type of variable, user can also specify a set of characters, where one character at a time from this set is used as a value for the given variable. When this variable is used in any regular expression, all values are tried one by one to generate the string which is compared with the contents of the document. e.g. A ==> [abcdqurs09123]
Strings where in the two types explained above, variables can hold only one character from the given set or range at a time, this allows specifying strings as possibilities for the variable. e.g. A ==> ”bb” OR ”cc” OR ”dd”

6.5.1.2. Rules: All rules are declared under the section defineRules. Every rule has two parts, LHS and RHS. The LHS specifies the regular expression and the RHS the function to be called when the LHS matches with the given word. ”==>” is used as delimeter between the LHS and RHS.

The LHS has the following syntax:

< ” ”—”verb”—”noun” >< regularexpression >.

User can specify which rule to be considered when the word is identified as ”verb” or ”noun”. ”*” indicates that the rule should be considered for all part-of-speech tags. If the part-of-speech should be used to decide if the rule should be considered or not can be enabled or disabled by setting the value of considerPOSTags option. Combination of any string along with any of the variables declared under the defineVars section and also the Klene operators, ”+” and ”*”, can be used to generate the regular expressions. Below we give few examples of L.H.S. expressions.

* <verb>”bias”

* <verb>”canvas”{ESEDING} ”ESEDING” is a variable defined under the defineVars

* section. Note: variables are enclosed with ”{” and ”}”.

* <noun>({A}*”metre”) ”A” is a variable followed by the Klene operator ”*”, which means ”A” can occur zero or more times.

* <noun>({A}+”itis”) ”A” is a variable followed by the Klene operator ”+”, which means ”A” can occur one or more times.

* < * >”aches” ”< * >” indicates that the rule should be considered for all part-of-speech tags.

On the RHS of the rule, the user has to specify one of the functions from those listed below. These rules are hard-coded in the Morph PR in GATE and are invoked if the regular expression on the LHS matches with any particular word.[8]

* stem(n, string, affix) Here,

o n = number of characters to be truncated from the end of the string.

o string = the string that should be concatenated after the word to produce the root.

o affix = affix of the word

* irreg stem(root, affix ) Here,

o root = root of the word

o affix = affix of the word

o null stem() This means words are themselves the base forms and should not be analyzed.

* semi reg stem(n,string) semir reg stem function is used with the regular expressions that end with any of the {EDING} or {ESEDING} variables defined under the variable section. If the regular expression matches with the given word, this function is invoked, which returns the value of variable (i.e. {EDING} or {ESEDING}) as an affix. To find a lemma of the word, it removes the n characters from the back of the word and adds the string at the end of the word.

6.6 SUPPLE Parser

SUPPLE (written in Prolog) is a bottom-up parser that constructs syntax trees and logical forms for English sentences. The parser is complete in the sense that every analysis licensed by the grammar is produced. In the current version only the 'best' parse is selected at the end of the parsing process. The English grammar is implemented as an attribute-value context free grammar which consists of subgrammars for noun phrases (NP), verb phrases (VP), prepositional phrases (PP), relative phrases (R) and sentences (S). The semantics associated with each grammar rule allow the parser to produce logical forms composed of unary predicates to denote entities and events (e.g., chase(e1), run(e2)) and binary predicates for properties (e.g. lsubj(e1,e2)). Constants (e.g., e1, e2) are used to represent entity and event identifiers. The GATE SUPPLE Wrapper stores syntactic infomation produced by the parser in the gate document in the form of: SyntaxTreeNodes which are used to display the parsing tree when the sentence is 'edited'; 'parse' annotations containing a bracketed representation of the parse; and 'semantics' annotations that contains the logical forms produced by the parser.[8]

6.6.1. Running the parser in GATE. In order to parse a document you will need to construct an application that has [8]:

* Tokeniser

* Sentence Splitter

* POS-tagger

* Morphology

* SUPPLE Parser with parameters

o mapping file (config/mapping.config)

o feature table file (config/feature table.config)

o parser file (supple.plcafe or supple.sicstus or supple.swi)

o prolog implementation (shef.nlp.supple.prolog.PrologCafe, shef.nlp.supple.prolog.SICStusProlog, shef.nlp.supple.prolog.SWIProlog or shef.nlp.supple.prolog.SWIJavaProlog).

6.6.2. Configuration files. Two files are used to pass information from GATE to the SUPPLE parser: the mapping file and the feature table file [8].

6.6.2.1. Mapping file: The mapping file specifies how annotations produced using Gate are to be passed to the parser. The file is composed of a number of pairs of lines, the first line in a pair specifies a Gate annotation we want to pass to the parser. It includes the AnnotationSet (or default), the AnnotationType, and a number of features and values that depend on the AnnotationType. The second line of the pair specifies how to encode the Gate annotation in a SUPPLE syntactic category, this line also includes a number of features and values. As an example consider the mapping [8]:

Gate;AnnotationType=Token;category=DT;string=&S

SUPPLE;category=dt;m_root=&S;s_form=&S

It specifies how a determinant ('DT') will be translated into a category 'dt' for the parser.

The construct '&S' is used to represent a variable that will be instantiated to the appropriate value during the mapping process. More specifically a token like 'The' recognized as a DT by the POS-tagging will be mapped into the following category:

dt(s_form:'The',m_root:'The',m_affix:'_',text:'_').

As another example consider the mapping:

Gate;AnnotationType=Lookup;majorType=person_first;minorType=female;string=&S

SUPPLE;category=list_np;s_form=&S;ne_tag=person;ne_type=person_first;gender=female

It specified that an annotation of type 'Lookup' in Gate is mapped into a category 'list np' with specific features and values. More specifically a token like 'Mary' identified in Gate as a Lookup will be mapped into the following SUPPLE category [8]:

list_np(s_form:'Mary',m_root:'_',m_affix:'_',

text:'_',ne_tag:'person',ne_type:'person_first',gender:'female').

6.6.2.2. Feature table: The feature table file specifies SUPPLE 'lexical' categories and its features. As an example an entry in this file is [8]:

n;s_form;m_root;m_affix;text;person;number

which specifies which features and in which order a noun category should be writen. In this case:

n(s_form:...,m_root:...,m_affix:...,text:...,person:...,number:....).

6.6.3. Parser and Grammar. The parser builds a semantic representation compositionally, and a ‘best parse' algorithm is applied to each final chart, providing a partial parse if no complete sentence span can be constructed. The parser uses a feature valued grammar. Each Category entry has the form [8]:

Category(Feature1:Value1,...,FeatureN:ValueN)

where the number and type of features is dependent on the category type. All categories will have the features s form (surface form) and m root (morphological root); nominal and verbal categories will also have person and number features; verbal categories will also have tense and vform features; and adjectival categories will have a degree feature. The list np category has the same features as other nominal categories plus ne tag and ne type [8].

Syntactic rules are specifed in Prolog with the predicate rule(LHS,RHS) where LHS is a syntactic category and RHS is a list of syntactic categories. A rule such as BNP HEAD ) N (“a basic noun phrase head is composed of a noun”) is writen as follows:

rule(bnp_head(sem:E^[[R,E],[number,E,N]],number:N),

[n(m_root:R,number:N)]).

where the feature 'sem' is used to construct the semantics while the parser processes input, and E, R, and N are variables to be instantiated during parsing.

The full grammar of this distribution can be found in the prolog/grammar directory, the file load.pl specifies which grammars are used by the parser. The grammars are compiled when the system is built and the compiled version is used for parsing [8].

6.6.4. Mapping Named Entities. SUPPLE has a prolog grammar which deals with named entities, the only information required is the Lookup annotations produced by Gate, which are specified in the mapping file. However, you may want to pass named entities identified with your own Jape grammars in Gate. This can be done using a special syntactic category provided with this distribution. The category sem cat is used as a bridge between Gate named entities and the SUPPLE grammar. An example of how to use it (provided in the mapping file) is:

Gate;AnnotationType=Date;string=&S

SUPPLE;category=sem_cat;type=Date;text=&S;kind=date;name=&S

which maps a named entity 'Date' into a syntactic category 'sem cat'. A grammar file called semantic rules.pl is provided to map sem cat into the appropriate syntactic category expected by the phrasal rules. The following rule for example:

rule(ne_np(s_form:F,sem:X^[[name,X,NAME],[KIND,X]]),[

sem_cat(s_form:F,text:TEXT,type:'Date',kind:KIND,name:NAME)]).

is used to parse a 'Date' into a named entity in SUPPLE which in turn will be parsed into a noun phrase [8].

7. From NLP to UML Model

After NLP the sentences are simplified in order to make identification of UML Model Elements from them easy. Heuristics are used to Identify Candidate UML Model Elements (Classes, Attributes, Operations and relationships amongst them) of NL Elements.

7.1. Standardization of NL Sentences

In order to simplify the final mapping onto classes and use cases, it is helpful to create a unified structure i.e. Sentences, in the form of S-V-O triples, throughout the input text. This detracts from the human readability of the text, but helps the machine readability [9]. This is achieved by applying the rules below.

Where the Subject precedes an equal parallel structure, split into two (or more) simpler sentences, where the subject is shared in turn by the following parts. Therefore, convert “S-V1-O1-V2-O2”, to “S-V1-O1” “SV2-O3”. As an example, “The baker kneads the dough and bakes the bread” becomes, “The baker kneads the dough” “The baker bakes the bread”. [9]

Where both Subject and Verb precede an equal parallel structure, split into two (or more) simpler sentences, where the Subject -Verb is shared in turn by the following parts Therefore, convert “S-V-O1-,/and-O2-,/and,-O3” to “S-V-O1” “S-V-O2” “S-V-O3”. As an example, “The baker bakes cakes and bread” becomes “The baker bakes cakes” “The baker bakes bread”. [9]

Verb lead equal parallel structure, Verb is shared by the following parts that also share the potential Subject. Therefore, convert “(Subject)-V-O1-,/and- O2-,/and,-O3…” to “(Subject)-V-O1” “V-O2” “V-O3” .[9]

Where the sentence contains a verb in a continuous tense, regard this as a complementary structure, where the Subject is shared. Therefore, convert “S-V1-O-V2ing” to “S-V1-O” “ S-V2”. For instance, “Bakers make bread by baking” becomes “Bakers make bread” “Bakers bake”. [9]

If the sentence uses the passive voice, convert “S-Ved” to “V-O(S)”. For instance, “the shoe is ordered by customers” is reformulated as “customers order the shoe” [9]

7.2. Heuristics to Map NL Elements to UML Model Elements

1. Translate Nouns to Classes. [9]

2. Translate Noun-Noun to Class-Attribute i.e. Translate first Noun as Class and second Noun as Attribute of the given Class. [9]

3. Translate Subject (S) - Verb(V) - Object(O) structure to a class diagram with the Subject and Object as classes both sharing the verb as a candidate method. [9]

4. Translate the Verb of a Non-Personal noun to Class-Operation where the Non-Personal noun is the Class and the Verb is Operation of the Class. [9]

5. Treat two Consecutive Nouns after a Verb as a single noun i.e. Class. [9]

6. For every candidate class find its frequency in the text. The most frequent candidates highly suggest classes. [3]

7. Attributes can be found using some simple heuristics like the possessive relationships and use of the verbs to have, denote, and identify. [3]

8. Attributive adjectives denote attribute values of the nouns they modify. These are interesting language elements that give more information about the entities denoted by nouns. For example, in a sentence like “a large library has many sections,” the adjective large indicates the existence of the attribute size associated with the entity library. [3]

9. Any candidate class that has a low frequency e.g. 1 and does not participate in any relationship is discarded from the list.. [3]

10. Some sentence patterns, e.g., ‘something is made up of something', ‘something is part of something' and ‘something contains something', denote aggregation relations. [3]

11. Determiners are used to identify the multiplicity of roles in associations. Our approach identifies three types of UML multiplicities : [3]

o 1 for exactly one: identified by the presence of indefinite articles, the definite article with a singular noun, and the determiner one. [3]

o for many: identified by the presence of any of the determiners each, all, every, many, and some. [3]

8. System Architecture of Auto Modeler

8.1. Overview

The Auto Modeler is a Modular NLP based ASE tool. The Auto Modeler is built as a Multi Tier Windows Desktop Application. It has the following Main Modules:-

Windows (GUI Interface)
NLP System
OOA Module
Model Viewer
Repository

The NLP system of Auto Modeler uses the General Architecture for Language Engineering (GATE) version 3.1 (see Chapter 6) for NLP in the backend.

Auto Modeler works as follows:

* The System Analyst provides input to Auto Modeler by Loading the Functional Specifications of the System to be built into the System, which have been gathered in the form of the Natural Language Requirements Specification Document.

* The NLP system of Auto Modeler syntactically and semantically parses the informal requirements text and saves the output in the system repository.

* The OOA module uses the output of the NLP system to produce the UML Model. At present in only identifies the Object classes, their attributes and relationships among them and stores the result in the repository.

* The Model viewer generates and shows the UML model to the user. It also exports the Model to Rational Rose upon user request.

8.2. Architecture Specification

In this section we describe the modules of Auto Modeler:

8.2.1. Windows (GUI Interface). The windows interface is the controller module. The windows interface allows the user to:

* Create/ Edit/ Delete an Auto Modeler Project.

* Load the Software Requirement Specifications into Auto Modeler.

* Edit the specifications using The Text Editor of Auto Modeler.

* Save the specifications into the repository.

* Perform Automatic OOA on the Specifications

* Generate the UML Model.

* Export the UML Model to Rational Rose.

8.2.2. NLP System: The NLP System uses GATE (see Chapter 7) in the backend to perform NLP. The NLP system consists of three main phases: Lexical Pre-Processing, Parsing and Discourse Interpretation [2][3].

8.2.2.1. Lexical Pre-Processing: The Lexical Preprocessor consists of four sub-modules namely, a tokenizer, sentence splitter, tagger, and a morphological analyzer. For detail on each module kindly see sections 6.1, 6.3, 6.4 and 6.5. The input to the lexical preprocessor is the Requirements Document Text File. The output is a set of charts, one per sentence, to be used by the parser. The processing steps are carried out in the following order:

* Tokenization: The tokenizer splits a plain text file into tokens. This includes, e.g., separating words and punctuation, identifying numbers, and so on. (See section 6.1.)

* Sentence Splitting: The sentence splitter identifies sentence boundaries. (See section 6.3.)

* Part-of-Speech (POS) Tagging: The POS tagger assigns to each token in the input POS tags. (See section 6.4)

* Morphological Analysis: After POS tagging, all nouns and verbs are passed to the morphological analyzer which returns the root and suffix of each word. (See section 6.5)

8.2.2.2. Parsing and Semantic Interpretation: The parser takes the output of the Lexical Preprocessor, and, using the grammar rules, builds a syntactic tree and in parallel generates a semantic representation for every sentence in the text. The semantic representation is simply a predicate argument structure (first order logical terms). The morphological roots of the simple verbs and nouns are used as predicate names in the semantic representations. Tense and numbers features are translated directly into this notation where appropriate. All NPs and VPs introduce unique instance constants in the semantics which serve as identifiers for the objects or events referred to in the text. We have used GATE's SUPPLE parser for this purpose (See Section 6.6)

The results of the parser will be stored in the System Repository for further processing.

8.2.3. OOA Module: The OOA module is responsible for identification of basic UML elements i.e. Classes, attributes, operations and the ascensions between them for UML Model generation. The techniques and heuristics described in chapter 7 are used to identify UML elements from the NL elements i.e. output of the parser. After analyzing the output of the SUPPLE parser the OOA module generates a list of candidate classes and candidate relationships. It also generates a list of candidate attributes and candidate operations of the Classes. Each attribute and operation is associated with a particular class. The Module then tries to associate the Classes with one another through the identified relationships. Classes which have a low frequency or they don't have any attributes / operations or which are not associated with any other class are removed.

8.2.4. Model Viewer: The Model Viewer takes the basic output Generated by the OOA module and generates the UML model and shows it on the screen for the user. This Module also allows the user to export the UML Model to Rational Rose.

8.2.5. Repository: The System Repository contains the knowledge regarding the NL requirements, NL Processing, Rules regarding OOA and UML Models. We use MS SQL Server in the backend for this purpose.

9. Case Study

In this section we illustrate Auto Modeler using a case study from the domain of library information systems [2][3].

The problem statement for this case study is as follows:

A library issues loan items to customers. Each customer is known as a member and is issued a membership card that shows a unique member number. Along with the membership number, other details on a customer must be kept such as a name, address, and date of birth. The library is made up of a number of subject sections. Each section is denoted by a classification mark. A loan item is uniquely identified by a bar code. There are two types of loan items, language tapes, and books. A language tape has a title language (e.g., French), and level (e.g., beginner). A book has a title, and author(s). A customer may borrow up to a maximum of 8 items. An item can be borrowed, reserved or renewed to extend a current loan. When an item is issued the customer's membership number is scanned via a bar code reader or entered manually. If the membership is still valid and the number of items on loan less than 8, the book bar code is read, either via the barcode reader or entered manually. If the item can be issued (e.g., not reserved) the item is stamped and then issued. The library must support the facility for an item to be searched and for a daily update of records.

9.1. Callan's class model

The below shows a class diagram of the library system presented by Callan [2][3].This model shows 8 classes drawn as solid rectangles. These classes are linked to each other with associations represented by lines between the class boxes. Library has been modeled as an aggregate of a number of Sections and this is represented by the diamond at the Library end of the association. Each section is uniquely identified by a class mark, this is represented by a small box showing the class mark attribute at the Library end of the association. Also each section is associated with Loan Items. Two operations are shown in the library class: search and update, these are shown in the third compartment of the Library class icon. There is an issues association between the Library and Member Card classes. This association is qualified with a member code attribute, which means every Member Card has a unique member code. The class Customer is associated with the class Member Card to show that each customer has a card. Customer is also associated with the class Loan Item via a Borrows association which is represented as an association class. Each Customer can borrow up to 8 items. This is shown by the multiplicity 0-8 at the Loan Item end. The class Loan Item has two subclasses Language Tape and Book [2][3].

9.2. Auto Modeler analysis of the library system

The final model produced by the Auto Modeler consists of 8 classes and 10 associations as shown in 9.2. Six out of the 8 classes in 9.1 are exactly as shown in Callan's model. These classes are: Book, Library, Customer, Loan item, Membership Card, and Language tape. One class, Subject Section our model is the same as in Callan's model but with the extra prefix subject is added in our case. One class, Member is not mentioned in Callan's model. One class bar code reader was discarded because it was not joined by any class in any association. If we compare this model with the model generated by CM-Builder [2][3], then the model generated by Auto Modeler is more closer to Callan's model

10. Conclusions and Future Work

10.1 Conclusions

In this thesis we have described Auto Modeler an ASE tool .that takes the Natural Language Software Requirements Specifications as input and tries to generate a UML Model from it. Auto Modeler uses GATE [8] to perform NLP on the input Document and saves the result in the system repository. The OOA Module uses Heuristics (chapter7) to map NL elements to UML Model elements. At present Auto Modeler only identifies Classes, attributes of Classes, operations of Classes and the relationships between them, and. generates a static Class Diagram as output. The results are presented to the analyst so that he can further refine them. As the case study has shown that the output generated by Auto Modeler is better than the model generated by CM-Builder. [2][3]

It is hoped that future versions of Auto Modeler will provide complete coverage of the UML model thus enabling in the fast pace development of high quality software.

10.1.1. Strengths of Auto Modeler. The strengths are:

1. Performs a fully automated OO analysis of the input text and produces a static Class Diagram as output which can be later modified by the analysis.

2. Produces a list of Classes, attributes of Classes, operations of Classes and the relationships between them. This list of course can be modified by the analyst.

3. Is Domain independent and can be used on requirements text in any domain.

10.1.2. Weaknesses of Auto Modeler: The weaknesses of Auto Modeler are:

1. The Linguistic analysis is limited.

2. The amount of generic knowledge useful for interpreting a range of software requirements texts is limited.

3. The coverage of UML model is not complete [1] and Dynamic aspects of systems are not extracted from requirements texts.

10.2. Future Work

The working of Auto Modeler needs to be improved and there is much room for future developments and enhancements:

* The NLP engine needs to be improved and enhanced, especially the parser whose grammar rules need to be improved.

* At present Auto modeler does not use a discourse interpretation module. A discourse interpretation module needs to be added which will enhance its OOA capabilities.

* The Heuristics used to map NL elements to OOA elements need to be improved

* At present Auto Modeler only produces the Class diagram of the UML Model, dealing with only static aspects of the system to be built. Dynamic aspects of the system and other UML diagrams need to be addressed in ordered to make the covered of the UML Model by Auto Modeler complete.

Appendix-I

ASE Tools Survey Report

1. Executive Summery:

This Report presents the results of a short survey, which was carried out in order to assess the use of and to assess the requirements for ASE (Automated Software Engineering) Tools in the Pakistani Software Industry. This survey was not an extensive one, as in all 7 top companies in the Software Technology Park (I & II), Islamabad were surveyed. One company LMKR not only uses but markets ASE tools on behalf of Mercury Interactive. If we are going to build a tool for the local software industry then a follow-up survey may be required in order to gather requirements for the said tool.

2. Participating Companies:

Digital Processing Systems INC.
Elixir Technologies
Interprise DB (SMC-Pvt) Ltd.
Knowledge Platform

V. LMKR

VI. ProSol Technologies (Pvt) Ltd.

Trivor Software

3. Results:

Companies already using ASE tools 57%

Companies satisfied with them 100%

Companies that want improvement in their tools 50%

Companies that require ASE tools 71%

ASE Tools already in use

Sr. No.

ASE Tool

Percentage

1

Tool for Automated Requirements Tracing

50

2

Tool for Automated verification of Model

50

3

Tool for Automated verification of Architecture

50

4

Tool for Automated Code Generation

75

5

Tool for Automated Software Testing

100

ASE Tools Required

Sr No.

ASE Tool

Percentage

1

Tool for Automated Requirements Tracing

40

2

Tool for Automated verification of Model

40

3

Tool for Automated verification of Architecture

40

4

Tool for Automated Model Generation from Specifications

60

5

Tool for Automated Architecture Generation from Model

20

6

Tool for Automated Code Generation

20

7

Tool for Automated Software Testing

80

Type of Software that is Developed

Sr No

Software Type

Percentage

1

Desktop Applications

100

2

Web-based Applications

86

Programming/Scripting Languages used

Sr No

Programming/Scripting Language

Percentage

1

MS Windows

86

2

Linux

43

3

Java

57

4

C++

71

5

VB .NET

57

6

C#

71

7

VC++

57

8

VB Script

43

9

Java Script

86

10

PhP

57

4. Conclusions:

The above Data indicates that there is a greater demand for a tool that can generate a Model from given specifications and a tool for software testing. But since commercial Software testing tools are already in use, we can work on the former model. As stated earlier whatever tool we decide to work on a follow-up survey is necessary in order to determine requirements for the said tool.

Bibliography

[1] K. Li, R.G.Dewar, R.J.Pooley. “Object-Oriented Analysis Using Natural Language Processing”, Department of Computer Science, School of Mathematical and Computer Sciences, Heriot-Watt University.

[2] H.M. Harmain and R. Gaizauskas, “CM-Builder: An Automated NL-based CASE Tool”, In Proceedings of the15th IEEE International Conference on Automated Software Engineering (ASE'2000), 2000, pp. 45-53.

[3] H. M. Harmain and R. Gaizauskas, “CM-Builder: A Natural Language-based CASE Tool”, Journal of Automated Software Engineering, 10, 2003, pp. 157-181

[4] V. Ambriola and V. Gervasi. “Processing Natural Language Requirements”, Proceedings of the 1997 International Conference on Automated Software Engineering, 1997. Pages 36-45, IEEE Press Nov 1997

[5] V. Ambriola and V. Gervasi. “On the parallel refinement of NL requirements and

UML diagrams”. In Proceedings of the ETAPS 2001 Workshop on Transformations in UML, Genova, Italy, April 2001.

[6] M. Dorfman. “Requirements engineering” in Software Requirements Engineering 2nd edition. IEEE, 1997..

[7] Paul Grünbacher and Yves Ledru, “Automated Software Engineering”, in ECRIM (European Research Consortium for Informatics and Mathematics) News, (ERCIM), www.ercim.org, Number 58, July 2004,pp 12

[8] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Cristian Ursu, Marin Dimitrov, Mike Dowman, Niraj Aswani and Ian Roberts, “Developing Language Processing Components with GATE Version 3 (a User Guide)”, http://gate.ac.uk.

[9] Ke Li, R.G.Dewar and R.J.Pooley, “Requirements capture in natural language problem statements”,Department of Computer Science, School of Mathematical and Computer Sciences, Heriot-Watt University, http://www.macs.hw.ac.uk

[10] John Batali's, “Notes on Artificial Intelligence Modeling”, Department of Cognitive Science, University of California at San Diego's., http://cogsci.ucsd.edu/~batali/108b/lectures/natlang.txt

[11] Patrick Doyle, “Natural Language”, http://www.cs.dartmouth.edu/%7Ebrd/Teaching/AI/Lectures/Summaries/natlang.html.