0115 966 7955 Today's Opening Times 10:00 - 20:00 (BST)

Linguistic Automatic Generation Natural Language

Disclaimer: This dissertation has been submitted by a student. This is not an example of the work written by our professional dissertation writers. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.

1. Introduction

1.1. The Problem Statement

This thesis deals with the problem of Automatic generation of a UML Model from Natural Language Software Requirement Specifications. This thesis describes the development of Auto Modeler an Automated Software Engineering tool that takes Natural Language Software System Requirement Specifications as Input, performs an automated OO analysis and tries to produce an UML Model (a partial one in its present state i.e. static Class diagrams only) as output. The basis for Auto Modeler is described in [2][3].

1.2. Motivation

We conducted a short survey of the Software Industry in Islamabad in order to determine what sorts of Automated Software Engineering Tools were required by the Software houses. The result of the Survey (see Appendix-I for the survey report) indicated that there is demand for such a tool as Auto Modeler. Since such tools i.e. [2][3] that have already been developed are either not available in the market or are very expensive, and thus out of the reach of most software houses. Therefore we decided to build our own tool that can be used by the software industry in order to enable them to be more productive and competitive. But at present Auto Modeler is not ready for commercial use. But it is hoped that future versions of Auto Modeler will be able to cater to the needs of the Software Houses.

1.3. Background

1.3.1. The need for Automated Software Engineering Tools: In this era of Information Technology great demands are placed on Software Systems and on all those that are involved in the SDLC. The developed software should not only be of high quality but it should also be developed in minimal amount of time. When it comes to Software quality, the software must be highly reliable and it should meet the customer's needs and it should satisfy the customer's expectations.

Automated Software Engineering Tools can assist the Software Engineer's and Software Developers in producing High Quality Software in minimal amount of time.

1.3.2. Requirements Engineering: Requirements engineering consists of the following tasks [6]:

· Requirements Elicitation

· Requirements Analysis

· Requirements Specification

· Requirements Validation / Verification

· Requirements Management

Requirements engineering is recognized as a critical task, since many software failures originate from inconsistent, incomplete or simply incorrect System Requirements specifications.

1.3.3. Natural Language Requirement Specifications: Formal methods have been successfully used to express Requirements Specifications, but often the customer cannot understand them and therefore cannot validate them [4]. Natural Language is the only common medium understood by both the Customer and the Analyst [4]. So the System Requirements Specifications are often written in Natural Language.

1.3.4. Object Oriented Analysis: The System Analyst must manually process The Natural Language Requirements Specifications Document and perform an OO Analysis and produce the results in the form of an UML Model, which has become a Standard in the Software Industry. The manual process is laborious, time consuming and often prone to errors. Some specified requirements might be left out. If there are problems or errors in the original requirements specifications, they may not be discovered in the manual process.

OOA applies the OO paradigm to models of proposed systems by defining classes, objects and the relationships between them. Classes are the most important building block of an OO system and from these we instantiate objects. Once an individual object is created it inherits the same operations, relationships, semantics, and attributes identified in the class. Attributes of classes, and hence objects, hold values of properties. Operations, also called methods, describe what can be done to an object/class.[1]

A relationship between classes/objects can show various attributes such as aggregation, composition, generalization and dependency. Attributes and operations represent the semantics of the class, while relationships represent the semantics of the model [1]. The KRB seven-step method, introduced by Kapur, Ravindra and Brown, proposes how to find classes and objects manually [1]. Hence,

Identify candidate classes (nouns in NL).
Define classes (look for instantiations of classes).
Establishing associations (capturing verbs to create association for each pair of classes in 1 and 2).
Expanding many-to-many associations.
Identify class attributes.
Normalize attributes so that they are associated with the class of objects that they truly describe.
Identify class operations.

From this process we can see that one goal of OOA is to identify NL concepts that can be transformed into OO concepts; which can then be used to form system models in particular notations. Here we shall concentrate on UML [1].

1.3.5. Natural Language Processing (NLP): If an automatic analysis of the NL Requirements Document is carried out then it is not only possible to quickly find errors in the Specifications but with the right methods we can quickly generate a UML model from the Requirements.

Although, Natural language is inherently ambiguous, imprecise and incomplete; often a natural language document is redundant, and several classes of terminological problems (e.g., jargon or specialist terms) can arise to make communication difficult [2] and it has been proven that Natural Language processing with holistic objectives is a very complex task, it is possible to extract sufficient meaning from NL sentences to produce reliable models. Complexities of language range from simple synonyms and antonyms to such complex issues as idioms, anaphoric relations or metaphors. Efforts in this particular area have had some success in generating static object models using some complex NL requirement sentences. Linguistic analysis: Linguistic analysis studies NL text from different linguistic levels, i.e. words, sentence and meaning.[1]

(i) Word-tagging analyses how a word is used in a sentence. In particular, words can be changeable from one sentence to another depending on context (e.g. light can be used as noun, verb, adjective and adverb; and while can be used as preposition, conjunction, verb and noun). Tagging techniques are used to specify word-form for each single word in a sentence, and each word is tagged as a Part Of Speech (POS), e.g. a NN1 tag would denote a singular noun, while VBB would signify the base form of a verb.[1]

(ii) Syntactic analysis applies phrase marker, or labeled bracketing, techniques to segment NL as phrases, clauses and sentences, so that the NL is delineated by syntactical/grammatical annotations. Hence we can shows how words are grouped and connected to each other in a sentence.[1]

(iii) Semantic analysis is the study of the meaning. It uses discourse annotation techniques to analyze open-class or content words and closed-class words (i.e. prepositions, conjunctions, pronouns). The POS tags and syntactic elements mentioned previously can be linked in the NL text to create relationships.

Applying these linguistic analysis techniques, NLP tools can carry out morphological processing, syntactic processing and semantic processing. The processing of NL text can be supported by Semantic Network (SN) and corpora that provide a knowledge base for text analysis.

The difficulty of OOA is not just due to the ambiguity and complexity of NL itself, but also the gap in meaning between the NL concepts and OO concepts.[1]

1.3.6. From NLP to UML Model Creation. After NLP the sentences are simplified in order to make identification of UML model elements form NL elements easy. Simple Heurists are used to Identify UML Model elements from Natural Text: (see Chapter 7)

* Nouns indicate a class

* Verb indicates an operation

* Possessive relationships and Verbs like to have, identify, denote indicate attributes

* Determiners are used to identify the multiplicity of roles in associations.

1.5. Plan of the thesis

In Chapter 2 we present a brief survey of previous work and work similar to our work. Chapters 3, 4, 5, 6 and 7 describe the theoretical basis for Auto Modeler. Chapter 8 Describes the Architecture of Auto Modeler. In Chapter 9 we describe Auto Modeler in action with a case study. In Chapter 10 we present conclusions.

2. Literature Survey

The first relevant published technique attempting to produce a systematic procedure to produce design models from NL requirements was Abbot. Abbott (1983) proposes a linguistic based method for analyzing software requirements, expressed in English, to derive basic data types and operations. [1]

This approach was further developed by Booch (1986). Booch describes an Object-Oriented Design method where nouns in the problem description suggest objects and classes of objects, and verbs suggest operations.[1]

Saeki et al. (1987) describe a process of incrementally constructing software modules from object-oriented specifications obtained from informal natural language requirements. Their system analyses the informal requirements one sentence at a time. Nouns and verbs are automatically extracted from the informal requirements but the system cannot determine which words are relevant for the construction of the formal specification. Hence an important role is played by the human analyst who reviews and refines the system results manually after each sentence is processed.[1]

Dunn and Orlowska (1990) describe a natural language interpreter for the construction of NIAM (Nijssen's, or Natural-language, Information Analysis Method ) conceptual schemas. The construction of conceptual schemas involves allocating surface objects to entity types (semantic classes) and the identification of elementary fact types. The system accepts declarative sentences only and uses grammar rules and a dictionary for type allocation and the identification of elementary fact types.[1]

Meziane (1994) implemented a system for the identification of VDM data types and simple operations from natural language software requirements. The system first generates an Entity-Relationship Model (ERM) from the input text and then generates VDM data types from the ERM.[1]

Mich and Garigliano (1994) and Mich (1996) describe an NL-based prototype system, NL-OOPS, that is aimed at the generation of object-oriented analysis models from natural language specifications. This system demonstrated how a large scale NLP system called LOLITA can be used to support the OO analysis stage.[1]

V. Ambriola and V. Gervasi.[4] have developed CIRCE an environment for the analysis of natural language requirements. It is based on the concept of successive transformations that are applied to the requirements, in order to obtain concrete (i.e., rendered) views of models extracted from the requirements. CIRCE uses, CICO a domain-based, fuzzy matching, parser which parses the requirements document and converts it into an abstract parse tree. This parse tree is encoded as tuple's and stored in a shared repository by CICO. A group of related tuples constitutes a T-Model. CIRCE uses internal tools to refine the encoded tuples called extensional knowledge and the knowledge about the basic behavior of software systems called intentional knowledge derived from modelers to further enrich the Tuple space. When a specific concrete view on the requirements is desired, a projector is called to build an abstract view of the data from the tuple space. A translator then converts the abstract view to a concrete view. In [5] V. Ambriola and V. Gervasi describe their experience of automatic synthesis of UML diagrams from Natural Language Requirement Specifications using their CIRCE environment.

Delisle et al., in their project DIPETT-HAIKU, capture candidate objects, linguistically differentiating between Subjects (S) and Objects (O), and processes, Verbs (V), using the syntactic S-V-O sentence structure. This work also suggests that candidate attributes can be found in the noun modifier in compound nouns, e.g. reserved is the value of an attribute of “reserved book”.[1]

Harmain and Gaizauskas developed a NLP based CASE tool, CM-Builder [2][3], which, automatically constructs an initial class model from NL text. It captures candidate classes, rather than candidate objects.

Börstler constructs an object model automatically based on pre-specified key words in a use case description. The verbs in the key words are transformed to behaviors and nouns are transformed to objects.[1]

Overmyer and Rambow developed NLP system to construct UML class diagrams from NL descriptions. Both these efforts require user interaction to identify OO concepts.[1]

The prototype tool developed by Perez-Gonzalez and Kalita supports automatic OO modeling from NL problem descriptions into UML notations, and produces both static and dynamic views. The underlying methodology includes theta roles and semi-natural language.[1]

3. Software Requirements Engineering

Software requirements engineering is the science and discipline concerned with establishing and documenting software requirements [6]. It consists of:

* Software requirements elicitation:- The process through which the customers (buyers and/or users) and the developer (contractor) of a software system discover, review, articulate, and understand the users' needs and the constraints on the software and the development activity.

* Software requirements analysis:- The process of analyzing the customers' and users' needs to arrive at a definition of software requirements.

* Software requirements specification:- The development of a document that clearly and precisely records each of the requirements of the software system.

* Software requirements verification:- The process of ensuring that the software requirements specification is in compliance with the system requirements, conforms to document standards of the requirements phase, and is an adequate basis for the architectural (preliminary) design phase.

* Software requirements management:- The planning and controlling of the requirements elicitation, specification, analysis, and verification activities.

In turn, system requirements engineering is the science and discipline concerned with analyzing and documenting system requirements. It involves transforming an operational need into a system description, system performance parameters, and a system configuration

This is accomplished through the use of an iterative process of analysis, design, trade-off studies, and prototyping.

Software requirements engineering has a similar definition as the science and discipline concerned with analyzing and documenting software requirements. It involves partitioning system requirements into major subsystems and tasks, then allocating those subsystems or tasks to software. It also transforms allocated system requirements into a description of software requirements and performance parameters through the use of an iterative process of analysis, design, trade-off studies, and prototyping.
A system can be considered a collection of hardware, software, data, people, facilities, and procedures organized to accomplish some common objectives. In software engineering, a system is a set of software programs that provide the cohesiveness and control of data that enables the system to solve the problem.[6]

The major difference between system requirements engineering and software requirements engineering is that the origin of system requirements lies in user needs while the origin of software requirements lies in the system requirements and/or specifications. Therefore, the system requirements engineer works with users and customers, eliciting their needs, schedules, and available resources, and must produce documents understandable by them as well as by management, software requirements engineers, and other system requirements engineers.

The software requirements engineer works with the system requirements documents and engineers, translating system documentation into software requirements which must be understandable by management and software designers as well as by software and system requirements engineers. Accurate and timely communication must be ensured all along this chain if the software designers are to begin with a valid set of requirements. [6]

4. Automated Software Engineering Tools

Software engineering is concerned with the analysis, design, implementation, testing, and maintenance of large software systems. Automated software engineering focuses on how to automate or partially automate these tasks to achieve significant improvements in quality and productivity.

Automated software engineering applies computation to software engineering activities. The goal is to partially or fully automate these activities, thereby significantly increasing both quality and productivity. This includes the study of techniques for constructing, understanding, adapting and modeling both software artifacts and processes. Automatic and collaborative systems are both important areas of automated software engineering, as are computational models of human software engineering activities. Knowledge representations and artificial intelligence techniques applicable in this field are of particular interest, as are formal techniques that support or provide theoretical foundations.[7]

Automated software engineering approaches have been applied in many areas of software engineering. These include requirements definition, specification, architecture, design and synthesis, implementation, modeling, testing and quality assurance, verification and validation, maintenance and evolution, configuration management, deployment, reengineering, reuse and visualization. Automated software engineering techniques have also been used in a wide range of domains and application areas including industrial software, embedded and real-time systems, aerospace, automotive and medical systems, Web-based systems and computer games.[7]

Research into Automated Software Engineering includes the following areas:

* Automated reasoning techniques

* Component-based systems

* Computer-supported cooperative work

* Configuration management

* Domain modeling and meta-modeling

* Human-computer interaction

* Knowledge acquisition and management

* Maintenance and evolution

* Model-based software development

* Modeling language semantics

* Ontologies and methodologies

* Open systems development

* Product line architectures

* Program understanding

* Program synthesis

* Program transformation

* Re-engineering

* Requirements engineering

* Specification languages

* Software architecture and design

* Software visualization

* Testing, verification, and validation

* Tutoring, help, and documentation systems

5. Natural Language Processing

Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. Natural language generation systems convert information from computer databases into normal-sounding human language, and natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.

5.1. Language Processing

Language processing can be divided into two tasks:[11]

* Processing written text, using lexical, syntactic, and semantic knowledge of the language as well as any required real world information.[11]

* Processing spoken language, using all the information needed above, plus additional knowledge about phonology as well as enough additional information to handle the further ambiguities that arise in speech.[11]

5.2. Uses for NLP:

5.2.1. User interfaces. Better than obscure command languages. It would be nice if you could just tell the computer what you want it to do. Of course we are talking about a textual interface -- not speech.[10]

5.2.2. Knowledge-Acquisition. Programs that could read books and manuals or the newspaper. So you don't have to explicitly encode all of the knowledge they need to solve problems or do whatever they do.[10]

5.2.3. Information Retrieval. Find articles about a given topic. Program has to be able somehow to determine whether the articles match a given query.[10]

5.2.4. Translation. It sure would be nice if machines could automatically translate from one language to another. This was one of the first tasks they tried applying computers to. It is very hard.[10]

5.3. Linguistic levels of Analysis

Language obeys regularities and exhibits useful properties at a number of somewhat separable "levels".[10]

Think of language as transfer of information. It is much more than that. But that is a good place to start.

Suppose that the speaker has some meaning that they wish to convey to some hearer.[10]

Speech (or gesture) imposes a linearity on the signal. All you can play with is the properties of a sequence of tokens. Actually, why tokens? Well for one thing that makes it possible to learn.[10]

So the other thing to play with is the order the tokens can occur.

So somehow, a meaning gets encoded as a sequence of tokens, each of which has some set of distinguishable properties, and is then interpreted by figuring out what meaning corresponds to those tokens in that order.[10]

Another way to think about it is that the properties of the tokens and their sequence somehow "elicits" an understanding of the meaning. Language is a set of resources to enable us to share meanings, but isn't best thought of as a means for *encoding* meanings. This is a sort of philosophical issue perhaps, but if this point of view is true, it makes much of the AI approach to NLP somewhat suspect, as it is really based on the "encoded meanings" view of language.[10]

The lowest level is the actual properties of the signal stream:

phonology -- speech sounds and how we make them

morphology -- the structure of words

syntax -- how the sequences are structured

semantics -- meanings of the strings

There are important interfaces among all of these levels. For example sometimes the meaning of sentences can determine how individual words are pronounced.[10]

This many levels is obviously needed. But language turns out to be more clever than this. For example, language can be more efficient by not having to say the same thing twice, so we have pronouns and other ways of making use of what has already been said:

A bear went into the woods. It found a tree.

Also, since language is most often used among people who are in the same situation, it can make use of features of the situation:





The mechanisms whereby features of the context, whether it is the context created by a sequence of sentences, or the actual context where the speaking happens is called "pragmatics".[10]

Another issue has to do with the fact that the simple model of language as information transfer is clealy not right. For one thing, we know there are at least the following three types of sentences:




And each of them can be used to do a different kind of thing. The first *might* be called information transfer. But what about imperatives? What about questions? To some degree the analysis of such sentences can involve the ideas of a basic notion of meaning Speech acts.[10]

There are other, higher-levels of structuring that language exhibits. For example there is conversational structure, where people know when they get to talk in a conversation, and what constitutes a valid contribution. There is "narrative structure" whereby stories are put together in ways that make sense and are interesting. There is "expository structure" which involves the way that informative texts (like encyclopedias) are arranged so as to usefully convey information. These issues blend off from linguistics into literature and library science, among other things.[10]

Of course with hypertext and multi-media and virtual reality, these higher levels of structure are being explored in new ways.[10]

5.4. Steps in Natural Language Understanding

The steps in the process of natural language understanding are:[11]

5.4.1. Morphological analysis

Individual words are analyzed into their components, and non-word tokens (such as punctuation) are separated from the words. For example, in the phrase "Bill's house" the proper noun "Bill" is separated from the possessive suffix "'s."[11]

5.4.2. Syntactic analysis. Linear sequences of words are transformed into structures that show how the words relate to one another. This parsing step converts the flat list of words of the sentence into a structure that defines the units represented by that list. Constraints imposed include word order ("manager the key" is an illegal constituent in the sentence "I gave the manager the key"); number agreement; case agreement.[11]

5.4.3. Semantic analysis. The structures created by the syntactic analyzer are assigned meanings. In most universes, the sentence "Colorless green ideas sleep furiously" [Chomsky, 1957] would be rejected as semantically anomalous. This step must map individual words into appropriate objects in the knowledge base, and must create the correct structures to correspond to the way the meanings of the individual words combine with each other. [11]

5.4.4. Discourse integration. The meaning of an individual sentence may depend on the sentences that precede it and may influence the sentences yet to come. The entities involved in the sentence must either have been introduced explicitly or they must be related to entities that were. The overall discourse must be coherent. [11]

5.4.5. Pragmatic analysis. The structure representing what was said is reinterpreted to determine what was actually meant. [11]

5.5. Syntactic Processing

Syntactic parsing determines the structure of the sentence being analyzed. Syntactic analysis involves parsing the sentence to extract whatever information the word order contains. Syntactic parsing is computationally less expensive than semantic processing.[10]

A grammar is a declarative representation that defines the syntactic facts of a language. The most common way to represent grammars is as a set of production rules, and the simplest structure for them to build is a parse tree which records the rules and how they are matched. [10]

Sometimes backtracking is required (e.g., The horse raced past the barn fell), and sometimes multiple interpretations may exist for the beginning of a sentence (e.g., Have the students who missed the exam -- ). [10]

Example: Syntactic processing interprets the difference between "John hit Mary" and "Mary hit John."

5.6. Semantic Analysis

After (or sometimes in conjunction with) syntactic processing, we must still produce a representation of the meaning of a sentence, based upon the meanings of the words in it. The following steps are usually taken to do this: [10]

5.6.1. Lexical processing. Look up the individual words in a dictionary. It may not be possible to choose a single correct meaning, since there may be more than one. The process of determining the correct meaning of individual words is called word sense disambiguation or lexical disambiguation. For example, "I'll meet you at the diamond" can be understood since at requires either a time or a location. This usually leads to preference semantics when it is not clear which definition we should prefer. [10]

5.6.2. Sentence-level processing. There are several approaches to sentence-level processing. These include semantic grammars, case grammars, and conceptual dependencies. [10]

Example: Semantic processing determines the differences between such sentences as "The ink is in the pen" and "The ink is in the pen."

5.6.3. Discourse and Pragmatic Processing. To understand most sentences, it is necessary to know the discourse and pragmatic context in which it was uttered. In general, for a program to participate intelligently in a dialog, it must be able to represent its own beliefs about the world, as well as the beliefs of others (and their beliefs about its beliefs, and so on).[10]

The context of goals and plans can be used to aid understanding. Plan recognition has served as the basis for many understanding programs -- PAM is an early example. [10]

5.7. Issues in Syntax

For various reasons, a lot of attention in computational linguistics has been paid to syntax. Partly this has to do with the fact that real linguistics have spent a lot of work on it. Partly because it needs to be done before just about anything else can be done. I won't talk much about morphology. We will assume that words can be associated with a set of features or properties. For example the word "dog" is a noun, it is singular, its meaning involves a kind of animal. The word "dogs" is related, obviously, but has the property of being plural. The word "eat" is a verb, it is in what we might call the "base" form, it denotes a particular kind of action. The word "ate" is related, it is in the "past tense" form. You can imagine I'm sure that the techniques of knowledge representation that we have looked at can be applied to the problem of representing facts about the properties and relations among words. [11]

The key observation in the theory of syntax is that the words in a sentence can be more or less naturally grouped into what are called "phrases", and those phrases can often be treated as a unit.

So in a sentence "The dog chased the bear," the sequence "the dog" forms a natural unit. The sequence "chased the bear" is a natural unit, as is "the bear".[11]

Why do I say that "the dog" is a natural unit? Well one thing is that I can replace it by another sequence that has the same referent, or a related referent. For example I could replace it by: [11]

Snoopy (a name)

It (a pronoun)

My brother's favorite pet (a more complex description)

What about "chased the bear"? Again, I could replace it by

died (a single word)

was hit by a truck (a more complex event)

This basic structure, in English, is sometimes called the "subject-predicate" structure. The subject is a nominal, something that can refer to an object or thing, the predicate is a "verb phrase", which describes an action or event. Of course, as in the example, the verb phrase can also contain other constituents, for example another nominal. [11]

These phrases also have structure. For example a noun phrase (a kind of nominal) can have a determiner, zero or more adjectives, and a noun, maybe followed by another phrase, like:

the big dog that ate my homework

Verb phrases can have complicated "verb groups" like

will not be eaten

Syntactic theories try to predict and explain what patterns are used in a language. Sometimes this involves figuring out what patterns just don't work. For example the following sentences have something wrong with them: [11]

* the dogs runs home

* he died the book

* she saw himself in the mirror

* they told it to she

Figuring out exactly what is wrong with such sentences allows linguists to create theories that help understand the way that sentences get structured.

The general idea, in English, is that a sentence consists, as I said of a subject and a predicate. A predicate is a verb followed by one or more nominal or prepositional phrases. Verbs often require a certain number of either nominal or prepositional phrases, these are called "complements" [11]. For example:

it died (no complements, "intransitive")

the horse kicked the farmer (one complement "transitive")

I gave her the book (two complements)

I gave the book to her (one complement is a prepositional phrase)

The sentences above are wrong for reasons that can be stated clearly. But another class of constraints was discovered in the early 60s. They generally involve sentences in which a component is moved out of its ordinary position, for example to make a question or relative clause. [11]


I like flowers.

Can be transformed into:

What do I like?


He gave the fish to Ned.

Can be transformed to:

Who did he give the fish to?

(Some people say this is ungrammatical. They are wrong. But even the "grammatical" version "to whom did he give the fish?" illustrates the point I am making.)

The general rule seems to be that you an take any nominal and replace it with a question word, and move it to the front of the sentence. [11]

But consider the following sentences:

A She likes ice cream and olives.

A' * What does she like ice cream and?

B I know a Democrat who hates Clinton.

B' * Who do you know a Democrat who hates?

Now these sentences are interesting because it is not exactly clear what sort of rule is being broken, you never see such sentences in language textbooks as the sort of thing to avoid, and children never produce them - and in fact children often make the sorts of errors mentioned previously. [11]

Other information may also be added to a sentence which is not required by the verb but which adds other information about what is going on, these are called "adjuncts".

it died yesterday (gives time)

it died in the garage (gives location)

it died because nobody fed it (gives reason)

Note that in the last example a "sentence" is part of another sentence. This can happen in various ways. For example some verbs take sentence-like units as complements: [11]

he thought I liked him

Or, as above, they can be used as adjuncts. Rather than call these sentences, they are sometimes called "clauses" -- a clause is a verb with some other arguments, usually its complements, sometimes (not always) a subject.

"Phrase structure trees" are often used to represent the configuration of sentences. These can show how the structural elements are related, and the relations among nodes in the tree can be used to describe constraints that have to hold. [11]

One approach to characterizing syntactic structure involves giving rules to describe how phrases can be generated. For example here are some such rules:

S -> NP VP

NP -> Det {Adj} Noun

VP -> Verb {NP} {PP}

PP -> Prep NP

A category in parenthesis {}, means that it is optional.

Assuming that we have a "lexicon" of words, with their categories represented, these rules could be used to generate some syntactic structures that sentences may exhibit. [11]

Suppose we add this rule:

NP -> Det {Adj} Noun {PP}

For example "the man on the dock". This gives rise to the possibility that two sentences with the same sequence of words could be grouped differently.

I saw the man with a telescope.

These different configurations can be associated with different meanings. This is called "syntactic ambiguity." Ambiguity is when a word or sentence can be taken as having more than one distinct meaning. For example some words have more than one meaning:

I went to the bank.

Different meanings of words can cause sentences to be understood in very different ways:

I saw her duck.

Flying planes can be dangerous.

The sorts of rules that I have described are called "context-free" because the rewrite operation that they describe doesn't depend on any context in which the left-hand symbol occurs. But this can't capture some fairly simple regularities: [11]

Agreement: *She saw himself.

Complements: *He put the block.

Case: *They saw she.

To solve this, rules need to specify more than just what tree configurations can occur, but must somehow indicate constraints that hold among the elements in the tree.

Another issue is that some sentences seem pretty directly related to others. For example consider the following pairs: [11]

he ate the fish the fish was eaten by him

she read the book what did she read?

the dog is at the corner the dog at the corner barked

There is a sense in which the second sentence or phrase is a "transformed" version of the first. This observation led to a powerful theory of syntactic structure called "transformational grammar" in which a language began with some simple context-free rules and some local constraints to create a set of basic sentences, which could then be transformed in various ways. [11]

It turned out however that this didn't really work, so lately linguists are looking at a more abstract theory. The basic idea is that there is a general theory of phrase structure:

X -- lexical category (noun, preposition, verb)

X' -- "modified" lexical category (with complements)

X'' -- "specified" lexical category.

Constraints can be specified among phrases built up this way. And restrictions on movement can be stated.

The hypothesis goes even deeper than this, in that some linguists believe that this representation system is somehow innate, that it underlies all human linguistic knowledge. The evidence for this claim is the fact that all languages can be described using this terminology (more or less) and that it doesn't have to be this way. There is also evidence having to do with the fact that there are often relations between ordering rules in languages that seem to hold for all phrases, rather than for just one type of phrase. [11]

For example there are languages in which the complements of a verb go after the verb (Like English.) In many of these languages, modifiers to nouns and complements to prepositions go after the modified element (like English for prepositions, but not for nouns, French is a good example of this). Obviously this doesn't always work, but it works often enough that some researchers think that there might be something there. Others think this whole notion is totally bogus (for example most people at UCSD).

5.8. Issues in Parsing

Given all of the attention paid to syntax, it is not surprising that a lot of work has been done on getting computers to come up with a characterization of the semantic structures of sentences. Obviously, the way that this will work depends on the specific syntactic theory you believe in, but in general a parsing program is a search through the space of possible structural characterizations of the sentence, constrained by the fact that the structural characterization must be compatible with the given sequence of words. [11]

Most of the research on automatic parsing, has involved context-free grammars. Sometimes the basic ideas from context-free parsing are then augmented to make the parser able to handle non-context-free-constraints. [11]

The general idea of parsing with a set of context-free rules is to start generating possible tree structures, until a rule generates a lexical category. This is then checked against the next word in the sentence. If it is of the appropriate category, the parse continues If not, the parser must explore another node in the search space. [11]

For example:

S -> NP VP

NP -> Det Noun

VP -> Verb {NP} {PP}

PP -> Prep NP

Suppose we are parsing:

The dog barked in the yard.

We assume we have sentence, so we start with the tree:


We expand it using the rule


Working from left to right, we expand the NP node:

Det Noun

Now "Det" is a lexical category, so we look at the first word of the sentence, it is indeed a determiner, so we continue. The next category "Noun" is also a lexical category, so we check, and succeed. [11]

Now we come to a non-lexical category, VP, so we find a rule for that. This rule has optional constituents, so we treat each optional possibility as a separate node. Our first assumes that both are optional: [11]

VP -> Verb

And we create a node for each of the other possibilities:

VP -> Verb NP

VP -> Verb PP

VP -> Verb NP PP

The first node predicts a verb and one is there so we continue. However that rule says we should be done, and we aren't yet, so it fails, and we go back to the next node. This one also predicts a verb, so we continue. We expand and NP node which predicts a Determiner, but there is none there, so that one fails. [11]

The next node predicts a verb, and we expand the PP node to predict a preposition, which is what is there, and we continue on.

Obviously there can be lots more complexity to all of this but the general idea in what is called "top down" parsing is a depth-first search down the left side of the tree until a lexical category is predicted. This is compared with the next word in the sentence. [11]

To handle non-context-free phenomena, a context-free parser is sometimes augmented with some additional tests or operations to perform after the parser succeeds on the context-free operation to possibly eliminate some sentences. For example we might have:

S -> NP VP (= (number NP) (number VP))

Where 'number' returns whether is argument is singular or plural. Of course we will have to augment our representation of the syntactic structure somehow to record this and other potentially relevant syntactic properties. We will see a specific example of this next time, when we examine a parser that uses the machinery we developed for proving theorems. [11]

5.9. Issues in Semantics

Although it is hard to tell sometimes at linguistics talks, the only reason that people are interested in syntax is that the structure of a sentence is presumably related somehow to the meaning that it conveys. [11]

One idea in semantics we have already seen the idea of hierarchies of objects. To some degree, the meanings of nouns and noun phrases can be understood with the sorts of knowledge representation ideas we have already looked at, and many of these ideas were developed for natural language understanding systems. [11]

The idea of the "referent" of a noun phrase--the thing that it refers to, usually by satisfying some description.

The idea of hierarchies of objects can also be extended to the idea of hierarchies of actions and events. In the theory of "conceptual dependency" the claim is that the relations among complex events by composing them out of more simple events. [11]

A key idea in representing events is that certain kinds of events have specific participants". For example a "buy" event has a buyer and a seller and a thing bought. A “move" event has the thing that moves and possibly an initial and a final location and maybe path along which the motion happens. [11]

These observations lead to the theory of "case frames". A case frame is a representation of an action or event, along with its participants. The reason they are called "case" frames has to do with the fact that in many languages (though not English), nouns are0 assigned case depending on the role that the referent of the noun phrase plays in the sentence. For example in Latin, there is a different ending to indicate if the word is the subject of the sentence, the direct object, or if it refers to a location (and some more). [11]

The idea of case frames is that each verb is associated with a specific case frame, and a set of "role mappings" which indicate how the syntactic arguments of the sentence are assigned to the participant slots in the case frame.

Here are some typical slots in case frames:







For example the verb "buy" might be associated with a "purchase" case frame with a buyer and seller and an thing bought. So we will assume that it uses the "source" slot for the seller, the "goal" slot for the buyer, and the "object" slot for the thing bought. [11]

Thus the verb "buy" maps the subject of the sentence to the "goal" slot, the direct object of the "object" slot, and the object of the preposition "from" to the "source" slot. Note that prepositions are often used to assign case roles. Obviously, "from" is often the "source" slot and "to" is often the goal slot.

Now consider the verb "sell". This evokes the same case frame but with different mappings: the subject is now the source, the object is again the object, and the object of "to" is the goal. [11]

5.10. Issues in Pragmatics

Pragmatics usually refers to how contextual resources are used to work out the specific meanings of sentences. Sometimes the contextual resources are linguistics, for example referring expressions, and sometimes they are part of the speech situation, for example the speaker and hearer, and the time and place of the utterance. [11]

So for example we have in English the difference between "definite" and "indefinite" reference. An "indefinite" expression gives a description and is often used to indicate that an object satisfying that description is to be newly introduced into the discourse. A "definite" referring expression is used to refer back to a previously mentioned entity. So in: [11]

A bear came to our campsite last night.

The bear was eating our garbage.

It scared my brother.

The first expression "a bear" is indefinite. Introduces the entity to the store. "The bear" is definite. Refers to previously introduced bear. So does "it". All of this requires some notion of a "structure" or "context" in which referring expressions are introduced. [11]

The discourse situation must be represented also, for many references to be understood For example we need to represent the speaker and hearer, and perhaps onlookers, if we are to work out the intended referents of "me" and "you" and "us" and "them". Also the times of "now" and "yesterday", and the locations of "here" and "there". [11]

Different languages partition the speech situation in different ways than English. For example many languages have a second person plural, sort of like "you all". Some have two different kinds of first person plurals -- one that includes the hearer, and one that doesn't.. Spanish, for example, has four spatial pronouns, one for near the speaker, one for near the hearer, one for in the region where both are, and one for a region far from both. [11]

5.11. Issues in Discourse

The next level of analysis is called "discourse theory". This is about the higher level relations that hold among sequences of sentences in a discourse or a narrative. It merges sometimes with literary theory, but also with pragmatics. [11]

One thing to understand is that different sentences do different kinds of "work" in a discourse. We have seen some examples of this already -- noun phrases that refer to new entities, or back to previously introduced ones. Same for whole sentences. Some introduce new events or relations, some used them to introduce something new. [11]

A car began rolling down the hill

It collided with a lamppost.

One important idea in discourse theory is the idea that much language is performed in the context of some mutual activity. For example two people could be working on some project together. In this case, they are probably both somewhat aware of the plan that they are both following, and so much of the pragmatic information needed to understand what they are talking about can be thought of in terms of that plan. And sometimes utterances can be understood as if they were steps in the execution of a plan. For example if I say, [11]

please pass the salt

This could be thought of as a way to get me the salt, if having salt was part of a plan.

Some people think of sentences like

can you pass the salt

As "indirect speech acts" because they look like questions, but aren't really. One way to think about sentences like this is that the hearer understands that this is probably not a question, but is a conventionalized (and polite) means of asking for the salt. [11]

Another analysis of this sort of sentence is that you are trying to avoid rejection. You do this by considering ways that your plan might fail. So you don't want to have this happen:

please pass the salt

I can't, I'm tied up with ropes.

oh, sorry.

So you ask about potential problems first -- asking about ability. So that if there is a problem, you don't have to ask directly and you won't be rejected. It is sort of like: [11]

are you doing anything saturday night?

yes, I'm feeding my goldfish

So you don't have to be rejected if you actually ask for a date. [11]

6. GATE (General Architecture for Language Engineering)

We have used GATE [8] as the NLP engine in Auto Modeler. GATE is an infrastructure for developing and deploying software components that process human language. GATE helps scientists and developers in three ways[8]:

By specifying an architecture, or organizational structure, for language processing software;
By providing a framework, or class library, that implements the architecture and can be used to embed language processing capabilities in diverse applications.
By providing a development environment built on top of the framework made up of convenient graphical tools for developing components.

The architecture exploits component-based software development, object orientation and mobile code. The framework and development environment are written in Java and available as open-source free software under the GNU library licence2. GATE uses Unicode [8] throughout, and has been tested on a variety of Slavic, Germanic, Romance, and Indic languages [8].

From a scientific point-of-view, GATE's contribution is to quantitative measurement of accuracy and repeatability of results for verification purposes [8].

GATE has been in development at the University of Sheffield since 1995 and has been used in a wide variety of research and development projects [8]. Version 1 of GATE was released in 1996, was licensed by several hundred organizations, and used in a wide range of language analysis contexts including Information Extraction ([8]) in English, Greek, Spanish, Swedish, German, Italian and French. Version 3.1 of the system, a complete reimplementation and extension of the original, is available from http://gate.ac.uk/download/ [8].

GATE is distributed with an Information Extraction system called ANNIE, A Nearly-New IE system ANNIE relies on finite state algorithms and the JAPE language [8]. ANNIE components form a pipeline which appears in below

ANNIE components are included with GATE. We have used ANNIE for Tokenization, Sentence Splitting and Part of Speech Tagging. For Morphological analysis we have used GATE Morphological Analyzer. For Domain independent Semantic analysis we GATE's SUPPLE parser. Below each Process is explained in detail.

6.1. Tokeniser

The tokeniser splits the text into very simple tokens such as numbers, punctuation and words of different types. For example, we distinguish between words in uppercase and lowercase, and between certain types of punctuation. The aim is to limit the work of the tokeniser to maximise efficiency, and enable greater flexibility by placing the burden on the grammar rules, which are more adaptable [8].

6.1.1. Tokeniser Rules. A rule has a left hand side (LHS) and a right hand side (RHS). The LHS is a regular expression which has to be matched on the input; the RHS describes the annotations to be added to the Annotation Set. The LHS is separated from the RHS by '>'. The following operators can be used on the LHS [8]:

| (or)

* (0 or more occurrences)

? (0 or 1 occurrences)

+ (1 or more occurrences)

The RHS uses ';' as a separator, and has the following format:

{LHS} > {Annotation type};{attribute1}={value1};...;{attribute

n}={value n}

Details about the primitive constructs available are given in the tokeniser file (DefaultTokeniser.Rules).

The following tokeniser rule is for a word beginning with a single capital letter:



It states that the sequence must begin with an uppercase letter, followed by zero or more

lowercase letters. This sequence will then be annotated as type “Token”. The attribute “orth” (orthography) has the value “upperInitial”; the attribute “kind” has the value “word”.

6.1.2 Token Types. In the default set of rules, the following kinds of Token and SpaceToken are possible [8]: Word: A word is defined as any set of contiguous upper or lowercase letters, including a hyphen (but no other forms of punctuation). A word also has the attribute “orth”, for which four values are defined:[8]

• upperInitial - initial letter is uppercase, rest are lowercase

• allCaps - all uppercase letters

• lowerCase - all lowercase letters

• mixedCaps - any mixture of upper and lowercase letters not included in the above categories Number: A number is defined as any combination of consecutive digits. There are no subdivisions of numbers.[8] Symbol: Two types of symbol are defined: currency symbol (e.g. ‘$', ‘£') and symbol (e.g. ‘&', ‘ˆ'). These are represented by any number of consecutive currency or other symbols (respectively).[8] Punctuation: Three types of punctuation are defined: start punctuation (e.g. ‘('), end punctuation (e.g.‘)'), and other punctuation (e.g. ‘:'). Each punctuation symbol is a separate token.[8] SpaceToken: White spaces are divided into two types of SpaceToken - space and control - according to whether they are pure space characters or control characters. Any contiguous (and homogenous) set of space or control characters is defined as a SpaceToken.

The above description applies to the default tokeniser. However, alternative tokenisers can be created if necessary. The choice of tokeniser is then determined at the time of text processing.[8]

6.1.3. English Tokeniser. The English Tokeniser is a processing resource that comprises a normal tokeniser and a JAPE transducer [8]. The transducer has the role of adapting the generic output of the tokeniser to the requirements of the English part-of-speech tagger. One such adaptation is the joining together in one token of constructs like “ '30s”, “ 'Cause”, “ 'em”, “ 'N”, “ 'S”, “ 's”, “ 'T”, “ 'd”, “ 'll”, “ 'm”, “ 're”, “ 'til”, “ 've”, etc. Another task of the JAPE transducer is to convert negative constructs like “don't” from three tokens (“don”, “ ' ” and “t”) into two tokens (“do” and “n't”).[8]

The English Tokeniser should always be used on English texts that need to be processed afterwards by the POS Tagger.[8]

6.2. Gazetteer

The gazetteer lists used are plain text files, with one entry per line. Each list represents a

set of names, such as names of cities, organizations, days of the week, etc.[8] Below is a small section of the list for units of currency:


European Currency Units



German mark

German marks

New Taiwan dollar

New Taiwan dollars

NT dollar

NT dollars

An index file (lists.def) is used to access these lists; for each list, a major type is specified and, optionally, a minor type 2. In the example below, the first column refers to the list name, the second column to the major type, and the third to the minor type. These lists are compiled into finite state machines. Any text tokens that are matched by these machines will be annotated with features specifying the major and minor types. Grammar rules then specify the types to be identified in particular circumstances. Each gazetteer list should reside in the same directory as the index file.





So, for example, if a specific day needs to be identified, the minor type “day” should be specified in the grammar, in order to match only information about specific days; if any kind of date needs to be identified, the major type “date” should be specified, to enable tokens annotated with any information about dates to be identified. More information about this can be found in the following section.[8]

6.3. Sentence Splitter

The sentence splitter is a cascade of finite-state transducers which segments the text into sentences. This module is required for the POS tagger. The splitter uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds.[8]

Each sentence is annotated with the type Sentence. Each sentence break (such as a full stop) is also given a “Split” annotation. This has several possible types: “.”, “punctuation”, “CR” (a line break) or “multi” (a series of punctuation marks such as “?!?!”.

The sentence splitter is domain and application-independent.[8]

6.4. Part of Speech Tagger

The POS tagger [8] is a modified version of the Brill tagger, which produces a partof-

speech tag as an annotation on each word or symbol. The list of tags used is given in [8]. The tagger uses a default lexicon and ruleset (the result of training on a large corpus taken from the Wall Street Journal). Both of these can be modified manually if necessary. Two additional lexicons exist - one for texts in all uppercase (lexicon cap), and one for texts in all lowercase (lexicon lower). To use these, the default lexicon should be replaced with the appropriate lexicon at load time. The default ruleset should still be used in this case.[8]

The ANNIE Part-of-Speech tagger requires the following parameters.

* encoding - encoding to be used for reading rules and lexicons (init-time)

* lexiconURL - The URL for the lexicon file (init-time)

* rulesURL - The URL for the ruleset file (init-time)

* document - The document to be processed (run-time)

* inputASName - The name of the annotation set used for input (run-time)

* outputASName - The name of the annotation set used for output (run-time). This is an optional parameter. If user does not provide any value, new annotations are created under the default annotation set.

* baseTokenAnnotationType - The name of the annotation type that refers to Tokens in a document (run-time, default = Token)

* baseSentenceAnnotationType - The name of the annotation type that refers to Sentences in a document (run-time, default = Sentences)

* outputAnnotationType - POS tags are added as category features on the annotations of type “outputAnnotationType” (run-time, default = Token)

If - (inputASName == outputASName) AND (outputAnnotationType == baseTokenAnnotationType)

then - New features are added on existing annotations of type “baseTokenAnnotationType”. Otherwise - Tagger searches for the annotation of type “outputAnnotationType” under the “outputASName” annotation set that has the same offsets as that of the annotation with type “baseTokenAnnotationType”. If it succeeds, it adds new feature on a found annotation, and otherwise, it creates a new annotation of type “outputAnnotationType” under the “outputASName” annotation set.[8]

6.5. GATE Morphological Analyzer

The Morphological Analyzer Processing Resource (PR) can be found in the Tools plugin[8]. It takes as input a tokenized GATE document. Considering one token and its part of speech tag, one at a time, it identifies its lemma and an affix. These values are than added as features on the Token annotation. Morpher is based on certain regular expression rules. These rules were originally implemented by Kevin Humphreys in GATE1 in a programming language called Flex [8]. Morpher has a capability to interpret these rules with an extension of allowing users to add new rules or modify the existing ones based on their requirements. In order to allow these operations with as little effort as possible, we changed the way these rules are written.[8]

Two types of parameters, Init-time and run-time, are required to instantiate and execute the PR.

* rulesFile (Init-time) The rule file has several regular expression patterns. Each pattern has two parts, L.H.S. and R.H.S. L.H.S. defines the regular expression and R.H.S. the function name to be called when the pattern matches with the word under consideration.

* caseSensitive (init-time) By default, all tokens under consideration are converted into lowercase to identify their lemma and affix. If the user selects caseSensitive to be true, words are no longer converted into lowercase.

* document (run-time) Here the document must be an instance of a GATE document.

* affixFeatureName Name of the feature that should hold the affix value.

* rootFeatureName Name of the feature that should hold the root value.

* annotationSetName Name of the annotationSet that contains Tokens.

* considerPOSTag Each rule in the rule file has a separate tag, which specifies which rule to consider with what part-of-speech tag. If this option is set to false, all rules are considered and matched with all words. This option is very useful. For example if the word under consideration is ”singing”. ”singing” can be used as a noun as well as a verb. In the case where it is identified as a verb, the lemma of the same would be ”sing” and the affix ”ing”, but otherwise there would not be any affix.

6.5.1. Rule File. GATE provides a default rule file, called default.rul, which is available under the gate/plugins/Tools/morph/resources directory. The rule file has two sections.[8]

Variables Rules Variables: The user can define various types of variables under the section defineVars. These variables can be used as part of the regular expressions in rules. There are three types of variables:

Range With this type of variable, the user can specify the range of characters. e.g. A==>[-a-z0-9]
Set With this type of variable, user can also specify a set of characters, where one character at a time from this set is used as a value for the given variable. When this variable is used in any regular expression, all values are tried one by one to generate the string which is compared with the contents of the document. e.g. A ==> [abcdqurs09123]
Strings where in the two types explained above, variables can hold only one character from the given set or range at a time, this allows specifying strings as possibilities for the variable. e.g. A ==> ”bb” OR ”cc” OR ”dd” Rules: All rules are declared under the section defineRules. Every rule has two parts, LHS and RHS. The LHS specifies the regular expression and the RHS the function to be called when the

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Request Removal

If you are the original writer of this dissertation and no longer wish to have the dissertation published on the UK Essays website then please click on the link below to request removal:

More from UK Essays

Get help with your dissertation
Find out more