In the article NLTK: The Natural Language Toolkit is explained thoroughly what NLTK is, the reason it was created, and its use, using several examples. Also, there are shortly presented some projects assigned to students, difficulties that have been overcome and some of the basic tools and characteristics of the Natural Language Toolkit. Furthermore, they justify their choice to run NLTK on Python, having selected carefully the criteria basis the programming language should satisfy. Lastly, the key features of Python are explained giving examples of the most important modules and tools of Python.
The Natural Language Processing (NLP) toolkit's main concern is the implementation and design of computer software that is interacting with humans by using natural language. Given the self-evident observation that using natural language is the easiest and most effective way for humans to communicate with one another, it follows that, the holds for humans and machines too. The primary aim of NLP research is to design the language input-output mechanisms of artificially intelligent systems that are proficient enough to use language as fluently and flexibly as humans do. The Natural Language Toolkit (NLTK), which provides computer tools for handling semantics such as symbolic and statistical natural language processing; is a set of open source program modules, tutorials and problem sets, available under an open source licence. It runs in all platforms supported by Python, including Windows, Os X, Linux and Unix.
We chose Python because it has a shallow learning curve, its syntax and semantics are transparent, and it has good string-handling functionality. As an interpreted language, Python facilitates interactive exploration. As an object-oriented language, Python permits data and methods to be encapsulated and re-used easily. Python comes with an extensive standard library, including tools for graphical programming and numerical processing. The recently added generator syntax makes it easy to create interactive implementations of algorithms.
Python is a programming language that lets you work faster and effectively integrate your systems. You can learn to use Python and lower maintenance costs and see almost immediate gains in productivity. Python is a very new language. It was released by its designer, Guido Van Rossum, in February 1991 while working for CWI also known as Stichting Mathematisch Centrum. Many of Python's features originated from an interpreted language called ABC. Rossum wanted to correct some of ABC's problems and keep some of its features. At the time he was working on the Amoeba distributed operating system group and was looking for a scripting language with a syntax like ABC but with the access to the Amoeba system calls, so he decided to create a language that was generally extensible. Since he had some experience with using Modula-2+, he decided to talk with the designers of Modula-3. Modula-3 is the origin of the syntax and semantics used for exceptions, and some other Python features. In 1989, during the Christmas holidays, he decided to give it a try and design a language which he later called Python.
In the article "NLTK: The Natural Language Toolkit", authors Edward Loper and Steven Bird explain the various features of the Natural Language Toolkit as well as the reason why they have chosen Python to work with, among so many other programming languages.
In their search for the most suitable programming language they have set several criteria requiring that the language should be fairly easy to learn, it must support rapid prototyping and the code should be simple and straightforward and can be readily implemented by another programmer, with an easily understandable programming structure. It must also be an object-oriented, easy to write programs aimed on improving the clarity, quality, and development time of a computer program; without the difficulty of use related with languages like C++ and lastly it must have an easily-used graphics library to support the design of graphical user interfaces. Python was chosen as it fulfils all of the above requirements.
In the design and implementation of the toolkit several criteria were taken into consideration. Therefore, it was critical to determine what the toolkit would and what would not attempt to accomplish. The requirements for the toolkit were:
Ease of use. The main objective of the toolkit is to enforce students to focus on constructing natural language processing systems.
Consistency. The toolkit should use reliable data structures and interfaces.
Extensibility. The toolkit should adapt to new components with ease. Furthermore, it should be clear where these extensions would fit into the toolkit's infrastructure.
Documentation. The toolkit and all its components should be thoroughly documented.
Simplicity. The complexities of building NLP systems should be well-thought-out and clearly specified.
Modularity. It is required that the interactivity among the components of the toolkit should be minimal using simple, well-defined interfaces.
The non-requirements were:
Comprehensiveness. The toolkit should be structured in a way enforcing students to extend it where needed, instead of providing the complete set of tools.
Efficiency. It should be efficient enough so that real tasks can be performed.
Cleverness. Clear designs and implementations are preferable.
The toolkit is assembled of several independent modules that each of them defines a specific data structure or task. The core modules define the basic data type and processing systems that are used throughout the toolkit. These are:
The Token module is equipped with basic classes for analysing individual elements of text
The Tree module defines data structures for representing tree structures over text.
The Probability module utilises programmes that code frequency as well as probability distributions.
The rest of the modules define data structures and interfaces for performing special NLP tasks. These are:
The Parsing module consisted of the parser module which produces trees representing the text structure; the chunkparser module which identifies non-overlapping linguistic groups in unrestricted text and four subcategories supporting the above two. The srparser, the chartparser, the pcfgparser and the rechunkparser.
The Tagging module providing extra information for each token.
The Finite State Automata module encoding finite state automata and creating them from regural expressions.
The Type checking module which debugs the code.
The Visualisation module consisted of several other modules, demonstrating finite state automata.
The Text Classification module defining the appropriate interface for the categorisation and classification of text.
The toolkit is accompanied by three different types of documentation; Tutorials, Reference Documentation and Technical Reports. Tutorials give detailed information, including examples, on how to use the toolkit. Reference Documentation provides precise definitions for every module, interface, class, method, function, and variable in the toolkit. Technical Reports explain and justify the toolkit's design and implementation
NLTK has various uses among of which are:
Assignments. NLTK supports a wide range of assignments on different complexity levels. It is suitable for students who want to become familiar with it by doing simple tasks as well as more demanding tasks; where they will be required to alter in-built components to construct
comprehensive systems. An example of an assignment, using the Chunk Parsing, given to the students was to correctly identify base noun phrase chunks in a given text.
Demonstrations. The interactive graphical environment of NLTK has been proved to be invaluable for students learning natural language processing because it contains detailed demos and examples of algorithms, displaying the in progress state of basic data structures. An example is the chart parsing tool which processes a single sentence, with a given set of vocabulary and grammar. Its display is divided into three sections: the bottom, the middle and the top section; displaying the chart, the sentence and the syntax tree corresponding to the aforementioned section respectively.
Projects. Students are provided with an accommodating background for projects of varying difficulty. The common projects would require the implementation of a new algorithm, task or creating a new component. An example created during a statistical NLP course is the probabilistic parsing module which uses subgroups to create a probabilistic version of the data structure of the grammar.
Other relevant work
There are several notable articles dealing with natural language processing as it stands as one of the most challenging and demanding fields combining both programming as well as linguistics.
Three of the articles I discriminated are dealing with biomedical research, clinical documents and molecular biology.
The first article, called: "Coding Neuroradiology Reports for the Northern Manhattan Stroke Study: A Comparison of Natural Language Processing and Manual Review", is dealing with automated systems using natural language processing for clinical research. The main goal of this study was by utilising and comparing the accuracy of both automated and manual coding, for the extraction of data in a clinical research study a consistent imaging form was produced as well as reports containing this information. Then by using the information extracted for both cases and comparing them, the ROC curves were generated, where according to the brain images, the researchers coded their own interpretation of the images.
The aim of the second article, called: "Automated Encoding of Clinical Documents Based on Natural Language Processing", was the development of a technique built on natural language processing (NLP) that maps a clinical document to codes with modifiers. To automatically generate codes, the NLP system called MedLEE was modified. MedLEE was modified so that by implicating the match of structured output generated by itself it would obtain the most specific code.
The third article, called: "GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles", focuses in the extraction and structure of information of cellular pathways from the biological literature. For this study an existing medical natural language processing system is utilised, called MedLEE.
Other articles that have cited the one I have used are: "Hands-on NLP for an interdisciplinary audience", authors: Elizabeth D. Liddy and Nancy J. McCracken and "Web-based interfaces for natural language processing tools", authors: Marc Light, Robert Arens and Xin Lu.
In my research for articles I faced minor difficulties in finding the authors of the sources that I have used for my report as well as choosing the article most appropriate for this coursework. This assignment enhanced my knowledge by learning how to use the Google Scholar database and the Locateit server which were really helpful in obtaining all the information I needed. I adamantly believe that through this coursework I achieved a good level on writing a descent report which is an invaluable experience for the years to come.