Extraction & Indexing Keyphrases

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Extraction & Indexing Keyphrases

Chapter 2

Literature review

2. Related work

Due to the increasing usage of electronic documents (company databases, digital libraries, and webpages) efficient access methods has now become necessary. To provide content based text search processing tools for automatic indexing, terminology extraction, document classification, keyphrase assignment, and extraction are required. There are a several methods developed for these tasks. While some of these problems are solved successfully. Jacquemin and Tzoukermann, (1999) claim that they achieve recall and almost 100% precision for indexing on terms and variants within a research domain. However, keyphrase indexing is still a challenging problem: Maximal reported F-measure value for automatic evaluation against author's phrases is 45% (Hulth 2004). This can be explained by the difficulty of the task and the problematic evaluation of indexing algorithms, even professional human indexers often disagree on what phrases are significant for a given document.

This chapter begins with an overview of current achievements in keyphrase extraction, assignment, and related areas. The comparison of these two strategies for keyphrase indexing reveals the reasons for sparse efforts in both approaches. The chapter ends with background on human indexing and related work in the area of cognitive linguistics, which sheds some light on the indexing specific cognitive processes and motivates the keyphrase indexing method proposed in this study.

2.1 Keyphrase extraction

Keyphrase extraction is usually divided into two stages:

1. Selecting candidates. All content words (non-stopwords) and phrases (concatenation of content words or combined with stopwords) are extracted from article. Otherwise, a part of speech (PoS) tagger and a thin parser are used to recognize noun phrases (NPs).

1. Filtering. Analysis of selected candidates and filtering heuristics are applied to determine keyphrases among them.

Filtering methods can be classified into machine learning and symbolic approaches according to the techniques they use. While machine learning methods build a statistical model from the training data, in symbolic methods researcher deduce best possible filtering heuristics based on manual analysis of documents and their keyphrases.

2.1.1 Machine learning method

Keyphrase extraction can be seen as supervised learning for examples. The main problem is to determine informative features that can be determined for each phrase and used to distinguish keyphrases and non-keyphrases among the candidate terms.

A learning algorithm needs two sets of documents with manually assigned keyphrase sets, where the first is used for training, i.e. to create classification model, and the second one for evaluation of this model. All phrases in the documents of the training set can be seen as positive and negative examples. The training scheme learns the model by analyzing feature values for each example. The performance of this model is then evaluated on unseen documents from the test set.

KEA The research group for machine learning at the University of Waikato developed the Keyphrase Extraction Algorithm KEA, based on robust and simple methods (Frank et al. (1999), Witten et al. (1999)). In the first stage of keyphrase extraction KEA determines textual sequences defined by orthographical boundaries such as punctuate on marks, numbers, new lines and splits these sequences into tokens. All n-grams, i.e. single words or concatenations of two and more tokens, which not begin or end with a stopword are candidate keyphrases. KEA stems each candidate with the iterated Lovins (1968) stemmer, but keeps the most frequent full version of a phrase for the output. In the filtering stage KEA computes two features for each candidate: the T F × I DF measure (a phrase's frequency in a document compared to its frequency in the document collection, (Salton and McGill 1983)) and the distance of the phrase's first occurrence in the document from its beginning. A Na¨ıve Bayes (Domingos and Pazzani 1997) learning scheme creates training data consisting of two sets of weights: for author's keyphrases and all other phrases appearing in the document. In the filtering stage the overall probability for each candidate being a keyphrase is calculated based on this data. The candidate phrases are ranked according to the computed probability, and the r top ranked phrases are included into the resulting keyphrase set, where r is the number of phrases the user requested.

GenEx P. Turney from the National Research Council Canada developed a hybrid genetic algorithm for keyphrase extraction GenEx, consisting of two components: Genitor and Extractor (Turney 1999). Extractor combines a set of symbolic heuristics to create a ranked list of keyphrases. Turney's candidates are phrases consisting of up to three content words. The candidates are stemmed by truncation: cutting off all terms to the length of. To filter out keyphrases among, the candidate each of them is scored as its frequency multiplied by its position in text. The scores of the candidates with more than one word are additionally boosted, because they have usually lower scores than single word terms. After removing duplicates and selecting the most frequent full form for each stemmed phrase, Extractor presents top-ranked phrases as output. Extractor has 12 numeric parameters and flags (e.g. boosting factor for phrases with two stems or parameter for number of resulting keyphrases). The steady-state genetic algorithm Genitor is applied to determine the best parameter settings from the training data.

A. Hulth presented in her PhD thesis (Hulth 2004) another machine learning technique with additional use of NLP tools. She compared different methods to extract candidate words and phrases: NP chunking, PoS pattern matching and, finally, the trivial n-gram extraction. Matching of candidates against the manually assigned keyphrases proved that both linguistic oriented approaches produce more correct phrases than n-grams. NP-approach outperformed other candidate extraction methods additionally by having a smaller amount of incorrect terms. For filtering Hulth uses four features: term frequency, inverse document frequency (unlike KEA, not combined as T F × I DF measure), position of the first occurrence and the PoS-tag (Hulth's analysis of her experimental data has shown that certain PoS-patterns are more likely to donate keyphrases). A combination of several prediction models, applied on NP candidates (after removing the determiners at the initial position) yielded best results.

KEA is the simplest keyphrase extraction approach among these systems. GenEx is based on more complex filtering heuristics, but it does not outperform KEA (Frank et al. 1999). Al- though no comparison of all three systems on same document collection exists, Hulth's evaluation results are significantly higher than those reported for KEA and GenEx. She demonstrates that the performance of her algorithm was improved after using linguistic based techniques for candidate selection and classification. Hulth's observations are a good motivation to explore further NLP techniques for keyphrase extraction and assignment.

2.1.2 Symbolic methods

Barker and Cornacchia (2000) developed a tool they call B&C that does not incorporate any machine learning techniques. To extract candidates they use a simple dictionary lookup to assign basic PoS-tags and a shallow parser to identify all nouns with zero or more pre modifying adjective and nouns. In the filtering step, Barker and Cornacchia compute the frequency of the head noun for each candidate phrase and keep all phrases with N top ranked heads. For each phrase they compute its frequency times its length. The top K highest scoring phrases are keyphrases for the given document. N and K are thresholds that are set by the user. Evaluation experiments involving human judges have shown that this simple approach performs as well as Turney's Extractor.

Paice and Black (2003) extract terms from documents that are specific for a particular domain and can be seen as keyphrases. To archive a higher conflation rate of candidate phrases they transform each extracted n-gram to a pseudo-phrase in three basic steps: removing all stop words from the n-gram, stemming the content terms and sorting them into alphabetical order. This matches similar phrases such as algorithm efficiency, efficiency of algorithms, the algorithm's efficiency, an efficient algorithm and even the algorithm is very efficient to the same pseudo-phrase algorithm efficient. Original forms of each pseudo-phrase are stored to include into the final set the most frequent one. This is a more sophisticated conflation method than simple stemming and boosts the overall score of a phrase group, based on the morphological similarity of its members. Paice and Black (2003) score each pseudo-phrase by using formula score = W * (F − 1) * N 2, where W is the sum of the weights of all words in the pseudo-phrase, F is the frequency of the phrase in the document, and N is the length of the phrase in words (with maximum value 4). All candidate phrases are sorted by their score.

In a final step, they apply a pattern-based technique to establish semantic roles and relations among remaining phrases. They concentrate their attention on three main roles (INFLUENCE, OBJECT, and PROPERTY) covered by domain-independent patterns such as effect of INFLUENCE on PROPERTY of OBJECT. Phrases which are not covered by any of the pat- terns are removed. The remaining phrases are presented with information on the roles that have been detected between them. This approach is an interesting symbiosis between keyphrase and information extraction. The authors do not provide any evaluation of their methods, but they present some striking examples.

Keyphrase extraction methods and especially the last presented one are highly related to the area of automatic terminology extraction. Such approaches are advantageous if a database with all terms in a specific field is required. This data is important for various tasks (e.g. content- based indexing, text classification) and is normally collected manually by human experts in a given field. The main difference to keyphrase extraction is that terms are not extracted from individual documents, but from the entire collection.

2.1.3 Terminology extraction

Terminology extraction and keyphrase extraction share the initial problem of term normalization, which means conflation of morphologically, syntactically or even semantically related terms into the same group.

FASTR is one of the most prominent examples in this field (Bourigault and Jacquemin (1999), Jacquemin and Tzoukermann (1999), Jacquemin and Bourigault (2003)). FASTR uses a generator of morphologically related terms and a shallow transformational parser with a number of sophisticated conflation rules for terminological indexing of French and English texts. Term extraction starts with statistical PoS tagging. After that FASTR applies derivational analysis on each term in the document collection. This analysis is based on inflectional morphological rules and generates as many lexical forms for a given term as possible. For example, for the French stem mortel the following related terms are derived:

[mortel] [im-[mortel]A]A (immortal)

[[im-[mortel]A]A -iser]V (immortalize)

des- [[im-[mortel]A]A -iser]V ]V (unimmortalize)

Overgenerated forms that either do not exist or are unusual (e.g. unimmortalize) are removed afterward by matching them against the dictionary and all terms that appear in the corpus. The remaining forms, grouped into derivational classes, support the morph syntactical conflation of terms and phrases. While syntactical variants are identified by coordination or substitution rules (e.g. high tension is a variant of high or medium tension), morpho syntactic variants imply additional derivational changes of their components. The derivational equivalence is estimated by using prior determined classes. The following meta rule identifies variants of the same concept N1 and N5 :

M etarule AtoN (N1 → N2 A3) ≡ N5 → N2 (A? P D? A? ) N4

< N4 deriv reference > = < A3 reference >

For example, it conflates phrases such as production industrially (industrial production) and production actuelle de l'industrie (current production of the industry), where the first phrase corresponds to the pattern N2 A3 and the second to N2 A P D N4. The logical constraint in the second line of the rule above says that it can be applied only if A3 and N4 are in the same derivational class, which is the case for industrielle and industrie.

Conflating semantic variants is an additional constraint for every type of phrase in the document collection. Two phrases can be conflated if their components are located in the same synset of the lexical database WordNet. E.g., the following rule matches phrases such as rigid lens and hard lens as semantic variants:

M etarule SemArg(N1 → A2 N3) ≡ N1 → A4 N3< A2syn >=< A4syn >

FASTR involves a variety of NLP-tools from a finite-state morphological processor over a parser for extraction of terminological transformations to a lexical database. The high amount of manual work makes this approach quite complex and also expensive. But the results show that the efforts are worthwhile. The authors report that their indexing method has a very high precision (97.7%) and increases the recall rate from 72.4% (for simple indexing) to 93,4%, after term conflation (Jacquemin and Tzoukermann 1999).

MOGA is a simple machine learning based system that was proposed by Wu and Agogino (2004) for the same task. They focused attention not on term conflation but on distinguishing terms specific for a given field from generic ones. For this purpose they use a measure called dispersion computed for each candidate noun phrase asM=NE, where N the exact number of units that contain the phrase and E is their expected number. E is computed with a formula that takes into account the total number of all textual units and the total occurrence of the phrase in the entire document collection. The lower is the dispersion M, the higher is the probability that this term is content-bearing.

The authors use a generic algorithm to determine the best cutoff value of dispersion. The resulting terminology set was evaluated by six human experts in the given specific field (engineering design), who considered over 90% of extracted phrases as related to the domain in question. They also cover 80% of the author's phrases assigned to this document collection manually. Although the purpose of this approach differs from the present project, it would be interesting to consider the dispersion as a feature. Unfortunately the definition of E is unclear and the authors also do not clarify their definition of textual units: these could be either paragraphs or sentences in the documents.

2.2 Keyphrase assignment

Keyphrase assignment utilizes keyphrase indexing in an entirely different way. In contrast to keyphrase extraction, it uses a predefined set of keyphrases, a so-called controlled vocabulary. The characteristics of the documents themselves, rather than the individual phrases in them, are analyzed to find appropriate keyphrases from this vocabulary. A keyphrase may or may not appear in the document verbatim. The task of keyphrase assignment is better known as text classification or text categorization, defined as “labelling natural language texts with thematic categories from a predefined set” (Sebastiani 2002). Keyphrase indexing with a controlled vocabulary is referred to as keyphrase assignment or text classification, depending on the context.

Scientists have been developing methods for automatic text classification for around fifty years. Until the late '80s knowledge-engineering experts manually created logical classification rules that were then automatically applied to electronic texts. Since the early '90s, this task has been explored mainly in the context of machine learning. Different inductive learning schemes have been used to analyze the characteristics of manually classified documents and build the classifier automatically.

A classifier is a function that maps a document vector to a confidence value that a document belongs to a given class. This value is a number between 0 and 1. A threshold is used to decide whether a particular document can be assigned to a given category or not. A classifier for the class interest could be thus if (interest AND rate) OR (quarterly), then confidence (interest) = 0.9

The process of building classifiers starts with representation of the document in a vector space (cf. Section 1.4). Because of the high number of index terms in a document collection, the resulting vector space is high dimensional and requires a lot of computational resources. The number of dimensions can be reduced by keeping only those terms that are especially useful for the classification process. The usefulness value can be determined computationally with a specified formula (see Sebastiani (2002) for details).

Dumais et al. (1998) compared the effectiveness of classification when using more sophis- ticated techniques such as considering multi-word phrases as single index terms (e.g. New York) or using shallow or deep parsing. None of these linguistically driven methods could improve the accuracy of classifying functions. Lewis (1992), who had similar experience, explained this by low statistical quality of the linguistic indexing methods.

Different learning methods have been applied for text classification. Dumais et al. (1998) compared all main types of classifiers such as Find Similar (vector similarity with relevance feedback), Decision Trees, Na¨ıve Bayes, Bayes Nets and Support Vector Machines (SVMs). The SVM method achieved the highest average precision of 0.87 for classifying around 3,300 documents into 90 categories. This technique has the advantage that it can handle more dimensions than other methods, and no parameters need to be set. Another technique is to use the so-called classifier committees, where the majority vote among different classifier makes the decision. According to Sebastiani (2002), who summarized the performance of different techniques, SVM and classifier committees outperform all other methods.

These experiments were made on a widely-used collection of Reuter's news articles classified under categories related to economics. The number classification labels are 135. This scenario is far removed from a real-world keyphrase assignment problem, where vocabularies comprise thousands of descriptors. Some of the authors use for their experiments the OHSUMED collection, which is a part of a MEDLINE document repository (Joachims (1998), Ruiz and Srinivasan (1999), Marko´ et al. (2003)). Every document in this collection has one or more manually assigned descriptors from the MeSH (Medical Subject Headings) vocabulary, that has in total over 20,500 main headings (descriptors) and over 120,000 synonymic non- descriptors. Joachims (1998) and Ruiz and Srinivasan (1999) use only a small fraction of this vocabulary of around 23 and 120 descriptors respectively.

Marko´ et al. (2003) considers the entire vocabulary for MeSH indexing. Their system combines a sophisticated approach to orthographical, morphological and semantic term normalization with a hybrid weighting technique. First, they map every word in the document collection and in the indexing vocabulary onto one or several subwords encoded in the lexicon constricting the most important part of the MorphoSaurus system.7 The mapping includes an orthographical normalization of equivalent spellings,8 morphological segmentation (leukemia becomes {leuk, em, ia} and semantic mapping (equivalent forms are grouped in MorphoSaurus under the same ID). Second, they retrieve normalized MeSH headings that appear in every normalized document and score them according to predefined rules. These scores are enhanced by probabilities acquired statistically from a training corpus. The probabilities are computed for every unordered trigram sequence of subwords that co-occur with MeSH headings of the particular document. MeSH descriptors assigned to a given document are ranked according to their total scores. A threshold value defines the resulting set of top ranked keyphrases. The evaluation on 309 abstracts yielded precision of 55% and recall of 26% for the top 5 assigned keyphrases.

The particular advantage of this approach is its language independence. MeSH headings can be assigned to any language that is present in the MorphoSaurus lexicon. Unfortunately, creating such lexical knowledge base requires diligent efforts by people who have to be experts in linguistics and the particular domain (e.g. medicine).

2.3 Extraction vs. assignment

The previous section demonstrates that text classification works well as long as the documents belong to relatively few classes. For example, the text classification software TextCat has been successfully used in applications ranging from language identification (trained for 70 classes) to spam filtering (only 2 classes necessary). Keyphrase assignment is a more problematic version of text classification and cannot be solved with the same strategies, because the amount of training data required increases rapidly with the number of categories grows. Controlled vocabularies, even if used in restricted domains, usually consist of several thousands of keyphrases. For example, the United States National Library of Medicine uses more than 22,500 MeSH descriptors for medical articles.

Keyphrase extraction is closely related to keyphrase assignment, but it is resolved with a completely different strategy. It is based on analysis of the properties of keyphrases that can be computed relatively simply and quickly. It is notable that both sophisticated machine learning approaches and simple scoring of candidate phrases or their head nouns yield similar levels of accuracy (Barker & Cornacchia 2000). Their performance is still insufficient to replace the expensive manual work of human indexers. Although a rather large percentage of keyphrases assigned by authors have been reportedly extracted automatically (e.g. Hulth (2004)), these numbers should be taken with a pinch of salt, because they do not contain any information about non-matching keyphrases. When two professional indexers assign keyphrase sets to the same document, these sets usually also match only partially, but all keyphrases are highly related to the document and are definitely grammatical phrases. There is no guarantee for this in automatic keyphrase extraction. In fact, extracted keyphrases are often too general or malformed. Even approaches enhanced by some rudimentary linguistic techniques such as PoS patterns matching or NP chunking tend to extract useless or non-grammatical phrases, because they cannot avoid errors generated by the linguistic tools.10

While these technical errors could be improved by using more accurate NLP-tools, keyphrase extraction still has several disadvantages. For example, there is no homogeneity among selected keyphrases, because the extraction process is restricted to the vocabulary of the document's author. Documents that describe the same subject in different but synonymous words (e.g. Seaweed culture and sea weed farming) receive different keyphrases and cannot be grouped together according to their content. Keyphrase assignment avoids this shortcoming by having a controlled vocabulary of indexing terms. Another problem in automatic keyphrase extraction is that it is restricted to syntactic properties of phrases without considering their meaning. These algorithms also ignore the content of the document as a whole and therefore fail to cover all its topics in the keyphrase set. Keyphrase assignment is more advanced in this sense, because it analyzes the content of the document by taking into account the co-occurrence statistics between terms.

These observations show that both keyphrase assignment and extraction have their advan- tages, but also their limitations. Combining the approaches into a single keyphrase indexing system would allow one to benefit from both of them and at the same time avoid their short- comings. The following section starts with a short insight into indexing strategy performed by librarian experts. On the one hand, they need to understand the content of the document first, on the other hand, their choice of keyphrases are restricted by cataloguing rules. The processes of understanding the content of the indexed document has not been explored in the context of automatic indexing yet. The indexing theory and the text comprehension issues in cognitive lin- guistics might provide some information for improving current automatic keyphrase indexing strategies.

2.4 Indexing theory and cognitive processes

The goal of keyphrase indexing in library science is to organize library holdings based on their content and to provide fast and unified access to them. Given a document, the task of the indexer is to assign a keyphrase set that reflects its main topics in an exact and precise way and to keep the indexing consistent among all of the documents in the collection.

David et al. (1995) define following key stages in performing this task:

1. Scanning of document: In this stage the indexer recognizes the document's nature (e.g. research report) and its main parts by using his professional knowledge.

2. Content analysis: In this stage the indexer uses his domain knowledge to determine what the document is about.

3. Concept selection: Among all concepts appearing in the document the indexer chooses those that are representative for the given document.

4. Translation into descriptors: This stage is only important for controlled indexing where central concepts need to be translated into descriptors.

5. Revision: The indexer adjusts his selection to his work environment by comparing his index terms to similar documents in the database.

The first and the last two stages are specific for the task of professional indexing. The main phase in the indexing task consists of content analysis and concept selection that both correspond to the general process of text comprehension. The way in which humans understand the document's content is an enormously complex process. Theories in cognitive linguistics describe this process as an interaction between perception organs, text memory and long term memory consisting of general knowledge, episodic memory, frames etc. (van Dijk and Kintsch 1983, p. 347ff). According to van Dijk and Kintsch (1983), large amounts of knowledge that are not expressed in the text itself need to be accessed and retrieved in order to construct a mental representation of the text's content in one's memory and to understand it. While perception

”The man went to the train station. After he bought a one-way ticket, he went to the platform and got into the train.”






and text memory are involved in a modified way into current solutions for automatic keyphrase extraction and assignment, the long-term memory is not presented in any of known approaches. To support systems involving automatic text understanding with this kind of external knowledge, an automatically accessible knowledge base, represented in a flexible and usable form, is required. A popular type of such knowledge bases is semantic networks, where concepts are connected to each other according to semantic relations between them. Beside the relatively simple and intuitive representation of conceptional knowledge, semantic networks also support another activity involved into text understanding spreading activation (Collins and Loftus, 1975). When a particular concept (node) in the network is activated (e.g. the indexer reads a particular term in the text), other nodes that are related to it are activated as well. Activating nodes allows the avoidance of ambiguities and provides a framework, which represents the scope of the text's content and its relatedness to the entire knowledge base.

This process is crucial for determining concepts in the document. Concepts, in turn, build up propositions that denote the meaning of sentences and represent the text base (e.g. P1 to P7 in 2.1). According to Kintsch's and van Djik's theory of text understanding (Kintsch and van Dijk 1978), propositions are then transformed mentally into a macro-structure representing the gist of the text. This transformation includes a recursive applying of the so-called macro- rules, such as deletion (elimination of irrelevant propositions), generalization (converting a series of specific proposition to a more general one) and construction (constructing a new proposition from a series of proposition given in text by activating world knowledge). The macro-rules can be represented as edges that connect the propositional nodes into a network representing the macro-structure (cf. 2.2).

By applying macro-rules, new macro-propositions (e.g. M1 and M1 in 2.2) can be constructed. Each macro-proposition summarizes the meaning of proposition from which.

it is constructed. 2.2 shows a macro-structure that could be created from the text in 2.1. The form of the macro-structure depends on the particular content of the text and the long-term knowledge of a given person. For example, 2.2 represents the construction rule that transforms propositions P1, P6 and P7 and the macro-proposition M1 into a new macro-proposition M2. This macro-proposition is the theme of the given text snippet and could be expressed by a phrase taking the train. Although this information is not explicitly available in text, the reader is able to deduce it from his world knowledge. David et al. (1995) claim that these operations take place when professional indexers perform the second indexing stage, the content analysis.

The third point, selecting the main concepts in a document, can be also observed from the perspective of text comprehension process. Given a longer text, with more than one global macro-proposition, the question is, which are the most important. (Kintsch and van Dijk 1978, p. 78) proposed different strategies of how humans identify the key subjects in texts. For example, those propositions that are more frequent are obviously more important. Frequency is also a common feature in keyphrase extraction algorithms. Another interesting notice made by Kintsch and van Djik is that macro-propositions with the higher amount of ingoing edges are more significant. When one considers the semantic content of the text not as a macro-structure but as semantic network, this observation would mean that nodes with the highest amount of relational links to other nodes in the document are most relevant for its content.

The main goal in developing text-processing approaches, such as keyphrase extraction, is to supersede expensive and difficult intellectual human work. The natural way of designing such methods would be to simulate cognitive processes used for the same tasks by humans.

The backgrounds of indexing theory and cognitive linguistics presented briefly in this section provide new ideas of how keyphrase indexing task can be solved automatically. The ideal keyphrase indexing system needs, first of all, to detect the main concepts within a text. Second of all, semantic relations between the document's concepts and other information (e.g. general world knowledge or domain specific statistics) available to the system should be determined. Thirdly, this information on concepts and relations between them need to be analyzed to create the final set of most significant terms in a particular document.

The most difficult task in designing such a system is to extend it with a large but flexible database containing background knowledge. In computer science, thesauri and ontologies are typically used to encode world-knowledge into computer. A thesaurus is a database with entries representing concepts connected by semantic relations such as synonymy, antonymy, hyper-/hyponymy, mero-/holonymy and others. One of the most popular thesauri, WordNet, developed in the Cognitive Science Laboratory at Princeton University, contains more than 150,000 words that are organized into over 110,000 groups representing concepts. WordNet became primarily a practical tool for knowledge-based natural language application and was explored successfully for different tasks, such as information retrieval, word sense disambiguation and knowledge engineering (Fellbaum 1998). Using domain-restricted thesauri allows one to limit the number of meanings per term and to reduce the size of the required knowledge database.