Non Linear Phrase Based Document Clustering Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract. Document Clustering plays a vital role in providing better results for document retrieval and document mining. Conventional techniques, such as vector space model have not been able to satisfy the Information Retrieval needs. They use word counting and frequency distribution to cluster documents. These techniques do not consider the semantic relationship between the words. In Phrase based model, word sequences are given importance, as the meaning of natural language strongly depends on it. Phrase based model addresses the semantic relationship between words but fails to address the semantic relationship between phrases. The proposed technique, Non linear Phrase block addresses it.

Key words. Document Clustering, Phrase based Indexing, Non Linear Phrase Indexing, word sequences, similarity, suffix tree.

1 Introduction

World Wide Web has brought unlimited data at our door step. Almost all the documents have electronic copies and are available online. We can find documents of all domains at internet. Whether the topic is Science, Engineering, Medicine, Entertainment, Sports or History, the data we require is just a click away. If we require Stephen Covey's "The 7 Habits of Highly Effective People", all we have to do is type in the title and author's name in the search engine and we will find the document. Searching is not that easy if we do not the exact title of the document. Problem arises when we want to find documents similar to a particular document or documents related to a topic of our interest.

World Wide Web can be considered as a library having countless books. These books are useless if a reader cannot find the book he is looking for. Similar to library, in order to facilitate the reader it is important that the documents are analyzed and classified into set of categories.

Document classification can be supervised or unsupervised. In supervised classification, feedback by some external entity (expert) is given on the correct classification of documents. In unsupervised classification, there is no involvement of external entity in the classification of documents.

Unsupervised document classification in which the number of groups is automatically detected is referred as Document Clustering. Document Clustering aims to discover natural grouping among documents in such a way that documents with in a cluster is similar to one another and are dissimilar to documents in other clusters.

Exploring, analyzing and correctly classifying the unknown natures of data in a document without supervision is the basic requirement of Clustering. Documents can be clustered using two techniques Hard Partition clustering and Hierarchical clustering. Hard Partition clustering directly divides documents into clusters unlike Hierarchical clustering that organizes clusters in form of a tree.

Agglomerative hierarchical clustering is a bottom-up clustering technique in which clusters generated in early stage are nested to form clusters of later stage. Vice versa technique of top down hierarchical clustering is called Divisive Hierarchical clustering. Fuzzy set theory introduces concept of C-means clustering in which an object may belong to many or all clusters with a degree of membership.

Basic questions that arise in Document clustering are: (1) What clustering technique should be used to measure the similarity between the documents (2) What suitable model should be used to represent the documents (3) How many meaningful clusters are present in the data (4) Whether the cluster should be overlapping or not

2 Literature Review

New clustering techniques are modifications of traditional clustering algorithms. Traditional clustering algorithms were designed for Relational Databases and OLTP systems. They have become impractical in real world web document clustering.

The special requirements of clustering web documents are [1]:

High dimensionality: number of relevant terms in a document are in thousands

Scalability: real world data set contains millions of documents

Accuracy: Documents within the same cluster should be similar and documents across different clusters should be dissimilar

Meaningful cluster description: the topic hierarchy should provide sensible structure.

Prior domain knowledge: Clustering algorithms should not be sensitive to input parameters.

2.1 Bisecting k means

To select a Document clustering technique, we have to first decide whether to use hard partitioning or hierarchical clustering.

Agglomerative Hierarchical clustering is usually preferred but Stienbach, Karypis and Kumar [5] proposed otherwise. They proposed that bisecting k-means, a variation of k means (hard partition) is an efficient process of clustering documents as compared to Agglomerative hierarchical clustering.

2.2 Frequent Item set

Wang introduced the concept for clustering documents using frequent item sets rather than the conventional distance function. Documents having more common frequent items are more close to each other and are thus clustered together. [10]

Hierarchical Frequent Term based Clustering (HFTC) is modification of Frequent Term Clustering. HFTC greedily selects the next frequent item set, which represents the next cluster. HFTC is comparable to Bisecting k means in accuracy but is not scalable. [1]

Frequent item based Hierarchical clustering (FIHC) satisfies all Document clustering special requirements. Conventional hierarchical and partitioning methods are document clustered. The similarity between documents plays a vital role in constructing a cluster. FIHC is "cluster-centered" in that it measures the cohesiveness of a cluster directly using frequent item sets. [1]

Consider two frequent terms "apple" and "windows". When "apple" is present in a document, the document may relate to fruits and when term "windows" is present, the document may relate to "furniture". But when "apple" and "windows" occur together, the document may relate to Operating System.

Discovering the hidden topics using the frequent terms is the goal of FIHC. It uses these semantic topics to cluster documents. An advantage that FIHC has is that it provides label to clusters.

2.3 Suffix Tree Clustering

The sequence of words plays a vital role in documents. Consider two documents D1 and D2 having text:

D1: "I love friendship and hate war"

D2: "I love war and hate friendship"

These documents have very different semantic meanings. Most of the clustering algorithms use the vector space model which treats documents as bag of words. Such algorithms will categorize the above documents are similar as both have same bag of words {love, friendship, hate, war}. The word sequences are thus ignored. These sequences are critical in understanding the semantics of a document.

Phrase based models preserves the sequential order of terms, hence conserving the semantics of the document. The idea of word sequence based clustering was proposed by Zamir and Etzoini.[9] Suffix Tree Clustering (STC) is based on this approach. STC does not treat document as a bag of words rather considers it as a strings. It uses the proximity of words to categorize documents. It builds a suffix tree to identify cluster of documents that share similar documents.

STC is a linear time clustering algorithm that is based on identifying the phrases that are common to group of documents.

2.4 CFWS and CFWMS

STC does not reduce high dimensionality of text documents; hence its complexity is quite high for large databases. Yanjun Li, Chang and Holt proposed two new clustering algorithms named Clustering based on Frequent Word Sequences (CFWS) and Clustering based on Frequent Word Meaning Sequences (CFWMS). CFWS and CFWMS reduce their dimensionality by only forming suffix tree of frequent items in Document (Generalized Suffix Tree).

Due to the use of synonyms many similar phrases are not recognized by CFWS. CFWMS can provide more compact and accurate information about these documents. It treats documents as a string of word meaning rather than as a bag of words.

CFWMS have better accuracy than Bisecting K means and FIHC [12].

2.5 Document Index Graph Clustering

Suffix tree can produce quality clusters but have high number of redundancies. An alternative approach is Document Index Graph (DIG) Clustering.

The model consists of four components [2]:

1. A Web document restructuring scheme that identifies different document parts, and assigns levels of significance to these parts according to their importance.

2. A phrase-based document indexing model, the Document Index Graph (DIG) that captures the structure of sentences in the document set. The DIG model is based on graph theory and utilizes graph properties to match an n-length phrase from a document to other documents.

3. Similarity measure that uses the combination of Single term and Phrase similarity scores and the matching phrases / terms' significance.

4. An incremental document clustering method to form new quality clusters.

3 Non linear Phrase Document Clustering

Non linear Phrase based Document clustering addresses the semantic relationship between words and also between phrases. Natural language strongly depends on it. There are four basic steps for Non linear Phrase Document Clustering: (1) Document Preprocessing (2) Phrase Representation (3) Measuring Phrase Block Similarity (4) Assigning Documents to clusters.

First step is the Document pre processing step. In it we categorize the text in the Web Document into categories according to structure provided in the document, such as the Headings and the text under the headings are in separate categories. After categorization, the sentences are extracted from the text. Then the words in the sentence are stemmed.

Second step is to capture the structure of the sentences using a model. This structure should be flexible enough so that we could build phrases and phrase blocks from it. To cater the high dimensionality of documents we can use only frequent phrase blocks.

Third step is to build a similarity measure for scoring the similarity between documents. The similarity measure should be based on the matching phrase blocks.

Final step is to build an incremental document clustering method based on Agglomerative hierarchical technique.

Cluster Evaluation


Document Preprocessing

Suffix Tree Representation

Cluster Formation

Non Linear

Phrase Similarity

Similarity Measure

Fig. . Non Linear Phrase based Document Clustering

3.1 Document pre processing

An advantage of web document is that it is semi structured. Along with the text of the documents, we can retrieve significance of the text as well. Web document restructuring based on html tags is originally proposed by Hamouda.

A document can be divided into three categories, with category 1 being most significant and category 3 being least. (1) Category 1 contains the Tile, Headings and the sub headings of the document. (2) Category 2 contains the text in the document which is bold, italic or underlined. It also contains description of the figures. (3) Category 3 contains the remaining text.

In other words, a document can be viewed as sequence of categories and a category can we viewed as sequence of sentences.

di = <ci1, ci2, ci3>


ci = <si1, si2, si3, …>


Where di is the document at index I and ci1, ci2, ci3 are its categories and sj1, sj2, sj3 are the sentences of a document category. Many sentence extraction algorithms are available to extract sentences from Text.

After extraction of sentences it is essential to stem the words in a sentence. Porter Stemmer's algorithm is mostly used for stemming.

A sentence can be viewed as sequence of phrases and a phrase can be viewed as a sequence of words.

si = <pi1, pi2, pi3, …>


pi = <wi1, wi2, wi3, …> = ∑ wij


Where p1, p2, p3 are the phrases of a sentence and w1, w2, w3 are the word sequence of a phrase. A phrase can also be written as:

pi = ∑ wij


The above equation shows that a phrase at index i, is a sequence of words having count j. A phrase block is a sequence of phrases.

bi = <p1, p2, p3, …>


Where p1, p2, p3 are the phrase in a block bi.

Applying (5) in (6) we get

bi = < ∑wij, ∑wik, ∑wil, …>


Hence, a phrase block is an m x n dimensional matrix; in which m is the number of phrases per block and n is the number of words per phrase.

3.2 Non linear Phrase Representation

To represent a document in non linear phrases from the word sequences of the database, there are two models that we could have adopted. One is Suffix tree implementation and other is Document Index Graph (DIG). Suffix tree is well known structure for pattern matching, but consumes great amount of memory. Document Index Graphs are more memory efficient but their phrase retrieval is slower. So we adopted Suffix tree. We can use Generalized Suffix tree for high dimensional documents.

For linear phrase document clustering, each compact document is treated as a sequence of words. We insert each word into the Suffix tree. For non linear document clustering, each document will be treated as a sequence of phrase blocks and we will insert phrase blocks into the suffix tree.

There are four parameters to consider while building a Phrase block. (1) First is the word sequences per phrase (2) Second is the word sequence proximity (3) Third is the phrase per block and (4) fourth is the phrase proximity.

The first parameter tells us that how many words are to be placed in each phrase. Consider the sentence: "young boy love play cricket". Two words per phrase for this sentence will be: {"young boy", "boy love", "play love", "play cricket"}. Three words per phrase will be {"young boy love", "boy love play", "love play cricket"}.

The second parameter tells us if any word should be skipped in order to make a sequence. Conventional phrase based approached do not allow to skip words of a sequence, i.e. the conventional technique have word proximity 1. Two words per phrase with word proximity 1 are given above. Two words per phrase with word proximity 2 are {"young love", "boy play", "love cricket"}.

The third parameter tells us that how many phrases should be present in a block. For the above example with words per phrase 2, words proximity 1 and phrases per block 2 is: {"young boy", "boy love"}, {"boy love", "play love"}, {"play love", "play cricket"}.

The fourth parameter allows us to skip phrases from a block. For the above example with words per phrase 2, word proximity 1, phrases per block 2 and phrase proximity 2 will be: {"young boy", "play love"}, {"boy love", "play cricket"}.

Pseudo code for building Suffix tree from Documents is defined below

Algorithm: Suffix Phrase Building

Input: D: {d1, .., dn} documents to cluster

Output: T: {T1, .., Tn} Suffix tree of Phrase blocks

for each document di in D

Ti ƒŸ Empty Tree

Si: {si1, .., sin} ƒŸ Parse_Sentences (di)

for each sentence sij in Si

Si' ƒŸ Remove_Stop_Words (Si)

for each word wijk in Si'

wijk = Stem_Words (wijk)

Phrase <- wijk

for (b=0 to wordsProximity) do

for (c=1 to wordsPerPhrase) do

m ƒŸ k + b + c

Phrase <- Phrase U wijm

end loop;

end loop;

if Phrase is not in Ti

Add Phrase to Ti

end if;

Update Phrase Reference in Metadata

end loop;

end loop;

3.3 Phrase Block Similarity Measure

Non linear Phrase tries to semantically measure the similarity between the documents. This measure does not use individual words or phrases. Rather it uses the sequence of phrases. Cosine and Jaccard measures are used for single word based similarity. The similarity for them is:

Similarity (di ,dj) = (di U dj) / (di ∩ dj)

In other words, Word Similarity between two documents is based on the ratio of how many words overlap to their union in terms of words.

For non linear phrase similarity, the similarity measure is a function of the following factors:

Number of matching Phrase blocks

The dimensions of the Phrase block (Number of words / phrase and Number of phrases / block)

The proximity measure (word / phrase proximity)

The Document category level


The measure is similar to the work done in CFWS, CFWMS [11] and DIG [2]. The difference is that Phrase blocks are used instead of Phrases. Pseudo code for Block Phrase similarity measure can be defined as:

Algorithm: Non Linear Phrase Matching

Input: D: {d1, .., dn} documents to cluster

T: {T1, .., Tn} Suffix tree of Phrase blocks

Output: S: {{S11, .., S1n}, .., {Sn1, .., Snn}}Similarity Matrix of Documents

for each document di in D

for each document dj in D

Sij ƒŸ0

for each matching phrase Pk in di and dj

for (b = 0 to phraseProximity)

c ƒŸ0

while (Pk+c match in di and dj)

Sij ƒŸ Sij + b + c*(phraseProximity - b)

c ƒŸ c + 1

end while

end loop

end loop

Sij ƒŸ (Sij + wordsPerPhrase + phrasePerBlock

+ documentCategory) / Max_Similarity

Sij ƒŸ Sji

end loop

end loop

3.4 Cluster creation and merging

Agglomerative Hierarchical Clustering is the technique used for Cluster creation and merging. It is a bottom up clustering technique. In the earliest stage we assign each document to a separate cluster. Then we merge the cluster based on their similarity (distance) with other clusters. The merging continues until only one cluster remains. Quality of each cluster level can be evaluated using F Measure, Purity and Entropy.

Pseudo code for the algorithm is:

Algorithm: Agglomerative Cluster


S: {{S11, .., S1n}, .., {Sn1, .., Snn}} Similarity Matrix of Documents


F_Measure: {f1, .., fn} F-Measure for each Cluster level

Purity: {p1, .., pn} Purity for each Cluster level

Entropy: {p1, .., pn} Purity for each Cluster level

Cluster_Level := Document_count

for each cluster ci in C

ci ƒŸ di

Evaluate_Cluster(FMeasure, Purity, Entropy)

end loop

while (true) do

Min_Cluster ƒŸ Empty

for each cluster ci in C

for each cluster cj in C

d ƒŸ Distance_Measure(ci, cj)

if (d < Min_Cluster.distance)

Min_Cluster ƒŸ ci U cj

end if

end loop

end loop

Merge_Cluster(ci, cj)

Cluster_Level ƒŸ Cluster_Level - 1

Evaluate_Cluster(FMeasure, Purity, Entropy)

end loop


4 Experimental Evaluations

In order to evaluate the quality of clusters produced from our clustering algorithm, we have implemented our algorithm in C#.

Two groups of data sets are used. One is the News20 group and other is of Reuters. News 20 group have around 20,000 documents of 20 topics.

Quality of the clusters can be evaluated using three Informational Retrieval measures (1) F Measure (2) Purity (3) Entropy.

F Measure is calculated with the help of Precision and Recall. Precision for a cluster calculates the fraction of total relevant documents retrieved with all retrieved documents. Mathematically, Precision for a cluster can be expressed as:

p(i) = (NRL ∩ NRT) / NRT


Where, NRL are the total documents retrieved and NRT are the total documents relevant. Recall calculates the fraction of the relevant retrieved documents with all relevant documents. Mathematically, Recall for a cluster can be expressed as:

r(i) = (NRL ∩ NRT) / NRT


F Measure is the weighted harmonic mean of Precesion and Recall. F Measure for a cluster can be calculated as:

F(i) = (2 p(i) r(i)) / (p(i) + r(i) )


Purity is the evaluation criteria in which cluster quality is calculated by considering the correctly classified documents of a cluster. Mathematically, Purity for a cluster can be expressed as:

P(i) = ∑ I maxJ |wI ∩ cJ| / N


Where, w is the cluster and c and is the class. Entropy tells us how homogeneous a cluster is.

Mathematically, Entropy for a cluster can be expressed as:

H(i) = - ∑ K P(wK) Log P(wK)


Where, P(wK) is the probality of document being in cluster wK.

From the News 20 data set, we selected 84 documents from 7 different classes, 12 from each class. The seven classes were: (1) Atheism (religion) (2) Computer Graphics (3) Sports Hockey (4) Computer Systems IBM Hardware (5) Computer Systems MAC hardware (6) Computer Systems Windows (7) Politics

The count of data extracted from the data set after pre processing is as follows





Cumulative Phrases

Block Phrases

Cumulative Block Phrases











































Column "Parameters" is the algorithm input parameter (Words per Phrase - Phrase per Block - Words Proximity - Phrase Proximity).

The results shown below in the graphs are a comparison of three techniques:

Series 1 - Bag of words similarity (words per phrase = 1 and phrase per block = 1), Series 2 - Phrase based similarity (words per phrase = x and phrase per block = 1) and Series 3 - Non linear similarity (words per phrase = x and phrase per block = y).

Fig. . F Measure comparison

Fig. . Purity comparision

Fig. . Entropy Comparison

We can see from the comparison that Non linear Phrase produces better F measure and Purity and lower Entropy than the other two approaches.

5 Conclusions and Future Work

Non linear Phrase based model is an attempt to improve Document clustering algorithm. Unlike linear Phrase based model, our model not only utilizes the sequential patterns of the words in the document but also the sequential pattern of phrases in the document. We used Suffix tree to represent Non linear Phrases. The memory performance of Suffix trees is not very scalable for high dimension and high volume data. To resolve this we can use Generalized Suffix Tree (GST) or Document Index Graph (DIG). We can further use word meaning sequences and phrases to further improve our similarity measure. For this we can use Word Net ontology for pre processing our document.