# Implementation Of Ranking Of Documents Using Term Frequency Inverted Document Frequency Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Search Engine is the most important application of Information Retrieval. Search Engine returns the results after ranking documents based on some algorithm. More relevant documents are given more ranking than less relevant documents. The documents are ranked based on different techniques. Term Frequency /Inverted Document Frequency and Vector Space Model is mostly used technique for ranking. Every search engine uses this technique in different forms. In this paper we have combined TF/IDF and Vector space model and implemented in Java to rank documents. The similarity between documents and query is calculated with the use of cosine distance of Vector Space Model.

Term Frequency is the total numbers of occurrence of terms in the set of documents. Simple term frequency considers only local information in documents. Terms are calculated with reference to a query. The document which is having higher number of terms is considered more relevant as compared to other documents [1].

Only basic Term Frequency is not able to effectively discriminate between relevant and non relevant documents. Some global information should also be calculated within corpus of documents. Inverted Document Frequency (IDF) is used to calculate the relevance of term within corpus of documents. IDF is based on the concept that the rarer a term within the corpus of documents, the more powerful that term is to determine the relevance of the document [6]. By using TF/IDF, the stop words gets lower weights and the term in query which is rare gets more weight. The weight of a term in TF/IDF model is calculated by

(1),

where tfi is the number of occurrence of term i in a document, D is total number of documents and dfi is the number of documents containing term i[1].

1.2 Vector Space Model

In using only TF/IDF model, longer documents get unfair more weights because in the case of longer documents the occurrence of terms is more as compared to shorter documents. So to remove this problem, Vector space model is used and the documents are normalized[5]. Cosine distance is used to calculate the similarity between documents vector and query vector.

Cosine (2),

Where Q.Di is the dot product of query weights with Document weights. is length of query vector and is length of Document vector. Vector lengths are used for normalization so that longer documents can not take unfair advantage of weights as compared to shorter documents [7].

Cosine Distance means similarity of Documents Vector with Query Vector.

Sim (Q, Di) =Cosine Di (3)

2 Experiment

We have used Java to implement combination of TF/IDF and Vector Space Model. There is one module Documents where Documents are converted into Set. Index is created when documents are given at run time. Documents are checked with index if terms in documents are matched with any of index terms then that term in document is set to true otherwise false. After having true values of terms in Document module i.e. which terms of index are in documents, we calculate tf, idf, weights and ranks of documents in class SearchEngine. At run time there are three documents and one query. The terms in the documents are indexed and one file is created using Java which is having all these terms and then these terms are matched with the query. The following steps are used for ranking documents.

2.1 Indexing

Terms are indexed in a file using Java. When documents are added in the set, then automatically the index file is updated.

Suppose there are three documents

D1 = "There are two types of ranking"

D2 = "First is static"

D3 = "Second is dynamic method"

The query is

Q = "Dynamic ranking method"

Index is created while implementation in a file and this index will be used for comparison with the query terms.

"are","dynamic","first","is","method","of","ranking","second","static","there","two","types" is saved in a file.

2.2 TF

After index is created, next step is to calculate the term counts in document and in query. tfi is the number of times a term occurs in the document and in query. By using the values of tfi, the value of dfi can be calculated. dfi is the number of documents containing term i.

Terms

q

D1

D2

D3

dfi

are

0

1

0

0

1

dynamic

1

0

0

1

1

first

0

0

1

0

1

is

0

0

1

1

2

method

1

0

0

1

1

of

0

1

0

0

1

ranking

1

1

0

0

1

second

0

0

0

1

1

static

0

0

1

0

1

there

0

1

0

0

1

two

0

1

0

0

1

types

0

1

0

0

1

2.3 IDF

Term frequency is having local information. It is the occurrence of terms in documents. But to calculate global information IDF is required[8]. Inverted Document Frequency calculates the number of documents having terms and then compares it with total number of documents. IDF ensures that if a term is in less number of documents that that term is having higher discriminative power to differentiate between relevant and non - relevant documents.

Terms

IDFi=log(D/dfi)

are

0.4771

dynamic

0.4771

first

0.4771

is

0.1761

method

0.4771

of

0.4771

ranking

0.4771

second

0.4771

static

0.4771

there

0.4771

two

0.4771

types

0.4771

2.4 Weights of Terms

Weight of a term is calculated with the combination of local and global information. Local information is the occurrence of term in documents or query and global information is occurrence of term in corpus of documents i.e. how much important that term is.

Terms

wq

wD1

wD2

wD3

are

0

0

0

0

dynamic

0.4771

0

0

0.4771

first

0

0

0.4771

0

is

0

0

0.1761

0.1761

method

0.4771

0

0

0.4771

of

0

0.4771

0

0

ranking

0.4771

0.4771

0

0

second

0

0

0

0.4771

static

0

0

0.4771

0

there

0

0.4771

0

0

two

0

0.4771

0

0

types

0

0.4771

0

0

2.5 Vector Lengths

Till now TF/IDF model was used, now documents and query are treated as Vectors. To calculate the similarity documents and query vector length is required [7]. Vector length is calculated by :

(4)

(5),

So by using (4) and (5), the documents and query vector length is calculated. In our experiment putting (4) and (5), the result is:

2.6 Cosine Similarity

Similarity is calculated by the cosine angle between two vectors i.e. document vector and query vector[7].

Cosine

First dot product between document vector and query vector is required.

Q.D1 = 0.4771 * 0.4771 = 0.2276

Q.D2 =0

Q.D3=0.4771*0.4771 + 0.4771*0.4771 = 0.4552

Now by using (2), cosine similarity can be calculated.

Sim (Q, D1) =0.2582

Sim (Q, D2) =0

Sim (Q, D3) =0.6520

2.7 Ranking

Higher value of similarity means higher relevant document for the given query. So,

Rank 1: Doc3

Rank 2: Doc1

Rank 3: Doc2

By looking at the query and the documents it is clear that Document 3 should get first rank because it is having two terms "dynamic" and "method" which are in query and Document 2 should get second rank because it is having one term which is in query. Document 2 is not having any term which matches with query, so, the similarity value of Documents 2 is 0.

Terms which are common in the corpus of documents get less IDF value and hence less weights as compared to uncommon words. In our experiment "is" term occurs in two documents so the value of IDF and weight of this term is less.

In our experiment we also used normalization and divide the dot product in (2) by the vector length of document. By using normalization, longer documents do not get more weights just because of the higher number of occurrence of terms.

3 Conclusion

Term Frequency/Inverted Document Frequency Model with using Vector Space Model is very strong model for checking the relevance of a document based on keywords in query. Many Search Engines and other data mining/Information Retrieval applications use this model either directly or indirectly. In our experiment we have shown the ranking can be calculated effectively using keyword search.

3.1 Advantages/Limitations

The main advantage of this model is accuracy. The result is based on the keywords which user enters in query for getting information. By our experiment we analyze that documents which get higher rank using this model are really relevant for the user. The calculation for this keyword search based model is not very much complex [2]. This model gives better results as compared to simple TF model because the terms weights in TF/IDF model and Vector Space Model are calculated within corpus of documents i.e. global information, but in simple TF model terms weights are calculated within only that particular document i.e. local information only.

The first limitation in this model is that it can not understand the synonyms. e.g. "car insurance "and "auto insurance" are same but having different terms [3]. Some documents get lower ranks because the same terms do not exist in the documents. Polysemy is the second limitation of this model. Some terms are used to express different meanings e.g. "driving car" and "driving result". Some documents are shown more relevant because the documents are having more terms which are in query. Third limitation is that it can not understand the semantic content[4].

3.2 Further Research

In our experiment we have shown that documents are ranked effectively based on the keywords or terms in query. Some terms which are very common gets very low weights and the terms which are rare gets more weights because the rare terms have more discriminative power to distinguish between relevant and non-relevant documents. Further improvements in our study can be removal of stop words completely like "is","are","of","for","the" etc. This improvement increases the effectiveness of the ranking of more relevant documents. This research can also be improved by not retrieving documents below a defined cosine similarity threshold [8]. Terms can also be stemmed to root. Semantic understanding can be included in this research by setting some keywords for a document within a particular domain.