Modified Vector Space Model For Protein Retrieval Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

This paper provides an enhancement of an existing method of information retrieval which is a modification of Vector Space Model for information retrieval. This enhanced model is modified to be applied on protein sequence data whereas the normal vector space model has been applied on text data. The results show that the enhanced model achieved very good results in performance but the setup time is somehow high for a large collection of protein sequences

1. Introduction

The Vector Space Model (VSM) is a standard technique in Information Retrieval in which documents are represented through the words that they contain. It was developed by Gerard Salton in the early 1960's to avoid some of the information retrieval problems. Vector spaces models convert texts into matrices and vectors, and then employ matrix analysis techniques to find the relations and key features in the document collection. It represents queries and documents as vectors of terms which can be words from the document or the query itself. The most important thing is to represent relevance between documents in this information space, which is achieved by finding the distance between the query and the document [I ].

The weight of relevance of a query in the document can be calculated using some similarity measures such as cosine or dot product or other measurement.

Glenisson P.and Mathys J [4] have showed how the bag-of-words representation can be used successfully to represent genetic annotation and free-text information coming from different databases. They evaluated the VSM by testing and quantifying its performance on a fairly simple biological problem. They found that it can establish a powerful statistical text representation as a foundation for knowledge-based gene expression clustering [2].

In this work, we have modified the VSM technique to work with biological datasets. We used the document

frequency (DF) instead of inverse document frequency (IDF). The results of the experiments show that the modified method give good results using precision evaluation measure.

2. Vector Space Model

The VSM relies on three sets of calculations. This model can work on selected index of words or on full text.

The calculations needed for the vector space model are:

I. The weight of each indexed word across the entire

document set needs to be calculated. This answers the question of how important the word is in the entire collection.

2. The weight of every index word within a given

document (in the context of that document only) needs to be calculated for all N documents. This answers the question of how important the word is within a single document.

3. For any query, the query vector is compared to

every one of the document vectors. The results can be ranked. This answers the question of which document comes closest to the query, and ranks the others as to the closeness of the fit.

The weight can be calculated using this equation:

( "

Wi = tJi *logl21 \ dJi )

Eql:

where:

if; = term frequency (term counts) or number of times a term i occurs in a document. This accounts for local information.

df, = document frequency or number of documents containing term i

D = number of documents in a database.

The D/df, ratio is the probability of selecting a document containing a queried term from a collection of documents. This can be viewed as a global probability over the entire collection. Thus, the log (D/dfJ term is the

86

IJCSNS International Journal of Computer Science and Network Security, YOL.7 No.9, September 2007

inverse document frequency, IDF; and accounts for global information.

2.1. VSM Example

To understand Eq I, let's use a trivial example. To simplify, let's assume we deal with a basic term vector model in which we:

I. Do not take into account WHERE the terms occur in documents.

2. Use all terms, including very common terms and stop words.

3. Do not reduce terms to root terms (stemming).

The following example [3] is one of the best examples on term vector calculations available online.

Suppose we query an IR system for the query "gold silver truck"

The database collection consists of three documents with the following content:

D I: "Shipment of gold damaged in a fire"

D2: "Delivery of silver arrived in a silver truck" D3: "Shipment of gold arrived in a truck"

Q: "gold silver truck"

Vector space Model constructs the index tables as shown in Tables I and 2 by analyzing the terms of all documents into words as in Table I and find the frequency of each term in all documents; Table 2 does the same for the query.

2.2 Similarity Analysis

There are many different methods to measure how similar two documents are, or how similar a document is to a query in YSM. These methods include the: cosine, dot product, Jaccard coefficient and Euclidean distance. In this paper we will use the cosine measure which is the most common.

The similarity measure for the previous example in section 2.1 can be calculated as follows:

1. For each document and query, compute all vector lengths (zero terms ignored)

IDll = 00.47712 + 0.47712 + 0.17612 + 0.17612=~0.5173 = 0.7192 ID 21 = 00.17612 + 0.47712 + 0.95422 + 0.17612=A2001 =1.0955 ID31 = 00.17612+ 0.17612 + 0.17612 + 0.17612=~0.1240 = 0.3522

:. ID,I = ~L.. w2

'.J i

:. IQI~ )~w~.J

2. Compute all dot products (zero products ignored):

Q-DI = 0.1761 *0.1761 = 0.0310

Q - D2 = 0.4771 *0.9542 + 0.1761 *0.1761 = 0.4862 Q - D3 = 0.1761 *0.1761 + 0.1761 *0.1761 = 0.0620

.'. Q-Di = LWQ w· .

. .J 1.J

1

3. Calculate the similarity values:

Cosine e ~ Q -Dj Dj IQl*lDjl

0.0310 ~ 0.0801

0.5382 * 0.7192

Cosine e ~ Q - D 2 0.4862 ~ 0.8246

D2 1 Q 1* I D2 1 0.5382 *1.0955

Cosine e ~ Q-D3 _ 0.0620 ~0.3271

D 3 1 Q 1 * 1 D 3 1 0.5382 * 0.3522

,', Cosine 8D; ~ SUll(Q,D;) :L,vv Q,j,"vi,j

,', SUll(Q,D;) ~ ~L:.~,2 ~L:w2

Q.J . '.J

J ,

IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.9 September 2007

Contribution

We can easily see from the previous example that the normal VSM will not be suitable for the protein sequence data. This is because it uses the IDF in calculating the weights, and as we saw in the example, IDF gives weight zero if the term appears in all documents and that is used for the stop or common words such as: a, an, the, of,... etc Since these words are very common they exist in all documents, IDF gives these words rank 0; because usually the words that are in all documents are not relevant. However, in protein there are no stop words as in text data. So, the original method is not suitable for protein data because the existence of a term in all protein sequences gives a meaning and a weight must be given to this term.

In this paper a small modification on VSM is proposed to fit for protein sequence data; that is to use DF instead of IDF, where DF is the frequency of the term in all documents (i.e. in how many documents this term exists). This will give each document its relevance based on the frequency even if this term exists in all documents so it will be suitable for protein data. We will use the cosine similarity measure, which is the most common and has been proved by most researchers to give the best results for similarity [5].

3. Implementation

We have implemented the algorithm described in section 2 in C programming language.

Experiments were run on a group of proteins that are known to be related. We tested the system on four protein families: ribosomal protein Ll, ribosomal protein L2, ribosomal protein L3, ribosomal protein L4, where each family has 50 proteins.

3.1 Results and Evaluation

The program has been tested on a collection of 200, 1000, 5000 and 10000 documents, where the document is a protein sequence as in Figure 1. We have a file for protein sequences that we want to search in, a file to input the query and an output file that gives us the retrieved results. The test of the program has been applied as follows:

We chose a sequence of amino acids as a query from the collection of protein sequence, for example from LI, and match it with the whole file and see the results. The relevant documents would be those from LI, because we get the query from Ll.

87

>NFOI724288 Ribosomal protein L3 [Desulfovibrio vulgaris] MAEKMGILGRKIGVTRIF ASDGSA VA VTVIK AGPCPVTQVKTV ATDGYDAIQIAFDEAKEKH LNKPEIGHLAKAGKGLFRTLREIRLEAP AA YE VGSELDVTLFATGDRVKVSGTSIGKGYQGVM RRWNF AGSKDTHGCEKVHRSGGSIGNNTFPG

Figure I: one protein sequence [6]

3.2 Evaluation

We used the standard IR evaluation to evaluate the algorithm. The precision gives the metric percentage of the number of relevant documents retrieved to the documents retrieved.

Number of relevant documents retrieved

Precision =

Number of retrieved documents

This measure gives us how accurate the method is from the number of relevant documents we retrieved. If the precision = I, this implies that the algorithm has successfully identified all relevant documents.

Applying this method on 200 documents with 50 relevant documents and using 10, 20 and 50 as the query length, we get the following results:

We can see from Tables 3- 5 that the precision for a query of length 10 is 80 % this is because the query length is not long enough and can be found in many protein sequences, whereas for a query of length 20­50, the precision is 100% for a cutoff = 5-10, and reach 39% for a cutoff = 100, and this is good results for precision measure.

3.3. Setup Time

The setup time is the time for constructing the index tables showed in Tables I and 2 in addition to the executing time of the program starting from entering the query asking for the retrieved documents until it gives the retrieved documents.

To calculate the setup time of the program, a collection of 200, 1000, 5000 and 10000 documents has been used, taking into account that this setup time is for the first run which includes the constructing of the indexes .

We can see from Table 6 that the setup time is quite reasonable for small documents up to 5000, but after that the setup time increases rapidly.

This can be improved by parallelizing the program distributing the data on multiple nodes which will decrease the setup time.

IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.9 September 2007

4. Conclusion

In this paper a modified model of VSM which is applied on protein sequences data has been introduced. The modified method achieved good results and good performance for retrieving the protein data. Using the precision evaluation measure, it gives a precision of I for a cutoff =10, and 0.39 for a cutoff of 100. The results show that for a small document collection the setup time is reasonable, but for large collection it gives very big setup time. Our next step is to test the program on larger data and compare the performance of this modified method with other similar methods. We also intend to explore the application of parallel techniques to reduce the large setup time.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.