Collusion Detection Using Smith Waterman Algorithm Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The aim of this project is to understand the working of the Smith-Waterman algorithm, and on the basic of Smith-waterman algorithm understanding the Local alignment and String comparison. Currently in Molecular search field the sequence of genes and their relation is very important and for this purpose there are many algorithms to define and produce results. Every algorithm has its own complexities and structure. Smith-waterman algorithm also works on the sequence of the sequences of the strings. Purpose of this project is to implement the smith-waterman Algorithm and on the basis of this algorithm comparing the strings and alignment of string in the final phase.


Mile Stones:

The whole project implementation is divided into tasks and each task is given some time stamp according to its complexity. The tasks are as followed.

Description of the Smith-Waterman-Algorithm

Understanding of main working

Understanding of the significant similarities detected by the algorithm

Checking dynamic programming arrays to improve both time and space efficiency

Getting some results of Empirical investigations

Conclusion on the bases of the results obtained

Till now I have done my work with the Smith-Waterman-Algorithm and I have also done some work with the internal working of the Algorithm. Now I am going to start with the significant similarities detection and use of the dynamic programming arrays to improve the time and efficiency of the program.


Project Proposal and plan:

First three choices for the project were submitted on 22nd January 2010. Then project preliminary report was submitted on 30th April 2010 followed by first presentation explaining the initial understanding of project. First Project proposal was submitted on date and then 2nd July 2010.2nd progress report was submitted in 6th August 2010 with the timeline that the final project will be submitted in start of the September 2010. After the submission of final project with the idea of possible outcome and further work that required to make the project error less. Then it depend on the supervisor or second examiner to call for presentation which will be held from 6th of September to 17th of September 2010.


The goal of the project is to understand the working of the smith-waterman Algorithm and then its implementation. Matching the sequences and showing their alignment as well. Output of the file is in term of plagiarism percentage and the values involved in plagiarism.


The design and implementation of the Smith - Waterman Algorithm was different from other detection algorithms. Smith-waterman algorithm is design to deal with the sequences of DNA and their comparison with each other. The algorithm is complicated and in this project, I have tried to cover as much aspect as time allowed me.


Problem Statement:

Sequence alignment is the essential tool in the field of the molecular biology. Molecular biology depends on the alignment of the sequence, So if that process if good and quick then the dependent process will automatically be efficient. There are number of algorithms that are available for use but their complexity keep on growing. Available algorithms and their efficiency is still a big issue and is chosen on empirical basis.

Smith-Waterman Algorithm also works for the alignment of the sequences of the strings. The best quality of this algorithm is that it gives best and quick result for the shorter length of the string and as the strings goes bigger and bigger processing times increases sufficiently but still the outcome is reliable as it will be shown in this paper / project.


Literature Review:

Back Ground:

Lexical Analysis:

Lexical analysis on these values means that we transform the original words into the form of symbols. The program will read the file in form of string them it will check each words of the string and put a special symbol for that word. If the any word that is similar to the previously stored word then new symbol will not take place for that word.

!= null

Words to symbol


String Word

(Henry, Fred)

Smith-Waterman Algorithm:

Plagiarism and collision is mostly associated with the student assessment. Plagiarism is associated with copying or stealing someone else work with / without the intention to deceive or knowledge of author. Collusion is mean as submitting the work that has been done somehow for fully by another party and that party is not aware of this act.

In the field of Computational Biology it is really hard to work on the DNA or protein sequence. Smith-waterman algorithm helps to work on these complex sequences. This algorithm is used to find out the distance of similarities and aligning the sequences in a way that similar values come on same index. Smith - Waterman Algorithm use matrix to where one sequence comes of top and one comes on side (like two for loop). There it check the values of each sequence with other one and place a value accordingly (method of placing values is described below).

Smith-Waterman algorithm is basically used to compare and align the two sequences of strings. Mostly used in DNA recognition and this algorithm help to find out the similarities and by the help of the alignment, It become easy to computes the strings.

The Smith-Waterman Algorithm for collision detection is one of the best algorithms to compare strings with each other. This classical algorithm is also known as Local-Algorithm because it is widely used for the purpose of the string matching to check the collusion.

This algorithm use string matching technique. Part of the one string is compared with the strings of the other text file. Algorithm count the value of hit (matches), delete (indel) and mismatch. To count these values we put these values in predefined variables named as h, d and r.

Where 'h = Match', 'd = delete' and 'r = Mismatch'.

The value of make increase in total value and value of d and r make the negative effect on the total count. So increment takes place when there is any match of value and decrement when there is any mismatch or indel occur. In the start we write the value as

h = d = r = 1

After the process of the string matching the total is counted as followed

Total = (*) h - (*) d - (*) r

Where „*‟ is the counted value for the match, indel and mismatch.


Suppose that we have two string „X = abcbadbca' and 'Y = abbdbda'. We have to compare these two and in the end the result will be shown in the form of total value carried out.

Here after the two string comparison we come up with the following values,

h = 6

d = 2

r = 1

Therefore total will be as

Total = 6 * h - 2 * d - 1 * r

We can it as follow as well

= 6 - 2 - 1

Total = 3

This is the total of comparison of two different file strings, this is the same process that is done for whole the file containing many string. The value of threshold is defined first in any plagiarism algorithm and that threshold value defines the percentage of the plagiarism involved in the submitted work. The value of threshold key depends upon the nature of the output required, (In text file less threshold value is good and in restricted environment higher value is best).

(Teacher Notes)

Detection Technique:

Strings is divided into words and each word is given a number and then the whole string provide a sequence of the numbers and then that sequence of number is compared with the strings (text file) of another publisher or author. The method proposed (Smith-Waterman-Algorithm) can be applied to any sort of textural material, like as essay, some numbers of word or written pages.


Suppose that we have a string "A horse is an animal"


Numbered Key











So now this string can be represented as "12345" and in this way we try to give number to all the string and one thing is notable that one we have given a number to one word then we can't give another number to same word. So in that way all the strings are converted into numbered strings that make it relatively easy to compare with other numbered value.

The problem in that collision algorithm was that it is bit hard to convert all the words into numbered value and for that thing it requires maximum space plus the time it will to convert it to number and then it will start processing on it. People borrow part of someone else work and after altering it, they claim it as it is their own work. This thing can easily be done while using a computer without effecting the meaning and semantics of the sentence.

In case if we apply the string comparison method then on corpus of n documents, we can make all (n/2) = n(n-1)/2 pair wise documents comparisons.


When n = 300 then total pair wise comparison will be = 45000.


String alignment:

String alignment is the process to represent the similar / particular transform of one string to another that both have maximum value at the same index.


" A (global) alignment of the two strings S1 and S2 is obtained by first inserting chosen spaces (or dashes), either into or at the end of the S1 and S2, and then placing the two resulting strings one above the other so that every character or space is opposite a unique character or unique space in the other string"

(Gusfield, 1997)

Here the term mentioned as "Global" mean that it implies on every type of strings which are used for the alignment purposes. If there are two strings given and then it require to do alignment, So after alignment the string will have the maximum matching values at the same opposite positions.

Example of the two strings 'qacdbd' and 'qawxb' will look as follow after the alignment

q a c - d b d

q a w x - b -

(gusfield, 1997)


In the above example the values of both string which are matching with each other comes under each other rather than the value of 'c and w'. Value of x and ds doesn't match so we are putting '-' with these values.


In the process of strings alignment we have to calculate the Distance matrix and then on the bases of that Distance matrix we calculate the trace back path / alignment.

Distance matrix: (Calculation of similar array):

"Distance matrix (Dij) is defined for the two strings S1[1 ... i] and S2[1 ... j]. "

Calculation of the similar array is done by using this Distance matrix where we start processing from the right - bottom of the matrix and check for the values having different possibilities (each possibility is described below).

Let's suppose that String1 (S1) have 'n' letters and string two (S2) have 'm' letter then the Distance matrix will process in (n*m) time. It can be more easy if we do it in (n+1)*(m+1) times. Because one row and one column is added to the table with value '0'.

D(i j)
















Table 1.1 (showing the values of row 0 and column 0.)

Distance matrix put the values of 0th row and column as above and then the value of the table index depends on the matching of the values with the value on row. If both the values matches then it take the same value of the index D(i-1, j-1) and if the value if not matching then it pick up the minimum value from these table index { D(i-1,j) , D(i,j-1) , D(i-1,j-1) } and then value is increased by '1' and placed in the index.

After the all processing of the values the table 1.1 will look like as follows.


























Table 1.2 (Values after the process on the strings)

(Understanding from Gusfield 1997)

(Teacher Notes)


Now the process of trace back will start to align the values of both strings. Process will start from the end of the table and continues to go back to the beginning by following the values from which index it comes. These arrow ' ' and there directions show the possible of the previous index from where the value is acquired. So after reading the last value of the table we can process to the previous index and put some values or dashes accordingly.

The values that we put on the screen for alignment purpose depends on the direction of the arrow ' '. The behaviour of the arrow and impact on the trace back is as follows.


This arrow means that the value is not match. We will put the value of the S2 in string 2 and dash '-' will be inserted on same index of the S1 string.


This arrow means that value is not match. we will put the value of the S1 in string 1 and dash '-' will be inserted on same index of the S2 string.


This arrow indicates most probability of match on the same index of the both strings. Values of the both string at that index will be place and the control will be moved on the D(i-1 , j-1).

After the whole processing the resulting string will be as follows.

. B A C


. - B A C

A B - C

. B A - C

- A B C

Above representations depends upon the direction of the arrow, because each direction of arrow will guide the processing to a different style.


Working with Smith-Waterman Algorithm:

Smith-waterman algorithm is all about calculating the value matrix according to the match and mismatch on the values of two strings of text files. Smith-Waterman Algorithm define some variables like " 'h' , 'r' , 'd'" where these are define as followed


The value of 'h' is added to the matrix with the previous values of index ( i-1 , j-1) iff there is a match of the value [ x[j] and y[i] ].


The value of 'r' is subtracted from the maximum value of the index

{ [i,j-1] , [i-1, j] , [ i-1 , j-1] }. This process do occur only if there is mismatch of the

{ x[j] and y[i]}.


The value of 'd' is subtracted from the maximum value of the index

{ [i,j-1] , [i-1, j] , [ i-1 , j-1] }. This process do occur only if there is delete between the { x[j] and y[i]}.

(Chrochmore, Rytter)

Chapter 2:

Project Structure:

The structure of the project consist upon three main Task

Implementation of the Smith-Waterman Algorithm:

Smith-waterman algorithm defines the way to calculate the values in the matrix. It works with the math and miss match of value. The basic working of algorithm is upon two strings, where the algorithm check if the values on both indexes matches with each other than we add value 2 ( + 2 ) with the value at index (i - 1 , j - 1) and if the values are not matching with each other than we take the maximum of the previous values that are attached with the current index {(i - 1,j) , (i ,j - 1) , (i-1,j-1)}

and minus value 1 ( - 1) only if the value is greater than '0' otherwise we put '0' on that index. In this way the whole matrix is calculated.

String comparison:

String comparison take place between two string and for this process we define a threshold value (user select the threshold value).

Threshold Value:

Threshold value is use to indicate the value that describe the comparison point. If in Smith-waterman algorithm matrix it found the value equal to threshold value and the previous values are continues equal to each other ( i.e x]j] = = y[i] , where 'j' and 'i' are the index number) then it mean that these string from two text file are same and it indicate one level of plagiarism.

So we put this part of both strings as output in put record which shows the similarity in some part of the string. In the end we calculate the total characters of each string and number of values match and then we calculate the percentage ( % ) of plagiarism. Same way we calculate the plagiarism foe the second string.

Showing the Value With gaps (Value Alignment):

Value alignment or value with gaps are calculated as followed,

Suppose we have two strings

q a c d b d

q a w x b

(gusfield, 1997)

And then the result will be much similar to this output

q a c - d b d

q a w x - b -

(gusfield, 1997)

This process is obtained by working on the values of each index of both string ( X and Y having 'i and j' as there index number). We add one row and one column in addition to the total number of values of X and Y strings. We put '0' in row '0' and column ' 0' and from the index value 1 to the end we check for the matching and miss matching values of both strings. If there is a match then we put the same value as it is on index


and if there is no match then we get the minimum value from the attached indexes

{(i - 1,j) , (i ,j - 1) , (i-1,j-1)}

and add value '1' ( +1) in it. This process goes till the end of the matrix.

Now we have to show the values with the gap and for this purpose we start from the end of the table. We keep on going back until we get to the index (0,0). If value on the end and value at index

(i , j) = = (i-1,j-1)

and values are same on both indexes, then we put the values of both string of that index and move up to the value


If values are not matching, then we move up , up-right or left according to the matrix values. Suppose that value of 'current index - 1'is same to the value at left then we put value of string 'X' and on string 'Y' we put '-' and control is moved to the index

(i, j-1).

(Teacher Notes)

Chapter 3:

Programming Language:

In this section we will describe the programming language chosen for this project. This project requires the implementation of the algorithm and its understanding. Understanding is done by searching the different aspect of the algorithms, its working and its output etc. This project doesn't depend on specific programming language, it just depend on the developer where they find them self easy to implement.

Programming language C#:

Visual basic C# 2010 is selected to work on this project. There are numerous reason to choose this language but the most important one is that I have base in C# and it become easy to work with C# rather than going for learning many other languages.


C# is case Sensitive.

Use of Static Classes.

Multi line comments.

Printing multi line with the token '+' and using the same token for addition as well.

Allowing the block of unsafe code by using the unsafe keyword.

There are numerous other feathers which make C# prominent to other language but as described above that most important reason is understanding of the language.



Syntax of a language is very vast topic. But here we have mention some examples,



if (condition)


// condition is true



if (condition)


// if condition is true




// If condition is false



if (condition)


// if condition is true


else if (condition)


// If 2nd condition is true




// If both conditions are false



for(int i = 0; i < number-1 ; i ++)


// code will come here


Nested Loop:

for(int i = 0; i < number-1 ; i ++)


for(int j = 0; j < number -1 ; j ++)


// code will come here



There are many more syntax operation that are widely used in C# and some of them are used in this project coding phase ( which are explained as well).


Chapter 4:

Specification and Design:

To design the whole project first we need to understand that working of the algorithm and how it computes the values and how it displays it to the user. In this project we need to have multitasking machine to make the process quick.

The way this project will run and produce the outcome is as follow.

The user will place the two documents consisting of the single / multi strings in the E:\Project folder (the file should be .TXT).

Give the name file1.txt and file11.txt to both file accordingly.

Program will start processing by picking up the data in form of words and will store it in two string arrays.

Then it will check for the indexing (hashing) of these words. It will assign the specific number to each word and if the any word is matching with previously stored word then there will be no new index number to that word (this is to avoid confusion).

After the indexing / hashing now we will have the words in representation in the form of the numbers.

e.g ( a horse , a fast horse) string will be represented as

< 0 1 0 2 1>, where

0 = a

1 = horse

2 = fast

After this process now these numbers will be stored in two arrays named 'x' and 'y' which will be accordingly to the data of two files.

Then the smith - Waterman algorithm will start work on matrix. It will work on these values stored in x and y and string1 and string2 accordingly. (the working of matrix is explained above)

On the base of that matrix a threshold matrix will be created.

Thresh hold matrix will check each values stores in smith-waterman algorithm and if the values is equal to the thresh hold values then it will check that if the values of string1 and string2 are same then it count this thing as plagiarism and will take into account.

The percentage of plagiarism is calculated on that bases and values will be shown as well.

After this alignment of the string will take place. For this thing we will create the new matrix where the values will be string1 + 1 and string2 + 1. Where the additional row and column is for storing the value '0'. (The calculation of the matrix is described above as well).

Then the output will be shown and stored in a file in same directory E:\Project\val with gap.txt

User can view the output of each matrix and each process which is also stored in that folder (E:\Project).

Chapter 5:


Comparing both input strings and storing them in new char arrays with proper alignment.

Displaying similar Values plus Plagiarism status.

Showing the Strings with Proper Alignment.

Start of project

(Picking the file1.txt and file11.txt) from directory E:\Project

Conversion into Lexical Values:

Threshold Matrix

Applying the Smith-waterman algorithm And Construction of Matrix.

Start of project

(Picking the file1.txt and file11.txt) from directory E:\Project

Conversion into Lexical Values:

Threshold Matrix

Applying the Smith-water And Construction of Matrix.

This chapter will provide the detail of the implementation and its necessary explanation as well. Implementation of project is described first in table then extra detail is in paragraph.

File Name





The folder where all files are stored



file1.txt contains the text (sequence / strings). Can be said as first input file.



File11.txt contains the text (sequence / strings). Can be said as second input file.



A text file that contain the matrix which is the result when the Smith-Waterman Algorithm process on the sequences of two files.



A text file which contain the values of threshold matrix. (explained below)


Similar Values.txt

The values which are similar in both text files and are satisfying the threshold values as well are stored in similar values.txt file


Val with Gaps

Alignment of the values is the last step. Where values are display with the proper alignment.

Table of files.

Input Files:


file1.txt is the first input file. User will store its text data into this file and then the program collects the data from this file.


file11.txt is the first input file. User will store its text data into this file and then the program collects the data from this file.

Processing Phase:

First program will take the date from both files ( file1.txt and file11.txt) and it values are stored in two different strings arrays called 'x' and 'y'. Then the program will perform the lexical analysis phase and values will be arranged into lexical sequences rather than the whole wording.

Example: a horse, my kingdom for a horse, will be represented as follow

< 1 2 3 4 5 1 2 >











First the words will be stored into two string arrays ( 2-Dimention two array) called ary[,] and ary1[,]. Then the lexical process will be done on both arrays and the lexical values will be generated with the original words and these are stored in a new string array (words and their lexical values) called hsh[,]. Then string1 and string2 (new arrays) will be created where the lexical values are stores with respect to the appearance of the words in to sequence in file, so string1[] will contain the lexical values for the files1.txt data and string2[] will have the lexical values for the second file11.txt data. Then these both string arrays (string1[] and string2[]) will be passed to the program for the actual processing. Then the user will be asked for the threshold value. Threshold values mean that maximum number in smith-waterman algorithm (in smith-waterman algorithm if the values are matching then a addition is done in the values +2 ). If the values are matching and it reach to the threshold values then we count this thing in plagiarism and number of values are counted as well, which in the end help to find out the percentage of the plagiarism in both files. Then we compare both strings and then a new matrix is created that contain the values for the alignment process ( that matrix is explained below). Then the alignment process starts and on the basis of threshold and newly created matrix values are extracted from the table and then each value is match accordingly

Example : if we are working on the index (i,j) of the matrix then values on indexes {(i,j-1), (i-1,j-1), (i-1, j-1)} are check if it match with (i-1, j-1) and values of on (i-1,j-1) index are same { x[i=1] = = x1[j-1]} then we jump to index (i-1,j-1) and placed both file values respectively into output1[pop] and output2[pop] arrays (here pop is a integer values which keep on increasing by every insertion). If values on (i,j) ! = (i-1,j-1) then we check the remaining two indexes and we moved to (i,j-1) or (i-1,j) accordingly. If we moved to (i-1,j) then we put values of x[pop] into ouptut1[pop] and '-' is inserted into output2[pop], if we moved to index (i,j-1) then '-' is inserted into output1[pop] and this process continues till we reach to the first values (one thing is important that if we are on a index where j =0 then we only check the value on index(i=1,j) for comparison and if i=0 the we only check index (i,j-1) for comparison, other than these two condition we check three indexes as described above).

Output files:

The output files are described as follow. All the files are stored in same directory ( E:\Project\).


This file contains the matrix that is generated after the actual smith-waterman algorithm.

Threshold .txt:

This file contains the matrix with respect to the threshold values as described in processing phase.

Similar Values.txt:

This file contains the similar values that occur in both files plus the percentage of plagiarism that is detected in both files.

Val with Gaps.txt:

This file contains the original data / values stored in both files (file1.txt and file11.txt) with the proper alignment.



Important note:

The working on the values is done word by word. So in alignment process for matching values full words will be displayed and in case on mismatch there will be one symbol '-' to make alignment. User has to think that space as appropriate as one word.

Chapter 6:

Testing of Project:

Working of the project was checked on every step of development. Project consists of different phase and on every phase it gives some output which is going to help for the next step. Now we will discuss here the project step by step:


User will put the data / strings in the files names (file1.txt and file11.txt) in the directory E:\Project. Here in this evaluation / testing process we have some data inserted into the both files. For getting the maximum result we have put the same data in two files to check how the program deals with same data.

Step 2:

Then the program will get the data from these two above mentioned files and store the data into two strings array. It will also show the data to the user in the form of word by word.

Step 3:

After storing the files into the array then the program will start giving the word some specific code ( e.g a = 1, horse = 2, etc). For similar word the program is not going to give new code, it will use the same code that has been assign previously. Then it will show the whole words of file with their specific code. And after this program will store the lexical code instead of the string words in two different array and then smith-waterman algorithm will start working on these files.

Lexical values:


Smith-waterman algorithm will work on these lexical values and the actual words where it is matching and will generate a matrix having entries equal to (length of stirng1 words) * (length of string 2 words). While producing the matrix it will also save these entries in a newly created file named Smith-Waterman.txt in same directory E:\Project.

Step 4:

After the successful completion of the smith-waterman algorithm the program will ask from user to enter a threshold values. The program will run on the newly generated matrix (smith-waterman matrix) and will check that values are maximally up to threshold values. It will create a new matrix with respect to that threshold value, where no values will be greater than the specified threshold value.

We can also think that threshold values is equal to the (number of worlds that should be match in both strings) * 2. Because smith-waterman algorithms do addition on 2 if values are matched. These values will be stored in a new file named Threshold.txt in same directory.

Step 4:

Threshold values will define the maximum number that can be consider in the matrix if the string is matching from any value and carries on matching till that specified threshold values is matched. If we come across such situation then the program will store that part of both strings as matching strings and it will also count the number of words / lexical values matched there. If the same thing happens for another string then program will do the same process and count of words will be increased with respect to the new lexical values matched.

Step 5:

After completion of the threshold matrix process, the program will show the similar strings and then it will show the total number of words in each string with respect to the number of words matching that comes in that strings and on the bases of these two things it will calculate the percentage (%) of the plagiarism status of that file. Same process will be done for the second string file.

Step 6:

After showing the matching values (String Comparison process completion), program will go toward the next step where the values will be align with the respect to their matching order. Any mismatch of values will cause the move toward straight up or straight right depending on the next value matching condition. If the values in output1[] and output2[] array are same and satisfied the condition set then control will be moved to position right and up (i-1,j-1). And these values will be displayed to the user on the console application.

Here size of string one is 11 and string 2 is 11 as well but the size of threshold matrix is 144 (which is 12 * 12) but not 122 (means 11 * 11), this is because threshold matrix add one row and column with the entries of '0' in it to make the process slightly easy to compute.

Testing of the program with some values different in any file.

Here we will show the same process on some different values.

(Teacher Notes)

This screen is showing the maximum processes that are going on in this project. Here in the last phase of strings with gaps, each '-' indicates that one mismatch is there with the second file.

(Teacher Notes)

Chapter 7:

Evaluation of project:

It also checks the end of file as one word. Because after numerous testing and checking it closely I come to know that last matching values if it reaches to the end of file will be one lesser than the other one or one lesser than as values are expected from threshold values. After careful checking for every values then we come to know that that it count one values more than the number of words in the file.

Chapter 8:

Conclusion and Further Work to be done:

In this project I have tried to understand the Smith-Waterman algorithm and the alignment process. Then after getting the knowledge of the algorithm and its working on strings, I started the project and tried to implement the logic with acquired knowledge. The main focus of the project is to first read the strings word by words then performing the indexing (lexical values instead of the full words). Then implementation of smith-waterman algorithm and storing values of matrix in file. After that it will ask from the user to enter threshold values. Then it will sort the matrix with respect to threshold values, and then the similar string words will be displayed along with plagiarism status. After that string comparison will be start and the values with be adjust accordingly to the matching and mismatching values.

This project gave me understanding of the string that how to extract the string from the files and then extracting the words from these values. I have tried to put my 100% efforts to work out the desired output. I am feeling that I have got maximum understanding and knowledge of the implementation that was required for this project. This project is very helpful to me for giving me a chance for getting the knowledge of these unseen aspects.

Currently this project is showing the smith-waterman algorithm matrix with the string comparison and alignment of string as well. This whole project is made out in console application. I will keep on working on this project to make an interface that make it user friendly.

I have also done some experiment on this project to satisfy my work. But these experiments are done on simple text and chosen data. So the experiments are done on this project program are limited and I still expect that there should be more work to be done to make the things more easy. I would appreciate if someone is going to provide me more supervision for making this project more good and flawless.