Hybrid Approach For Exact String Matching Problem Computer Science Essay

Published:

Abstract: Problem statement: The rapid movements of technology have necessiated the generation of intricate and enormous amount of data emanating from biological sciences making string matching patterns a difficult task. These trends have intend rendered the application of one algorithm for string searching almost ineffective as excution times and number of comparsion remain relatively very high. Therefore the answer lies in the combination of two or more algorithms to form an hybrid algorithm to enable faster performances. Approach: The realistic way is the proposal of two algorithms, that is, Raita and Berry-Ravindran (RBR) Algorithms which could make it possible for enhanced performance. Results: The new RBR hybrid algoirthm achieved the best results through lessening the number of attempts, number of character comparisons and searching time. The types of data used for evaluating the performances was Protein and English text. The percentage of the improvements of the RBR hybrid algorithm when weigh against Berry-Ravindran in Protein and English text are 43% and 44% accordling. The enhancement in percentages of RBR in excess of Raita algorithm for Protein and English text are 30% and 18% respectively. The evaluation techniques are number of attempts, number of character comparisons and searching time. Conclusion: The outcome illustrates the blending of the algorithms led to greater resulys and improved performaces when to the original algorithms no matter than size of data being used as the test data.

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

Key words: Hybrid algorithm, string searching, RBR, Raita algorithm, string matching pattern

INTRODUCTION

The string matching issue have been generating a lot of keen interest since the commencement of computer science and it plays a critical role in ensuring the string searching and as well as matching of patterns originating from large resources such as Protein and DNA [1]. String matching importance covered a broad array of interests, for examples applications, proccessing of text, biological data, internet search engines, spell checkers and others [2]. In recent years due to advancement in technology biological science data keeps rising therefore it is no longer prudent to apply a single string search algorithm to solve these complex search problems but the integration of two algorithms to form hybrid algorithm which could enhance string search peformances. That is why in recent times the focus have been directed twoards the development of hybrid algorthms as it is seen as the best way for resolving the string matching problems and increase performance because scientific data is getting more intense [3].

In view of this development a proposal for a latest hybrid algorithm to be called Raita Berry-Ravindran algorithm (RBR) were highlighted which basically involves the proccesses of blending the good features of Raita (R) and Berry-Ravindran(BR) algorithms to boost search performance. The reason for the choice of these two algorithms to form an hybrid was that R algorithm is noted for having the best practical behaviour and performance with search patterns in English text due to the presence of character reliances [4]. It disadvantage however is that the algorithms does not check the starting point as a first step, in the case of BR algorithm it is famous for having a better a better shift value from the two successive characters, in otherwords BR attains the highest shift distance obtained mismatch information at the searching phase[5]. A string matching algorithm in simple terms is the practice of discovering all the occurrences of a given pattern of m characters (X = x1, x2…xm) built over a finite alphabet Σ size of σ. The major issue for these algorithms is on how to slide the window to match the pattern within a given large pool of texts during the searching phase[6]. All the algorithms normally function by scanning a given set of text by the aid of window , where the window comprises of the pattern length when a match or mismatch occurs. Then the text in the window are arranged in lines, before the characters in the text are compared with a pattern until a match is found. Throughout the searching phase usually either a match or a mismatch will occur, so the window is shifted to the rightmost section for the alignment of text which begins from left to right at the beginning of the search [7].

Related study: Earlier research works carry out confirmed integrating the good properties of both algorithms yeilds to enhance hybrid performance during the searching phase as each algorithm had its own strengths which when combined does gives enhance output.

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

An hybrid algorithm founded on two algorithms namely Boyer Moore (BM) Knuth-Morris-Pratt (KMP) algorithms, alias KMPBS that reduces the last character or the first character does not match the situation in order to reduce the the number of comparisons and to boost pattern matching were implemented by Xian-Feng [8]. It uses the Next function of KMP and aslo the Right (p [j]) values signifying BMHS algorithm at the pre-processing phase and the searching phase. In addition, two algorithms were combined to form an hybrid algorithm called ZTFS [9] which is made up of combination of Zhu-Takaoka(ZT) and Fast Search (FS). The operations for ZTFS include two procedures namely bmGs(j) for BM good suffix heuristic and ztBc(a, b) for ZT bad character heuristic representing the characters, and with the ztBc(a, b) function providing the maximum shift values, and also bmGs( j) function grant the prefix of the pattern during the pre-processing phase.

Another hybrid algorithm were put into operation comprsing of Backward Hashing (BH) and Aho-Corasick (AC) algorithms [10] which is useful for tracking and scanning information for text virus. The function is to index the shift table from the prefix sliding window (PSW), that represent the prefix for the search window. BH also search for longer patterns that are within PSW are be examined as the shift value does not go beyond the PSW thereby saving time.

MATERIALS AND METHODS

The research begins by determining the pros and cons of all string algorithms before chocing on BR and R algorithms. Then the best properties from algorithms were extracted that is, bad character table for Berry-Ravindran and the bmBc (boyer moore bad charater) table used by Raita algorithm. The computation of the shift value in the pre-processing phase is to ensure a bigger shift value for the window during the searching phase of the hybrid algorithm. The mixing of the two algorithms ensures the best functions are used to help make the algorithm efficient therefore increasing its performance. BR provides the best shift value using two successive characters. However, the disadvantage of this algorithm is that it does not check the starting point as a first step. For the R algorithm, it is useful in searching patterns specificaally for English texts and the performance is due to the presence of character dependencies [4]. However, the disadvantage of these algorithms is that they do not check the starting point as a first step.

Hybrid algorithm pre-processing phase: The hybrid algorithm build the pre-processing phase first [11] and at this phase the hybrid algorithm consists of building two tables namely bad character table and bmBc table for R by computing the shift values. The first table to be used in the pre-processing phase of hybrid algorithm is bad character table; this bad character table is built by using Berry-Ravindran formula as shown in Eq. 1:

(1)

And the second to be created was bmBc table, which contains all the locations of the characters that exist in the pattern and the text.

Hybrid algorithm searching phase: The searching phase of the hybrid algorithm begins by applying the pattern length. The procedures invloved are illustrated as below as follows:

Scan the m-length character of the text to demarcate a possible beginning search point.

If the examining character is does not exist in the Raita bad character table, shift the pattern by Berry-Ravindran shift value by using the rightmost two consecutive characters after the examining m-length character

If the scanning character exists in the Raita bad character table, arrange the character of initial search point and the pattern with the equivalent location of the character in the bad character table.

Begin the comparison of characters from the last character of the pattern with the rightmost text character of the window, then if there is a match it compares the first character of the pattern with the leftmost text character of the window, And finally if when a match occurs it actually compares the other characters from the second character to the last

When a match or mismatch happens, the procedure is to shift the pattern by calculate the Berry-Ravindran shift value from the two rightmost consecutive characters immediately after the window.

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

Analysis: The new hybrid consist of the pre-processing phase of Raita which is O(m+σ) and Berry-Ravindran representing O(m+σ2) time comlexity [12]. With the combination of these two equation the pre-processing phase time complexity of the hybrid algorithm is O(m+σ2). Below is the searching phase time complexity.

Lemma 1: Equation for complexity of time was defined to be O(mn) as the worst case.

Proof: The explanation of the worst case for the hybrid algorithm means that the entire characters within text are matched and must not be greater than m times. The worst case usually occurs throughout the processes when every characters inside the patterns are similar to the other characters in the text. For example given text T = "fffffffffffffff" and pattern P ="fffff" and so therefore O(mn) is noted to be the worst case time complexity.

Lemma 2: Representing the time complexity of O(n/(m +2)) is known to be the best case.

Proof: At each attempt once the probing character does not exist in the bmBc table, the shift value will be m+2 which is computed through Berry Ravindran function at the pre-processing phase. In the best case, when each character inside the patterns are exclusively different from the characters in text. For example given text T = "ccccccccccccccccccc" and pattern P = "yyyyyyy", the shift will be equal to m+2, which is executed during each attempt at the searching phase, for that reason the time complexity is O(n/(m +2)) as the best case.

The major factors used to determine the average time complexity is made of the size of the alphabet and the probable occurrence of every single character in the text. Consequently the maximum shift to be achived is denoted as m+2 while the the minimum character comparisons are usually between 1-m and it is the random sum founded on the input data. Owing to its random character which has no realistic estimates, the average time complexity is usually not possible to forecast.

Evaluation: For evaluation the performances of all the algorithms two different types of data were applied as the testing data for every algorithm and the value of each used as comparison among each algorithm. The data chosen for testing the algorithms are the standard creteria for evaluating all algorithms [3] when different sizes of characters and patterns are given. As a result the running of the program for the algorithms were executed 6 times. The data for Protein consist of alphabet size equal to twenty letters (σ = 20) and English text comprises of 100 different types of alphabets representing all the English language characters such as numbers and symbols. The English text were obtained from Gutenberg project [13], whereas the Protein data is taken from Swiss-Prot Database [14]. The program were implemented on Personal Computer with operating system being Microsoft Windows Vista Service Pack 2 with 1.93 GHz Intel Core 2 Duo Processor, 3GB of RAM and programming editor C++ Builder 2010 Architect. The new hybrid algorithm Raita and Berry-Ravindran algorithms performance are measured of number of attempts and number of character comparisons.

Number of attempts: This method beginning point is where the first character in the pattern is mapped unto a specific character inside the text and then continues to move tuntil the end of the text in order to establish whether a match or mismatch had happens before the text comes to an end. The expected number of shifts to the end of the text is known as the number of attempts. [15].

Number of character comparisons: It is the portions within the text of which represents the starting point inside a given text to the last letter of the text, where the characters of the pattern are take out independently and compared with the text characters to establish whether there is a match or mismatch [7].

RESULTS

Varing degrees of results was achieved for both the hybrid and the other algorithms which were tested using data size of 200MB. Additionally the results from the algorithms using different kinds of pattern lengths ranging from 3, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90 and 100 characters. The Figure from 1-9 demonstrate the two decisive factor for the evaluation, to be precise, number of attempts and number of character comparisons.

The results obtained from Fig. 1-9 illustrates the RBR hybrid algorithm proved enhance performances through number of attempts and number of character comparisons so for that reason the integration of the algorithms have utilized the good properties from each other to boost the performance of the hybrid algorithm. For example, the results gained from both the Protein and English text of the hybrid algorithm confirmed superior performance than the original algorithms and this is largely due to joining of bad-character shift function (bmBc) of Raita and Berry-Ravindran bad character enabling a larger shift values for the window and is so doing improved performance for the hybrid.

Fig. 5: Number of attempts in protein data

Fig. 6: Number of character comparisons in protein data

Fig. 7: Searching time in English text

Fig. 8: Number of attempts in English text

Fig. 9: Number of character comparisons in English text

DISCUSSION

As previously explained the aim for the research was to incorporate two different kinds algorithms to form a hybrid algorithm to ensure improved performance at the pre-processing and searching phases respectively as biological science data are becoming more colossal and cumbersome for string searching. To find out the performances of the new RBR hybrid algorithm two types of data were used, that is, Protein and English text. The results at the end of the experiments depict RBR as having significant reduction in number of attempts and number of character comparisons and also improved performances over the rest. The outcome for the new hybrid algorithm improved between 43% and 44% over Berry-Ravindran algorithm and also at the same rate improved from 30% and 18% over Raita algorithm respectively when using two different kinds of data.

The result does give a good picture of the need to paid much attention to the hybridization algorithms as a future prospect as it does ensures quicker processing and execution times rather than the application of a single algorithm for string searching which cannot meedt todays complex data. The success of the new hybrid algorithm does also shows, the choice of the right algorithms to form hybrid are of great importance as some cannot perform respectably when combine with each other so complementary algorithms must be carefully studied before being chosen to form hybrid before the needed enhanced performance and results can be achieved. The results also established the viability of combining two string search algorithms to and the benefits derived through improved search and matchin pattern performance.

From the results attained we strongly recommend the application of hybrid algorithms for future string searching developments as it is one of the best way to increase string searching performance.

CONCLUSION

The study presents a hybrid algorithm called RBR which integrate Raita and Berry-Ravindran algorithms. The hybrid algorithm confirmed improved number of attempts and number of character comparisons when all the different kinds of data size and pattern lengths, hence the proposed algorithm is useful for string searching especially Protein and English text. This also proved that the application of the hybrid algorithm will lead to better times than the normal algorithm as scitenfic data is becoming huge in present times.

Acknowledgements

I would like to express my deepest sincere gratitude to my supervisor, Associate Professor Dr. Nur'Aini Abdul Rashid for her valuable advice and assistance through useful comments during the research.My appreciation also goes to Universiti Sains Malaysia (USM) for enabling me to access its research and library facilities. I express my sincere and deepest gratitude to my family members especially my father, wife um Bader, brothers and sisters for their support and encouragement. Thanks to my friends in Saudi Arabia and Malaysia for their prayers, assistance and encouragement and support.