Memetic Algorithm For Closest String Problem Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract- Sequences consensus problems are especially important in studying molecular evolution, protein structures, and drug target design. In this paper, we work on two of these problems, namely closest string problem and farthest string problem. These problems are NP-hard, and none of exact algorithms already proposed to solve them is in polynomial time. Many non-exact algorithms have been proposed which try to obtain 'good' solutions in acceptable time for these problems. In this paper, a Memetic Algorithm (MA) is proposed for the closest string problem, which outperforms the existing algorithms. We then extend the proposed algorithm to address the farthest string problem.

Keywords- Memetic algorithm, string selection problems, Bioinformatics, metaheuristic

Introduction

Computational biology and Bioinformatics are two important interdisciplinary fields which connect biology and information technology. An important goal of these fields is to interpret genome and protein data, which has led to such problems as genome rearrangement [1], sequence alignment [2], and sequences consensus problem [3], most of which are known to be NP-hard [4]. Sequences consensus problems include closest string problem (CSP) and farthest string problem (FSP), which have applications in drug design [5][6] and coding theory [7,8].

The closest string problem seeks a string whose distance from all the strings in a given set of strings is the minimum. The distance between two strings is defined as their hamming distance [9]. Consequently, the solution to the problem is a sequence with maximum 'similarity' to the input strings. The farthest string problem, on the other hand, is the reverse of the closest string problem in the sense that it seeks a string whose distance from the given strings is the maximum. These problems are especially important in the context of studying molecular evolution, protein structures, and drug target design. For example, one of their applications in drug target design is to find a genetic sequence that kills all pathogenic bacteria without any impact on human being genes [10]. The closest string problem also has applications in the coding theory, data compression, error decoding, and speech coding [8]

Several algorithms have been proposed to solve the closest string problem. Since this problem is NP-hard [11], none of these algorithms is a polynomial time (PTIME) exact algorithm. That is, the existing algorithms which can solve the problem to optimality are all exponential [9][12][13][14]. However, because of their exponential complexity, these algorithms are not of much use in practice. Consequently, many non-exact algorithms have been proposed which try to obtain 'good' solutions in acceptable time.

There are in general two classes of non-exact algorithms for CSP, which are approximation algorithms [15][16][17][18] and (meta)heuristic algorithms. The distinction between these classes of algorithms is that the former guarantees a bound, called approximation ratio, on the optimal solution whereas the latter does not. In practice, however, (meta)heuristic algorithms usually provide better solutions than those provided by approximation algorithms.

One of the first metaheuristic algorithms proposed for CSP was a genetic algorithm which included a special pre-processing step on the input strings [19]. Another algorithm called iterative rounding was presented in [20]. A more sophisticated algorithm called CGSA was devised in [21]. The CGSA algorithm is based on a hybrid of Genetic Algorithm (GA) and Simulated Annealing (SA). It was shown to be superior to its predecessor GA algorithm proposed in [19]. Finally, Julstrom proposed another genetic algorithm in [22]. The main idea of this algorithm is to use the frequency of characters in each position of input set to initialize the population. The proposed technique was shown to be beneficial, but the algorithm was not compared with CGSA.

CGSA, as its name stands, hybridizes Genetic Algorithm and Simulated Annealing. A genetic algorithm is well-known to be capable of spreading the search space, but it is weak in exploiting the power of neighborhood search. On the contrary, simulated annealing has no information about the whole search space, whereas it is strong in local search. Therefore, hybridization of these algorithms is a promising trend in finding high quality solutions in the search space.

In this paper, a Memetic Algorithm (MA) is proposed for the closest string problem. It is then extended to address the farthest string problem. In a memetic algorithm, local search is incorporated within an evolutionary algorithm. Memetic algorithms have been successfully applied to many optimization problems such as knapsack [23] and graph coloring [24]. In our proposed algorithm, local search is added to GA, as a consequence of which neighborhood of the chromosomes are also investigated.

The proposed memetic algorithm works as follows. At the end of each iteration, some of the best chromosomes of the population are selected to which a local search function is applied. In addition, the local search function is applied to the other chromosomes of the population but with a low probability. Once the local search is applied to a chromosome, if the best neighbor has a higher fitness value, it will replace the original chromosome.

We have extended our memetic algorithm to address the farthest string problem as well. Lanctot et al. proved in [25] that this problem is NP-Hard. Some efforts have been made to solve this problem using exact and approximation algorithms [13][26]. However, to our best knowledge, no (meta)heuristic algorithm has yet been presented for this purpose. In this paper, a memetic algorithm is proposed as the first metaheuristic for this problem.

The rest of the paper is organized as follows. Section 2 presents the preliminaries required for the rest of the paper. In Section 3, memetic algorithms are introduced. Our proposed algorithm is described in Section 4. Section 5 provides the experimental results. Section 6 concludes the paper.

preliminaries

Let be an alphabet of characters, then presents all possible strings of length using the characters in . The distance between two strings and is defined as their hamming distance, which is defined as follows.

(1)

Where and denote the ith characters of the strings and , respectively, and

(2)

The closest string problem is defined as follows:

Instance: a set of strings on

Output: a string on

Minimize:

The farthest string problem is defined as follows:

Instance: a set of strings on

Output: a string on

Maximize:

Given a set S of strings, the hamming diameter is defined as follows.

(3)

Example. Assume where , and

Then, ,, and . Therefore,

memetic algorithm

Memetic algorithm has its roots in universal Darwinism. The idea of universal Darwinism was expressed by Richard Dawkins in [27] which describes a unified framework for organization of evolution in complicated systems. Universal Darwinism suggests that evolution belongs not only to biological systems but also to any complicated system whose main principles include inheritance, mutation and selection.

Inspired from Darwinism principles in natural evolution, the term "Memetic Algorithm" (MA) was proposed in 1989 by Moscato [28] and was defined as a hybrid of an evolutionary algorithm with an individual learning procedure that has the capability of local improvement. MA has been successfully applied to many NP-hard problems, including knapsack [23], graph coloring [24] and gene expression profile [29].

Two main paradigms of memetic algorithms are neighborhood structure and local search (LS) heuristic. The Neighborhood of a chromosome is defined as a set of chromosomes each of which reachable by applying a specific operator to the primary chromosome. When the neighbors of a chromosome are determined, one of them with a better fitness than that of the primary chromosome, if any, will be selected as its successor. Also an important decision should be made about which members of the population to select for further improvement via local search.

After defining these main rules, and having adopted an evolutionary algorithm, MA is well-defined: at each generation, local search is run on the selected members of the population. A pseudo code for a general memetic algorithm is presented in Figure 1.

Proposed Algorithm

In this section, we describe our proposed memetic algorithm for CSP and FSP. First, the underlying genetic algorithm and local search are individually described and then, the memetic algorithm for CSP is presented. Finally, the algorithm is modified to address the farthest string problem.

begin

Initialize population P; BestSoFar= a solution in P

Repeat Until the termination condition not met

newP= {}; empty population

Repeat Until newP is populated

Select two chromosomes from P;

Cross over on the selected chromosomes;

Mutate the resulting offsprings;

Select chromosomes from NewP for LS;

Apply LS to the selected chromosomes of NewP;

P= NewP;

Update BestSoFar;

Return BestSoFar;

end

Figure 1. Memetic algorithm pseudo code

Genetic algorithm

A genetic algorithm consists of three main phases namely population initialization, cross over and mutation [28], which are described next. To create initial population, some pre-processing techniques are applied. The first one is intended to reduce the size of the input strings. This phase works as follows. Positions in which all strings have the same character are detected and deleted from the strings. This pre-processing technique is useful to reduce the run-time of the algorithm [21].

Another pre-processing technique is to calculate the frequency of each alphabet character at various positions within the input strings. Two strings are created based on these calculations. The first string which is called the best string consists of characters whose frequencies at each position are more than a predefined threshold. The other string, called the worst string, consists of characters whose frequencies at each position are less than another threshold. When the best and the worst strings are created, the initial population is generated randomly, except that the characters of the worst string are not allowed to appear in the initial population. The best string is used for the purpose of mutation which is described later in this section.

Second phase of genetic algorithm is the crossover phase. In this phase, two members are selected as parents, cross over is performed, and two new children are created. Parent selection is performed using the roulette wheel method; hence chromosomes of larger fitness have more chance to participate in producing next generation. In our algorithm, cross over and mutation are applied probabilistically in a similar way as in CGSA algorithm for CSP [21]. These probabilities are shown by equations (4) and (5).

(4)

(5)

In these equations, is the minimum probability for crossover, is the maximum value of this probability, and show the minimum and maximum probabilities for mutation, respectively. is the maximum fitness of the old population, is the average fitness of the old population, and is the maximum fitness of the parent chromosomes. The maximum and minimum values for cross over and mutation probabilities, are constant values obtained by experience.

In this genetic algorithm, a two point crossover is used. Crossover points are selected randomly and parents are crossed over these points. Created chromosomes will mutate with the probability from equation (5). To mutate a chromosome, a position is selected randomly and if its character differs from the corresponding character of the best string, it will be replaced with that of the best string. If the best string does not suggest any character for that position, a random character will be selected to replace the current character of the mutation point.

When two new chromosomes are created, a competition between them and their parents occurs. Each parent competes with one of the children for survival in the next generation. The parent will be selected by the probability presented in equation (6). The parameter is the energy of the parent and is the energy of the child, which are defined as their fitness. The probability for selecting the child is one minus that of the parent.

(6)

Local searrch

The local search function is intended to improve a chromosome by finding a better chromosome in its neighbourhood. In our algorithm, the neighbourhood of a string is defined as the set of strings whose hamming distance from it is exactly one. The local search adopted in our algorithm operates as shown by the pseudo code of Figure 2. The weight of each string used in the algorithm is defined using equation (7), where dist[i] represents the distance between the ith string of input set and the string to which the local search is applied.

(7)

Input: A string X of length m with distance d from a set S of strings of the same length

Output: String X, possibly modified for a smaller

distance from S.

Begin

improve= true

While improve is true do

improve =false

For all positions p of X do

For all alphabet characters ch do

For all strings s in S do

If ch = s[p]

score[ch]=score[ch]+ weight[s];

If(score[ch] > score[X[p]]

X[p] = ch;

improve = true;

End

Figure 2. The psedu code for the local search procedure

Input: a set S of n strings of length m

Output: a string solution X whose distance from S is

to be minimized; X is not necessarily optimal

Begin

Initialize population p; NewP={};

do #itreration times

do #population_size / 2 times

select randomly two chromosomes par1 and par2 in p

calculate pcrossover and pmutate;

by probability pcrossover do a two-point cross over par1 and par2;

let ch1 and ch2 be the resulting offsprings

by probability pmutate apply mutation;

winner1=compete(par1, ch1);winner2= compete(par2, ch2);

add winner1and winner2 to the new population NewP;

for each chromose chrm in tmpP do

if(fitness(chrm) > maximum fitness Ã- .95)

local search(chrm)

else by probability= .05

local search(chrm)

P=NewP;

return the chromosome with the maximum fitness

End

Figure 3. psedu code of proposed Memetic algorithm

Memetic Algorithm

Following the definition of memetic algorithm [28], at the end of each generation of GA, the local search is applied to chromosomes whose fitness values are near the maximum found fitness (those more than 95% of the maximum fitness). Moreover, the other chromosomes in the population are also selected but by a low probability. Once the local search is applied to a chromosome, it will be replaced by a better one, if any, in its neighborhood. The pseudo code for the algorithm is shown in Figure 3.

Algorithm Adaptation for FSP

Some modifications on the proposed algorithm for CSP are needed in order to extend it to address FSP. In this section, these modifications are explained.

First, in the pre-processing procedure used to decrease the size of an instance, for each position the character (if any) which is not used in that position of any strings is treated as the best character for that position of the solution; such a position is then deleted from the input set. Second, the definition for the best and the worst strings are also changed.

The best string now contains characters for each position whose frequencies in input strings are less than a predefined threshold. The worst string, on the other hand, contains characters whose frequencies are more than the corresponding threshold.

Finally, in the memetic algorithm for FSP, the weight of each string is set using equation (8). The fitness of each chromosome is its maximum distance from the set of input strings. Other parts of the algorithm for FSP remain intact.

(8)

TABLE 1 PARAMETER VALUES IN MA FOR CSP

Pop_size

Iterations

Pbest

Pworst

Plocal_search

50

200

.6

.05

.05

TABLE 2 PARAMETER VALUES IN MA FOR FSP

Pop_size

Iterations

Pbest

Pworst

Plocal_search

50

200

.4

.95

.05

Computational Results

All the experiments were performed on a 2.4 core2 duo (T8300) CPU with 3GByte of RAM on Windows Vista. The proposed memetic algorithm has been implemented in C++ using Visual Studio 6. Tables 1 and 2 show the parameters used in our memetic algorithm for CSP and FSP, respectively.

We have compared our results with the results of two recent metaheuristic algorithms proposed for CSP [21,22] with respect to solution quality. We have not made time-comparison because of the difference between the specifications of our machine with those of [21] and [22]. However, we have reported our run-time to show that our proposed algorithm provides good-quality solutions in reasonable time.

Table 3 compares the proposed MA with the data-based GA over random instances generated as described next. Each problem instances includes 10 strings of 500 characters length. Experiments have been performed on alphabet sizes of 2, 4, 10, 20 and 30. For each alphabet size, we have considered five classes of instances each with a specified hamming diameter as in [22]. During the process of generating random strings of each instance of a specific class, the hamming diameter is continually checked enforcing the instance to preserve the pre specified hamming diameter. If the measured hamming diameter tends to be less than the pre specified value, some characters of the strings are randomly changed to restore the required diameter.

The columns of Table 3 are described next. The first column shows the number n of the characters in the alphabet . The second column Num is the class number. Column shows the hamming diameter for the class. The fourth to the seventh columns present the result for the data-based GA. These results are obtained from [22]. The fourth to the seventh columns show, respectively, the minimum, the average, and the standard deviation of the results (distance) obtained over the 10 runs of each class. The seventh column shows the average run time. The next four columns show the respective values for our MA. Finally, the last column α% reports the average improvement made by MA.

As can be seen in Table 3, MA outperforms the data-based GA in 22 out of 25 cases, with the average improvement of nearly one percent. However, all the three exceptional cases, where it does not give a better solution than that of data-based GA, have the property of ||=2. For all other cases, where the number of letters in  is greater than 2, MA provides solutions of higher quality.

TABLE 3 COMPARISON BETWEEN DATA-BASED GA AND MA ON CSP

Instance

Data-based GA

Memetic algorithm

α%

N=|∑|

Num

Best

Mean

S

Best

Mean

S

2

1

272

188

189.3

0.69

22.5

186

189.6

1.62

1.89

-0.16

2

267

187

188.5

0.88

22.4

188

188.8

1.60

2.90

-0.16

3

281

191

192.9

0.91

22.4

189

191.6

2.15

4.53

0.67

4

273

194

195.9

0.94

24.5

188

190.4

1.74

4.35

2.81

5

275

189

191.0

0.93

22.4

189

191.8

1.94

3.55

-0.42

4

1

398

292

293.8

0.93

23.0

289

292.8

2.56

2.84

0.34

2

394

293

294.0

0.68

23.0

289

290

1.26

5.87

1.36

3

398

292

293.3

0.89

23.1

289

291.6

1.50

3.34

0.58

4

392

294

295.9

0.90

23.0

288

290

2.28

6.13

1.99

5

397

294

296.3

0.75

22.9

289

291.4

2.24

1.54

1.65

10

1

461

364

365.0

0.36

21.8

362

363.2

1.17

2.91

0.49

2

466

363

364.2

0.65

21.9

361

363.4

1.36

2.81

0.22

3

463

366

367.1

0.47

21.8

361

364.2

1.83

3.54

0.79

4

466

368

368.4

0.50

21.6

361

362.4

0.80

3.80

1.63

5

465

365

366.3

0.68

21.7

361

363.6

1.74

3.33

0.74

20

1

489

395

397.3

0.76

20.5

391

392.4

1.02

3.27

1.23

2

486

394

395.5

0.64

20.6

391

392

1.41

1.74

0.88

3

486

396

396.9

0.72

20.6

389

391.6

0.49

1.55

1.34

4

485

395

396.5

0.72

20.6

390

391.8

1.33

3.69

1.19

5

483

395

396.5

0.60

20.6

389

391.6

0.80

1.77

1.24

30

1

491

408

408.7

0.68

20.0

401

403.8

1.94

1.99

1.20

2

489

407

408.3

0.62

20.2

403

403

0.00

6.07

1.30

3

492

403

405.3

0.92

20.1

403

404

0.89

2.96

0.32

4

492

409

410.5

0.81

19.9

402

403.6

3.31

1.20

1.68

5

491

407

408.2

0.69

20.0

402

403.8

0.98

2.75

1.08

Average improvement:

0.96

Table 4 compares the performance of our algorithm with that of CGSA as reported in their paper [21]. In the instances used in [21], a two-character alphabet was used, and the CGSA algorithm was run on only one instance per each size. However, the results of our algorithm are based on the average distance over 10 instances per each size. By each size, we mean a pair of (n,m), where n here means the number of strings and m is their length. As can be seen in Table 4, MA could find better results than CGSA in 16 out of 18 (about 88%) of the cases. On average, MA improves CGSA by about 3.63%.

Finally, Table 5 shows the results of our memetic algorithm customised for FSP on the same instances as used in Table 4. The algorithm has been run on 10 instances of each size and the average quality and run-time have been reported.

conclusion

In this paper we have studied the closest and the farthest string problems. Some heuristic and metaheuristic methods have previously been suggested for the closest string problem. In this paper, we have presented a memetic algorithm for these problems which provides in acceptable time solutions of good accuracy. We have compared the results of our algorithm with

TABLE 4 COMPARISON BETWEEN MA AND CGSA ON CSP

(n,m)

CGSA

MA

Α%

Distance

Time

Distance

Time

(10,300)

121

8

116.70

1.76

3.55

(10,400)

162

12

153.50

2.71

5.25

(10,500)

201

14

191.80

1.79

4.58

(10,600)

241

17

228.90

3.19

5.02

(10,700)

284

20

266.50

4.00

6.16

(10,800)

328

23

302.90

2.77

7.65

(15,300)

130

11

131.11

2.42

-0.85

(15,400)

172

14

170.30

3.70

0.99

(15,500)

215

18

215.90

2.18

-0.42

(15,600)

256

21

254.50

5.64

0.59

(15,700)

299

25

296.60

7.36

0.80

(15,800)

347

28

339.20

4.57

2.25

(20,300)

136

13

130.10

3.01

4.34

(20,400)

177

16

172.20

5.91

2.71

(20,500)

224

20

211.40

6.02

5.63

(20,600)

270

25

254.10

7.42

5.89

(20,700)

311

29

293.90

4.72

5.50

(20,800)

356

33

335.40

5.33

5.79

Average improveness

3.63

TABLE 5 RESULTS OF MEMETIC ALGORITHM ON FSP

(n,m)

MA

Distance

Time

(10,300)

184.20

4.14

(10,400)

246.30

5.50

(10,500)

308.60

6.63

(10,600)

371.20

8.05

(10,700)

432.40

9.11

(10,800)

496.40

10.67

(15,300)

171.89

6.94

(15,400)

232.30

9.10

(15,500)

288.40

10.73

(15,600)

349.20

14.09

(15,700)

405.90

16.47

(15,800)

464.80

20.31

(20,300)

170.80

8.14

(20,400)

228.60

10.43

(20,500)

289.30

12.73

(20,600)

347.00

15.09

(20,700)

406.60

16.72

(20,800)

465.90

19.70

the results of two state-of-the-art algorithms for CSP with positive results. Because of the successful results reported in this paper, a promising avenue for future work is to develop similar algorithms for the purpose of other sequences consensus problems.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.