# Memetic Algorithm For Closest String Problem Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract- Sequences consensus problems are especially important in studying molecular evolution, protein structures, and drug target design. In this paper, we work on two of these problems, namely closest string problem and farthest string problem. These problems are NP-hard, and none of exact algorithms already proposed to solve them is in polynomial time. Many non-exact algorithms have been proposed which try to obtain 'good' solutions in acceptable time for these problems. In this paper, a Memetic Algorithm (MA) is proposed for the closest string problem, which outperforms the existing algorithms. We then extend the proposed algorithm to address the farthest string problem.

Keywords- Memetic algorithm, string selection problems, Bioinformatics, metaheuristic

## Introduction

Computational biology and Bioinformatics are two important interdisciplinary fields which connect biology and information technology. An important goal of these fields is to interpret genome and protein data, which has led to such problems as genome rearrangement [1], sequence alignment [2], and sequences consensus problem [3], most of which are known to be NP-hard [4]. Sequences consensus problems include closest string problem (CSP) and farthest string problem (FSP), which have applications in drug design [5][6] and coding theory [7,8].

The closest string problem seeks a string whose distance from all the strings in a given set of strings is the minimum. The distance between two strings is defined as their hamming distance [9]. Consequently, the solution to the problem is a sequence with maximum 'similarity' to the input strings. The farthest string problem, on the other hand, is the reverse of the closest string problem in the sense that it seeks a string whose distance from the given strings is the maximum. These problems are especially important in the context of studying molecular evolution, protein structures, and drug target design. For example, one of their applications in drug target design is to find a genetic sequence that kills all pathogenic bacteria without any impact on human being genes [10]. The closest string problem also has applications in the coding theory, data compression, error decoding, and speech coding [8]

Several algorithms have been proposed to solve the closest string problem. Since this problem is NP-hard [11], none of these algorithms is a polynomial time (PTIME) exact algorithm. That is, the existing algorithms which can solve the problem to optimality are all exponential [9][12][13][14]. However, because of their exponential complexity, these algorithms are not of much use in practice. Consequently, many non-exact algorithms have been proposed which try to obtain 'good' solutions in acceptable time.

There are in general two classes of non-exact algorithms for CSP, which are approximation algorithms [15][16][17][18] and (meta)heuristic algorithms. The distinction between these classes of algorithms is that the former guarantees a bound, called approximation ratio, on the optimal solution whereas the latter does not. In practice, however, (meta)heuristic algorithms usually provide better solutions than those provided by approximation algorithms.

One of the first metaheuristic algorithms proposed for CSP was a genetic algorithm which included a special pre-processing step on the input strings [19]. Another algorithm called iterative rounding was presented in [20]. A more sophisticated algorithm called CGSA was devised in [21]. The CGSA algorithm is based on a hybrid of Genetic Algorithm (GA) and Simulated Annealing (SA). It was shown to be superior to its predecessor GA algorithm proposed in [19]. Finally, Julstrom proposed another genetic algorithm in [22]. The main idea of this algorithm is to use the frequency of characters in each position of input set to initialize the population. The proposed technique was shown to be beneficial, but the algorithm was not compared with CGSA.

CGSA, as its name stands, hybridizes Genetic Algorithm and Simulated Annealing. A genetic algorithm is well-known to be capable of spreading the search space, but it is weak in exploiting the power of neighborhood search. On the contrary, simulated annealing has no information about the whole search space, whereas it is strong in local search. Therefore, hybridization of these algorithms is a promising trend in finding high quality solutions in the search space.

In this paper, a Memetic Algorithm (MA) is proposed for the closest string problem. It is then extended to address the farthest string problem. In a memetic algorithm, local search is incorporated within an evolutionary algorithm. Memetic algorithms have been successfully applied to many optimization problems such as knapsack [23] and graph coloring [24]. In our proposed algorithm, local search is added to GA, as a consequence of which neighborhood of the chromosomes are also investigated.

The proposed memetic algorithm works as follows. At the end of each iteration, some of the best chromosomes of the population are selected to which a local search function is applied. In addition, the local search function is applied to the other chromosomes of the population but with a low probability. Once the local search is applied to a chromosome, if the best neighbor has a higher fitness value, it will replace the original chromosome.

We have extended our memetic algorithm to address the farthest string problem as well. Lanctot et al. proved in [25] that this problem is NP-Hard. Some efforts have been made to solve this problem using exact and approximation algorithms [13][26]. However, to our best knowledge, no (meta)heuristic algorithm has yet been presented for this purpose. In this paper, a memetic algorithm is proposed as the first metaheuristic for this problem.

The rest of the paper is organized as follows. Section 2 presents the preliminaries required for the rest of the paper. In Section 3, memetic algorithms are introduced. Our proposed algorithm is described in Section 4. Section 5 provides the experimental results. Section 6 concludes the paper.

## preliminaries

Let be an alphabet of characters, then presents all possible strings of length using the characters in . The distance between two strings and is defined as their hamming distance, which is defined as follows.

(1)

Where and denote the ith characters of the strings and , respectively, and

(2)

The closest string problem is defined as follows:

Instance: a set of strings on

Output: a string on

## Minimize:

The farthest string problem is defined as follows:

Instance: a set of strings on

Output: a string on

## Maximize:

Given a set S of strings, the hamming diameter is defined as follows.

(3)

Example. Assume where , and

Then, ,, and . Therefore,

## memetic algorithm

Memetic algorithm has its roots in universal Darwinism. The idea of universal Darwinism was expressed by Richard Dawkins in [27] which describes a unified framework for organization of evolution in complicated systems. Universal Darwinism suggests that evolution belongs not only to biological systems but also to any complicated system whose main principles include inheritance, mutation and selection.

Inspired from Darwinism principles in natural evolution, the term "Memetic Algorithm" (MA) was proposed in 1989 by Moscato [28] and was defined as a hybrid of an evolutionary algorithm with an individual learning procedure that has the capability of local improvement. MA has been successfully applied to many NP-hard problems, including knapsack [23], graph coloring [24] and gene expression profile [29].

Two main paradigms of memetic algorithms are neighborhood structure and local search (LS) heuristic. The Neighborhood of a chromosome is defined as a set of chromosomes each of which reachable by applying a specific operator to the primary chromosome. When the neighbors of a chromosome are determined, one of them with a better fitness than that of the primary chromosome, if any, will be selected as its successor. Also an important decision should be made about which members of the population to select for further improvement via local search.

After defining these main rules, and having adopted an evolutionary algorithm, MA is well-defined: at each generation, local search is run on the selected members of the population. A pseudo code for a general memetic algorithm is presented in Figure 1.

## Proposed Algorithm

In this section, we describe our proposed memetic algorithm for CSP and FSP. First, the underlying genetic algorithm and local search are individually described and then, the memetic algorithm for CSP is presented. Finally, the algorithm is modified to address the farthest string problem.

## begin

Initialize population P; BestSoFar= a solution in P

Repeat Until the termination condition not met

newP= {}; empty population

Repeat Until newP is populated

Select two chromosomes from P;

Cross over on the selected chromosomes;

Mutate the resulting offsprings;

Select chromosomes from NewP for LS;

Apply LS to the selected chromosomes of NewP;

P= NewP;

Update BestSoFar;

Return BestSoFar;

## end

Figure 1. Memetic algorithm pseudo code

## Genetic algorithm

A genetic algorithm consists of three main phases namely population initialization, cross over and mutation [28], which are described next. To create initial population, some pre-processing techniques are applied. The first one is intended to reduce the size of the input strings. This phase works as follows. Positions in which all strings have the same character are detected and deleted from the strings. This pre-processing technique is useful to reduce the run-time of the algorithm [21].

Another pre-processing technique is to calculate the frequency of each alphabet character at various positions within the input strings. Two strings are created based on these calculations. The first string which is called the best string consists of characters whose frequencies at each position are more than a predefined threshold. The other string, called the worst string, consists of characters whose frequencies at each position are less than another threshold. When the best and the worst strings are created, the initial population is generated randomly, except that the characters of the worst string are not allowed to appear in the initial population. The best string is used for the purpose of mutation which is described later in this section.

Second phase of genetic algorithm is the crossover phase. In this phase, two members are selected as parents, cross over is performed, and two new children are created. Parent selection is performed using the roulette wheel method; hence chromosomes of larger fitness have more chance to participate in producing next generation. In our algorithm, cross over and mutation are applied probabilistically in a similar way as in CGSA algorithm for CSP [21]. These probabilities are shown by equations (4) and (5).

(4)

(5)

In these equations, is the minimum probability for crossover, is the maximum value of this probability, and show the minimum and maximum probabilities for mutation, respectively. is the maximum fitness of the old population, is the average fitness of the old population, and is the maximum fitness of the parent chromosomes. The maximum and minimum values for cross over and mutation probabilities, are constant values obtained by experience.

In this genetic algorithm, a two point crossover is used. Crossover points are selected randomly and parents are crossed over these points. Created chromosomes will mutate with the probability from equation (5). To mutate a chromosome, a position is selected randomly and if its character differs from the corresponding character of the best string, it will be replaced with that of the best string. If the best string does not suggest any character for that position, a random character will be selected to replace the current character of the mutation point.

When two new chromosomes are created, a competition between them and their parents occurs. Each parent competes with one of the children for survival in the next generation. The parent will be selected by the probability presented in equation (6). The parameter is the energy of the parent and is the energy of the child, which are defined as their fitness. The probability for selecting the child is one minus that of the parent.

(6)

## Local searrch

The local search function is intended to improve a chromosome by finding a better chromosome in its neighbourhood. In our algorithm, the neighbourhood of a string is defined as the set of strings whose hamming distance from it is exactly one. The local search adopted in our algorithm operates as shown by the pseudo code of Figure 2. The weight of each string used in the algorithm is defined using equation (7), where dist[i] represents the distance between the ith string of input set and the string to which the local search is applied.

(7)

Input: A string X of length m with distance d from a set S of strings of the same length

Output: String X, possibly modified for a smaller

distance from S.

## Begin

improve= true

While improve is true do

improve =false

For all positions p of X do

For all alphabet characters ch do

For all strings s in S do

If ch = s[p]

score[ch]=score[ch]+ weight[s];

If(score[ch] > score[X[p]]

X[p] = ch;

improve = true;

## End

Figure 2. The psedu code for the local search procedure

Input: a set S of n strings of length m

Output: a string solution X whose distance from S is

to be minimized; X is not necessarily optimal

## Begin

Initialize population p; NewP={};

do #itreration times

do #population_size / 2 times

select randomly two chromosomes par1 and par2 in p

calculate pcrossover and pmutate;

by probability pcrossover do a two-point cross over par1 and par2;

let ch1 and ch2 be the resulting offsprings

by probability pmutate apply mutation;

winner1=compete(par1, ch1);winner2= compete(par2, ch2);

add winner1and winner2 to the new population NewP;

for each chromose chrm in tmpP do

if(fitness(chrm) > maximum fitness Ã- .95)

local search(chrm)

else by probability= .05

local search(chrm)

P=NewP;

return the chromosome with the maximum fitness

## End

Figure 3. psedu code of proposed Memetic algorithm

## Memetic Algorithm

Following the definition of memetic algorithm [28], at the end of each generation of GA, the local search is applied to chromosomes whose fitness values are near the maximum found fitness (those more than 95% of the maximum fitness). Moreover, the other chromosomes in the population are also selected but by a low probability. Once the local search is applied to a chromosome, it will be replaced by a better one, if any, in its neighborhood. The pseudo code for the algorithm is shown in Figure 3.

Some modifications on the proposed algorithm for CSP are needed in order to extend it to address FSP. In this section, these modifications are explained.

First, in the pre-processing procedure used to decrease the size of an instance, for each position the character (if any) which is not used in that position of any strings is treated as the best character for that position of the solution; such a position is then deleted from the input set. Second, the definition for the best and the worst strings are also changed.

The best string now contains characters for each position whose frequencies in input strings are less than a predefined threshold. The worst string, on the other hand, contains characters whose frequencies are more than the corresponding threshold.

Finally, in the memetic algorithm for FSP, the weight of each string is set using equation (8). The fitness of each chromosome is its maximum distance from the set of input strings. Other parts of the algorithm for FSP remain intact.

(8)

TABLE 1 PARAMETER VALUES IN MA FOR CSP

## Plocal_search

50

200

.6

.05

.05

TABLE 2 PARAMETER VALUES IN MA FOR FSP

50

200

.4

.95

.05

## Computational Results

All the experiments were performed on a 2.4 core2 duo (T8300) CPU with 3GByte of RAM on Windows Vista. The proposed memetic algorithm has been implemented in C++ using Visual Studio 6. Tables 1 and 2 show the parameters used in our memetic algorithm for CSP and FSP, respectively.

We have compared our results with the results of two recent metaheuristic algorithms proposed for CSP [21,22] with respect to solution quality. We have not made time-comparison because of the difference between the specifications of our machine with those of [21] and [22]. However, we have reported our run-time to show that our proposed algorithm provides good-quality solutions in reasonable time.

Table 3 compares the proposed MA with the data-based GA over random instances generated as described next. Each problem instances includes 10 strings of 500 characters length. Experiments have been performed on alphabet sizes of 2, 4, 10, 20 and 30. For each alphabet size, we have considered five classes of instances each with a specified hamming diameter as in [22]. During the process of generating random strings of each instance of a specific class, the hamming diameter is continually checked enforcing the instance to preserve the pre specified hamming diameter. If the measured hamming diameter tends to be less than the pre specified value, some characters of the strings are randomly changed to restore the required diameter.

The columns of Table 3 are described next. The first column shows the number n of the characters in the alphabet ïƒ¥. The second column Num is the class number. Column shows the hamming diameter for the class. The fourth to the seventh columns present the result for the data-based GA. These results are obtained from [22]. The fourth to the seventh columns show, respectively, the minimum, the average, and the standard deviation of the results (distance) obtained over the 10 runs of each class. The seventh column shows the average run time. The next four columns show the respective values for our MA. Finally, the last column Î±% reports the average improvement made by MA.

As can be seen in Table 3, MA outperforms the data-based GA in 22 out of 25 cases, with the average improvement of nearly one percent. However, all the three exceptional cases, where it does not give a better solution than that of data-based GA, have the property of |ïƒ¥|=2. For all other cases, where the number of letters in ïƒ¥ is greater than 2, MA provides solutions of higher quality.

TABLE 3 COMPARISON BETWEEN DATA-BASED GA AND MA ON CSP

2

1

272

188

0.69

22.5

186

189.6

1.62

1.89

-0.16

2

267

187

0.88

22.4

188

188.8

1.60

2.90

-0.16

3

281

191

192.9

0.91

22.4

189

2.15

4.53

0.67

4

273

194

195.9

0.94

24.5

188

1.74

4.35

2.81

5

275

189

0.93

22.4

189

191.8

1.94

3.55

-0.42

4

1

398

292

293.8

0.93

23.0

289

2.56

2.84

0.34

2

394

293

294.0

0.68

23.0

289

1.26

5.87

1.36

3

398

292

293.3

0.89

23.1

289

1.50

3.34

0.58

4

392

294

295.9

0.90

23.0

288

2.28

6.13

1.99

5

397

294

296.3

0.75

22.9

289

2.24

1.54

1.65

10

1

461

364

365.0

0.36

21.8

362

1.17

2.91

0.49

2

466

363

364.2

0.65

21.9

361

1.36

2.81

0.22

3

463

366

367.1

0.47

21.8

361

1.83

3.54

0.79

4

466

368

368.4

0.50

21.6

361

0.80

3.80

1.63

5

465

365

366.3

0.68

21.7

361

1.74

3.33

0.74

20

1

489

395

397.3

0.76

20.5

391

1.02

3.27

1.23

2

486

394

395.5

0.64

20.6

391

1.41

1.74

0.88

3

486

396

396.9

0.72

20.6

389

0.49

1.55

1.34

4

485

395

396.5

0.72

20.6

390

1.33

3.69

1.19

5

483

395

396.5

0.60

20.6

389

0.80

1.77

1.24

30

1

491

408

408.7

0.68

20.0

401

1.94

1.99

1.20

2

489

407

408.3

0.62

20.2

403

0.00

6.07

1.30

3

492

403

405.3

0.92

20.1

403

0.89

2.96

0.32

4

492

409

410.5

0.81

19.9

402

3.31

1.20

1.68

5

491

407

408.2

0.69

20.0

402

0.98

2.75

1.08

## 0.96

Table 4 compares the performance of our algorithm with that of CGSA as reported in their paper [21]. In the instances used in [21], a two-character alphabet was used, and the CGSA algorithm was run on only one instance per each size. However, the results of our algorithm are based on the average distance over 10 instances per each size. By each size, we mean a pair of (n,m), where n here means the number of strings and m is their length. As can be seen in Table 4, MA could find better results than CGSA in 16 out of 18 (about 88%) of the cases. On average, MA improves CGSA by about 3.63%.

Finally, Table 5 shows the results of our memetic algorithm customised for FSP on the same instances as used in Table 4. The algorithm has been run on 10 instances of each size and the average quality and run-time have been reported.

## conclusion

In this paper we have studied the closest and the farthest string problems. Some heuristic and metaheuristic methods have previously been suggested for the closest string problem. In this paper, we have presented a memetic algorithm for these problems which provides in acceptable time solutions of good accuracy. We have compared the results of our algorithm with

TABLE 4 COMPARISON BETWEEN MA AND CGSA ON CSP

(n,m)

(10,300)

121

8

1.76

3.55

(10,400)

162

12

2.71

5.25

(10,500)

201

14

1.79

4.58

(10,600)

241

17

3.19

5.02

(10,700)

284

20

4.00

6.16

(10,800)

328

23

2.77

7.65

(15,300)

11

131.11

2.42

-0.85

(15,400)

172

14

3.70

0.99

(15,500)

18

215.90

2.18

-0.42

(15,600)

256

21

5.64

0.59

(15,700)

299

25

7.36

0.80

(15,800)

347

28

4.57

2.25

(20,300)

136

13

3.01

4.34

(20,400)

177

16

5.91

2.71

(20,500)

224

20

6.02

5.63

(20,600)

270

25

7.42

5.89

(20,700)

311

29

4.72

5.50

(20,800)

356

33

5.33

5.79

## 3.63

TABLE 5 RESULTS OF MEMETIC ALGORITHM ON FSP

## Time

(10,300)

184.20

4.14

(10,400)

246.30

5.50

(10,500)

308.60

6.63

(10,600)

371.20

8.05

(10,700)

432.40

9.11

(10,800)

496.40

10.67

(15,300)

171.89

6.94

(15,400)

232.30

9.10

(15,500)

288.40

10.73

(15,600)

349.20

14.09

(15,700)

405.90

16.47

(15,800)

464.80

20.31

(20,300)

170.80

8.14

(20,400)

228.60

10.43

(20,500)

289.30

12.73

(20,600)

347.00

15.09

(20,700)

406.60

16.72

(20,800)

465.90

19.70

the results of two state-of-the-art algorithms for CSP with positive results. Because of the successful results reported in this paper, a promising avenue for future work is to develop similar algorithms for the purpose of other sequences consensus problems.