This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Artificial Immune Systems (AIS) are systems and techniques that are inspired by the biological immune system. The biological immune system is an adaptive, complex and robust system that helps the body to defend foreign pathogens. There are many ideas and concepts that have been extracted from the biological immune system and applied to real world engineering and scientific problems. There are many biological based theories that have been used in the Artificial Immune Systems for example, Negative Selection theory, Clonal Selection Theory and Idiotypic Network Theory.
Data mining is one of the fields that took advantage of the Artificial Immune Systems. Data mining is a process of finding and extracting interesting patterns from large datasets. The main tasks that the data mining involved in are classification, regression, clustering and association rules.
The goal of this project is to show how to solve data mining classification problems using Clonal Selection Algorithm (CLONALG). A comparison has been proposed between the CLONALG and other classification techniques to show the performance and the accuracy of it.
It has been observed that Clonal Selection Algorithm has some drawbacks specially when working with large datasets. The experiments conducted showed that CLONALG has some limitations. Datasets with different sizes were used in the experiments. Weka data mining tool was used in the experiments along with the CLONALG plug-in.
The project is intended to modify the CLONALG and improve its performance with classification problems. The outcome of the project is an Artificial Immune System based data mining tool that accepts a dataset and outputs the classification results.
The main goal of the biological immune system is to protect the body from foreign invading molecules. These invading cells are called Antigens (Antibody Generators). The immune system has many features such as adaptively, complexity and robustness. One of the most important characteristics in the immune system is the ability to distinguish between self and non-self (foreign) cells. Throughout the last decade scientists were able to adapt the ideas of the immune system and apply it to solve engineering and scientific problems.
In science, Artificial Immune Systems (AIS) are computational systems that are inspired by vertebrate immune system's principles and processes. The algorithms typically exploit the immune system's characteristics of learning and memory to solve a problem. There are many algorithms that have been inspired by the features and theories of the immune system for example, Clonal Selection Algorithm, Negative Selection Algorithms and Immune Networks. In this report I'm going to focus on the Clonal Selection Algorithm (CLONALG) and its application in data mining.
The main objective of this project is to develop a data mining tool based on Artificial Immune Systems. There are sub-objectives that the project should achieve in order to achieve the main goal. These objectives are:
Conduct a comparison between various classification techniques.
Conduct a comparison between various Immune Based techniques.
Test the performance and accuracy of the Clonal Selection Algorithm.
Modify the current CLONALG to produce better results.
Implement the data mining tool based on the modified CLONALG.
This project is intended to create an Immune based data mining tool that accepts a dataset and outputs the classification results. The immune system technique that is going to be studied and investigated here is the clonal selection technique. The implemented tool is going to be applied on classification problems only. The experiments in the project are done using CLONALG plug-in for Weka. Because of the time limitation, I'm going to modify the existence plug-in instead of implementing a new tool from scratch.
2. Biological and Computational Background Materials
2.1 Biological Immune System
The biological immune system is a complicated, multilayered defense system. This system protects the human body from molecules foreign invaders (pathogens). The immune system uses different mechanisms to achieve its goals according to the type of the pathogen and the way it enters the body. Diversity, matching and distributed control are some important features of the biological immune system that caught the attention of the scientists. The matching happens when the antibodies match and bind with the antigen. In order to match the many different types of the antigens, antibody diversity must be encouraged. Immune system has no central control unit so; the control is distributed throughout the body.
As started above, the immune system is a multilayered defense system. The invading body must pass through several defense layers when entering the human body. Immunity can be divided into two types: innate immunity and adaptive immunity. Innate immunity is the immunity that the body gets upon birth. It is represented as the physical conditions that make the life of the pathogen difficult. Increase in the body temperature, increase and decrease in pH, chemical mediators, tears, cough and sneezing are some examples of the innate immunity.
Our focus in this project is on the adaptive immunity. Adaptive immunity is the secondary line of defense in the vertebrates' body. It has important characteristics for example, adaptability, learning and memory. It is also called acquired immunity because it is developed under specific circumstances. Adaptive immunity is divided into two components: Humoral Immunity and Cellular Immunity. When the molecules pathogen (antigen) invades the body it activates T-cells which in turn activate the B-cells. B-Cells then produce the antibodies which bind to the antigen and cause its destruction (figure 1). Immune Network theory, negative selection theory, clonal selection theory and somatic hypermutation are biological immune theories that describe the process of antigen and antibodies matching.
Negative Selection theory shows how the immune system can distinguish between self and non-self cells. It shows the ability of the immune system to detect foreign cells while not reacting to the self cells. Clonal selection theory describes how the immune system response to the antigen stimulus. The idea behind this theory is that the immune system will create copies and clones for the matched antigen in for better mutation. The clones is creates according to the degree of affinity between the antigen and the antibodies.
2.2 Artificial Immune System
In the last decade, there has been an increase in the interest in the areas of the Artificial Immune Systems. Scientists applied the ideas to solve engineering and scientific problems (J.Timmis, L.N de Castro and, 1996) (Cooke, J.E Hunt and D.E, 1996). The implementation of any artificial has to go through for steps: data encoding, similarity measure, selection and mutation.
In the first step of the AIS, data must be encoded. The encoding is very important and can affect the success of the rest of the steps. The most famous encoding approach is to use binary digits eg.10011, 10101. In Artificial Immune System the antigen is the target solution and the antibodies are the rest of the data. After encoding, affinity must be calculated. Affinity measure shows how similar are the antigen with the rest of the data. Hamming distance is common to be used in this situation. After calculating the affinity, selection must be made. There are many approaches in selecting the antibodies like: Negative selection and Clonal selection. The last step is mutation. The selected antibodies are subjected to affinity mutation for better match the antigen in question.
2.3 Clonal Selection Algorithm
Clonal Selection Algorithm was proposed by Castro and Zuben in 2002. They extracted the idea from the Clonal Selection Theory in the biological immune system. The algorithm aims to develop a memory antibody pool that represents the solution and the antigen represents the element of evaluation. Figure 2 below gives an overview of the clonal selection process. The image shows that when the antibody binds with antigen, copies of the antibody are created.
The Clonal Selection Algorithm has several steps:
Initialization: in this step a pool of antibodies is prepared. This pool is devided in two pool, memory pool and reminder pool. The memory pool will represent the solution of the problem.
Generations: The algorithm goes throw several iterations to expose the system to all known antigens. The number of generation is user defined. In each generation several steps are executed:
Select the antigen: an Antigen is selected randomly without replacement.
Expose the selected antigen to the antibody pool. The affinity between the antigen and each of the antibodies is calculated.
Select the antibodies with the highest affinity.
Clone the selected antibodies. The higher the affinity, the more clone will be produced to the antibody.
An affinity mutation is done for better matching of the antigen.
Once again the affinity measure in calculated but this time is between the antigen and the produced clones.
The antibodies with the highest affinity is selected and transferred to the memory pool.
The antibodies with the lowest affinity are replaced by another set of antibodies.
After the algorithm finishes, the memory pool is selected to represent the solution of the problem.
CLONALG uses some equations and functions to create the clone and also to calculate the affinity between the antigen and the antibodies. Equation 1 shows the number of antibody clone create after the affinity measurement. áºž is the clonal factor, N is the antibodies pool size and I is the antibody current affinity rank. Equation 2 shows the total number of clones created after each antigen expose to the system. Nc represents the total number of clone and n is the number of selected antibodies.
As mentioned before Hamming distance can be used as an affinity measure. Equation 3 shows how the calculation of hamming distance is done. Hamming distance counts the difference between two binary strings. For example, the difference between 10101 and 10000 is 2.
Clonal Selection Algorithm has a set of parameters. These parameters are:
Number of generation (G): This parameter specifies the number of iterations the algorithm should perform. At each iteration the selected antigen is exposed to the antibodies in the system. It controls the amount of learning the system will do on the specified problem.
Antibody pool size (N): This parameter specifies the total number antibodies the system should maintain. N includes both, the memory pool size and the reminder pool size. So, N=m+r.
Selection pool size (n): This parameter specifies the number of selected antibodies after each iteration. The system selects the antibodies with the highest affinity.
Remainder replacement size (d): This parameter specifies the number of antibodies with the lowest affinity which will be discarded from the system. A new set of antibodies will replace the discarded ones.
Clonal factor (áºž): this parameter specifies the scaling factor for the number of created clones for the chosen antibody. For example, if N is 100 and áºž is then by using Equation 1 then the number of clones created for the antibody with the highest affinity is equal to 200.
Figure 4 below show the Clonal Selection Algorithm pseudocode. As we can see the algorithm accepts a set of instances to be recognized and outputs the instances at the memory pool after completing its execution.
input : S = set of patterns to be recognised, n the number of worst elements to select for removal
output : M = set of memory detectors capable of classifying unseen patterns
Create an initial random set of antibodies, A
forall patterns in S do
Determine the affinity with each antibody in A
Generate clones of a subset of the antibodies in A with the highest affinity.
The number of clones for an antibody is proportional to its affinity
Mutate attributes of these clones to the set A , and place a copy of the highest
affinity antibodies in A into the memory set, M
Replace the n lowest affinity antibodies in A with new randomly generated antibodies
Data mining is a process of extracting patterns and interesting information from large datasets. Because the size of the databases increased dramatically nowadays days, people needed a process whereby they can get the information lying under those databases. Data mining can be used in many fields like: marketing and security. Companies use data mining to reduce costs and increase their revenue. Data mining consists of three processes: preprocessing, data extracting (data mining) and results validation. Pre-processing process is performed on the dataset to clean it and remove any noise or missing data. After pre-processing, the data mining algorithms are applied on the dataset to extract required information. Data mining has four main classes: clustering, classification, regression and association rules. Finally, the results obtained from the previous step are verified and validated.
As mentioned above, data mining has for main classes: clustering, classification, regression and association rule. Clustering is to divide the dataset into different groups. Each group is different than the other but members of the same group shares the same features. Classification also intended to divide the members of the dataset into different groups but unlike clustering we know the outcome classes. Regression is used to forecast what the value will be according to the available data. Association rules is a method the helps finding and discovering interesting relationships among the data in the datasets.
3. Literature Review
3.1 Learning and Optimization Using the Clonal Selection Principle
This paper was written by Leonardo N.de Castro and Fernando J.Von Zuben. In this paper the authors are proposing a computational implementation for the clonal selection principle that explicitly takes into account the affinity maturation of the immune response. The algorithm that they are proposing which named CLONALG was derived to solve pattern recognition and machine learning problems. In this paper, two versions of algorithms was derived. The first version was designed to perform machine learning and pattern recognition tasks. The second version was designed to perform optimization tasks. The paper also the computational costs for the two proposed versions and also the sensitivity analysis for the user defined parameters.
3.2 Clonal Selection Algorithms: A comparative Case Study Using Effective Mutation Potentials
This paper was written by Vicenzo Cutello, Guisppe Narzisi, Giuseppe Nicosia and Mario Pavone. The paper presnts a comparative study between two of the most important Clonal Selection Algorithms: CLONALG and opt-A. Four classes of problems was used in the experiments: toy problems, pattern recognition, numerical optimization and NP-complete problem. The experiments conducted showed the performance of opt-A is better than CLONALG.
3.3 A Dynamic Adaptive Calibration of the CLONALG Immune Algorithm
This paper was written by Maria Cristina Riff and Elizabeth Montero. In this paper the authors are proposing a new parameter control strategy for the Clonal Selection Algorithm (CLONALG). The research in this area is widely open and many researchers tried to tickle parameters optimization problem. The research in this paper focuses on controlling the number of antibody clones produced and the number of selected cells which follow the mutation process for improvement. The approach that the authors follow provides low cost and efficient adaptive techniques for parameters controlling. The authors tested there approach on the traveling salesman problem which has been ticketed before using CLONALG. Because CLONALG uses a set of parameters, it is important to find the best combination of them. The idea behind this research is to design a low cost strategy to control two parameters which are the population size and the number of clones.
4. Methods of Investigation
4.1 Data Collection
The data sets used in the experiments are obtained from UCI Datasets Repository. I have selected 3 of the most popular datasets available. The datasets selected are Iris, Car Evaluation and Adult. There 3 selected datasets have three different sizes, small, medium and large.
Iris dataset is the most used dataset in pattern recognition. It's considered as a small dataset compared to the other datasets available. There are 3 classes in this dataset with 5o instance in each. Each class in the dataset represents a type of iris plant. So, the whole data set contains 150 instances.
The second dataset is called Car Evaluation. This dataset consists from 6 attributes and 1728 instances. I have selected this dataset because it falls in the medium size category. It has 4 classes unacc, acc, good and v-good.
The third dataset which considered as a large datasets is Adult. This dataset classifies the people of the US in two categories based on income (>50k and <=50k). It has 48842 instances and 14 attributes.
4.2 Tools used
The tools used in these experiments are Weka 3.4 and the immune algorithms plug-in. Weka is a very popular tool in data mining. It is an open source software and it is written in Java. It has a collection of algorithms for data mining tasks. The algorithms in Weka can be applied for classification, clustering, pre-processing and association rules. The Artificial Immune System plug-in is an open source code that contains some of the immune based algorithms like, Artificial Immune Recognition System (AIRS), Clonal Selection Algorithm (CLONALG) and Immunos-81.
4.3.1 Experiment 1
This initial experiment is conducted to see the result of using the Clonal Selection Algorithm (CLONALG) on our three datasets and compare the results to the decision tree algorithm (J48). All the parameters used in the algorithms are the default ones. The results of the experiment are shown in the tables below.
CLONALG performs well when used on small dataset (Iris) but as we can see from the table above that the correctly classified instances using the Clonal Selection Algorithm decreases whenever we increase the number of instances. CLONALG has a set of user defined parameters that governs it. There parameters are:
Number of generation (G)
Antibody pool size (N)
Selection pool size (n)
Remainder replacement size (d)
Clonal factor (áºž)
In the next experiment I'm going to modify some of the default parameters in the CLONALG algorithm.
4.3.2 Experiment 2
In this experiment I'm going to try and modify some of the CLONALG pre-set parameters. The parameters by default are:
Number of generation (G) = 10
Antibody pool size (N) = 30
Selection pool size (n) = 20
Remainder replacement size (d) = 0.1 (ratio)
Clonal factor (áºž) = 0.1
The experiment is done by modifying one of the parameters and keeping the rest unchanged. The results obtained are shown in the tables below. The fist parameter that we are going to modify is the number of generations (G).
As we can see from the table above, the number of generation do effect on the correctly classified instances. But after a certain point, the number of generations has minor effect on the results. One more observation in the above data is the time required to build the model. From the table we can see that the number of generations has a huge effect on the time consumed to build the model. In the next run, we are going to change the size of the Antibody pool size (N). The results are shown in the table and the graph below.
From the above table and graph, we can see that the Antibody pool size has a huge effect on the classification correctness. Increasing the number of N from 30 to 800 yields an increase in classification correctness by almost 18%. But once again, the higher the size of the Antibody pool size the more time required in building the model.
5. Data Analysis
From the experiments that have been conducted, we can see that the performance of the Clonal Selection Algorithm (CLONALG) decreases when the size of the dataset increases. When we used a small dataset (Iris), the results were very good and almost as good as using well known classifier (Decision Tree J48). After conducting the second experiment, the results were very encouraging. The experiment shows that we can increase the correctness of the classified instances by modifying the defaults parameters. Some parameters has minor effect on the final results (like number of generations) and others have major effect like the size of the Antibody pool size.
The conclusion of the experiments is that the CLONALG can be improved by optimizing its parameters. Further experiments and investigations have to be made to show the relationship among the parameters and the final results. A huge disadvantage in using this algorithm is the time consumed to build the model and to get the final results. Some modification should be done on the algorithm itself to overcome this problem.
Immune system is very complex system with many interesting features. Many scientists conducted many researches on the immune system to extract its ideas and theories and apply them in engineering and scientific problems. Distribution, adaptivity and ability to distinguish between self and non-self cells are some of the main characteristics of the immune system. Many algorithms have been developed from the theories of the immune system such as Clonal Selection Algorithm and Negative Selection Algorithm. Artificial Immune Systems can be applied in many scientific fields like data mining and pattern recognition.
Clonal Selection Algorithm is one of the most important algorithms that have been extracted from the immune system. It can be used to solve classification problems. CLONALG has some parameters that determines the correctness of the results. CLONALG perform well when used with small datasets but give unsatisfying results with large datasets. After conducting some experiments, I've found that the algorithm can be improved by adjusting its parameters. So to make this algorithm reliable we must find a way to optimize its parameters. We also need to modify the algorithm itself to reduce the time taken to generate the results.