This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Although computer hard drives are reliable data storage devices, failures can cause unrecoverable user data. As a result, hard drive failure prediction has become an important problem and the subject of active research (Hamerly and Elkan, 2001; Hughes, Murray, Kreutz-Delgado and Elkan, 2002; Murray, Hughes and Kreutz-Delgado, 2003, 2005). The prediction task involves two phases:
Selecting a subset of Self-Monitoring, Analysis and Reporting Technology (SMART) drive features or attributes, and
Using an appropriate prediction technique based on the selected subset of features or attributes.
Significant research has been done using different approaches to predict hard drive failure with the desired goal of attaining high detection rates and low false alarm rates. Hamerly and Elkan (2001) describe two learning approaches that use Bayesian methods to predict hard drive failures. First, they introduce the naive Bayes EM (Expectation-Maximization) method that works in conjunction with Autoclass (Cheeseman and Stutz, 1995), an unsupervised clustering algorithm and is used to find the highest-likelihood mixture model that fits the data. The second method they deploy is the standard supervised naÃ¯ve Bayes classifier. Both methods use the SMART drive attributes for failure prediction. For the naÃ¯ve Bayes EM method, a failed drive detection rate of between 35% and 45% is achieved. With the naÃ¯ve Bayes approach, a 55% detection rate is realized. Both approaches report low false alarm rates. Hughes et al. (2002) propose two SMART algorithms to improve drive failure prediction and lower the false alarm rates. Both algorithms are based on the Wilcoxon (Wilcoxon, 1945) rank-sum statistical test. Using a multivariate rank sum in conjunction with the ORed rank sum and ORed threshold, they achieve a modest 40% to 60% detection rate at a 0.5% false alarm rate. Hughes et al. make an interesting observation that many of the S.M.A.R.T. attributes do not demonstrate normal distribution, i.e. they are non-parametrically distributed. This means that standard parametric statistical methods (such as Normal, Weibull, etc.) cannot be used to characterize these attributes easily.
Murray et al. (2005) cast the hard disk drive failure prediction as a multiple-instance (MI) learning problem and propose using the multiple-instance naÃ¯ve Bayes (mi-NB) classifier to address the issue of false alarm rate in hard drive failure prediction. Using the mi-NB classifier on twenty five SMART attributes selected via reverse arrangements, a non-parametric statistical test, they report an initial detection rate of 35% with a false alarm rate of 1%. In comparison, the support vector machine (SVM) classifier gives the best performance at 51% detection rate with a corresponding 0% false alarm rate but it also makes it computationally expensive because of its training time.
Even though manufacturers of hard disk drives specify a mean time to failure (MTTF) ranging from 1,000,000 to 1,500,000 hours that suggest a nominal failure rate of 0.88%, field data shows annual replacement rates varying between 1% and 4%, and in some systems, up to 13% (Schroeder and Gibson, 2007). Using a conservative replacement rate of just 1%, 6 million drives would be finding their way back to repair centers on an annual basis. Manufacturers of hard disk drives would not be thrilled to learn of such large number of drives coming back for repair. Neither would the users be excited over having to return drives and wait for replacements. Although it is not known what percentage of the drive failures or returns is attributable to S.M.A.R.T., the case fits the need for a real-world pattern recognition problem using S.M.A.R.T. attributes to predict hard disk drive failure with a low false alarm rate (Murray et al. 2002, 2005).
The goal is to develop and demonstrate an evolutionary method for feature subset selection that outperforms the non-parametric reverse arrangements test method for feature selection in both the detection and false alarm rates without sacrificing computational efficiency.
This approach includes the use of a multiple-instance naÃ¯ve Bayes (mi-NB) classifier designed to work with low false alarm rates that has shown promising results in predicting failures in hard disk drives (Murray, Hughes and Kreutz-Delgado, 2005).
The proposed research in this dissertation consists of the following specific steps.
Identify the feature pool of twenty-five SMART attributes selected by the non-parametric reverse arrangements test.
Train and test the feature pool selected by the reverse arrangement test method using the mi-NB classifier on the SMART data set.
Combine the three methods with an evolutionary algorithm to find an optimal feature subset selection.
Train and test the feature pool selected by the evolutionary algorithm with the mi-NB classifier on the SMART data set.
Perform comparative analysis on the accuracy of the mi-NB classifier on the two feature pools for predicting hard disk drive failure.
Relevance, Significance, and Brief Review of the Literature
Relevance and Significance
Several approaches to failure prediction for hard disk drives have been introduced over the last decade. Hughes, Murray and Kreutz-Delgado (2000) apply their rank sum statistical hypothesis and multivariate rank sum statistical tests on 3744 drives that demonstrate three to four times higher correct prediction accuracy than error thresholds on will-fail drives. Hamerly and Elkan (2001) show that their NaÃ¯ve Bayes Expectation-Maximization (NBEM) and NaÃ¯ve Bayes classifier models perform three times better than the current industry standard methods. Murray, Hughes, and Kreutz-Delgado (2003) compare the performance of support vector machines (SVM), unsupervised clustering, and non-parametric statistical tests (rank-sum and reverse arrangements) on SMART attributes to predict hard disk drive failure. They find that the rank-sum method outperforms even the SVM method. Murray, Hughes, and Kreutz-Delgado (2005) introduce the semi-supervised multiple-instance naÃ¯ve Bayes algorithm (mi-NB) and demonstrate that it performed considerably better than the unsupervised Autoclass clustering algorithm developed by Cheeseman and Strutz (1995) which also makes the Bayes assumption.
The research proposed in this dissertation is important very relevant as it aims to minimize optimize the selection of the number of SMART attributes via the use of an evolutionary algorithm in conjunction with the mi-NB classifier without sacrificing computational efficiency. by combining some of the existing non-parametric statistical methods with an evolutionary algorithm.
In the proposed research, the author plans to demonstrate that the evolutionary method of feature selection method has the ability to maximize classification accuracy and minimize the number of SMART attributes selected. The results of this research will significantly advance the knowledge of the use of an evolutionary algorithm in the next generation of SMART attributes selection for hard disk drive failure prediction.
The literature review provided here briefly establishes the background for this research. It begins with a historical overview of the hard disk drive, follows with a brief overview of the SMART attributes, continues with an explanation of feature selection methods, provides an example of reverse arrangements, the non-parametric statistical method for feature selection and concludes with an outline of the evolutionary algorithm, the multiple-instance learning framework and the multiple-instance naÃ¯ve Bayes classifier.
Historical Overview of the Hard Disk Drive
Hard disk drives play a role in everyday life. Ever since IBM shipped the first unit of the RAMAC (Random Access Method of Accounting and Control) on 13 September, 1956, it set in motion a process that changed the way we live (Levy, 2006). Although the RAMAC, the size of two large refrigerators and weighing a ton, in no way resembles today's 1.8inch drive, it was the first to offer random data access which differentiates a hard drive from a tape drive. Today, the hard drive is a consumer electronics commodity item and has found its place in every aspect of our daily life. More and more uses of the hard disk continue to be discovered. The demand for hard disk drives continues to increase steadily every year. In 2009, the sales of hard disk drives are expected to be in the range of 600 million and there is a chance that they could exceed 700 million a year in 2010 or 2011 (Roberts, 2007).
Figure 1: Inside a Hard Disk Drive
Since a hard disk drive can store valuable data, its reliability becomes a major concern of the user (Schroeder and Gibson, 2007). Data loss from a hard disk drive for any user can be a tormenting ordeal. In an effort to reduce the probability of data loss, IBM in 1992 began shipping hard disk drives that were capable of predicting their own failure. The IBM-developed Predictive Failure Analysis (PFA) would periodically monitor certain drive operations, such as flying height between the read-write heads and the rotating media, and send a warning message to the host of an impending failure if a surge of faulty events was observed. Later, Compaq, a major computer manufacturer, announced IntelliSafe â„¢, a drive diagnostic that monitored a range of attributes and sent their threshold values to the host. The acceptance of IBM's PFA and Compaq's IntelliSafe â„¢ technologies by the storage industry eventually led to SMART (Self-Monitoring, Analysis and Reporting Technology) becoming the industry-standard reliability prediction for today's hard disk drives.
SMART attributes monitor operation and environment conditions that are relevant to impending failure. The drive microprocessor firmware runs the failure warning algorithm that checks whether maximum thresholds on these measured conditions are exceeded and issues a binary "won't fail/will fail" warning (Hughes et al., 2002). A host to the hard disk drive can issue standardized drive commands that can read this failure warning. Ideally, the user should be notified 24 hours of an impending drive failure so that a backup of the data onto another storage device could be made. Counts of drive recalibration retries, track-seek retries, write faults, read errors, head-to-disk flying height, data throughput performance, spin-up time, error rates, temperature, seek time performance, and reallocated sectors are some of the attributes that are monitored to predict drive failure. The technical documentation for SMART is in the AT Attachment (ATA) and is available for public viewing. An important point to note here is that since each drive manufacturer develops and uses its own set of attributes and algorithm for failure prediction, some drives may provide more SMART attributes than others but typically all manufacturers use a threshold trigger algorithm that triggers a SMART flag when any single attribute exceeds a predefined value (Murray et al. 2005). When a failure warning is triggered the drive in question can be returned to the factory for warranty replacement. As a result, Murray et al. state that to prevent false alarms at the expense of predictive accuracy, drive manufacturers set these thresholds conservatively with an acceptable false alarm rate in the order of 0.1% per year. They also state that with the SMART algorithm currently implemented in drives, manufacturers estimate the failure detection rate to be 3-10% which provides motivation for research in this area.
Up to 30 internal drive measurements are supported by the SMART ATA drive specification (Hughes et al., 2002). Table 1 lists the SMART attributes that are typically supported on a hard disk drive.
SMART attributes are typically collected at two hour intervals and the most recent 300 samples are saved on the drive, where each sample is all the attributes for a single time interval. If a drive did not survive 600 hours of operation, the number of valid samples saved may be less than 300. The number of samples saved on a drive is determined by each drive manufacturer so it is possible that drives from one manufacturer may have more samples saved than those from another manufacturer. The SMART dataset collected by Murray et al. consists of 369 drives, with 178 labeled good and 191 failed. Each sample in the dataset comprises of the drive's serial number, the total power-on hours, and 60 other attributes that monitor the performance of the drive. Some of these 60 attributes are sometimes called "extended" attributes because they go beyond the typical attributes that most drive manufacturers support in the standard SMART specification. Table 1 shows the names of the attributes and their meanings.
Table 1: SMART Attributes (Murray et al. 2005)
The Wilcoxon-Mann-Whitney rank sum test
The Wilcoxon rank sum test (also known as the Mann-Whitney U test or the Wilcoxon-Mann-Whitney test) is used for comparing whether two samples are drawn from the same population (Wilcoxon, 1945; Mann and Whitney, 1947). No assumptions are made about the parametric form of the distributions from which each set is drawn from. The test is performed by ranking the combined data set, dividing the ranks into two sets according the group membership of the original observations, and calculating a two sample z statistic, using the pooled variance estimate. The statistic is compared to percentiles of the standard normal distribution for large samples. When dealing with small samples, the statistic is compared to what would result if the data were combined into a single data set and assigned at random to two groups having the same number of observations as the original samples.
Liu and Motoda (1998) define feature selection as a "process that chooses an optimal subset of features according to a specific criterion", where criterion specifies the details of measuring feature subsets. They point out that an optimal subset can be a minimal subset that gives the best estimation of predictive accuracy, all other things being equal. To this end, feature selection results in:
Less data, so that the classifier can learn faster;
Higher accuracy, so that the classifier can generalize the data better;
Simple results, so that understanding them is easier, and
Fewer features, so that savings can be made in the next round of data collection by removing redundant or noisy features, if possible.
Similarly, according to Yang and Honavar (2001), the choice of features or attributes that are fed to a classifier can affect the following:
Accuracy of the classification function.
Learning time of the classifier.
Number of samples needed for learning.
Cost of classification, and
Knowledge acquired through learning.
As a result, in machine learning the use of feature selection in studying how to select informative (or discriminative) features and eliminating irrelevant, redundant or noisy ones from data becomes an important and frequently used method for preprocessing data before being fed to a classifier. In addition, feature selection can reduce the overall cost of computation and improve the performance of learning algorithms by reducing the dimensionality of data. With the aid of feature selection, machine learning algorithms become more accurate, reliable and scalable (Koller and Sahami, 1996; Kohavi and John, 1997).
Feature Subset Selection Strategies
In essence, feature subset selection is an optimization problem. Various search strategies can be adopted to find an optimal or suboptimal feature subset. An exhaustive/complete search strategy will search all 2N feature subsets to find an optimal one given that the full feature set contains N features and the total number of candidate subsets, 2N. The space complexity (the number of subsets needed to be generated) is O(2N). Naturally, when faced with large datasets, finding an optimal feature subset can be very difficult (Kohavi and John, 1997). Exhaustive search is computationally infeasible in practice (Yang and Honavar, 2001), except when working with small number of features. As a result, the use of heuristics for feature selection has been explored to circumvent this problem. Forward selection and backward elimination are the two most common branch and bound search algorithms used in feature subset selection (Narenda and Fukunaga, 1977). Other feature subset selection strategies use evolutionary algorithms (Vafaie and Jong, 1992; Kudo and Sklansky, 2000; Jirapech-Umpai and Aitken, 2005) and simulated annealing (Doak, 1992) to randomly search for solutions.
Feature subset selection algorithms can be also be classified into two categories based on whether or not feature selection is done independently of the learning algorithm used to construct the classifier (Yang and Honavar, 2001). They are the filter and the wrapper methods.
In the filter model, the feature subset selection is done as a preprocessing step. The disadvantage of the filter approach is that it totally ignores the effects of the selected feature subset on the performance of the learning algorithm (John, Kohavi, and Pfleger, 1994). The reverse arrangements test falls in the filter method category. In order to select a subset that will ultimately result in an induced structure with high predictive accuracy on unseen data, the subset selection algorithm must take into account the biases of the learning algorithm. Hence, the need arises to consider the wrapper model.
Figure 2: Filter Method (Yang and Honavar, 2001).
In the wrapper model, the feature subset algorithm exists as a wrapper around the learning algorithm and conducts a search for a good subset using the learning algorithm itself as part of the evaluation function (John, Kohavi, and Pfleger, 1994). The filter method is computationally more efficient than the wrapper method since it selects features independent of the learning algorithm. However, use of the wrapper method results in higher predictive accuracy at the expense of computational overhead. The genetic algorithm can be utilized for feature selection in the wrapper method.
Figure 3: Wrapper Method (Yang and Honavar, 2001).
Reverse Arrangements Test
SMART attributes are non-parametric and reported as time series data (Hamerly and Elkan, 2001; Murray et al. 2002, 2003, 2005) and, therefore, become useful for statistical trend analysis. To test non-parametric data for trend, Mann (1945), and Bendat and Piersol (1986) propose the reverse arrangements test which involves calculating the number of times, beginning with the first data point, that each subsequent point is less than the first data point. The inequality count is known as the reverse arrangement. To indicate a significant trend in the data, the z-score is used to test the null hypothesis at p â‰¤ 0.05. The reverse arrangements test is best illustrated by an example.
Given a time sequence of N observations of a random variable, x1, x2, â€¦, xN, the ith reverse arrangement, Ai, is the number of times that xi > xj for i < j. The sum of all reverse arrangements is,
If the sequence of xi are independent observations of the same random variable, then A is is a random variable with mean,
A z-score can then be calculated using the following equation (Siegal and Castellan, 1988),
When N â‰¥ 14, the z-score approximately follows the standard normal distribution and can be used to reject the null hypothesis with little loss of accuracy.
Consider the following series of systolic blood pressure values measured daily in a patient during a one-week monitoring period: [140; 125; 133; 140; 138; 142; 121].
In this sequence of N = 7 measures, x1 = 140 is greater than four of the following values, and thus A1 = 4; x2 = 125 is greater than one of the next five observations, and so A2 = 1. The value of each reverse arrangement, Ai is:
A1 = 4
A2 = 1
A3 = 1
A4 = 2
A5 = 1
A6 = 1
and A = A1 + A2 + A3 + A4 + A5 + A6 = 10. Using Equations 1 - 3, one gets ÂµA = 10.5, Ïƒ2A = 11.083, and z = -0.151. To reject the null hypothesis at p â‰¤ 0.05 requires a z-score of z â‰¥ 1.96 or z â‰¤ âˆ’1.96. The calculated z-score is -0.151, which does not reject the null hypothesis at p â‰¤ 0.05, indicating no significant trend in the data.
The reverse arrangements test in conjunction with attribute z-scores can be used to identify potentially useful SMART attributes for use in failure detection algorithms (Murray et al. 2005).
Genetic algorithms (GAs) are derived from natural genetics and have recently emerged as a useful and practical method of optimization and searching. They are stochastic algorithms that mimic natural evolution (Holland, 1992). To simulate biological evolution, GAs use mechanisms such as reproduction, mutation, crossover, natural selection, and survival of the fittest. Since feature selection requires exponential search space, GAs are a natural fit for use in solving feature selection problems. Siedlecki and Sklansky (1989) demonstrate that GAs are superior to other classical algorithms (such as branch and bound) in large scale feature selection. Additionally, other researchers have shown the advantages of GAs for feature selection (Brill, Brown and Martin, 1992; Yang and Honavar, 1998; Kuncheva and Jain, 1999; Raymer, Punch, Goodman, Kuhn, and Jain, 2000). Areas that have benefited from genetic algorithms in feature selection include biosystems (Paul and Iba, 2005; Debnath and Kurita, 2010), maintenance (Hadavi, 2007), pattern recognition (Sun, Bebis, and Miller, 2004), and data mining (Tan, Teoh, Yu, and Goh, 2009)Biltski, 2006; Goldberg, 1989; Holland, 1992; Mitchell, 1996). Using some form of fitness-dependent probabilistic selection of individuals from the current population, evolutionary algorithms can produce individuals for the next generation (Yang and Hanovar, 1998). The most commonly used genetic operators to enable the algorithm to explore the space of candidate solutions are mutation and crossover. By applying genetic operators to the fitness-dependent selection of individuals, successive generations of individuals are repeated many times until a satisfactory solution is found. Highly opportunistic and exploitative randomized searches explored by genetic algorithms on high-dimensional search spaces can be very effective under certain conditions (Holland, 1992). In practice, the choices of the fitness function, population size and the genetic operators determine the performance of evolutionary algorithms.
Figure 4 describes the schematic of the Simple Genetic Algorithm (SGA) (Goldberg, 1989). A population of random individuals (or chromosomes) is first initialized. Based on their fitness evaluation, these individuals are selected for recombination via crossover. Mutation follows, and the resulting child strings form the population for the next generation.
Figure 4: Schematic of the Simple Genetic Algorithm (SGA)
Individual (Chromosome) Encoding
One form of encoding is the binary string. The individual to be encoded is shown in the table below:
Table 2: Feature Subset encoding in SGA
Each bit in the string represents one feature. If the bit is 0, it means the feature is not selected; otherwise, the feature is selected. The length of each individual is determined by the number of features in the string.
A number generator that returns 1 or 0 randomly can be used to generate individuals. If a floating point number generator is used to generate random numbers between 0 and 1, a threshold value of 0.5 can be set to make the chances of getting a 0 or a 1 equal.
To maximize the classification accuracy of the feature subset , the fitness of each individual, i, F(i) is defined as:
F(i) = C(i), where C(i) is the classification accuracy of the feature subset
The roulette wheel selection is one of the most commonly used methods for genetic algorithms. It probabilistically selects individuals from a population for later breeding using the following:
where F(hi) is the fitness value of hi. The probability that an individual will be selected is proportional to its own fitness and is inversely proportional to the fitness of other competing hypothesis in the current population.
The crossover operator produces two new offsprings by copying selected bits from each parent. One point crossover is implemented by picking one position in the individuals at random and then exchanging the parents to form offsprings as shown in Table 3.
Table 3: Single-point crossover
Mutation is much easier to implement. To mutate a certain bit, a number is generated randomly and compared with the number having the mutation rate. If the random number is smaller than the mutation rate, the bit is mutated; otherwise, the bit is maintained and the next bit is considered. The point mutation is shown in Table 4.
Mutated Offspring 1
Table 4: Point Mutation
Murray et al. (2005) cast the hard drive failure prediction problem as a multiple-instance learning problem, which is a two-class semi-supervised problem. Multiple-instance (MI) learning is a variation on supervised learning, where the task is to learn a concept given positive and negative bags of instances. Each bag may contain many instances, but a bag is labeled positive even if only one of the instances in it falls within the concept. A bag is labeled negative only if all the instances in it are negative. This method of classifying a bag as positive if any of its instances is labeled positive is known as the MI assumption.
The first application of multiple-instance learning was to drug activity prediction (Dietterich, Lathrop, and Lozano-Perez, 1997). One objective in the drug activity prediction application was to predict whether a candidate drug molecule would bind strongly to a target protein known to be involved in some disease state.
Dietterich et al. term the multiple-instance algorithm they developed axis-parallel-rectangles. Other researchers (Andrews, Tsochantaridis, and Hofmann, 2003) subsequently develop support vector machines (SVM) and Wang and Zucker (2000) the Citation-k-Nearest-Neighbors algorithm for multiple-instance learning. While the multiple-instance SVM (mi-SVM) algorithm developed by Andrews, et al. (2003) adheres to the MI assumption, the Citation-k-Nearest-Neighbors algorithm does not (Xu, 2003) and, as a result, cannot be used for hard drive failure prediction.
Murray et al. point out that the hard drive problem fits nicely into the MI framework. A pattern of n samples is an instance, and the set of all patterns for a drive i is the bag (or drive) Xi. Each bag (or drive) has Ni instances of pattern labels yj, j = 1â€¦Ni, where failed drives are labeled Æ´i = 1 and good drives labeled Æ´i = 0.
Figure 3 shows the schematic of the disk drive multiple instance learning problem described by Murray et al. In the figure shown below, the drives (bags) are indicated by numbers, and each circle or square represents an instance (or pattern). Squares are instances from class 1 (failed drives) and circles are instances from class 0 (good drives). The + or - in each instance represents the hidden underlying class of each instance, 1 or 0, respectively. The classification boundary induced by a classifier is represented by the decision surface. Misclassified instances by the decision surface are shown in gray.
Bag (Drive) 1: All - instances are classified correctly, hence the bag is correctly classified as 0 (good drive).
Bag (Drive) 2: Since there is one instance that is classified as +, the bag is correctly classified as 1 (failed drive).
Bag (Drive) 3: There is one instance classified as + and another as -, so the bag is correctly classified as 1 (failed drive).
Bag (Drive) 4: Here, an instance with true class - is labeled as +, so the bag is misclassified as 1 (false alarm).
Bag (Drive) 5: All instances of the + bag (failed drive) are classified as -, hence the bag is misclassified as 0 (missed detection).
Figure 5: Multiple-instance learning (Murray et al. 2005)
Multiple-Instance NaÃ¯ve Bayes (mi-NB) Classifier
Murray et al. develop a new multiple instance learning algorithm using naÃ¯ve Bayes and specifically designed to allow control of the false alarm rate. It adheres to the MI assumption just as the mi-SVM algorithm (Andrews et al. 2003) does, but since it requires repeated learning of an SVM, the mi-SVM is considered computationally intensive. On the other hand, the mi-NB uses the fast naÃ¯ve Bayes algorithm as its base classifier and is therefore a more efficient multiple-instance learning algorithm.
Barriers and Issues
Research in the area of hard disk failure prediction using SMART attributes and utilizing non-parametric statistical methods for feature selection can be credited mostly to the work undertaken by Murray, Hughes, and Kreutz-Delgado (2002, 2003, and 2005). These researchers show that although the reverse arrangements test, a nonparametric statistical test (Mann, 1945) can be useful when performing feature selection through trial-and-error testing of attributes and combinations of attributes, it is not an ideal tool to use for automated or accurate feature selection. As a result, they report that many of the attributes that the reverse arrangements test (in combination with the z-score test, a parametric test) considered promising did not actually work well for failure prediction. Of interest in this research is a pre-processing technique (an evolutionary algorithm) that selects informative (or discriminative) attributes or features, thereby reducing the dimensionality of the data, so that the cost of the mi-NB classifier is not computationally intensive.
Tan, Fu, Zhang, and Bourgeois (2007) recently proposed a framework based on a genetic algorithm for feature subset selection that combines various existing feature subset selection methods. Extensive research has been and is continuing to be carried out in feature subset selection using machine learning methods in pattern recognition (Ferri, Pudil, Hatef, and Kittler, 1994; Jain and Zongker, 1997; Lee and Landgrebe, 1997; Biem, Katagiri, and Juang, 1997), exploratory data analysis (Mao and Jain, 1995), bioinformatics (Liu, Li, and Wong, 2002; Bacauskiene, Verikas, Gelzinis, and Valincius, 2009; Liu, Cai, Lu, Feng, Peng, and Niu, 2009), quality control (Kaya, 2009), healthcare (Skrypnik, Teziyan, Puuronen, and Tsymbal, 1999; Cheng, Wei, and Tseng, 2006), and data mining (Rayner, Punch, Goodman, Sanschargin, and Kuhn, 1997). In particular, as discussed elsewhere in the literature review, feature selection using genetic algorithms has also been the subject of ongoing research in bioinformatics, maintenance, pattern recognition, and data mining but no known work has been undertaken on the use of an evolutionary algorithm in feature subset selection for SMART attributes to which the body of knowledge available in this area is very limited.
Additionally, while the implementation of an evolution algorithm for feature selection appears very simple, tuning the parameters such as the mutation probability, crossover probability and the population size is necessary to find the "optimal" settings for the evolutionary algorithm for a given problem very difficult.
This research will need to carefully consider issues that are related to the characteristics of the SMART dataset because the data is non-parametric and therefore some attributes could exhibit poor correlation with false positive predictions (Murray et al. 2005). These characteristics will contribute to difficulty in understanding and explaining the obtained results from the application of the multiple-instance naÃ¯ve Bayes algorithm.
Scaling and binning the SMART attributes as part of the preprocessing of the data can be considered to be one of the barriers. However, in the proposed research, the author is planning on adopting the equal-frequency or equal-width binning discretization methodology demonstrated by Dougherty, Kohavi and Sahami (1995). This discretization step has shown to provide certain advantages in performance, generalization and computational efficiency of the classifiers (Frank and Witten, 1999). Hamerly and Elkan (2001) used equal-width binning into five bins (including zero-count bin) successfully in discretization of drive attribute data which the author also proposes to use for this research.
Because of the competitiveness and proprietary nature of the business, hard disk drive manufacturers are generally very wary of sharing any field data for use in research or otherwise. As a result, the author plans to use the same SMART dataset used by Murray et al. (2005) which limits the ability to randomly sample the entire scope of hard drive failure activity in the field. which may not provide a large enough number of samples for optimal feature subset selection via the proposed evolutionary algorithm but should suffice for the purpose of this dissertation.
The overall goal of this research is to goal is to develop and demonstrate an evolutionary method for feature subset selection that outperforms the non-parametric reverse arrangements test method for feature selection in both the detection and false alarm rates without sacrificing computational efficiency.
The specific steps involved in conducting the research are as follows.
Download the SMART dataset from http://cmrr.ucsd.edu/smart website. SMART data is reported by the drive in a time series format. The mi-NB classifier is designed to work with time series data. This makes the cmrr.ucsd.edu website's SMART dataset fit the need for the benchmark needed for this study. Additionally, due to the proprietary nature of the hard drive business, there is generally lack of available field data for such research purposes.
Use the reverse arrangements non-parametric statistical method on the dataset to select the feature subset.
Train and test the feature subset selected by the reverse arrangements non-parametric statistical method on the dataset with the mi-NB classifier.
Use a 10-fold cross-validation test on the dataset with the feature subset selection and determine the true error.
Apply the evolutionary algorithm on the dataset using the following fitness function and parameters:
Fitness Function: See explanation on Fitness Function below.
Selection: Roulette Wheel.
Population size: 20, 30, 40, 50.
Number of Generations: 5, 10, 15, 20, 25, 30, 35, 40, 45, 50.
Probability of crossover: 1, 0.95.
Probability of mutation: 0.001, 0.05.
Train and test the feature subset on the dataset using the mi-NB classifier.
Use a 10-fold cross-validation test on the dataset with the feature subset selection and determine the true error.
Generate comparative analyses on the accuracy of the classifier on the two feature subset selection methods and determine which method provides the highest drive failure detection rate, lowest false alarm rate and with the least number of features selected.
The novel fitness function is designed to achieve three objectives: maximize drive failure detection rate, minimize the false alarm rate and the number of features selected. To do this, the fitness function is defined as follows:
F(i) = w1 * DR(i) + (1-w1) * (1/FA(i)) + (1-w1 - w2) * (1/S(i))
where i is a feature vector representing a feature subset selected and w1 and w2 are parameters between 0 and 1. The function is composed of three parts. The first part is a weighted hard drive failure detection rate, DR(i) from the mi-NB classifier, the second part is the weighted false alarm rate FA(i), and the third part is the weighted size S(i) of the feature subset represented by i. For a given w, the fitness of an individual is increased as the detection rate and subset size of i increases and decreases as the false alarm rate of i increases. Increasing the value of w means we give a higher priority to the detection rate over the size of the feature subset and the false alarm rate. By adjusting w, we can achieve a tradeoff between the detection rate and the size/false alarm rate.
The author proposes the use of MATLAB software application to adopt some of the code routines developed by Murray, Hughes and Kreutz-Delgado (2005) to implement the multiple-instance naÃ¯ve Bayes classifier algorithm. Murray et al. have successfully used the multiple-instance naÃ¯ve Bayes classifier to predict hard drive failures. MATLAB is universally accepted as one of the top numerical computing programming languages and used by more than a million developers (Marques, 2003). In addition to using MATLAB, the author also proposes the use of JMP and MINITAB statistical software packages to perform non-parametric statistical on the SMART dataset for feature subset selection. Both JMP and MINITAB are well-known statistical packages used by universities and corporations (Beaumont and Knowles, 1996). WEKA 3.5.7 is an open source data mining software application developed by the University in Waikato in New Zealand (Larose, 2006). It will be useful in experimenting with the use of the evolutionary algorithm on the SMART dataset.
A Personal Computer will be used to train and test the SMART dataset that contains 68411 records. The system requirements include 2.4GHz Pentium processor, 2GB RAM and 750GB hard drive.
The author will acquire the student's version of MATLAB 2010a, a commercial software application, from The MathWorks, Inc. that would be used to code the routines to train and test the SMART dataset. Statistical analysis on the data will be conducted through the use of JMP and MINITAB commercial software packages. The operating system will be Windows XP.
Revision and submission of this dissertation idea paper as necessary - Target date: May 30, 2010
Completion of the dissertation proposal - Target date: September 30, 2010
Completion of the dissertation - Target date: March 30, 2011
Submission of the initial version of the dissertation - Target date: April 30, 2011
Submission of the final version of dissertation report - Target date: July 31, 2011
Relevant material from publications, books and journals will be accessed via Nova Electronics Resources and University of Colorado (Boulder) Library.
Student's version of MATLAB R2007b software from The MathWorks, Inc.
SMART dataset from http://cmrr.ucsd.edu/smart.
Commercial version of JMP software from SAS Institute.
MINITAB statistical software application.
Weka 3.5.7 Data Mining Open Source software application from http://www.cs.waikato.ac.nz/ml/weka/.