This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Todays anti-virus technology, based largely on analysis of existing viruses by human experts, is just barely able to keep pace with the more than three new computer viruses that are written daily. In a few years, intelligent agents navigating through highly connected networks are likely to form an extremely fertile medium for a new breed of viruses. At IBM, they are developing novel, biologically inspired anti-virus techniques designed to thwart both today's and tomorrow's viruses. Here we describe two of these. One, a neural network virus detector that learns to discriminate between infected and uninfected programs, and another is a computer immune system that identifies new viruses, analyzes them automatically, and uses the results of its analysis to detect and remove all copies of the virus that are present in the system. The neural-net technology has been incorporated into IBM's commercial anti-virus product; the computer immune system is in prototype.
Trigrams - a special case of the N-gram, where N is 3.
Boot sector - a sector of a hard disk, floppy disk, or similar data storage device that contains code for booting programs stored in other parts of the disk.
Generic detector - used to detect anti-malware
Each day, an army of perhaps a few hundred-virus writers around the world produces three or more new computer viruses. An army of comparable size, the anti-virus software developer's works feverishly to analyze these viruses, develop cures for them, and frequently distribute software updates to users. Currently, the battle is roughly even. The statistics, based on observation of a sample population of several hundred thousand machines for several suggest that in medium to large businesses roughly 1% of all computers become infected during any given year. The world's computer population has been inconvenienced, but despite dire predictions it has not been incapacitated. Most of the anti-virus products in common usage have been reasonably effective in detecting and removing viruses. Today, computer viruses are a manageable nuisance. Several worrisome trends threaten to turn the balance in the favor of computer virus authors. First, the rate at which new viruses are created, already on the verge of overwhelming human experts, has the potential to increase substantially.
Second, continued increases in interconnectivity and interoperability among the world's computers, designed to benefit computer users, are likely to be a boon to DOS and Macintosh viruses as well. In addition, mobile intelligent agents will soon navigate the global network, potentially serving as a fertile medium for a new breed of rapidly spreading virus that exploits the itinerancy of its host by leaving behind copies of it wherever its host goes. Traditional methods of detecting and removing viruses, which rely upon expert analysis by humans and subsequent distribution of the cure to users, would be orders of magnitude too slow to deal with viruses that spread globally within days or hours. In humans, a toxin produced by virally infected bacteria causes diptheria. Some computer viruses are similarly toxic, being deliberately programmed to cause severe harm to their hosts. One notorious example, the Michelangelo virus, destroys data on a user's hard disk whenever it is booted. To address these problems, they have developed a variety of biologically inspired anti-virus algorithms and techniques that replace many of the tasks traditionally performed by human virus experts, thus permitting much faster, automatic response to new viruses First, we will briefly describe what computer viruses are, how they replicate themselves, and why their presence in a system is undesirable. Then, we shall describe the typical procedures used by human experts to analyze computer viruses, and explain why these methods are unlikely to remain viable a few years from now. Then, we shall discuss two complementary anti-virus techniques that are inspired by biological systems that learn: a neural-network virus detector and a computer immune system.
1.Generic Detection of Viruses
Two methods of computer virus identification have already been introduced: the overly broad, ex post facto detection provided by activity monitors and integrity
management systems, and the overly specific detection offered by virus scanners. Somewhere in between is the ideal "generic detector" taking a program's code as input, it determines whether the program is viral or nonviral. Perfect generic detection is an algorithmically "undecidable" problem, as observed by it is reducible to the halting problem. However, imperfect generic detection that is good in practice is possible, and is naturally viewed as a problem in automatic pattern classification. Standard classification techniques encompass linear methods and non-linear ones such as nearest-neighbor classification, decision trees, and multiplayer neural networks. Within the problem of the generic detection of viruses, detection of "boot sector viruses" is both an important and relatively tractable sub-problem. Although there are over 4,000 different file-infecting viruses and only about 250 boot-sector viruses, of the 20 viruses most commonly seen 19 are boot viruses, and account for over 80% of all virus incidents. Boot viruses similarly dominate the rolls of newly observed viruses, so an ability to detect new boot sector viruses is significant in the war against viruses. Detecting boot viruses is a relatively limited pattern classification task. For this application, false positives are critical. False negatives mean missed viruses, and since viruses occur fairly rarely, so will false negatives. Also, if a classifier does let a virus slip by, the outcome is no worse than if no virus protection were in place. On the other hand, false positives can occur any time, and will leave a user worse off than he would have been without virus protection. Unfortunately, nearest-neighbor classification performs poorly for this problem. A viral boot sector can be just a short string of viral code written over a legitimate boot sector, so in any overall comparison, the virus will be more similar to the legitimate boot sector it happened to overwrite than to any other virus.
Using expert knowledge of viral and non-viral boot sectors and several days of extensive experimentation, they handcrafted an ad hoc classifier. The classifier scans a boot sector for the presence of patterns that provide strong or weak evidence for any of four viral functions. One point is credited for weak evidence, and two points for strong evidence. A boot sector is classified as viral if its total score is 3 or higher. This classifier performed well on the 350 examples, with a false-negative rate of about 18% and a false-positive rate too small to measure over the 100 negative examples. That is, 82% of viruses were detected, and no legitimate boot sector was classified as viral. They hoped to develop a procedure for automatically constructing a virus classifier, using similar features as inputs to a neural network. Since the ad hoc classifier incorporated knowledge of all of the available boot sectors, there was a possibility that it suffered from over-fitting, in which case it would generalize poorly on new data. It would be much easier to assess the generalization performance of an automatically constructed classifier. Also, the algorithmic extraction of features and optimization of network weights might give even better classification performance, especially in the false-positive measure. Finally, we believed that an automated procedure would adapt much more readily to new trends in boot sector viruses.
1.1 Feature selection
The first step in the construction was the selection of byte strings to act as features. Where a human expert is able to use high-level understanding of viruses, knowledge
of machine code, and natural intelligence to select complex feature patterns containing wildcards, for algorithmic feature generation we contented ourselves with simple 3-byte features. A training set with 150 512-byte viral boot sectors includes 76,500 "trigrams", of which typically 25,000 are distinct. This is where the first challenge, feature pruning, comes in. A well known principle in machine learning states that the number of training examples must be considerably larger than the number of adjustable parameters to reliably give good generalization to test examples. With 150 viral and 45 non-viral training examples, a network must have well fewer than 195 weights - say about 50 - implying a lesser or equal number of inputs. Somehow the 25,000 trigrams must be winnowed down to 50. Since what is desired are trigrams that are indicative of viral as opposed to legitimate behavior, it is natural to remove trigrams appearing too frequently in legitimate boot sectors. It is provided by selecting trigram features, which figure importantly in the viral training set. One way to do this would be to select trigrams occurring at least some number of times in the viral training set, but this leaves some viral samples unrepresented by any trigrams. A better approach comes from selecting a "cover" of trigrams. A set of trigrams with at least one trigram representing each of the viral samples. In fact, we can afford something close to a 4-cover, so that each viral sample is represented by 4 different trigrams in the set. Four-covering produces a set of about 50 trigram features, few enough to be used as input to a neural net. Reassuringly, most of the trigrams were substrings of or otherwise similar to the more complex patterns of the ad hoc classifier. However, there were a few trigrams that could not be related to any of these patterns, and on expert inspection they turned out to define a meaningful new feature class.
1.2 Classifier training and performance
By construction, the selected trigrams are very good features within the training set, no legitimate boot sector contains any of them, and most of the viral boot sectors
contain at least 4. Paradoxically, the high quality of the features poses the second challenge, what we have called the problem of ill-defined learning. Since no negative
example contains any of the features, any "positive" use of the features gives a perfect classifier. Specifically, the neural network classifier of with a threshold of 0 and any positive weights will give perfect classification on the training examples, but since even a single feature can trigger a positive, it may be susceptible to false positives on the test set and in real world use. The same problem shows up as an instability when the usual back-propagation training procedure is used to optimize the weights: larger weights are always better, because they drive the sigmoid function's outputs closer to the asymptotic ideal values of -1 and 1. In fact all that will keep a feature's ideal weighting from being infinite is the feature's presence in some negative example. Since none of the features were present in any negative example, our solution was to introduce new examples. One way is to add a set of examples defined by an identity matrix. That is, for each feature in turn, an artificial negative example is generated in which that feature's input value is 1 and all other inputs are 0. To do so, we used 512 bytes of code taken from the initial "entry point" portions of many PC programs to stand in as artificial legitimate boot sectors; the thought was that these sections of code, like real boot sectors, might be oriented to machine setup rather than performance of applications. At this point the problem is finally in the form of the most standard sort of feed-forward neural network training, which can be done by backpropagation. In typical training and testing runs, we find that the network has a false-negative rate of 10-15%, and a false-positive rate of 0.02% as measured on artificial boot sectors. Consistent with the 0.02% false-positive rate, there were no false positives on any of the 100 genuine legitimate boot sectors. Even though all the features are indicative of viral behavior, most training runs produced one or two slightly negative weights. They are not completely sure why this is so, but the simplest explanation is that if two features were perfectly correlated, only their total weight is important, so one may randomly acquire a negative weight and the other a correspondingly larger positive weight. For practical boot virus detection, the false-negative rate of 15% or less and false-positive rate of 0.02% are an excellent result: 85% of new boot sector viruses will be detected, with a tiny chance of false positives on legitimate boot sectors. In fact the classifier, incorporated into IBM Antivirus, has caught several new viruses. Of the 10 or 15% of viruses that escape detection, most do so not because they fail to contain the feature trigrams, but because the code sections containing them are obscured in various ways. If the obscured code is captured by independent means, the trigrams can be passed on to the classifier and these viruses too will be detected.
2.Computer Immune System
Although generic virus detection works well for boot sector viruses, and may eventually prove useful for file infectors as well, at least two drawbacks are inherent in the technique:
New viruses can be detected only if they have a sufficient amount of code in common with known viruses.
The method is appropriate for viral detection only; it is incapable of aiding in the removal of a virus from an infected boot sector or file. The only way to eliminate the infection is to erase or replace the infected boot sector or file.
The vertebrates have evolved a more sophisticated, adaptive immune system that works in concert with the innate immune system, and is based on recognition of specific pathogens. It exhibits the remarkable ability to detect and respond to previously un-encountered pathogens, regardless of their degree of similarity to known pathogens. This is precisely the sort of defensive capability that we seek against computer viruses. The immune system responds to virus-like anomalies by capturing and analyzing viral samples. From its analysis, it derives the means for detecting and removing the virus. Many components of the computer immune system are working in the laboratory, and are providing useful data that is incorporated into IBM Antivirus. First, we shall consider the set of components that are labeled as being currently in IBM Antivirus: anomaly detection, scanning for known viruses, and removal of known viruses. Then, we shall discuss some of the components that are labeled as being currently in the virus lab: sample capture using decoys, algorithmic virus analysis, and signature extraction. These components are all functioning prototypes. Finally, we shall discuss a mechanism by which one machine can inform its neighbors about viral infections.
2.1 Anomaly detection
The fundamental problem faced by both biological and computer immune systems are to distinguish between malignant and benign entities that enter the individual. Due to the high degree of stability of body chemistry in individual vertebrates during their lifetimes, their immune systems can replace this difficult task with the much simpler one of distinguishing self from non-self. This is a nice hack, because "self" is much easier to define and recognize than "benign". The biological immune system can simply implement the xenophobic strategy: "Know thyself (and reject all else)." In computers, the same xenophobic strategy is an important component of anomaly detection. Integrity monitors can use checksums to determine whether an existing executable has changed. However, this is only a partial solution. The nature of "self", i.e. the collection of software on an individual computer, is continually shifting over time - much more so than in biological organisms. People continually add new software to their system, and update existing software by buying new versions or compiling new source code. The fact that an executable is new or has changed is not nearly enough to warrant suspicion. An array of other monitors and heuristics employ a complementary "Know thine enemy" strategy: the nature of the anomaly must be strongly indicative of a virus. Some components of the anomaly detector trigger on suspicious dynamical behaviors; others trigger on static properties having to do with the exact nature of a change that has been identified by the integrity monitor.
2.2 Scanning for known viruses
If the anomaly detector has been triggered, the system is scanned for all known viruses. Since there are currently at least 4000 known PC DOS viruses, this means that exact or slightly inexact matches to approximately 4000 signatures, each in the range of roughly 16 to 32 bytes long, are searched in parallel. This is in itself an interesting string matching problem, and efficient search methods are an active area of research for them. Much more impressive than any string matching algorithm we could ever hope to devise, however, is the parallel search carried out by the vertebrate immune system, in which roughly 10 million different types of T-cell receptors and 100 million different types of antibodies and B-cell receptors are continually patrolling the body in search of antigen. Just as a computer virus scanner recognizes viruses on the basis of matches to a fragment of the virus (the signature), T-cell and B-cell receptors and antibodies recognize antigen by binding to fragments of the antigen. Matching to fragments rather than the entire antigen is a physical necessity in the biological immune system; in computers, this strategy is not absolutely necessary, but it has some important advantages. Matching to fragments is more efficient in time and memory, and permits the system to recognize slight variants, particularly when some mismatches are tolerated. For both biological and computer immune systems, an ability to recognize variants is essential because viruses tend to mutate frequently. If an exact match were required, immunity to one variant of a virus would confer no protection against a slightly different variant. Similarly, vaccines would not work, because they rely on the biological immune system's ability to synthesize antibodies to tamed or killed viruses that are similar in form to the more virulent one that the individual is being immunized against.
2.3 Virus removal
In the biological immune system, if an antibody encounters antigen, they bind together, and the antigen is effectively neutralized. Thus recognition and neutralization of the intruder occur simultaneously. Alternatively, a killer T cell may encounter a cell that exhibits signs of being infected with a particular infecting agent, whereupon it kills the host cell. This is a perfectly sensible course of action, because an infected host cell is
slated to die anyway, and its assassination by the killer T cell prevents the viral particles from reaching maturation. A computer immune system can take the same basic approach to virus removal: it can erase or otherwise inactivate an infected program. However, an important difference between computer viruses and biological viruses raises the possibility of a much gentler alternative. Verification is based upon checksums of regions of viral code that are known to be invariant across different instances of the virus. The exact location and structure of the virus must have been derived beforehand, and expressed in terms of a language understood by the verification algorithm. If the verification does not succeed, an attempt to remove the virus by this means is considered too risky, and another more generic virus removal method is brought into play. If the verification succeeds, a repair algorithm carries out the appropriate sequence of steps required for removing that virus, expressed in a simple repair language. The sequence of steps is easily derived from an analysis of the locations of all of the portions of the original host. Although the analysis required to extract verification and removal information has traditionally been performed by human experts, we shall discuss in a later subsection an automated technique for obtaining this information.
Suppose that the anomaly detector has found evidence of a virus, but that the scanner cannot identify it as any of the known strains. Most current anti-virus software will not be
able to recover the host program unless it was deliberately stored or analyzed prior to becoming infected. Ideally, one would like to have stronger evidence that the system really is infected, and to know more about the nature of the virus, so that all instances of it can be found and eliminated from the system. In the computer immune system, the presence of a previously unknown virus in the system can be established with much greater certainty than can be provided by the anomaly detector. The idea is to lure the virus into infecting one or more members of a diverse suite of "decoy" programs. Decoys are designed to be as attractive as possible to those types of viruses that spread most successfully. A good strategy for a virus to follow is to infect programs that are touched by the operating system in some way. Such programs are most likely to be executed by
the user, and thus serve as the most successful vehicle for further spread. Therefore, the immune system entices a putative virus to infect the decoy programs by executing, reading, writing to, copying, or otherwise manipulating them.
Immune System overview
Such activity attracts the attention of many viruses that remain active in memory even after they have returned control to their host. To catch viruses that do not remain active in memory, the decoys are placed in places where the most commonly used programs in the system are typically located, such as the root directory, the current directory, and other directories in the path. The next time the infected file is run, it is likely to select one of the decoys as its victim. From time to time, each of the decoy programs is examined to see if it has been modified. If any have been modified, it is almost certain that an unknown virus is loose in the system, and each of the modified decoys contains a sample of that virus. These virus samples are stored in such a way that they will not be executed accidentally.
2.5 Automatic virus analysis
Typically, a human expert applies a deep understanding of machine instruction sequences to virus analysis. Sometimes, this is combined with observation of the effects of the virus on a program. Our automatic virus analysis algorithm is much less sophisticated in its knowledge of machine code, but makes up for this deficiency by making use of more data: specifically, several samples of the virus. Once a few samples of the virus have been captured, the algorithm compares the infected decoys with one another and with the uninfected decoys to yield a precise description of how the virus attaches to any host. The description is completely independent of the length and contents of the host, and to some extent can accommodate self-encrypting viruses. A pictorial representation of one particularly simple infection pattern is presented in.
Automatic virus analysis provides several useful types of information:
The location of all of the pieces of the original host within an infected file, independent of the content and length of the original host. This information is automatically converted into the repair language used by the virus removal component of IBM Antivirus.
The location and structure of all components of the virus. Structural information includes the contents of all regions of the virus that are invariant across different samples. This information has two purposes:
It is automatically converted into the verification language used by the verification component of IBM Antivirus
It is passed to the automatic signature extraction component for further processing
2.6 Automatic signature extraction
The basic goal of automatic signature extraction is to choose a signature that is very likely to be found in all instances of the virus, and very unlikely to be found accidentally in uninfected programs. In other words, we wish to minimize false negatives and false positives. False negatives are dangerous because they leave the user vulnerable to attack. False positives are extremely annoying to customers, and so infuriating to vendors of falsely-accused software that they have led to at least one lawsuit. To minimize false negatives, we first start with the contents of the invariant regions that have been identified by the automatic virus analysis procedure. However, it is quite conceivable that not all of the potential variation has been captured within the samples. As a general rule, non-executable "data" portions of programs, which can include representations of numerical constants, character strings, work areas for computations, etc., are inherently more likely to vary from one instance of the virus to another than are "code" portions, which represent machine instructions. The origin of the variation may be internal to the virus, or a virus hacker might deliberately change a few data bytes in an effort to elude virus scanners. The remaining goal is to select from among the candidates one or perhaps a few signatures that are least likely to lead to false positives. We have formulated the problem of minimizing the false positive probability as follows. For each candidate signature, estimate the probability for it to match a random sequence of length S that is generated by the same probability distribution that generates legitimate software on the relevant platform. Then, we select the candidate signature for which the estimated probability is the smallest. In slightly more detail, the key steps of the algorithm are as follows:
1) Form a list of all n-grams (sequences of n bytes; 1 < n < n max) contained in the input data. (n max is typically 5 or 8.)
2) Calculate the frequency of each such n-gram in the "self" collection.
3) Use a simple formula that chains together conditional probabilities based on the measured n-gram frequencies to form a "false-positive" probability estimate for each candidate signature, i.e. the probability that it matches a random S-byte sequence chosen from code that is statistically similar to "self.
Select the signature with the lowest estimated falsepositive probability.
2.7 Immunological memory
The mechanisms by which the vertebrate immune system retains a lifelong memory of viruses to which it has been exposed are quite complex, and are still the subject of
study and debate. By contrast, immunological memory is absolutely trivial to implement in computers. During its first encounter with a new virus, a computer system may be "ill", i.e. it will devote a fair amount of time and energy to virus analysis. After the analysis is complete, the extracted signature and verification/repair information can be added to the appropriate known-virus databases. During any subsequent encounter, detection and elimination of the virus will occur very quickly. In such a case the computer can be thought of as "immune" to the virus.
2.8 Fighting self-replication with self-replication
In the biological immune system, immune cells with receptors that happen to match a given antigen reasonably well are stimulated to reproduce themselves. This provides a very strong selective pressure for good recognizers, and by bringing a degree of mutation into play, the immune cell is generally able to come up with immune cells that are extremely well-matched to the antigen in question. One can view this as a case in which self-replication is being used to fight a self-replicator in a very effective manner. They propose to use a similar mechanism, which we call the "kill signal", to quell viral spread in computer networks. When a computer discovers that it is infected, it can send a signal to neighboring machines. The signal conveys to the recipient the fact that the transmitter
was infected, plus any signature or repair information that might be of use in detecting and eradicating the virus. If the recipient finds that it is infected, it sends the signal to its neighbors, and so on.
The development of the generic virus detector and the computer immune system were primarily motivated by practical concerns: human virus experts are on the verge of being overwhelmed, and we need to automate as much of what they do as possible. The generic virus detector was incorporated into IBM Antivirus in May 1994, and since that time it has successfully identified several new boot viruses. It is the subject of a pending patent. Most of the components of the computer immune system are functioning as very useful prototypes in our virus isolation laboratory; we use them every day to process the large sets of new viruses that arrive in the mail from other virus experts around the world. The immune system itself is the subject of a pending patent, as are several of its components, including automatic virus analysis and automatic signature extraction. Our eventual goal is to incorporate the immune system into IBM Antivirus and, a few years from now, in networks inhabited by itinerant software agents.