A Similarity Test For Metamorphic Viruses Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Metamorphic virus code generates different copies of itself by changing their code. The new virus has the similar functionality as the parent but they have different structure. The goal of metamorphic virus is to produce viral copies by mutating the code structure and have no common signature. Since the viral copies are different they evade the signature detection technique which is generally used by anti-virus scanners.

Several detection techniques were used for detecting metamorphic viruses. Hidden Markov Model (HMMs) detection tool showed good results for classifying whether virus comes under the virus family or not. For the metamorphic viruses which have some similarity with the normal files evade the HMM detection technique.

In this project, we analyze the Similarity Index technique which was previously developed. We tested similarity score between viruses and normal files for viruses similar to normal files. We create an intelligent similarity score threshold that classify whether the program belongs to the family virus or not. We showed that we came up with the similarity index based detection technique which detects the metamorphic viruses which were not completely detected by previously developed HMM detection tool.


A computer virus is a program that, when executed, replicates itself without the users permission or knowledge [13]. It spreads the infection by attaching themselves to other executable code. The infected program, when launched, can replicate itself to infect other executables and change their behavior [8]. It relies in some way to other executable code to spread its infection.

A virus might perform malicious activities on the machine such as file viruses that corrupts the file system by infecting batch files, shell script and binary executable, Boot sector viruses which overwrites hard disk partition sector, damage caused by macro viruses and companion files. Modern viruses take the advantage of internet to propagate over the network and spread its infection globally.

Virus construction kit is available on the internet to make virus creation process simpler [22]. This helps user who has less knowledge of the assembly coding to produce their own viruses. There are several antivirus defense programs available today to identify malware. Most commonly used antivirus detection technique is signature detection which searches the content of the files in file system with the signatures stored in antivirus database. Another approach is code emulation where virus runs in virtual environment and its actions are recorded in log file. Based on logged action, the antivirus determines if the program performs the unusual activities and takes the appropriate action to prevent the infection [16].

To pass around the signature based detection technique, virus writer choose code obfuscation techniques which alters the structure of the code by reordering the assembly instruction, dead code insertion, instruction substitution, unconditional jumps. This resulted into metamorphic viruses with same functionality but has different structure.

To contend metamorphic viruses, Hidden Markov Model (HMM) detection tool based on statistical pattern analysis tool was developed [2]. This virus detection tool is initially trained and then used to determine if the program belongs to the virus family or non-virus (normal) family. Performance of the HMM tool is measured in terms of time spent on training the model, false negative and false positive value and gross precision of classification [2].

General metamorphic viruses with code obfuscation technique evades signature detection but are detected by HMM [9]. Comprehensive metamorphic viruses with less similarity with its virus family viruses but have some similarity with normal files were not completely detected by HMM [3].

The aim of the project is to identify comprehensive metamorphic viruses using similarity index approach. Similarity index technique is used to classify program to determine if it belong to virus family or normal files. The similarity index techniques were applied between various virus files of same virus family, normal files, and between normal files and virus files. The threshold value is obtained by testing several virus and normal files which classify the program.

This paper is organized as follows: In section 2, we provide background information on some computer viruses. Section 3 we discuss antivirus technique for possible defenses. Section 4 details evolution of virus and code obfuscation technique to generate highly morphed viruses with some similarity with normal files. Section 5 details the design and implementation of our similarity index test technique. Section 6 covers the experimental results from our similarity index technique. Section 7, presents our conclusions and possible future enhancements.


Computer virus is self-replicating program that performs malicious activities by infecting other host files. The host files, when executed, can infect other files in turn. Mostly a virus comprises of three modules [8]. For example, the file infector virus, which attach itself in code of other host program. The infected file can be any executable application or a game. On execution of infected program, the virus loads itself into PC's memory. This virus is loaded in PC's memory separately from the local host file and thus it continues to run even after the host file shutdown its execution. "Before the initiation of the internet, file infector viruses accounted for probably 85% of all virus infections" [11].

Figure 1. Computer Virus [12]


This section outlines some of the most commonly used antivirus techniques to detect computer viruses.

Signature Detection

Signature detection technique is widely used to detect the viruses. Signature is a pattern of bits found in virus [1]. This string of bits is found in virus file and stored in antivirus database. The virus scanner searches the entire file system for known signatures. If the known signature is found then the file is marked as infected. For example, executable file infected by W32/Beast virus comprise of following pattern of bits as signature [13].

"83EB 0274 EB0E 740A 81EB 0301 0000"

The virus scanner searches entire file system for this signature. It declares the file to be the Beast virus if this signature is present.

Some virus scanner support wildcard search strings, such as "??02 34C9 8CD1 429C" where '?' indicates the wildcard. Wildcard strings permits skipped bytes and regular expressions and it helps in detecting encrypted viruses in some cases [19].

Heuristic Analysis

Heuristic analysis is a method used by antivirus software to detect new or unknown computer viruses. There are two types of heuristic scanning techniques. The difference between two approaches is whether heuristic scanner makes the use of CPU emulation to scan for virus like behavior. A heuristic scanner has two phases of operation when scanning files for viruses. In first phase of operation, scanner observes the behavior of the program. A scanner looks for specific area in file where virus would attach itself. In second phase, it determines the program logic which can be executed by computer instruction in the specific area identified in first phase [10]. The program is flagged as virus if it contains the certain percentage of computer instruction similar to viral instructions.

The Heuristic analysis results many false positives as it mostly operates on basis of past experience [23]. This might not detect new viruses that contain code different from previously known virus program. Heuristic scanner creating many false positives looses users' trust and interest.


The following are different strategies to hide viruses from the antivirus detection techniques.

Encrypted Viruses

Encryption is the simplest way to conceal the virus from antivirus program. Encrypted virus contains encrypted body and decryptor logic. Most of the antivirus program attempt to find the virus by looking at specific string of bits in program. To avoid detection, viruses encrypt the body using encryption key to conceal the pattern of code. Different encryption key generates different encrypted virus body. The logic of encryption is kept simple, such as XOR, the key for encrypting the virus body [3]. Even though encrypted virus body looks different for all generation, the decryptor always remains constant from generation to generation. As a result, detection of encrypted viruses is possible. The antivirus program can detect the decryptor by its code pattern even if it cannot decrypt the virus body.

Polymorphic Viruses

Polymorphic viruses is one of the more complex technique implemented by virus coders to overcome the disadvantage of encrypted viruses. To make it more effective then encrypted viruses, polymorphic viruses have different methods of decryption by mutating the decryptor logic. More advanced version of polymorphic virus substitute the mutually independent instruction such as moving "0" to B or adding "0" to A resulting in inexact values. This evades the antivirus program looking for specific code of pattern in virus. Virus scanners based on signature detection method have to search different string of bits, for each likely decryption methods for detecting this type of viruses.

Anti-virus software even uses code emulator to detect the polymorphic virus. The code emulator let the virus execute and observe its behavior. It emulates the decryption process and detects the decrypted virus body.

Figure 2. Polymorphic virus generations [20]

Metamorphic Viruses

Virus coders developed metamorphic viruses which does not carry any decryptor or constant virus body like polymorphic viruses. This virus changes its code for each infection by inserting or removing junk code to its source. This makes it more resistant to code emulation detection technique. Unlike polymorphic viruses, encryption is not used in metamorphic viruses. The virus body has different structures with same functional behavior. Figure 3 shows the metamorphic virus with different body structures. Metamorphic virus uses various code obfuscation techniques to hide from virus scanner detection techniques.

Figure 3. Metamorphic virus generations [20]

Register Swap (Register Usage Exchange)

Register swap is one of the simplest metamorphic techniques which changes register operands in the virus body with different registers. Instructions remains constant for all virus generation, only register changes. For example, instruction "mov edi,0004h" might be substituted with "mov ebx,0004h". The W95/Regswap virus [7] is an example of the metamorphic viruses which used the register swap technique. Figure 4 shows the code, which follows Regswap technique where the sequence of opcode is unchanged. Wildcard string can usually detect such metamorphic viruses [19].

Figure 4. Two different generations of RegSwap [9]

Junk Instruction Insertion

Junk code insertion is an effective technique employed by metamorphic viruses to change the appearance of the virus body. Inserting junk code does not affect the execution of the program. Junk instructions are instructions that are either have no effect or not executed in the program [13]. Example of Do-nothing instructions are "mov edx,edx" , "add R1,0", "sub R1,0" or "nop".

Dead code insertion can be done as a single instruction or a block of instructions between the core instructions. Figure 5 shows the example of the Evol virus which implemented the junk code insertion technique by adding a block of dead code.

Figure 5. Dead Code insertion in Evol virus [14]

Instruction Substitution

Instruction substitution is another useful techniques use to substitute an instruction or a block of instruction with an equivalent instruction or an equivalent block of instructions. For example, "push edx", "pop eax" can be substituted by "add eax,1" followed by "move eax,edx". Table 1 shows the W32/MetaPhor virus [15] implementing instruction substitution. The "Mov Reg,Imm" operation is equivalent to "Mov Mem,Reg" followed by "op Mem,Reg2" and "MOV Reg,Mem".

Table 1. Example of instruction substitution used by W32.MetaPhor virus [15]

Instruction Transposition

Transposition is a method to change the order of execution of the instructions. Instruction permutation between the instructions does not affect the program outcome and it can be applied only if there is no mutual dependency between the instructions. Consider the following instruction set:

(op1 r1, r2)

(op2 r3, r4) // r1 and/or r3 register are to be modified

The instructions can be reordered only if following conditions are satisfied:

r1 is not equal to r4,

r2 is not equal to r3,

r1 is not equal to r3,

Subroutine Transposition

Subroutine is an effective technique can apply to change the appearance of the virus by reordering the subroutines. There can be n! different generation of subroutines for n different subroutines. The W32/Ghost virus [15] implements the subroutine transposition technique. The virus contains 10 subroutines generating 10! Distinct copies. Detection of such virus can be accomplished by string driven pattern detection technique.

Figure 6 Subroutine Transposition


To evade the signature based detection, metamorphic engine produce highly morphed copies of itself. Each generation of viruses are different in structure. Similarity index test technique is used to measure the similarity between the dissimilar virus copies. It gives the percentage if similarity between the morphed copies. A similarity test technique is developed by P. Mishra in [4].

Similarity Test method

To measure the similarity between the virus copies, two assembly files are compared based on the opcode sequence present in them. The following steps are followed to compute the similarity between two files and graphically illustrated in Figure 7.

Given two assembly files, file1.asm and file2.asm, we extract the sequence of opcodes from both the files, excluding labels, comments, blank lines and other directives. Let's call these resulting opcode sequences F1 and F2 for file1.asm and file2.asm respectively. Let m and n are the number of opcodes in F1 and F2, respectively. Opcode number is assigned to each of the opcode in the resulting opcode sequence: 1 for the first opcode, 2 for the second, and so on.

Opcode sequence is divided into subsequences of three consecutive opcodes. We compare the opcode sequences, F1 and F2, considering all subsequences of three consecutive opcodes from each sequence. We considered a match, if three opcodes are same in any order. For example F1 is (add,call,test,sub,mov) and F2 is (mov,add,call,sub,test). The sequence (call,test,sub) in F1 matches with (call,sub,test) of F2. The process is repeated for all the opcodes in F1 and F2.

We mark the match on the graph coordinate(X,Y) where X is the opcode number of the first opcode of the three opcode subsequence in file F1, and Y is the opcode number of the opcode subsequence in file F2. To find the total number of match in F1, all matches are computed and added together. The total number of match is divided by m to get similarity percentage of F1(X). Similarly, similarity percentage for F2(Y) is computed.

Similarity percentage between the files, file1.asm and file2.asm is obtained by taking average of F1 and F2.

A graph is plotted to on grid of dimension n Ã- m to visualize the similarity of the both files by marking all the match coordinates. The x-axis represents the opcode numbers of file F1 and the y-axis represents the opcode numbers of file F2.

Figure 7 Process of finding the similarity between two assembly programs.