A Study On Decompiler Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

A decompiler is a computer program which performs the reverse operation of a compiler. It translates a file which contains information at a low level of abstraction that is computer readable ie machine instruction into a higher level of abstraction designed to human readable. Decompilers goal is to convert a compiled binary file into a source file. This convertion is done for several reasons, example are as to understand how a program works, or to try to modify a program so that it can be enhanced or fix a bug.

When a programmer writes software, and releases it to the public, he (or she) normally releases a compiled version of the application, that users can run on their own machine. Whether it is a commercial offering, or a free piece of software, the programmer has put a considerable amount of time and effort into producing it. The source code behind software is something private, that the programmer has created. Programmers don't want people looking for flaws in their software, and they don't want people to change the title of the software and then redistribute it as someone else's product. It is for this reason that programmers don't often release their source code - but few realize that every time you release compiled software, you are also giving people the opportunity to reconstruct the source code!

Software that examines software

Decompilers are programs that analyze compiled code, and from this, reconstruct the original source code. Decompilation and reverse engineering is often prohibited by software license agreements - but this won't always stop an unscrupulous competitor, or an enthusiastic hacker from analyzing your code. Decompilers are freely available for a variety of languages and platforms, including Java! Read on, and I'll introduce you to the world of decompilation.

How do they work?

Decompilers work by analyzing the byte code of software, and making educated guesses about the code that created it. Most classes also contain additional debugging information, which can help the decompiler create a more accurate representation of the original source code. This debugging information is invisible to normal users, and many programmers don't even realize just how much information can be obtained from their classes - but there are ways to protect your code.

Software is available that will strip away debugging information, and even change the names of local and member variables inside your classes. SourceGuard, for example, will rename your variables to meaningless names, which decreases the readability of decompiled source. When you protect your code with applications like SourceGuard, decompilers have less information on which to base their analysis on, and it becomes harder for programmers to understand the code they produce.

// Calculate difference in dates

long numDiff = x_Date.getTime()

       - currentDate.getTime();

// Divide by 100 to find number of seconds difference

numDiff = numDiff / 100;

// Get seconds

int sec = (int) numDiff % 60;

// Get minutes

numDiff = numDiff / 60;

int min = (int) numDiff % 60;
Original source code
p = (this.C.getTime() -

       ((java.util.Date)(obj)).getTime());

p = (p/ 100);

j4 = (((int)p) % 60);

p = (lp/ 60);

j3 = (((int)p) % 60);

p = (p / 60);

Source decompiled with Mocha, after SourceGuard protection

The success of decompilation varies upon the amount of protection that software developers use, and the decompiler software that one uses to decompile. Many decompilers fail to decompile correctly, and some will even produce code that won't compile - particularly when faced with strong protection from a product like SourceGuard which offers a feature called byte-code range modification. BRM prevents most compilers from decompiling methods that have try { } catch blocks, and will produce garbled code with most decompilers.

Preventing decompilation is a valuable feature. Of course, such protection isn't uncrackable. While there are plenty of free decompilers out there, you really get what you pay for. With free tools, the code that is produced ranges from complex to unusable when protected by a tool that is decompiler resistant. With commercial tools, you can get varying degrees of success, and at least one tool is capable of breaking the byte-code range modification technique of SourceGuard.

SourceAgain, by Ahpah Software Inc, is capable of decompiling BRM protected classes effectively, and produces much more readable code than free software like Mocha or DejaVu. SourceAgain is available in three versions, a standalone decompiler, a professional edition that integrates with Symantec Visual Cafe and Microsoft Visual J++, and a Unix version. For those interested in using decompilers, a trial of SourceAgain is accessible from the web, and it can decompile classes that are accessible from a http:// address.


Uses of decompilers

While decompilers do represent a threat, they also can be of great benefit to programmers. There are many legitimate purposes for decompilers, such as reconstructing lost source code from a binary executable. When an old application needs to be upgraded or modified, and the original source code isn't available (perhaps the company which commissioned it never received the original source code, or perhaps the source code was lost because it wasn't considered import enough to backup), reverse engnieering through decompilation can be very useful. Some companies might decompile software of a competitor, to establish the structure of data files to include support for that filetype in their application. Whether or not such actions are legal is a gray area, but including support for competing spreadsheets, word processors, databases, etc is handy for end-users.


Design Phases:

There are various phases in Decompilation process and each phase is used for some specific aspects.These phases are:

Loader

It is the first decompilation phase of the decompiler, this convert the machine code to program's binary file. The loader must discover facts about architecture and entry points.

Disassembly

This is the next logical phase in the disassembly of machine dependent code into a machine independent representation (IR).

Idioms

Idioms are Idiomatic machine code sequences that are sequences of code which can be combined semantics and is not immediately usable from the instructions individual semantics. Either as disassembly phase part, or as later analyses part, these idiomatic sequences should betranslated into known IR(Instruction Register).

Program analysis

There are various program analyses that can be applied to IR. Expression propagation combines the semantics of various instructions into the more complex expressions.

Data flow analysis

Data flow analysis is used to trace the places where register contents are used and defined. Similar analysis can also be applied for the locations tha are used for local data and for temporary data. Each set of value defination can be given different name.

Type analysis

A good machine code decompiler will perform type analysis. The way memory locations or registers are used result in constraints on possible type of the location.

Structuring

Structuring of code is essential. This phase involves structuring of IR into high level constructs.

Code generation

This is the final phase of decompiler.In this phase human readable code is generated ie generation of high level code is done in this phase.


Decompiler for TurboC : DisC

There are various decompiler software available for TurboC. One of them is DisC. It is a decompiler for Turbo C. TurboC compiler generates a DOS executable file and give a C language program. DisC can interpret this DOS executable file to human readable form.

Logic to interpret the machine code used by the decompiler are tailored the turboC compiler.

Features

  • Complete decompilation of the executable files, contains functions which are called.

  • It identifies constructs of C programming like "while", "do...while", "if", "for" and "switch..case" statements. DisC even understands the "ternary" operator.

  • Identifies "comma expressions" also for example code for an expression like "(x=20,y=10,30)" )

  • DisC has the ability to check with the C library files (standard) to identify library functions present in the executable code that is to be decompiled, and substitute the appropriate library function (which the language has) name in all references. We can even specify our own user-defined libraries for this search.

  • It can decompile programs which are compiled using the memory models in TurboC.

  • A built-in code reorganizer takes in the raw output from the executable and reorganizes it into structured source code.

  • It contains a complete 8086 disassembler.

Features not implemented

  • Does not recognize floating point code.

  • It does not recognize strings separately.

  • Not 100% automatic, the user still need to add a few lines of code to the output to compile it again!

  • Presently, limited only to TurboC (more specifically, version 2.0/2.01), though it is pretty straight forward to extend it to other compilers.

Under the Hood

From a user's view, we can tell the software which executable file (only .EXE file which can be run on DOS and is given) to decompile , it would do the rest.

  1. It tries to figure out where the "main" function is located. If that function can not be found out automatically, we will have to find out by own the entry point of the program. But do not worry, DisC normally always finds out where is the location of "main".

  2. Then we are shown whenever with a menu (basically DisC is entirely text-based... so what else can we expect from a simple DOS program?) where we can choose whether we just want to look at the assembly code of the program, or we want to decompile a specific function, or we want to do fully automatic decompilation etc...

  3. When we choose the Full decompilation, there is nothing to do after that. DisC tries on its own to understand the program, and generates equivalent C code which is saved in the output file "_DISC.C", which you can review.

  4. When we choose to Decompile, we will be asked with the address of the function (normally main) which we want to decompile. And to know which function to decompile, we must have had a look at the assembly listing of the executable program (which is also shown by DisC!). We can choose to go on decompiling recursively all functions called by this function, or just stop with this function. Once again, output is saved in the file "_DISC.C".

What DisC actually does is...

  • Each and every instruction is emulated by DisC! This gives it a chance to understand the complete flow of the program, and whenever it feels that a C-language statement has been completed in execution, it prints out what it found out till now. Then it carries on once again. Simple, isnt it?

  • And whenever it finds a pattern which closely responds one of the C-language constructs ("if", "for", ...) it notes down that position, and later on uses this information reinterpret the code and give a better output.

  • DisC works in 2 passes. During the first pass, it tables all function calls, all branches and C-language constructs. During the second pass, the actual decompilation is done, with the help of the tables built in the first pass.

  • While comparing with the standard C libraries to find out any library functions used, DisC does an "instruction-by-instruction" compare, and not a "byte-by-byte" compare... This gives a better judgement for library functions.

Structured code

When compiling, all high level constructs like "if", "for", "while", "do... while" etc... are translated to branches and then code is generated by the compiler. So when you do a decompilation of the executable, the output is will not contain any loops or if constructs, but simple "goto..."s. This is not what we want, isnt it? DisC has a built-in code reorganizer (I prefer to call it C-Beautifier!) which will recognize "if" blocks and "for","while","do...while" loops and reorganize them more pleasently. The final output is almost similar to the original code.

Sample programs which were decompiled using DisC:

Things to know (if you are planning to try out DisC on your computer)

  • All numbers in the output file are given in Hexadecimal! No signed integers are shown (except for "switch...case" statements). Please be careful with this one.

  • Variables are given automatic names which are self-explanatory. For example, a local integer is named as "l_int_1" while a global character is named as "g_char_3". And YOU need to declare these variables specifically in all places where they are used. DisC does not generate variable declarations, because at present it doesnt know if the output generated is correct or not.

  • All functions have an "_" prefix. This is because most of the C compilers generate names in this manner, and the library function names are also like this. Before you compile the output, make sure to remove the "_" prefixes.

  • Also you need to include the C include files for all functions called.

  • In all practical purposes, dont expect to compile the output of DisC and generate a program. The purpose of DisC is to help you understand how the given executable works, and you can try writing the software on your own if you understand how it works. So dont blame me if you are unable to compile the output successfully, or if the output doesnt help you. This was only an educational project, and if you find it helpful please go ahead and use it.

BIBLIOGRAPHY

  1. Author: John J. Donovan, Title: Systems Programming, Publishers: Tata McGraw Hill

  2. Author: Dhamdhere, Title: System Programming and Operating Systems, Publishers: Tata McGraw Hill.

  3. Author: Aho A. V. and J. D. Ullman, Title: Principles of Compiler Design, Publishers: Addison Wesley.

  4. Author: O G Kakde, Title: Compiler Desigh, Publishers: Laxmi Publications.

  5. Author: Milan Milenkovic, Title: Operating Systems, Publishers: Tata McGraw Hil

  6. Alan, C. Shaw, RTS and software, John Wiley and Sons, New York, 2001.

  7. Philip Laplante, Real-time systems design and analysis, an engineer's handbook, IEEE Computer Society Press , New York.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.