This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
To develop an instruction set simulator (ISS) for MIPS assembly language along with implementation of basic 5-stage pipeline and cycle accurate (including stall cycles due to data and control hazards and any potential pipeline forwarding). To further optimize the MIPS instruction set simulator by implementing the Tomasulo's Algorithm and Zero Loop Overhead.
INSTRUCTION SET SIMULATOR:
An Instruction set simulator is a software development tool that simulates the execution of programs on a specific processor. Our basic instruction set simulator consists of two main components:
The input to the instruction set simulator is a program in the intermediate format (binary format).Hence an assembler is required which will convert the program in assembly language to a program in binary format.
The instruction set simulator then executes the program and the prints the value of the registers, value at the memory locations and calculates the CPI for that program.
The Assembler takes the assembly language program as the input and outputs the program in an intermediate format that is the binary format. The output of the assembler is then fed as an input to the instruction set simulator.
The assembler that have been implemented is a two pass assembler which takes two passes to convert the code to the binary format. In the first pass of the assembler, the whole program is read line by line and the comments are ignored. All the labels are identified and they are stored in an array and corresponding to each label the line number at which the instruction following the label is present is stored so that whenever the label name is encountered the control switches to that particular line number and starts execution at that point.
In the second pass, the instructions are read line by line and their opcodes and operands are decoded and a fixed binary value is assigned. The resulting binary code is a sequence of 32 bits. The opcode is decoded to a 6 bits binary code, shamt code and function bits (11 bits) are added to the opcode in case of instructions such as Load, Add, Sub, Mul, Div and Nop. The operands are then decoded. The register names are decoded to a 5 bit binary code and the immediate address and offset are decoded to a 16 bit code. These are then concatenated to the opcode or opcode and function and shamt bits depending on the type of instruction. The 32 bit binary code for each instruction is then finally written to a text file which is then used as an input to the simulator.
The assembler has been in implemented in C++ (Assembler.cpp). All the benchmarks and the programs to be tested have been written in the MIPS code. The program to be tested can be chosen from the menu on the console and then the text file corresponding to the MIPS code of that program serves as the input text file and the Assembler Output file corresponding to that particular input file is generated.
INSTRUCTION SET ARCHITECTURE:
An instruction set architecture (ISA) is that part of computer architecture which includes data types, instructions, specification of the set of op-codes (machine language), registers, addressing modes, memory architecture, interrupt and exception handling, and external
Input - output and is visible to the programmer or compiler writer. It forms a boundary between software and hardware. The instruction set architecture implemented here is a Register- Register/Load -Store Architecture. In this form of architecture the memory can be accessed only as a part of the Load and Store instruction.
There are basically three types of instructions that have been implemented:
1. Register ' Register Type instructions: These instructions include operations between two registers. Instructions in this category include Add, Sub, Mul and Div instructions.
2. I ' Type Instructions: This category includes instructions such as Load, Store, Immediate instructions and conditional branch instructions (Branch equal to zero, Branch not equal to zero and Branch greater than equal to).
3. J-Type Instructions: This category includes the unconditional jump instructions such as jmp instruction.
INSTRUCTIONS BEING IMPLEMENTED:
1. ADD RD, RS, RT: This instruction adds the content of register RS and RT and stores it in the destination register RD.
2. SUB RD, RS, RT: This instruction subtracts the content of register RS and RT and stores it in the destination register RD.
3. MUL RD, RS, RT: This instruction multiplies the contents of register RS and RT and stores it in the destination register RD.
4. DIV RD, RS, RT: This instruction divides the content of RS by RT and stores the result (quotient of division) in the destination register RD.
5. ADDI RD, RS, IMMEDIATE: This instruction adds the value present in register RS to the immediate value in the immediate field and stores the result in the destination register RD.
6. LD RD, OFFSET (RS) : This instruction is a load instruction in which first of all the memory is address is calculated by adding the value present in RS and the offset. The content at this memory address is then loaded in the destination register RD.
7. STR RD, OFFSET (RS) : This instruction is a store instruction in which the memory address is calculated first (Memory address= value in RS + offset). The content of register RD is then stored at the calculated memory location.
8. BGE RD, RS, OFFSET: Branch Greater than Equal instruction compares the value of registers RD and RS. If the value at RD is greater than or equal to RS the branch is taken and the control jumps to the address given by the offset address.
9. BEZ RD,OFFSET: Branch Equal to zero instruction compares the value of registers RD and R0. If the value at RD is equal to R0 then the branch is taken and the control jumps to the address given by the offset address.
10. BNEZ RD,OFFSET: Branch Not Equal to zero instruction compares the value of registers RD and R0. If the value at RD is not equal to R0 then the branch is taken and the control jumps to the address given by the offset address.
11. JMP LABEL: Jmp instruction is the only unconditional branch instruction. This instruction transfers the control to the address pointed by the label.
INSTRUCTION SET AND THE CORRESPONDING OPCODES:
INSTRUCTIONS CODED FORMAT OPCODE SHAMT CODE & FUNCTON
Ld Rd, offset (Rs) 010000 0000000000
Str Rd,offset(Rs) 010001
Add Rd,Rs,Rt 100010 00000000000
Sub Rd,Rs,Rt 100010 00000000001
Mul Rd,Rs,Rt 100010 00000000100
Div Rd,Rs,Rt 100010 00000000101
Addi Rd,Rs,immediate 010011
Bge Rd,Rs,offset 011000
Bez Rd,offset 010111
Bnez Rd,offset 010110
Jmp Label 110000
Nop 000000 00000000000
In the following architecture we are using a total of 32 general purpose registers (R0 to R31) and one program counter. The general purpose registers are used to hold the values, on which a specific operation can be performed. These general purpose registers is an array of integers which can be used to store any value. The program counter used is also an integer variable which always contains the address of the next instruction to be executed.
Registers and their corresponding 5 bit codes:
Register Names Binary code
After the program has been changed to an intermediate format and written to a text file, this output text file of the Assembler is given as an input to the Simulator. Simulator is the main part of our Instruction Set Simulator which executes the instructions one by one and gives the output. The Simulator designed is based on the MIPS and uses a Reduced Instruction Set Architecture.
The basic simulator has been pipelined so that multiple instructions can be overlapped in execution. It takes advantage of parallelism that exists among the actions needed to execute an instruction. In the pipelined processor, a new instruction is issued on each clock cycle and each instruction will take almost 5 cycles to complete the execution. It should be made sure that no two different instructions should have the same data path resource on the same clock cycle. In order to ensure that no structural exists during execution, separate instruction and data memories have been used. In order to handle reads and a write to the same register, the register write is performed in the first half of the clock cycle and read in the second half.
The simulator uses a simple 5 stage pipeline:
1. Instruction Fetch Stage (IF): Sends the value of PC to memory and fetch the corresponding instruction from the memory. Updates the PC to its next value.
2. Instruction Decode Stage (ID): Decodes the instructions that is the opcode, rd, rs, rt , immediate value from the 32 bit binary opcode.
3. Execute/Effective Address Stage(exec):
For memory references. ie. Load/Store, the ALU adds the base register and the offset to form the effective address. In case of R-R type instructions the ALU performs the corresponding operation specified by the opcode on the values read from the register files. In case of immediate type instructions, the same as above would be performed with one of the values read from the registers and the other will be an immediate operand.
4. Memory Access Stage (MEM): In case of loads, the effective address computed in the previous execute stage is accessed and read. In case of stores, the memory writes the data from the second register read from the register file using the effective address.
5. Write-Back Stage (WB): Writes the results of the read performed in previous (memory) stage into the register file in case of loads. Writes the computed value into destination register Rd in case of R-R type instructions.
The following functions were implemented to execute the basic simulator:
1. IF(): The instructions are fetched one in each clock cycle .
2. ID(): Decodes the instructions by extracting the opcode ,rd(if present) ,rs ,rt, immediate value (if present) from the 32 bit binary code.
3. exec(): The required operation is performed ie. It maybe an ALU operation(add/sub/mul/div) or a control statement (branch/jmp) and the corresponding ALU. Data forwarding is implemented using alu () function.
4. mem(): In case of load/store instructions memory access is performed in this stage.
5. WB(): The result of the operation is written back to the destination register(address in case of store) in the first half clock cycle.
HAZARDS: The pipelining of instructions may lead to some hazards which may arise due to the data dependencies present in the instructions or due to the lack of resources. Structural hazards are the class of hazards that arise from resource conflicts when the hardware cannot support all possible combination of instructions simultaneously. As a result of structural hazard the whole pipeline has to be stalled which may lead to a reduction in the CPI performance. Second class of the hazards is the control hazards which arise from the pipelining of branches and other instructions that change the PC. In order to remove the control hazards, branch prediction is done. In our implementation, we are assuming that the branch is always not taken and the next instruction in the queue is fetched. Branch is resolved in the execute phase. Once the branch is resolved, if the branch is taken then the pre fetched instructions are flushed from the pipeline and the control jumps to the point at which the branching instruction points. If the branch is not taken then the normal execution continues. Third class of hazards is the data hazards which may arise due to the data dependencies. This hazard can be avoided by using at set of pipelining registers at the end of each stage. These pipelining registers are used to forward the result at the end of a particular stage to the subsequent stages. This process of forwarding the result from one stage to the next subsequent stage is known as Data Forwarding.
CPI = (No of clock cycles) / (No of clock cycles ' 4 ' No of stalls)
CPI for each program is calculated and displayed on the console.
As a part of our optimization we are implementing Tomasulo's Algorithm. In a simple MIPS five stage pipeline technique, an instruction is fetched and issued unless there is data dependence between the fetched instruction and the instruction which is already in the pipeline, and such a dependency cannot be hidden by data forwarding or bypassing. This will result in stalling of all the subsequent instructions until the hazard is cleared. No new instruction is issued or executed until the hazard is cleared. Hence, the CPI performance becomes poor. Hence, In order to improve the CPI performance, various static and dynamic scheduling techniques can be employed. Tomasulo's Algorithm is a dynamic scheduling approach which is basically a hardware based algorithm developed in 1967 by Robert Tomasulo from IBM. This algorithm allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially (out-of-order execution). Dynamic scheduling with out of order completion must preserve exception behavior in the sense that exactly those exceptions that would arise if the program were executed in strict program order do arise. Dynamically scheduled processors preserve exception behavior by ensuring that no instruction can generate an exception until the processor knows that the instruction raising the exception will be executed.
Architecture of MIPS Using Tomasulo's Algorithm:
The main components are:
Instruction unit: This unit contains all the instructions to be executed in a program.
Instruction queue: Instructions are sent from the instruction unit into the instruction queue from which they are then issued in FIFO order.
Reservation Stations: The instructions after being fetched are then decoded and sent to the particular reservation station. In this structure we assume that there are 3 reservation stations corresponding to the floating point adder and 2 reservation stations corresponding to the floating point multiplier. The reservation stations include the operation and the actual operands, as well as information used for detecting and resolving hazards.
Load and Store Buffers: In this structure we assume that we have 5 load and 5 store buffers. Load and Store buffers holds the component of effective address until it is computed, hold the value of computed loads and destination address and value to stored for the stores waiting for the CDB and keeps a track outstanding loads and stores.
Common Data Bus (CDB): It acts as a common result bus holding the result of instructions that have completed execution. It allows all units waiting for an operand to be loaded simultaneously.
Floating Point units: In this architecture we have two floating point units. The floating point adder implements addition and subtraction and the Floating point multiplier do multiplication and division.
Floating Point Registers: A set of 32 floating point registers are used to store the values. These are the general purpose registers.
Tomasulo's Algorithm Implementation:
The basic structure of the reservation stations and floating point registers is as given:
Reservation stations: In our implementation we have assumed the following fields in the reservation station:
1. Busy: If this field is 1, it indicates that the corresponding reservation station is busy that is an instruction is already present in that reservation station.
2. Opcode: This field contains the 6 bit opcode for the instruction to be executed.
3. Timer: According to the type of instruction the number of cycles it takes during execution is stored in this field.
Instruction Number of Cycles of Execution
4. Vj, Vk: resolved source operand value.
5. Qj, Qk: pending operand in terms of the reservation station identity.
6. A: Used to hold the immediate value or offset address for memory calculations.
7. Result: The instructions in the reservation station are executed and the result of the executed instructions is stored in this field.
8. Id: A unique Id has been assigned to each reservation station. This field holds the unique Id for that particular reservation station.
Floating point register:
The floating point register structure contains two fields:
1. Value: This field contains the value that has been computed and written back to the register.
2. Qi: This field contains the unique Id of the reservation station.
There are three main stages in the Tomasulo's Algorithm:
Issue: The instruction from the head of the queue is fetched sequentially in order to maintain the correct data flow. If there is a matching reservation station that is empty, the instruction is issued to the reservation station with the operand values, if they are currently in registers. If none of the reservation stations for that type of instruction is empty, then there is a structural hazard and the instruction stalls until a station or buffer is freed .If the operands are not in the registers, the functional units computing the value of that operand is tacked. This step renames the registers and hence eliminates the WAR and WAW hazard.
In our implementation, function Issue () implements the functionality of the issue stage.
Execute: As soon as all the operands for a particular instruction become available, the instruction reaches the execute stage and starts its execution at the particular functional unit. This waiting of instructions until all the operands become available helps in avoiding the RAW hazard. For the instructions which take multiple cycles for execution of instructions, a timer starts which goes on decrementing the value of timer which is initially set to the number of cycles to be taken by that particular instruction. As soon as the value of timer becomes zero, it indicates the finishing point of the execution of instruction and the instruction goes to the next stage.
Load and Store instructions require two cycles of execution. In the first cycle the effective address is computed when the base register is available. Load in the load buffer execute as soon as the memory unit is available. Stores in the store buffer will wait for the value to be stores before being sent to the memory unit.
In our implementation, all the branch instructions are assumed to be not taken. And to preserve the exception behavior no instruction is allowed to initiate execution until all branches that precede the instruction in program order have completed.
The following functions in the program implement the execute phase:
1. Execute () ' This function checks whether the operands are available. As soon as the operands become available the Alu function corresponding to that instruction is called.
2. Aluadd () - Implements the execution of Add, Sub , Addi, Bez,Bnez,Beq,Jmp
3. Alumul () - Implements the execution of Mul and Div instruction.
4. Aluld () - Implements the execution of Load instruction.
5. Alustr () - Implements the execution of Store instruction.
Write Result: When the result is available, the result is written on the CDB and from there into the registers and into any reservation stations (including store buffer) waiting for the result.
In our program, the write result is implemented by the Writeback() function.
As soon as the result for a particular instruction has been computed, it is stored in the result field of the reservation station, which is then written back into the corresponding destination register and the subsequently the busy flag for that particular reservation station and the Qi field corresponding to the destination register is reset. Use of CDB for writing back the result eliminates the WAW hazards.
Zero loop overhead is a hardware technique which can minimise the overhead without the penalty of increasing the code size. Zero-overhead loops are part of most processors, but 'hardware loop buffers' can really add increased performance in looping constructs. They act as a type of cache for instructions being executed in the loop. For example, after the first time through a loop, the instructions can be kept in the loop buffer, eliminating the need to 're-fetch' the same instructions over and over again each time through the loop. This can produce a significant savings in cycles by keeping the loop instructions in a buffer where they can be accessed in a single cycle. This feature requires no additional setup by the programmer but it is important to know the size of this buffer so that loop sizes can be selected intelligently.
In the modified MIPS structure using Tomasulo's Algorithm, as a result of implementation of out of order execution, lesser number of stalls is generated and the CPI is improved. The formula that has been used for the calculation of CPI is given as:
CPI = Number of Clock Cycles/ (Number of Clock Cycles -2).
MODIFICATIONS TO THE PIPELINED SIMULATOR:
In order to compare the CPI performance of the MIPS structure with Tomasulo's Algorithm and the existing ISS (pipelined MIPS structure) some changes have been made to the existing ISS. In the Tomasulo's Algorithm we have assumed that an instruction can take more than one cycle for execution. Hence in order to compare the execution speed of the instruction, the existing ISS is modified accordingly to match the delays in execution for the instructions that take multiple cycles for execution.
Figure 1- CPI Comparison
Figure 2-CPI Comparison
Figure 3-Percentage Improvement in CPI
TABLE 1: Measured values of CPIs
Program MIPS Modified MIPS Tomasulo Percentage Improvement
ld_delay1 3.5 4.5 1.667 62.95556
ld_delay2 3.5 4.5 1.667 62.95556
ld_delay3 3.5 4.5 1.5 66.66667
ld_reg_1 1.8 2 1.4 30
ld_reg_2 1.8 2 1.4 30
ld_reg_3 1.8 2 1.4 30
alu_reg_1 1.8 1.8 1.6667 7.405556
alu_reg_2 1.8 1.8 1.4 22.22222
alu_reg_3 1.8 1.8 1.4 22.22222
exmem_to_ex_1 2.33 2.33 1.667 28.45494
exmem_to_ex_2 2.33 2.33 1.667 28.45494
exmem_to_ex_3 2.33 2.33 1.667 28.45494
mem-wb_to_ex_1 2.33 2.33 1.667 28.45494
mem-wb_to_ex_2 2.33 2.33 1.667 28.45494
mem-wb_to_ex_3 2.33 2.33 1.667 28.45494
mem-wb_to_ex_4 2.33 2.667 1.667 37.49531
mem-wb_to_ex_5 2.33 2.667 1.667 37.49531
mem-wb_to_ex_6 2.33 2.667 1.667 37.49531
st_mem-wb_mem 2.33 2.667 1.667 37.49531
st_addr_ex-mem_ex 2.33 2.667 1.667 37.49531
st_addr_mem-wb_ex 2.33 2.667 1.5 43.75703
ld_addr_ex-mem_ex 2.33 2.667 1.667 37.49531
ld_addr_mem-wb_ex 2.33 2.667 1.5 43.75703
ld_st_mem-wb_mem 2.33 3 1.667 44.43333
big_test 1.8 1.8 1.4 22.22222
branch_1 3 3 1.5 50
branch_2 2.5 2.5 1.4 44
STEPS TO RUN THE INSTRUCTION SET SIMULATOR:
PROGRAMMING LANGUAGE USED: C++
TOOL USED FOR PROGRAMMING: Microsoft Visual studio 2008
INSTRUCTIONS FOR EXECUTION:
' To run the assembler, use assembler executable file. It contains all benchmark input files and accordingly the output will be stored in assembler_output.txt.
' To run the simulator (ISS using pipelining) alone use simulator executable file. To test individual benchmarks change the assembler_output.txt to particular output file name and run the program.
' To run the Simulator (ISS implementing Tomasulo's Algorithm) , use tomasulo executable file.
To test individual benchmarks change the assembler_output.txt to particular output file name and run the program.
INSTRUCTIONS FOR CODING (MIPS) :
' Comments should begin with '# 'after a space from the instruction
' Terminate all codes with exit instruction
' Do not use spaces after commas separating the registers.
' Give a space after exit
' Terminate labels with colon ( : ).
' Failing to implement any of these instructions will result in either infinite looping or abnormal termination or may give inappropriate results.