Disclaimer: This is an example of a student written essay.
Click here for sample essays written by our professional writers.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.com.

Pipelining And Superscalar Architecture Information Technology Essay

Paper Type: Free Essay Subject: Information Technology
Wordcount: 3221 words Published: 1st Jan 2015

Reference this

Here some of the summary or short term of pipelining and superscalar. Instruction execution is extremely complex and involves several operations which are executed successively. This implies a large amount of hardware, but only one part of this hardware works at a given moment.

Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. This is solved without additional hardware but only by letting different parts of the hardware work for different instructions at the same time.This technique rresponsible for large increases in program execution speed.

A superscalar architecture is one in which several instructions can be initiated simultaneously and executed independently.

Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates concurrently with all other segments. A pipeline can be visualized as a collection of processing segment through which binary information flows. Each segment perform partial processing dictated by the way the task is partitioned. The result obtained by the computation in each segment is transferred to the next segment in the pipeline. The final result is obtained after the data have passed through all segments. The name “pipeline” implies a flow of information analogous to an industrial assembly line. It is characteristic of pipelines that several computations can be in progress in distinct segments at the same time. The overlapping of computation is made possible by associating a register with each segment in the pipeline. The registers provide isolation between each segment so that each can operate on distinct data simultaneously.

In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements.

Computer-related pipelines include:

Instruction pipelines, such as the classic RISC pipeline, which are used in processors to allow overlapping execution of multiple instructions with the same circuitry. The circuitry is usually divided up into stages, including instruction decoding, arithmetic, and register fetching stages, wherein each stage processes one instruction at a time.

Graphics pipelines, found in most graphics cards, which consist of multiple arithmetic units, or complete CPUs, that implement the various stages of common rendering operations (perspective projection, window clipping, colour and light calculation, rendering, etc.).

Software pipelines, where commands can be written so that the output of one operation is automatically used as the input to the next, following operation. The Unix command pipe is a classic example of this concept; although other operating systems do support pipes as well.


An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time).

The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step. This allows the computer’s control circuitry to issue instructions at the processing rate of the slowest step, which is much faster than the time needed to perform all steps at once. The term pipeline refers to the fact that each step is carrying data at once (like water), and each step is connected to the next (like the links of a pipe.)

The origin of pipelining is thought to be either the ILLIAC II project or the IBM Stretch project. The IBM Stretch Project proposed the terms, “Fetch, Decode, and Execute” that became common usage.


In 3D computer graphics, the terms graphics pipeline or rendering pipeline most commonly refer to the current state of the art method of rasterization-based rendering as supported by commodity graphics hardware[1]. The graphics pipeline typically accepts some representation of a three-dimensional scene as an input and results in a 2D raster image as output. OpenGL and Direct3D are two notable graphics pipeline models accepted as widespread industry standards.

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Essay Writing Service

The rendering pipeline is mapped onto current graphics acceleration hardware such that the input to the graphics card (GPU) is in the form of vertices. These vertices then undergo transformation and per-vertex lighting. At this point in modern GPU pipelines a custom vertex shader program can be used to manipulate the 3D vertices prior to rasterization. Once transformed and lit, the vertices undergo clipping and rasterization resulting in fragments. A second custom shader program can then be run on each fragment before the final pixel values are output to the frame buffer for display.

The graphics pipeline is well suited to the rendering process because it allows the GPU to function as a stream processor since all vertices and fragments can be thought of as independent. This allows all stages of the pipeline to be used simultaneously for different vertices or fragments as they work their way through the pipe. In addition to pipelining vertices and fragments, their independence allows graphics processors to use parallel processing units to process multiple vertices or fragments in a single stage of the pipeline at the same time.


In software engineering, a pipeline consists of a chain of processing elements (processes, threads, co routines, etc.), arranged so that the output of each element is the input of the next. Usually some amount of buffering is provided between consecutive elements. The information that flows in these pipelines is often a stream of records, bytes or bits.

The concept is also called the pipes and filters design pattern. It was named by analogy to a physical pipeline.


Pipelining does not help in all cases. There are several possible disadvantages. An instruction pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline.

Advantages of Pipelining:

The cycle time of the processor is reduced, thus increasing instruction issue-rate in most cases.

Some combinational circuits such as adders or multipliers can be made faster by adding more circuitry. If pipelining is used instead, it can save circuitry vs. a more complex combinational circuit.

Disadvantages of Pipelining:

A non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture.

The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is due to the fact that extra flip flops must be added to the data path of a pipelined processor.

A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.


A superscalar CPU can execute more than one instruction per clock cycle. Because processing speeds are measured in clock cycles per second (megahertz), a superscalar processor will be faster than a scalar processor rated at the same megahertz.

A superscalar architecture includes parallel execution units, which can execute instructions simultaneously. This parallel architecture was first implemented in RISC processors, which use short and simple instructions to perform calculations. Because of their superscalar capabilities, RISC processors have typically performed better than CISC processors running at the same megahertz. However, most CISC-based processors (such as the Intel Pentium) now include some RISC architecture as well, which enables them to execute instructions in parallel. Nearly all processors developed after 1998 are superscalar.

Superscalar describes a microprocessor design that makes it possible for more than one instruction at a time to be executed during a single clock cycle. In a superscalar design, the processor or the instruction compiler is able to determine whether an instruction can be carried out independently of other sequential instructions, or whether it has a dependency on another instruction and must be executed in sequence with it. The processor then uses multiple execution units to simultaneously carry out two or more independent instructions at a time. Superscalar design is sometimes called “second generation RISC.”

A superscalar CPU architecture implements a form of parallelism called instruction-level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.

While a superscalar CPU is typically also pipelined, pipelining and superscalar architecture are considered different performance enhancement techniques.

The superscalar technique is traditionally associated with several identifying characteristics (within a given CPU core):

Instructions are issued from a sequential instruction stream

CPU hardware dynamically checks for data dependencies between instructions at run time (versus software checking at compile time)

The CPU accepts multiple instructions per clock cycle


Available performance improvement from superscalar techniques is limited by two key areas:

The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism, and

The complexity and time cost of the dispatcher and associated dependency checking logic.

The branch instruction processing.

Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of the other. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; b = e + f might not be run able in parallel, depending on the order in which the instructions complete while they move through the units.

When the number of simultaneously issued instructions increases, the cost of dependency checking increases extremely rapidly. This is exacerbated by the need to check dependencies at run time and at the CPU’s clock rate. This cost includes additional logic gates required to implement the checks, and time delays through those gates. Research shows the gate cost in some cases may be nk gates, and the delay cost k2logn, where n is the number of instructions in the processor’s instruction set, and k is the number of simultaneously dispatched instructions. In mathematics, this is called a combinatoric problem involving permutations.

Even though the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there is no assurance otherwise and failure to detect a dependency would produce incorrect results.

No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched. While process advances will allow ever greater numbers of functional units (e.g, ALUs), the burden of checking instruction dependencies grows so rapidly that the achievable superscalar dispatch limit is fairly small. — likely on the order of five to six simultaneously dispatched instructions.

However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus the degree of intrinsic parallelism in the code stream forms a second limitation.


The fifth-generation Pentium and newer processors feature multiple internal instruction execution pipelines, which enable them to execute multiple instructions at the same time. The 486 and all preceding chips can perform only a single instruction at a time. Intel calls the capability to execute more than one instruction at a time superscalar technology. This technology provides additional performance compared with the 486.

Find Out How UKEssays.com Can Help You!

Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.

View our services

Superscalar architecture usually is associated with high-output RISC (Reduced Instruction Set Computer) chips. An RISC chip has a less complicated instruction set with fewer and simpler instructions. Although each instruction accomplishes less, overall the clock speed can be higher, which can usually increase performance. The Pentium is one of the first CISC (Complex Instruction Set Computer) chips to be considered superscalar. A CISC chip uses a richer, fuller- featured instruction set, which has more complicated instructions. As an example, say you wanted to instruct a robot to screw in a light bulb. Using CISC instructions you would say

Pick up the bulb.

Insert it into the socket.

Rotate clockwise until tight.

Using RISC instructions you would say something more along the lines of

Lower hand.

Grasp bulb.

Raise hand.

Insert bulb into socket.

Rotate clockwise one turn.

Is bulb tight? If not repeat step 5.


Overall many more RISC instructions are required to do the job because each instruction is simpler (reduced) and does less. The advantage is that there are fewer overall commands the robot (or processor) has to deal with, and it can execute the individual commands more quickly, and thus in many cases execute the complete task (or program) more quickly as well. The debate goes on whether RISC or CISC is really better, but in reality there is no such thing as a pure RISC or CISC chip, it is all just a matter of definition, and the lines are somewhat arbitrary.

Intel and compatible processors have generally been regarded as CISC chips, although the fifth- and sixth-generation versions have many RISC attributes and internally break CISC instructions down into RISC versions.





divides an instruction into steps, and since each step is executed in a different part of the processor, multiple instructions can be in different “phases” each clock.

involves the processor being able to issue multiple instructions in a single clock with redundant facilities to execute an instruction within a single core


once one instruction was done decoding and went on towards the next execution subunit

multiple execution subunits able to do the same thing in parallel


Sequencing unrelated activities such that they use different components at the same time

Multiple sub-components capable of doing the same task simultaneously, but with the processor deciding how to do it.


The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism, and

The complexity and time cost of the dispatcher and associated dependency checking logic.

The branch instruction processing.


The number of instructions a microprocessor can handle in a single clock cycle is a crucial factor to the processor’s performance and it depends on the design of the processor itself. Among today’s processors, the number of instructions per clock cycle has been sped up through two processes, called pipelining and superscalar execution. A microprocessor uses pipelining or superscalar technology is said to have pipeline or superscalar design.

Pipelining allows the processor to read a new instruction from memory before it is finished processing the current one. It works by splitting the instruction fetch, decode and execution into independent stages; as an instruction goes through each stage, the next instruction follows it does not need to wait until it completely finishes. The extent to which pipelined data can flow into the processor is called the pipeline depth. In Intel 80286 processor family, the pipeline depth is only 1 which means in effect, there was no pipeline at all. However, with the 80486 family, the pipeline depth increased to 4. Pentiums has a pipeline depth of 5, and MMX and other technologies enable even more.

Pipelining saves time by ensuring that the microprocessor can start the execution of a new instruction before completing the current or previous ones. However, it can still complete just one instruction per clock cycle. Another disadvantage with pipelining concerns pipeline stalls. As what we learn in cs411 class, some stalls occured in a pipeline design can be voided while others cannot. There are different strategies to handle pipeline stalls, such as simply suffer the delays, use branch delay slot, or add additional hardware, etc.

To increase efficiency and thereby save processing time, many of today’s processors (Compaq/Digital’s Alpha, IMB/Motorola’s PowerPC, and Sun’s SPARC, etc.) use superscalar architecture. The main benefit and difference of superscalar technology versus pipelining is that it allows processors to execute more than one instruction per clock cycle with multiple pipelines. In a superscalar design, the processor looks for instructions that can be handled within the same clock cycle and processes these together. This concept, coupled with available VLSI (very long scale instruction) technology, allowed the execution of up to five operations per cycle: one branch instruction, one fixed-point instruction, one condition register instruction, and one floating-point multiply-add instruction which counted as two floating-point operations.

Parallel processing certainly offers speed benefits, but superscalar design has critics. Superscalar technology increase the level of complexity in hardware designing. Some people argue that it also wastes many opportunities for parallel execution, because combining individual instructions could take very long time and individual instructions are often delayed when waiting for resources.


Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: