This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Vector Architecture basically deals with vector processors in which the set of instructions includes such operations which can process multiple data elements simultaneously and can perform mathematical operations on them. It is a kind of CPU design. Vector processor provides high-level operations that work on vectors. Vector processors provide vector instructions which operate in a pipeline i.e. sequentially on all elements of vector registers. Vector architecture or processors are commonly used in the scientific computing area. They form the basis of most of the supercomputers.
Another kind of architecture used is the scalar architecture which includes scalar processor. The main difference is that the scalar processors handle one element at a time using multiple instructions. As already mentioned, vector processors are commonly used in supercomputers while the scalar ones find their application in small scale processors.
Few examples of Vector Processors are:
- The computation of each result is independent of the computation of previous results, allowing a very deep pipeline without any data hazards.
- Control hazards are non-existent because an entire loop is replaced by a vector instruction whose behaviour is predetermined.
- A single vector instruction specifies a tremendous amount of work as it is the same as executing an entire loop. Thus, the instruction bandwidth requirement is reduced.
- Vector instructions that access memory have a known access pattern. If the vector elements are all adjacent, then fetching the vector from a set of heavily interleaved memory banks works very well because a single access is initiated for the entire vector rather than to a single word, the high latency of initiating a main memory access versus accessing a cache is amortized or we can say is liquidated gradually. Thus, the cost of the latency to main memory is seen only once for the entire vector, rather than once for each word of the vector.
- Few examples of vector operations include add, subtracting, multiply, divide two vectors to produce a third one, load a vector from memory and to store a vector to memory.
Styles of Vector Architectures
- Memory-memory vector processors: In these type of vector processors, all vector operations are memory to memory.
- Vector-register processors: In these type of vector processors, all vector operations are between vector registers. The vector is equivalent of load-store architecture. It includes all vector machines since late 1980s like Cray, Convex, Fujitsu, Hitachi, NEC.
Few Important terms
- Initiation Rate: It is defined as the rate of consuming operands and producing new results. Generally, initiation rate is one per clock cycle for individual instruction and can be more than one for parallel operations.
- Convoy: It is defined as the set of vector instructions that could potentially begin execution together in one clock period. A convoy must complete before new instructions can begin.
- Chime: A chime is a timing measure for the time for a vector sequence. A vector sequence of m convoys with a vector length of n elements executes in roughly m x n clock cycles. A chime ignores the start-up overhead for a vector operation.
- Vector start-up time: Vector start-up time is the overhead to start execution. It is related to the pipeline depth, and is due to the time to clear out existing vector operations from the unit.
COMPONENTS OF VECTOR PROCESSOR
- Vector Register: These are the fixed length bank holding a single vector. It has at least 2 read and 1 write ports. Typically there are 8-32 vector registers, each holding 64-128 64-bit elements
- Vector Functional Units: They are fully pipelined and start a new operation on every clock. Generally, there are 4-8 FUs like FP add, FP mult, FP reciprocal, integer add, logical, shift.
- Control Unit: Its function is to detect structural and data hazards.
- Vector Load-Store Unit: It is used to load and store vector to and from memory.
- Special Purpose Registers: For example: Vector length & Vector mask registers.
- Set of Scalar Registers: They provide data as input to the vector functional units and also compute addresses to pass to the Load-Store unit. In case of VMIPS, 32 general purpose and 32 floating-point registers are used.
Earlier data for elements to be processed was encoded directly into the instruction. These are pointed to by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time. As CPU speeds have increased, this memory latency has historically become a large impediment to performance. In order to reduce the amount of time this takes, most modern CPUs use a technique known as instruction pipelining in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next fetches the values at those addresses, and the next does the math itself. With the help of pipelining the processor starts decoding the next instruction even before the first has left the CPU, in the fashion of an assembly line, so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the latency, but the CPU can process an entire batch of operations much faster than if it did so one at a time.
This concept of pipelining has been taken a step further by Vector processors as instead of pipelining just the instructions, they also pipeline the data itself. They are fed instructions that say for example not just to add A to B, but to add all of the numbers which are present at one place in the memory to the numbers which are present somewhere else in the memory.. Instead of constantly having to decode instructions and then fetch the data needed to complete them, it reads a single instruction from memory, and knows that the next address will be one larger than the last. This allows for significant savings in decoding time.
As vector processors work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in supercomputers, as the supercomputers themselves were generally found in places such as weather prediction centre, physics labs, for medical purposes, artificial intelligence, aerodynamics etc.
Real world issues (problems)
Problem1. Vector - Length Control
Problem: How do we support operations where the length is unknown or not the vector length?
Solution: For solving this problem, we can provide a vector-length register but it solves the problem only if real length is less than Maximum Vector Length. Also we can use the technique known as strip mining. It involves generating code where vector operations are done for a size not greater than Maximum Vector Length.
In this process we generally create 2 loops out of which one handles any number of iterations which are multiple of Maximum Vector Length while the other handles the remaining iterations. So by this the code becomes vectorizable.
Problem2. Vector Stride
Problem: The position of adjacent elements in memory may not be sequential so due to this the Set up time become enormous. Eg. Matrix Multiplication.
Solution: Stride is the distance separating elements in memory that will be adjacent in a vector register. So for this, we store the stride in a register, so only a single load or store is required.
- Non-unit strides can cause major problems for the memory system, which is based on unit stride which means that all the elements are one after another in different interleaved memory banks.
- To account for non-unit stride, most systems have a stride register that the memory system can use for loading elements of a vector register. However, the memory interleaving may not support rapid loading. In 1995 the vector computers had from 64 to 1024 banks of memory to overcome some of these problems and to allow fast vector memory load/stores.
- Computer Architecture: A quantitative Approach, Patterson and Hennessy, Appendix G, section 1-3.
- Computer Architecture: A modern Synthesis, Subrata Dasgupta, Chapter 7, P246 - P249.