Parallel Pipelining For Portable Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Fast Fourier transform is widely used in the field of digital signal processing such as filtering, spectral analysis. FFT plays a critical role in modern digital communications such as digital video broadcasting and orthogonal frequency division multiplexing systems. Algorithms including radix-4,split-radix, radix-22 have been developed based on the basic radix-2 FFT approach [1],[2],[3],[4]. In this Hardware architectures are not fully utilized and require high hardware complexity [7],[11].

Folding transformation and register minimization techniques are used to derive several FFT architectures. Much parallel architecture for FFT are developed for a specific N â€" point FFT [12]. A formal method of developing these architectures from the algorithms is not well established. Most of these hardware architectures are not fully utilized and require high hardware complexity. In the high speed digital communications high throughput and low power designs are required to meet the speed and power requirements while keeping the hardware overhead to a minimum. In [5] and [6] the architectures are designed for Fast Fourier Transform in the specific applications like Ultra Wide Band and Orthogonal Frequency Division Multiplexing. In [8] and [10] the pipelined architectures are derived based on Radix-24 architectures. But these architectures are not fully utilized. In this paper a new approach to design these architectures from the FFT flow graphs. Folding transformation and register minimization techniques are used to derive several known FFT architectures. Several novel architectures are also developed using the proposed methodology which have not be presented before. When the input samples are real the spectrum is symmetric and approximately half the operations are redundant. In applications such as speech, audio, image, radar and biomedical signal processing, a specialized hardware implementation is best suited to meet the real time constraints. This type of implementation also saves power in implantable or portable devices. The different algorithms have been proposed for the computation of RFFT. These approaches are based on removing the redundancies of the CFFT when the input is real. This can be efficiently used in a digital signal processor compared to a specialized hardware implementation. RFFT is calculated using the CFFT architecture in an efficient manner. One such algorithm is the doubling algorithm where an existing CFFT is used to calculate two RFFT simultaneously. Another one is the packing algorithm which forms a complex sequences of length N/2 taking the even and odd samples of a real input sequence of length N, and computes the N/2 point CFFT of the complex sequence. In these algorithms additional operations are necessary to obtain final results. This paper presents a novel approach to design of FFT architectures based on the folding transformation. In the folding transformation many butterflies in the same column can be mapped to one butterfly unit. If the FFT size is N a folding factor of N/2 leads to a 2 parallel architecture. In another design choose a folding factor of N/4 to design a 4-parallel architecture, where four samples are processed in the same clock cycle. Different folding sets lead to a family of FFT architectures. Folding sets are designed to reduce the latency and reduce number of storage elements. All new architectures achieve full hardware utilization. Various new real FFT architectures are also presented based on higher radices It may be noted that, that the approach cannot be generalized for radices higher than 2. This drawback is removed by the proposed methodology. In this paper several novel parallel pipelined architectures for the computation of RFFT based on radix-22, radix-23 and radix-2n algorithms. It combines the advantages of radix-2n algorithms, which require fewer complex multipliers when compared to radix-2 algorithm with the reduction of operations using redundancy.


The folding transformation and register minimization methods are used to derive several fast fourier transform. The parallel pipelined architectures are derived from the Fast Fourier Transform flow graphs. Fast Fourier Transform consists of complex and real valued signals. The parallel pipelined architectures are derived for computation of both complex and real valued signals. The folding transformation and register minimization techniques are used to derive several FFT architectures

Fig.1. Block Diagram of Parallel Pipelined Architectures for Fast Fourier Transform.

The process is described using an 8-point radix-2 DIF FFT as an example. It can be extended to other radices in a similar fashion. Fig.2 shows the flow graph of a radix-2 8-point DIF FFT. The graph is divided into three stages and each of them consists of a set of butterflies and multipliers The twiddle factor in between the stages indicates a multiplication by WNk where WN denotes the Nth root of unity, with its exponent evaluated modulo N.

Fig.2. Flow Graph of a Radix-2 8-point FFT.

Fig.3. Data Flow Graph of a Radix-2 8-point FFT.

This algorithm can be represented as a data flow graph (DFG) as shown in Fig.3. The nodes in the DFG represent tasks or computations. In this case, all the nodes represent the butterfly computations of the radix-2 FFT algorithm. In particular, assume nodes A and B have the multiplier operation on the bottom edge of the butterfly. The folding transformation is used on the DFG in to derive a pipelined architecture.

Fig.4. Pipelined DFG of a 8-point DIF FFT as a preprocessing step for folding.

The Folding Transformation is a technique to reduce the area of hardware architecture. In the folding transformation many butterflies in the same column can be mapped to one butterfly unit. To transform the DFG, require a folding set, which is an ordered set of operations executed by the same functional unit. Each folding set contains K entries i.e., the number of operations folded into a single functional unit.

The operation in the jth position within the folding set (where j goes from 0 to K-1) is executed by the functional unit during the time partition. The term is the folding order, i.e., the time instance to which the node is scheduled to be executed in hardware. For example, consider the folding set A = {O,O,O,O,A0,A1,A2,A3} for K=8. The Operation A0 belongs to the folding set A with the folding order 4. The functional unit executes the operations A0,A1,A2,A3 at the respective time instances and will be idle during the null operations. This situation to use the systematic folding techniques to derive the 8-point FFT architecture.

Fig..5. Linear Lifetime Chart

From the lifetime chart, the folded architecture requires 4 registers as opposed to 16 registers in a straightforward implementation.

The register minimization is a technique is to reduce the number of registers used in hardware architecture. The register minimization technique is based on life time of the variables. Maximum Number of live variables represents the minimum number of registers.

Table.1. Register allocation tablefor Radix-2 8-point FFT

From the allocation table in Fig.1 and the folding equations, the final architecture in can be synthesized.

The control signals for all the Multiplexer scan be easily generated using a 2-bit counter. The control signals for the 1st stage and 2nd stage multiplexers are obtained by dividing clock signal by 4 and by 2, respectively.


The results can be obtained by using the simulation tools Modelsim 6.3F and Xilinx ISE 10.1i. The Modelsim is a hardware simulation software in which design the hardware in VHDL/Verilog HDL. The Xilinx ISE software is used to synthesize the hardware. The complex FFT for different word lengths were implemented on Xilinx Virtex-5 FPGA (XC5VLX20T) using 16-bit fixed-point implementation. For RFFT, the input sequence is assumed to be real. It is easy to show that, if is real, then the output is symmetric.

Fig.5. simulation result of generating the dataflow of signals.


This paper has presented a approach to derive the FFT architectures for a given algorithm. The proposed approach can be used to derive parallel architectures of any arbitrary parallelism level. Using this approach parallel architectures have been proposed for the computation of complex FFT based on radix-2n algorithms. The power consumption can be reduced by 37% and 50% in proposed architectures. For RFFT two different scheduling approaches have been proposed with one having less complexity in control logic while the other has fewer delay elements. Due to the capability of processing two input samples in parallel, the frequency of operation can be reduced by 2 which in turn reduces the power consumption upto 50%. These are very suitable for applications in implantable or portable devices due to their low area and power consumption.