Design Of A Digital Fir Filter Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract - Digital signal processing algorithms often use inner product as basic computation. This paper presents memory-efficient distributed arithmetic (DA) architecture for FIR filters. An efficient Very Large Scale Integration (VLSI) implementation of digital convolution for real-time DSP applications is an important task. Systolic architectures are extensively popular due to the simplicity of their design and using high level of pipelining in small chip area. Distributed Arithmetic (DA) is the most efficient technique for executing such computations as they eliminate the need for multiply operations using lookup tables.

In this project one dimensional systolic structure is modelled using VHDL for computation of linear convolution using Dynamic Distributed arithmetic. The proposed structure involve significantly less memory and less number of adders compared with the existing DA based structures.

Keywords - Distributed arithmetic, Field Programmable Gate Arrays (FPGA), Finite Impulse Response (FIR) filter, linear convolution, systolic array.


Finite Impulse Response (FIR) filters are one of the most common components of Digital Signal Processing (DSP) systems. FIR filtering is achieved by convolving the input data samples with the desired unit impulse response of the filter. Along with the advancement in Very Large Scale Integration (VLSI) technology as the DSP has become increasingly popular over the years, the high speed realization of FIR filters with less power consumption has become much more demanding. Since the complexity of implementation grows with the filter order and the precision of computation, real-time realization of these filters with desired level of accuracy is a challenging task. Several attempts have, therefore, been made to develop dedicated and reconfigurable architectures for realization of FIR filters in Application Specific Integrated Circuits (ASIC) and Field-Programmable Gate Arrays (FPGA) platforms.

Distributed Arithmetic provides an approach for multiplier-less implementation of FIR filters where the filter coefficients are programmable. In other words, the same filter structure can be used for a different set of coefficients. Distributed Arithmetic (DA) is a bit-serial computational operation that forms an inner (dot) product of a pair of vectors in a single direct step. This is achieved by storing all possible intermediate computations in a lookup table memory and using it to select an appropriate value for a specified bit vector. The main operations required for DA-based computation of inner product are a sequence of lookup table (LUT) accesses followed by shift-accumulation operations of the LUT output. DA-based computation is well suited for FPGA realization, because the LUT as well as the shift-add operations, can be efficiently mapped to the LUT-based FPGA logic structures.

Systolic designs represent an attractive architectural paradigm for efficient hardware implementation of computation-intensive DSP applications. They have several attractive features such as simplicity, regularity and modularity of structure. In addition, they also possess significant potential to yield high-throughput rate by exploiting high-level of concurrency using pipelining or parallel processing or both. The multipliers in these structures require a large portion of the chip-area, and consequently enforce limitation on the maximum possible number of processing elements (PEs) that can be accommodated and the highest order of the filter that can be realized. So the multiplierless DA-based technique has been used for its high-throughput processing capability and increased regularity which results in cost-effective and area-time efficient computing structures.

distributed arithmetic

Distributed Arithmetic (DA) is an efficient method for computing inner products when one of the input vectors is fixed. It uses look-up tables and accumulators instead of multipliers for computing inner products and has been widely used in many DSP applications such as convolution, DFT, DCT and digital filters. DA was originally proposed for FIR and Infinite Impulse Response (IIR) filter implementations. Distributed Arithmetic is used to design bit-level architectures for vector-vector multiplications. It is based on a bit level rearrangement of the multiply accumulate operation to replace it with set of addition and shifting operations. In DA, each word in the vectors is represented as a binary number; the multiplications are recorded and mixed such that the arithmetic becomes "distributed" through the structure. Area savings achieved by using DA technique can be up to 80% and seldom less than 50% in DSP hardware designs [5]. The first stage of this work is the design of DA as explained further.

Let us consider the inner-product of two N-point vectors A and B given by Eq. (1) as,


where A is a constant vector, while B may change from time to time.

Assuming L to be the word length, each component of B may be expressed in two's complement representation by Eq. (2),


where bkl denotes the lth bit of Bk.

Now, combining the equations (1) and (2), the inner-product can be expressed in distributed form as shown below in Eq. (3):


For simplicity, assuming the signal samples to be unsigned words of size L, the inner product given in Eq. (3) then can be expressed in a simpler form by Eq. (4a) as,




Since Ak in Eq. (4b) is assumed to be constant and each bkl may take on values of 0 and 1 only, expression in equation (4b) may have only 2k possible values. The 2k possible values of Cl can be pre-computed and can be stored in ROM, such that while computing the inner product the partial sums Cl can be read out from the ROM using the input bit sequence as address bits. The inner product can, therefore, be calculated according to equation (4a), by L cycles of shift-accumulation followed by ROM-read operations corresponding L number of bit sequences {bkl} for 0 lL-1 . This leads to a multiplier less realization of vector multiplication.

According to the decomposition scheme proposed by Meher [1], when N is a composite number given by N = PM, (where P and M may be any two positive integers) one can map the index k into (m + pM) for m = 0, 1, … , M-1 and p = 0, 1, … , P-1. Hence equation (4a) can be expressed in the following form as Eq. (5a):




for l = 0, 1, …. , L-1 and p = 0, 1, …. , P-1.

The bit vector (bn)l,p in Eq. (5b) is used as address word for the lookup table and F is the memory-read operation.

derivation of the structures

We derive the DA-based one-dimensional (1-D) and systolic array for FIR filters from Dependence Graph (DG) representation of DA-based computation.

1-D Systolic Array for FIR Filters

Systolic system consists of a set interconnected cells, each capable of performing some simple operation. The word systolic is used because data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to the memory, much as blood circulates to and from the heart [5]. Systolic approach can speed up a compute-bound computation in a relatively simple and inexpensive manner.

The DG for computation of FIR filter output is shown in Fig. 1 [1]. It consists of L rows, where each row consists of P number of node-A and one boundary node-B. The function of node-A and node-B are depicted in Fig. 1(b) and (c), respectively. A bit vector consisting of a sequence of M-bits is fed to the node-A on (l+1)th row and (p+1)th column. The node uses the sequence of M input bits of the input bit vector as address for an LUT and reads the content stored at the location specified by the address. The value read from the LUT is then added with the input available from its left, and the sum is passed to the node on its right. Node-B performs a shift-add operation such that it makes a left-shift of the bits of the input available from the top, then adds the input available from the left to the left-shifted value, and passes the result down to its adjacent node.

Fig. 1. DG for DA-based implementation of FIR filter: (a) DG; (b) function of

node-A; and (c) function of node-B.

Fig. 2. The 1-D array for DA-based implementation of FIR filter: (a) Linear systolic array; (b) function of PE; and (c) function of output cell. delta stands for a unit delay.

The input sequence {x(n)} is fed to a serial-in parallel-out input register where content of the register is serially right shifted by one position and transferred in parallel to the bit-serial word-parallel converter in every L cycles. The bits of vector are derived from the bit-serial word-parallel converter and fed to the (p+1)th PE (for p = 0,1,....,p-1) in most significant bits (MSBs) to least significant bits (LSBs) order in each cycle period (time step) such that (L-1)th bits of input values are fed to the PE at first, and the zeroth bits are fed at the end. Besides, input to each PE is staggered by one cycle period with respect to the preceding PE to meet the causality requirement. A linear array consisting of P number of PEs and an output cell is shown in Fig. 2 and the function of the PEs is described in Fig. 2(b). Each PE consists of a ROM of 2M words. During a cycle period, each PE reads the content on its ROM at the location specified by the input bit vector. The value read from the ROM is then added to the input available to the PE from its left. During every cycle period, the sum is then transferred as output to its right. The function of the output cell is shown in Fig. 2(c). Each output cell contains a shift-register and an adder. During a cycle period, it shifts the content of its register left by one position and then adds the available input to the recently shifted content in its register. After L cycles, it delivers a desired filter output. The structure will yield its first filter output (L+P) cycles after the first input is fed to the first PE, while the successive output becomes available in every L cycles. For high-throughput applications, one may, however, have a structure with N number of 1-D arrays which would yield N convolved output in every L cycles.


This section is concerned with the description of the proposed system and implementation of the FIR filter based on systolic decomposition of DA-based computation.


Comparison of performance of the proposed DA-based decomposition scheme implementation and the existing DAFIR system generator block:


































The address length M is taken to be four for the proposed implementation.

A number of observations can be made from the data presented in Table 1. It is clear that the proposed implementation significantly outperforms the existing implementations in terms of three important key metrics, namely the gate count of the design, frequency and power consumption for all the values of N. In the proposed DA-based decomposition scheme only a single shift-accumulate operation is needed, irrespective of the filter order N as a result of systolization.

The synthesis tool used is Xilinx ISE 9.2i. The simulation tool used is ModelSim XE 6.2c. The target device selected is Spartan 3 (XC3S400). The above shown Table 1 is the result of filters of different orders compared with the existing design.

Gate Count

Filter Order, N

Figure 1. Plot of variation of Gate Count with filter order of the proposed DA-based Decomposition Scheme and existing DAFIR block.

Filter Order, N

Frequency (MHz)

Figure 2. Plot of variation of Frequency with filter order of the proposed DA-based Decomposition Scheme and existing DAFIR block

Figure 3. Hardware co-simulation of 1-D DA-based Decomposition method for L = 8.


The project presents hardware-efficient designs for computation of finite digital convolution by address decomposition of DA-based inner-product computation. The advantages of DA kind of implementation are its high usable frequency, minimum gate count and minimum power consumption. The main advantage is it overcomes the usage of multipliers. This method uses adders, LUTs and shift registers.

The systolic decomposition scheme is found to offer a flexible choice of the address length of the lookup tables (LUT) for DA-based computation. The 1-D systolic array provides reduction in ROM size and the number of adders by several orders of magnitude.