This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
In this paper various approaches of implementing a hardware efficient fast convolution have been discussed. Long length convolutions implemented on a FPGA are not area, power efficient and also it cannot be implemented on a single FPGA. To increase the speed of long convolutions and to meet the calculation capacity of each single FPGA chip, the long coefficient sequence can be partitioned into short sub-sequences. Each short length convolution can then be made area efficient at the expense of decrease in speed by implementing them as a convolution ASIP. The asynchronous ASIP is still faster than a synchronous ASIP. The speed of an asynchronous convolution processor can be further increased by applying Algorithmic Strength Reduction (ASR) where the number of multiplications, (which is more time consuming than an addition) is alleviated at the expense of increase in the number of additions required in a convolution process. Several algorithms based on ASR which would lead to a faster convolution ASIP have been discussed.
Convolution is at the very core part of DSP. In real time applications like radar echo signal simulator, CDMA system, mobile phones convolution is one major process involved. In such real time applications the input to output delay (latency) in a convolution operation is the major constraint. In battery powered portable devices such as cell phones, personal digital advisors and wireless sensor nodes, cryptographic algorithms which require performing costly arithmetic operations are processed. Hence, energy efficiency also becomes a critical issue. Thus fast and exact convolution is one of the most important digital signal processing problems.
In the above applications long convolutions are extensively used. Thus speeding them is very important. Though there are various algorithms available for this purpose, they are complex. So the long convolution can be partitioned into a number of short sub-convolutions, where each sub convolution can be done in parallel. But in systems such as CDMA, mobile where area is a constraint this cannot be applied. Thus a system for convolution can be designed which can be used recursively or in parallel. Though such kind of a system becomes area efficient it is necessary to incorporate a speedy system. It can be achieved by designing an application specific instruction processor (ASIP) for convolution which employs an algorithm that greatly improves the speed of short convolution. There are two kinds of convolution: linear/ noncyclic /aperiodic and cyclic/periodic convolution. Most fast convolution algorithms compute a cyclic convolution, while most applications call for linear convolution. On the other hand cyclic convolutions can be computed by finding the linear convolutions and reducing them by modulo xN-1. Hence efficient ways of computing linear convolutions also lead to efficient ways of computing cyclic convolutions. The linear convolution can be computed directly using MN multiplications and (M-1)*(N-1) additions. Hence much effort has gone into developing alternative and more efficient ways of implementing convolutions. There are several approaches for speeding up the calculation of convolution.
The conventional approach is based on FFT. The speed advantage offered by FFT algorithm can be very large for long convolutions . However there are drawbacks of the FFT approach, which relate mainly to the use of sines and cosines and to the need of complex arithmetic, even if the convolutions are real. Most fast convolution algorithms such as those based on FFT, apply only to periodic functions and therefore compute only cyclic convolutions.
In order to overcome the limitations of FFT method, many other fast convolution algorithms have been proposed. Toom Cook and Karastuba which uses algorithmic strength reduction are the algorithms which speed up the computation of short convolution.
Asymptotically, the algorithms can be arranged from least efficient to most effective as follows :
From above table it can be seen that the convolution through FFT has the lowest asymptotic complexity for large values of N. For short length convolutions, algorithms such as Toom Cook and Karastuba are most suitable. The paper is organized as follows. The second section talks about the convolution process, the third discusses about how an area efficient convolution processor can be designed, the fourth deals with the various algorithms employed to speed up convolution and the fifth section deals with the implementation of ASIP for convolution
The basic concept of convolution is to flip, multiply and add. Consider arbitrary sequences x[n] and h[n] of length N. The discrete convolution of x and h is
y[n ]=x[n]Ó¨h[n]=∑ h(n-k) . x(k) n=0,1,2,....(2N-2)
y0= h0 x0
y1= h1 x0 + h0 x1
y2 =h2 x0+h1 x1+ h0 x2
y3 =h3 x0+h2 x1 +h1 x2+h0 x3
y4 =h4 x0+h3 x1+h2x2+h1 x3+h0 x4 y5 =h4 x1+h2 x3+h3x2+h1 x4
y6 =h4 x2+h3 x3 +h2 x4 y7 =h4 x3+h3 x4
y8 =h4 x4
ii. POLYNOMIAL MULTIPLICATION
Multiplication of two polynomials x(p) and h(p) each of degree N yields a polynomial y(p) of degree 2N-1.
y(p) = (h0+h1p+h2p2+h3p3 +h4p4) * ( x0+x1p+x2p2+x3p3+x4p4 )
h0 x0+h1x0p+h2 x0p2+h3 x0p3 +h4 x0p4
h0 x1p+h1 x1p2+h2 x1p3+h3 x1p4 +h4 x1p5
h0 x2p2+ h1 x2p3+h2x2p4+h3x2p5 +h4x2p6
+h0 x3p3+h1 x3p4+h2 x3p5+h3 x3p6 +h4 x3p7 +h0 x4p4+h1 x4p5+h2 x4p6+h3 x4p7 +h4 x4p8
h0 x0 + (h1 x0 + h0 x1 )p
+ (h2 x0+h1 x1+ h0 x2 ) p2
(h3 x0+h2 x1 +h1 x2+h0 x3)p3
(h4 x0+h3 x1+h2x2+h1 x3+h0 x4 )p4
+ (h4 x1+h2 x3+h3x2+h1 x4) p5
+ (h4 x2+h3 x3 +h2 x4) p6
(h4 x3+h3 x4)p7+ h4 x4p8
Binary and integer multiplication are adaptations of polynomial multiplication, with the complication that they worry about the propagation of carries. To propagate the carries and to get the final output, Recomposition [the coefficients are placed in order, the MSB bits/integers of coefficients except LSB 2k-1 bits/integers are overlapped and then addition is carried out, where k is the number of bits/integers in each coefficient ] is done. Integer multiplication requires an additional step called splitting. To perform long integer multiplication using convolution (polynomial multiplication) algorithm, each long integer is split into sub-integers which represent coefficients of polynomials.
From the above it can be seen that an efficient convolution ASIP will lead to an efficient polynomial multiplier/binary/integer multiplier/correlator.
III. SHORT LENGTH CONVOLUTION ASIP
BASIC MODEL FOR CONVOLUTION IN FPGA
Implementation of very long convolutions by using the basic model has the following disadvantages:
Convolution of N*N in the above model requires N multipliers and (N-1) adders. Multiplier occupies large area when compared to an adder. Hence the area required to implement convolution on a FPGA is N times the area occupied by a single multiplier in addition to the area occupied by the adders. The area required by a long convolution is very large and also it cannot be accommodated on a single FPGA chip.
It is fast due to parallel multiplications where most of the time delay is due to the sequential addition operations.
ii. COEFFICIENT-PARTITIONED MODEL
Partitioned short-convolution ASIP2 h2(0)…....h 2(L-1)
Partitioned short-convolution ASIPk hk-1(0)…..h k-1(L-1)
In this approach, long coefficient sequence is partitioned into short sub-sequences which can be accommodated in a single FPGA chip. Thus the overall long convolution can be implemented by distributing the partitioned sub-convolutions into many FPGA chips. Since in this model convolution is decomposed into parallel process, high speed can be achieved when compared to the basic model.
Though this model improves the speed and resorts to a multi-chip environment the total area required for the convolution process cannot be reduced. Thus it is not area efficient.
iii. PROPOSED METHOD
In order to fulfill the requirement of area efficiency, each partitioned sub-convolution can be implemented as a convolution ASIP with a single multiplier, adder and a subtractor incorporated into it. As area efficiency is achieved in this method the partitioned convolutions which were implemented in a multi-chip environment can be implemented into a single chip.
iv. AREA - SPEED TRADEOFF
There is always a tradeoff between area and speed. Though in this method area is reduced to a maximum extent, speed reduces relatively which is due to the sequential execution of all the multiplications (N2) and additions [(N-1)2] required in a convolution operation. The area and speed for a long convolution in various methods is as follows:
Though a single ASIP system dramatically reduces the area, speed is also reduced to a great extent. Whereas by using partitioned convolution ASIPs we can improve the speed, though the area occupied increases, it is not as large as that of coefficient partitioned model. Since area and power are major constraints in several battery powered systems, the proposed method can be used to make the system highly efficient.
v. APPLICATION SPECIFIC INSTRUCTION PROCESSOR (ASIP)
There are two kinds of ASIPs, namely Synchronous ASIP and Asynchronous ASIP. In synchronous ASIP, there is a common clock which has the frequency of maximum time consuming operation. We know that multiplication takes more time than addition, thus there is a time lag for every addition operation in the convolution process. Whereas in an asynchronous ASIP there is no global clock signal, instead, it operates under distributed control, with concurrent hardware components communicating and synchronizing with each other. Thus the time lag in addition operation can be avoided. Hence more speed can be achieved by using an asynchronous ASIP. Further the speed can be increased by reducing the number of multiplications which is a time consuming operation. The number of multiplications can be reduced by algorithmic strength reduction. It is a process where the number of stronger operations such as multiplication can be alleviated at the expense of increasing the number of weaker operations such as addition.
IV. SEARCH FOR FAST CONVOLUTION ALGORITHMS
There are various algorithms available for performing fast convolution, which are as follows:
i. FAST FOURIER TRANSFORM
Convolution in time domain is equivalent to multiplication in frequency domain. Consider N sample input sequences x(n) and h(n).The convolution of these sequences can be done using FFT, which involves the following steps
To avoid aliasing the N sample input sequence must be converted into 2N-1 sample sequences by zero padding .Thus the input sequences after zero padding are as follows:
Compute FFT of input sequences using butterfly
X(k)= FFT[ xz (n) ]
H(k)= FFT[ hz (n) ]
Step-3: Complex Multiplications
Compute point wise multiplications of X(k) and H(k) to get Y(k).
Compute IFFT for Y(k) to get the convolved output sequence y(n).
The above steps can be represented as follows:
Though FFT is very efficient method for convolution of long length inputs, it is very complex for short convolutions.
ii. ALGORITHMIC STRENGTH REDUCTION:
The computational complexity of school book method is O(N2 ) i.e. it requires N2 multiplications. As multiplication is a time consuming operation, it is important to reduce the number of multiplications involved in a convolution process. This can be done by algorithmic strength reduction. It is a process where the number of stronger operations such as multiplication can be alleviated at the expense of increasing the number of weaker operations such as addition.
iii. TOOM COOK METHOD:
Toom Cook is a linear convolution algorithm for polynomial multiplication, based on Lagrange's interpolation theorem. The steps involved in this method are as follows:
a. Representation as polynomials:
In this step the inputs of N*N convolution is represented as coefficients of a polynomial of degree N-1
H(p)= ∑ hi . pi
X(p)= ∑ xi . pi
Evaluate each polynomial at 2N-1 points say r0, r1, r2…… r2N-2 (0,∞, 2N-3 points such as 1,-1,2,-2......) Example for toom-3 convolution, the evaluation points are 0,-1,1,2, ∞
X(0) = x0+ x1(0)+ x2(0)2 = x0
X(-1) = x0+ x1(-1)+ x2(-1)2 = x0-x1+ x2
X(1) = x0+ x1(1)+ x2(1)2 = x0+ x1+ x2
X(2) = x0+ x1(2)+ x2(2)2 = x0+2 x1+ 4x2
X(∞) = x2
X(0) 1 0 0
X(0) 1 1 1 x0
X(0) = 1 -1 1 x1
X(0) 1 -2 4 x2
X(0) 0 0 1
In this step matrix-vector multiplication (point wise multiplication) is done.
For example the pointwise multiplication of toom-3 is as follows
Y(p) can be interpolated from Y(r0), Y(r1),.......Y(r2N-2) explicitly using the inverse Vandermonde matrix of the form which is represented as follows:
y0 1 r0 r02 r02N-2 Y0
y1 1 r1 r12 r12N-2 Y1
. = . . . . .
. . . . . * .
. . . . . .
y2N-2 1 r2N-2 r2N-22 r2N-22N-2 Y2N-2
Thus for toom-3, the interpolation step is as follows
y0 1 0 0 0 0 y(0)
y1 1 -1 1 -1 1 y(-1)
y2 = 1 1 1 1 1 * y(1)
y3 1 -2 4 -8 16 y(-2)
y4 0 0 0 0 1 y(0)
IMPROVED VERSION OF COOK TOOM BASED ON CHINESE REMAINDER THEOREM
When applying Toom Cook algorithm, significant reduction of operation counts occurs if the evaluation points are carefully chosen. A better algorithm can be produced if we only choose 2N-2 distinct numbers and compute the convolution using formula obtained by Chinese remainder Theorem (CRT), which is denoted as follows
For toom-3 from the above equation we have:
Y (p)=Y0(p-2)(p2-1)/2 + Y1 p(p-2)(p+1)/2 -Y2 p(p-2)(p-1)/6
+ Y3p(p2-1)/6 + Y4p(p2-1)(p-2)
The above equation can be expanded to obtain an explicit formula as follows:
m2= (h0 +h1 +h2) (x0 +x1+x2) /2
m3= (h0 -h1 +h2) (x0 -x1+x2) /6
m4= (h0 +2h1 +4h2) (x0 +2x1+4x2) /2
Y1 =- m1 +2m2-2m3 -m4 +2m5
Y2 =-2 m1 +m2+3m3 -m5
Y5 =- m1 -m2-m3 +m4 -2m5
It involves a number of division operations which makes the algorithm more complex
iv. CONVOLUTION BY INSPECTION:
Toom Cook algorithm involves a number of divisions in its pre-addition and post-addition matrices, but an efficient algorithm to reduce the number of multiplications can be designed by inspection method .The pre addition and post addition matrices 5*5 convolution is
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 1 0 0 0
P5 = 1 0 1 0 0
1 0 0 1 0
0 1 1 0 0
0 0 1 1 0
0 1 0 0 1
0 0 1 0 1
0 0 0 1 1
1 1 0 1 1
1 0 0 0 0 0 0 0 0 0 0 0 0 0
-1 -1 0 0 0 1 0 0 0 0 0 0 0 0
-1 1 -1 0 0 0 1 0 0 0 0 0 0 0
-1 -1 -1 -1 0 0 0 1 1 0 0 0 0 0
1 1 1
1 -1 -1 -1
-1 -1 -1 -1 -1 1
0 -1 -1 -1 -1 0 0 0
-1 0 0 0 0 0 0 1 0 0
-1 0 0 0 0 0 0 0 0 1
0 1 0 0 0 0 0 0 0 0 0
Toom cook algorithm involves division operations. The karatsuba algorithm derives a division-free formula for fast polynomial multiplication. Also it further reduces the number of multiplications than that of inspection method. The algorithm is as follows:
The polynomials have to be fragmentized into two equal parts. If the length N of the polynomial is odd they have to be padded with '0' (i.e., if N=11 the polynomial multiplication of N=12 is carried out by substituting the coefficient of p11 as '0').
H(p) = hN-1 …. hN/2 hN/2-1 ….h1 h0
= hN-1 …. hN/2 . pN/2 Ó¨ hN/2-1 ….h1 h0
= h1 . pN/2 Ó¨ h0
X(p) = xN-1 …. xN/2 xN/2-1 ….x1 x0
= xN-1 …. xN/2 . xN/2 Ó¨ xN/2-1 ….x1 x0
= x1 . xN/2 Ó¨ x0
For 2-2 polynomial multiplication
Y(p) = (h0+h1 p)*( x0+x1 p)
= (h0x0) +((h0+h1)(x0+x1)-(h0x0-h1x1)p + (h1x1) p2
= (h0x0)(1-p) + (h0+h1)(x0+x1) p + (h1x1) (p-p2)
Multiplications required: 3 instead of 4 Additions required: 4 instead of 1
vi. KARATSUBA LIKE FORMULA
The karatsuba like formula derived by Montgomery for Quartic(5-5), Quintic(6-6) Sextic(7-7) polynomials requires 13, 17, 22 multiplications respectively.
The explicit formula is given below:
Quartic (5-5) polynomials
m9=(h0-h2-h3)(x0-x2-x3) m10=(h0+h1-h3-h4)( x0+x1-x3-x4) m11=(h0+h1+h2-h4)( x0+x1+x2-x4) m12=(h0-h2-h3-h4)( x0-x2-x3-x4)
y0 = m1
y1 = m6-m2-m1
y2 = m11-m8-m6-m5+m4+2m2+m1
y5= m13-2m12-m11+m10+2m9-m6-2m4-2m3 -3m1
y6 = m12-m9-m7-m5+m4+2m3+m1
y7 = m7-m4-m3
y8 = m4
Multiplications required: 13 instead of 25
Additions required: 72 instead of 16 with additional shift operations
Even though there is a decrease in the number of multiplications (13 instead of 14 in inspection method) the number of additions has also increased drastically. The required decrement in the number of addition operations can be obtained by algorithm given below.
vii. IMPROVED CONVOLUTION FORMULA USING CHINESE REMAINDER THEOREM
Murat Cent and Ferruh Ozbudak have given explicit formulas for multiplying polynomials of small degree over finite fields F2 [0, 1]. These formulas were derived using irreducible polynomials as moduli polynomials in Chinese Remainder Theorem (CRT).
For example using moduli polynomials as (x-∞)3, x3,(x2+x+1),(x3+x+1) in CRT for a 5-5 polynomial multiplication over F2[0,1] explicit formulas have been derived. Using these moduli polynomials we have derived an explicit formula for polynomial multiplication over any characteristic field i.e. integers.
The explicit formula is as follows:
m1 = a0*b0
m2 = a1*b1
m3 = a2*b2
m10= (a0-a2-a3)* (b0-b2-b3)
MEDIUM LENGTH CONVOLUTIONS
Medium length convolutions can be performed using Iterative Karatsuba method  or by using iterated short convolution method as given in keshab parhi.
Iterated Karatsuba method:
To perform MN-MN convolution it requires Mul(M)*Mul(N) multiplications For a 10-10 convolution Mul(5)*Mul(2)=13*3=39 multiplications will be required.
H(p) = h0+h1 p1 +h2 p2 +h3 p3 +h4 p4 +h5 p5 +h6 p6 +h7 p7 +h8 p8 +h9 p9
= c0(p) +c1(p) q
X(p) = x0+x1 p1 +x2 p2 +x3 p3 +x4 p4 +x5 p5+ x6 p6 +x7 p7 +x8 p8 +x9 p9
= d0(p) +d1(p) q
Where c0(p) = h0+h1 p1 +h2 p2 +h3 p3 +h4 p4
c1(p)= h5 +h6 p1 +h7 p2 +h8 p3 +h9 p4
d0(p) = h0+h1 p1 +h2 p2 +h3 p3 +h4 p4
d1(p) = h5 +h6 p1 +h7 p2 +h8 p3 +h9 p4
q = p5
Y(p) = H(p)*X(p)
= c0(p)d0(p) + ((c0(p)+c1(p))(d0(p)+d1(p)) -
c0(p)d0(p) - c1(p)d1(p)) q + c1(p)d1(p) q2
The above is a 2-2 kartsuba which requires 3 multiplications where each multiplication is a 5-5 polynomial multiplication which in turn requires 13 multiplications.
COMPARISON OF VARIOUS ALGORITHMS
The implementation part consists of designing an application specific instruction processor (ASIP) for convolution. An ASIP is a component used in system on chip design. The instruction set of an ASIP is tailored to benefit a specific application. The ASIP that is designed for convolution consists of a single
Divide by 2 unit
Divide by 6 unit
Register bank (32 registers)
The designed ASIP is synchronous, thus the speed depends on the number of instructions. Execution of each instruction takes 4 clock cycles, whatever the operation may be. In the first clock cycle instruction is read then in the second its decoded, executed in the third and finally stored in the fourth clock cycle. Here in case of addition, there will be a time lag as it's a simple operation when compared to multiplication. To avoid the above disadvantage, the system can be made asynchronous where the total time taken purely depends on the time taken by each addition and multiplication with no synchronization.
Thus to make the real time systems less complex, long convolution can be replaced by a number of short convolutions. Designing an ASIP for convolution by incorporating a fast convolution algorithm makes the system fast and area efficient.