# Efficient Implementation Of Fast Convolution Engineering Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Convolution is at the very core part of DSP. In real time applications like radar echo signal simulator, CDMA system, mobile phones convolution is one major process involved. In such real time applications the input to output delay (latency) in a convolution operation is the major constraint. In battery powered portable devices such as cell phones, personal digital advisors and wireless sensor nodes, cryptographic algorithms which require performing costly arithmetic operations are processed. Hence, energy efficiency also becomes a critical issue. Thus fast and exact convolution is one of the most important digital signal processing problems.

In the above applications long convolutions are extensively used. Thus speeding them is very important. Though there are various algorithms available for this purpose, they are complex. So the long convolution can be partitioned into a number of short sub-convolutions, where each sub convolution can be done in parallel. But in systems such as CDMA, mobile where area is a constraint this cannot be applied. Thus a system for convolution can be designed which can be used recursively or in parallel. Though such kind of a system becomes area efficient it is necessary to incorporate a speedy system. It can be achieved by designing an application specific instruction processor (ASIP) for convolution which employs an algorithm that greatly improves the speed of short convolution. There are two kinds of convolution: linear/ noncyclic /aperiodic and cyclic/periodic convolution. Most fast convolution algorithms compute a cyclic convolution, while most applications call for linear convolution. On the other hand cyclic convolutions can be computed by finding the linear convolutions and reducing them by modulo xN-1. Hence efficient ways of computing linear convolutions also lead to efficient ways of computing cyclic convolutions. The linear convolution can be computed directly using MN multiplications and (M-1)*(N-1) additions. Hence much effort has gone into developing alternative and more efficient ways of implementing convolutions. There are several approaches for speeding up the calculation of convolution.

The conventional approach is based on FFT. The speed advantage offered by FFT algorithm can be very large for long convolutions [12]. However there are drawbacks of the FFT approach, which relate mainly to the use of sines and cosines and to the need of complex arithmetic, even if the convolutions are real. Most fast convolution algorithms such as those based on FFT, apply only to periodic functions and therefore compute only cyclic convolutions.

In order to overcome the limitations of FFT method, many other fast convolution algorithms have been proposed. Toom Cook and Karastuba which uses algorithmic strength reduction are the algorithms which speed up the computation of short convolution.

Asymptotically, the algorithms can be arranged from least efficient to most effective as follows [12]:

## complexity

Classical

[O(N2)]

Karatsuba

[O(N1.5849 )]

Toom-Cook

[O(N1.4649 )]

FFT

[O(NlogNloglogN)]

From above table it can be seen that the convolution through FFT has the lowest asymptotic complexity for large values of N. For short length convolutions, algorithms such as Toom Cook and Karastuba are most suitable. The paper is organized as follows. The second section talks about the convolution process, the third discusses about how an area efficient convolution processor can be designed, the fourth deals with the various algorithms employed to speed up convolution and the fifth section deals with the implementation of ASIP for convolution

## LINEAR CONVOLUTION

The basic concept of convolution is to flip, multiply and add. Consider arbitrary sequences x[n] and h[n] of length N. The discrete convolution of x and h is

N-1

y[n ]=x[n]Ó¨h[n]=∑ h(n-k) . x(k) n=0,1,2,....(2N-2)

k=0

y0= h0 x0

y1= h1 x0 + h0 x1

y2 =h2 x0+h1 x1+ h0 x2

y3 =h3 x0+h2 x1 +h1 x2+h0 x3

y4 =h4 x0+h3 x1+h2x2+h1 x3+h0 x4 y5 =h4 x1+h2 x3+h3x2+h1 x4

y6 =h4 x2+h3 x3 +h2 x4 y7 =h4 x3+h3 x4

y8 =h4 x4

## ii. POLYNOMIAL MULTIPLICATION

Multiplication of two polynomials x(p) and h(p) each of degree N yields a polynomial y(p) of degree 2N-1.

y(p) = (h0+h1p+h2p2+h3p3 +h4p4) * ( x0+x1p+x2p2+x3p3+x4p4 )

h0 x0+h1x0p+h2 x0p2+h3 x0p3 +h4 x0p4

h0 x1p+h1 x1p2+h2 x1p3+h3 x1p4 +h4 x1p5

h0 x2p2+ h1 x2p3+h2x2p4+h3x2p5 +h4x2p6

+h0 x3p3+h1 x3p4+h2 x3p5+h3 x3p6 +h4 x3p7 +h0 x4p4+h1 x4p5+h2 x4p6+h3 x4p7 +h4 x4p8

h0 x0 + (h1 x0 + h0 x1 )p

+ (h2 x0+h1 x1+ h0 x2 ) p2

(h3 x0+h2 x1 +h1 x2+h0 x3)p3

(h4 x0+h3 x1+h2x2+h1 x3+h0 x4 )p4

+ (h4 x1+h2 x3+h3x2+h1 x4) p5

+ (h4 x2+h3 x3 +h2 x4) p6

(h4 x3+h3 x4)p7+ h4 x4p8

## BINARY/INTEGER MULTIPLICATION

Binary and integer multiplication are adaptations of polynomial multiplication, with the complication that they worry about the propagation of carries. To propagate the carries and to get the final output, Recomposition [the coefficients are placed in order, the MSB bits/integers of coefficients except LSB 2k-1 bits/integers are overlapped and then addition is carried out, where k is the number of bits/integers in each coefficient ] is done. Integer multiplication requires an additional step called splitting. To perform long integer multiplication using convolution (polynomial multiplication) algorithm, each long integer is split into sub-integers which represent coefficients of polynomials.

## x0

h4x0

h3x0

h2x0

h1x0

h0x0

h4x1

h3x1

h2x1

h1x1

h0x1

h4x2

h3x2

h2x2

h1x2

h0x2

h4x3

h3x3

h2x3

h1x3

h0x3

h4x4

h3x4

h2x4

h1x4

h0x4

y8

y7

y6

y5

y4

y3

y2

y1

y0

From the above it can be seen that an efficient convolution ASIP will lead to an efficient polynomial multiplier/binary/integer multiplier/correlator.

## y(n)

Implementation of very long convolutions by using the basic model has the following disadvantages:

## 1. Area

Convolution of N*N in the above model requires N multipliers and (N-1) adders. Multiplier occupies large area when compared to an adder. Hence the area required to implement convolution on a FPGA is N times the area occupied by a single multiplier in addition to the area occupied by the adders. The area required by a long convolution is very large and also it cannot be accommodated on a single FPGA chip.

## 2. Speed

It is fast due to parallel multiplications where most of the time delay is due to the sequential addition operations.

## ii. COEFFICIENT-PARTITIONED MODEL

x(n)

Partitioned short-

convolution ASIP1

h1(0)…....h1(L-1)

y1(n)

Partitioned short-convolution ASIP2 h2(0)…....h 2(L-1)

y2(n)

yk-1(n)

Partitioned short-convolution ASIPk hk-1(0)…..h k-1(L-1)

In this approach, long coefficient sequence is partitioned into short sub-sequences which can be accommodated in a single FPGA chip. Thus the overall long convolution can be implemented by distributing the partitioned sub-convolutions into many FPGA chips. Since in this model convolution is decomposed into parallel process, high speed can be achieved when compared to the basic model.

Though this model improves the speed and resorts to a multi-chip environment the total area required for the convolution process cannot be reduced. Thus it is not area efficient.

## iii. PROPOSED METHOD

In order to fulfill the requirement of area efficiency, each partitioned sub-convolution can be implemented as a convolution ASIP with a single multiplier, adder and a subtractor incorporated into it. As area efficiency is achieved in this method the partitioned convolutions which were implemented in a multi-chip environment can be implemented into a single chip.

## iv. AREA - SPEED TRADEOFF

There is always a tradeoff between area and speed. Though in this method area is reduced to a maximum extent, speed reduces relatively which is due to the sequential execution of all the multiplications (N2) and additions [(N-1)2] required in a convolution operation. The area and speed for a long convolution in various methods is as follows:

## AREA

Single

Slow

Less

ASIP

Partitioned

Medium

Medium

short

convolution

ASIPs

Coefficient-

Fast

More

partitioned

model

Though a single ASIP system dramatically reduces the area, speed is also reduced to a great extent. Whereas by using partitioned convolution ASIPs we can improve the speed, though the area occupied increases, it is not as large as that of coefficient partitioned model. Since area and power are major constraints in several battery powered systems, the proposed method can be used to make the system highly efficient.

## v. APPLICATION SPECIFIC INSTRUCTION PROCESSOR (ASIP)

There are two kinds of ASIPs, namely Synchronous ASIP and Asynchronous ASIP. In synchronous ASIP, there is a common clock which has the frequency of maximum time consuming operation. We know that multiplication takes more time than addition, thus there is a time lag for every addition operation in the convolution process. Whereas in an asynchronous ASIP there is no global clock signal, instead, it operates under distributed control, with concurrent hardware components communicating and synchronizing with each other. Thus the time lag in addition operation can be avoided. Hence more speed can be achieved by using an asynchronous ASIP. Further the speed can be increased by reducing the number of multiplications which is a time consuming operation. The number of multiplications can be reduced by algorithmic strength reduction. It is a process where the number of stronger operations such as multiplication can be alleviated at the expense of increasing the number of weaker operations such as addition.

## IV. SEARCH FOR FAST CONVOLUTION ALGORITHMS

There are various algorithms available for performing fast convolution, which are as follows:

## i. FAST FOURIER TRANSFORM

Convolution in time domain is equivalent to multiplication in frequency domain. Consider N sample input sequences x(n) and h(n).The convolution of these sequences can be done using FFT, which involves the following steps

## Step-1:

To avoid aliasing the N sample input sequence must be converted into 2N-1 sample sequences by zero padding .Thus the input sequences after zero padding are as follows:

xz(n)

## =

x(n)

for n=0,1,....,N-1

0

for n=N,......2N-1

hz(n)

## =

h(n)

for n=0,1,....,N-1

0

for n=N,…...2N-1

## Step-2:

Compute FFT of input sequences using butterfly

modules.

X(k)= FFT[ xz (n) ]

k=0,1,2,.......N-1

H(k)= FFT[ hz (n) ]

k=0,1,2,.......N-1

Step-3: Complex Multiplications

Compute point wise multiplications of X(k) and H(k) to get Y(k).

## Step-4:

Compute IFFT for Y(k) to get the convolved output sequence y(n).

The above steps can be represented as follows:

X(n)

ZERO

FFT

OUTPUT

IFFT

Y(n)

ZERO

FFT

## Computational complexity

Though FFT is very efficient method for convolution of long length inputs, it is very complex for short convolutions.

## ii. ALGORITHMIC STRENGTH REDUCTION:

The computational complexity of school book method is O(N2 ) i.e. it requires N2 multiplications. As multiplication is a time consuming operation, it is important to reduce the number of multiplications involved in a convolution process. This can be done by algorithmic strength reduction. It is a process where the number of stronger operations such as multiplication can be alleviated at the expense of increasing the number of weaker operations such as addition.

## iii. TOOM COOK METHOD:

Toom Cook is a linear convolution algorithm for polynomial multiplication, based on Lagrange's interpolation theorem. The steps involved in this method are as follows:

## a. Representation as polynomials:

In this step the inputs of N*N convolution is represented as coefficients of a polynomial of degree N-1

n-1

H(p)= ∑ hi . pi

i=0

n-1

X(p)= ∑ xi . pi

i=0

## Evaluation :

Evaluate each polynomial at 2N-1 points say r0, r1, r2…… r2N-2 (0,∞, 2N-3 points such as 1,-1,2,-2......) Example for toom-3 convolution, the evaluation points are 0,-1,1,2, ∞

X(0) = x0+ x1(0)+ x2(0)2 = x0

X(-1) = x0+ x1(-1)+ x2(-1)2 = x0-x1+ x2

X(1) = x0+ x1(1)+ x2(1)2 = x0+ x1+ x2

X(2) = x0+ x1(2)+ x2(2)2 = x0+2 x1+ 4x2

X(∞) = x2

X(0) 1 0 0

X(0) 1 1 1 x0

X(0) = 1 -1 1 x1

X(0) 1 -2 4 x2

X(0) 0 0 1

## c. Multiplication:

In this step matrix-vector multiplication (point wise multiplication) is done.

For example the pointwise multiplication of toom-3 is as follows

Y(0)

X(0).H(0)

Y(-1)

X(-1).H(-1)

Y(1)

X(1).H(1)

Y(2)

X(2).H(2)

Y(∞)

X(∞).H(∞)

## d. Interpolation:

Y(p) can be interpolated from Y(r0), Y(r1),.......Y(r2N-2) explicitly using the inverse Vandermonde matrix of the form which is represented as follows:

y0 1 r0 r02 r02N-2 Y0

y1 1 r1 r12 r12N-2 Y1

## . . . . . .

y2N-2 1 r2N-2 r2N-22 r2N-22N-2 Y2N-2

Thus for toom-3, the interpolation step is as follows

-1

y0 1 0 0 0 0 y(0)

y1 1 -1 1 -1 1 y(-1)

y2 = 1 1 1 1 1 * y(1)

y3 1 -2 4 -8 16 y(-2)

y4 0 0 0 0 1 y(0)

IMPROVED VERSION OF COOK TOOM BASED ON CHINESE REMAINDER THEOREM

When applying Toom Cook algorithm, significant reduction of operation counts occurs if the evaluation points are carefully chosen. A better algorithm can be produced if we only choose 2N-2 distinct numbers[5] and compute the convolution using formula obtained by Chinese remainder Theorem (CRT), which is denoted as follows

## i=0

For toom-3 from the above equation we have:

Y (p)=Y0(p-2)(p2-1)/2 + Y1 p(p-2)(p+1)/2 -Y2 p(p-2)(p-1)/6

+ Y3p(p2-1)/6 + Y4p(p2-1)(p-2)

The above equation can be expanded to obtain an explicit formula as follows:

m1=h0x0/2

m2= (h0 +h1 +h2) (x0 +x1+x2) /2

m3= (h0 -h1 +h2) (x0 -x1+x2) /6

m4= (h0 +2h1 +4h2) (x0 +2x1+4x2) /2

m5= h2x2

Y0=2 m1

Y1 =- m1 +2m2-2m3 -m4 +2m5

Y2 =-2 m1 +m2+3m3 -m5

Y5 =- m1 -m2-m3 +m4 -2m5

Y5 =m5

## Computational complexity

It involves a number of division operations which makes the algorithm more complex

## iv. CONVOLUTION BY INSPECTION:

Toom Cook algorithm involves a number of divisions in its pre-addition and post-addition matrices, but an efficient algorithm to reduce the number of multiplications can be designed by inspection method [4].The pre addition and post addition matrices 5*5 convolution is

1 0 0 0 0

1 0 0 0 0

1 0 0 0 0

1 0 0 0 0

1 0 0 0 0

1 1 0 0 0

P5 = 1 0 1 0 0

1 0 0 1 0

0 1 1 0 0

0 0 1 1 0

0 1 0 0 1

0 0 1 0 1

0 0 0 1 1

1 1 0 1 1

1 0 0 0 0 0 0 0 0 0 0 0 0 0

-1 -1 0 0 0 1 0 0 0 0 0 0 0 0

-1 1 -1 0 0 0 1 0 0 0 0 0 0 0

-1 -1 -1 -1 0 0 0 1 1 0 0 0 0 0

Q5=

1 1 1

1

1 -1 -1 -1

-1 -1 -1 -1 -1 1

0 -1 -1 -1 -1 0 0 0

0 1

1

0 0

0

0 0

-1

1

-1 0 0 0 0 0 0 1 0 0

0 0

0

-1

-1 0 0 0 0 0 0 0 0 1

0 0

0

0 1 0 0 0 0 0 0 0 0 0

## v. KARATSUBA

Toom cook algorithm involves division operations. The karatsuba algorithm derives a division-free formula for fast polynomial multiplication. Also it further reduces the number of multiplications than that of inspection method. The algorithm is as follows:

The polynomials have to be fragmentized into two equal parts. If the length N of the polynomial is odd they have to be padded with '0' (i.e., if N=11 the polynomial multiplication of N=12 is carried out by substituting the coefficient of p11 as '0').

H(p) = hN-1 …. hN/2 hN/2-1 ….h1 h0

= hN-1 …. hN/2 . pN/2 Ó¨ hN/2-1 ….h1 h0

= h1 . pN/2 Ó¨ h0

X(p) = xN-1 …. xN/2 xN/2-1 ….x1 x0

= xN-1 …. xN/2 . xN/2 Ó¨ xN/2-1 ….x1 x0

= x1 . xN/2 Ó¨ x0

For 2-2 polynomial multiplication

Y(p) = (h0+h1 p)*( x0+x1 p)

= (h0x0) +((h0+h1)(x0+x1)-(h0x0-h1x1)p + (h1x1) p2

= (h0x0)(1-p) + (h0+h1)(x0+x1) p + (h1x1) (p-p2)

## vi. KARATSUBA LIKE FORMULA

The karatsuba like formula derived by Montgomery for Quartic(5-5), Quintic(6-6) Sextic(7-7) polynomials requires 13, 17, 22 multiplications respectively[1].

The explicit formula is given below:

Quartic (5-5) polynomials

m1=h0x0

m2=h1x1

m3=h3x3

m4=h4x4

m5=(h0-h4)(h0-h4)

m6=(h0+h1)(b0+b1) m7=(h3+h4)(x3+x4)

m8=(h1+h2-h4)(x1+x2-x4)

m9=(h0-h2-h3)(x0-x2-x3) m10=(h0+h1-h3-h4)( x0+x1-x3-x4) m11=(h0+h1+h2-h4)( x0+x1+x2-x4) m12=(h0-h2-h3-h4)( x0-x2-x3-x4)

m13=(h0+h1+h2+h3+h4)(x0+x1+x2+x3+x4)

y0 = m1

y1 = m6-m2-m1

y2 = m11-m8-m6-m5+m4+2m2+m1

y3=m13-m12-2m11+m10+2m8-m7+3m5-3m4

-2m2-2m1

y4=-m13+2m12+2m11-2m10-m9-m8+m7+m6

-4m5+3m4+m3-m2+3m1

y5= m13-2m12-m11+m10+2m9-m6-2m4-2m3 -3m1

y6 = m12-m9-m7-m5+m4+2m3+m1

y7 = m7-m4-m3

y8 = m4

Multiplications required: 13 instead of 25

Even though there is a decrease in the number of multiplications (13 instead of 14 in inspection method) the number of additions has also increased drastically. The required decrement in the number of addition operations can be obtained by algorithm given below.

## vii. IMPROVED CONVOLUTION FORMULA USING CHINESE REMAINDER THEOREM

Murat Cent and Ferruh Ozbudak have given explicit formulas for multiplying polynomials of small degree over finite fields F2 [0, 1]. These formulas were derived using irreducible polynomials as moduli polynomials in Chinese Remainder Theorem (CRT)[3].

For example using moduli polynomials as (x-∞)3, x3,(x2+x+1),(x3+x+1) in CRT for a 5-5 polynomial multiplication over F2[0,1] explicit formulas have been derived. Using these moduli polynomials we have derived an explicit formula for polynomial multiplication over any characteristic field i.e. integers.

The explicit formula is as follows:

m1 = a0*b0

m2 = a1*b1

m3 = a2*b2

m4=a3*b3

m5= a4*b4

m6= (a0-a1)*(b0-b1)

m7= (a0+a2)*(b0+b2)

m8= (a2+a4)*(b2+b4)

m9= (a3-a4)*(b3-b4)

m10= (a0-a2-a3)* (b0-b2-b3)

m11= (a1+a2-a4)*(b1+b2-b4)

m12= (a0-a1-a3+a4)*(b0-b1-b3+b4)

m13= (a0-a1-a2-a3+a4)*(b0-b1-b2-b3+b4)

c0= m

c1=-m6+m2+m1

c2= m7-m1+m2-m3

c3= m13-m12-m10+m8+m4-m3+m1-m5

c4=m13-m11-m10-m6-m9+m1+m2+

m3+m3+m4+m5

c5=m13-m12-m11+m7+m5-m3+m2-m1 c6=m8-m5-m3+m4

c7=-m9+m5+m4 c8=m5

## MEDIUM LENGTH CONVOLUTIONS

Medium length convolutions can be performed using Iterative Karatsuba method [8] or by using iterated short convolution method as given in keshab parhi[4].

Iterated Karatsuba method:

To perform MN-MN convolution it requires Mul(M)*Mul(N) multiplications For a 10-10 convolution Mul(5)*Mul(2)=13*3=39 multiplications will be required.

H(p) = h0+h1 p1 +h2 p2 +h3 p3 +h4 p4 +h5 p5 +h6 p6 +h7 p7 +h8 p8 +h9 p9

= c0(p) +c1(p) q

X(p) = x0+x1 p1 +x2 p2 +x3 p3 +x4 p4 +x5 p5+ x6 p6 +x7 p7 +x8 p8 +x9 p9

= d0(p) +d1(p) q

Where c0(p) = h0+h1 p1 +h2 p2 +h3 p3 +h4 p4

c1(p)= h5 +h6 p1 +h7 p2 +h8 p3 +h9 p4

d0(p) = h0+h1 p1 +h2 p2 +h3 p3 +h4 p4

d1(p) = h5 +h6 p1 +h7 p2 +h8 p3 +h9 p4

q = p5

Y(p) = H(p)*X(p)

= c0(p)d0(p) + ((c0(p)+c1(p))(d0(p)+d1(p)) -

c0(p)d0(p) - c1(p)d1(p)) q + c1(p)d1(p) q2

The above is a 2-2 kartsuba which requires 3 multiplications where each multiplication is a 5-5 polynomial multiplication which in turn requires 13 multiplications.

c0(p)

((c0(p)+c1(p))(d0(p)+

c1(p)

Total

d0(p)

d1(p))

c0(p)d0(p)

d1(p)

output

c1(p)d1(p))

Inputs

h0,h1,

h0+h5,

h1+h6,

h2+h7,

h5,6,

h2,h3,

h3+h8,

h4+h9,

x0+x5,

h7,h8

h4,x0,

x1+x6,

x2+x7,

x3+x8,

,h9,x

x1,x2,

x4+x9

5,x6,

x3,x4

x7,x8

,x9

Output

t1p0

t1p0

Of

t1p1

t1p1

each

t1p2

t1p2

5-5

Convol

t1p3

t1p3

ution

t1p4

t4p=

t1p4

t3p-

t2p-

t1p

t1p5

t3p0

t1

t2

t4p0

t1p5+t 4p0

p0

p0

t1p6

t3p1

t1

t2

t4p1

t1p6+t 4p1

p1

p1

t1p7

t3p2

t1

t2

t4p2

t1p7 +t4p2

p2

p2

t1p8

t3p3

t1

t2

t4p3

t1p8+t 4p3

p3

p3

t3p4

t1

t2

t4p4

t4p4

p4

p4

t3p5

t1

t2

t4p5

t2p0

t2p0+t 4p5

p5

p5

t3p6

t1

t2

t4p6

t2p1

t2p2+t 4p6

p6

p6

t3p7

t1

t2

t4p7

t2p2

t2p2+ t4p7

p7

p7

t3p8

t1

t2

t4p8

t2p3

t2p3+t 4p8

p8

p8

t2p4

t2p4

t2p5

t2p5

t2p6

t2p6

t2p7

t2p7

t2p8

t2p8

Mul(5)

13

13

13

39

64

64+10+18

64

220+8=

228

Short

Short

Short

Convolution

Convolution

Convolution

Control Unit

General

FFT

Inspection

Karatsuba

using

CRT

Multiplications

25

105

14

13

16

64

43

64

## IV. IMPLEMENTATION

The implementation part consists of designing an application specific instruction processor (ASIP) for convolution. An ASIP is a component used in system on chip design. The instruction set of an ASIP is tailored to benefit a specific application. The ASIP that is designed for convolution consists of a single

Multiplier unit

Subtractor unit

Divide by 2 unit

Divide by 6 unit

Register bank (32 registers)

RAM unit

The designed ASIP is synchronous, thus the speed depends on the number of instructions. Execution of each instruction takes 4 clock cycles, whatever the operation may be. In the first clock cycle instruction is read then in the second its decoded, executed in the third and finally stored in the fourth clock cycle. Here in case of addition, there will be a time lag as it's a simple operation when compared to multiplication. To avoid the above disadvantage, the system can be made asynchronous where the total time taken purely depends on the time taken by each addition and multiplication with no synchronization.

ARITHEMETIC UNIT

## /

Register

RAM

2

6

bank

(32

registers)

Control unit

Instruction Register

## V. CONCLUSION

Thus to make the real time systems less complex, long convolution can be replaced by a number of short convolutions. Designing an ASIP for convolution by incorporating a fast convolution algorithm makes the system fast and area efficient.