High Speed Parallel Crc Implementation Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract- Error correction codes provides a mean to detect and correct errors introduced by the transmission channel. Two main categories of codes exist: block codes and convolution codes. They both introduce redundancy by adding parity symbols to the message data. Cyclic redundancy check (CRC) codes are the subset of the cyclic codes that are also a subset of linear block codes.

The hardware implementation of a bit-wide CRC is a simple linear feedback shift register. While such a circuit is simple and can run at very high clock speeds, it suffers from the limitation that the stream must be bit-serial. This means that n clock cycles will be required to calculate the CRC values for an n-bit data stream. In many high speed data networking applications where data frames need to be processed at high speeds, this latency is intolerable and hence, implementation of CRC generation and checking on a parallel stream of data becomes desirable.

This project presents a high-speed parallel cyclic redundancy check (CRC) implementation based on unfolding, pipelining, and retiming algorithms. CRC architectures are first pipelined to reduce the iteration bound by using novel look-ahead pipelining methods and then unfolded and retimed to design high-speed parallel circuits. A comparison on commonly used generator polynomials between the proposed design and previously proposed parallel CRC algorithms shows that the proposed design can increase the speed by up to 25% and control or even reduce hardware cost.

This project is implemented using Verilog HDL. Simulation and synthesis are done using Xilinx ISE Tools.

Index Terms-Cyclic redundancy check (CRC), linear feedback shift register (LFSR), pipelining, retiming, unfolding, Verilog HDL.


cYCLIC redundancy check (CRC) is widely used to detect errors in data communication and storage devices. When high-speed data transmission is required, the general serial implementation cannot meet the speed requirement. Since parallel processing is a very efficientway to increase the throughput rate, parallel CRC implementations have been discussed extensively in the past decade [1], [2]. Although parallel processing increases the number of message bits that can be processed in one clock cycle, it can also lead to a long critical path (CP); thus, the increase of throughput rate that is achieved by parallel processing will be reduced by the decrease of circuit speed. Another issue is the increase of hardware cost caused by parallel processing, which needs to be controlled. This brief addresses these two issues of parallel CRCimplementations.

In [1] and [2], recursive formulas have been developed forparallel CRC hardware computation based on mathematical deduction.they have identical CPs. The parallel CRC algorithm in [2] processes an m-bit message in (m+k)/L clock cycles, where k is the order of the generator polynomial and L is the level of parallelism. However, in [1], message bits can be processed in m/L clock cycles.

High-speed architectures for parallel long Bose-Chaudhuri-Hocquenghen (BCH) encoders in [3] and [4], which are based on the multiplication and division computations on generator polynomial, are efficient in terms of speeding up the parallel linear feedback shift register (LFSR) structures.

They can also be generally used for the LFSR of any generator polynomial. However, their hardware cost is high.

We will show that our proposed design can achieve shorter CP, which leads to a parallel CRC circuit with higher processing speed, and can control or reduce the hardware cost of parallel processing, with regard to the most commonly used CRC generator polynomials. The proposed design starts from LFSR, which is generally used for serial CRC. An unfolding algorithm [5] is used to realize parallel processing. However, direct application of unfolding may lead to a parallel CRC circuit with long iteration bound, which is the lowest achievable CP [5]. Two novel look-ahead pipelining methods are developed to reduce the iteration bound of the original serial LFSR CRC structures; then, the unfolding algorithm is applied to obtain a parallel CRC structure with low iteration bound. The retiming algorithm is then applied to obtain the achievable lowest CP. This brief is organized as follows. Section II analyzes the effect of unfolding on the iteration bound of a parallel CRC circuit. Our proposed pipelining algorithms are discussed in Section III. Retimed and unfolded LFSR CRC circuits are compared in Section IV.


We start with the simple example in [1]. Consider the CRC architecture with the generator polynomial G(y) = 1+y+y^3+y^5, which is shown in Fig. 1. After applying an unfolding algorithm [5] with unfolding factors J=2 and J=3, we obtain the two-parallel and three-parallel architectures shown in Figs. 2 and 3, respectively.

Iteration bound is defined as the maximum of all the loop bounds. Loop bound is defined as t/w, where is the computation time of the loop and is the number of delay elements in the loop [5].

It is obvious that the iteration bounds for the CRC architectures in Figs. 1-3 are Txor, 2Txor, and 3Txor , respectively, where Txor is the computation time of an XOR gate. In this case, the iteration bound of a -parallel CRC architecture for G(y) = 1+y+y^3+y^5 is J.Txor Although we can use retiming to reduce the CP of a circuit, we cannot achieve a CP with a computation time that is less than the iteration bound of this circuit. In otherwords, the CP of a parallel CRC architecture cannot be less than , where is the level of parallelism and is the iteration period bound of the original data flow graph (DFG) [5]. Therefore, it is very important to reduce the iteration bound before the unfolding algorithm is applied.


A. Proposed Look-Ahead Pipelining (Method 1)

In this section, we propose a look-ahead pipelining algorithm to reduce the iteration bound of the CRC architecture.The CRC architecture with the generator polynomial G(y) = 1+y+y^8+y^9 is shown in Fig. 4. Its iteration bound is 2Txot, which corresponds to the section in the dashed square and the term y^8+y^9 in the generator polynomial. The largest iteration bound of a general serial CRC architecture is also 2Txor . For example, the serial architectures of commonly used generator polynomials CRC-16 and CRC-12 have the iteration bound of 2Txor because they have terms y^15+y^16 and y^11+y^12 in their generator polynomials, respectively.

The critical loop with the delayed input redrawn in Fig. 5 is described by

If we apply look-ahead pipelining to (1), we can obtain a twolevel pipelined version given by

The architecture for (2) is shown in Fig. 6. In Fig. 6, we can see that the loop bound in Fig. 5 has been reduced from 2Txor to Txor at the cost of two XOR gates and two flip-flops.

The CRC architecture in Fig. 4 can nowbe pipelined as shown in Fig. 7. In Fig. 7, we can see that the loop bounds of loop 1 and loop 2 are Txor and (5/8)Txor , respectively. So, the iteration bound of the two-level pipelined CRC architecture is Txor. If we apply higher levels of pipelining, as shown in (2), the loop bound of loop 1 will decrease; however, that of loop 2 will increase. For example, for three-level and four-level pipelining, the loop bound of loop 1 is (2/3)Txor and (1/2)Txor, respectively; however, that of loop 2 is (7/8)Txor and (9/8)Txor, respectively. Therefore, the iteration bounds of three-level and four-level pipelined CRC architectures are (7/8)Txor and (9/8)Txor, respectively. We can also see from this example that increasing the pipelining level will not necessarily reduce the iteration bound of the CRC architecture. For this example, the lowest iteration bound we can achieve is (7/8)Txor, when the pipelining level is three.

B. Improved Look-Ahead Pipelining (Method 2)

Consider the inner loop marked by the dashed square in Fig. 8. This loop can be redrawn as shown in Fig. 9 and can be represented by

If we apply look-ahead pipelining to (3), we can obtain a fourlevel pipelined realization of (3) as

The architecture for (4) is shown in Fig. 10. In Fig. 10, we can see that four-level pipelining can be achieved at the cost of two XOR gates and two flip-flops if the there are two delay elements in the initial loop. However, in the proposed pipelining method 1,

four-level pipelining can be achieved at the cost of four XOR gates and four flip-flops. The CRC architecture in Fig. 8 can be pipelined as shown in Fig. 11. In Fig. 11, we can see that the loop bounds of loop 1 and loop 2 are (1/2)Txor and (5/8)Txor, respectively. So, the iteration bound of the four-level pipelined CRC architecture

is (5/8)Txor. Use of four-level pipelining reduces the iteration bound of Fig. 8 from Txor to (5/8)Txor at the cost of two XOR gates and two delay elements. The efficiency of the improved pipelining method will be demonstrated in the succeeding discussions of this brief.

Loop 1 in Fig. 7 can be represented as shown in Fig. 9 and transformed as shown in Fig. 12, based on retiming and pipelining [5].

If we perform four-level pipelining to Fig. 4 by the proposed pipelining method 2, as shown in Fig. 12, we need four extra XOR gates in loop 2, instead of six, as needed by method 1. Thus, the loop bound of loop 2 is reduced from (9/8)Txor to (7/8)Txor , and the iteration bound is also reduced from (9/8)Txor to (7/8)Txor; two XOR gates are saved.

The following example illustrates the correctness of the proposed pipelining method 2: For the CRC architecture in Fig. 12(c), let the message sequence

be 101011010; Table I shows the data flow at the marked points of this architecture at different time slots. In Table I, we can see that the achieved four-level pipeline introduces a latency of three clock cycles. However, this is not a drawback, when the message bits are long.


After we apply pipelining to the original serial CRC architecture, the minimum achievable CP (iteration bound) of the unfolded CRC architecture is reduced. In this section, we carry out retiming for minimum CP to obtain fast parallel CRC architectures by using the example in Fig. 12(c). Applying three-parallel unfolding to Fig. 12(c), we obtain the design in Fig. 13, where all the numbered nodes 1,2,…. Represent XOR gates.

If the input sequence is 101011010, we can get the data flow of Fig. 13, as shown in Table II.We can see that four clock cycles are needed to encode a 9-bit message.

It is obvious that the CP of Fig. 13 is 5Txor . After we apply retiming to it, its CP can be reduced to 3Txor , which is shown in Fig. 14.

If the input sequence is still 101011010, we can get the data flow of Fig. 14, as shown in Table III. Compared with Table II, Table III shows a latency of one more clock cycle caused by retiming. One may be led to believe that three-parallel design does not process 3 bits of the message efficiently because encoding a 9-bit message requires five clock cycles. This is because we are just giving a simple example to illustrate our design. In real applications, the message length will be much longer than 9 bits. For example, if the message is 90 bits long, the CRC architecture in Fig. 14 will take 30+2 = 32 clock cycles to encode it. It is obvious that 90/32 is very closeto 3.

The example we used in Sections III and IV is not the best solution for a three-parallel implementation of G(y) = 1+y+y^8+y^9. In Fig. 14, we can see that the given solution requires 21 XOR gates to achieve a CP of 3Txor, by four-level pipelining.

However, two-level pipelining would be enough to get a CP of 3Txor, which leads to a three-parallel solution requiring only 15 XOR gates. The preceding example was only used to illustrate the complete procedure of how to apply our proposed design based on method 2.


We focus our discussion on the commonly used generator polynomials [6] that are shown in Table IV. By observing these polynomials, we can see that they all have many zero coefficients between the second and third highest order nonzero coefficients.This property is also applicable to the polynomial example

we used in Sections III and IV for G(y) = 1+y+y^8+y^9.

A comparison between previous high-speed CRC architectures and the proposed ones is shown in Table V for different parallelism levels of commonly used CRC generator polynomials. Tree structure is applied to further reduce the CP to 0(log(CP)).at the cost of some additional XOR gates.

In Table V, we can see that for all the listed commonly used generator polynomials, our design can maintain or reduce the CP. In addition to the fact that our design can process an m-bit message within almost the same number of clock cycles as previous designs, our design can speed up the throughput rate of

CRC architectures and control or even reduce the hardware cost. Our design requires more XOR gates for some generator polynomials and saves XOR gates for others. Our design also requires a small number of additional delay elements. In Table V, no optimization techniques such as identifying common subexpressions [2] are applied.

Note that as shown in Fig. 12, when the pipelining level increases, the loop bound of loop 2 will become larger than that of loop 1. Increasing the pipelining level will increase the loop bound of loop 2 and thus increase the iteration bound. The iteration bound of Fig. 12(c) can be further reduced if we move the XOR gate of y(n+2) from loop 2 to loop 1, which is shown in Fig. 15.

In Fig. 15, we can see that although moving the addition operation of y(n+2) increases the loop bound of loop 1 from 0.5Txor to (3/4)Txor , that of loop 2 is reduced to (3/4)Txor, and the iteration bound is also reduced from (7/8)Txor to (3/4)Txor . This iteration bound reduction method is also applied to the parallel implementation of the commonly used generator polynomials shown in Table V and

marked with* .

Our design can also be applied to other LFSR architectures, especially those with a generator polynomial that has many zero coefficients between the second and third highest order nonzero coefficients.


This chapter explain the Simulation Results and Analysis of various CRC architecture.

Each architecture is coded in Verilog and simulated.

The matrices are evaluated manually and verified using coding.

The simulation results and the netlist simulation are verified for each architecture.

For the message bits: 101011010 and for the generator polynomial 1+y+y8+y9

Table 5.1 Architecture and No. of cycles


No. of clk cycles

Original architecture


2-level pipelined


4-level pipelined


Retiming after pipelining


Unfolding the 4-level pipelined


Retiming the unfolded architecture


Table 5.2 Architecture and loop bounds


Iteration bound

Original architecture

2 Txor

2-level pipelined


4-level pipelined & retiming

7/8 Txor

Table 5.3 Architecture and Critical path


Critical path

Unfolding the 4-level pipelined

5 Txor

Retiming the unfolded architectue

3 Txor

The below screenshots are be the respective simulation results

 Serial Implementation of 1+y+y3+y5

Unfolded by factor 2 for CRC 1+y+y3+y5

 Unfolded by factor 3 for CRC 1+y+y3+y5

 CRC Architecture for 1+y+y8+y9

 Two level pipelined CRC for 1+y+y8+y9

CRC Architecture for 1+y+y7+y9

 Four level pipelined CRC for 1+y+y7+y9

 Retimed CRC Architecture for 1+y+y8+y9

 Pipelined CRC Architecture for 1+y+y8+y9

 Improved four-level pipelined CRC for 1+y+y8+y9

 Three-parallel CRC architecture after unfolding for 1+y+y8+y9

 Retimed three-parallel CRC architecture for 1+y+y8+y9


This project has verified pipelining method for high-speed parallel CRC Implementation. Pipelining has decreased the iteration bound of the architecture effectively. Applying unfolding technique to pipelined architecture increased the throughput of the circuit. Retiming the architecture reduced the critical path delay. so applying pipelining, unfolding and retiming to the CRC has increased the throughput to achieve high speed design.


[1] T.-B. Pei and C. Zukowski, "High-speed parallel CRC circuits in VLSI," IEEE Trans. Commun., vol. 40, no. 4, pp. 653-657, Apr. 1992.

[2] G. Campobello, G. Patané, and M. Russo, "Parallel CRC realization," IEEE Trans. Comput., vol. 52, no. 10, pp. 1312-1319, Oct. 2003.

[3] K. K. Parhi, "Eliminating the fanout bottleneck in parallel long BCH encoders," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 3, pp. 512-516, Mar. 2004.

[4] X. Zhang and K. K. Parhi, "High-speed architectures for parallel long BCH encoders," in Proc. ACM Great Lakes Symp. VLSI, Boston, MA, Apr. 2004, pp. 1-6.

[5] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. Hoboken, NJ: Wiley, 1999.

[6] T. V. Ramabadran and S. S Gaitonde, "A tutorial on CRC computations,"IEEE Micro, vol. 8, no. 4, pp. 62-75, Aug. 1988.

[7] www.google.com

[8] www.xilinx.com/ise

[9] www.wikipedia.com