Crypto Processor For High Throughput Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Advanced Encryption Standard (AES), has received significant interest over the past decade due to its performance and security level. Many hardware implementations have been proposed. In this paper we propose a parallel pipelined AES architecture which helps to get higher throughput. The sub keys, required for each round of the Rijndael algorithm, are generated in real-time by the key-scheduler module by expanding the initial secret key, thus reducing the amount of storage for buffering.

The large and growing number of internet and wireless communication users has led to an increasing demand of security measures and devices for protecting the user data transmitted over the open channels. Two types of cryptographic systems can be used for that purpose, one is symmetric-key crypto system and other is asymmetric-key crypto system. Symmetric-key cryptography (DES, 3DES and AES) uses same key for both encryption and decryption. The asymmetric-key cryptography (RSA and Elliptic curve cryptography) uses different keys for encryption and decryption. The major disadvantage of DES is its key length is small. In November 2001, the National Institute of Standards and Technology (NIST) of the United States chose the Rijndael algorithm as the suitable Advanced Encryption Standard (AES) to replace the DES algorithm.

The AES encryption is considered to be efficient both for hardware and software implementations. Some works have been presented on hardware implementations of the AES algorithm using FPGA [8], [9] and ASIC [6], [7].In this paper we present a parallel pipelined AES algorithm which helps to get higher throughput.

The rest of the paper is organized as fallows. Section 2 describes basic AES algorithm. Section 3 describes novel on-the-fly key expansion module. Section 4 describes pipeline design. Section 5 describes comparison work. Finally we concluded the paper in section 6.

The AES algorithm is a symmetric block cipher that processes data blocks of 128 bits using a cipher key of length 128, 192, or 256-bits. In addition, the AES algorithm is an iterative algorithm. Each iteration can be called a round, and the total number of rounds, Nr, is 10, 12, or 14, when the key length is 128, 192, or 256 bits, respectively. Table 1 shows the number of rounds as a function of key length.

The 128-bit data block is divided into 16 bytes. These bytes are mapped to a 4x4 array called the State, and all the internal operations of the AES algorithm are performed on the State. Each byte in the State is denoted by Si,j(0 ≤ i, j < 4), and is considered as an element of GF(28) . Although different irreducible polynomials can be used to construct GF(28), the irreducible polynomial used in the AES algorithm is p(x) = x8 + x4 + x3 + x + 1. Figure.1 shows the block diagram of the AES encryption and the equivalent decryption structures.

After an initial round key addition, a round function consisting of four different transformations sub-bytes, shift-rows, mix-columns, and add-round-key is applied to the data block in the encryption procedure and in reverse order with inverse transformations in Decryption procedure. But last round in encryption contains only sub bytes, shift rows and add round key. First round in decryption contains only inverse sub bytes, inverse shift rows and add round key.

ShiftRows is a simple shifting transformation. The first row of the State does not change, while the second, third and fourth rows cyclically shift one byte, two bytes and three bytes to the left, respectively.

Figure 2. Shift rows transformation

2.3.MixColumn transformation:

The MixColumns() transformation operates on the State column-by-column, treating each column as a four-term polynomial. The columns are considered as polynomials over GF(28) and multiplied modulo x4 + 1 with a fixed polynomial a(x), given by

a(x) = {03}x3 + {01}x2 + {01}x + {02} .

The function xtime is used to represent the multiplication with '02', modulo the irreducible polynomial m(x)= x8 + x4 + x3 + x + 1. Implementation of function xtime() includes shifting and conditional xor with '1B'. Figure 3 shows the mixed column module. In matrix form, the MixColumns transformation can be expressed as

Add RoundKey involves only bit-wise XOR operations.

The transformations in the decryption process perform the inverse of the corresponding transformations in the encryption process. In the InvShiftRows, the first row of the State does not change, while the rest of the rows are cyclically shifted to the right by the same offset as that in the ShiftRows. The InvMixColumns multiplies the polynomial formed by each column of the State with a-1(x) modulo x4+1, where

The decryption structure can be derived by inverting the encryption structure directly. However, the sequence of the transformations will be different from that in encryption. This feature prohibits resource sharing between encryptors and decryptors. As can be observed from the operations involved in the decryption transformations, the InvShiftRows and the InvSubBytes can be exchanged without affecting the decryption process. Meanwhile, the InvMixColumns can be moved before the AddRoundKey, provided that the InvMixColumns are applied to the round keys before they are added. Taking these into consideration, an equivalent decryption structure as that in Fig. 1(b) can be used. In this figure, the mixroundkeys are the modified round keys resulted from applying InvMixColumns to the round keys. The equivalent decryption structure has the same sequence of transformations as that in the encryption structure, and thus, resource sharing between encryptors and decryptors are enabled.


In the AES algorithm, the key expansion module is used for generating round keys for every round. There are two approaches to provide round keys. One is to pre-compute and store all the round keys, and the other one is to produce them on-the-fly. First approach consumes more area. In second approach, the initial key is divided into Nk words (key0, key1,…, keyNk-1) which are used as initial words. The rest of the words are generated from the initial key iteratively. It can be computed that is 4, 6, or 8, when the key length is 128, 192 or 256-bit, respectively. Each round key has 128 bits, and is formed by concatenating four words:

Roundkey(i) = {w4i,w4i+1,w4i+2,w4i+3}.



x x


X Sbox(Rot(Y)) Rcon[i] X Y

Figure 4. Data path for key generator

The key expansion process can be described by the pseudo code listed below

for i = 0 to Nk-1

wi = keyi


for i = Nk to 4(Nr + 1)-1

temp = wi-1

if (I mod Nk = 0)

temp = SubWord(RotWord(wi-1)) XOR Rcon(i/Nk)

else if

wi = wi-Nk XOR temp



The AES encryption for pipeline design is shown in figure. Here we included pipeline registers between every round.

Figure 5. AES encryption with pipelining

The plain text is received at each clock cycle through input register. A single round of algorithm is completed in one clock cycle. Round keys are generated by using key expansion module. Generated round keys are supplied to each round. At each clock cycle data is shifted to next stage and final output is appeared only after the end of tenth clock cycle. As this is a pipeline structure we will get the second output immediately in the next clock cycle. Internal design of the each round contains Sub bytes, Shift rows, Mix columns, and add round key which are explained in previous sections.


A study reported in [6] is an ASIC implementation. The design is implemented using 0.18-µm standard CMOS technology. Total gate count of the design is 173k and only encryption is implemented. The operating frequency of the chip is 125 MHz achieving a throughput of 2.29 Gbit/s. Another ASIC implementation is presented in [7]. Only 128-bit key versions of the AES finalists are implemented. Mitsubishi Electric's 0.35-micron CMOS ASIC design libraries are used. The Rijndael chip utilizes 612,834 gates having a critical path delay of 65.64 ns and a throughput of 1.95 Gbit/sec. The encryption of a data block is completed in one clock cycle.

There are some FPGA implementations using pipeline architectures. However, total throughput must be divided by the number of pipeline stages when the chip is used in the feedback modes of operation specified by NIST. An FPGA implementation in [8] achieves a pipelined throughput of 12.2 Gbit/s using both inner and outer round pipelining. Another pipelined FPGA implementation in [9] achieves 6.95 Gbit/s. Other FPGA implementations in [10], [11], and [12] achieves throughput values between 137 Mbit/s non-pipelined and 3.65 Gbit/s pipelined.


In this paper, we presented a hardware implementation of efficient pipeline AES architecture which includes both encryption and decryption. Also parallel architecture helped us to get higher throughput than earlier implementations. The design is implemented using Verilog HDL and simulated with the help of modelsim. Synthesis is done by using Leonardo spectrum.