Block Matching Motion Estimation Algorithm Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract- The most popular method used for Motion Estimation is the Block Matching Algorithm. It computes the Motion Vector (MV) between the adjacent frames by estimating the absolute difference of the pixel values from current frame and the search area from the previous frame. In this work architecture for Block Matching Algorithm is designed that estimates the amount of motion on a block-by-block basis based on Mean Absolute Difference (MAD). In order to test Motion Estimation in a video coding system an Error Detection and Correction Architecture (EDCA) is also designed based on the Residue-and-Quotient (RQ) code which can detect multiple bit error. The Block matching algorithm with EDCA employing single PE have a reduced area of about 1268 from 2776 that is obtained for the BMA with EDCA employing 16 PEs.

Keyword- Processing Element, Block-Matching Algorithm, Mean-Absolute Difference (MAD), Error detection, Error correction, Motion estimation, Residue-and-quotient (RQ) code.


Some of the Video Compression standards include MPEG-1, MPEG-2 and MPEG-4. The advanced Video Coding standard is MPEG-4. Video compression is essential in various applications to reduce the total amount of data required for transmitting or storing the video data. Motion Estimation (ME) is of priority concern in removing the temporal redundancy between the successive frames in a video coding system and also it consumes more time. The most popular method used for Motion Estimation is the Block Matching Algorithm. It processes the successive frames and computes the Motion Vector (MV) based on the Mean-Absolute Difference (MAD) values.

Among the various motion estimation algorithms, block matching algorithm proves to be simple and efficient with its architecture. This method effectively estimates the motion vector between adjacent frames that is used in the video coding systems. Motion estimation process plays a vital role in Video coding systems as it eliminates the temporal redundancy between the frames. ME is considered as the intensive unit in terms of computation in any video coding system.

Testing of PEs in a ME is essential as an error in PE affects the video quality and signal-to-noise ratio. Numerous PEs in a ME can be tested concurrently using Concurrent Error Detection (CED) methods. The testing process is done concurrently with the motion estimation process. This can considerably reduce the computation time and concurrent process improves the efficiency. Logic built-in self-test (BIST) is a design for testability technique in which a portion of a circuit on a chip, board, or system is used to test the digital logic circuit itself [1]. BIST technique for testing logic circuits can be online or offline. Online BIST is performed when the functional circuitry is in normal operational mode. In concurrent online BIST, testing is conducted simultaneously during normal functional operation. The functional circuitry is usually implemented with coding techniques [2].

The full search block-matching algorithm consists of shift register arrays whose length can be varied. The inputs are fed serially which reduces the pin count. The array like arrangement is advantageous because of its regularity and flexibility in local communication between the PEs. It is flexible in adapting to the dimensional change in the search area.

The rest of this paper is organized as follows. Section II introduces the Block matching algorithm. Next, Section III illustrates the testing architecture that is concurrently processed with the algorithm. Section IV evaluates the performance of the architecture. Conclusions are finally drawn in Section V.



The most popularly used method for Motion Estimation is the block matching algorithm. This method estimates the amount of motion on a block-by-block basis. Figure 1 shows the block matching process. The current frame is divided into blocks of pixels of size n n [3]. A block of pixels called the reference block from the current frame is compared with the corresponding block called the candidate block in the search area of size (n2δ) (n2δ) from the previous frame, where p is the displacement allowed. The motion vector is obtained when the best match candidate block is found.

The basic idea of the full search block matching algorithm is to search all possible displaced candidate blocks within the search area in the previous frame to find the best match block. The concept of Mean Absolute difference (MAD) is used as a matching criterion. It is given as

where R (i , j) is the reference block of size n n at coordinates (i , j), S (ix, jy) is the candidate block within a search area in the previous frame and (x,y) represents the candidate displacement vector. The motion vector is determined by the least MAD (x,y) for all possible displacements.

The architecture of the full search block matching algorithm consists of three main stages namely, 1) the shift register array and adder, 2) a control unit and 3) best match selection unit. The input data are fed in serially and the array of processing element calculates the absolute difference and they added up using the adder to compute the mean absolute difference. The output from the adder is given to best match selection unit. The control unit is used to adjust the dimension of the search area.

2.1., Array implementation of PE

The processing elements (PEs) are arranged in the form of an array to speed up the block matching process [4]. Figure 2 gives an example of the search area and reference block and Figure 3 shows an example to illustrate the operation of the system. The search area consists of an 88 value and the reference block consists of a 44 value [3].

Each pixel value from the search area is fed serially from left to right and top to bottom to the simplified system. The function of processing element can be given as follows. First, the reference block values R (i, j) is stored in the registers. Next, the search area values are shifted to right neighbor PEi,j-1. The absolute difference value is calculated for the pixel values stored and result is added to obtain the mean absolute difference value.

The reference block of n n requires n2 processing elements and (2p1)(n1) registers for a displacement of δ. The data are inputted serially but the operation performed in the processing element is parallel which increases the speed of the algorithm.

2.2., Block matching architecture

The basic architecture of the block matching algorithm shown in Figure 4 consists of array formed by n2 processing elements PEi,j, (n1) shift register arrays (SRAi), parallel adder to compute the mean absolute difference, a best match selection unit and a control unit [3]. The pixel data of the reference block is fed to the R input line and the search area pixel value is fed to the S input line. First the reference pixel values {R (i, j), 1} are shifted to the PE array and stored before the first search pixel value arrives at the PE1, 1. The time for loading the reference block is saved because the whole reference block data can be shifted in during the period of time when the first search pixel is shifted from PEn, n to PE1, 1.Two clock signals are used in the architecture namely, tc and tr. Here tc is the system clock and tr is used to control the shift operation of the reference block data.

2.3., PE block

The processing element block shown in Figure 5 consists of the reference block shift register (RBSR) and a block to evaluate the absolute difference (ADE) and an adder to add the AD value. The ADE (Absolute difference evaluator) block calculates the difference between the search pixel value shifted from the left neighbor PEi,j+1 and the reference pixel R(i,j) stored in the RBSR. The latch L3 is used to provide a unit delay after which the AD value is added to the result shifted from the down neighbor PEi-1, j.

The output from this is delayed by the latch L1 and given to the neighbor above PEi+1, j. The search pixel value is delayed by one unit by the latch L2 and then given to the right neighbor PEi, j-1.

2.4., Shift Register Arrays

The shift registers array structure are connected to the processing element and size of which can be changed. Based on the dimension of the search area i.e., (n2δ) (n2δ) the shift register array can be programmed. The pixel values are shifted from the search area to the processing elements and then to the next neighboring PE after passing through the shift register array. The shift operation is initiated by the clock signal.

2.5., Best Match Selection Unit


Figure 6 shows the block diagram of the best match selection unit. The Absolute difference value computed from the processing element is added up in the parallel adder to compute the Mean absolute difference value. This MAD (x, y) value is sent to the BMSU where it is compared with the current champion that is stored in the champion register. The current champion is decided by the block with least MAD value. If the value from the parallel adder is smaller the current champion then the present value will be stored in the champion register and the corresponding position is stored in the row and column registers.

2.6., Control Unit

The control unit block is shown in Figure 7. It consists of the control code generator which produces the control signal. The length of the shift register arrays can be changed based on the control signal that is generated [5]. It is controlled by the value (n2δ) that is stored in the register. To change the dimension of the search area the suitable value is stored in the register. A signal h is generated to indicate that the matching process of one block is finished and motion vector can be determined.


Testing of ME is essential as an error in ME process can cause degradation in the video quality in any video coding systems. Earlier codes like Parity Codes, Berger Codes were used for detecting a single bit error. The residue code which evolved next is generally a separable arithmetic code to estimate the residue of the given data and appending it to the data [7]. This code is capable of detecting a single bit error. Additionally, error correction is not possible by using the residue codes. Therefore, this work presents a quotient code [6] to assist the residue code in detecting multiple bit errors and recovering errors.

The RQ code for X modulo m is expressed as and, respectively. Notably, denotes the largest integer not exceeding i.

Figure 8 shows the Error Detection and correction architecture (EDCA) design for testing that can be used concurrently with block matching algorithm. A PE generally consists of two ADDs (i.e. an 8-b ADD and a 12-b ADD) and an accumulator (ACC) [2]. Next, the 8-b ADD (a pixel has 8-b data) is used to estimate the addition of the current pixel (cur_pixel) and reference pixel (Ref_pixel). Additionally, a 12-b ADD and an ACC are required to accumulate the results from the 8-b ADD in order to determine the sum of absolute difference (SAD) value for video encoding application.

3.1., Test Code Generation

TCG is the main block of the proposed EDCA design. Test Code Generation (TCG) design is based on the ability of the RQCG circuit to generate corresponding test codes in order to detect errors and recovers data. The specific PEi estimates the absolute difference between the Cur_pixel of the search area and the Ref_pixel of the current macroblock. The SAD value for the macroblock with size of NxN can be evaluated:

3.2., Error Detection Process

Error Detection Circuit (EDC) is used to perform the operation of error detection in the specific PEi as shown in Figure 8. This block is used to compare the output from the TCG i.e., (RT and QT) with output from RQCG1 i.e., (RPEi and QPEi), in order to detect the occurrence of an error. If the value RPEi ≠ RT and QPEi ≠ QT, then the error in the PEi can be detected. The EDC output is then generated as a 0/1 signal to indicate that the tested PEi is error free/ with error.

3.3., Error Correction Process

The original data is recovered in the Data Recovery Circuit (DRC) during error correction process, by separating the RQ code from the TCG. The data recovery is possible by implementing the mathematical model as,


The Block Matching algorithm is simulated and synthesized using Mentor Graphics with target device as EP2S180F1508C from the family Stratix II. The reference block of size 4x4 and search area of size 8x8 is taken for the simulation. Table 1 shows the area and power for the overall block matching motion estimation algorithm and the parameter analysis of various blocks.

Testing architecture employing Residue-and-quotient code is used in the existing error detection architecture that has 16 Processing Elements (PEs) and 16 Test Code Generation (TCG) blocks for computing the test codes for each pixel value in the macroblock [6]. Though this architecture can detect multiple bit errors, the area of the architecture is high. To overcome this disadvantage, the testing architecture is used which has a single PE and a TCG for computing the test codes for the pixel values.

The performance of the EDCA design is estimated in terms of its area and power. In this architecture, the pixel values in the 44 macroblock are taken out for the code generation based on the clocking signal. At each triggering edge of the clock the 8-bit pixel value from the ref_pixel and cur_pixel is taken and given to the TCG and PE blocks for code generation. The generated code from test code generation block and the RQCG output is compared in the EDC block, output of which indicates the error. As the code generation for each 8 bit pixel value is done based on the clock signal, the need for separate PE and TCG for each 16 elements of the 44 macroblock is avoided. By this the overall area and power of the testing architecture is reduced. This testing architecture is employed in the block matching algorithm. Table 2 shows the parameter analysis of EDCA design.

The overall performance of the Motion estimation algorithm integrated with the testing architecture is improved by employing the EDCA testing architecture with single processing into the Block matching algorithm.

Table 3 shows the area comparison of the BMA algorithm integrated with EDCA employing 16 PEs and the BMA integrated with testing architecture employing single PE. The area of the BMA employing EDCA with single PE has lesser area in terms of logic elements.


The Block matching motion estimation process is done in the video coding system to find the motion vector pointing to the best prediction macroblock in a reference frame or field. An error occurring in ME can cause degradation in the video quality. In order to make the coding system efficient an Error Detection and Correction Architecture (EDCA) is designed for detecting the errors and recovering the data which is employed in the BMA design. Based on the RQ code, a RQCG-based TCG design is developed to generate the corresponding test codes that can detect multiple bit errors and recover data. The Block matching algorithm employing error detection and correction architecture is implemented using VHDL and synthesized by using Mentor Graphics with the target device as EP2S180F1508C from the family Stratix II. Experimental results indicate that the proposed EDCA design with single PE can effectively detect the errors and recover data in the BMA with reduced area and power. The Block matching algorithm with EDCA employing single PE have a reduced area of about 1268 from 2776 that is obtained for the BMA with EDCA employing 16 PEs.


The authors would like to thank All India Council for Technical Education (AICTE), India, to financially support this work under the Grant 8023/BOR/RID/ RPS-48/2009-10. The authors would also like to thank the Management and Principal of Sri Ramakrishna Engineering College, Coimbatore for providing excellent computing facility and encouragement.