An Efficient Pipeline Execution

Essay add: 22-10-2015, 20:34   /   Views: 185

In this paper, we present an implementation of the optimized H.264 intra 4x4 algorithm in order to reduce the time required to complete the intra 4X4 process. However the source of waste time in conventional architecture of intra 4x4 is the serialization of intra predictions and reconstructions of sixteen 4x4 blocks in one macroblock which can be replaced by a pipelined architecture while maintaining consistency with the standard. in this work, we have studied ten alternative scanning orders based on rearranging order of intra 4x4 to choose the best one in order to reduce dependencies between consecutively executed blocks without performance degradation. The best order is implemented by a pipelined architecture using VHDL language. The VHDL code is verified to work at 100 MHz in an ALTERA Stratix II EP2S60F1020C3 FPGA.

As a result, the processing time is reduced by 31.25% compared to the conventional implementation. So, it can be a good solution for real-time video application. The H.264 intra 4x4 hardware and software are demonstrated to work together on ALTERA NIOS-II development board with Stratix II EP2S60F1020C3 FPGA.Index Terms- H.264; FPGA; Intra 4x4; scanning order; pipeline; embedded Linux, NIOS II.


H.264 video compression standard which has been create to provide high compression efficiency using aggressive compression techniques such as spatial and temporal prediction [1]. In this work, we interested on the spatial prediction. We have two ways in the spatial prediction: intra 4x4 and intra 16X16. This work interested on the intra 4x4. This algorithm generates a prediction pixels for a Macroblock (MB) based on spatial redundancy.

H.264 intra 4x4 algorithm achieves better coding results than the prediction algorithms used in the previous video compression standards. However, this coding gains crone with an increase in encoding complexity which makes it an exciting challenge to have a real-time implementation of H.264 intra 4x4 algorithm. In intra 4x4 prediction mode, each 4Ã-4 luma block can select one of nine prediction modes, which are shown in Fig. 1. Extensive research efforts have been made to speed up the execution time without a significant addition of hardware resource.

Among them, several researchers have focused on intra 4x4 order that includes computationally expensive operations such as intra prediction, integer transformation and quantization algorithm [2-5]. [2] proposes a pipeline execution of intra prediction and reconstruction of 4x4 blocks. [3] presents an adaptive coefficient scanning method for intra mode in H.264. the proposed adaptive scanning uses six alternative scanning orders. [4] proposes a new processing order to reduce dependencies between consecutively executed blocks. [5] presents a novel 4x4 processing order replacing the conventional order. In this paper, the optimal processing order of INTRA 4x4 is implemented. As result, the computation time is decreased by 31.25% compared to the conventional order.

Our hardware is described in VHDL (VHSIC Hardware Description language) language and implemented with the NIOS-II softcore processor in a single Stratix II EP2S60 FPGA (Field Programmable Gate Array) device and the remaining parts are performed in software on NIOS-II softcore processor and using uClinux-dist, an embedded Linux, as operating system.Fig.1. 4x4 luma prediction ModesThe rest of this paper is organized as follows: In Section 2 an overview of intra 4x4 scanning order is introduced. Section 3 analyses the implementation performance for the best order. Finally, conclusions are reached in Section 4.


2.1. Conventional Order

In the H.264/AVC standard, the coding order of intra 4x4 is specified as shown in Fig.2 [6].Fig.2 Conventional order of intra 4x4Each box represents a 4x4 block and the number inside a box is the processing order. For example, the block in the upper-left corner (labeled 0) is processed first and the next block to the right (labeled 1) is processed next. To generate the predicted images in the encoding process, the 13 boundary pixels (A to M in Fig.1) from the reconstructed blocks are required. For example, to perform the intra prediction of block 9 in Fig. 2, the pixels in blocks 2, 3, 6 and 8 are needed.

Therefore, the intra prediction of block 9 cannot be started until the reconstructions of needed blocks are done. To reconstruct those blocks after intra prediction, additional operations such as integer transform (ICT), quantization (Q), inverse quantization (IQ) and inverse integer transform (IICT) should be performed. This dependent process introduces bubble cycles into the prediction process and wastes hardware resources (Fig.3). Consequently, the total processing time for 16 4Ã-4 blocks in a MB is given by:Ttotal=16 t1+16 t2Where t1 and t2 denotes the required time for intra prediction and reconstruction of one 4x4 block respectively.Fig.3. MB processing time of conventional orderFor t1 = t2 =13 cycles; the execution time of the conventional order for one MB is equal to 416 cycles.

This order cannot be used for hardware implementation because all operations are serialized. To achieve a best performance in term of execution time a pipelining order can be selected.

2.2. Proposed order of intra 4x4

Fig.4. Different scanning order of intra 4x4In literature [2-5], many intra scanning orders have been proposed. All these orders are aiming at reducing the intra processing time.

In this part of paper, we describe and study ten various scanning order of the intra 4x4 in order to select the optimized order for high speed H.264 intra 4x4 algorithm implementation (Fig.4).Before implementation of these 10 proposed orders, we should test their performance in term of the sequences quality using a H264/AVC software reference model based on C- language. After that, we compare the output results in order to choose the best order. The quality is measured for different test sequences such as "foreman, mobile, tempete and akiyo" by using the peak-signal-to-noise-ratio metric (PSNR) with fixed quantization parameter QP=26.Table.1 Comparing PSNR performance of different orders







Conventional order36.9935.5934.7539.87(G-J,J-S and H-JL) order[2]37.0135.5934.7539.87(G-J,J-S and H-JL) optimised order [2]36.9335.5734.7339.87Zig zag order [3]36.8835.5834.7439.82Vertical order [3]37.0335.5934.7539.90Horizontal order [3]37.0135.5934.7539.87Diagonal order [3]36.5735.4134.5539.16Vert-diag order [3]36.8135.5134.6939.64Horiz-diag order [3]36.8235.5434.6939.59New processing order [4]36.8835.5734.7339.74(K-Y,J-Hl and K-S) Order [5]37.0335.5934.7539.91From table.1, we can conclude that the orders proposed by [2](G-J,J-S and H-JL), [3] (Vertical order, Horizontal order and [5] (K-Y,J-Hl and K-S) give a good quality of compression compared to the others orders. thus, in the next we study the processing time of the selected scanning order.Fig 4.1 shows the proposed scanning order by [2]. For the intra prediction of block 4 the reconstruction results of block 1 are needed to generate the predicted block.

Therefore, block 4 can start its intra prediction process while block 2 is in the reconstruction process. Fig.5 shows the pipelined intra 4x4 process. The intra 4x4 processing time can be wait only 5 times, 0à1, 1à2, 6à9, 13à14, 14à15.The total processing time is given by:Ttotal =16t1+5t2 =286 cyclesFig.5. Pipelining of the (G-J,J-S and H-JL) order [2]For the vertical order presented in Fig 4.4, the intra 4x4 processing time can be wait only 11 times 0à1, 1à2, 4à5, 3à6, 6à7, 8à9, 9à12, 12à13, 10à11, 11à14, 14à15 as shown in Fig.6. The total processing time is given by:Ttotal =16t1+11t2=364 cyclesFig.6.

Pipelining of the Vertical order [3]For the horizontal order presented in Fig 4.5 the intra 4x4 processing time can be wait only 10 times 0à2, 8à10, 3à9, 9à11, 4à6, 6à12, 12à14, 5à7, 7à13, 13à15 as shown in Fig.7. The total processing time is given by:T total =16t1+10t2 =351 cyclesFig.7. Pipelining of the Horizontal order [3]Finally, for the last order presented in Fig.4.10, the intra 4x4 processing time can be wait only 6 times 0à1, 1à2, 5à6, 9à10, 13à14, 14à15 as shown in Fig.8.

The total processing time is given by:T total =16t1+6t2 =299 cyclesFig.8. Pipelining of the (K-Y, J-Hl and K-S) order [5]we conclude that (G-J,J-S and H-JL)order [2] and (K-Y, J-Hl and K-S) Order [5] take less number of clock cycles, so they are the best optimized ones with 31.25% time reduction for the first and 28.125% for the second.


3.1. Intra 4x4 Architecture

The block diagram of the proposed hardware architecture for H.264 video coding is shown in Fig.9 [7]. The intra 4x4 is composed by four blocks: intra prediction 4x4, coding chain, reconstruction4x4 and control unit. The block of intra prediction calculates the predicted 4x4 block of pixels for all nine intra prediction modes specified in H.264, based on the reconstituted pixels from previous 4x4 blocks .Fig.9. Architecture of intra 4x4Also it calculates the difference (residual) and absolute difference values between the predicted pixels and the actual pixels for each prediction mode.

These absolute values give the sum SAD for each prediction mode. This block compares the SAD values for all prediction modes and gives the lowest value to determine which prediction mode will be used. In the end it outputs the predicted pixels, pixel residues, and SAD for the prediction mode.

This block able to process one 4x4 block every 13 clock cycles. The second block of coding chain is composed by integer transform 4x4, quantization 4x4, inverse integer transform 4x4, and inverse quantization 4x4. For the block of reconstruction 4x4, it can be the summation between the predicted block and the chain coding coefficients for given the reconstructed block.

This sub-block receives sixteen predicted pixels and sixteen chain coding coefficients and produces sixteen reconstructed pixels. The coding chain and the reconstruction of 4x4 block take also 13 clock cycles. The control unit block produces four strobes used to capture the design state at different cycles as it moves through the design and finally produces the intra4x4_ready output flag to indicate to the downstream block that the outputs are valid [7].

3.2. Implementation Results

The results discussed in this section are based on hardware implementation of the intra 4x4 scanning order for H.264 video coding, which are tested on the ALTERA NIOS-II development board with Stratix II EP2S60F1020C3 FPGA. This FPGA reaches 48352 Adaptive look-up tables (ALUTs), 2484 KB of Embedded Memory Blocs, 288 DSP blocks, 8 PLL and 719 pins [8]. The whole design is described using VHDL (RTL level) fitted into the FPGA and works with 100 MHz system clock.Table.3. Execution time and synthesis results comparison


[6][2][5]Number cycles/MB416286299Logic Elements34%34%34%The summary of these results is displayed in Table.3 for the best orders shows in the previous section.This table shows that for different order the hardware resource utilization is same as the conventional order. The proposed scanning order in [2] takes less of clock cycle number when compared to the previous work presented in [3] and [5]. Simulations verified the improvement of clock cycles count for the intra 4x4 operation as show in Table.3. This hardware is optimized to achieve a higher performance for H.264 video encoder than the hardware architecture presented in [6] and [5].For experimental verification, we made a C-language reference model of software for H.264 video encoder [9]. We compared the output results of our reference-C model with that of JM 10.1 model [10], and confirmed the correctness of our model.

We have used the NIOS-II softcore processor for sending data to the intra 4x4 coprocessor. Our embedded system has been tested by using the Altera NIOS-II development board. The heart of the target board is the Altera Stratix II EP2S60F1020C3 FPGA circuit.

For all experiments, we have focussed on the following video test sequences: "Foreman", "Tempete", "Mobile", and "Akiyo". These test sequences have different movement and camera particularities. The quality of the different sequences is measured by the peak-signal-to-noise-ratio metric (PSNR) which has indicated the same quality of image between the SW and HW solutions.


In this paper, a high speed H.264/AVC intra 4x4 order for video conference applications was proposed. By rearranging the scanning order, the data dependency was decreased. With the proposed scanning order, intra 4x4 could be pipelined with a suitable hardware implementation without increased hardware utilization.

The processing time of the pipelined process could be reduced up to 31.25 % compared with the standard process. We have also designed an embedded system based on an Altera Stratix II FPGA platform in order to evaluate the performance of our design. The system shows that the sequences quality is same with software solution.

Article name: An Efficient Pipeline Execution essay, research paper, dissertation