59 VLSI Architectures for Image Communications

Furthermore, the efficiency estimation of architectural alter-natives is discussed and implementation examples of dedicated and programmable architectures arepresented... The processing

Trang 1

P Pirsch, et Al “VLSI Architectures for Image Communications.”

2000 CRC Press LLC <http://www.engnetbase.com>.

Trang 2

VLSI Architectures for Image

59.6 Programmable ArchitecturesIntensive Pipelined Architectures•Parallel Data Paths•Co- processor Concept

59.7 ConclusionAcknowledgmentReferences

59.1 Introduction

Video processing has been a rapidly evolving field for telecommunications, computer, and mediaindustries In particular, for real time video compression applications a growing economical signifi-cance is expected for the next years Besides digital TV broadcasting and videophone, services such

as multimedia education, teleshopping, or video mail will become audiovisual mass applications

To facilitate worldwide interchange of digitally encoded audiovisual data, there is a demand forinternational standards, defining coding methods, and transmission formats International stan-dardization committees have been working on the specification of several compression schemes TheJoint Photographic Experts Group (JPEG) of the International Standards Organization (ISO) hasspecified an algorithm for compression of still images [4] The ITU proposed the H.261 standard forvideo telephony and video conference [1] The Motion Pictures Experts Group (MPEG) of ISO hascompleted its first standard MPEG-1, which will be used for interactive video and provides a picturequality comparable to VCR quality [2] MPEG made substantial progress for the second phase ofstandards MPEG-2, which will provide audiovisual quality of both broadcast TV and HDTV [3].Besides the availability of international standards, the successful introduction of the named servicesdepends on the availability of VLSI components, supporting a cost efficient implementation of videocompression applications In the following, we give a short overview of recent coding schemes anddiscuss implementation alternatives Furthermore, the efficiency estimation of architectural alter-natives is discussed and implementation examples of dedicated and programmable architectures arepresented

Trang 3

59.2 Recent Coding Schemes

Recent video coding standards are based on a hybrid coding scheme that combines transform codingand predictive coding techniques An overview of these hybrid encoding schemes is depicted inFig.59.1

FIGURE 59.1: Hybrid encoding and decoding scheme

The encoding scheme consists of the tasks motion estimation, typically based on blockmatchingalgorithms, computation of the prediction error, discrete cosine transform (DCT), quantization (Q),variable length coding (VLC), inverse quantization (Q−1), and inverse discrete cosine transform

(IDCT or DCT-1) The reconstructed image data are stored in an image memory for further tions The decoder performs the tasks variable length decoding (VLC−1), inverse quantization, and

predic-motion compensated reconstruction

Generally, video processing algorithms can be classified in terms of regularity of computation anddata access This classification leads to three classes of algorithms:

• Low-Level Algorithms — These algorithms are based on a predefined sequence of

opera-tions and a predefined amount of data at the input and output The processing sequence

of low-level algorithms is predefined and does not depend on the values of data processed.Typical examples of low-level algorithms are block matching or transforms such as theDCT

• Medium-Level Algorithms — The sequence and number of operations of medium-level

algorithms depend on the data Typically, the amount of input data is predefined, whereasthe amount of output data varies according to the input data values With respect to hybridcoding schemes, examples for these algorithms are quantization, inverse quantization, orvariable length coding

• High-Level Algorithms — High-level algorithms are associated with a variable amount of

input and output data and a data-dependent sequence of operations As for

Trang 4

medium-level algorithms, the sequence of operations is highly data dependent Control tasks ofthe hybrid coding scheme can be assigned to this class.

Since hybrid coding schemes are applied for different video source rates, the required absoluteprocessing power varies in the range from a few hundred MOPS (Mega Operations Per Second) forvideo signals in QCIF format to several GOPS (Giga Operations Per Second) for processing of TV

or HDTV signals Nevertheless, the relative computational power of each algorithmic class is nearlyindependent of the processed video format In case of hybrid coding applications, approximately 90%

of the overall processing power is required for low-level algorithms The amount of medium-leveltasks is about 7% and nearly 3% is required for high-level algorithms

59.3 Architectural Alternatives

In terms of a VLSI implementation of hybrid coding applications, two major requirements can beidentified First, the high computational power requirements have to be provided by the hardware.Second, low manufacturing cost of video processing components is essential for the economic success

of an architecture Additionally, implementation size and architectural flexibility have to be takeninto account

Implementations of video processing applications can either be based on standard processors fromworkstations or PCs or on specialized video signal processors The major advantage of standard pro-cessors is their availability Application of these architectures for implementation of video processinghardware does not require the time consuming design of new VLSI components The disadvantage

of this implementation strategy is the insufficient processing power of recent standard processors.Video processing applications would still require the implementation of cost intensive multiproces-sor systems to meet the computational requirements To achieve compact implementations, videoprocessing hardware has to be based on video signal processors, adapted to the requirements of theenvisaged application field

Basically, two architectural approaches for the implementations of specialized video processing

components can be distinguished Dedicated architectures aim at an efficient implementation of one

specific algorithm or application Due to the restriction of the application field, the architecture

of dedicated components can be optimized by an intensive adaptation of the architecture to therequirements of the envisaged application, e.g., arithmetic operations that have to be supported,processing power, or communication bandwidth Thus, this strategy will generally lead to compactimplementations The major disadvantage of dedicated architecture is the associated low flexibility.Dedicated components can only be applied for one or a few applications In contrast to dedicated

approaches with limited functionality, programmable architectures enable the processing of different

algorithms under software control The particular advantage of programmable architectures is theincreased flexibility Changes of architectural requirements, e.g., due to changes of algorithms or

an extension of the aimed application field, can be handled by software changes Thus, a generallycost-intensive redesign of the hardware can be avoided Moreover, since programmable architecturescover a wider range of applications, they can be used for low-volume applications, where the design

of function specific VLSI chips is not an economical solution

For both architectural approaches, the computational requirements of video processing cations demand for the exploitation of the algorithm-inherent independence of basic arithmeticoperations to be performed Independent operations can be processed concurrently, which en-ables the decrease of processing time and thus an increased through-put rate For the architecturalimplementation of concurrency, two basic strategies can be distinguished: pipelining and parallelprocessing

appli-In case of pipelining several tasks, operations or parts of operations are processed in subsequentsteps in different hardware modules Depending on the selected granularity level for the implemen-

Trang 5

tation of pipelining, intermediate data of each step are stored in registers, register chains, FIFOs, ordual-port memories Assuming a processing time ofT P for a non-pipelined processor module and

T D,IMfor the delay of intermediate memories, we get in the ideal case the following estimation forthe throughput-rateR T ,Pipeof a pipelined architecture applyingNPipepipeline stages:

R T ,Par= NPar

whereNPar= number of parallel units

Generally, both alternatives are applied for the implementation of high-performance video cessing components In the following sections, the exploitation of algorithmic properties and theapplication of architectural concurrency is discussed considering the hybrid coding schemes

pro-59.4 Efficiency Estimation of Alternative VLSI Implementations

Basically, architectural efficiency can be defined by the ratio of performance over cost To achieve

a figure of merit for architectural efficiency we assume in the following that performance of a VLSIarchitecture can be expressed by the achieved throughput rateR T and the cost is equivalent to therequired silicon areaA Sifor the implementation of the architecture:

E = A R T

Besides the architecture, efficiency mainly depends on the applied semiconductor technology and thedesign-style (semi-custom, full-custom) Therefore, a realistic efficiency estimation has to considerthe gains provided by the progress in semiconductor technology A sensible way is the normalization

of the architectural parameters according to a reference technology In the following we assume areference process with a grid lengthλ0= 1.0 micron For normalization of silicon area, the following

equation can be applied:

A Si,0 = ASi

λ0

λ

2

(59.4)where the index 0 is used for the system with reference gate lengthλ0

According to [7] the normalization of throughput can be performed by:

Trang 6

E can be used for the selection of the best architectural approach out of several alternatives Moreover,

assuming a constant efficiency for a specific architectural approach leads to a linear relationship ofthroughput rate and silicon area and this relationship can be applied for the estimation of the requiredsilicon area for a specific application Due to the power of 3.6 in Equ (59.6), the chosen semiconductortechnology for implementation of a specific application has a significant impact on the architecturalefficiency

In the following, examples of dedicated and programmable architectures for video processingapplications are presented Additionally, the discussed efficiency measure is applied to achieve afigure of merit for silicon area estimation

59.5 Dedicated Architectures

Due to their algorithmic regularity and the high processing power required for the discrete cosinetransform and motion estimation, these algorithms are the first candidates for a dedicated imple-mentation As typical examples, alternatives for a dedicated implementation of these algorithms arediscussed in the following

The discrete cosine transform (DCT) is a real-valued frequency transform similar to the DiscreteFourier transform (DFT) When applied to an image block of size L× L, the two dimensional DCT(2D-DCT) can be expressed as follows:

(i, j) = coordinates of the pixels in the initial block

(k, l) = coordinates of the coefficients in the transformed block

x i,j = value of the pixel in the initial block

Y k,l = value of the coefficient in the transformed block

Computing a 2D DCT of size L× L directly according to Eq (59.7) requires L4multiplicationsand L4additions

The required processing power for the implementation of the DCT can be reduced by the tion of the arithmetic properties of the algorithm The two-dimensional DCT can be separated intotwo one-dimensional DCTs according to Eq (59.8)

at the output of each array The results of the 1D-DCT have to be reordered for the second 1D-DCTstage For this purpose, a transposition memory is used Since both one-dimensional processorarrays require identical DCT coefficients, these coefficients are stored in a common ROM

Trang 7

FIGURE 59.2: Separated DCT implementation according to [9].

Moving from a mathematical definition to an algorithm that can minimize the number of lations required is a problem of particular interest in the case of transforms such as the DCT The1D-DCT can also be expressed by the matrix-vector product :

where [C] is an L × L matrix and [X] and [Y] 8-point input and output vectors As an example, with

θ = p/16, the 8-points DCT matrix can be computed as denoted in Eq (59.10)

cos 4θ cos 4θ cos 4θ cos 4θ cos 4θ cos 4θ cos 4θ cos 4θ

cosθ cos 3θ cos 5θ cos 7θ − cos 7θ − cos 5θ − cos 3θ − cos θ

cos 2θ cos 6θ − cos 6θ − cos 2θ − cos 2θ − cos 6θ cos 6θ cos 2θ

cos 3θ − cos 7θ − cos θ − cos 5θ cos 5θ cosθ cos 7θ − cos 3θ

cos 4θ − cos 4θ − cos 4θ cos 4θ cos 4θ − cos 4θ − cos 4θ cos 4θ

cos 5θ − cos θ cos 7θ cos 3θ − cos 3θ − cos 7θ cosθ − cos 5θ

cos 6θ − cos 2θ cos 2θ − cos 6θ − cos 6θ cos 2θ − cos 2θ cos 6θ

cos 7θ − cos 5θ cos 3θ − cos θ cosθ − cos 3θ cos 5θ − cos 7θ

cos 4θ cos 4θ cos 4θ cos 4θ

cos 2θ cos 6θ − cos 6θ − cos 2θ

cos 4θ − cos 4θ − cos 4θ cos 4θ

cos 6θ − cos 2θ cos 2θ − cos 6θ

cosθ cos 3θ cos 5θ cos 7θ

cos 3θ − cos 7θ − cos θ − cos 5θ

cos 5θ − cos θ cos 7θ cos 3θ

cos 7θ − cos 5θ cos 3θ − cos θ

Another approach that has been extensively used is based on the technique of distributed arithmetic.Distributed arithmetic is an efficient way to compute the DCT totally or partially as scalar products

To illustrate the approach, let us compute a scalar product between two length-M vectors C and X :

Trang 8

FIGURE 59.3: Lee FDCT flowgraph for the one-dimensional 8-points DCT [10].

The change of summing order ini and j characterizes the distributed arithmetic scheme in which the

initial multiplications are distributed to another computation pattern Since the termC jhas only 2M

possible values (which depend on thex i,j values), it is possible to store these 2Mpossible values in a

ROM An input set of M bits{x0,j , x1,j , x2,j , , x M−1,j} is used as an address, allowing retrieval

of theC jvalue These intermediate results are accumulated in B clock cycles, for producing oneY

value Figure59.4shows a typical architecture for the computation of a M input inner product Theinverter and the MUX are used for inverting the final output of the ROM in order to computeC0

FIGURE 59.4: Architecture of a M input inner product using distributed arithmetic

Figure59.5illustrates two typical uses of distributed arithmetic for computing a DCT Figure59.5(a)implements the scalar products described by the matrix of Eq (59.10) Figure59.5(b) takes advantage

of a first stage of additions and substractions and the scalar products described by the matrices of

Eq (59.11) and Eq (59.12)

Properties of several dedicated DCT implementations have been reported in [6] Figure59.6

shows the silicon area as a function of the throughput rate for selected design examples The designparameters are normalized to a fictive 1.0µm CMOS process according to the discussed normalization

strategy As a figure of merit, a linear relationship of throughput rate and required silicon area can

be derived:

Equation (59.15) can be applied for the silicon area estimation of DCT circuits For example,assuming TV signals according to the CCIR-601 format and a frame rate of 25Hz, the source rate

Trang 9

FIGURE 59.5: Architecture of an 8-point one-dimensional DCT using distributed arithmetic.(a) Pure distributed arithmetic (b) Mixed D.A.: first stage of flowgraph decomposition products of

8 points followed by 2 times 4 scalar products of 4 points

equals 20.7 Mpel/s As a figure of merit from Eq (59.15) a normalized silicon area of about 10.4 mm2can be derived For HDTV signals the video source rate equals 110.6 Mpel/s and approximately 55.3

mm2silicon area is required for the implementation of the DCT Assuming an economically sensiblemaximum chip size of about 100 mm2to 150 mm2, we can conclude that the implementation ofthe DCT does not necessarily require the realization of a dedicated DCT chip and the DCT core can

be combined with several other on-chip modules that perform additional tasks of the video codingscheme

For motion estimation several techniques have been proposed in the past Today, the most portant technique for motion estimation is block matching, introduced by [21] Block matching

im-is based on the matching of blocks between the current and a reference image Thim-is can be done

by a full (or exhaustive) search within a search window, but several other approaches have been

FIGURE 59.6: Normalized silicon area and throughput for dedicated DCT circuits

Trang 10

reported in order to reduce the computation requirements by using an “intelligent” or “directed”search [17,18,19,23,25,26,27].

In case of an exhaustive search block matching algorithm, a block of size N× N pels of the currentimage (reference block, denotedX) is matched with all the blocks located within a search window

(candidate blocks, denotedY ) The maximum displacement will be denoted by w The matching

criterium generally consists in computing the mean absolute difference (MAD) between the blocks.Letx(i, j) be the pixels of the reference block and y(i, j) the pixels of the candidate block The

matching distance (or distortion)D is computed according to Eq (59.16) The indexesm and n

indicate the position of the candidate block within the search window The distortionD is computed

for all the(2w +1)2possible positions of the candidate block within the search window [Eq (59.16)]and the block corresponding to the minimum distortion is used for prediction The position of this

block within the search window is represented by the motion vector v (59.17)

The operations involved for computingD(m, n) and DMIN are associative Thus, the order for

exploring the index spaces(i, j) and (m, n) are arbitrary and the block matching algorithm can

be described by several different dependence graphs As an example, Fig.59.7shows a possibledependence graph (DG) forw = 1 and N = 4 In this figure, AD denotes an absolute difference

and an addition, M denotes a minimum value computation

FIGURE 59.7: Dependence graphs of the block matching algorithm The computation of v(X, Y )

andD(m, n) are performed by 2D linear DGs.

The dependence graph for computingD(m, n) is directly mapped into a 2-D array of processing

elements (PE), while the dependence graph for computing v(X, Y ) is mapped into time (59.8)

In other words, block matching is performed by a sequential exploration of the search area, whilethe computation of each distortion is performed in parallel Each of the AD nodes of the DG isimplemented by an AD processing element (AD-PE) The AD-PE stores the value ofx(i, j) and

receives the value ofy(m + i, n + j) corresponding to the current position of the reference block in

the search window It performs the subtraction and the absolute value computation, and adds the

Trang 11

result to the partial result coming from the upper PE The partial results are added on columns and alinear array of adders performs the horizontal summation of the row sums, and computesD(m, n).

For each position(n, m) of the reference block, the M-PE checks if the distortion D(m, n) is smaller

than the previous smaller distortion value, and, in this case, updates the register which keeps theprevious smaller distortion value

To transform this naive architecture into a realistic implementation, two problems must be solved:(1) a reduction of the cycle time and (2) the I/O management

1 The architecture of Fig.59.8implicitly supposes that the computation of D(m,n) can bedone combinatorially in one cycle time While this is theoretically possible, the resultingcycle time would be very large and would increase as 2N Thus, a pipeline scheme isgenerally added

2 This architecture also supposes that each of the AD-PE receives a new value ofy(m +

i, n + j) at each clock cycle.

FIGURE 59.8: Principle of the 2-D block-based architecture

Since transmitting theN2values from an external memory is clearly impossible, advantage must

be taken from the fact that these values belong to the search window A portion of the search window

of sizeN∗(2w + N) is stored in the circuit, in a 2-D bank of shift registers able to shift in the up,

down, and right direction Each of the AD-PEs has one of these registers and can, at each cycle, obtainthe value ofy(m + i, n + j) that it needs To update this register bank, a new column of 2w + N

pixels of the search area is serially entered in the circuit and is inserted in the bank of registers Amechanism must also be provided for loading a new reference with a low I/O overhead: a doublebuffering ofx(i, j) is required, with the pixels x0(i, j) of a new reference block serially loaded during

the computation of the current reference block (Fig.59.9)

Figure59.10shows the normalized computational rate vs normalized chip area for block matchingcircuits Since one MAD operation consists of three basic ALU operations (SUB, ABS, ADD), for a1.0 micron CMOS process, we can derive from this figure that:

The first term of this expression indicates that the block matching algorithm requires a large storagearea (storage of parts of the actual and previous frame), which cannot be reduced even when the

Tiêu đề	VLSI architectures for image communications
Tác giả	P. Pirsch, W. Gehrke
Trường học	University of Hannover
Chuyên ngành	Electrical Engineering
Thể loại	Book chapter
Năm xuất bản	2000

Định dạng
Số trang	23
Dung lượng	385,96 KB