Furthermore, the efficiency estimation of architectural alter-natives is discussed and implementation examples of dedicated and programmable architectures arepresented... The processing
Trang 1P Pirsch, et Al “VLSI Architectures for Image Communications.”
2000 CRC Press LLC <http://www.engnetbase.com>.
Trang 2VLSI Architectures for Image
59.6 Programmable ArchitecturesIntensive Pipelined Architectures•Parallel Data Paths•Co- processor Concept
59.7 ConclusionAcknowledgmentReferences
59.1 Introduction
Video processing has been a rapidly evolving field for telecommunications, computer, and mediaindustries In particular, for real time video compression applications a growing economical signifi-cance is expected for the next years Besides digital TV broadcasting and videophone, services such
as multimedia education, teleshopping, or video mail will become audiovisual mass applications
To facilitate worldwide interchange of digitally encoded audiovisual data, there is a demand forinternational standards, defining coding methods, and transmission formats International stan-dardization committees have been working on the specification of several compression schemes TheJoint Photographic Experts Group (JPEG) of the International Standards Organization (ISO) hasspecified an algorithm for compression of still images [4] The ITU proposed the H.261 standard forvideo telephony and video conference [1] The Motion Pictures Experts Group (MPEG) of ISO hascompleted its first standard MPEG-1, which will be used for interactive video and provides a picturequality comparable to VCR quality [2] MPEG made substantial progress for the second phase ofstandards MPEG-2, which will provide audiovisual quality of both broadcast TV and HDTV [3].Besides the availability of international standards, the successful introduction of the named servicesdepends on the availability of VLSI components, supporting a cost efficient implementation of videocompression applications In the following, we give a short overview of recent coding schemes anddiscuss implementation alternatives Furthermore, the efficiency estimation of architectural alter-natives is discussed and implementation examples of dedicated and programmable architectures arepresented
Trang 359.2 Recent Coding Schemes
Recent video coding standards are based on a hybrid coding scheme that combines transform codingand predictive coding techniques An overview of these hybrid encoding schemes is depicted inFig.59.1
FIGURE 59.1: Hybrid encoding and decoding scheme
The encoding scheme consists of the tasks motion estimation, typically based on blockmatchingalgorithms, computation of the prediction error, discrete cosine transform (DCT), quantization (Q),variable length coding (VLC), inverse quantization (Q−1), and inverse discrete cosine transform
(IDCT or DCT-1) The reconstructed image data are stored in an image memory for further tions The decoder performs the tasks variable length decoding (VLC−1), inverse quantization, and
predic-motion compensated reconstruction
Generally, video processing algorithms can be classified in terms of regularity of computation anddata access This classification leads to three classes of algorithms:
• Low-Level Algorithms — These algorithms are based on a predefined sequence of
opera-tions and a predefined amount of data at the input and output The processing sequence
of low-level algorithms is predefined and does not depend on the values of data processed.Typical examples of low-level algorithms are block matching or transforms such as theDCT
• Medium-Level Algorithms — The sequence and number of operations of medium-level
algorithms depend on the data Typically, the amount of input data is predefined, whereasthe amount of output data varies according to the input data values With respect to hybridcoding schemes, examples for these algorithms are quantization, inverse quantization, orvariable length coding
• High-Level Algorithms — High-level algorithms are associated with a variable amount of
input and output data and a data-dependent sequence of operations As for
Trang 4medium-level algorithms, the sequence of operations is highly data dependent Control tasks ofthe hybrid coding scheme can be assigned to this class.
Since hybrid coding schemes are applied for different video source rates, the required absoluteprocessing power varies in the range from a few hundred MOPS (Mega Operations Per Second) forvideo signals in QCIF format to several GOPS (Giga Operations Per Second) for processing of TV
or HDTV signals Nevertheless, the relative computational power of each algorithmic class is nearlyindependent of the processed video format In case of hybrid coding applications, approximately 90%
of the overall processing power is required for low-level algorithms The amount of medium-leveltasks is about 7% and nearly 3% is required for high-level algorithms
59.3 Architectural Alternatives
In terms of a VLSI implementation of hybrid coding applications, two major requirements can beidentified First, the high computational power requirements have to be provided by the hardware.Second, low manufacturing cost of video processing components is essential for the economic success
of an architecture Additionally, implementation size and architectural flexibility have to be takeninto account
Implementations of video processing applications can either be based on standard processors fromworkstations or PCs or on specialized video signal processors The major advantage of standard pro-cessors is their availability Application of these architectures for implementation of video processinghardware does not require the time consuming design of new VLSI components The disadvantage
of this implementation strategy is the insufficient processing power of recent standard processors.Video processing applications would still require the implementation of cost intensive multiproces-sor systems to meet the computational requirements To achieve compact implementations, videoprocessing hardware has to be based on video signal processors, adapted to the requirements of theenvisaged application field
Basically, two architectural approaches for the implementations of specialized video processing
components can be distinguished Dedicated architectures aim at an efficient implementation of one
specific algorithm or application Due to the restriction of the application field, the architecture
of dedicated components can be optimized by an intensive adaptation of the architecture to therequirements of the envisaged application, e.g., arithmetic operations that have to be supported,processing power, or communication bandwidth Thus, this strategy will generally lead to compactimplementations The major disadvantage of dedicated architecture is the associated low flexibility.Dedicated components can only be applied for one or a few applications In contrast to dedicated
approaches with limited functionality, programmable architectures enable the processing of different
algorithms under software control The particular advantage of programmable architectures is theincreased flexibility Changes of architectural requirements, e.g., due to changes of algorithms or
an extension of the aimed application field, can be handled by software changes Thus, a generallycost-intensive redesign of the hardware can be avoided Moreover, since programmable architecturescover a wider range of applications, they can be used for low-volume applications, where the design
of function specific VLSI chips is not an economical solution
For both architectural approaches, the computational requirements of video processing cations demand for the exploitation of the algorithm-inherent independence of basic arithmeticoperations to be performed Independent operations can be processed concurrently, which en-ables the decrease of processing time and thus an increased through-put rate For the architecturalimplementation of concurrency, two basic strategies can be distinguished: pipelining and parallelprocessing
appli-In case of pipelining several tasks, operations or parts of operations are processed in subsequentsteps in different hardware modules Depending on the selected granularity level for the implemen-
Trang 5tation of pipelining, intermediate data of each step are stored in registers, register chains, FIFOs, ordual-port memories Assuming a processing time ofT P for a non-pipelined processor module and
T D,IMfor the delay of intermediate memories, we get in the ideal case the following estimation forthe throughput-rateR T ,Pipeof a pipelined architecture applyingNPipepipeline stages:
R T ,Par= NPar
whereNPar= number of parallel units
Generally, both alternatives are applied for the implementation of high-performance video cessing components In the following sections, the exploitation of algorithmic properties and theapplication of architectural concurrency is discussed considering the hybrid coding schemes
pro-59.4 Efficiency Estimation of Alternative VLSI Implementations
Basically, architectural efficiency can be defined by the ratio of performance over cost To achieve
a figure of merit for architectural efficiency we assume in the following that performance of a VLSIarchitecture can be expressed by the achieved throughput rateR T and the cost is equivalent to therequired silicon areaA Sifor the implementation of the architecture:
E = A R T
Besides the architecture, efficiency mainly depends on the applied semiconductor technology and thedesign-style (semi-custom, full-custom) Therefore, a realistic efficiency estimation has to considerthe gains provided by the progress in semiconductor technology A sensible way is the normalization
of the architectural parameters according to a reference technology In the following we assume areference process with a grid lengthλ0= 1.0 micron For normalization of silicon area, the following
equation can be applied:
A Si,0 = ASi
λ0
λ
2
(59.4)where the index 0 is used for the system with reference gate lengthλ0
According to [7] the normalization of throughput can be performed by:
Trang 6E can be used for the selection of the best architectural approach out of several alternatives Moreover,
assuming a constant efficiency for a specific architectural approach leads to a linear relationship ofthroughput rate and silicon area and this relationship can be applied for the estimation of the requiredsilicon area for a specific application Due to the power of 3.6 in Equ (59.6), the chosen semiconductortechnology for implementation of a specific application has a significant impact on the architecturalefficiency
In the following, examples of dedicated and programmable architectures for video processingapplications are presented Additionally, the discussed efficiency measure is applied to achieve afigure of merit for silicon area estimation
59.5 Dedicated Architectures
Due to their algorithmic regularity and the high processing power required for the discrete cosinetransform and motion estimation, these algorithms are the first candidates for a dedicated imple-mentation As typical examples, alternatives for a dedicated implementation of these algorithms arediscussed in the following
The discrete cosine transform (DCT) is a real-valued frequency transform similar to the DiscreteFourier transform (DFT) When applied to an image block of size L× L, the two dimensional DCT(2D-DCT) can be expressed as follows:
(i, j) = coordinates of the pixels in the initial block
(k, l) = coordinates of the coefficients in the transformed block
x i,j = value of the pixel in the initial block
Y k,l = value of the coefficient in the transformed block
Computing a 2D DCT of size L× L directly according to Eq (59.7) requires L4multiplicationsand L4additions
The required processing power for the implementation of the DCT can be reduced by the tion of the arithmetic properties of the algorithm The two-dimensional DCT can be separated intotwo one-dimensional DCTs according to Eq (59.8)
at the output of each array The results of the 1D-DCT have to be reordered for the second 1D-DCTstage For this purpose, a transposition memory is used Since both one-dimensional processorarrays require identical DCT coefficients, these coefficients are stored in a common ROM
Trang 7FIGURE 59.2: Separated DCT implementation according to [9].
Moving from a mathematical definition to an algorithm that can minimize the number of lations required is a problem of particular interest in the case of transforms such as the DCT The1D-DCT can also be expressed by the matrix-vector product :
where [C] is an L × L matrix and [X] and [Y] 8-point input and output vectors As an example, with
θ = p/16, the 8-points DCT matrix can be computed as denoted in Eq (59.10)
cos 4θ cos 4θ cos 4θ cos 4θ cos 4θ cos 4θ cos 4θ cos 4θ
cosθ cos 3θ cos 5θ cos 7θ − cos 7θ − cos 5θ − cos 3θ − cos θ
cos 2θ cos 6θ − cos 6θ − cos 2θ − cos 2θ − cos 6θ cos 6θ cos 2θ
cos 3θ − cos 7θ − cos θ − cos 5θ cos 5θ cosθ cos 7θ − cos 3θ
cos 4θ − cos 4θ − cos 4θ cos 4θ cos 4θ − cos 4θ − cos 4θ cos 4θ
cos 5θ − cos θ cos 7θ cos 3θ − cos 3θ − cos 7θ cosθ − cos 5θ
cos 6θ − cos 2θ cos 2θ − cos 6θ − cos 6θ cos 2θ − cos 2θ cos 6θ
cos 7θ − cos 5θ cos 3θ − cos θ cosθ − cos 3θ cos 5θ − cos 7θ
cos 4θ cos 4θ cos 4θ cos 4θ
cos 2θ cos 6θ − cos 6θ − cos 2θ
cos 4θ − cos 4θ − cos 4θ cos 4θ
cos 6θ − cos 2θ cos 2θ − cos 6θ
cosθ cos 3θ cos 5θ cos 7θ
cos 3θ − cos 7θ − cos θ − cos 5θ
cos 5θ − cos θ cos 7θ cos 3θ
cos 7θ − cos 5θ cos 3θ − cos θ
Another approach that has been extensively used is based on the technique of distributed arithmetic.Distributed arithmetic is an efficient way to compute the DCT totally or partially as scalar products
To illustrate the approach, let us compute a scalar product between two length-M vectors C and X :
Trang 8FIGURE 59.3: Lee FDCT flowgraph for the one-dimensional 8-points DCT [10].
The change of summing order ini and j characterizes the distributed arithmetic scheme in which the
initial multiplications are distributed to another computation pattern Since the termC jhas only 2M
possible values (which depend on thex i,j values), it is possible to store these 2Mpossible values in a
ROM An input set of M bits{x0,j , x1,j , x2,j , , x M−1,j} is used as an address, allowing retrieval
of theC jvalue These intermediate results are accumulated in B clock cycles, for producing oneY
value Figure59.4shows a typical architecture for the computation of a M input inner product Theinverter and the MUX are used for inverting the final output of the ROM in order to computeC0
FIGURE 59.4: Architecture of a M input inner product using distributed arithmetic
Figure59.5illustrates two typical uses of distributed arithmetic for computing a DCT Figure59.5(a)implements the scalar products described by the matrix of Eq (59.10) Figure59.5(b) takes advantage
of a first stage of additions and substractions and the scalar products described by the matrices of
Eq (59.11) and Eq (59.12)
Properties of several dedicated DCT implementations have been reported in [6] Figure59.6
shows the silicon area as a function of the throughput rate for selected design examples The designparameters are normalized to a fictive 1.0µm CMOS process according to the discussed normalization
strategy As a figure of merit, a linear relationship of throughput rate and required silicon area can
be derived:
Equation (59.15) can be applied for the silicon area estimation of DCT circuits For example,assuming TV signals according to the CCIR-601 format and a frame rate of 25Hz, the source rate
Trang 9FIGURE 59.5: Architecture of an 8-point one-dimensional DCT using distributed arithmetic.(a) Pure distributed arithmetic (b) Mixed D.A.: first stage of flowgraph decomposition products of
8 points followed by 2 times 4 scalar products of 4 points
equals 20.7 Mpel/s As a figure of merit from Eq (59.15) a normalized silicon area of about 10.4 mm2can be derived For HDTV signals the video source rate equals 110.6 Mpel/s and approximately 55.3
mm2silicon area is required for the implementation of the DCT Assuming an economically sensiblemaximum chip size of about 100 mm2to 150 mm2, we can conclude that the implementation ofthe DCT does not necessarily require the realization of a dedicated DCT chip and the DCT core can
be combined with several other on-chip modules that perform additional tasks of the video codingscheme
For motion estimation several techniques have been proposed in the past Today, the most portant technique for motion estimation is block matching, introduced by [21] Block matching
im-is based on the matching of blocks between the current and a reference image Thim-is can be done
by a full (or exhaustive) search within a search window, but several other approaches have been
FIGURE 59.6: Normalized silicon area and throughput for dedicated DCT circuits
Trang 10reported in order to reduce the computation requirements by using an “intelligent” or “directed”search [17,18,19,23,25,26,27].
In case of an exhaustive search block matching algorithm, a block of size N× N pels of the currentimage (reference block, denotedX) is matched with all the blocks located within a search window
(candidate blocks, denotedY ) The maximum displacement will be denoted by w The matching
criterium generally consists in computing the mean absolute difference (MAD) between the blocks.Letx(i, j) be the pixels of the reference block and y(i, j) the pixels of the candidate block The
matching distance (or distortion)D is computed according to Eq (59.16) The indexesm and n
indicate the position of the candidate block within the search window The distortionD is computed
for all the(2w +1)2possible positions of the candidate block within the search window [Eq (59.16)]and the block corresponding to the minimum distortion is used for prediction The position of this
block within the search window is represented by the motion vector v (59.17)
The operations involved for computingD(m, n) and DMIN are associative Thus, the order for
exploring the index spaces(i, j) and (m, n) are arbitrary and the block matching algorithm can
be described by several different dependence graphs As an example, Fig.59.7shows a possibledependence graph (DG) forw = 1 and N = 4 In this figure, AD denotes an absolute difference
and an addition, M denotes a minimum value computation
FIGURE 59.7: Dependence graphs of the block matching algorithm The computation of v(X, Y )
andD(m, n) are performed by 2D linear DGs.
The dependence graph for computingD(m, n) is directly mapped into a 2-D array of processing
elements (PE), while the dependence graph for computing v(X, Y ) is mapped into time (59.8)
In other words, block matching is performed by a sequential exploration of the search area, whilethe computation of each distortion is performed in parallel Each of the AD nodes of the DG isimplemented by an AD processing element (AD-PE) The AD-PE stores the value ofx(i, j) and
receives the value ofy(m + i, n + j) corresponding to the current position of the reference block in
the search window It performs the subtraction and the absolute value computation, and adds the
Trang 11result to the partial result coming from the upper PE The partial results are added on columns and alinear array of adders performs the horizontal summation of the row sums, and computesD(m, n).
For each position(n, m) of the reference block, the M-PE checks if the distortion D(m, n) is smaller
than the previous smaller distortion value, and, in this case, updates the register which keeps theprevious smaller distortion value
To transform this naive architecture into a realistic implementation, two problems must be solved:(1) a reduction of the cycle time and (2) the I/O management
1 The architecture of Fig.59.8implicitly supposes that the computation of D(m,n) can bedone combinatorially in one cycle time While this is theoretically possible, the resultingcycle time would be very large and would increase as 2N Thus, a pipeline scheme isgenerally added
2 This architecture also supposes that each of the AD-PE receives a new value ofy(m +
i, n + j) at each clock cycle.
FIGURE 59.8: Principle of the 2-D block-based architecture
Since transmitting theN2values from an external memory is clearly impossible, advantage must
be taken from the fact that these values belong to the search window A portion of the search window
of sizeN∗(2w + N) is stored in the circuit, in a 2-D bank of shift registers able to shift in the up,
down, and right direction Each of the AD-PEs has one of these registers and can, at each cycle, obtainthe value ofy(m + i, n + j) that it needs To update this register bank, a new column of 2w + N
pixels of the search area is serially entered in the circuit and is inserted in the bank of registers Amechanism must also be provided for loading a new reference with a low I/O overhead: a doublebuffering ofx(i, j) is required, with the pixels x0(i, j) of a new reference block serially loaded during
the computation of the current reference block (Fig.59.9)
Figure59.10shows the normalized computational rate vs normalized chip area for block matchingcircuits Since one MAD operation consists of three basic ALU operations (SUB, ABS, ADD), for a1.0 micron CMOS process, we can derive from this figure that:
The first term of this expression indicates that the block matching algorithm requires a large storagearea (storage of parts of the actual and previous frame), which cannot be reduced even when the