Digital Signal Processing Handbook P54

Video Scanning Format Conversion and Motion Estimation Gerard de Haan Philips Research Laboratories 54.1 Introduction 54.2 Conversion vs.. Standardization 54.3 Problems with Linear Sampl

Trang 1

de Haan, G “Video Scanning Format Conversion and Motion Estimation”

Digital Signal Processing Handbook

Ed Vijay K Madisetti and Douglas B Williams

Boca Raton: CRC Press LLC, 1999

Trang 2

Video Scanning Format Conversion

and Motion Estimation

Gerard de Haan

Philips Research Laboratories

54.1 Introduction 54.2 Conversion vs Standardization 54.3 Problems with Linear Sampling Rate Conversion Applied to Video Signals

Temporal Interpolation•Vertical Interpolation and Interlaced Scanning

54.4 Alternatives for Sampling Rate Conversion Theory Simple Algorithms •Advanced Algorithms

54.5 Motion Estimation Pel-Recursive Estimators•Block-Matching Algorithm•Search Strategies

54.6 Motion Estimation and Scanning Format Conversion Hierarchical Motion Estimation • Recursive Search

Block-Matching References

54.1 Introduction

The scanning format of a video signal is a major determinant of general picture quality Specifi-cally, it determines such aspects as stationary and dynamic resolution, motion portrayal, aliasing, scanning structure visibility, and flicker Various formats have been designed and standardized to strike a particular balance between quality, cost, transmission capacity, and compatibility with other standards

The field of video scanning format conversion is concerned with the translation of video signals from one format into another It consists of two basic parts: temporal interpolation and spatial interpolation A particular case is de-interlacing, which poses an inseparable spatio-temporal inter-polation problem

Vertical and temporal interpolation cause practical and fundamental difficulties in achieving high-quality scanning format conversion This is because the conditions of the sampling theorem are generally not met in video signals If they were satisfied, standard conversions of arbitrary accuracy would be possible using suitable linear filters

The earlier conversion methods neglected the fundamental problems and, consequently, negatively influenced the resolution and the motion portrayal More recent algorithms apply motion vectors to predict the position of moving objects at unregistered temporal instances to improve the quality of

the picture at the output format A so-called motion estimator extracts these vectors from the input

Trang 3

signal The motion vectors partly solve the fundamental problems, but the demands on the motion estimator for scanning format conversion are severe

In this section we shall first briefly indicate why we can expect that the importance of scanning format conversion will grow Then we discuss in more detail the fundamental problems of temporal interpolation of video signals Next we provide a concise overview of the basic methods in scanning format conversion, focused on temporal sampling rate conversion and de-interlacing Finally, we give an overview of motion estimation algorithms, which are crucial in the more advanced scanning format convertors

54.2 Conversion vs Standardization

Scanning formats have been designed in the past to strike a particular compromise between quality, cost, transmission capacity, and compatibility with other standards There were three main formats

in use a decade ago: 50 Hz interlaced, 60 Hz interlaced, and 24 (or 25) Hz progressive (film) With the arrival of video-conferencing, HDTV, workstations, and PCs, many new video formats have appeared These include low end formats such as CIF and QCIF with smaller picture size and lower frame rates, progressive and interlaced HDTV formats at 50 Hz and 60 Hz, and other video formats used on computer workstations and enhanced television displays with field rates up to 100 Hz It will

be clear that the problem of scanning format conversion is of a growing importance, despite many attempts to globally standardize video formats

54.3 Problems with Linear Sampling Rate Conversion Applied to

Video Signals

High-quality scanning format conversion is difficult to achieve, as the conditions of the sampling theorem are generally not met in video signals The solution of Sample Rate Conversion (SRC) for systems satisfying the conditions of the sampling theory is well known for arbitrary sampling ratios [1]

Figure54.1illustrates the procedure for a ratio of 2 To arrive at the double output sampling rate,

in a first step, zero-valued samples are inserted between every input pair of samples In a second step, a low-pass filter (LPF) at the output rate is applied to remove the first repeat spectrum from the input data In case of a temporal SRC, the interpolating LPF has to be a temporal LPF, i.e., a filter including picture delays Though feasible, this makes it a fairly expensive filter

A more complicated, though still not fundamental, problem occurs at the signal acquisition stage Since scenes do occur with almost unlimited spatial and/or temporal bandwidth, the sampling theo-rem requires that this signal be low-pass filtered prior to the scanning process Interlaced scanning, as commonly applied, even demands two-dimensional prefiltering in the vertical-temporal frequency plane In a video system, it is the camera that samples the scene in a vertical and temporal sense; therefore, the prefilter has to be realized in the optical path Although there are considerable practical problems achieving this filtering, it would apparently bring down the problem of temporal inter-polation of video images to the common sampling rate conversion problem The next section will show, however, that in addition to the practical problems there is a fundamental problem as well

54.3.1 Temporal Interpolation

Considering the eye’s sine-wave temporal frequency response for full brightness potential and full field display [2], as shown in Fig.54.2, temporal prefiltering with a bandwidth of 75 Hz at first sight seems sufficient The fundamental problem now is that the relation shown in Fig.54.2holds for

Trang 4

FIGURE 54.1: Consecutive steps in upsampling with a factor of two.

temporal frequencies as they occur at the retina of the observer These frequencies, however, equal the frequencies at the display only if the eye is stationary with respect to this display Particularly with the eye tracking objects moving on the screen, this assumption is no longer valid For a tracking observer very high temporal frequencies on the screen can be transformed to much lower frequencies

or even DC at the retina Consequently, suppression of these frequencies, with an interpolating lowpass filter, results in excessive blurring of moving objects as will be discussed next

Figure54.3shows, in a time-discrete representation, a simple object, a square, moving with a constant velocity Again, in this example, we consider up-sampling with a factor of two Therefore, the true position of the object is available at every second temporal position only (e.g., the odd numbered samples) The “tracking observer” views along the motion trajectory, represented with a line in the illustration, which results in a stationary image of the object on the retina If the output field sampling frequency exceeds the cutoff temporal frequency of the human visual system,1 the viewer will have the illusion that the object is continuously present

Therefore, the object is actually seen at a position corresponding with the motion trajectory If now, e.g., in the 6th output field, the object is interpolated according to SRC theory, weighted copies

of the object from surrounding fields resulting from the interpolating LPF are displayed Figure54.3

illustrates the case of a symmetrical transversal lowpass filter In this situation, the viewer sees the object at the correct position but also various attenuated and displaced copies (the impulse response

of the interpolating temporal filter) of the object in a neighborhood The attenuation depends on the coefficients of the interpolating filter, and the distance between the copies is related to the displacement

1 Actually the picture update frequency may be even as low as 16 Hz, to guarantee smooth perceived motion (see, e.g., [ 3 ]) The higher display rates are merely necessary to prevent the annoying large area flicker.

Trang 5

FIGURE 54.2: The contrast sensitivity of the human observer (y-axis) for large areas of uniform

brightness, as a function of the temporal frequency (x-axis).

FIGURE 54.3: The effect of temporal interpolation for an object tracking observer The field numbers are counted at the output field rate

of the moving object in a field period For the object-tracking observer, therefore, the temporal LPF

is transformed into a spatial LPF For an object velocity of one pixel per field period (one pel/field), its frequency characteristic equals the temporal frequency characteristic of the interpolating LPF.21 pel/field is a slow motion, as in broadcast picture material; velocities in a range exceeding 16 pel/field

do occur Thus, the spatial blur caused by the SRC process becomes unacceptable even for moderate object velocities

54.3.2 Vertical Interpolation and Interlaced Scanning

Much similar to the situation of field rate conversion, it may seem that sequential scan conversion is

an up-sampling problem for which SRC-theory provides an adequate solution However, straight-forward, one-dimensional, up-sampling in the vertical frequency domain is incorrect as the data is clearly sub-Nyquist sampled due to interlace

If, more correctly, the sequential scan conversion is considered as a two-dimensional up-sampling problem in the vertical-temporal frequency domain, we arrive at a discussion similar to the one

2 It is assumed here that both filters are normalized to their respective sampling frequency.

Trang 6

in Section54.3.1: the problem cannot be solved as we do not know the temporal frequency at the retina of a movement-tracking observer It is possible to disregard this problem and to perform a two-dimensional SRC, implicitly assuming a stationary viewer and prefiltered information Such systems were described and have been implemented for studio applications With the older image pick-up tubes the results can be satisfactory, as these devices have a poor dynamic resolution When modern (CCD-)cameras are used, however, the limitations of the assumptions become obvious

54.4 Alternatives for Sampling Rate Conversion Theory

With the problem of linear interpolation of video signals clarified, we will discuss alternative algo-rithms developed over time These algoalgo-rithms fall into two categories A first category simplifies the interpolation filter prescribed by SRC-theory, considering that a completely correct solution is

impossible anyway The resulting “simple algorithms” are more attractive for hardware realization

than the method from which they are derived and under certain conditions can perform quite

simi-larly The second category includes the most “advanced algorithms” for scanning format conversion.

These methods can be characterized by their common attempt to interpolate the 3-D image data in the direction in which the correlation is highest The difference between the various options lies mainly in the number of possible directions, and dimensions, which are considered The imple-mentation can show various linear interpolation filters controlled by one or more detectors, or a multi-dimensional nonlinear filter that has an inherent edge adaptivity As this description allows a large number of algorithms, we will illustrate it with some important examples

54.4.1 Simple Algorithms

SRC-theory in the temporal and vertical frequency domain is not applicable due to the missing prefilter in common video systems A sophisticated linear interpolation filter therefore makes little sense Any interpolating (spatio-)temporal low-pass filter will suppress original temporal frequency components as well as aliased signal components, as they occupy, by definition, the same spectrum

As the first effect is desired and the second not, the transfer function of the filter strikes a compromise between alias and blurring Repetition of the most recent sample in this sense is optimal for the dynamic resolution and worst for alias A strong temporal low-pass filter suppresses much (not necessarily all) alias and yields a poor dynamic resolution The annoyance of the temporal alias depends on the input and output picture frequency, and particularly their difference In the easiest case, both frequencies are high and their difference 50 Hz or more In the worst case, input and output picture rate are low and their difference in the order of 10 Hz In case of an annoying beat frequency, an interpolating LPF usually improves picture quality, otherwise the best compromise is closer to repetition of the most recent sample

54.4.2 Advanced Algorithms

As indicated before, these methods are characterized by their common attempt to interpolate the 3-D image data in the direction in which the correlation is highest To this end they either have an explicit

or implicit detector to find this direction In case of (1-D) temporal interpolation the explicit detector

is usually called a motion detector, for 2-D spatial interpolation it is called an edge detector, while

the most advanced device estimating the optimal spatio-temporal (3-D) interpolation direction is

usually called a motion estimator The interpolation filter can be recursive or transversal, and can

have any number of taps, but a transversal filter with one or two taps is the most common choice For a two taps FIR approach we can write the interpolated video signalFint, in picturen, at spatial

Trang 7

positionx = (x, y) T as a function of the input video signalF (x, n):

Fint(x, n) = 0.5

F

x +

δ1

δ2

, n + δ3

+ F

x −

δ1

δ2

, n − δ3

(54.1)

In this terminology a motion detector controlsδ3, an edge detector δ1, andδ2, while a motion estimator can be applied to determineδ1, δ2, andδ3

Algorithms with a Motion Detector

To detect motion, the difference between two successive pictures is calculated It is too simple, however, to expect this signal to become zero in a picture part without moving objects The common problems with the detection are noise and alias Additional problems occurring in some systems are color subcarriers causing non-stationarities in colored regions, interlace causing nonstationarities in vertically detailed picture parts, and timing jitter of the sampling clock which is particularly harmful

in detailed areas

All these problems imply that the output of the motion detector usually is not a binary, but rather

a multi-level signal, indicating the probability of motion Usual (but not always valid) assumptions made to improve the detector are:

1 Noise is small and signal is large

2 The spectrum part around the color carrier carries no motion information

3 Low-frequency energy in the signal is larger than in the noise and alias

4 Moving objects are large compared to a pixel

The general structure of the motion detector resulting from these assumptions is depicted in Figure54.4 As can be seen, the difference signal is first low-pass (and carrier reject) filtered to profit

FIGURE 54.4: General structure of a motion detector

from (54.2) and (54.3) It also makes the detector less “nervous” for timing jitter in detailed areas After the rectification another low-pass filter improves the consistency of the motion signal, based

on assumption (54.4) Finally, the nonlinear (but monotonous) transfer function in the last block translates the signal in a probability figure for the motionP m, using (54.1) This last function may have to be adapted to the expected noise level Low-pass filters are not necessarily linear More than one detector can be used, working on more than just two pictures in the neighborhood of the current image, and a logical or linear combination of their outputs may lead to a more reliable indication of motion

The motion detector (MD) is applied to switch or fade between two processing modes, one of which is optimal for stationary and the other for moving image parts Examples are:

• De-interlacing The MD fades between intra-field interpolation (line-averaging, or edge

Trang 8

dependent spatial interpolation) and inter-field interpolation (repetition of the previous field, averaging of neighboring fields, etc.)

• Field rate doubling on interlaced video: The MD fades between repetition of fields (best dynamic resolution without motion compensation for moving picture parts) and repe-tition of frames (best spatial resolution in stationary image parts)

To slightly elaborate on the first example of de-interlacing, we define the interpolated pixel

X m (x, n) in a moving picture part as:

X m x, n= 0.5

F

x −

0 1

, n

+ F

x +

0 1

, n

(54.2)

while for stationary picture parts the interpolated pixelX s (x, n) is taken as:

and taking the probability of motionP m, from the motion detector into account, the output is given by:

Fint x, n= PmX m x, n+ (1 − P (m))X s x, n (54.4)

In most practical cases the outputP mhas a nonlinear relation with the actual probability

Algorithms with an Edge Detector

To detect the orientation of a spatial edge, usually the differences between pairs of spatially neighboring pixels are calculated Again it is a bit unrealistic to expect that a zero difference is a reliable indication of a spatial direction in which the signal is stationary The same problems (noise, alias, carriers, timing-jitter) occur as with motion detection The edge detector (ED) is applied to switch or fade between at least two but usually more processing modes, each of them optimal for interpolation of a certain orientation of the spatial edge Examples are:

• De-interlacing The ED fades between vertical line-averaging and diagonal averaging (+/ − 45◦, or even more angles).

• Up-conversion to a higher resolution format A simple bi-linear interpolation filter is applied with its coefficients adapted to the output of the edge detector

FIGURE 54.5: Identification of pixels as applied for direction dependent spatial interpolation

Trang 9

In Fig.54.5,X is the pixel to be interpolated for the sequential scan conversion and the result

applying pixels in a neighborhood (A, B, C, D, E and F ) is either X a , X b, orX c, where:

X a = 0.5[A + F ] = 0.5

F

x −

1 1

, n

+ F

x +

1 1

, n

(54.5)

and:

X b = 0.5[B + E] = 0.5

F

x −

0 1

, n

+ F

x +

0 1

, n

(54.6) and:

X c = 0.5[C + D] = 0.5

F

x +

+1

−1

, n

+ F

x +

−1 +1

, n

(54.7)

The selection ofX a , X b, orX cto the interpolated outputFintis controlled by a luminance gradient indication calculated from the same neighborhood:

Fint x, n=







X a , (|A − F | < |C − D| ∧ |A − F | < |B − E|)

X b , (|B − E| ≤ |A − F | ∧ |B − E| ≤ |C − D|)

X c , (|C − D| < |A − F | ∧ |C − D| < |B − E|)

(54.8)

In this example, the gradient is calculated on the same pixels that are used in the interpolation step This is not necessarily the case Similar to the earlier described motion detector, it is advantageous to filter the video signal prior to and/or after the rectification in Eq (54.8) Also the decision, i.e., the optimal interpolation angle, can be low-pass filtered to improve the consistency of the interpolation angle Finally, the edge dependent interpolation can be combined with (motion adaptive or motion compensated) temporal interpolation to improve the interpolation quality of near horizontal edges

Implicit Detection in Nonlinear Interpolation Filters

Many nonlinear interpolation methods have been described Most popular is the class of order statistical filters Combinations with linear (bandsplitting) filters are known, optimizing the interpolation for individual spectrum parts We will limit ourselves to some basic examples here

An illustration of a basic inherently adapting filter is shown in Figure54.6 The line to be

inter-FIGURE 54.6: Sequential scan conversion with three-tap vertical-temporal median filtering The thin lines show which pixels are input for the median filter

Trang 10

polated is found as the median of the spatially neighboring lines (a and b) and the corresponding line (c) from the previous field:

Fint(x, n) = median [a, b, c] =

median

F

x +

0 1

, n

, F

x −

0 1

, n

, F x, n − 1 (54.9) with:

median (X, Y, Z) =







X, (Y ≤ X ≤ Z ∨ Z ≤ X ≤ Y )

Y, (X < Y ≤ Z ∨ Z ≤ Y < X)

Z, (otherwise)

(54.10)

The inherent adaptation to edges is understood as follows: In case of a temporal edge (i.e., motion) larger than the spatial edge (i.e., vertical detail), the difference between a and b is relatively small compared to their difference with c Therefore, an intra-field interpolation results (a or b is copied)

In case of a non-moving vertical edge, the difference between a and b will be relatively large compared

to the difference between c and a or b In this case, the inter-field interpolation (c is copied) is most likely

It is possible to combine edge detectors with non-linear filters, e.g., a so-called weighted median filter In a weighted median filter, the (integer) weight given to a sample indicates the number of times its value is included in the input of the filter to the ranking stage An increase of this weight increases the chance this sample value is selected as the median It therefore provides a method, using the output of an edge detector with uncertainties, to statistically improve the performance of the interpolation

We will again use Fig.54.5to identify the location of the pixels used in the interpolation The output value for the pixel position indicated with X results as:

Fint x, n= median

A, B, C, D, E, F, α · X−1, β · B + E

2

, (α, β ∈ N) (54.11) with:

X−1= F x, n − 1, A = F

x −

1 1

, n

, B = F

x −

0 1

, n

,

(54.12)

as illustrated in Fig.54.5 The weighting (α and β) implies that an assumed “important” pixel is fed

more than once to the median calculating circuit:

α · A = A, A, A A, A

α times (54.13)

The combination arises if a motion detector is used to control the weighting factors of the pixel from the previous field and that of the value found by line averaging A large value ofα increases the

probability of field insertion, while a largeβ causes an increased probability of line averaging.

Although the examples in this section are limited to de-interlacing, it should be noted that proposals exist for field rate conversion as well

Algorithms with a Motion Estimator

The idea to interpolate picture content in the direction in which it is most correlated can be extended to a three-dimensional case This results in an interpolation along the motion trajectory Figure54.7defines the motion trajectory as the line that connects identical picture parts in a sequence

Tiêu đề	Video Scanning Format Conversion and Motion Estimation
Tác giả	G. de Haan
Người hướng dẫn	Vijay K. Madisetti, Editor, Douglas B. Williams, Editor
Trường học	Philips Research Laboratories
Chuyên ngành	Digital Signal Processing
Thể loại	Essay
Năm xuất bản	1999
Thành phố	Boca Raton

Định dạng
Số trang	20
Dung lượng	398,14 KB