Báo cáo hóa học: "A Real-Time Wavelet-Domain Video Denoising Implementation in FPGA" potx

EURASIP Journal on Embedded SystemsVolume 2006, Article ID 16035, Pages 1 12 DOI 10.1155/ES/2006/16035 A Real-Time Wavelet-Domain Video Denoising Implementation in FPGA Mihajlo Katona, 1

Trang 1

EURASIP Journal on Embedded Systems

Volume 2006, Article ID 16035, Pages 1 12

DOI 10.1155/ES/2006/16035

A Real-Time Wavelet-Domain Video Denoising

Implementation in FPGA

Mihajlo Katona, 1 Aleksandra Piˇzurica, 2 Nikola Tesli´c, 1 Vladimir Kovaˇcevi´c, 1 and Wilfried Philips 2

1 Chair for Computer Engineering, University of Novi Sad, Fruˇskogorska 11, 21000 Novi Sad, Serbia and Montenegro

2 Department of Telecommunications and Information Processing, Ghent University, Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium

Received 15 December 2005; Accepted 13 April 2006

The use of field-programmable gate arrays (FPGAs) for digital signal processing (DSP) has increased with the introduction of dedicated multipliers, which allow the implementation of complex algorithms This architecture is especially eﬀective for data-intensive applications with extremes in data throughput Recent studies prove that the FPGAs oﬀer better solutions for real-time multiresolution video processing than any available processor, DSP or general-purpose FPGA design of critically sampled discrete wavelet transforms has been thoroughly studied in literature over recent years Much less research was done towards FPGA design

of overcomplete wavelet transforms and advanced wavelet-domain video processing algorithms This paper describes the paral-lel implementation of an advanced wavelet-domain noise filtering algorithm, which uses a nondecimated wavelet transform and spatially adaptive Bayesian wavelet shrinkage The implemented arithmetic is decentralized and distributed over two FPGAs The standard composite television video stream is digitalized and used as a source for real-time video sequences The results demon-strate the eﬀectiveness of the developed scheme for real-time video processing

Copyright © 2006 Mihajlo Katona et al This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited

Video denoising is important in numerous applications, such

as television broadcasting systems, teleconferencing, video

surveillance, and restoration of old movies Usually, noise

re-duction can significantly improve visual quality of a video as

well as the eﬀectiveness of subsequent processing tasks, like

video coding

Noise filters that aim at a high visual quality make use of

both spatial and temporal redundancy of video Such filters

are known as spatio-temporal or three-dimensional (3D)

fil-ters Often 2D spatial filter and 1D temporal filter are applied

separately, and usually sequentially (because spatial

denois-ing facilitates motion detection and estimation) Temporal

filtering part is often realized in a recursive fashion in order

to minimize the memory requirements Numerous existing

approaches range from lower complexity solutions, like 3D

rational [1] and 3D order-statistic [2,3] algorithms to

so-phisticated Bayesian methods based on 3D Markov models

[4,5]

Multiresolution video denoising is one of the

increas-ingly popular research topics over recent years Roosmalen et

al [6] proposed video denoising by thresholding the

coeﬃ-cients of a specific 3D multiresolution representation, which

combines 2D steerable pyramid decomposition (of the spa-tial content) and a 1D wavelet decomposition (in time) Re-lated to this, Selesnick and Li [7] investigated wavelet thresh-olding in a nonseparable 3D dual-tree complex wavelet rep-resentation Rusanovskyy and Egiazarian [8] developed an

eﬃcient video denoising method using a 3D sliding window

in the discrete cosine transform domain Other recent mul-tiresolution schemes employ separable spatial/temporal fil-ters, where the temporal filter is motion adaptive recursive filter Such schemes were proposed, for example, by Piˇzurica

et al [9] where a motion selective temporal filter follows the spatial one, and by Zlokolica et al [10] where a motion-compensated temporal filter precedes the spatial one Less research was done so far towards hardware design of these multiresolution video denoising schemes

The use of the FPGAs for digital signal processing has increased with the introduction of dedicated multipliers, which facilitate the implementation of complex DSP algo-rithms Such architectures are especially eﬀective for data-intensive applications with extremes in data throughput With examples for video processing applications Draper et al [11] present performance comparison of FPGA and general-purpose processors Similarly, Haj [12] illustrates two dif-ferent wavelet implementations in the FPGAs and compares

Trang 2

these with general-purpose and DSP processors Both studies

come to the conclusion that the FPGAs are far more suitable

for real-time video processing in the wavelet domain than

any available processor, DSP or general-purpose

The hardware implementation of the wavelet transform

is related to the finite-impulse-response (FIR) filter design

Recently, the implementation of FIR filters has become quite

common in the FPGAs A detailed guide for the FPGA filter

design is in [13] and techniques for area optimized

imple-mentation of FIR filters are presented, for example, in [14]

A number of diﬀerent techniques for implementing the

crit-ically sampled discrete wavelet transform (DWT) in the

FP-GAs exist [15–21] including the implementation of

MPEG-4 wavelet-based visual texture compression system [22]

Re-cently, the lifting scheme [23–25] is introduced for

real-time DWT [20,26] as well as the very-large-scale-integration

(VLSI) implementation of the DWT using embedded

in-struction codes for symmetric filters [27] The lifting scheme

is attractive for hardware implementations because it

re-places multipliers with shift operations The FPGA

imple-mentations of overcomplete wavelet transforms are much

less studied in literature

Our initial techniques and results in FPGA

implementa-tion of wavelet-domain video denoising are in [28,29] These

two studies were focusing on diﬀerent aspects of the

devel-oped system: implementation of the wavelet transform and

distributed computing over the FPGA modules in [28] and

customization of a wavelet shrinkage function by look-up

tables for implementation in read-only-memories (ROMs)

[29] The description was on a more abstract level focusing

on the main concepts and not on the details of the

architec-tural design

In this paper, we report a full architectural design of

a retime FPGA implementation of a video denoising

al-gorithm based on an overcomplete (nondecimated) wavelet

transform and employing sophisticated locally adaptive

wavelet shrinkage We propose a novel FIR filter design for

the nondecimated wavelet transform based on the algorithm

`

a trous [30] The implemented spatial/temporal filter is

sepa-rable, where a motion-adaptive recursive temporal filter

fol-lows the spatial filter as was proposed in [9] We present an

eﬃcient customization of the locally adaptive spatial wavelet

filter using a combination of read-only-memories (ROMs)

and a dedicated address generation network We design an

eﬃcient implementation of a local window for wavelet

pro-cessing using an array of delay elements Our design of the

complete denoising scheme distributes computing over two

FPGA modules, which switch their functionality in time:

while one module performs the direct wavelet transform of

the current frame, the other module is busy with the

in-verse wavelet transform of the previous frame After each

two frames, the functioning of the two modules is reversed

We present a detailed data flow of the proposed scheme For

low-to-moderate noise levels, the designed FPGA

implemen-tation yields a minor performance loss compared to the

soft-ware version of the algorithm This proves the potentials of

the FPGAs for real-time implementations of highly

sophisti-cated and complex video processing algorithms

The paper is organized as follows.Section 2presents an overview of the proposed FPGA design, including the mem-ory organization (Section 2.1) and data flow (Section 2.2) Section 3 details the FPGA design of the diﬀerent build-ing blocks in our video denoisbuild-ing scheme We start with some preliminaries for the hardware design of the non-decimated wavelet transform (Section 3.1) and present the proposed pipelined FPGA implementation (Section 3.2) Next, we present the FPGA design of the locally adaptive wavelet shrinkage (Section 3.3) and finally the FPGA imple-mentation of the motion-adaptive recursive temporal filter (Section 3.4).Section 4presents the real-time environment used in this study The conclusions are inSection 5

2 REAL-TIME IMPLEMENTATION WITH FPGA

An overview of our FPGA implementation is illustrated in Figure 1 We use two independent modules working in paral-lel Each module is implemented in a separate FPGA While one module performs the wavelet decomposition of an in-put TV frame, the other module performs the inverse wavelet transform of the previous TV frame The two modules switch their functionality in time The wavelet-domain denoising block is located in front of the inverse wavelet transform The proposed distributed algorithm implementation over the two modules allows eﬀective logic decentralization with respect to input and output data streams Namely, while one FPGA module is handling the input video stream per-forming the wavelet decomposition, the other FPGA mod-ule is reading the wavelet coeﬃcients for denoising, sending them to the wavelet reconstruction, and building up the vi-sually improved output video stream

The nondecimated wavelet transform demands significant memory resources For example, in our implementation with three decomposition levels we need to store nine frames of wavelet coefficients for every input frame In addition, we need an input memory buffer and an output buffer for iso-lating data accesses from different clock domains

The input data stream is synchronized with a 13.5 MHz clock For three decomposition levels the complete wavelet decomposition and reconstruction has to be completed with the clock of at least 3×13.5 = 40.5 MHz The set-up of

our hardware platform requires the output data stream at

27 MHz Table 1 lists the required interfaces of the buﬀers that are used in the system

The most critical timing issue is at the memory buﬀer for storing the wavelet coeﬃcients It has to provide

simultane-ous read and write options at 40.5 MHz Due to lack of the

SDRAM controller that supports this timing issue, the whole processing is split in two independent parallel modules The idea is to distribute the direct and the inverse wavelet process-ing between these modules While one module is performprocess-ing the wavelet decomposition of the current frame, the other module is performing the inverse wavelet transform of the

Trang 3

Wavelet coe ﬃcient RAM Wavelet coe ﬃcient RAM

Control module Control module

Temporal filter

Control module

27 MHz 13.5 MHz

Figure 1: A detail of the FPGA implementation of the proposed wavelet-domain video denoising algorithm

Table 1: Memory interfaces

Buﬀers Write port (MHz) Read port (MHz)

Wavelet coeﬃcients buﬀer 40.5 40.5

previous frame With such organization, one module reads

and the other module writes the coeﬃcients The

approxima-tion subband (LL band) during the wavelet decomposiapproxima-tion

and composition is stored in the onboard SRAM memory

This allows us to use only read or write memory port during

one frame

The data flow through all the memory buﬀers and both

FPGA’s in our scheme is shown inFigure 2 The total delay

is 4 frames During the first 20 milliseconds, the input frame

A0is stored in the input buﬀer at a clock rate of 13.5 MHz

During the next 20 milliseconds, this frame is read from the

input buﬀer and is wavelet transformed in a 40.5 MHz clock

domain, with 3 decomposition scalesW1(A0),W2(A0), and

W3(A0) In parallel to this process, the next frameA1is

writ-ten in the input buﬀer The following time slot of 20

mil-liseconds is currently not used for processingA0, but is

re-served for future additional processing in the wavelet

do-main Within this period the frameA1is read from the

in-put buﬀer and is decomposed in its wavelet coeﬃcients The

framesA0 andA1are processed by FPGA1 The next input

frame,A2, is written in the input buﬀer, and is wavelet

trans-formed in the next time frame by FPGA2

The denoising and the inverse wavelet transform of the

frameA0are performed afterwards During this period the

wavelet coeﬃcients of the frame A0are read from the

mem-ory, denoised and the output frame is reconstructed with

the inverse wavelet transformW −1(A0) During the last

re-construction stage (the rere-construction at the finest wavelet

scale), the denoised output frame is written to the output

memory buﬀer Parallel to this process, FPGA2 performs the

wavelet decomposition of the frameA2and the input frame

A3is stored in the input buﬀer

Finally, 4×20 milliseconds=80 milliseconds after the frame A0 appeared at the system input (4 frames later), it

is read from the output buﬀer in a 27 MHz clock domain and is sent to the selective recursive temporal filter and to the system output afterwards The output data stream is aligned with a 100 Hz refresh rate, which means that the same frame

is sent twice to the output within one time frame of 20 mil-liseconds Additionally, FPGA2 performs the wavelet decom-position of the frameA3 Further on,A4frame is written to the input buﬀer and is decomposed in the following time frame under the control of FPGA1

In this scheme, the two FPGAs actually switch their func-tionality after each two frames The FPGA1 performs the wavelet decomposition for two frames, while the FPGA2 performs the inverse wavelet transform of the previous two frames After two frames, this is reversed

REAL-TIME PROCESSING

We design an FPGA implementation of a sequantial spa-tial/temporal video denoising scheme from [9], which is de-picted inFigure 3 Note that we use an overcomplete (non-decimated) wavelet transform to guarantee a high-quality spatial denoising In this representation, with three decom-position levels the number of the wavelet coeﬃcients is 9 times the input image size Therefore we choose to perform the temporal filtering in the image domain (after the in-verse wavelet transform) in order to minimize the memory requirements

While hardware implementations of the orthogonal wavelet transform have been extensively studied in literature [16–

21, 26, 27], much less research has been done towards implementations of the nondecimated wavelet transform

We develop a hardware implementation of the

non-decimat-ed wavelet transform basnon-decimat-ed on the algorithm `a trous [30] and with the classical three orientation subbands per scale This

Trang 4

Read

Write

Read

W1 (A1 )W1 (A1 ) W1 (A0 )W2 (A0 )W3 (A0 )W1 (A1 )W2 (A1 )W3 (A1 )W1 (A2 )W1 (A2 ) W1 (A3 )W1 (A3 ) W1 (A4 )W2 (A4 )W3 (A4 )W1 (A5 )W2 (A5 )W3 (A5 ) W1 (A6 )W1 (A6 )

Write

Read

Write

Read

W3 (A1 )W2 (A1 )W1 (A1 ) W1 (A0 )W2 (A0 ) W1 (A1 )W2 (A1 ) W3 (A2 )W2 (A2 )W1 (A2 ) W3 (A3 )W2 (A3 )W1 (A3 ) W1 (A4 )W2 (A4 ) W1 (A5 )W2 (A5 ) W3 (A6 )W2 (A6 )W1 (A6 )

W1 (A3 )W2 (A3 )W3 (A3 ) W1 (A2 )W1 (A2 ) W1 (A1 )W1 (A1 ) W1 (A0 )W2 (A0 )W3 (A0 ) W1 (A1 )W2 (A1 )W3 (A1 ) W1 (A2 )W1 (A2 ) W1 (A3 )W1 (A3 ) W1 (A4 )W2 (A4 )W3 (A4 )

W1 (A3 )W2 (A3 )W3 (A2 )W2 (A2 )W1 (A2 )W3 (A1 )W2 (A1 )W1 (A1 ) W1 (A0 )W2 (A0 ) W1 (A1 )W2 (A1 ) W3 (A2 )W2 (A2 )W1 (A2 ) W3 (A3 )W2 (A3 )W1 (A3 ) W1 (A4 )W2 (A4 )

FPGA1 fields

FPGA2 fields

Direct wavelet transform Inverse wavelet transform

W j(A i)-wavelet decompositions at scalej

W 1 (A i)-wavelet reconstruction of the frameA i

A j-processing frame with indexi

Figure 2: The data flow of wavelet processing

2D wavelet transform

Denoising by wavelet shrinkage

Inverse 2D wavelet transform

Pixel-based motion detector

Selective recursive filter

Figure 3: The implemented denoising scheme

algorithm upsamples the wavelet filters at each

decomposi-tion level In particular, 2j − 1 zeros (“holes,” in French, trous)

are inserted between the filter coeﬃcients at the

decomposi-tion level j, as it is shown inFigure 4

We use the SystemC library [31] and a previously

devel-oped simulation environment [32,33] to develop a real-time

model of the wavelet decomposition and reconstruction

Figure 5 shows the simulation model After a number of

simulations and tests we have concluded that the real-time

wavelet implementation with 16 bit arithmetic gives

practi-cally the same results as a referent MATLAB code of the

algo-rithm `a trous [30] At a number of input frames there were

more than 97.13% errorless pixels with mean error of 0.0287

Analyzing those figures at the level of bit representation, we

can conclude that maximally 1 bit out of 16 was wrong The wrong bit may occur on the bit position 0 shown inFigure 6 Taking into account that input pixels are 8 bit integers we can ignore this error

the nondecimated wavelet transform

Here we develop an FPGA implementation of a nondeci-mated wavelet transform with three orientation subbands per scale We design FIR filters for the algorithm `a trous [30] with the Daubechies’ minimum phase wavelet of length four [34] and we implement the designed FIR filters with dedi-cated multipliers in the Xilinx Virtex2 FPGAs [35]

Trang 5

H j(z)

HL LH

LL

2j 1 “trous” (holes) 2j 1 “trous” (holes)

Figure 4: The nondecimated 2D discrete wavelet transform

2D wavelet transformation

2D inverse wavelet transformation

Input data 0

15

14 : 7

6 : 0 0

Output data

LL 16 LH 16 HL

HH 16 16

15

14 : 7

6 : 0 8 8

Figure 5: The developed simulation model for the implementation of the wavelet transform

Our implementation of the 2D wavelet transform is

line-based as shown in Figure 7 We choose the line alignment

in order to preserve the video sequence input format and to

pipeline the whole processing in our system The horizontal

and the vertical filtering is performed within one pass of the

input video stream We avoid using independent horizontal

and vertical processing which requires two cycles and an

in-ternal memory for storing the output of the horizontal

filter-ing Instead, we use the line-based vertical filtering with as

many internal line buﬀers as there are taps in the used FIR

filter

The horizontal and vertical FIR filters diﬀer only in the

filter delay path implementation The data path of the

hor-izontal filter is a register pipeline as shown inFigure 8 The

data path of the vertical filter is the output of the line buﬀers

Hence, the vertical FIR filter does not include any delay

ele-ments, but only the pipelined filtering arithmetics

(multipli-ers and an adder) Pipelining the filtering arithmetics ensures

the requested timing for data processing and we use this

ap-proach both for the horizontal and vertical filters

The algorithm `a trous [30] upsamples the wavelet filters

by inserting 2j −1 zeros between the filter coeﬃcients at the

decomposition level j (seeFigure 4) We implement this fil-ter up-sampling by using a longer filfil-ter delay path and the appropriate data selection logic The required number of the registers depends on the length of the mother wavelet func-tion and on the number of the decomposifunc-tion levels used We use a wavelet of length four and three decomposition levels, and hence our horizontal filter inFigure 8contains 3×4=12 registers Four registers are dedicated to the 4-tap filter and

3 times as many are needed to implement the required up-sampling up to the third decomposition level Analogously,

on the vertical filtering side, each line buﬀer for vertical fil-tering is able to store up to 4 lines

For the calculation of the first decomposition level of the wavelet transform, only the first 4 registers d0, d1, d2, and d3

inFigure 8are used in the FIR filter register pipeline At the second decomposition level, the wavelet filters have to be up-sampled with 1 zero between the filter coeﬃcients In our im-plementation, this means that registers d0, d2, d4, and d6 are used for filtering.Figure 8illustrates the FIR filter configu-ration during the calculation of the wavelet coeﬃcients from the third decomposition level During this period, the d0, d4, d8, and d12 registers are involved in the filtering process

Trang 6

0 0

Input Output

0

X

0

X

0

X

0

X

0

X

0

X

0

X

Figure 6: Input and output data format

Video input 4 tap horizontal FIR L

4 tap horizontal FIR H

Line bu ffer Line bu ffer Line bu ffer Line bu ffer

4 tap vertical FIR LL

4 tap vertical FIR LH

4 tap vertical FIR HL

4 tap vertical FIR HH

LL

LH

HL

HH

Figure 7: A block schematic of the developed hardware implementation of the wavelet transform

We implement the inverse wavelet transform accordingly

The processing is mirrored when compared to the wavelet

decomposition: the vertical filtering is done first and the

hor-izontal processing afterwards The FIR filter design is the

same as for the direct wavelet transform, only the filter

co-eﬃcients a(0), a(1), a(2), and a(3) inFigure 8are mirrored

Our video denoising scheme employs a spatially adaptive

wavelet shrinkage approach of [36] A brief description of

this denoising method follows

Lety ldenote the noise-free wavelet coeﬃcient and w lits

observed noisy version at the spatial position l in a given

wavelet subband For compactness, we suppressed here the

indices that denote the scale and the orientation The method

of [36] shrinks each wavelet coeﬃcient by a factor which

equals the probability that this coeﬃcient presents a signal of

interest The signal of interest is defined as a noise-free signal

component that exceeds in magnitude the standard

devia-tion of noiseσ The probability of the presence of a signal

of interest at positionl is estimated based on the coeﬃcient

magnitude| w l |and based on a local spatial activity indicator

z l = k ∈ ∂ l | w k |, where∂ l is the neighborhood of the pixel

l (within a squared window) and N l is the number of the

neighboring coeﬃcients For example, for a 3×3 window

∂ lconsists of the 8 nearest neighbors of the pixell (N l =8)

Let H1 denote the hypothesis “the signal of interest is

present:” | y l | > σ and let H0denote the opposite hypothesis:

“| y l | ≤ σ.” The shrinkage estimator of [9] is

y l = PH1| w l,z l

w l = ρξ l η l

1 +ρξ l η l w l, (1)

where

ρ = PH1

PH0

, ξ l = pw l | H1

pw l | H0

, η l = pz l | H1

pz l | H0

.

(2)

p(w l | H0) and p(w l | H1) denote the conditional prob-ability density functions of the noisy coefficients given the absence and given the presence of a signal of interest Sim-ilarly, p(z l | H0) and p(z l | H1) denote the corresponding conditional probability density functions of the local spa-tial activity indicator The input-output characteristic of this wavelet denoiser is illustrated inFigure 9 This figure shows that the coefficients that are small in magnitude are strongly shrinked towards zero, while the largest ones tend to be left unchanged The displayed family of the shrinkage character-istics corresponds to the different values of the local spatial activity indicator For the same coefficient magnitude | w l |

the input coeﬃcient will be shrunk less if LSAI z lis bigger and vice versa

We now address the implementation of this shrinkage function Under the Laplacian prior for noise-free data

p(y) = (λ/2) exp( − λ | y |) we have [9]ρ = exp(− λT)/(1 −

exp(− λT)) The analytical expressions for ξ landη lseem too complex for the FPGA implementation We eﬃciently imple-ment the two likelihood ratiosξ landη l as appropriate

look-up tables, stored in two “read-only” memories (ROM) The

generation of the particular look-up-tables is based on an ex-tensive experimental study, as we explain later in this section The developed architecture is presented in Figure 10 One ROM memory, containing the look-up tableξ l, is addressed

by the coeﬃcient magnitude| w l |, and the other ROM mem-ory, containing the look-up tableρη l is addressed by LSAI

z l For calculating LSAI, we average the coeﬃcient values from the current line and from the previous two lines within

Trang 7

Scale=3

Figure 8: The proposed FIR filter implementation of the algorithm `a trous for a mother wavelet of length 4 and supporting up to 3

de-composition levels The particular arithmetic network using the registers d0, d4, d8, and d12 corresponds to the calculation of the wavelet coeﬃcients at the third decomposition level

150

100

50

0

50

100

150

Noisy input coe ﬃcient

Di ﬀerent LSAI

Figure 9: An illustration of the employed wavelet shrinkage family

a 3×3 window The read values from ROM’s are

multi-plied to produce the generalized likelihood ratior = ρξ l η l

We found it more eﬃcient to realize the shrinkage factor

r/(1 + r) using another ROM (look-up-table) instead of

us-ing the arithmetic operations The output of this

look-up-table denoted here as “shrinkage ROM” is the desired wavelet

shrinkage factor Finally, the output of the shrinkage ROM

multiplies the input coeﬃcient to yield the denoised

coeﬃ-cient

We denoise in parallel three wavelet bands LH, HL, and

HH at each scale Diﬀerent resolution levels (we use three)

are processed sequentially as illustrated inFigure 2 The

low-pass (LL) band is only delayed for the number of clock

peri-ods that are needed for denoising This delay, which is in our

implementation 6 clock cycles, ensures the synchronization

of the inputs at the inverse wavelet transform block (see the

timing inFigure 2)

The generation of the appropriate look-up tables for the

two likelihood ratios resulted from our extensive

experi-ments on diﬀerent test images and diﬀerent noise-levels as

it is described in [29].Figure 11illustrates the likelihood

ra-tio ξ lcalculated from one test image at diﬀerent noise lev-els These diagrams show another interpretation of the well-known threshold selection principle in wavelet denoising: a well-chosen threshold value for the wavelet coeﬃcients in-creases with the increase of the noise level The maximum likelihood estimate of the threshold T (i.e., the value for

which p(T | H0)= p(T | H1)) is the abscissa of the point

ξ l = 1.Figure 12displays the likelihood ratioξ l, in the di-agonal subband HH at third decomposition level, for 10 dif-ferent frames with fixed noise standard deviations (σ = 10 andσ =30) We showed in [29] that from a practical point

of view, the difference between the calculated likelihood ra-tios for different frames is minor, especially for lower noise levels (up toσ =20) Therefore we average the likelihood ra-tios over different frames and store these values as the corre-sponding look-up tables for several different noise levels (σ =

5, 10, 15, and 20) In the denoising procedure, the user selects the input noise level, which enables addressing the correct set

of the look-up tables The performance loss of the algorithm due to simplifications with the generated look-up tables is for

diﬀerent input noise levels shown inFigure 13 These results

Trang 8

FIFO (line bu ﬀer) FIFO (line bu ﬀer)

LSAI coe ﬃcient magnitude window

+ (1 scale)ABS(pixel)

(1 scale) Energy

8

Address generation combinatorial network

ROM

ROM KSI

ETA

Shrinkage

Output coe ﬃcient Input

coe ﬃcient

Figure 10: Block schematic of implemented denoising architecture

1000

900

800

700

600

500

400

300

200

100

0

HH

Figure 11: Likelihood ratioξ l for one test frame and 4 diﬀerent

noise levels (σ =5, 10, 20, 30)

represent peak signal-to-noise ratio (PSNR) values averaged

over frames of several diﬀerent video sequences For σ =10

the average performance loss was only 0.13 dB (and visually,

the diﬀerences are diﬃcult to notice) while for σ = 20 the

performance loss is 0.55 dB and is on most frames

becom-ing visually noticeable, but not highly disturbbecom-ing For higher

noise levels, the performance loss increases

In the current implementation, the user has to select one

of the available noise levels With such approach, it is

possi-ble that the user will not choose the best possipossi-ble noise

re-duction If the selected noise level is smaller from the real

noise level in the input signal, some of the noise will remain

in the output signal On the other hand, if the noise level is

over-estimated, the output signal will be blurred without

sat-isfying visual eﬀect

This user intervention can be avoided by implementing a

noise level estimator The output of this block could be used

for the look-up table selection, which further enables

ad-justable noise reduction according to the noise level in input

signal For example, a robust wavelet-domain noise estimator

based on the median absolute deviation [37] can be used for

this purpose or other related wavelet-domain noise

estima-tors like [38]

The likelihood ratiosξ landη lare monotonic increasing

functions We are currently investigating the approximation

of these functions by a family of piece-wise linear functions parameterized by the noise standard deviation and by the pa-rameter of the marginal statistical distribution of the noise-free coeﬃcients in a given subband

A pixel-based motion detector with selective recursive tem-poral filtering is quite simple for hardware implementation Since we first apply a high quality spatial filtering the noise is already significantly suppressed and thus a pixel-based mo-tion detecmo-tion is eﬃcient In case the momo-tion is detected the recursive filtering is switched oﬀ

Two pixels are involved for temporal filtering at a time: one pixel from the current field and another from the same spatial position in the previous field We store the two fields

in the output buﬀer and read the both required pixel values

in the same cycle If the absolute diﬀerence between these two pixel values is smaller than the predefined threshold value,

no motion case is assumed and the two pixel values are

sub-ject to a weighted averaging, with the weighting factors de-fined in [9] In the other case, when motion is detected, the current pixel is passed to the output The block schematic in Figure 14depicts the developed FPGA architecture of the se-lective recursive temporal filter described above We use the

8 bit arithmetic because the filter is located in the time do-main where all the pixels are represented as 8 bit integers

In our implementation we use the standard television broad-casting signal as a source of video signal A common feature

of all standard TV broadcasting technologies is that the video sequence is transmitted in analog domain (this excludes the latest DVB and HDTV transmission standards) Thus, before digital processing of television video sequence the digitaliza-tion is needed Also, after digital processing the sequence has

to be converted back to the analogue domain in order to

be shown on a standard tube display This pair of A/D and D/A converters is well known as a codec The 8 bit codec, with 256 levels of quantization per pixel, is considered suf-ficient from the visual quality point of view.Figure 15shows

a block schematic of digital processing for television broad-casting systems

We use the PAL-B broadcasting standard and 8 bit YUV

4 : 2 : 2 codec The hardware platform set-up consists of

Trang 9

900

800

700

600

500

400

300

200

100

0

HH

(a)

1000 900 800 700 600 500 400 300 200 100 0

HH

(b)

Figure 12: Likelihood ratioξ ldisplayed for 10 frames with fixed-noise levels:σ =10 (a) andσ =30 (b)

50 40 30 20 10 0

Standard deviation of added noise

Original with 3 decomposition levels FPGA implementation

PSNR comparison

Figure 13: Performance of the designed FPGA implementation in comparison with the original software version of the algorithm, which employs exact analytical calculation of the involved shrinkage expression

three separate boards Each board corresponds to one of the

blocks presented inFigure 15:

(i) Micronas IMAS-VPC 1.1 (A/D—analog front-end)

[39];

(ii) CHIPit Professional Gold Edition (processing block)

[40];

(iii) Micronas IMAS-DDPB 1.0 (D/A—analog back-end)

[41]

We made all the connections among the previously

men-tioned boards with a separate interconnection board designed

for this purpose This interconnection board consists of the

interconnection channels and the voltage adjustments

be-tween the CHIPit board (3.3 V level) and the Micronas IMAS

boards (5 V level)

The processing board consists of two Xilinx Virtex II

FPGAs (XC2V6000-5) [35] and is equipped with plenty of

SDRAM memory (6 banks with 32 bit access made with

256 Mbit ICs)

All boards of the used hardware platform are configured

with the I2C interface The user is able to set up the needed

noise level in input signal This is fulfilled with writing ap-propriate value to the corresponding register in the FPGA accessible via the I2C interface Appropriate look-up table with the averaged likelihood ratio is selected according to the value in this register

We designed a real-time FPGA implementation of an ad-vanced wavelet-domain video denoising algorithm The de-veloped hardware architecture is based on innovative techni-cal solutions that allow an implementation of sophisticated adaptive wavelet denoising in hardware We believe that the results reported in this paper can be interesting for a num-ber of industrial applications, including TV broadcasting systems Our current implementation has limitations in practical use due to the required user-intervention for noise level estimation Our future work will integrate the noise level estimation to avoid these limitations and to allow au-tomatic adaptation of the denoiser to the noise level changes

in the input signal

Trang 10

Pixel from current field

Pixel from previous field

ABS (A-B) A<B

+

Output

0.6

0.4

Threshold

Figure 14: Block schematic of implemented temporal filter

A/D

Input video sequence

nT

Digital processing

nT

D/A Output video sequence

Figure 15: A digital processing system for television broadcasting video sequences

ACKNOWLEDGMENT

The second author is a Postdoctoral Researcher of the Fund

for the Scientific Research in Flanders (FWO), Belgium

REFERENCES

[1] F Cocchia, S Carrato, and G Ramponi, “Design and

real-time implementation of a 3-D rational filter for edge

preserv-ing smoothpreserv-ing,” IEEE Transactions on Consumer Electronics,

vol 43, no 4, pp 1291–1300, 1997

[2] G Arce, “Multistage order statistic filters for image sequence

processing,” IEEE Transactions on Signal Processing, vol 39,

no 5, pp 1146–1163, 1991

[3] V Zlokolica and W Philips, “Motion and detail adaptive

de-noising of video,” in Image Processing: Algorithms and Systems

III, vol 5298 of Proceedings of SPIE, pp 403–412, San Jose,

Calif, USA, January 2004

[4] L Hong and D Brzakovic, “Bayesian restoration of image

se-quences using 3-D Markov random fields,” in Proceedings of

the IEEE International Conference on Acoustics, Speech and

Sig-nal Processing (ICASSP ’89), vol 3, pp 1413–1416, Glasgow,

UK, May 1989

[5] J Brailean and A Katsaggelos, “Simultaneous recursive

dis-placement estimation and restoration of noisy-blurred image

sequences,” IEEE Transactions on Image Processing, vol 4, no 9,

pp 1236–1251, 1995

[6] P van Roosmalen, S Westen, R Lagendijk, and J Biemond,

“Noise reduction for image sequences using an oriented

pyra-mid thresholding technique,” in IEEE International Conference

on Image Processing, vol 1, pp 375–378, Lausanne,

Switzer-land, September 1996

[7] I Selesnick and K Li, “Video denoising using 2D and 3D

dual-tree complex wavelet transforms,” in Wavelets: Applications in

Signal and Image Processing X, vol 5207 of Proceedings of SPIE,

pp 607–618, San Diego, Calif, USA, August 2003

[8] D Rusanovskyy and K Egiazarian, “Video denoising

algo-rithm in sliding 3d dct domain,” in Proceedings of the 7th

International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS ’05), J Blanc-Talon, W Philips, D.

Popescu, and P Scheunders, Eds., vol 3708 of Lecture Notes on

Computer Science, pp 618–625, Antwerp, Belgium, September

2005

[9] A Piˇzurica, V Zlokolica, and W Philips, “Noise reduction

in video sequences using wavelet-domain and temporal

filter-ing,” in Wavelet Applications in Industrial Processing, vol 5266

of Proceedings of SPIE, pp 48–59, Providence, RI, USA,

Octo-ber 2003

[10] V Zlokolica, A Piˇzurica, and W Philips, “Video denois-ing usdenois-ing multiple class averagdenois-ing with multiresolution,” in

The International Workshop on Very Low Bitrate Video Coding (VLBV ’03), pp 172–179, Madrid, Spain, September 2003.

[11] B A Draper, J R Beveridge, A P W Bohm, C Ross, and

M Chawathe, “Accelerated image processing on FPGAs,” IEEE

Transactions on Image Processing, vol 12, no 12, pp 1543–

1551, 2003

[12] A M Al-Haj, “Fast discrete wavelet transformation using

FP-GAs and distributed arithmetic,” International Journal of

Ap-plied Science and Engineering, vol 1, no 2, pp 160–171, 2003.

[13] G Goslin, “A guide to using field programmable gate arrays (FPGAs) for application-specific digital signal processing per-formance,” XILINX Inc., 1995

[14] C Dick, “Implementing area optimized narrow-band FIR

fil-ters using Xilinx FPGAs,” in Configurable Computing:

Technol-ogy and Applications, vol 3526 of Proceedings of SPIE, pp 227–

238, Boston, Mass, USA, November 1998

[15] R D Turney, C Dick, and A Reza, “Multirate filters and wavelets: from theory to implementation,” XILINX Inc [16] J Ritter and P Molitor, “A pipelined architecture for parti-tioned DWT based lossy image compression using FPGA’s,” in

ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’01), pp 201–206, Monterey, Calif, USA,

February 2001

Định dạng
Số trang	12
Dung lượng	1,13 MB