EURASIP Journal on Embedded SystemsVolume 2006, Article ID 16035, Pages 1 12 DOI 10.1155/ES/2006/16035 A Real-Time Wavelet-Domain Video Denoising Implementation in FPGA Mihajlo Katona, 1
Trang 1EURASIP Journal on Embedded Systems
Volume 2006, Article ID 16035, Pages 1 12
DOI 10.1155/ES/2006/16035
A Real-Time Wavelet-Domain Video Denoising
Implementation in FPGA
Mihajlo Katona, 1 Aleksandra Piˇzurica, 2 Nikola Tesli´c, 1 Vladimir Kovaˇcevi´c, 1 and Wilfried Philips 2
1 Chair for Computer Engineering, University of Novi Sad, Fruˇskogorska 11, 21000 Novi Sad, Serbia and Montenegro
2 Department of Telecommunications and Information Processing, Ghent University, Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium
Received 15 December 2005; Accepted 13 April 2006
The use of field-programmable gate arrays (FPGAs) for digital signal processing (DSP) has increased with the introduction of dedicated multipliers, which allow the implementation of complex algorithms This architecture is especially effective for data-intensive applications with extremes in data throughput Recent studies prove that the FPGAs offer better solutions for real-time multiresolution video processing than any available processor, DSP or general-purpose FPGA design of critically sampled discrete wavelet transforms has been thoroughly studied in literature over recent years Much less research was done towards FPGA design
of overcomplete wavelet transforms and advanced wavelet-domain video processing algorithms This paper describes the paral-lel implementation of an advanced wavelet-domain noise filtering algorithm, which uses a nondecimated wavelet transform and spatially adaptive Bayesian wavelet shrinkage The implemented arithmetic is decentralized and distributed over two FPGAs The standard composite television video stream is digitalized and used as a source for real-time video sequences The results demon-strate the effectiveness of the developed scheme for real-time video processing
Copyright © 2006 Mihajlo Katona et al This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited
Video denoising is important in numerous applications, such
as television broadcasting systems, teleconferencing, video
surveillance, and restoration of old movies Usually, noise
re-duction can significantly improve visual quality of a video as
well as the effectiveness of subsequent processing tasks, like
video coding
Noise filters that aim at a high visual quality make use of
both spatial and temporal redundancy of video Such filters
are known as spatio-temporal or three-dimensional (3D)
fil-ters Often 2D spatial filter and 1D temporal filter are applied
separately, and usually sequentially (because spatial
denois-ing facilitates motion detection and estimation) Temporal
filtering part is often realized in a recursive fashion in order
to minimize the memory requirements Numerous existing
approaches range from lower complexity solutions, like 3D
rational [1] and 3D order-statistic [2,3] algorithms to
so-phisticated Bayesian methods based on 3D Markov models
[4,5]
Multiresolution video denoising is one of the
increas-ingly popular research topics over recent years Roosmalen et
al [6] proposed video denoising by thresholding the
coeffi-cients of a specific 3D multiresolution representation, which
combines 2D steerable pyramid decomposition (of the spa-tial content) and a 1D wavelet decomposition (in time) Re-lated to this, Selesnick and Li [7] investigated wavelet thresh-olding in a nonseparable 3D dual-tree complex wavelet rep-resentation Rusanovskyy and Egiazarian [8] developed an
efficient video denoising method using a 3D sliding window
in the discrete cosine transform domain Other recent mul-tiresolution schemes employ separable spatial/temporal fil-ters, where the temporal filter is motion adaptive recursive filter Such schemes were proposed, for example, by Piˇzurica
et al [9] where a motion selective temporal filter follows the spatial one, and by Zlokolica et al [10] where a motion-compensated temporal filter precedes the spatial one Less research was done so far towards hardware design of these multiresolution video denoising schemes
The use of the FPGAs for digital signal processing has increased with the introduction of dedicated multipliers, which facilitate the implementation of complex DSP algo-rithms Such architectures are especially effective for data-intensive applications with extremes in data throughput With examples for video processing applications Draper et al [11] present performance comparison of FPGA and general-purpose processors Similarly, Haj [12] illustrates two dif-ferent wavelet implementations in the FPGAs and compares
Trang 2these with general-purpose and DSP processors Both studies
come to the conclusion that the FPGAs are far more suitable
for real-time video processing in the wavelet domain than
any available processor, DSP or general-purpose
The hardware implementation of the wavelet transform
is related to the finite-impulse-response (FIR) filter design
Recently, the implementation of FIR filters has become quite
common in the FPGAs A detailed guide for the FPGA filter
design is in [13] and techniques for area optimized
imple-mentation of FIR filters are presented, for example, in [14]
A number of different techniques for implementing the
crit-ically sampled discrete wavelet transform (DWT) in the
FP-GAs exist [15–21] including the implementation of
MPEG-4 wavelet-based visual texture compression system [22]
Re-cently, the lifting scheme [23–25] is introduced for
real-time DWT [20,26] as well as the very-large-scale-integration
(VLSI) implementation of the DWT using embedded
in-struction codes for symmetric filters [27] The lifting scheme
is attractive for hardware implementations because it
re-places multipliers with shift operations The FPGA
imple-mentations of overcomplete wavelet transforms are much
less studied in literature
Our initial techniques and results in FPGA
implementa-tion of wavelet-domain video denoising are in [28,29] These
two studies were focusing on different aspects of the
devel-oped system: implementation of the wavelet transform and
distributed computing over the FPGA modules in [28] and
customization of a wavelet shrinkage function by look-up
tables for implementation in read-only-memories (ROMs)
[29] The description was on a more abstract level focusing
on the main concepts and not on the details of the
architec-tural design
In this paper, we report a full architectural design of
a retime FPGA implementation of a video denoising
al-gorithm based on an overcomplete (nondecimated) wavelet
transform and employing sophisticated locally adaptive
wavelet shrinkage We propose a novel FIR filter design for
the nondecimated wavelet transform based on the algorithm
`
a trous [30] The implemented spatial/temporal filter is
sepa-rable, where a motion-adaptive recursive temporal filter
fol-lows the spatial filter as was proposed in [9] We present an
efficient customization of the locally adaptive spatial wavelet
filter using a combination of read-only-memories (ROMs)
and a dedicated address generation network We design an
efficient implementation of a local window for wavelet
pro-cessing using an array of delay elements Our design of the
complete denoising scheme distributes computing over two
FPGA modules, which switch their functionality in time:
while one module performs the direct wavelet transform of
the current frame, the other module is busy with the
in-verse wavelet transform of the previous frame After each
two frames, the functioning of the two modules is reversed
We present a detailed data flow of the proposed scheme For
low-to-moderate noise levels, the designed FPGA
implemen-tation yields a minor performance loss compared to the
soft-ware version of the algorithm This proves the potentials of
the FPGAs for real-time implementations of highly
sophisti-cated and complex video processing algorithms
The paper is organized as follows.Section 2presents an overview of the proposed FPGA design, including the mem-ory organization (Section 2.1) and data flow (Section 2.2) Section 3 details the FPGA design of the different build-ing blocks in our video denoisbuild-ing scheme We start with some preliminaries for the hardware design of the non-decimated wavelet transform (Section 3.1) and present the proposed pipelined FPGA implementation (Section 3.2) Next, we present the FPGA design of the locally adaptive wavelet shrinkage (Section 3.3) and finally the FPGA imple-mentation of the motion-adaptive recursive temporal filter (Section 3.4).Section 4presents the real-time environment used in this study The conclusions are inSection 5
2 REAL-TIME IMPLEMENTATION WITH FPGA
An overview of our FPGA implementation is illustrated in Figure 1 We use two independent modules working in paral-lel Each module is implemented in a separate FPGA While one module performs the wavelet decomposition of an in-put TV frame, the other module performs the inverse wavelet transform of the previous TV frame The two modules switch their functionality in time The wavelet-domain denoising block is located in front of the inverse wavelet transform The proposed distributed algorithm implementation over the two modules allows effective logic decentralization with respect to input and output data streams Namely, while one FPGA module is handling the input video stream per-forming the wavelet decomposition, the other FPGA mod-ule is reading the wavelet coefficients for denoising, sending them to the wavelet reconstruction, and building up the vi-sually improved output video stream
The nondecimated wavelet transform demands significant memory resources For example, in our implementation with three decomposition levels we need to store nine frames of wavelet coefficients for every input frame In addition, we need an input memory buffer and an output buffer for iso-lating data accesses from different clock domains
The input data stream is synchronized with a 13.5 MHz clock For three decomposition levels the complete wavelet decomposition and reconstruction has to be completed with the clock of at least 3×13.5 = 40.5 MHz The set-up of
our hardware platform requires the output data stream at
27 MHz Table 1 lists the required interfaces of the buffers that are used in the system
The most critical timing issue is at the memory buffer for storing the wavelet coefficients It has to provide
simultane-ous read and write options at 40.5 MHz Due to lack of the
SDRAM controller that supports this timing issue, the whole processing is split in two independent parallel modules The idea is to distribute the direct and the inverse wavelet process-ing between these modules While one module is performprocess-ing the wavelet decomposition of the current frame, the other module is performing the inverse wavelet transform of the
Trang 3Wavelet coe fficient RAM Wavelet coe fficient RAM
Control module Control module
Temporal filter
Control module
27 MHz 13.5 MHz
Figure 1: A detail of the FPGA implementation of the proposed wavelet-domain video denoising algorithm
Table 1: Memory interfaces
Buffers Write port (MHz) Read port (MHz)
Wavelet coefficients buffer 40.5 40.5
previous frame With such organization, one module reads
and the other module writes the coefficients The
approxima-tion subband (LL band) during the wavelet decomposiapproxima-tion
and composition is stored in the onboard SRAM memory
This allows us to use only read or write memory port during
one frame
The data flow through all the memory buffers and both
FPGA’s in our scheme is shown inFigure 2 The total delay
is 4 frames During the first 20 milliseconds, the input frame
A0is stored in the input buffer at a clock rate of 13.5 MHz
During the next 20 milliseconds, this frame is read from the
input buffer and is wavelet transformed in a 40.5 MHz clock
domain, with 3 decomposition scalesW1(A0),W2(A0), and
W3(A0) In parallel to this process, the next frameA1is
writ-ten in the input buffer The following time slot of 20
mil-liseconds is currently not used for processingA0, but is
re-served for future additional processing in the wavelet
do-main Within this period the frameA1is read from the
in-put buffer and is decomposed in its wavelet coefficients The
framesA0 andA1are processed by FPGA1 The next input
frame,A2, is written in the input buffer, and is wavelet
trans-formed in the next time frame by FPGA2
The denoising and the inverse wavelet transform of the
frameA0are performed afterwards During this period the
wavelet coefficients of the frame A0are read from the
mem-ory, denoised and the output frame is reconstructed with
the inverse wavelet transformW −1(A0) During the last
re-construction stage (the rere-construction at the finest wavelet
scale), the denoised output frame is written to the output
memory buffer Parallel to this process, FPGA2 performs the
wavelet decomposition of the frameA2and the input frame
A3is stored in the input buffer
Finally, 4×20 milliseconds=80 milliseconds after the frame A0 appeared at the system input (4 frames later), it
is read from the output buffer in a 27 MHz clock domain and is sent to the selective recursive temporal filter and to the system output afterwards The output data stream is aligned with a 100 Hz refresh rate, which means that the same frame
is sent twice to the output within one time frame of 20 mil-liseconds Additionally, FPGA2 performs the wavelet decom-position of the frameA3 Further on,A4frame is written to the input buffer and is decomposed in the following time frame under the control of FPGA1
In this scheme, the two FPGAs actually switch their func-tionality after each two frames The FPGA1 performs the wavelet decomposition for two frames, while the FPGA2 performs the inverse wavelet transform of the previous two frames After two frames, this is reversed
REAL-TIME PROCESSING
We design an FPGA implementation of a sequantial spa-tial/temporal video denoising scheme from [9], which is de-picted inFigure 3 Note that we use an overcomplete (non-decimated) wavelet transform to guarantee a high-quality spatial denoising In this representation, with three decom-position levels the number of the wavelet coefficients is 9 times the input image size Therefore we choose to perform the temporal filtering in the image domain (after the in-verse wavelet transform) in order to minimize the memory requirements
While hardware implementations of the orthogonal wavelet transform have been extensively studied in literature [16–
21, 26, 27], much less research has been done towards implementations of the nondecimated wavelet transform
We develop a hardware implementation of the
non-decimat-ed wavelet transform basnon-decimat-ed on the algorithm `a trous [30] and with the classical three orientation subbands per scale This
Trang 4Read
Write
Read
W1 (A1 )W1 (A1 ) W1 (A0 )W2 (A0 )W3 (A0 )W1 (A1 )W2 (A1 )W3 (A1 )W1 (A2 )W1 (A2 ) W1 (A3 )W1 (A3 ) W1 (A4 )W2 (A4 )W3 (A4 )W1 (A5 )W2 (A5 )W3 (A5 ) W1 (A6 )W1 (A6 )
Write
Read
Write
Read
W3 (A1 )W2 (A1 )W1 (A1 ) W1 (A0 )W2 (A0 ) W1 (A1 )W2 (A1 ) W3 (A2 )W2 (A2 )W1 (A2 ) W3 (A3 )W2 (A3 )W1 (A3 ) W1 (A4 )W2 (A4 ) W1 (A5 )W2 (A5 ) W3 (A6 )W2 (A6 )W1 (A6 )
W1 (A3 )W2 (A3 )W3 (A3 ) W1 (A2 )W1 (A2 ) W1 (A1 )W1 (A1 ) W1 (A0 )W2 (A0 )W3 (A0 ) W1 (A1 )W2 (A1 )W3 (A1 ) W1 (A2 )W1 (A2 ) W1 (A3 )W1 (A3 ) W1 (A4 )W2 (A4 )W3 (A4 )
W1 (A3 )W2 (A3 )W3 (A2 )W2 (A2 )W1 (A2 )W3 (A1 )W2 (A1 )W1 (A1 ) W1 (A0 )W2 (A0 ) W1 (A1 )W2 (A1 ) W3 (A2 )W2 (A2 )W1 (A2 ) W3 (A3 )W2 (A3 )W1 (A3 ) W1 (A4 )W2 (A4 )
FPGA1 fields
FPGA2 fields
Direct wavelet transform Inverse wavelet transform
W j(A i)-wavelet decompositions at scalej
W 1 (A i)-wavelet reconstruction of the frameA i
A j-processing frame with indexi
Figure 2: The data flow of wavelet processing
2D wavelet transform
Denoising by wavelet shrinkage
Inverse 2D wavelet transform
Pixel-based motion detector
Selective recursive filter
Figure 3: The implemented denoising scheme
algorithm upsamples the wavelet filters at each
decomposi-tion level In particular, 2j − 1 zeros (“holes,” in French, trous)
are inserted between the filter coefficients at the
decomposi-tion level j, as it is shown inFigure 4
We use the SystemC library [31] and a previously
devel-oped simulation environment [32,33] to develop a real-time
model of the wavelet decomposition and reconstruction
Figure 5 shows the simulation model After a number of
simulations and tests we have concluded that the real-time
wavelet implementation with 16 bit arithmetic gives
practi-cally the same results as a referent MATLAB code of the
algo-rithm `a trous [30] At a number of input frames there were
more than 97.13% errorless pixels with mean error of 0.0287
Analyzing those figures at the level of bit representation, we
can conclude that maximally 1 bit out of 16 was wrong The wrong bit may occur on the bit position 0 shown inFigure 6 Taking into account that input pixels are 8 bit integers we can ignore this error
the nondecimated wavelet transform
Here we develop an FPGA implementation of a nondeci-mated wavelet transform with three orientation subbands per scale We design FIR filters for the algorithm `a trous [30] with the Daubechies’ minimum phase wavelet of length four [34] and we implement the designed FIR filters with dedi-cated multipliers in the Xilinx Virtex2 FPGAs [35]
Trang 5H j(z)
HL LH
LL
2j 1 “trous” (holes) 2j 1 “trous” (holes)
Figure 4: The nondecimated 2D discrete wavelet transform
2D wavelet transformation
2D inverse wavelet transformation
Input data 0
15
14 : 7
6 : 0 0
Output data
LL 16 LH 16 HL
HH 16 16
15
14 : 7
6 : 0 8 8
Figure 5: The developed simulation model for the implementation of the wavelet transform
Our implementation of the 2D wavelet transform is
line-based as shown in Figure 7 We choose the line alignment
in order to preserve the video sequence input format and to
pipeline the whole processing in our system The horizontal
and the vertical filtering is performed within one pass of the
input video stream We avoid using independent horizontal
and vertical processing which requires two cycles and an
in-ternal memory for storing the output of the horizontal
filter-ing Instead, we use the line-based vertical filtering with as
many internal line buffers as there are taps in the used FIR
filter
The horizontal and vertical FIR filters differ only in the
filter delay path implementation The data path of the
hor-izontal filter is a register pipeline as shown inFigure 8 The
data path of the vertical filter is the output of the line buffers
Hence, the vertical FIR filter does not include any delay
ele-ments, but only the pipelined filtering arithmetics
(multipli-ers and an adder) Pipelining the filtering arithmetics ensures
the requested timing for data processing and we use this
ap-proach both for the horizontal and vertical filters
The algorithm `a trous [30] upsamples the wavelet filters
by inserting 2j −1 zeros between the filter coefficients at the
decomposition level j (seeFigure 4) We implement this fil-ter up-sampling by using a longer filfil-ter delay path and the appropriate data selection logic The required number of the registers depends on the length of the mother wavelet func-tion and on the number of the decomposifunc-tion levels used We use a wavelet of length four and three decomposition levels, and hence our horizontal filter inFigure 8contains 3×4=12 registers Four registers are dedicated to the 4-tap filter and
3 times as many are needed to implement the required up-sampling up to the third decomposition level Analogously,
on the vertical filtering side, each line buffer for vertical fil-tering is able to store up to 4 lines
For the calculation of the first decomposition level of the wavelet transform, only the first 4 registers d0, d1, d2, and d3
inFigure 8are used in the FIR filter register pipeline At the second decomposition level, the wavelet filters have to be up-sampled with 1 zero between the filter coefficients In our im-plementation, this means that registers d0, d2, d4, and d6 are used for filtering.Figure 8illustrates the FIR filter configu-ration during the calculation of the wavelet coefficients from the third decomposition level During this period, the d0, d4, d8, and d12 registers are involved in the filtering process
Trang 60 0
Input Output
0
X
0
X
0
X
0
X
0
X
0
X
0
X
Figure 6: Input and output data format
Video input 4 tap horizontal FIR L
4 tap horizontal FIR H
Line bu ffer Line bu ffer Line bu ffer Line bu ffer
4 tap vertical FIR LL
4 tap vertical FIR LH
4 tap vertical FIR HL
4 tap vertical FIR HH
LL
LH
HL
HH
Figure 7: A block schematic of the developed hardware implementation of the wavelet transform
We implement the inverse wavelet transform accordingly
The processing is mirrored when compared to the wavelet
decomposition: the vertical filtering is done first and the
hor-izontal processing afterwards The FIR filter design is the
same as for the direct wavelet transform, only the filter
co-efficients a(0), a(1), a(2), and a(3) inFigure 8are mirrored
Our video denoising scheme employs a spatially adaptive
wavelet shrinkage approach of [36] A brief description of
this denoising method follows
Lety ldenote the noise-free wavelet coefficient and w lits
observed noisy version at the spatial position l in a given
wavelet subband For compactness, we suppressed here the
indices that denote the scale and the orientation The method
of [36] shrinks each wavelet coefficient by a factor which
equals the probability that this coefficient presents a signal of
interest The signal of interest is defined as a noise-free signal
component that exceeds in magnitude the standard
devia-tion of noiseσ The probability of the presence of a signal
of interest at positionl is estimated based on the coefficient
magnitude| w l |and based on a local spatial activity indicator
z l = k ∈ ∂ l | w k |, where∂ l is the neighborhood of the pixel
l (within a squared window) and N l is the number of the
neighboring coefficients For example, for a 3×3 window
∂ lconsists of the 8 nearest neighbors of the pixell (N l =8)
Let H1 denote the hypothesis “the signal of interest is
present:” | y l | > σ and let H0denote the opposite hypothesis:
“| y l | ≤ σ.” The shrinkage estimator of [9] is
y l = PH1| w l,z l
w l = ρξ l η l
1 +ρξ l η l w l, (1)
where
ρ = PH1
PH0
, ξ l = pw l | H1
pw l | H0
, η l = pz l | H1
pz l | H0
.
(2)
p(w l | H0) and p(w l | H1) denote the conditional prob-ability density functions of the noisy coefficients given the absence and given the presence of a signal of interest Sim-ilarly, p(z l | H0) and p(z l | H1) denote the corresponding conditional probability density functions of the local spa-tial activity indicator The input-output characteristic of this wavelet denoiser is illustrated inFigure 9 This figure shows that the coefficients that are small in magnitude are strongly shrinked towards zero, while the largest ones tend to be left unchanged The displayed family of the shrinkage character-istics corresponds to the different values of the local spatial activity indicator For the same coefficient magnitude | w l |
the input coefficient will be shrunk less if LSAI z lis bigger and vice versa
We now address the implementation of this shrinkage function Under the Laplacian prior for noise-free data
p(y) = (λ/2) exp( − λ | y |) we have [9]ρ = exp(− λT)/(1 −
exp(− λT)) The analytical expressions for ξ landη lseem too complex for the FPGA implementation We efficiently imple-ment the two likelihood ratiosξ landη l as appropriate
look-up tables, stored in two “read-only” memories (ROM) The
generation of the particular look-up-tables is based on an ex-tensive experimental study, as we explain later in this section The developed architecture is presented in Figure 10 One ROM memory, containing the look-up tableξ l, is addressed
by the coefficient magnitude| w l |, and the other ROM mem-ory, containing the look-up tableρη l is addressed by LSAI
z l For calculating LSAI, we average the coefficient values from the current line and from the previous two lines within
Trang 7Scale=3
Figure 8: The proposed FIR filter implementation of the algorithm `a trous for a mother wavelet of length 4 and supporting up to 3
de-composition levels The particular arithmetic network using the registers d0, d4, d8, and d12 corresponds to the calculation of the wavelet coefficients at the third decomposition level
150
100
50
0
50
100
150
Noisy input coe fficient
Di fferent LSAI
Figure 9: An illustration of the employed wavelet shrinkage family
a 3×3 window The read values from ROM’s are
multi-plied to produce the generalized likelihood ratior = ρξ l η l
We found it more efficient to realize the shrinkage factor
r/(1 + r) using another ROM (look-up-table) instead of
us-ing the arithmetic operations The output of this
look-up-table denoted here as “shrinkage ROM” is the desired wavelet
shrinkage factor Finally, the output of the shrinkage ROM
multiplies the input coefficient to yield the denoised
coeffi-cient
We denoise in parallel three wavelet bands LH, HL, and
HH at each scale Different resolution levels (we use three)
are processed sequentially as illustrated inFigure 2 The
low-pass (LL) band is only delayed for the number of clock
peri-ods that are needed for denoising This delay, which is in our
implementation 6 clock cycles, ensures the synchronization
of the inputs at the inverse wavelet transform block (see the
timing inFigure 2)
The generation of the appropriate look-up tables for the
two likelihood ratios resulted from our extensive
experi-ments on different test images and different noise-levels as
it is described in [29].Figure 11illustrates the likelihood
ra-tio ξ lcalculated from one test image at different noise lev-els These diagrams show another interpretation of the well-known threshold selection principle in wavelet denoising: a well-chosen threshold value for the wavelet coefficients in-creases with the increase of the noise level The maximum likelihood estimate of the threshold T (i.e., the value for
which p(T | H0)= p(T | H1)) is the abscissa of the point
ξ l = 1.Figure 12displays the likelihood ratioξ l, in the di-agonal subband HH at third decomposition level, for 10 dif-ferent frames with fixed noise standard deviations (σ = 10 andσ =30) We showed in [29] that from a practical point
of view, the difference between the calculated likelihood ra-tios for different frames is minor, especially for lower noise levels (up toσ =20) Therefore we average the likelihood ra-tios over different frames and store these values as the corre-sponding look-up tables for several different noise levels (σ =
5, 10, 15, and 20) In the denoising procedure, the user selects the input noise level, which enables addressing the correct set
of the look-up tables The performance loss of the algorithm due to simplifications with the generated look-up tables is for
different input noise levels shown inFigure 13 These results
Trang 8FIFO (line bu ffer) FIFO (line bu ffer)
LSAI coe fficient magnitude window
+ (1 scale)ABS(pixel)
(1 scale) Energy
8
Address generation combinatorial network
ROM
ROM KSI
ETA
Shrinkage
Output coe fficient Input
coe fficient
Figure 10: Block schematic of implemented denoising architecture
1000
900
800
700
600
500
400
300
200
100
0
HH
Figure 11: Likelihood ratioξ l for one test frame and 4 different
noise levels (σ =5, 10, 20, 30)
represent peak signal-to-noise ratio (PSNR) values averaged
over frames of several different video sequences For σ =10
the average performance loss was only 0.13 dB (and visually,
the differences are difficult to notice) while for σ = 20 the
performance loss is 0.55 dB and is on most frames
becom-ing visually noticeable, but not highly disturbbecom-ing For higher
noise levels, the performance loss increases
In the current implementation, the user has to select one
of the available noise levels With such approach, it is
possi-ble that the user will not choose the best possipossi-ble noise
re-duction If the selected noise level is smaller from the real
noise level in the input signal, some of the noise will remain
in the output signal On the other hand, if the noise level is
over-estimated, the output signal will be blurred without
sat-isfying visual effect
This user intervention can be avoided by implementing a
noise level estimator The output of this block could be used
for the look-up table selection, which further enables
ad-justable noise reduction according to the noise level in input
signal For example, a robust wavelet-domain noise estimator
based on the median absolute deviation [37] can be used for
this purpose or other related wavelet-domain noise
estima-tors like [38]
The likelihood ratiosξ landη lare monotonic increasing
functions We are currently investigating the approximation
of these functions by a family of piece-wise linear functions parameterized by the noise standard deviation and by the pa-rameter of the marginal statistical distribution of the noise-free coefficients in a given subband
A pixel-based motion detector with selective recursive tem-poral filtering is quite simple for hardware implementation Since we first apply a high quality spatial filtering the noise is already significantly suppressed and thus a pixel-based mo-tion detecmo-tion is efficient In case the momo-tion is detected the recursive filtering is switched off
Two pixels are involved for temporal filtering at a time: one pixel from the current field and another from the same spatial position in the previous field We store the two fields
in the output buffer and read the both required pixel values
in the same cycle If the absolute difference between these two pixel values is smaller than the predefined threshold value,
no motion case is assumed and the two pixel values are
sub-ject to a weighted averaging, with the weighting factors de-fined in [9] In the other case, when motion is detected, the current pixel is passed to the output The block schematic in Figure 14depicts the developed FPGA architecture of the se-lective recursive temporal filter described above We use the
8 bit arithmetic because the filter is located in the time do-main where all the pixels are represented as 8 bit integers
In our implementation we use the standard television broad-casting signal as a source of video signal A common feature
of all standard TV broadcasting technologies is that the video sequence is transmitted in analog domain (this excludes the latest DVB and HDTV transmission standards) Thus, before digital processing of television video sequence the digitaliza-tion is needed Also, after digital processing the sequence has
to be converted back to the analogue domain in order to
be shown on a standard tube display This pair of A/D and D/A converters is well known as a codec The 8 bit codec, with 256 levels of quantization per pixel, is considered suf-ficient from the visual quality point of view.Figure 15shows
a block schematic of digital processing for television broad-casting systems
We use the PAL-B broadcasting standard and 8 bit YUV
4 : 2 : 2 codec The hardware platform set-up consists of
Trang 9900
800
700
600
500
400
300
200
100
0
HH
(a)
1000 900 800 700 600 500 400 300 200 100 0
HH
(b)
Figure 12: Likelihood ratioξ ldisplayed for 10 frames with fixed-noise levels:σ =10 (a) andσ =30 (b)
50 40 30 20 10 0
Standard deviation of added noise
Original with 3 decomposition levels FPGA implementation
PSNR comparison
Figure 13: Performance of the designed FPGA implementation in comparison with the original software version of the algorithm, which employs exact analytical calculation of the involved shrinkage expression
three separate boards Each board corresponds to one of the
blocks presented inFigure 15:
(i) Micronas IMAS-VPC 1.1 (A/D—analog front-end)
[39];
(ii) CHIPit Professional Gold Edition (processing block)
[40];
(iii) Micronas IMAS-DDPB 1.0 (D/A—analog back-end)
[41]
We made all the connections among the previously
men-tioned boards with a separate interconnection board designed
for this purpose This interconnection board consists of the
interconnection channels and the voltage adjustments
be-tween the CHIPit board (3.3 V level) and the Micronas IMAS
boards (5 V level)
The processing board consists of two Xilinx Virtex II
FPGAs (XC2V6000-5) [35] and is equipped with plenty of
SDRAM memory (6 banks with 32 bit access made with
256 Mbit ICs)
All boards of the used hardware platform are configured
with the I2C interface The user is able to set up the needed
noise level in input signal This is fulfilled with writing ap-propriate value to the corresponding register in the FPGA accessible via the I2C interface Appropriate look-up table with the averaged likelihood ratio is selected according to the value in this register
We designed a real-time FPGA implementation of an ad-vanced wavelet-domain video denoising algorithm The de-veloped hardware architecture is based on innovative techni-cal solutions that allow an implementation of sophisticated adaptive wavelet denoising in hardware We believe that the results reported in this paper can be interesting for a num-ber of industrial applications, including TV broadcasting systems Our current implementation has limitations in practical use due to the required user-intervention for noise level estimation Our future work will integrate the noise level estimation to avoid these limitations and to allow au-tomatic adaptation of the denoiser to the noise level changes
in the input signal
Trang 10Pixel from current field
Pixel from previous field
ABS (A-B) A<B
+
Output
0.6
0.4
Threshold
Figure 14: Block schematic of implemented temporal filter
A/D
Input video sequence
nT
Digital processing
nT
D/A Output video sequence
Figure 15: A digital processing system for television broadcasting video sequences
ACKNOWLEDGMENT
The second author is a Postdoctoral Researcher of the Fund
for the Scientific Research in Flanders (FWO), Belgium
REFERENCES
[1] F Cocchia, S Carrato, and G Ramponi, “Design and
real-time implementation of a 3-D rational filter for edge
preserv-ing smoothpreserv-ing,” IEEE Transactions on Consumer Electronics,
vol 43, no 4, pp 1291–1300, 1997
[2] G Arce, “Multistage order statistic filters for image sequence
processing,” IEEE Transactions on Signal Processing, vol 39,
no 5, pp 1146–1163, 1991
[3] V Zlokolica and W Philips, “Motion and detail adaptive
de-noising of video,” in Image Processing: Algorithms and Systems
III, vol 5298 of Proceedings of SPIE, pp 403–412, San Jose,
Calif, USA, January 2004
[4] L Hong and D Brzakovic, “Bayesian restoration of image
se-quences using 3-D Markov random fields,” in Proceedings of
the IEEE International Conference on Acoustics, Speech and
Sig-nal Processing (ICASSP ’89), vol 3, pp 1413–1416, Glasgow,
UK, May 1989
[5] J Brailean and A Katsaggelos, “Simultaneous recursive
dis-placement estimation and restoration of noisy-blurred image
sequences,” IEEE Transactions on Image Processing, vol 4, no 9,
pp 1236–1251, 1995
[6] P van Roosmalen, S Westen, R Lagendijk, and J Biemond,
“Noise reduction for image sequences using an oriented
pyra-mid thresholding technique,” in IEEE International Conference
on Image Processing, vol 1, pp 375–378, Lausanne,
Switzer-land, September 1996
[7] I Selesnick and K Li, “Video denoising using 2D and 3D
dual-tree complex wavelet transforms,” in Wavelets: Applications in
Signal and Image Processing X, vol 5207 of Proceedings of SPIE,
pp 607–618, San Diego, Calif, USA, August 2003
[8] D Rusanovskyy and K Egiazarian, “Video denoising
algo-rithm in sliding 3d dct domain,” in Proceedings of the 7th
International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS ’05), J Blanc-Talon, W Philips, D.
Popescu, and P Scheunders, Eds., vol 3708 of Lecture Notes on
Computer Science, pp 618–625, Antwerp, Belgium, September
2005
[9] A Piˇzurica, V Zlokolica, and W Philips, “Noise reduction
in video sequences using wavelet-domain and temporal
filter-ing,” in Wavelet Applications in Industrial Processing, vol 5266
of Proceedings of SPIE, pp 48–59, Providence, RI, USA,
Octo-ber 2003
[10] V Zlokolica, A Piˇzurica, and W Philips, “Video denois-ing usdenois-ing multiple class averagdenois-ing with multiresolution,” in
The International Workshop on Very Low Bitrate Video Coding (VLBV ’03), pp 172–179, Madrid, Spain, September 2003.
[11] B A Draper, J R Beveridge, A P W Bohm, C Ross, and
M Chawathe, “Accelerated image processing on FPGAs,” IEEE
Transactions on Image Processing, vol 12, no 12, pp 1543–
1551, 2003
[12] A M Al-Haj, “Fast discrete wavelet transformation using
FP-GAs and distributed arithmetic,” International Journal of
Ap-plied Science and Engineering, vol 1, no 2, pp 160–171, 2003.
[13] G Goslin, “A guide to using field programmable gate arrays (FPGAs) for application-specific digital signal processing per-formance,” XILINX Inc., 1995
[14] C Dick, “Implementing area optimized narrow-band FIR
fil-ters using Xilinx FPGAs,” in Configurable Computing:
Technol-ogy and Applications, vol 3526 of Proceedings of SPIE, pp 227–
238, Boston, Mass, USA, November 1998
[15] R D Turney, C Dick, and A Reza, “Multirate filters and wavelets: from theory to implementation,” XILINX Inc [16] J Ritter and P Molitor, “A pipelined architecture for parti-tioned DWT based lossy image compression using FPGA’s,” in
ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’01), pp 201–206, Monterey, Calif, USA,
February 2001