CSPIHT BASED SCALABLE VIDEO CODEC FOR LAYERED VIDEO STREAMING FENG WEI B.. The layered 3D-CSPIHT codec introduces layering of encoded bit streams to support layered scalable video str
Trang 1CSPIHT BASED SCALABLE VIDEO CODEC FOR
LAYERED VIDEO STREAMING
FENG WEI
(B Eng (Hons) , Xi’an Jiaotong University)
A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF ENGINEERING
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
NATIONAL UNIVERSITY OF SINGAPORE
2003
Trang 2Last but not least, I wish to thank my boyfriend Huang Qijie for his support all the way along Almost all of my progress was made when he is by my side
Trang 3TABLE OF CONTENTS
ACKNOWLEDGEMENT……….…… ………… i
TABLE OF CONTENTS……….…… ………… ii
LIST OF FIGURES.……….…… ……… iv
LIST OF TABLES……….….……….…vii
SUMMARY……… ………… viii
CHAPTER 1 INTRODUCTION……… ……….…… 1
CHAPTER 2 IMAGE AND VIDEO CODING……… ……….…… 5
2.1 Transform Coding……… …… ……… 5
2.1.1 Linear Transform… …….……… … ……….6
2.1.2 Quantization……….……… ……….….7
2.1.3 Arithmetic Coding……….……….……… ……… 8
2.1.4 Binary Coding……….……… ….… ………….10
2.2 Video Compression Using MEMC……… ……….10
2.3 Wavelet Based Image and Video Coding……… … ………12
2.3.1 Discrete Wavelet Transform……….……….13
2.3.2 EZW Coding ……….……….……… …………16
2.3.3 SPIHT Coding Scheme……….……… ……… …18
2.3.4 Scalability……….……… ………… 23
2.4 Image and Video Coding Standards……… ………… 25
CHAPTER 3 VIDEO STREAMING AND NETWORK QoS……… …….…… 25
3.1 Video Streaming Models……….….…….25
3.2 Characteristics and Challenges of Video Streaming……… ………26
3.3 Quality of Service……….… ……27
3.3.1 Definition of QoS …… ….……….……… …… 27
3.3.2 IntServ Framework………….……….………….….…….28
3.3.3 DiffServ Framework………….……….………… …… 31
3.4 Layered Video Streaming……… ……… 33
CHAPTER 4 Layered 3D-CSPIHT CODEC……….…… ……….36
4.1 CSPIHT and 3D-CSPIHT Video Coder ……… …… ….…… … ……… 36
4.2 Limitations of Original 3-D CSPIHT Codec……… …… … ………….41
4.3 Layered 3D-CSPIHT Video Codec……….…….… ……….42
4.3.1 Overview of New Features……….…….…………43
4.3.2 Layer IDs……….…….……… 44
Trang 44.3.4 How the Codec Functions in the Network……….………54
4.3.5 Layered 3D-CSPIHT Algorithm……….……… …57
CHAPTER 5 PERFORMANCE DATA……….….…….59
5.1 Coding Performance Measurements……….……….59
5.2 PSNR Performance of the layered 3D-CSPIHT Codec………….….……… 60
5.3 Coding Time and Compression Ratio……….…… 70
CHAPTER 6 CONCLUSIONS……….……71
REFERENCES……… ……….……74
Trang 5SUMMARY
A layered scalable codec based on the 3-D Color Set Partitioning in Hierarchical Trees (3D-CSPIHT) coder is presented in this thesis The layered 3D-CSPIHT codec
introduces layering of encoded bit streams to support layered scalable video streaming
It restricts the significance criteria of the original 3D-CSPIHT coder to generate separate bit streams comprised of cumulative layers Layers are defined according to resolution subbands The layered 3D-CSPIHT codec incorporates a new sorting algorithm to produce multi-resolution scalable bit streams, and a specially designed layer ID to identify the layer that a particular data packet belongs to By doing so, decoding of lossy data is achieved
The layered 3D-CSPIHT codec is tested using both high motion and low motion standard QCIF video sequences at 10 frames per second It is compared against the original 3D-CSPIHT and the 2D-CSPIHT video coder in terms of PSNR, encoding time and compression ratio In the luminance plane, the original 3D-CSPIHT and the 2D-CSPIHT give better PSNR than the layered 3D-CSPIHT While in the chrominance planes, they give similar PSNR results The layered 3D-CSPIHT also costs more in computational time and provides less compressed bit streams, because of the expense incurred by incorporating the layer ID However, encoded video data is very likely to encounter loss in real network transmission When decoding lossy data, the layered 3D-CSPIHT codec outperforms the original 3D-CSPIHT significantly
Trang 6LIST OF TABLES
Table 2.1 Image and video compression standards……… ……….24 Table 4.1 Resolution options……….…………47 Table 4.2 LIP, LIS, LSP state after sorting at bit plane 2 (original CSPIHT)…… …50 Table 4.3 LIP, LIS, LSP state after sorting at bit plane 1 (original CSPIHT)……… 51 Table 4.4 LIP, LIS, LSP state after sorting at bit plane 0 (original CSPIHT)……… 51 Table 4.5 LIP, LIS, LSP state after sorting at bit plane 2 (layered CSPIHT, layer 1 effective)………52 Table 4.6 LIP, LIS, LSP state after sorting at bit plane 1 (layered CSPIHT, layer 1 effective)………52 Table 4.7 LIP, LIS, LSP state after sorting at bit plane 0 (layered CSPIHT, layer 1 effective)………52 Table 4.8 LIP, LIS, LSP state after sorting at bit plane 2 (layered CSPIHT, layer 2 effective)………53 Table 4.9 LIP, LIS, LSP state after sorting at bit plane 1 (layered CSPIHT, layer 2 effective)………53 Table 4.10 LIP, LIS, LSP state after sorting at bit plane 0 (layered CSPIHT, layer 2 effective)……… 53 Table 5.1 Average PSNR (dB) at 3 different resolutions……… …61 Table 5.2 Encoding time (in second) of the original and layered codec…… ………70
Trang 7LIST OF FIGURES
Fig 1.1 A typical video streaming system………… ………2
Fig 2.1 Encoding model……… ………5
Fig 2.2 Decoding model……… ………6
Fig 2.3 Binary coding model……… ……….……….10
Fig 2.4 Block matching motion estimation……… ……….………11
Fig 2.5 1-D DWT decomposition……… ……… 14
Fig 2.6 Dyadic DWT decomposition of an image……… …….……… 14
Fig 2.7 Subbands after 3-level dyadic wavelet decomposition……… ……… 15
Fig 2.8 2-level DWT decomposed Barbara image……… ……….15
Fig 2.9 Spatial Orientation Tree for EZW……… ………… 17
Fig 2.10 Spatial Orientation Tree of SPIHT……….18
Fig 2.11 SPIHT coding algorithm……… ……… 25
Fig 3.1 Unicast video streaming……… ……….25
Fig 3.2 Multicast video streaming……… …… 26
Fig 3.3 IntServ architecture……… ……29
Fig 3.4 Leaky bucket regulator……… … 30
Fig 3.5 An example of the DiffServ network……… ….32
Fig 3.6 DiffServ inter-domain operations……….……… …… …33
Fig 3.7 Principle of a layered codec……… ………….………… 35
Fig 4.1 CSPIHT SOT (2-D) ……….…… ….37
Fig 4.2 CSPIHT video encoder ……… ….37
Fig 4.3 CSPIHT video decoder……… ……….……….38
Fig 4.4 3D-CSPIHT STOT ……….………… …… …39
Trang 8Fig 4.6 3D-CSPIHT video decoder……… …….……… …….41
Fig 4.7 Confusion when decode lossy data using original 3D-CSPIHT decoder… 41
Fig 4.8 Network scenario considered for design of the layered codec……… … 43
Fig 4.9 The bit stream after layer ID is added……… ……45
Fig 4.10 Resolution layers in layered 3D-CSPIHT……… ………47
Fig 4.11 Progressively transmitted and decoded layers ……… … ….47
Fig 4.12 (a) An example video frame after DWT transform ……… …….…49
Fig 4.12 (b) SOT for Fig 4.14 (a)…… ……… …… ……49
Fig 4.13 Bit stream structure of the layered 3D-CSPIHT coder……… ….….55
Fig 4.14 Flowchart of the layered decoder algorithm……… ……56
Fig 4.15 Layered 3D-CSPIHT algorithm……….…58
Fig 5.1 Frame by frame PSNR results on (a) foreman and (b) container sequences at 3 different resolutions……… ….61
Fig 5.2 Rate distortion curve of the layered 3D-CSPIHT codec ……… … 62
Fig 5.3 PSNR (dB) comparison of the original and the layered codec in (a) luminance plane, (b) Cb plane and (c) Cr plane for the foreman sequence……… 63
Fig 5.4 Frame 1 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original……… 64
Fig 5.5 Frame 58 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original……… 64
Fig 5.6 Frame 120 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original……… 65
Fig 5.7 Frame 190 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original……… 65
Fig 5.8 Comparison on carphone sequence……….66
Trang 9Fig 5.9 Comparison on akiyo sequence……… 67
Fig 5.10 Manually formed incomplete bit streams ………68
Fig 5.11 Reconstruction of frame (a)(b)1, (c)(d)5, (e)(f)10 of the foreman sequence
……… ….69
Trang 10
CHAPTER 1 INTRODUCTION
With the emergence of increasing demand of rich multimedia information on the Internet, video streaming has become popular in both academia and industry
Video streaming technology enables real time or on-demand distribution of video resources over the network Compressed video data are transmitted by a server application, and received and displayed in real time by the corresponding client applications These applications normally start to display the video as soon as a certain amount of data arrives at the client’s buffer, thus providing downloading and viewing
of the video simultaneously
A typical video streaming system consists of five core functional blocks, i.e., coding module, network sender, network receiver, decoding module and video renderer As shown in Fig 1.1, raw video data will undergo compression in the coding module to reduce the data load in the network The compressed video is then transmitted by the sender to the client on the other side of the network, where a decoding procedure is performed to reconstruct the video for the renderer to display
Video streaming is advantageous because a user does not have to wait until the whole file to arrive before he can see the video Besides, video streaming leaves no physical files on the clients’ computer
Trang 11Fig 1.1 A typical video streaming system
The challenge of video streaming lies in the highly delay-sensitive characteristic of video applications Video/audio data need to arrive on time to be useful Unfortunately,
current Internet service is best effort (BE) and guarantees no delay bound Delay
sensitive applications need a new service model in which they can ask for higher
assurance or priority from the network Research in network Quality of Service (QoS)
aims to investigate and provide such service models Technical details of QoS include
control protocols such as the Resource Reservation Protocols (RSVP), and individual
building blocks such as traffic policing, buffer management and admission control [1] Layered scalable streaming is one of the QoS supportive video streaming mechanisms that provide both efficiency and flexibility
The basic idea of layered scalable streaming is to encode raw video into multiple layers that can be separately transmitted, cumulatively received and progressively decoded [2]-[4] Clients obtain a preferred video quality by subscribing to different layers and combining these layers into different bit streams Base layer of the video stream must be received for any other layers to be useful, and each additional layer improves the video quality As network clients always differ significantly in their capacities and preferences, layered scalable streaming is efficient in that it is able to
Trang 12deliver one video stream over the network, while at the same time it enables the clients
to receive a video that is specially “shaped” for each of them
Besides adaptive QoS support from the network, layered scalable video streaming requests a scalable video codec Recent subband coding algorithms based on the
Discrete Wavelet Transform (DWT) support scalability The DWT based Set Partitioning in Hierarchical Trees (SPIHT) scheme [5] [6] for coding of monochrome
images has yielded desirable results despite its simplicity in implementation The
Color SPIHT (CSPIHT) [7]-[9] improves the SPIHT and achieves comparable
compression results to SPIHT in color image coding In the area of video compression, interest is focused on the removal of temporal redundancy The use of 3-D subband coding schemes is one of the successful solutions Karlsson and Vetterli implemented a 3-D subband coding system in [10] by generalized the common 2-D filter banks to 3-D subband analysis and synthesis As one of the embedded 3-D subband coding
algorithms that follow it, 3D-CSPIHT [11] is an extension of the CSPIHT coding
scheme for video coding
The above coding schemes achieve satisfactory PSNR performance; however, they have been designed from a pure compression point of view, which render problems for their direct application to a QoS enabled streaming system
In this project, we extended the 3D-CSPIHT codec to address these problems and enable it to produce layered bit streams that are suitable for layered video streaming
Trang 13The rest of this thesis is organized as follows: In chapter 2 we provide background information in image/video compression, and in chapter 3 we discuss related research
in multimedia communications and network QoS The details of our extension of the
3D-CSPIHT codec, called layered 3D-CSPIHT video codec, are presented in chapter 4
We analyze performance of the layered codec in chapter 5 Finally, in chapter 6 we conclude this thesis
Trang 14CHAPTER 2 IMAGE AND VIDEO CODING
This chapter begins with an overview of transform coding for still images and video coding using motion compensation Then wavelet based image and video coding is introduced and the subband coding techniques are described in detail Finally, current image and video coding standards are briefly summarized
2.1 Transform Coding
A typical transform coding system comprises of forward transform, quantization and entropy coding, as shown in Fig 2.1 First, a reversible linear transform is used to reduce redundancy between adjacent pixels, i.e., the inter-pixel redundancy, in an image After that, the image undergoes the quantization stage to reduce psychovisual redundancy Lastly, the quantized image goes through entropy coding which aims to reduce coding redundancy Transform coding is a core technique recommended by JPEG and adopted by H 261, H.263, and MPEG 1/2/4 The corresponding decoding procedure is depicted in Fig 2.2 We will discuss the three encoding stages in this section
Transform Quantization Entropy coding
Fig 2.1 Encoding model
Trang 15Entropy decoding Inverse transform
Compressed
Fig 2.2 Decoding model
2.1.1 Linear Transforms
Transform coding exploits the inter-pixel redundancy of an image by mapping the image to the transform domain using a reversible linear transform For most natural images, a significant number of coefficients will have small magnitudes after the transform These coefficients therefore can be coarsely quantized or entirely discarded without causing much image degradation [12] There is no information loss during the transform process, and the number of coefficients produced is equal to the number of pixels transformed Transform itself does not directly reduce the amount of data required to represent the image However, a set of transform coefficients are obtained
in this way, which makes the inter-pixel redundancies of the input image more accessible for compression in later stages of the encoding process [12]
standard basis {a 1 , a 2 , …, a N } of an N-dimensional Euclidean space, we obtain:
x ∑ (2.1)
=
= N
n n x
1
n
a
where A=[ a 1 , a 2 , …, a N ] is an identity matrix of size N × N
A different set of basis [ b 1 , b 2 , …, b N ] can be used to represent x as
x ∑ (2.2)
=
= N
n n y
1
n
b
Trang 16Let B=[ b 1 , b 2 , …, b N ] and y=[ y1 , y2 , …, yN ]T, we have
2.1.2 Quantization
After transform process, quantization is used to reduce the accuracy of the transform coefficients according to a pre-established fidelity criterion [14] The effect of compression is achieved in this way Quantization is an irreversible process
Trang 17Quantization is the mapping from the source data vector x to a code word = Q[x] in a
code book { ; 1 ≤ k ≤ L} The criterion to choose the proper code word is to reduce the expected distortion due to quantization with respect to a particular probability
density distribution of the data Assume the probability density function of x is f(x)
The expected distortion can be formulated as:
k r k
r
D N x r I x r k f x x dx
)(),(
2 1
;][1),(
otherwise
r x Q r
I a( )= −log p( )a (1≤i≤k) (2.7)
Trang 18where the unit of information is bit for logarithm of base 2
The entropy of the message source is then defined as
2 (2.8)
1
k j j
Arithmetic coding is a variable length coding based on the frequency of each character
or symbol It is suitable to encode a long stream of symbols or long messages In arithmetic coding, probabilities of all code words sum up to unity The events in the data set are arranged in an interval between 0 and 1 Each code word probability can be related to a subdivision of this interval The algorithm for arithmetic coding then works
as follows:
i) Begin with a current interval [L, H) initialized to [0, 1);
ii) For each incoming event, the current interval is subdivided into subintervals proportional to their probabilities of occurrence, one for each possible event; iii) Select the subinterval corresponding to the incoming event, make it the new current interval and go back to step 1
Arithmetic coding reduces the information that needs to be transmitted to a single number within the final interval, which is identified after the whole data set is encoded
Trang 19The arithmetic decoder, with the knowledge of occurrence probability of the different events and the number received, then maps the intervals identified and scales the intervals accordingly to decode the data set
2.1.4 Binary Coding
Binary coding is lossless, and is a necessary step in any coding system The process of
binary coding is shown in Fig 2.3
Probability table pi
Codeword cibit length li
Fig 2.3 Binary coding model
Denote the bit rate produced by such a binary coding system as R According to Fig 2.3, we have
R ( ) ( ) (2.9)
2.2 Video Compression Using MEMC
Unlike still image compression, video compression attempts to exploit the temporal redundancy There are two types of coding categorized according to the type of
redundancy being exploited, i.e., intraframe coding and interframe coding In
intraframe coding, each frame is coded separately using still image compression methods such as transform coding, while interframe coding uses spatial redundancies
Trang 20and motion compensation to exploit temporal redundancy of the video sequence This
is done by predicting a new frame from its previous frame, thus the original frame to code is reduced to the prediction error or residual frame [15] We do this because prediction errors have smaller energy than the original pixel values and therefore can
be coded with fewer bits Those regions with high motion or scene changes will be coded directly using transform coding Video compression system is evaluated using
three criteria: reconstruction quality, compression rate and complexity
The method used to predict a frame from its previous one is called Motion Estimation (ME) or Motion Compensation (MC) [16] [17] MC uses the motion vectors to
eliminate or reduce the effects of motion, while ME computes motion vectors to carry
on the displacement information of a moving object Normally the two terms are often referred to as MEMC
reference frame actual frame
prediction block
actual block motion vector
Fig 2.4 Block matching motion estimation
MEMC is normally done at macro block (MB) (16x16 pixels) level independently in order to reduce computation complexity, which is called the Block Matching Algorithm
In the Block Matching Algorithm (Fig 2.4), a video frame is divided into macro
Trang 21blocks Each pixel within the block is assumed to have the same amount of translational motion Motion estimation is achieved by doing block matching between
a block in the current frame and a similar matching block within a search window in
the reference frame A two-dimensional displacement vector or motion vector (MV) is
then obtained by finding the displaced co-ordinate of the match block to the reference frame The best prediction is found by minimizing a matching criterion such as the
Sum of Absolute Difference (SAD) SAD is defined as:
y x B SAD
),()
,( (2.10)
where represents the pixel with coordinate (x,y) in a MxN block from the
current frame at spatial location (i,j), while B represents the pixel with
coordinate (x,y) in the candidate matching block from the reference frame at spatial location (i,j) displaced by vector (u,v)
),
(
B j
),(
,j v x y u
I− −
2.3 Wavelet Based Image and Video Coding
This section provides a brief overview of wavelet based image and video coding
[18]-[22] The Discrete Wavelet Transform (DWT) is introduced and the subband coding schemes including the Embedded Zerotree Wavelet (EZW) and the Set Partitioning in Hieratical Tree (SPIHT) are discussed in detail In the last sub-section, the concept of
scalability is introduced
Trang 222.3.1 Discrete Wavelet Transform
The Discrete Wavelet Transform (DWT) is an invertible linear transform that decomposes a signal into a set of orthogonal functional basis called wavelets The
fundamental idea behind DWT is to present each frequency component as a resolution matched to its scale, so that a signal can be analyzed at various levels of scales or resolutions In the field of image and video coding, DWT performs decomposition of video frames or residual frames into a multi-resolution subband representation
We denote the wavelet basis as
( ) 2 2 (2 )
)
j k
φ (2.11)
where variables j and k are integers that are the scale and location index indicating the
wavelet's width and position, respectively They are used to scale or “dilate” φ(x) or the mother function to generate wavelets
The DWT transform pair is then defined as
, ,
where is the signal to be decomposed, and c is the wavelet coefficient To span
the data domain at different resolutions, we use equation (2.14):
Trang 23cj+1
cj
ajinput vector
Fig 2.6 Dyadic DWT decomposition of an image
In real applications, the DWT is often performed on a vector whose length is an integer power of 2 As Fig 2.5 shows, the process of 1-D DWT computation comprises of a series of filtering and sub-sampling operations H and L denote high and low-pass
obtained from the DWT The 1-D DWT can be extended to 2-D for image and video processing In this case, filtering and sub-sampling are first performed along all the
rows of the image and then all the columns 2-D DWT is called dyadic DWT 1-level
dyadic DWT results in four different resolution subbands, namely the LL, LH, HL and the HH subbands The decomposition process is shown in Fig 2.6 The LL subband contains the low frequency image and can be further decomposed by 2-level or 3-level
Trang 24dyadic DWT Fig 2.7 depicts the subbands of an image decomposed using a 3-level dyadic DWT Fig 2.8 shows the Barbara image after 2- level decomposition
Fig 2.7 Subbands after 3-level dyadic wavelet decomposition
Fig 2.8 2-level DWT decomposed Barbara image
The advantage of DWT is that it has versatile time frequency localization This is because DWT has shorter basis functions for higher frequencies, and longer basis functions for lower frequencies The DWT has an important advantage over traditional
Trang 25Fourier Transform in that it can analyze signals containing discontinuities and sharp spikes
2.3.2 EZW Coding Scheme
Good energy compaction property has attracted huge research interest on DWT based image and video coding schemes The main challenge of wavelet-based coding is to achieve an efficient structure to quantize and code the wavelet coefficients in the
transform domain Lewis and Knowles defined a spatial orientation tree (SOT)
structure [23] - [27] and Shapiro then made use of the SOT concept and introduced the
Embedded Zerotree Wavlet (EZW) encoder [28] in 1993 The idea is further improved
by Said and Pearlman by modifying the EZW SOT structure Their new structure is
called Set Partitioning in Hierarchical Trees (SPIHT) A brief discussion on the EZW
scheme is provided in this section and a detailed description on SPIHT is provided in the next section
Shapiro’s EZW coder contains 4 key steps:
i) the discrete wavelet transform;
ii) subband coding using the EZW SOT structure (Fig 2.9);
iii) entropy coded successive-approximation quantization;
iv) adaptive arithmetic coding
A zerotree is actually a SOT which has no significant coefficients with respect to a given threshold For simplicity, the image in Fig 2.9 is transformed using a 2-level DWT However, in most situations, a 3-level DWT is applied to ensure better
Trang 26reconstruction quality As shown in Fig 2.9, the image is divided into 7 subbands after the 2-level wavelet transform Nodes in the lowest subband will each have 3 children nodes with one in each of its neighborhood subband Its children, in turn, will each have 4 children nodes which reside in the same spatial location of the correspondent higher subband Thus, all the nodes are linked in the SOTs, a searching through these SOT trees will then be performed to have the significant coefficients found and coded with a higher priority The core of the EZW encoder, step ii), is based on three concepts, i.e., comparison of coefficient magnitudes to a series of decreasing thresholds representing the current bit plane, ordered bit plane coding of refinement bits and exploitation of the correlation across subbands in the transform domain
Fig 2.9 Spatial Orientation Tree for EZW
EZW coding scheme is proved competitive with virtually all known compression techniques in performance, while still generating a fully embedded bit stream It utilizes both the bit plane coding and the zerotree concept
Trang 272.3.3 SPIHT Coding Scheme
Pearlman’s SPIHT coder is an enhancement to the EZW coder Basically, SPIHT is also a sorting algorithm which tries to code wavelet coefficients according to priority defined by their significance with respect to a certain threshold This is achieved by tracking down the SPIHT SOT and comparing the coefficients against the given threshold SPIHT scheme inherits the basic concepts of the EZW, except that it uses a modified SOT called SPIHT SOT (Fig 2.10)
*
Fig 2.10 Spatial Orientation Tree of SPIHT
The SPIHT SOT structure is designed according to the observation that if a coefficient magnitude in a certain node of a SOT does not exceed a given threshold, it is very likely that none of the nodes in the same location in the higher subbands will exceed that threshold The SPIHT SOT naturally defines this spatial relationship using a hierarchical pyramid Each node is identified by the coordinate of the pixel and its magnitude is the correspondent absolute value of that pixel As Fig 2.10 shows, each node has either no or 4 offspring, which are located at the same spatial orientation in the next finer level of the pyramid The 4 offspring always form 2×2 adjacent pixel groups The nodes in the lowest subband of the image or the highest level of the
Trang 28pyramid will be the roots of the SOT There is a slight difference in the offspring branching rule for the tree roots, i.e., in each 2×2 group, the upper left node will be childless Thus, the wavelet coefficients are organized in hierarchical trees with nodes
in common orientation across all subbands linked in one same SOT This will allow us
to predict a coefficient’s significance according to the magnitude of its parent node later
We use the symbols in Said and Pearlman’s paper to denote the coordinates and the sets:
• O(i,j) denotes set of coordinates of all offspring of node (i,j);
• D(i,j) denotes set of coordinates of all descendants of the node (i,j);
• H denotes all nodes in the lowest subband, inclusive of the childless nodes;
• L(i,j) denotes the set of coordinates of all non-direct descendants of the node (i,j), i.e., L(i,j)=D(i,j)-O(i,j);
Now we are able the express the SOT descendants branching rule by equation (2.15):
O(i,j) = {(2i, 2j), (2i+1, 2j), (2i, 2j+1), (2i+1,2j+1)} (2.15)
After the sets are defined, the set partitioning rule is used to create new partitions in order to effectively predict and code significant nodes A magnitude test is performed
on each partitioned subset to determine its significance If significant, the subset will
be further partitioned into new subsets, and the magnitude test will again be applied to the new subsets until each individual significant coefficient is identified Note that an individual coefficient is significant when it is larger than the current threshold and insignificant otherwise To make a significant set, at least one descendant must be
Trang 29significant on an individual basis We denote the transformed coefficients as c i,j, the pixel sets as τ and use the following function to define the relationship between magnitude comparisons and message bits:
S
n j j
,0
2max
,1)(τ ( )τ , (2.16)
The set partitioning rule is then defined as follows:
i) The initial partition is formed with sets {(i,j)} and D(i,j), for all (i,j)∈H;
ii) If D(i,j) is significant, then it is partitioned into L(i,j) plus the four element sets with (k,l) ∈O(i,j);
single-iii) If L(i,j) is significant, then it is partitioned into the four sets D(k,l) with (k,l)
O(i,j)
∈
Following the SOT structure and the set partitioning rule, an image that has large coefficients at the SOT roots and zero or very small coefficients in higher level of the SOTs will need very little sorting and partitioning of the pixel sets This property reduces the computational complexity greatly and allows for a better reconstruction of the image
In implementation, the SPIHT coding algorithm uses 3 ordered lists to store the significance information, i.e.:
i) list of insignificant pixels (LIP)
ii) list of significant pixels (LSP)
iii) list of insignificant sets (LIS)
Trang 30In case of LIP and LSP, the coordinates of the pixels will be stored in the list In case
of LIS, however, the list contains two types of coordinates categorized according to
which set it represents If an entry represents the set D(i,j), we say it is a type A entry;
if it represents L(i,j), we say it is a type B entry
To initialize the coding algorithm, the maximum coefficient in the image is identified and the initial bit plane is assigned a value of n=log2(max(i,j){|c i,j|} The threshold
value is then obtained by computing 2 n Also, the LIS and LIP lists are initialized with pixel coordinates in the highest subband The set portioning rule is then applied to the LIP and LIS lists to judge the significance status of the pixels or sets This is called the
sorting pass Thereafter, the refinement pass will go through the LSP list to code bits
necessary to enhance the precision of the significant coefficients from the previous sorting pass by a bit position Thus, we fulfill coding under the first bit plane To continue, the bit plane is decreased by 1 and the sorting and refinement passes are re-executed in the next iteration This process is repeated until the bit plane is reduced to zero or a user given bit budget runs out Fig 2.11 demonstrates the above coding algorithm Note that in step 2.2), the entries added to the end of the LIS list are evaluated before that same sorting pass ends So, step 2.2) not only sorts the original initialized entries, but also sorts the entries that are being added to the LIS list
The SPIHT coder improves the performance of the EZW coder by 0.3-0.6 dB This gain is mostly due to the fact that the original zerotree algorithms allow special symbols only for single zerotrees, while there are often other sets of zeros in reality In particular, SPIHT coder provides symbols for combinations of parallel zerotrees Moreover, SPIHT produces a fully embedded bit stream that can be precisely
Trang 31controlled in bit rate SPIHT coder is very fast and has a low computational complexity Both EZW and SPIHT belong to subband coding schemes, and both exploit the correlation between subbands through the SOT
1) Initialization:
1.1) Output n= log2 (max (i, j) { | c i, j| };
1.2) Set the LSP, LIP and LIS as empty lists
Add coordinates (i,j) H to the LIP and those with descendants ∈
to the LIS as TYPE A entries
-Output sign of c i,j;
2.2) For each entry (i,j) in LIS do:
2.2.1) If entry is TYPE A then
-Output S(D(i,j));
-If S(D(i,j))=1 then
+For each offspring (k,l) of (i,j) do:
-Output S(k,l);
-if S(k,l)=1 then add (k,l) to LSP and output sign of c k,l;
-if S(k,l)=0 then add (k,l) to end of LIP;
+If L(i,j)!=0 then move (i,j) to end of LIS as
TYPE B and go step 2.2.2);
otherwise remove (i,j) from LIS;
2.2.2) If the entry is TYPE B then
-Output S(L(i,j));
-If S(L(i,j))=1 then
+Add each element in L(i,j) to end of LIS as TYPE A; +Remove (i,j) from LIS;
3) Refinement Pass:
For each entry (i,j) in LSP, except those from the last sorting pass:
-Output the nth most significant bit of c i,j;
4) Quantization-Step Update:
Decrease n by 1 and go to step 2
Fig 2.11 SPIHT coding algorithm
Trang 32
2.3.4 Scalability
One advantage of the SPIHT image and video coder is the bit rate scalability Scalability is the degree to which video and image formats can be sized in systematic proportions for distribution over communication channels of varying capacities [29] In other words, it measures how flexible an encoded bit stream is Scalable image and video coding has received considerable attention from the research community due to the diversity of the communication networks and network users
There are three basic types of scalability, and they refine video quality along three different dimensions, i.e.:
• Temporal scalability or temporal resolution/frame rate
• Spatial scalability or spatial resolution
• SNR scalability or amplitude resolution
Each type of scalable coding provides scalability of one dimension of the video sequence Multiple types of scalability can be combined to provide scalability along multiple dimensions In real applications, being temporal scalable often means supporting different frame rates, while spatial and SNR scalability means video of different spatial resolution and visual quality respectively
One common method of providing scalability is to apply subband decomposition on the video sequences Thus, the full resolution video can be decoded using both the low pass and high pass subbands, while half resolution video can be decoded using only the low pass subband The resulting half resolution video sequences can be passed through further subband decomposition to create quarter resolution video, and so on We will use this concept in chapter 4
Trang 332.4 Image and Video Coding Standards
ISO/IEC and ITU-T have been heavily involved in the standardization of image, audio
and video coding as international organizations Specifically, ISO/IEC focuses on
video storage, broadcast video and video streaming applications while ITU-T caters to
real time video applications Current video standards mainly comprise of the ISO
MPEG family and the ITU-T H.26x family Table 2.1 [30] provides an overview of
these standards and their applications JPEG and JPEG 2000 are also listed as still
image coding standards for reference
JPEG 2000 Improved still image compression Variable
MPEG-2 Digital Television, Video on DVD 2-20Mpbs
MPEG-4 Object-based coding Interactive video 28-1024kbps
H.261 Video conferencing over ISDN Variable
H.263 Video conferencing over Internet and PSTN
Wireless video conferencing >=33kbps H.26L Improved video compression 10-100kpbs
Table 2.1 Image and video compression standards
Trang 34CHAPTER 3 VIDEO STREAMING AND NETWORK QoS
In this chapter some fundamentals in video streaming are provided The network
Quality of Service (QoS) is defined and two frameworks, i.e., the Integrated Services (IntServ) and the Differentiated Services (DiffServ) are discussed in detail Finally,
principles of layered video streaming are provided
3.1 Video Streaming Models
Unicast and multicast are the two models of video streaming Unicast is the
communication between a single sender and a single receiver As shown in Fig 3.1, the sender sends individual copies of video streams to each client even when some of the clients require the same video resource Unicast is also called point-to-point communication because there seems to be a non-shared connection from the server to each client
Trang 35By contrast, communication between a single sender and multiple receivers is called multicast or point-to-multi-points communication In multicast scenario (Fig 3.2), the sender sends only one copy of the required video over the network It is then routed to several destinations by the network switches or routers A client receives the video stream by tuning in to a multicast group in its neighborhood When the clients are multiple groups, the video will be duplicated and branched at fork points, as shown at router R1 (Fig 3.2)
encoder
z R1
R2
R3
R4 harddisk
Fig 3.2 Multicast video streaming
3.2 Characteristics and Challenges of Video Streaming
Video streaming is real time applications Unlike traditional data-oriented applications such as email, ftp, and web browsing, video streaming applications is highly delay-sensitive and need the data to arrive on time to be useful As such, the service requirements of video streaming applications differ significantly from those of traditional data-oriented applications To satisfy these requirements is a great challenge under today’s Internet
Trang 36First of all, the best effort (BE) services that current Internet provides are far from
sufficient for a real time application as video streaming Under the BE service model, there is no guarantee on delay bound or loss rate When the data load is heavy on the Internet, delivery results could be very unacceptable On the other side, video streaming requires timely and, to some extent, correct delivery We must ensure that the streaming is still viable with a decreased quality in time of congestion
Second, client machines on the Internet normally vary significantly in their computing, display and memory capabilities In most cases, these heterogeneous clients will require video of different qualities It is obviously inefficient to delivery the same video stream to all clients seperately Instead, streaming in response to particular requests from each individual client is desirable
As a conclusion, suitable coding and streaming strategies are needed to support efficient real time video streaming Scalable video coding and new network service
model which supports network Quality of Service are developed to address the above
challenges We will discuss details of QoS in the next section
3.3 Quality of Service
3.3.1 Definition of QoS
The current Internet provides one single class of service, the BE service BE service generally treats all data as one service class and provides priority to no particular data
or users This is not enough for the new real time applications such as video streaming
We must modify the Internet and provide more service options, which can, to some
Trang 37extent, keep the service quality up to a certain level that has been previously agreed on
by the user and the network This new service model is provided by support of network
There are generally two approaches to support QoS One is fine-grained approaches, which provide QoS to individual applications or flows, the other is coarse-grained approaches, which aggregate data of traffic into classes, and provide QoS to these
classes Currently, the Internet Engineering Task Force (IETF) has developed QoS frameworks through both approaches, namely, the Integrated Services (IntServ) framework as example of fine-grained approaches, and the Differentiated Services
(DiffServ) framework as example of coarse-grained approaches
Trang 38determine or assist in the reservation Packet classifier classifies a packet into an appropriate QoS class Policy control is then used to examine the packet to see whether
it has administrative permission to make the requested reservation For final reservation success, however, admission control must also be passed to ensure that the desired resources can be granted without affecting the QoS previously requested by and admitted to other flows Finally, the packet is scheduled to enter the network at a proper time by the packet scheduler in the primary data forwarding path
Fig 3.3 IntServ architecture
The IntServ architecture adds two service classes to the existing BE model, i.e.,
guaranteed service and controlled load service
Basic concept of guaranteed service can be described using a linear flow regulator called leaky bucket regulator (Fig 3.4) Suppose there are b tokens in the bucket, and new tokens are being filled in at a rate of r tokens/sec Before filtering by the regulator, packets are being thrown in with a variable rate Under the filtering, however, they must wait at the input queue for the same amount of tokens to be available before they
Trang 39can further proceed into the network Obviously, such a regulator allows flows of a
maximum burst rate of b tokens/sec and an average rate of r tokens/sec to pass Thus,
it confines the traffic to a network to b+rt tokens over an interval of t seconds
Fig 3.4 Leaky bucket regulator
To invoke the service, a router needs to be informed of the traffic and the reservation
characteristics, denoted by Tspec and Rspec respectively
Tspec contains the following parameters:
• p = peak rate of flow (bytes/s)
• b = bucket depth (bytes)
• r = token bucket rate (bytes/s)
• m = minimum policed unit (bytes)
• M = maximum datagram size (bytes)
Rspec contains the following parameters:
• R = bandwidth, i.e., service rate (bytes/s)
• S = slack term (ms)
Trang 40Guaranteed service promises a maximum delay for a flow, provided that the flow conforms to its specified traffic parameters This service model aims to support applications with hard real time requirements
Unlike guaranteed service, controlled-load service provides no rigid delay or loss guarantees Instead, it provides a QoS similar to BE service in an under utilized network, with almost no loss or delay When the network is overloaded, it tries to share the bandwidth among multiple streams in a controlled way to manage approximately the same level of QoS Controlled-load service is intended to support applications that can tolerate reasonable amount of delay and loss
3.3.3 DiffServ Framework
IntServ provides fine-grained QoS guarantees using the Tspec message However, introducing Tspec for each flow may be too expensive in implementation Besides,
incremental deployment is only possible for controlled-load service, while it is difficult
to realize guaranteed service across the network Therefore, there is a need for more flexible service models to allow for more qualitative definitions of service distinctions The solution is DiffServ, which aims to develop architecture for providing scalable and flexible service differentiation
Generally, the DiffServ architecture comprises of 4 key concepts: