The book contains details of video coding principles, whichlead to advanced video coding developments in the form of scalable coding, distributedvideo coding, non-normative video coding
Trang 2Visual Media Coding and Transmission Ahmet Kondoz
© 2009 John Wiley & Sons, Ltd ISBN: 978-0-470-74057-6
Trang 3Visual Media Coding and Transmission
Ahmet Kondoz
Centre for Communication Systems Research, University of Surrey, UK
Trang 4Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners The publisher is not associated with any product or vendor mentioned in this book This publication is designed to provide accurate and authoritative information in regard to the subject matter covered It is sold on the understanding that the publisher is not engaged in rendering professional services If professional advice
or other expert assistance is required, the services of a competent professional should be sought.
#1998, #2001, #2002, #2003, #2004 3GPP TM TSs and TRs are the property of ARIB, ATIS, CCSA, ETSI, TTA and TTC who jointly own the copyright in them They are subject to further modifications and are therefore provided to you ‘as is’ for information purposes only Further use is strictly prohibited.
Library of Congress Cataloging-in-Publication Data
Set in 10/12pt Times New Roman by Thomson Digital, Noida, India.
Printed in Great Britain by CPI Antony Rowe, Chippenham, England
Trang 63 Scalable Video Coding 39
3.3.1 Scalable Coding for Shape, Texture, and Depth for 3D Video 48
3.4.2 Odd Even Frame Multiple Description Coding
4.3 Stopping Criteria for a Feedback Channel-based Transform
4.4 Rate-distortion Analysis of Motion-compensated Interpolation
4.5 Nonlinear Quantization Technique for Distributed Video Coding 129
Trang 74.6 Symmetric Distributed Coding of Stereo Video Sequences 134
4.7 Studying Error-resilience Performance for a Feedback Channel-based
5.3 Rate Control Architecture for Joint MVS Encoding and Transcoding 165
5.4 Bit Allocation and Buffer Control for MVS Encoding Rate Control 171
5.6 Spatio-temporal Scene-level Error Concealment for Segmented Video 182
5.7 An Integrated Error-resilient Object-based Video
Trang 85.7.3 Performance Evaluation 195
6.3 Inter-view Prediction using Reconstructed Disparity
6.5 Low-delay Random View Access in Multi-view Coding Using
7.2.5 Resource Management Strategy in Wireless Multimedia
Trang 97.3 Conclusions 244
Trang 109.1.12 Support of Voice over UMTS Networks 360
10.4 Performances of Video Transmission in Inter-networked Systems 442
Trang 1111 Context-based Visual Media Content Adaptation 455
11.2 Overview of the State of the Art in Context-aware Content Adaptation 457
11.2.2 Standardization Efforts on Contextual Information for
11.4.1 Integrating Digital Rights Management (DRM) with Adaptation 480
11.6 The Application Scenario for Context-based Adaptation
11.6.2 Mechanisms using Contextual Information in a Virtual
11.6.4 System Architecture of a Scalable Platform for Context-aware
11.6.9 Interfaces between Modules of the Content Adaptation Platform 544
Trang 13EPFLTouradj EbrahimiFrederic DufauxThien Ha-MinhMichael AnsorgeShuiming YeYannick MaretDavid MarimonUlrich HoffmannMourad OuaretFrancesca De SimoneCarlos BandeirinhaPeter VajdaAshkan YazdaniGelareh MohammadiAlessandro TortelliLuca BonardiDavide Forzati
ISTFernando Pereira
Jo~ao AscensoCatarina BritesLuis Ducla SoaresPaulo NunesPaulo CorreiaJose Diogo AreiaJose Quintas Pedro
Ricardo Martins
UPC-TSCPere Joaquim MindanJose Luis ValenzuelaToni Rama
Luis TorresFrancesc TarresUPC-ACJaime DelgadoEva Rodrı´guezAnna CarrerasRuben TousTRT-UKChris FirthTim MastertonAdrian WallerDarren PriceRachel CraddockMarcello GocciaIan MockfordHamid AsgariCharlie AttwoodPeter de WaardJonathan DennisDoug WatsonVal MillingtonAndy Vooght
TUBThomas SikoraZouhair BelkouraJuan Jose Burred
Trang 14IPWStanisław BaduraLilla Bagin´skaJarosław BaszunFilip BorowskiAndrzej BuchowiczEmil DmochEdyta D˛abrowskaGrzegorz Galin´skiPiotr GarbatKrystian IgnasiakMariusz JakubowskiMariusz Leszczyn´skiMarcin Morgos´
Jacek NaruniecArtur NowakowskiAdam OłdakGrzegorz Pastuszak
Andrzej PietrasiewiczAdam PietrowcewSławomir RymaszewskiRadosław SikoraWładysław SkarbekMarek SutkowskiMichał TomaszewskiKarol Wnukowicz
INECS PortoGiorgiana CiobanuFilipe SousaJaime CardosoJaime DiasJorge MamedeJose RuelaLuı´s Corte-RealLuı´s Gustavo MartinsLuı´s Filipe TeixeiraMaria Teresa AndradePedro CarvalhoRicardo DuarteVı´tor Barbosa
Trang 15VISNET II is a European Union Network of Excellence (NoE) in the 6th Framework Programme,which brings together 12 leading European organizations in the field of Networked AudiovisualMedia Technologies The consortium consists of organizations with a proven track record andstrong national and international reputations in audiovisual information technologies VISNET IIintegrates over 100 researchers who have made significant contributions to this field oftechnology, through standardization activities, international publications, conferences workshopactivities, patents, and many other prestigious achievements The 12 integrated organizationsrepresent 7 European states spanning across a major part of Europe, thereby promising efficientdissemination and exploitation of the resulting technological development to larger communities.This book contains some of the research output of VISNET II in the area of AdvancedVideo Coding and Networking The book contains details of video coding principles, whichlead to advanced video coding developments in the form of scalable coding, distributedvideo coding, non-normative video coding tools, and transform-based multi-view coding.Having detailed the latest work in visual media coding, the networking aspects of videocommunication are presented in the second part of the book Various wireless channelmodels are presented, to form the basis for following chapters Both link-level quality ofservice (QoS) and cross-network transmission of compressed visual data are considered.Finally, context-based visual media content adaptation is discussed with some examples
It is hoped that this book will be used as a reference not only for some of the advancedvideo coding techniques, but also for the transmission of video across various wirelesssystems with well-defined channel models
Ahmet KondozUniversity of SurreyVISNET II Coordinator
Trang 17Glossary of Abbreviations
ADMITS Adaptation in Distributed Multimedia IT Systems
CC/PP Composite Capabilities/Preferences Profile
CoDAMoS Context-Driven Adaptation of Mobile Services
CoGITO Context Gatherer, Interpreter and Transformer using Ontologies
CROSLOCIS Creation of Smart Local City Services
CS/H.264/AVC Cropping and Scaling of H.264/AVC Encoded Video
Trang 18DAML Directory Access Markup Language
DANAE Dynamic and distributed Adaptation of scalable multimedia content
DistriNet Distributed Systems and Computer Networks
ISO International Organization for Standardization
ITEC Department of Information Technology, Klagenfurt University
Trang 19MDS Multimedia Description Schemes
MP3 Moving Picture Experts Group Layer-3 Audio (audio file format/extension)
Trang 20UI User Item
WiFi Wireless Fidelity (IEEE 802.11b Wireless Networking)
XSLT eXtensible Stylesheet Language Transformations
Trang 21Introduction
Networked Audio-Visual Technologies form the basis for the multimedia communicationsystems that we currently use The communication systems that must be supported are diverse,ranging from fixed wired to mobile wireless systems In order to enable an efficient and cost-effective Networked Audio-Visual System, two major technological areas need to be investi-gated: first, how to process the content for transmission purposes, which involves various mediacompression processes; and second, how to transport it over the diverse network technologiesthat are currently in use or will be deployed in the near future In this book, therefore, visual datacompression schemes are presented first, followed by a description of various media trans-mission aspects, including various channel models, and content and link adaptation techniques.Raw digital video signals are very large in size, making it very difficult to transmit or storethem Video compression techniques are therefore essential enabling technologies for digitalmultimedia applications Since 1984, a wide range of digital video codecs have beenstandardized, each of which represents a step forward either in terms of compression efficiency
or in functionality The MPEG-x and H.26x video coding standards adopt a hybrid codingapproach, employing block-matching motion estimation/compensation, in addition to thediscrete cosine transform (DCT) and quantization The reasons are: first, a significantproportion of the motion trajectories found in natural video can be approximately describedwith a rigid translational motion model; second, fewer bits are required to describe simpletranslational motion; and finally, the implementation is relatively straightforward and amena-ble to hardware solutions These hybrid video systems have provided interoperability inheterogeneous network systems Considering that transmission bandwidth is still a valuablecommodity, ongoing developments in video coding seek scalability solutions to achieve aone-coding multiple-decoding feature To this end, the Joint Video Team of the ITU-T VideoCoding Expert Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) havestandardized a scalability extension to the existing H.264/AVC codec The H.264-basedScalable Video Coding (SVC) allows partial transmission and decoding to the bit stream,resulting in various options in terms of picture quality and spatial-temporal resolutions
In this book, several advanced features/techniques relating to scalable video coding arefurther described, mostly to do with 3D scalable video coding applications Applications andscenarios for the scalable coding systems, advances in scalable video coding for 3D videoapplications, a non-standardized scalable 2D model-based video coding scheme applied on the
Visual Media Coding and Transmission Ahmet Kondoz
© 2009 John Wiley & Sons, Ltd ISBN: 978-0-470-74057-6
Trang 22texture, and depth coding of 3D video are all discussed A scalable, multiple description coding(MDC) application for stereoscopic 3D video is detailed Multi-view coding and DistributedVideo Coding concepts representing the latest advancements in video coding are also covered
in significant depth
The definition of video coding standards is of the utmost importance because it guaranteesthat video coding equipment from different manufacturers will be able to interoperate.However, the definition of a standard also represents a significant constraint for manufacturersbecause it limits what they can do Therefore, in order to minimize the restrictions imposed onmanufacturers, only those tools that are essential for interoperability are typically specified inthe standard: the normative tools The remaining tools, which are not standardized but are alsoimportant in video coding systems, are referred to as non-normative tools and this is wherecompetition and evolution of the technology have been taking place In fact, this strategy ofspecifying only the bare minimum that can guarantee interoperability ensures that the latestdevelopments in the area of non-normative tools can be easily incorporated in video codecswithout compromising their standard compatibility, even after the standard has been finalized
In addition, this strategy makes it possible for manufacturers to compete against each other and
to distinguish between their products in the market A significant amount of research effort isbeing devoted to the development of non-normative video coding tools, with the target ofimproving the performance of standard video codecs In particular, due to their importance, ratecontrol and error resilience non-normative tools are being researched In this book, therefore,the development of efficient tools for the modules that are non-normative in video codingstandards, such as rate control and error concealment, is discussed For example, multiple videosequence (MVS) joint rate control addresses the development of rate control solutions forencoding video scenes formed from a composition of video objects (VOs), such as in theMPEG-4 standard, and can also be applied to the joint encoding and transcoding of multiplevideo sequences (VSs) to be transmitted over bandwidth-limited channels using the H.264/AVC standard
The goal of wireless communication is to allow a user to access required services at any timewith no regard to location or mobility Recent developments in wireless communications,multimedia technologies, and microelectronics technologies have created a new paradigm inmobile communications Third/fourth-generation (3G/4G) wireless communication technol-ogies provide significantly higher transmission rates and service flexibility over a widecoverage area, as compared with second-generation (2G) wireless communication systems.High-compression, error-robust multimedia codecs have been designed to enable the support
of multimedia application over error-prone bandwidth-limited channels The advances ofVLSI and DSP technologies are enabling lightweight, low-cost, portable devices capable oftransmitting and viewing multimedia streams The above technological developments haveshifted the service requirements of mobile communication from conventional voice telephony
to business- and entertainment-oriented multimedia services in wireless communicationsystems In order to successfully meet the challenges set by the latest current and futureaudiovisual communication requirements, the International Telecommunication Union-Radiocommunications (ITU-R) sector has elaborated on a framework for global 3G standards byrecognizing a limited number of radio access technologies These are: Universal MobileTelecommunications System (UMTS), Enhanced Data rates for GSM Evolution (EDGE), andCDMA2000 UMTS is based on Wideband CDMA technology and is employed in Europe andAsia using the frequency band around 2 GHz EDGE is based on TDMA technology and uses
Trang 23the same air interface as the successful 2G mobile system GSM General Packet Radio Service(GPRS) and High-Speed Circuit Switched Data (HSCSD) are introduced by Phase 2þ of theGSM standardization process They support enhanced services with data rates up to 144 kbps inthe packet-switched and circuit-switched domains, respectively EDGE, which is the evolution
of GPRS and HSCSD, provides 3G services up to 500 kbps within GSM carrier spacing of
200 kHz CDMA2000 is based on multi-carrier CDMA technology and provides the upgradedsolution for existing IS-95 operators, mainly in North America EDGE and UMTS are the mostwidely accepted 3G radio access technologies They are standardised by the 3rdGenerationPartnership Project (3GPP) Even though EDGE and UMTS are based on two differentmultiple-access technologies, both systems share the same core network The evolved GSMcore network serves for a common GSM/UMTS core network that supports GSM/GPRS/EDGE and UMTS access In addition, Wireless Local Area Networks (WLAN) are becomingmore and more popular for communication at homes, offices and indoor public areas such ascampus environments, airports, hotels, shopping centres and so on IEEE 802.11 has a number
of physical layer specifications with a common MAC operation IEEE 802.11 includes twophysical layers a frequency-hopping spread-spectrum (FHSS) physical layer and a direct-sequence spread-spectrum (DSSS) physical layer and operates at 2 Mbps The currentlydeployed IEEE 802.11b standard provides an additional physical layer based on a high-ratedirect-sequence spread-spectrum (HR/DSSS) It operates in the 2.4 GHz unlicensed band andprovides bit rates up to 11 Mbps IEEE 802.11a standard for 5 GHz band provides high bit rates
up to 54 Mbps and uses a physical layer based on orthogonal frequency division multiplexing(OFDM) Recently, IEEE 802.11g standard has also been issued to achieve such high bit rates
in the 2.4 GHz band
The Worldwide Interoperability for Microwave Access (WiMAX) is a telecommunicationstechnology aimed at providing wireless data over long distances in different ways, from point-to-point links to full mobile cellular access It is based on the IEEE 802.16 standard, which isalso called WirelessMAN The name WiMAX was created by the WiMAX Forum, which wasformed in June 2001 to promote conformance and interoperability of the standard The forumdescribes WiMAX as “a standards-based technology enabling the delivery of last mile wirelessbroadband access as an alternative to cable and DSL” Mobile WiMAX IEEE 802.16e providesfixed, nomadic and mobile broadband wireless access systems with superior throughputperformance It enables non-line-of-sight reception, and can also cope with high mobility
of the receiving station The IEEE 802.16e enables nomadic capabilities for laptops and othermobile devices, allowing users to benefit from metro area portability of an xDSL-like service.Multimedia services by definition require the transmission of multiple media streams, such
as video, still picture, music, voice, and text data A combination of these media types provides
a number of value-added services, including video telephony, E-commerce services, party video conferencing, virtual office, and 3D video 3D video, for example, provides morenatural and immersive visual information to end users than standard 2D video In the nearfuture, certain 2D video application scenarios are likely be replaced by 3D video in order toachieve a more involving and immersive representation of visual information and to providemore natural methods of communication 3D video transmission, however, requires moreresources than the conventional video communication applications
multi-Different media types have different quality-of-service (QoS) requirements and enforceconflicting constraints on the communication networks Still picture and text data arecategorized as background services and require high data rates but have no constraints on
Trang 24the transmission delay Voice services, on the other hand, are characterized by low delay.However, they can be coded using fixed low-rate algorithms operating in the 5 24 kbps range.
In contrast to voice and data services, low-bit-rate video coding involves rates at tens tohundreds of kbps Moreover, video applications are delay sensitive and impose tight constraints
on system resources Mobile multimedia applications, consisting of multiple signal types, play
an important role in the rapid penetration of future communication services and the success ofthese communication systems Even though the high transmission rates and service flexibilityhave made wireless multimedia communication possible over 3G/4G wireless communicationsystems, many challenges remain to be addressed in order to support efficient communications
in multi-user, multi-service environments In addition to the high initial cost associated with thedeployment of 3G systems, the move from telephony and low-bit-rate data services tobandwidth-consuming 3G services implies high system costs, as these consume a largeportion of the available resources However, for rapid market evolvement, these widebandservices should not be substantially more expensive than the services offered today Therefore,efficient system resource (mainly the bandwidth-limited radio resource) utilization and QoSmanagement are critical in 3G/4G systems
Efficient resource management and the provision of QoS for multimedia applications are insharp conflict with one another Of course, it is possible to provide high-quality multimediaservices by using a large amount of radio resources and very strong channel protection.However, this is clearly inefficient in terms of system resource allocation Moreover, theperceptual multimedia quality received by end users depends on many factors, such as sourcerate, channel protection, channel quality, error resilience techniques, transmission/processingpower, system load, and user interference Therefore, it is difficult to obtain an optimal sourceand network parameter combination for a given set of source and channel characteristics Thetime-varying error characteristics of the radio access channel aggravate the problem In thisbook, therefore, various QoS-based resource management systems are detailed For compari-son and validation purposes, a number of wireless channel models are described The key QoSimprovement techniques, including content and link-adaptation techniques, are covered.Future media Internet will allow new applications with support for ubiquitous media-richcontent service technologies to be realized Virtual collaboration, extended home platforms,augmented, mixed and virtual realities, gaming, telemedicine, e-learning and so on, in whichusers with possibly diverse geographical locations, terminal types, connectivity, usageenvironments, and preferences access and exchange pervasive yet protected and trustedcontent, are just a few examples These multiple forms of diversity requires content to betransported and rendered in different forms, which necessitates the use of context-awarecontent adaptation This avoids the alternative of predicting, generating and storing all thedifferent forms required for every item of content Therefore, there is a growing need fordevising adequate concepts and functionalities of a context-aware content adaptation platformthat suits the requirements of such multimedia application scenarios This platform needs to beable to consume low-level contextual information to infer higher-level contexts, and thusdecide the need and type of adaptation operations to be performed upon the content In this way,usage constraints can be met while restrictions imposed by the Digital Rights Management(DRM) governing the use of protected content are satisfied
In this book, comprehensive discussions are presented on the use of contextual information
in adaptation decision operations, with a view to managing the DRM and the authorizationfor adaptation, consequently outlining the appropriate adaptation decision techniques and
Trang 25adaptation mechanisms The main challenges are found by identifying integrated tools andsystems that support adaptive, context-aware and distributed applications which react to thecharacteristics and conditions of the usage environment and provide transparent access anddelivery of content, where digital rights are adequately managed The discussions focus ondescribing a scalable platform for context-aware and DRM-enabled adaptation of multimediacontent The platform has a modular architecture to ensure scalability, and well-definedinterfaces based on open standards for interoperability as well as portability The modules areclassified into four categories, namely: 1 Adaptation Decision Engine (ADE); 2 AdaptationAuthoriser (AA); 3 Context Providers (CxPs); and 4 Adaptation Engine Stacks (AESs),which comprise Adaptation Engines (AEs) During the adaptation decision-taking stage theplatform uses ontologies to enable semantic description of real-world situations The decision-taking process is triggered by low-level contextual information and driven by rules provided bythe ontologies It supports a variety of adaptations, which can be dynamically configured Theoverall objective of this platform is to enable the efficient gathering and use of contextinformation, ultimately in order to build content adaptation applications that maximize usersatisfaction.
Trang 27or in functionality This chapter describes the basic principles behind most standard based video codecs currently in use It begins with a discussion of the types of redundancypresent in most video signals (Section 2.2) and proceeds to describe some basic techniquesfor removing such redundancies (Section 2.3) Section 2.4 investigates enhancements to thebasic techniques which have been used in recent video coding standards to provide improve-ments in video quality This section also discusses the effects of communication channel errors
block-on decoded video quality Sectiblock-on 2.5 provides a summary of the available video codingstandards and describes some of the key differences between them Section 2.6 gives anoverview of how video quality can be assessed It includes a description of objective andsubjective assessment techniques
2.2 Redundancy in Video Signals
Compression techniques are generally based upon removal of redundancy in the originalsignal In video signals, the redundancy can be classified as spatial, temporal, or source-coding.Most standard video codecs attempt to remove these types of redundancy, taking into accountcertain properties of the human visual system
Spatial redundancy is present in areas of images or video frames where pixel values vary bysmall amounts In the image shown in Figure 2.1, spatial redundancy is present in parts of thebackground, and in skin areas such as the shoulder
Temporal redundancy is present in video signals when there is significant similarity betweensuccessive video frames Figure 2.2 shows two successive frames from a video sequence It isclear that the difference between the two frames is small, indicating that it would be inefficient
to simply compress a video signal as a series of images
Visual Media Coding and Transmission Ahmet Kondoz
© 2009 John Wiley & Sons, Ltd ISBN: 978-0-470-74057-6
Trang 28Source-coding redundancy is present if the symbols produced by the video codec areinefficiently mapped to a binary bitstream Typically, entropy coding techniques are used toexploit the statistics of the output video data, where some symbols occur with greaterprobability than others.
2.3 Fundamentals of Video Compression
This section describes how spatial redundancy and temporal redundancy can be removed from
a video signal It also describes how a typical video codec combines the two techniques toachieve compression
2.3.1 Video Signal Representation and Picture Structure
Video coding is usually performed with YUV 4 : 2 : 0 format video as an input This formatrepresents video using one luminance plane (Y) and two chrominance planes (Cb and Cr) Theluminance plane represents black and white information, while the chrominance planes con-tain all of the color data Because luminance data is perceptually more important than the
Figure 2.1 Spatial redundancy is present in areas of an image or video frame where the pixel values arevery similar
Figure 2.2 Temporal redundancy occurs when there is a large amount of similarity between videoframes
Trang 29chrominance data, the resolution of the chrominance planes is half that of the luminance in bothdimensions Thus, each chrominance plane contains a quarter of the pixels contained in theluminance plane Downsampling the color information means that less information needs to becompressed, but it does not result in a significant degradation in quality.
Most video coding standards split each video frame into macroblocks (MB), which are
16 16 pixels in size For the YUV 4 : 2 : 0 format, each MB contains four 8 8 luminanceblocks and two 8 8 chrominance blocks, as shown in Figure 2.3 The two chrominance blockscontain information from the Cr and Cb planes respectively Video codecs code each videoframe, starting with the MB in the top left-hand corner The codec then proceeds horizontallyalong each row, from left to right
MBs can be grouped Groups of MBs are known by different names in different standards.For example:
. Group of Blocks (GOB): H.263 [1 3]
. Video packet: MPEG-4 Version 1 and 2 [4 6]
. Slice: MPEG-2 [7] and H.264 [8 10]
The grouping is usually performed to make the video bitstream more robust to packet losses incommunications channels Section 2.4.8 includes a description of how video slices can be used
in error-resilient video coding
2.3.2 Removing Spatial Redundancy
Removal of spatial redundancy can be achieved by taking into account:
. The characteristics of the human vision system: human vision is more sensitive to frequency image data than high-frequency data In addition, luminance information is moreimportant than chrominance information
low-. Common features of image/video signals: Figure 2.4 shows an image that has been high-passand low-pass filtered It is clear from the images that the low-pass-filtered version containsmore energy and more useful information than the high-pass-filtered one
Figure 2.3 Most video codecs break up a video frame into a number of smaller units for coding
Trang 30These factors suggest that it is advantageous to consider image/video compression in thefrequency domain Therefore, a transform is needed to convert the original image/video signalinto frequency coefficients The DCT is the most widely used transform in lossy image andvideo compression It permits the removal of spatial redundancy by compacting most of theenergy of the block into a few coefficients.
Each 8 8 pixel block is put through the discrete cosine transform (DCT):
pfor k¼ 0;
Trang 31An example of the DCT in action is shown below, and illustrated in Figure 2.5 An inputblock, s(n1,n2), is first taken from an image:
3777775
3777775ð2:3Þ
The coefficients in the transformed block represent the energy contained in the block atdifferent frequencies The lowest frequencies, starting with the DC coefficient, are contained inthe top-left corner, while the highest frequencies are contained in the bottom-right, as shown inFigure 2.5
Note that many of the high-frequency coefficients are much smaller than the low-frequencycoefficients Most of the energy in the block is now contained in a few low-frequencycoefficients This is important, as the human eye is most sensitive to low-frequency data
(b) DCT- transformed block (a) Original image block
1 4 7 0
-50 0 50 100 150 200 250 300 350
Figure 2.5 Transform based compression (a) Original image block (b) DCT transformed block
Trang 32It should be noted that in most video codec implementations the 2D DCT calculation isreplaced by 1D DCT calculations, which are performed on each row and column of the
aðuÞ ¼
1N
2N
3777775
3777775
ð2:7Þ
where QYis the matrix for luminance (Y plane) and QUVis the matrix for chrominance (U and
V planes) The matrix values are set using psycho-visual measurements Different matricesare used for luminance and chrominance because of the differing perceptual importance ofthe planes
The quantization matrices determine the output picture quality and output file size.Scaling the matrices by a value greater than 1 increases the coarseness of the quantization,reducing quality However, such scaling also reduces the number of nonzero coefficientsand the size of the nonzero coefficients, which reduces the number of bits needed to codethe video
Trang 33and divide it by the luminance quantization matrix:
3777775ð2:9Þ
The combination of the DCT and quantization has clearly reduced the number of nonzerocoefficients
The next stage in the encoding process is to zigzag scan the DCT matrix coefficients into anew 1D coefficient matrix, as shown in Figure 2.6 Using the above example, the result is:
ð2:10Þ
Figure 2.6 Zigzag scanning of DCT coefficients
Trang 34where the EOB symbol indicates the end of the block (i.e all following coefficients are zero).Note that the number of coefficients to be encoded has been reduced from 64 to 28 (29 includingthe EOB).
The data is further reorganized, with the DC component (the top-left in the DCT matrix)being treated differently from the AC coefficients
DPCM (Differential Pulse Code Modulation) is used on DC coefficients in the H.263standard [1] This method of coding generally creates a prediction for the current block’svalue first, and then transmits the error between the predicted value and the actual value Thus,the reconstructed intensity for the DC at the decoder, s(n1,n2), is:
s n1; n2ð Þ ¼ ^s n1; n2ð Þ þ e n1; n2ð Þ ð2:11Þwhere^s n1; n2ð Þ and e n1; n2ð Þ are respectively the predicted intensity and the error
For JPEG, the predicted DC coefficient is the DC coefficient in the previous block.Thus, if the previous DC coefficient was 15, the coded value for the example given abovewill be:
AC coefficients are coded using run-level coding, where each nonzero coefficient iscoded using a value for the intensity and a value giving the number of zero coefficientspreceding the coefficient With the above example, the coefficients are represented as shown inTable 2.1
Variable-length coding techniques are used to encode the DC and AC coefficients Thecoding scheme is arranged such that the most common values have the shortest codewords
2.3.3 Removing Temporal Redundancy
Image coding attempts to remove spatial redundancy Video coding features an additionalredundancy type: temporal redundancy This occurs because of strong similarities betweensuccessive frames It would be inefficient to transmit a series of JPEG images Therefore, videocoding aims to transmit the differences between two successive frames, thus achieving evenhigher compression ratios than for image coding
The simplest method of sending the difference between two frames would be to take thedifference in pixel intensities However, this is inefficient when the changes are simply amatter of objects moving around a scene (e.g a car moving along a road) Here it would bebetter to describe the translational motion of the object This is what most video codecstandards attempt to do
A number of different frame types are used in video coding The two most important typesare:
Table 2.1 Run level coding
Trang 35. Intra frames (called I frames in MPEG standards): these frames use similar compressionmethods to JPEG, and do not attempt to remove any temporal redundancy.
. Inter frames (called P frames in MPEG): these frames use the previous frame as a reference.Intra frames are usually much larger than inter frames, due to the presence of temporal re-dundancy in them However, inter frames rely on previous frames being successfully received
to ensure correct reconstruction of the current frame If a frame is dropped somewhere in thenetwork then all subsequent inter frames will be incorrectly decoded Intra frames can be sentperiodically to correct this Descriptions of other types of frame are given in Section 2.4.1.Motion compensation is the technique used to remove much of the temporal redundancy invideo coding It is preceded by motion estimation
2.3.3.1 Motion Estimation
Motion estimation (ME) attempts to estimate translational motion within a video scene Theoutput is a series of motion vectors (MVs) The aim is to form a prediction for the current framebased on the previous frame and the MVs
The most straightforward and accurate method of determining MVs is to use block matching.This involves comparing pixels in a certain search window with those in the current frame, asshown in Figure 2.7 Typically, the Mean Square Error is employed, such that the MV can befound from:
Trang 36Although this technique identifies MVs with reasonable accuracy, the procedure requiresmany calculations for a whole frame ME is often the most computationally intensive part of
a codec implementation, and has prevented digital video encoders being incorporated into cost devices
low-Researchers have examined a variety of methods for reducing the computational complexity
of ME However, they usually result in a tradeoff between complexity and accuracy of MVdetermination Suboptimal MV selection means that the coding efficiency is reduced, andtherefore leads to quality degradation, where a fixed bandwidth is specified
2.3.3.2 Intra/Inter Mode Decision
Not all MBs should be coded as inter MBs, with motion vectors For example, new objects may
be introduced into a scene In this situation the difference is so large that an intra MB should beencoded Within an inter frame, MBs are coded as inter or intra MBs, often depending on theMSE value If the MSE passes a certain threshold, the MB is coded as Intra, otherwise intercoding is performed The MSE-based threshold algorithm is simple, but is suboptimal, and canonly be used when a limited number of MB modes are available More sophisticated MB mode-selection algorithms are discussed in Section 2.4.3
2.3.4 Basic Video Codec Structure
The video codec shown in Figure 2.8 demonstrates the basic operation of many video codecs.The major components are:
. Transform and Quantizer: perform operations similar to the transform and quantizationprocess described in Section 2.3.2
. Entropy Coder: takes the data for each frame and maps it to binary codewords It outputs thefinal bitstream
. Encoder Control: can change the MB mode and picture type It can also vary the coarseness ofthe quantization and perform rate control Its precise operation is not standardized.. Feedback Loop: removes temporal redundancy by using ME and MC
Trang 372.4 Advanced Video Compression Techniques
Section 2.3 discussed some of the basic video coding techniques that are common to most of theavailable video coding standards This section examines some more advanced video codingtechniques, which provide improved compression efficiency, additional functionality, androbustness to communication channel errors Particular attention is paid to the H.264 videocoding standard [8, 9], which is one of the most recently standardized codecs Subsequentcodecs, such as scalable H.264 and Multi-view Video Coding (MVC) [11], use the H.264 codec
as a starting point Note that scalability is discussed in Chapter 3
2.4.1 Frame Types
Most modern video coding standards are able to code at least three different frame types:
. I frames (intra frames): these do not include any motion-compensated prediction from otherframes They are therefore coded completely independently of other frames As they do notremove temporal redundancy they are usually much larger in size than other frame types.However, they are required to allow random access functionality, to prevent drift between theencoder and decoder picture buffers, and to limit the propagation of errors caused by packetloss (see Section 2.4.8)
. P frames (inter frames): these include motion-compensated prediction, and therefore removemuch of the temporal redundancy in the video signal As shown in Figure 2.9, P frames
Encoder Control
Inverse Quantization
Inverse Transform
Motion Estimation and Compensation
Entropy Coder
Macroblock Mode
Quantized Transform Coefficients
Motion VectorsFigure 2.8 Basic video encoder block diagram
Trang 38generally use a motion-compensated version of the previous frame to predict the currentframe Note that P frames can include intra-coded MBs.
. B frames: the ‘B’ is used to indicate that bi-directional prediction can be used, as shown inFigure 2.10 A motion-compensated prediction for the current frame is formed usinginformation from a previous frame, a future frame, or both B frames can provide bettercompression efficiency than P frames However, because future frames are referenced duringencoding and decoding, they inherently incur some delay Figure 2.11 shows that the framesmust be encoded and transmitted in an order that is different from playback This means thatthey are not useful in low-delay applications such as videoconferencing They also requireadditional memory usage, as more reference frames must be stored
H.264 supports a wider range of frames and MB slices In fact, H.264 supports five types ofsuch slice, which include I-type, P-type, and B-type slices I-type (Intra) slices are the simplest,
in which all MBs are coded without referring to other pictures within the video sequence Ifpreviously-coded images are used to predict the current MB it is a called P-type (predictive)slice, and if both previous- and future-coded images are used then it is called a B-type (bi-predictive) slice
Other slices supported by H.264 are the SP-type (Switching P) and the SI-type (Switching I),which are specially-coded slices that enable efficient switching between video streams andrandom access for video decoders [12] Avideo decoder may use them to switch between one of
Figure 2.9 P frames use a motion compensated version of previous frames to form a prediction of thecurrent frame
Figure 2.10 B frames use bi directional prediction to obtain predictions of the current frame from pastand future frames
Trang 39several available encoded streams For example, the same video material may be encoded atmultiple bit rates for transmission across the Internet A receiving terminal will attempt todecode the highest-bit-rate stream it can receive, but it may need to switch automatically to alower-bit-rate stream if the data throughput drops.
2.4.2 MC Accuracy
Providing more accurate MC can significantly reduce the magnitude of the prediction error, andtherefore fewer bits need to be used to code the transform coefficients More accuracy can beprovided either by allowing finer motion vectors to be used, or by permitting more motionvectors to be used in an MB The former allows the magnitude of the motion to be describedmore accurately, while the latter allows for complex motion or for situations where there areobjects smaller than an MB
H.264 in particular supports a wider range of spatial accuracy than any of the existing codingstandards, as shown in Table 2.2 Amongst earlier standards, only the latest version of MPEG-4Part 2 (version 2) [5] can provide quarter-pixel accuracy, while others provide only half-pixelaccuracy H.264 also supports quarter-pixel accuracy
For achieving quarter-pixel accuracy, the luminance prediction values at half-samplepositions are obtained by applying a 6-tap filter to the nearest integer value samples [9] Theluminance prediction values at quarter-sample positions are then obtained by averagingsamples at integer and half-sample positions
An important point to note is that more accurate MC requires more bits to be used to specifymotion vectors However, more accurate MC should reduce the number of bits required to codethe quantized transform coefficients There is clearly a tradeoff between the number of bitsadded by the motion vectors and the number of bits saved by better MC This tradeoff dependsupon the source sequence characteristics and on the amount of quantization that is used.Methods of finding the best tradeoff are dealt with in Section 2.4.3
Frame 1:
I frame
Frame 1:
Trang 402.4.3 MB Mode Selection
Most of the widely-used video coding standards allow MBs to be coded with a variety of modes.For example:
. MPEG-2: INTRA, SKIP, INTER-16 16, INTER-16 8
. H.263/MPEG-4: INTRA, SKIP, INTER-16 16, INTER-8 8
. H.264/AVC: INTRA-4 4, INTRA-16 16, SKIP, INTER-16 16, INTER-16 8, TER-8 16, INTER-8 8; the 8 8 INTER blocks may then be partitioned into 4 4,
IN-8 4, 4 8
Selection of the best mode is an important part of optimizing the compression efficiency of anencoder implementation Mode selection has been the subject of a significant amount ofresearch It is a problem that may be solved using optimization techniques such as LagrangianOptimization and dynamic programming [13] The approach currently taken in the H.264reference software uses Lagrangian Optimization [14]
For mode selection, Lagrangian Optimization may be carried out by minimizing thefollowing Lagrangian cost for each coding unit:
JMODEðM; Q; lMODEÞ ¼ DRECðM; QÞ þ lMODERRECðM; QÞ ð2:15Þwhere RREC(M,Q) is the rate from compressing the current coding unit with mode, M, and withquantizer value, Q DREC(M,Q) is the distortion obtained from compressing the current codingunit using mode, M, and quantizer, Q The distortion can be found by taking the sum of squareddifferences:
where QH.263is the quantization parameter
Table 2.2 Comparison of the ME accuracies provided by different video codecs