When one defines a digital video sequence as above,one is bound to question the necessity for separate VQA algorithms – Can one notapply an IQA algorithm on a frame-by-frame basis or on
Trang 1is palatable to the human is an important one The term ‘quality’ is one that isused to define the palatability of an image or a video sequence Researchers havedeveloped algorithms which aim to provide a measure of this quality Automaticmethods to perform image quality assessment (IQA) has made giant leaps overthe past few years [1] These successes suggest that this field is close to attain-ing saturation [2] More complex than IQA algorithms are video quality assessment(VQA) algorithms, whose goals are similar to those for IQA but require processing
of dynamically changing images In this chapter, we focus on VQA algorithms fordigital video sequences Digital videos comprise of a set of frames (still images)played at a particular speed (frame-rate) Each frame has the same resolution andthe frame is made up of a bunch of picture elements or pixels These pixels havefixed bit-depth i.e., the number of bits used to represent the value of a pixel is fixed
for a video This definition is valid for progressive videos Interlaced videos on
the other hand, consist of a pair of ‘fields’, each containing alternating portions ofthe equivalent frame When played out at an appropriate rate, the observer views thevideos as a continuous stream When one defines a digital video sequence as above,one is bound to question the necessity for separate VQA algorithms – Can one notapply an IQA algorithm on a frame-by-frame basis (or on one of the fields) and thenaverage out the score to provide a quality rating? Indeed, many VQA algorithmsare derived from IQA algorithms, and some of them do just that; however, the mostA.K Moorthy, K Seshadrinathan, and A.C Bovik ( )
Department of Electrical and Computer Engineering, The University of Texas at Austin,
Austin, Texas, USA
e-mail: anushmoorthy@mail.utexas.edu; kalpsesh@gmail.com; bovik@ece.utexas.edu
B Furht (ed.), Handbook of Multimedia for Digital Entertainment and Arts,
DOI 10.1007/978-0-387-89024-1 6, c Springer Science+Business Media, LLC 2009
139
Trang 2140 A.K Moorthy et al.important difference between a still-image and a video is the presence of perceivedmotion, suggesting that modeling of such motion is key to the development of bet-ter VQA algorithms As we shall see, such motion modeling should account forhuman perception of motion This is validated by improved performance of VQAalgorithms that incorporate some motion modeling.
The performance of any VQA algorithm is evaluated in terms of its correlationwith human perception We will have a lot to say about this towards the end of thischapter However, note that for the applications we target, the ultimate receiver of avideo is the human and hence, when one talks about ‘performance’, one necessarilymeans correlation with human perception This leads to the question – How doesone know what the human perceives? The general procedure is to ask a representa-tive sample of the human populace to rate the quality of a given video on some ratingbar The mean score achieved by a video is then said to be representative of the hu-man perception of quality The International Telecommunications Union (ITU) hasprovided a set of recommendations on how such quality assessment by humans is
to be conducted [3] Such VQA is generally referred to as subjective quality
assess-ment, and as one can imagine, is time-consuming and cumbersome and hence theneed for automatic VQA algorithms Algorithmic assessment of quality is called
objective quality assessment Note that the procedure to form a quality score from
a subjective study implies that perfect correlation with human perception is almostimpossible due to inter-subject variation
We classify VQA algorithms as: full-reference (FR), reduced-reference (RR) andno-reference (NR) FR VQA algorithms assume that a pristine reference video isavailable, and the quality of the video under consideration is evaluated with re-spect to this pristine reference Note that, by this definition, we are evaluating the
relative quality of a given video RR VQA algorithms operate under the
assump-tion that even though the pristine video is unavailable for direct comparison, some
additional information about the pristine sequence is available This may includefor example partial coefficient information or knowledge about the compression ordistortion process [4]-[7] NR metrics are those that have absolutely no knowledgeabout the processes involved in the creation of the given video Simply put, the algo-rithm is presented with a video and is asked to rate its quality These algorithms arefew, even for image quality assessment [8] NR VQA algorithms are rare [9] Ourdefinitions of NR and RR VQA algorithms are not universal though In some cases,
NR algorithms assume a distortion model The reader will observe that NR VQAalgorithms have the potential to be the most useful kind of VQA algorithms, andmay question the need for FR VQA algorithms However, as we shall see throughthis chapter, our understanding of the process by which humans rate the quality of
a video sequence is limited Indeed, we do not yet have a complete understanding
of motion processing in the brain [10, 11] Given this lack of information, trulyblind NR VQA algorithms are still years away Finally, RR VQA algorithms are acompromise between these two extremes, and are a stepping stone towards a NRVQA algorithm See [5] and [13] for examples of RR VQA and IQA algorithms.Since most work has been done in the FR domain, and procedures and standards forevaluation of their performance exist, in this chapter we shall discuss only FR VQAalgorithms
Trang 36 Digital Video Quality Assessment Algorithms 141
Visual Stimulus from eyes
Optic Nerve to LGN
Primary visual cortex
Higher level visual processing Fig 1 Schematic model of the human visual system
Let us briefly look at how videos are processed by the human visual system(HVS) in order to better understand some key concepts of algorithms that we shalldiscuss here Note that even though there have been significant strides in under-standing motion processing in the visual cortex, a complete understanding is still
a long way off What we mention here are some properties which have been firmed by psycho-visual research The reader is referred to [10] for a more detailedexplanation of these ideas
con-Figure1shows a schematic model of the HVS The visual stimulus in the form
of light from the environment passes through the optics of the eye and is imaged onthe retina Due to inherent imperfections in the eye, the image formed is blurred,which can be modeled by a point spread function (PSF) [11] Most of the informa-tion encoded in the retina is transmitted via the optic nerve to the lateral geneiculatenucleus (LGN) The neurons in the LGN then relay this information to the primaryvisual cortex area (V1) From V1, this information is passed on to a variety of visualareas, including the middle-temporal (MT) or V5 region V1 neurons have recep-tive fields1 which demonstrate a substantial degree of selectivity to size (spatialfrequency), orientation and direction of motion of retinal stimulation It is hypothe-sized that the MT/V5 region plays a significant role in motion processing [12] AreaMT/V5 also plays a role in the guidance of some eye movements, segmentation and3-D structure computation [14], which are properties of human vision that play animportant role in visual perception of videos Unfortunately, as we move from theoptics towards V1 and MT/V5, the amount of information we have about the func-tioning of these regions decreases The functioning of area MT is an area of activeresearch [15]
1 The receptive field of a neuron is its response to visual stimuli, which may depend on spatial frequency, movement, disparity or other properties As used here, the receptive field response may
be viewed as synonymous with the signal processing term impulse response.
Trang 4142 A.K Moorthy et al.
In this chapter we first describe some HVS-based approaches which try to modelthe visual processing stream described above, since these approaches were origi-nally used to predict visual quality We then describe recently proposed structuraland information-theoretic approaches and feature-based approaches which are com-monly used Further, we describe recent motion-modeling based approaches, anddetail performance evaluation and validation techniques for VQA algorithms Fi-nally, we touch upon some possible future directions for research on VQA andconclude the chapter
HVS – Based Approaches
Much of the initial development in VQA centered on explicit modeling of the HVS.The visual pathway is modeled using a computational model of the HVS; the orig-inal and distorted videos are passed through this model The visual quality is thendefined as an error measure between the outputs produced by the model for theoriginal and distorted videos Many HVS based VQA models are derived from theirIQA counterparts Some of the popular HVS-based models for IQA include the Vis-ible Differences Predictor (VDP) developed by Daly [16], the Sarnoff JND visionmodel [17], the Safranek-Johnston Perceptual Image Coder (PIC) [18] and Watson’sDCTune [19] The interested reader is directed to [20] for a detailed description ofthese models
A block diagram of a generic HVS based VQA system is shown in Figure2.The only difference between this VQA system and a HVS-based IQA system is thepresence of a ‘temporal filter’ This temporal filter is generally used to model thetwo kinds of temporal mechanisms present in early stages of processing in the visualcortex Lowpass and bandpass filters have typically been used for this purpose.The Moving Pictures Quality Metric (MPQM), an early approach to VQA, uti-lized a Gabor filterbank in the spatial frequency domain, and one lowpass and onebandpass temporal filter [21] The Perceptual Distortion Metric [22] was a modifi-cation of MPQM and used two infinite impulse response (IIR) filters to model thelowpass and bandpass mechanisms Further, the Gabor filterbank was replaced by asteerable pyramid decomposition [23] Watson proposed the Digital Video Quality(DVQ) metric in [24], which used the Discrete Cosine Transform (DCT) and utilized
a simple IIR filter implementation to represent the temporal mechanism A scalablewavelet based video distortion metric was proposed in [25] In this section we de-scribe DVQ and the scalable wavelet-based distortion metric in some detail
Reference
& test
videos
Spatial quality map or score
Pre-Processing
Temporal Filtering
Linear Transform
Masking Adjustment
Error Normalization
& Pooling
Fig 2 Block diagram of a generic HVS-based VQA system
Trang 56 Digital Video Quality Assessment Algorithms 143
Digital Video Quality Metric
Digital Video Quality Metric (DVQ) metric computes the visibility of artifacts
expressed in the DCT domain In order to evaluate human visual thresholds on
dy-namic DCT noise a small study with three subjects was carried out for different
DCT (spatial) and temporal frequencies The data obtained led to a separable modelwhich is a product of a temporal, a spatial and an orientation function coupled with
Scalable Wavelet-Based Distortion Metric
The distortion metric proposed in [25] can be used as an FR or RR metric dependingupon the application Further, it differs from other HVS-based metrics in that theparametrization is performed using human responses to natural videos rather thansinusoidal gratings
The metric uses only the Y channel from the YUV color space for processing
We note that this is true of many of the metrics described in this chapter Colorand its effect on quality is another interesting area of research [27] The referenceand distorted video sequences are temporally filtered using a finite impulse response(FIR) lowpass filter Then, a spatial frequency decomposition using an integer im-plementation of a Haar wavelet transform is performed and a subset of coefficients isselected for distortion measurement Further, a contrast computation and weighting
by a contrast sensitivity function (CSF) is performed, followed by a masking putation Finally, following a summation of the differences in the decompositionsfor the reference and distorted videos a quality score computation is undertaken
com-A detailed explanation of the algorithm and parameter selection along with certainapplications may be found in [25]
In this section we explained only two of the many HVS models Several based models have been implemented in commercial products The reader is di-rected to [28] for a short description
Trang 6HVS-144 A.K Moorthy et al.Structural and Information-Theoretic approaches
In this section we describe two recent VQA paradigms that are an alternative toHVS-based approaches – the structural similarity index and the video visual infor-mation fidelity These approaches take into account certain properties of the HVSwhen approaching the VQA problem Performance evaluation of these algorithmshas shown that they perform well in terms of their correlation with human percep-tion This coupled with the simplicity of implementation of these algorithms makesthem attractive
Structural Similarity Index
The Structural SIMilarity Index (SSIM) was originally proposed as an IQA rithm in [29] In fact, SSIM builds upon the concepts of the Universal Quality Index(UQI) proposed previously [30] The SSIM index proposed in [29] is a single-scaleindex i.e., the index is evaluated only at the image resolution (and we shall refer
algo-to it as SS-SSIM) In order algo-to better evaluate quality over multiple resolutions, themulti-scale SSIM (MS-SSIM) index was proposed in [31] SS-SSIM and MS-SSIMare space-domain indices A related index was developed in the complex waveletdomain in [32] (see also [33])
Given two image patches x and y drawn from the same location in the reference
and distorted images respectively, SS-SSIM evaluates the following three terms:luminance l.x; y/, structure s.x; y/, and contrast c.x; y/ as:
xand yare the means of x and y;
x2, y2, are the variances of x and y;
xy is the covariance between x and y; and
C , C , and C D C =2 are constants
Trang 76 Digital Video Quality Assessment Algorithms 145SS-SSIM computation is performed using a window-based approach, where themeans, standard deviations and cross-correlation are computed within an 11 11Gaussian window Thus SS-SSIM provides a matrix of values of approximately thesize of the image representing local quality at each location The final score forSSIM is typically computed as the mean of the local scores, yielding a single qual-ity score for the test image However, other pooling strategies have been proposed[34], [35] Note that SSIM is symmetric, attaining the upper limit of 1 if and only
if the two images being compared are exactly the same Hence, a value of 1 sponds to perfect quality, and any value lesser than one corresponds to distortion inthe test image MS-SSIM evaluates structure and contrast over multiple-scales, thencombines them along with luminance, which is evaluated at the finest scale [31].Henceforth, the acronym SSIM applies to both SS-SSIM and MS-SSIM, unless it isnecessary to differentiate between them
corre-For VQA, SSIM may be applied on a frame-by-frame basis and the final qualityscore is computed as the mean value across frames Again, this pooling does nottake into account unequal distribution of fixations across the video or the fact thatmotion is an integral part of VQA Hence, in [36], an alternative pooling based on
a weighted sum of local SSIM scores was proposed, where the weights dependedupon the average luminance of the patch and on the global motion The hypotheseswere - 1) regions of lower luminance do not attract many fixations and hence theseregions should be weighted with a lower value; and 2) high global motion reducesthe perceivability of distortions and hence SSIM scores from these frames should
be assigned lower weights A block-based motion estimation procedure was used tocompute global motion It was shown that SS-SSIM performs extremely well on theVQEG dataset (see section on performance evaluation)
Video Visual Information Fidelity
Natural scene statistics (NSS) have been an active area of research in the recentpast – see [37], [38] for comprehensive reviews Natural scenes are a small subset
of the space of all possible visual stimuli, and NSS deals with a statistical terization of such scenes Video visual information fidelity (Video VIF) proposed in[39] is based on the hypothesis that when such natural scenes are passed through aprocessing system, the system causes a change in the statistical properties of these
charac-natural scenes, rendering them un-charac-natural; and has evolved from VIF used for IQA
[40] (see also [41]) If one could measure this ‘un-naturalness’ one would be able
to predict the quality of the image/video It has been hypothesized that the visualstimuli from the natural environment drove the HVS and hence modeling NSS andHVS may be viewed as dual problems [40] As mentioned in the introduction, eventhough great strides have been made in understanding the HVS, a comprehensivemodel is lacking, and NSS may offer an opportunity to fill this gap Previously,NSS has been used successfully for image compression [42], texture analysis andsynthesis [43], image denoising [44] and so on
Trang 8146 A.K Moorthy et al Fig 3 The model of HVS
for Vide VIF The channel
introduces distortions in the
video sequence, which along
with the references signal is
received by cognitive
pro-cesses in the brain
multi-An extension of VID to video, video VIF, models the original video as astochastic source which passes through the HVS, and the distorted video as havingadditionally passed through a channel which introduces the distortion (blur, block-ing etc.) before passing through the HVS (see Figure3) Derivatives of the video arecomputed and modeled locally using the GSM model [39]
The output of each spatio-temporal derivative (channel) of the original signal isexpressed as a product of two random fields (RF) [45] - a RF of positive scalars and azero mean Gaussian vector RF The channels of the distorted signal are modeled as:
D D GC C Vwhere, C is the RF from a channel in the original signal, G is a deterministic scalarfield and V is a stationary additive zero-mean Gaussian RF with a diagonal covari-ance matrix This distortion model expresses noise by the noise RF V and blur bythe scalar-attenuation field G The uncertainties in the HVS are represented using
a visual noise term which is modeled as a zero-mean multi-variate Gaussian RF(N and N0), whose covariance matrix is diagonal Then define:
E D C C N
F D D C N0VIF then computes mutual informations between C and E and between C and F ,both conditioned on the underlying scalar field S Finally, VIF is expressed as aratio of the two mutual informations summed over all the channels
Trang 96 Digital Video Quality Assessment Algorithms 147
VIFD
P
j 2channelsI.CjI Fjjsj/P
j 2channelsI.CjI Ejjsj/where, Cj; Fj; Ej; sj define coefficients from one channel
Feature Based Approaches
Feature based approaches extract features and statistics from the reference quences and compare these features to predict visual quality This definition appliesequally to SSIM and VIF described earlier, however, as we shall see, feature basedapproaches utilize multiple features, and are generally not based on any particularpremise such as structural retention or NSS
se-Swisscom/KPN research developed the Perceptual Video Quality Metric(PVQM) [47], which measures three parameters – edginess indicator, temporalindicator and chrominance indicator Edginess is compared by using local gradi-ents of luminance of the reference and distorted videos The temporal indicatoruses normalized cross-correlation between adjacent frames of reference videos.The chrominance indicator accounts for perceived difference in color informationbetween the reference and distorted videos These scores are then mapped onto avideo quality scores Perceptual Evaluation of Video Quality (PEVQ) from Opticomwas based on the model used in PVQM [48]-[50] A recent performance evaluationcontest was conducted by the ITU-T for standardization of VQA algorithms [51]and the ITU-T approved and standardized four full reference VQA algorithms in-cluding PVEQ [52] Another algorithm that uses a feature based approach to VQA
is the Video Quality Metric [53]
Video Quality Metric
Proposed by the National Telecommunications and Information Administration(NTIA) and standardized by the American National Standards Institute (ANSI),Video Quality Metric (VQM) [53] was the top performer in the Video Quality Ex-perts Group (VQEG) Phase-II study [54] The International TelecommunicationsUnion (ITU) has included VQM as a normative measure for digital cable televisionsystems [55]
VQM applies a series of filtering operations over a spatio-temporal block whichspans a certain number of rows, columns and frames of the video sequence to extractseven parameters:
1 a parameter which detects the loss of spatial information, which is essentially anedge detector, applied on the luminance;
2 a parameter which detects the shift of edges from horizontal and vertical tation to diagonal orientation, applied on the luminance;
Trang 10orien-148 A.K Moorthy et al.
3 a parameter which detects the shift of diagonal edges to horizontal and verticalorientation, applied on the luminance;
4 a parameter which computes the changes in the spread of the chrominance ponents;
com-5 a quality improvement parameter, which accounts for any improvements arising
from sharpening operations;
6 a parameter which is the product of a simple motion detection (absolute ence between frames) and contrast and finally,
differ-7 a parameter to detect severe color impairments
Each of the above mentioned parameters is thresholded in order to specifically count only for those distortions which are perceptible, then pooled using differenttechniques The general model for VQM then computes a weighted sum of theseparameters to find a final quality index For VQM, a score of 1 indicates poor qual-ity, while 0 indicates perfect quality A MATLAB implementation of VQM has beenmade available for research purposes online [56]
ac-Motion Modeling Based Approaches
Distortions in a video can either be spatial – blocking artifacts, ringing distortions,mosaic patterns, false contouring and so on, or temporal – ghosting, motion block-ing, motion compensation mismatches, mosquito effect, jerkiness, smearing and so
on [57] The VQA algorithms discussed so far mainly try to account for loss inquality due to spatial distortion, but fail to model temporal quality-loss accurately.For example, the only temporal component of PVQM is a correlation computationbetween adjacent frames; VQM uses absolute pixel-by-pixel differences betweenadjacent frames of a video sequence
The human eye is very sensitive to motion and can accurately judge the velocityand direction of motion of objects in a scene The ability to detect motion is essentialfor survival and for performance of tasks such as navigation, detecting and avoidingdanger and so on It is hence no surprise that spatio-temporal aspects of humanvision are affected by motion
As we discussed earlier, initial processing of visual data in the human brain takesplace in the V1 region Neurons in this front-end (comprising of the retina, LGNand V1) are tuned to specific orientations and spatial frequencies and are well-modeled by separable, linear, spatial and temporal filters Many HVS-based VQAalgorithms use such filters to model this area of visual processing However, thevisual data from area V1 is transported to area MT/V5 which integrates local mo-tion information from V1 into global percepts of motion of complex patterns [58].Even though responses of neurons in area MT have been studied and some mod-els of motion sensing have been proposed, none of the existing HVS-based systemsincorporate these models in VQA Further, a large number of neurons in area MTare known to be directionally selective and hence movement information in a videosequence may be captured by a linear spatio-temporal decomposition
Trang 116 Digital Video Quality Assessment Algorithms 149Recently a temporal pooling strategy based on motion information was proposedfor SSIM [59] We call this algorithm speed-weighted SSIM and explain some ofits features in this section Note that the original SSIM for VQA [36], used sometemporal weighting using motion information as well.
Speed-Weighted SSIM
Speed-weighted SSIM (SW-SSIM) [59] considers three kinds of motion fields -1)absolute motion; which is the absolute pixel motion between two adjacent frames, 2)background/global motion; which is caused by movement of the image acquisitionsystem and 3) relative motion; which is the difference between the absolute andglobal motion
It is hypothesized that the HVS is an efficient extractor of information [38].Visual perception is modeled as an information communication process, where theHVS is the error prone communication channel since the HVS does not perceive allinformation with the same degree of certainty A psychophysical study conducted
by Stocker and Simoncelli on human visual speed perception suggested that theinternal noise of human speed perception is proportional to the true stimulus speed[60] It was found that for a given stimulus speed, a log-normal distribution provides
a good description of the likelihood function (internal noise), which determines theperceptual uncertainty
SW-SSIM proceeds as follows First a SS-SSIM map is constructed at each pixellocation using SSIM as defined before Then a motion vector field is computed usingBlack and Anandan’s multi-scale optical flow estimation algorithm [61] - yieldingabsolute pixel motion Then, a histogram of the motion vectors in each frame iscomputed and the vector associated with the peak value is identified as the globalvector for that frame Relative motion computation follows The weight applied
at every pixel is then a function of the relative velocity, the global velocity andthe stimulus contrast The weight is designed such that the importance of a visualevent increases with information content and decreases with perceptual uncertainty.Finally, each pixel location is weighted and the scores so obtained for each frame ispooled within and across frames to give a quality index for the video Note that inthis brief explanation, we have skipped over some practical implementation issues;the interested reader is directed to [60] for a thorough description of the algorithm.SW-SSIM was shown to perform well on the VQEG dataset
Even though SW-SSIM takes into account motion information, only a weighting
of spatially-obtained SSIM scores is undertaken based on this information We lieve that computation of temporal quality of a video sequence is as important, ifnot more, as spatial quality computation Recently, a new VQA algorithm - motionbased video integrity evaluation - that explicitly accounts for temporal quality arti-facts was proposed [62], [63]
Trang 12be-150 A.K Moorthy et al.
Motion Based Video Integrity Evaluation
Motion based video integrity evaluation (MOVIE) evaluates the quality of videossequences not only in space and time, but also in space-time, by evaluating motionquality along motion trajectories
First, both the reference and the distorted video sequences are spatio-temporallyfiltered using a family of bandpass Gabor filters Gabor filters have been used formotion estimation in video [64], [65] and for models of human visual motion sens-ing [66]-[68] It has also been shown that Gabor filters can be used to model thereceptive field of neurons in the visual cortex [69] Additionally, Gabor filters attainthe theoretical lower bound on uncertainty in the frequency and spatial variables.MOVIE uses three scales of Gabor filters A Gaussian filter is included at the center
of the Gabor structure to capture low frequencies in the signal
A local quality computation of the band-pass filtered outputs of the reference andtest videos is then undertaken by considering a set of coefficients within a windowfrom each of the Gabor sub-bands The computation involves the use of a mutualmasking function [70] The mutual masking is used to model the contrast makingproperty of the HVS, which refers to a reduction in the visibility of a signal com-ponent due to the presence of another spatial component of the same frequencyand orientation in a local neighborhood This masking model is closely related tothe MS-SSIM and information theoretic models for IQA [71] The quality index
so obtained is termed as the spatial MOVIE index – even though it captures sometemporal distortions
MOVIE uses the same filter bank to compute motion information i.e., estimateoptical flow from the reference video The algorithm used is a multi-scale exten-sion of the Fleet and Jepson [64] algorithm that uses the phase of the complex Gaboroutputs for motion estimation
Translational motion as an easily accessible interpretation in the frequencydomain : spatial frequencies in the video signal are sheared due to translationalmotion along the temporal frequency dimension without affecting the magnitude ofthe spatial frequencies and such a translating patch lies entirely within a plane in thefrequency domain [72] The optical flow computation provides an estimation of thelocal orientation of this spectral plane at each pixel Thus, if the motion of the dis-
torted video matches that of the reference video exactly, then the filters that lie along
the motion plane orientation defined by the flow from the reference will be activated
by the distorted video and outputs of filters that lie far away from this plane will benegligible In presence of a temporal artifact, however, the motion in the referenceand distorted videos do not match and a different set of filter banks may be acti-vated Thus, motion vectors from the reference are used to construct velocity-tunedresponses This can be accomplished by a weighted sum of the Gabor responses,where positive excitatory weights are assigned to those filters that lie close to thespectral plane and negative inhibitory weights are assigned to those that lie fartheraway from the spectral plane This excitatory-inhibitory weighting results in a strongresponse when the distorted video has motion equal to the reference and a weak re-sponse when there is a deviation from the reference motion Finally, the mean square
Trang 136 Digital Video Quality Assessment Algorithms 151error is computed between the response vectors from the reference video (tuned toits own motion) and those from the distorted video The temporal MOVIE index justdescribed essentially captures temporal quality.
Application of MOVIE to videos produces a map of spatial and temporal scores
at each pixel location for each frame of the video sequence In order to pool thescores to create a single quality index for the video sequence, MOVIE uses thecoefficient of variation [73] Although many alternate pooling strategies have beenproposed [16], [17], [35], [36], [53] the coefficient of variation serves to capturethe distribution of the distortions accurately [74] The coefficient of variation iscomputed for the spatial and temporal MOVIE scores for each frame, then the valuesare averaged across frames to create the spatial and temporal MOVIE indices forthe video sequence (temporal MOVIE index uses the square root of the average).The final MOVIE score is a product of the temporal and spatial MOVIE scores Adetailed description of the algorithm can be found in [74]
Performance Evaluation & Validation
Practical deployment of the various VQA algorithms discussed previously requiresthat a mutually agreed upon testing strategy for evaluation of performance exist Itwas in order to create such a test-bed for the VQA algorithms that the VQEG FR-
TV phase-I [51] was conducted A total of 320 distorted video sequences were used
in order to test the performance of 10 leading VQA algorithms, along with PSNR.The study found that all of the tested algorithms were statistically indistinguishablefrom PSNR [51]!
The test procedure employed by the VQEG was as follows: All of the rithms were run on the entire database, and then the performance was gaugedbased on three criterion : prediction monotonicity, prediction accuracy and pre-
algo-diction consistency The monotonicity was measured by computing the Spearman Rank Ordered Correlation Coefficient (SROCC), the accuracy was computed us-
ing Linear (Pearson’s) Correlation Coefficient (CC) and Root Mean Square Error(RMSE) While the SROCC can be computed directly on the scores obtained fromthe algorithm and subjective testing, the CC and RMSE require a non-linear trans-formation before their computation This is due to the fact that the objective scoresmay be non-linearly related to the subjective scores This would imply that, althoughthe algorithms predict the quality accurately, in the absence of such a non-linearmapping the CC and RMSE would not be truly representative of algorithm perfor-
mance Finally, consistency was measured by computing the Outlier Ratio (OR).
The standard procedure to conduct a subjective study in order to obtain the meanopinion scores (MOS) which is representative of the human perception of quality isenlisted in [3] A similar study to assess the quality of images was conducted soonafter [75], where leading IQA algorithms were evaluated in a procedure similar tothat followed by the VQEG The VQEG dataset and the LIVE image dataset areavailable publicly at [51] and [76]
Trang 14152 A.K Moorthy et al.
Table 1 Performance of VQA algorithms on VQEG phase-I dataset
VQA Algorithm SROCC LCC
Proponent P8 (Swisscom)[ 47 ] 0.803 0.827 Frame-SS-SSIM [ 36 ] 0.812 0.849 MOVIE [ 62 ] 0.833 0.821
In order to obtain a comparison of the results of various VQA algorithms, inTable1we detail the performance of PVQM [47], which was the top performer inthe VQEG dataset, along with Frame-SS-SSIM and MOVIE We also include PeakSignal-to-Noise Ratio (PSNR), since it provides the baseline for performance evalu-ation, as it has been argued the PSNR does not correlate well with human perception
of quality [77] Note that many of the algorithms from the VQEG study have beenaltered further to enhance performance Indeed, VQM, whose earlier version was aproponent in the VQEG study, was trained on the VQEG phase-I dataset in order toobtain the parameters of the algorithm We also note that the VQEG phase-I dataset
is the only publicly available dataset for VQA testing
Although the VQEG dataset has been used in the recent past for performanceevaluation of various VQA algorithms, the dataset suffers from severe drawbacks.The VQEG dataset contains some non-natural video sequence – eg., scrolling text onscreen – which is not considered ‘fair-game’ for VQA algorithms which are based
on human perception of natural scenes and are not geared towards quality ment of artificially created environments or text For example, as demonstrated in[74], MOVIE performs significantly better when such sequences are not considered
assess-in the analysis Further, the dataset is dated - the report was published assess-in 2000, andwas made specifically for TV and hence contains interlaced videos The presence
of interlaced videos complicates the prediction of quality, since the de-interlacingalgorithm can introduce further distortion before computation of algorithm scores.Further, the VQEG study included distortions only from old generation encoderssuch as the H.263 [78] and MPEG-2 [79], which exhibit different distortions com-pared with present generation encoders like the H.264 AVC/MPEG-4 Part 10 [80].Finally, and most importantly the VQEG phase I database of distorted videos suffersfrom problems with poor perceptual separation Both humans and algorithms havedifficulty in producing consistent judgments that distinguish many of the videos,lowering the correlations between humans and algorithms and the statistical confi-dence of the results We also note that even though the VQEG has conducted otherstudies [54], oddly, none of the data has been made public
In order to overcome these limitations the LIVE video quality assessment andthe LIVE wireless video quality databases were created These two databases willalleviate the problems associated with the VQEG dataset and will provide a suitabletesting ground for future VQA algorithms Information regarding these databasesmay not be ready before this chapter is published, but will soon be provided at [76]
Trang 156 Digital Video Quality Assessment Algorithms 153Conclusions & Future Directions
In this chapter we began by motivating the need for VQA algorithms and gave abrief summary of various VQA algorithms We detailed performance evaluationtechniques and validation methods for a number of leading VQA algorithms Futureresearch may involve further understanding of human motion processing and its in-corporation into VQA algorithms Temporal pooling is another issue that needs to
be considered Gaze attention and region-of-interest remain interesting areas of search, especially in the case of video quality assessment In this chapter we havedetailed only FR VQA algorithms However, research in the area of RR VQA al-gorithms is of key interest, considering its practical advantages The Holy Grail,
re-of course are truly NR VQA algorithms Further, the statistical techniques used formeasuring the performance of algorithms have been questioned [35], [75] It is ofinterest to evaluate various possible alternatives to study correlation with humanperception
References
1 Z Wang and A C Bovik, Modern Image Quality Assessment New York: Morgan and Claypool Publishing Co., 2006.
2 A K Moorthy and A C Bovik, “Perceptually Significant Spatial Pooling techniques for
Im-age quality assessment ,” in SPIE Conference on Human Vision and Electronic Imaging, Jan.
2009.
3 “Methodology for the subjective assessment of the quality of television pictures,” ITU-R ommendation BT.500-11.
Rec-4 B Hiremath, Q Li and Z Wang “Quality-aware video,” IEEE International Conference on
Image Processing, San Antonio, TX, Sept 16-19, 2007.
5 H R Sheikh, A C Bovik, and L Cormack, “No-reference quality assessment using natural
scene statistics: JPEG2000,” Image Processing, IEEE Transactions on, vol 14, no 11, pp.
1918–1927, 2005.
6 C M Liu, J Y Lin, K G Wu and C N Wang, “Objective image quality measure for
block-based DCT coding,” IEEE Trans Consum Electron., vol 43, pp 511–516, 1997.
7 Z Wang, A C Bovik, and B L Evans, “Blind measurement of blocking artifacts in images,”
in IEEE Intl Conf Image Proc, 2000.
8 X Li, “Blind image quality assessment”, IEEE International Conference on Image Processing,
New York, 2002.
9 Patrick Le Callet, Christian Viard-Gaudin, St´ephane P´echard and Emilie Caillault, “No
ref-erence and reduced refref-erence video quality metrics for end to end QoS monitoring”, Special
Issue on multimedia Qos evaluation and management technologies, E89, (2), Pages: 289-296, February 2006.
10 W S Geisler and M S Banks, “Visual performance,” in Handbook of Optics, M Bass, Ed McGraw-Hill, 1995.
11 B A Wandell, Foundations of Vision Sunderland, MA: Sinauer Associates Inc., 1995.
12 N C Rust, V Mante, E P Simoncelli, and J A Movshon, “How MT cells analyze the motion
of visual patterns ”, Nature Neuroscience, vol.9(11), pp 1421–1431, Nov 2006.
13 Z Wang, G Wu, H R Sheikh, E P Simoncelli, E.-H Yang and A C Bovik, ”Quality -aware
images” IEEE Transactions on Image Processing, vol 15, no 6, pp 1680-1689, June 2006.
14 R T Born and D C Bradley, “Structure and function of visual area MT,” Annual Rev
Neuro-science, vol 28, pp 157–189, 2005.