The random bias was characterized by selecting control fixations randomly sampled from a uniform distribution over an entire saliency map.. The subject bias was defined by selecting the
Trang 14.1.4 Normalization Schemes
The normalization of the saliency maps is necessary for the correct quantitative analysis across movie frames on the same scale We used z-score normalization method, in which we subtract the mean saliency value from all the saliency values and divide by the standard deviation of all the saliency values in a given map This resulted in some saliency values below zero We then removed larger saliency values in the map by selecting a threshold in terms of a multiple of the standard deviation (X) The intuition behind using thresholds in saliency maps was that any region at X standard deviations away from the mean saliency was just as salient as regions with a higher saliency value A similar threshold can be applied
to negative values in a given saliency map Thus resulting map had values in a bounded interval of [ X X] This was akin to Normalized Scan path Saliency NSS method (Peters et al., 2005)
To assess the performance of the model against chance performance we selected control fixations to compare against the human fixations The control fixations were selected in three di↵erent ways; random bias, subject bias, and centre bias The random bias was characterized by selecting control fixations randomly sampled from a uniform distribution over an entire saliency map
The subject bias was defined by selecting the control fixations from a fixation pool of other subjects on movies other than the movie to which the current saliency map belongs to The subject bias represents a stricter control compare to random bias since we are accounting for the human eye movement pattern in selection of the control fixations
The centre bias was a method of selecting control fixations randomly sampled
90
Trang 24.1 Methods
from a uniform distribution over a restricted region in centre of the saliency map (see Figure 4.18) This type of control is strictest in computing model’s perfor-mance compare to chance perforperfor-mance due to its accounting of the photographer’s bias
We have used two scoring methods to assess the performance of di↵erent saliency models, in predicting human fixations The first scoring method we used is named
as Area under the Reciver Operator Curve (ROC) otherwise known as AUC This scoring method has been reported often in literature to evaluate the eye fixation prediction (Bruce & Tsotsos, 2009; Gao et al 2008) In this scoring method we first compute true positives from the saliency map using human fixation data For the false positive we sample the saliency map using random distribution, drawn uniformly over the entire image This is followed by the thresholding of the true positive and false positive distributions, to get ROC values over an entire spectrum
of 0 to 1 The threshold is varied over a range of minimum and maximum saliency values in the dataset Subsequently ROC for false alarm rate (labeling non-fixated location as fixated) as a function of the hit rate (labeling fixated locations as fixated) is plotted The advantage of this metric includes being non-parametric, taking into consideration salience at the fixated and non-fixated location and hav-ing upper and lower bounds (0.5 for chance discrimination, 1.0 or 0 for perfect discrimination depending on if actual/human or control values are higher) The area under the receiver operator curve (AUC) indicates how well the saliency map predicts human fixations An AUC score of 0.5 shows it’s not possible to discrimi-nate the two distributions (human and random) while score of 1.0 indicates perfect discrimination and score of less than 0.5 suggests models is performing worse than chance
Trang 3Frame # 54
Actual Fixations
Frame # 54
Random Bias
Frame # 54
Subject Bias
Frame # 54
o o
o o o
o o
o
o o o
o o
o
o o o
o o o
Actual Fixations
Normalized histogram
High Saliency Low Saliency
Random bias
Subject bias
Centre bias
Figure 4.18: Three di↵erent types of control biases shown on grey scale movie frame and
corresponding face modulated saliency map Actual fixations are the real fixation by
di↵erent subject for this movie frame Random bias (shown in cyan colour) show control
fixations sampled from uniform distribution over an entire image Subject bias (shown
in pink colour) show control fixations sampled from the fixation pool of other subjects
watching movies other than the one under consideration Centre bias (shown in blue
colour) show control fixations by sampling from uniform distribution over a restricted
region in the centre of the frame
92
Trang 44.1 Methods
False alarm rate
Figure 4.19: Receiver operator curve for movie cats False alarm rate shows random locations classified as fixated while hit rate shows human fixated locations classified as fixated Dotted line indicates chance level discrimination
A second scoring scheme, often used by Itti and colleagues (Itti & Baldi, 2009),
is Kullback-Leibler (KL) divergence (Kullback, 1959) scores It measures the di↵erence in shape, between the histogram of the saliency sampled at the fixated location and saliency sampled at the control location
KL(h|c) =X
x h(x) log
✓ h(x) c(x)
◆
(4.5)
Here h is the probability deduced from human fixated saliency values and c
is the probability obtained from the control values The control locations are drawn from uniform spatial distribution over an entire image (random bias) or from fixation pool of subjects from other movies (subject bias) or uniform spatial distribution over a restricted region in image (centre bias) Likewise AUC, if the saliency sampled at the fixated location, predicted by the models, is significantly better than the chance level then KL divergence scores between two histograms
Trang 5would be high and vice versa The range of KL divergence is from 0 to 1 Higher values indicate more dissimilarity in shape of the two distributions, implying model
is better predictor of the human fixation data The zero value indicates chance performance, meaning that model is not doing any better than the control
Figure 4.20 demonstrates qualitative comparisons between proposed model, gist dependent control conditions, and previously proposed models of visual attention Previous computational models used for comparisons are Surprise Model Itti and
and dynamic visual attention model (D.V.A.) (Hou and Zhang, 2008) Compar-isons to gist dependent control conditions (mentioned in section4.1.3.6) are made
to qualitatively assess correct (labeled as Gist) and incorrect (ladled as Average and Gist scrambled) modulations
We show comparisons for 6 di↵erent movie frames In first two columns we show a movie frame along with the proposed model’s output at di↵erent stages The first stage is labeled as saliency map, obtained using motion intensity, spatial coherency, and temporal coherency maps The second stage is where we modulate our saliency map using face information The third and final stage of our model’s output is the modulation of face modulated saliency map using gist information
In third and fourth column we show control condition modulations for the gist case The last column shows saliency maps produced by the previously proposed models of visual attention in the literature
To get an idea of how saliency values vary for sampled location across these di↵erent maps we also show a fixation data from one subject, superimposed over the maps in a green colour The sampled value on each map is indicated on top of
94
Trang 64.2 Results
the respective maps As shown the proposed model is good at capturing visually salient location Moreover the validity of correct gist modulation is confirmed by low saliency values in control conditions
We quantify our results using KL divergence and AUC scores A quantitative analysis is based on comparing fixated location with control fixations for a given saliency map The control fixation (random bias or subject bias or centre biased) were sampled 100 times for a given fixated location (actual) It’s important to note that many research studies sample control values from human fixated locations on stimuli (also known as subject biased) other than one under consideration The claim behind employing this strategy is that randomly sampling control distribu-tions over entire image results in over estimation of the model’s prediction power However due to central bias in human eye movements, a very simple model like Gaussian blob, centred on the image, may outperform many state-of-the-art com-plex models (Parkhurst & Niebur 2003; Tatler et al., 2005) We report two scores for these comparisons; KL divergence and AUC scores KL divergence gives the measure of shape similarity between two arbitrary distributions AUC scores are based on ROC curves which are used to overcome the subjectivity in threshold selection Moreover this method takes into account the variability of saliency at fixated location and non-fixated location (Tatler, Baddeley and Gilchrist, 2005) Both of these scores are frequently reported in literature for such model compar-isons
The Figure4.21illustrates the distribution of saliency values, sampled on di↵er-ent maps, for 7846 human fixated locations versus control locations The saliency values were z-normalized per frame The green bars represent the distribution from control sampling, while the blue bars represent the distribution from human fixa-tion targets in a frame The data is shown for a movie I,Robot (2004) The error
Trang 7Control conditions
Normalized histogram
High Saliency Low Saliency
Frame 441 Gist 2x2 (0.74) Average 2x2 (0.41) Gistswap 2x2 (0.48) Surprise (0)
Saliency (0.53) Gist 3x3 (0.72) Average 3x3 (0.47) Gistswap 3x3 (0.46) D.V.A (0.5)
Face+Saliency (0.53) Gist 4x4 (0.75) Average 4x4 (0.46) Gistswap 4x4 (0.44) SUNDay (0.54)
Frame 741 Gist 2x2 (0.87) Average 2x2 (0.18) Gistswap 2x2 (0.21) Surprise (0.57)
Saliency (0.28) Gist 3x3 (0.79) Average 3x3 (0.22) Gistswap 3x3 (0.18) D.V.A (0.03)
Face+Saliency (0.28) Gist 4x4 (0.91) Average 4x4 (0.21) Gistswap 4x4 (0.24) SUNDay (0.19)
Figure 4.20: A qualitative comparison of proposed saliency model with previous models
of visual attention in the literature We show comparisons for 6 di↵erent frames from our movie data set In all the examples we show a fixation point (green) from one subject superimposed on di↵erent maps and sampled saliency value at the location in respective maps As shown saliency maps produced by proposed model is much sparser compared
to previous models
96
Trang 84.2 Results
Control conditions
Normalized histogram
High Saliency Low Saliency
Frame 520 Gist 2x2 (0.83) Average 2x2 (0.64) Gistswap 2x2 (0.72) Surprise (0.42)
Saliency (0.86) Gist 3x3 (0.89) Average 3x3 (0.81) Gistswap 3x3 (0.57) D.V.A (0.7)
Face+Saliency (NaN) Gist 4x4 (0.83) Average 4x4 (0.71) Gistswap 4x4 (0.76) SUNDay (0.43)
Frame 1020 Gist 2x2 (0.84) Average 2x2 (0.27) Gistswap 2x2 (0.24) Surprise (0.03)
Saliency (0.21) Gist 3x3 (0.78) Average 3x3 (0.28) Gistswap 3x3 (0.2) D.V.A (0.12)
Face+Saliency (NaN) Gist 4x4 (0.72) Average 4x4 (0.27) Gistswap 4x4 (0.38) SUNDay (0.03)
Figure4.20 (continued)
Trang 9Frame 598 Gist 2x2 (0.7) Average 2x2 (0.61) Gistswap 2x2 (0.55) Surprise (0.53)
Saliency (0.8) Gist 3x3 (0.86) Average 3x3 (0.56) Gistswap 3x3 (0.58) D.V.A (0.45)
Face+Saliency (NaN) Gist 4x4 (0.88) Average 4x4 (0.56) Gistswap 4x4 (0.55) SUNDay (0.61)
Control conditions
Frame 331 Gist 2x2 (0.89) Average 2x2 (0.22) Gistswap 2x2 (0.22) Surprise (0.5)
Saliency (0) Gist 3x3 (0.9) Average 3x3 (0.3) Gistswap 3x3 (0.14) D.V.A (0.17)
Face+Saliency (0.87) Gist 4x4 (0.89) Average 4x4 (0.25) Gistswap 4x4 (0.33) SUNDay (0.36)
Normalized histogram
High Saliency Low Saliency
Figure4.20 (continued)
98
Trang 104.2 Results
bars were obtained by constructing 1000 surrogates of human and control distri-butions, each sampled from their respective original distridistri-butions, using bootstrap method (Efron and Tibshirani, 1994) For each condition we report mean KL divergence and AUC scores with ±1 std over 1000 surrogates
We found KL divergence and AUC scores were significantly above the chance level (95% confidence intervals were well above chance) for all three control con-ditions and for all the di↵erent maps With modulation of face locations in our baseline/Spatio-Temporal saliency map we significantly improved performance of the proposed model Follow up scene category dependent gist modulation further improved the results, as reflected by histograms of saliency values at human fixated locations and scoring metrics We found the gist modulation consistently improved the model’s performance across the movies (see Figure4.22) On x-axis we plotted AUC scores obtained by face modulation of baseline saliency map and on Y-axis we plotted AUC scores obtained by Gist modulation of face modulated saliency maps The diagonal marks the chance performance Any movie point below the diagonal would indicate that gist modulation resulted in degradation of performance over face modulation On the contrary if the movie point was above the diagonal that would indicate that gist modulation resulted in improvement of performance over the face modulation As illustrated majority of the movie points were found to be well above the diagonal (t-test p < 0.01) However for some of the movies, espe-cially those with faces, we observed marginal improvements with gist modulations,
as shown by 2.5th and 97.5th percentile error bars One explanation of such result
is that with face modulation the AUC scores were already saturating to the limit (i.e., theoretical limit of 1) So with additional gist modulation it did not made stark di↵erence However in movies with less frequent faces (such as Galapagos)
we saw a significant improvement in prediction, as reflected in AUC scores well above the diagonal