Chapter 4Computational Modelling We used a Spatio-Temporal Coherency model, which is an extension of the user model used motion intensity, spatial coherency, temporal coherency, face det
Trang 1Chapter 4
Computational Modelling
We used a Spatio-Temporal Coherency model, which is an extension of the user
model used motion intensity, spatial coherency, temporal coherency, face detection and gist modulation to compute saliency maps (see Figure 4.1) The critical part
of this model is motion saliency, defined as the attention due to motion Abrams and Christ (2003) have shown that the onset of motion captures visual attention Motion saliency for a frame is computed over multiple preceding frames According
to our investigations, overt orienting of the fovea to any motion induced salient locations in any given frame, n, is influenced by the saliency at that location in the preceding frames (n 1, n 2,· · · ) This is known as saccade latency in the literature, with latencies typically in the order of 200-250 milliseconds (Becker and J¨urgens, 1979; Hayhoe et al., 2003) We investigated the influence of up to ten preceding frames (approximately 500 milliseconds), however beyond the fifth frame (i.e., n 5) we did not see any significant contributions to overt orientation This means that the currently fixated location in frame n was indeed selected based
Trang 2on the saliency at that location up to n 5 preceding frames The on-screen time for 5 frames was about 210 milliseconds (the video frame rate in our experiments were 24 frames per second) which is well within the bounds of the reported saccade latencies
We computed motion vectors using Adaptive rood pattern search (ARPS) algo-rithm, for fast block-matching motion estimation (Nie and Ma,2002) The motion vectors are computed by dividing the frame into a matrix of blocks The fast ARPS algorithm leverages on the fact that general motion is usually coherent For example, if we see a block surrounded by other blocks, and the surrounding blocks moved in a particular direction, there is a high probability that the current block will also have a similar motion vector Thus, the algorithm estimates the motion vector of a given block using the motion vector of the macro block to its immediate left
The Motion Intensity map, I (Figure4.2), which is a measure of motion induced activity, is computed using motion vectors (dx, dy) normalized by the maximum magnitude in the motion vector field
I(i, j) =
q
dxi,j2+ dyi,j2
Spatial coherency (Cs) and temporal coherency (Ct) maps were obtained by convolving the frames with the Entropy filter The Cs maps captured regularity
at a spatial scale of 9x9, with in a frame, while the Ct maps captured regularity
at the same spatial scale (ie., 9x9 pixels) but over a temporal scale of 5 frames Spatial coherency (see Figure4.3) measured the consistency of pixels around the point of interest, otherwise known as the correlation The higher the correlation
Trang 3Figure 4.1: Spatio-Temporal saliency model architecture diagram
Trang 4Low
Compute motion vectors
Motion Vector Field
Compute motion intensity
High Motion Energy
Low Motion Energy
Figure 4.2: Computation of motion intensity on adjacent frames Three examples from di↵erent movies are shown
the more probability they are from the same object This is computed as the entropy over the block of pixels Higher entropy showed more randomness in the block structure, indicating lower correlation among pixels and, hence lower spatial
using the following equation
Cs(x, y) =
n X i=1
where Ps(t) is the probability of occurrence of the pixel intensity i and n cor-responds to the 9x9 neighbourhood
Similarly, to compute consistency in pixel correlation over time, we used
Ct(x, y) =
n X i=1
where Ps(t) is the probability of occurrence of the pixel i at the corresponding
Trang 5Frame(n)
Entropy Filtering - Spatial
High Entropy
Low Entropy
Figure 4.3: Examples of spatial coherency map computed on five di↵erent movie frames
location in preceding five frames (m = 5) Higher entropy implies greater motion and thus higher saliency at that location The temporal coherency map (see Figure
4.4), in general, signifies motion energy in each fixated frame, contributed by the five preceding frames (with exception in cases for boundary frames where motion vectors are invalid due to scene or camera transitions)
Once all three feature maps are computed, we apply centre-surround suppres-sion to these maps to highlight regions having higher spatial contrast This is akin
to simulating the behaviour of ganglion cells in retina (Hubel and Wiesel,1962) To achieve this, we first compute dyadic Gaussian pyramid (Burt and Adelson, 1983) for each map by repeatedly low-pass filtering and subsampling the map (see Figure
4.5) For low-pass filtering, we used a 6⇥ 6 separable Gaussian kernel (Walther and Koch, 2006) defined as K = [1 5 10 10 5 1]32 (see Walther, 2006, Appendix A.1 for more details)
We start with level 1 (L1) which is the actual size of the map Image for each successive level is obtained by first low-pass filtering the image This step results
in a blurry image with supressed higher spatial frequencies The resulting image
is then subsampled to half of its current size to obtain the level 2 (L2) image The process continue until the map cannot be further subsampled (L9 in Figure 4.5)
Trang 6Frames Frames Frames
n-5
n-4
n-3
n-2
n-1
n
n-5
n-4
n-3
n-2
n-1
n
n-5
n-4
n-3
n-2
n-1
n
Entropy Filtering - Temporal
High Entropy
Low Entropy
Figure 4.4: Examples of temporal coherency map computed over the previous five frames, shown from three di↵erent movie examples
Trang 7100 200 300 400 500 600 700
50
100
150
200
250
50 100 150 200 250 300 350
20 40 80 100 120
50 100 150
20 40 60
20 40 60 80
10
20
30
10 20 30 40
5 10 15
5 10 15 20
2 4 6 8
2 4 6 8 10
1
2
3
4
1 2 3 4 5
0.5 1 1.5 2 2.5
0.5 1 1.5 2 2.5
0.5
1
1.5
Low Saliency
High Saliency
Figure 4.5: Example of a temporal coherency map at nine di↵erent levels of Gaussian pyramid Starting at level 1 (L1 in above figure),having same size as the original map, each successive level is obtained by low-pass filtering and subsequently subsampling the map to half of its size at the current level
To simulate the behaviour of centre-surround receptive field, we take the di↵er-ence among di↵erent levels of the pyramid for a given feature map as was previously described in Itti et al (1998) We select di↵erent levels of pyramid to represent
in six intermediate maps as shown in Figure 4.6 To get a point-wise differences across scale, the images are interpolated to a common size
All of these six centre surround maps are then added, across scales to get a single map per feature as shown in Figure 4.7
All three feature maps are then combined linearly to produce a standard saliency map
Since higher entropy in the temporal coherency map indicates greater motion
Trang 820 40 60 80
5
10
15
20
25
30
35
20 40 60 80
5 10 15 20 25 30 35
20 40 60 80
5 10 15 20 25 30 35
20 40 60 80
5
10
15
20
25
30
35
20 40 60 80
5 10 15 20 25 30 35
20 40 60 80
5 10 15 20 25 30 35
Low Saliency
High Saliency
Figure 4.6: Taking point-wise di↵erences across scales (2-5, 2-6, 3-6, 3-7, 4-7, 4-8) results
in six intermediate maps for a given feature map
200 400 600
50
100
150
200
250
200 400 600
50 100 150 200 250
200 400 600
50 100 150 200 250
200 400 600
50
100
150
200
250
200 400 600
50 100 150 200 250
200 400 600
50 100 150 200 250
M n Intensity Map
Spa Coherency Map
Tempora Coherency Map
High Saliency
Low Saliency
Figure 4.7: Final feature maps obtained after adding across-scale centre-surround dif-ferences The top panel shows the feature maps before centre-surround suppression is applied The bottom row shows the final feature maps after the application of the centre-surround suppression via across scales point-wise di↵erence followed by the summation The example shown for one movie frame clearly demonstrates the e↵ectiveness of the centre-surround suppression in producing sparser feature maps
Trang 9over a particular region, intensity maps are directly multiplied with temporal co-herency maps This highlights the contribution of the motion salient regions in the saliency maps On the contrary higher entropy in the spatial coherency map indicates randomness in the block structure, suggesting that region does not belong
to any single entity or object Since we are interested in motion saliency induced
by the spatially coherent object, we assign higher value to the pixels belonging to
randomly chosen frame from each of the movie in our database
Trang 10Animals Cats
IRobot
Low Saliency
High Saliency
Figure 4.8: Saliency maps shown for randomly selected frame from every movie in the database Column 1 and 3 show movie frames while column 2 and 4 show the saliency map for the corresponding frame A saliency values is indicated by a warmer colour as illustrated by the colour map on the right
Trang 114.1.2 Face Modulation
We modulate the standard saliency map with high-level semantic knowledge, such
as faces using a state of the art face detector (Viola and Jones,2004) This accounts for the fact that overt attention is frequently deployed to faces (Cerf et al., 2009), and it can be argued that faces are a part of the bottom-up information as there are cortical areas specialized for faces, in particular the fusiform gyrus (Kanwisher
et al., 1997)
Viola and Jones (2004) face detector is based on training a cascaded classifier using a learning technique called AdaBoost on a set of very simple visual features These visual features have Haar-like feature properties as they are computed by subtracting the sum of the sub region from the sum of the remaining region The Figure4.9shows example of Haar-like rectangle features Panels A and B show two-rectangle features (horizontal / vertical) and panels C and D show three-two-rectangle and four-rectangle features respectively The value of the feature is computed
by subtracting the sum of the pixel values in the white region from the sum of the pixel values in the grey region These Haar-like features are simple and very efficient to compute using integral of image representation The integral of image representation allows the computation of the rectangle sum in constant time The Haar-like features are extracted over a 24⇥ 24 pixel sub-window resulting
in thousands of features per image The goal here is to construct a strong classifier
by selecting a small number of discriminant features from the limited set of labelled training images This is achieved by employing an AdaBoost technique to learn the cascade of weak classifiers Each weak classifier in the cascade is trained on a single feature The term weak signifies the fact that no single classifier in the cascade can classify all the examples accurately For each round of boosting, the AdaBoost method allows the selection of the weak classifier with the lowest error rate con-trolled by the desired hit and miss rate This is followed by the re-assignment of
Trang 12Figure 4.9: Example of four basic rectangular features as shown in Viola and Jones
2004 IJCV paper Panel A and B show two-rectangle features while panels C and D show three-rectangle and four-rectangle features Panel E shows example of two features overlaid on a face image The first feature is a two-rectangle feature measuring the di↵erence between the eye and upper cheek region while the second feature, a three-rectangle feature, measures the di↵erence between the eye region and upper nose region
the weights to emphasize the examples with poor classification for the next round The Adaboost method is regarded as a greedy algorithm since it associates a large weight to every good features and a small weight to poor features The final strong classifier is then a weighted combination of the weak classifiers
properties of the image The example shows a two-feature classifier (top row shows the two selected features) trained over 507 faces The first feature measures the di↵erence in the luminance value between the regions of the eyes and upper cheeks The second selected feature measures the di↵erence in the luminance value between the eye region and the nose bridge An intuitive rationale behind the selection of these features is that the eye region is generally darker than the skin region
Previous findings on static images suggest that people look at face components (eyes, mouth, and nose) preferentially, with the eyes being given more preference over other components (Buswell, 1935; Yarbus et al., 1967; Langton et al., 2000; Birmingham and Kingstone, 2009) However, recent study (V˜o et al., 2012) on
Trang 13gaze allocation in dynamic scenes suggests that eyes are not fixated preferentially V˜o et al (2012) showed that the percentage of overall gaze distribution is not significantly di↵erent for any of the face components for vocal scenes However, for mute scenes, they did find a significant drop in gaze distribution for the mouth compared to the eyes and nose In fact, the nose was given priority over the eyes regardless of whether the person in the video made eye contact with the camera
or not However these di↵erences were found to be insignificant
To detect faces in our video database, we used trained classifiers from the
region and returns with a bounding box encompassing the complete face This is followed by convolving the face region with a gaussian having a size h equal to the width of the box and peak value at the centre of the box This automatically assigns the highest feature value to nose compared to other face components Figure 4.10
shows the process of face modulation for an example frame from the movie “The Matrix” (1999) Note that the bottom right corner highlights the salient regions
in the movie frame by overlaying the face modulated saliency map on the movie frame
Trang 14Movie frame with face detected
Saliency overlayed on movie frame
Saliency map
Low Saliency
High Saliency
Figure 4.10: Example of saliency map modulation with detected face region of interest (ROI) Top left column shows the original movie frame with face ROI bounding box Subsequent columns show how the face modulation is applied to the spatio-temporal saliency map The bottom right column overlays the face modulated saliency map on the movie frame signifying hot spots in the frame
We investigated an improvement to the bottom-up influenced spatio-temporal saliency model by incorporating top-down semantics of the scene Our hypothesis is that variability in eye movement patterns for di↵erent scene categories (O’Connell and Walther,2012) can help in improving saliency prediction for the early fixations Earlier research experiments have shown the influence of scene context in guiding visual attention (Neider and Zelinsky, 2006; Chen et al., 2006) In Neider and Zelinsky (2006) the scene-constrained targets were searched faster with a higher percentage of initial saccades directed to target-consistent scene regions Moreover, they found that contextual guidance biases eye movement towards target-consistent regions (Navalpakkam and Itti, 2005) as opposed to excluding target-inconsistent scene regions (Desimone and Duncan,1995) Chen et al.(2006) showed that in the presence of both top-down (scene preview) and bottom-up (colour singleton) cues, top-down information prevails in guiding eye movement They observed faster