VISUAL ATTENTION IN DYNAMIC NATURAL SCENES 2

Chapter 4Computational Modelling We used a Spatio-Temporal Coherency model, which is an extension of the user model used motion intensity, spatial coherency, temporal coherency, face det

Trang 1

Chapter 4

Computational Modelling

We used a Spatio-Temporal Coherency model, which is an extension of the user

model used motion intensity, spatial coherency, temporal coherency, face detection and gist modulation to compute saliency maps (see Figure 4.1) The critical part

of this model is motion saliency, defined as the attention due to motion Abrams and Christ (2003) have shown that the onset of motion captures visual attention Motion saliency for a frame is computed over multiple preceding frames According

to our investigations, overt orienting of the fovea to any motion induced salient locations in any given frame, n, is influenced by the saliency at that location in the preceding frames (n 1, n 2,· · · ) This is known as saccade latency in the literature, with latencies typically in the order of 200-250 milliseconds (Becker and J¨urgens, 1979; Hayhoe et al., 2003) We investigated the influence of up to ten preceding frames (approximately 500 milliseconds), however beyond the fifth frame (i.e., n 5) we did not see any significant contributions to overt orientation This means that the currently fixated location in frame n was indeed selected based

Trang 2

on the saliency at that location up to n 5 preceding frames The on-screen time for 5 frames was about 210 milliseconds (the video frame rate in our experiments were 24 frames per second) which is well within the bounds of the reported saccade latencies

We computed motion vectors using Adaptive rood pattern search (ARPS) algo-rithm, for fast block-matching motion estimation (Nie and Ma,2002) The motion vectors are computed by dividing the frame into a matrix of blocks The fast ARPS algorithm leverages on the fact that general motion is usually coherent For example, if we see a block surrounded by other blocks, and the surrounding blocks moved in a particular direction, there is a high probability that the current block will also have a similar motion vector Thus, the algorithm estimates the motion vector of a given block using the motion vector of the macro block to its immediate left

The Motion Intensity map, I (Figure4.2), which is a measure of motion induced activity, is computed using motion vectors (dx, dy) normalized by the maximum magnitude in the motion vector field

I(i, j) =

q

dxi,j2+ dyi,j2

Spatial coherency (Cs) and temporal coherency (Ct) maps were obtained by convolving the frames with the Entropy filter The Cs maps captured regularity

at a spatial scale of 9x9, with in a frame, while the Ct maps captured regularity

at the same spatial scale (ie., 9x9 pixels) but over a temporal scale of 5 frames Spatial coherency (see Figure4.3) measured the consistency of pixels around the point of interest, otherwise known as the correlation The higher the correlation

Trang 3

Figure 4.1: Spatio-Temporal saliency model architecture diagram

Trang 4

Low

Compute motion vectors

Motion Vector Field

Compute motion intensity

High Motion Energy

Low Motion Energy

Figure 4.2: Computation of motion intensity on adjacent frames Three examples from di↵erent movies are shown

the more probability they are from the same object This is computed as the entropy over the block of pixels Higher entropy showed more randomness in the block structure, indicating lower correlation among pixels and, hence lower spatial

using the following equation

Cs(x, y) =

n X i=1

where Ps(t) is the probability of occurrence of the pixel intensity i and n cor-responds to the 9x9 neighbourhood

Similarly, to compute consistency in pixel correlation over time, we used

Ct(x, y) =

n X i=1

where Ps(t) is the probability of occurrence of the pixel i at the corresponding

Trang 5

Frame(n)

Entropy Filtering - Spatial

High Entropy

Low Entropy

Figure 4.3: Examples of spatial coherency map computed on five di↵erent movie frames

location in preceding five frames (m = 5) Higher entropy implies greater motion and thus higher saliency at that location The temporal coherency map (see Figure

4.4), in general, signifies motion energy in each fixated frame, contributed by the five preceding frames (with exception in cases for boundary frames where motion vectors are invalid due to scene or camera transitions)

Once all three feature maps are computed, we apply centre-surround suppres-sion to these maps to highlight regions having higher spatial contrast This is akin

to simulating the behaviour of ganglion cells in retina (Hubel and Wiesel,1962) To achieve this, we first compute dyadic Gaussian pyramid (Burt and Adelson, 1983) for each map by repeatedly low-pass filtering and subsampling the map (see Figure

4.5) For low-pass filtering, we used a 6⇥ 6 separable Gaussian kernel (Walther and Koch, 2006) defined as K = [1 5 10 10 5 1]32 (see Walther, 2006, Appendix A.1 for more details)

We start with level 1 (L1) which is the actual size of the map Image for each successive level is obtained by first low-pass filtering the image This step results

in a blurry image with supressed higher spatial frequencies The resulting image

is then subsampled to half of its current size to obtain the level 2 (L2) image The process continue until the map cannot be further subsampled (L9 in Figure 4.5)

Trang 6

Frames Frames Frames

n-5

n-4

n-3

n-2

n-1

n

n-5

n-4

n-3

n-2

n-1

n

n-5

n-4

n-3

n-2

n-1

n

Entropy Filtering - Temporal

High Entropy

Low Entropy

Figure 4.4: Examples of temporal coherency map computed over the previous five frames, shown from three di↵erent movie examples

Trang 7

100 200 300 400 500 600 700

50

100

150

200

250

50 100 150 200 250 300 350

20 40 80 100 120

50 100 150

20 40 60

20 40 60 80

10

20

30

10 20 30 40

5 10 15

5 10 15 20

2 4 6 8

2 4 6 8 10

1

2

3

4

1 2 3 4 5

0.5 1 1.5 2 2.5

0.5

1

1.5

Low Saliency

High Saliency

Figure 4.5: Example of a temporal coherency map at nine di↵erent levels of Gaussian pyramid Starting at level 1 (L1 in above figure),having same size as the original map, each successive level is obtained by low-pass filtering and subsequently subsampling the map to half of its size at the current level

To simulate the behaviour of centre-surround receptive field, we take the di↵er-ence among di↵erent levels of the pyramid for a given feature map as was previously described in Itti et al (1998) We select di↵erent levels of pyramid to represent

in six intermediate maps as shown in Figure 4.6 To get a point-wise differences across scale, the images are interpolated to a common size

All of these six centre surround maps are then added, across scales to get a single map per feature as shown in Figure 4.7

All three feature maps are then combined linearly to produce a standard saliency map

Since higher entropy in the temporal coherency map indicates greater motion

Trang 8

20 40 60 80

5

10

15

20

25

30

35

20 40 60 80

5 10 15 20 25 30 35

20 40 60 80

5 10 15 20 25 30 35

20 40 60 80

5

10

15

20

25

30

35

20 40 60 80

5 10 15 20 25 30 35

20 40 60 80

5 10 15 20 25 30 35

Low Saliency

High Saliency

Figure 4.6: Taking point-wise di↵erences across scales (2-5, 2-6, 3-6, 3-7, 4-7, 4-8) results

in six intermediate maps for a given feature map

200 400 600

50

100

150

200

250

200 400 600

50 100 150 200 250

200 400 600

50 100 150 200 250

200 400 600

50

100

150

200

250

200 400 600

50 100 150 200 250

200 400 600

50 100 150 200 250

M n Intensity Map

Spa Coherency Map

Tempora Coherency Map

High Saliency

Low Saliency

Figure 4.7: Final feature maps obtained after adding across-scale centre-surround dif-ferences The top panel shows the feature maps before centre-surround suppression is applied The bottom row shows the final feature maps after the application of the centre-surround suppression via across scales point-wise di↵erence followed by the summation The example shown for one movie frame clearly demonstrates the e↵ectiveness of the centre-surround suppression in producing sparser feature maps

Trang 9

over a particular region, intensity maps are directly multiplied with temporal co-herency maps This highlights the contribution of the motion salient regions in the saliency maps On the contrary higher entropy in the spatial coherency map indicates randomness in the block structure, suggesting that region does not belong

to any single entity or object Since we are interested in motion saliency induced

by the spatially coherent object, we assign higher value to the pixels belonging to

randomly chosen frame from each of the movie in our database

Trang 10

Animals Cats

IRobot

Low Saliency

High Saliency

Figure 4.8: Saliency maps shown for randomly selected frame from every movie in the database Column 1 and 3 show movie frames while column 2 and 4 show the saliency map for the corresponding frame A saliency values is indicated by a warmer colour as illustrated by the colour map on the right

Trang 11

4.1.2 Face Modulation

We modulate the standard saliency map with high-level semantic knowledge, such

as faces using a state of the art face detector (Viola and Jones,2004) This accounts for the fact that overt attention is frequently deployed to faces (Cerf et al., 2009), and it can be argued that faces are a part of the bottom-up information as there are cortical areas specialized for faces, in particular the fusiform gyrus (Kanwisher

et al., 1997)

Viola and Jones (2004) face detector is based on training a cascaded classifier using a learning technique called AdaBoost on a set of very simple visual features These visual features have Haar-like feature properties as they are computed by subtracting the sum of the sub region from the sum of the remaining region The Figure4.9shows example of Haar-like rectangle features Panels A and B show two-rectangle features (horizontal / vertical) and panels C and D show three-two-rectangle and four-rectangle features respectively The value of the feature is computed

by subtracting the sum of the pixel values in the white region from the sum of the pixel values in the grey region These Haar-like features are simple and very efficient to compute using integral of image representation The integral of image representation allows the computation of the rectangle sum in constant time The Haar-like features are extracted over a 24⇥ 24 pixel sub-window resulting

in thousands of features per image The goal here is to construct a strong classifier

by selecting a small number of discriminant features from the limited set of labelled training images This is achieved by employing an AdaBoost technique to learn the cascade of weak classifiers Each weak classifier in the cascade is trained on a single feature The term weak signifies the fact that no single classifier in the cascade can classify all the examples accurately For each round of boosting, the AdaBoost method allows the selection of the weak classifier with the lowest error rate con-trolled by the desired hit and miss rate This is followed by the re-assignment of

Trang 12

Figure 4.9: Example of four basic rectangular features as shown in Viola and Jones

2004 IJCV paper Panel A and B show two-rectangle features while panels C and D show three-rectangle and four-rectangle features Panel E shows example of two features overlaid on a face image The first feature is a two-rectangle feature measuring the di↵erence between the eye and upper cheek region while the second feature, a three-rectangle feature, measures the di↵erence between the eye region and upper nose region

the weights to emphasize the examples with poor classification for the next round The Adaboost method is regarded as a greedy algorithm since it associates a large weight to every good features and a small weight to poor features The final strong classifier is then a weighted combination of the weak classifiers

properties of the image The example shows a two-feature classifier (top row shows the two selected features) trained over 507 faces The first feature measures the di↵erence in the luminance value between the regions of the eyes and upper cheeks The second selected feature measures the di↵erence in the luminance value between the eye region and the nose bridge An intuitive rationale behind the selection of these features is that the eye region is generally darker than the skin region

Previous findings on static images suggest that people look at face components (eyes, mouth, and nose) preferentially, with the eyes being given more preference over other components (Buswell, 1935; Yarbus et al., 1967; Langton et al., 2000; Birmingham and Kingstone, 2009) However, recent study (V˜o et al., 2012) on

Trang 13

gaze allocation in dynamic scenes suggests that eyes are not fixated preferentially V˜o et al (2012) showed that the percentage of overall gaze distribution is not significantly di↵erent for any of the face components for vocal scenes However, for mute scenes, they did find a significant drop in gaze distribution for the mouth compared to the eyes and nose In fact, the nose was given priority over the eyes regardless of whether the person in the video made eye contact with the camera

or not However these di↵erences were found to be insignificant

To detect faces in our video database, we used trained classifiers from the

region and returns with a bounding box encompassing the complete face This is followed by convolving the face region with a gaussian having a size h equal to the width of the box and peak value at the centre of the box This automatically assigns the highest feature value to nose compared to other face components Figure 4.10

shows the process of face modulation for an example frame from the movie “The Matrix” (1999) Note that the bottom right corner highlights the salient regions

in the movie frame by overlaying the face modulated saliency map on the movie frame

Trang 14

Movie frame with face detected

Saliency overlayed on movie frame

Saliency map

Low Saliency

High Saliency

Figure 4.10: Example of saliency map modulation with detected face region of interest (ROI) Top left column shows the original movie frame with face ROI bounding box Subsequent columns show how the face modulation is applied to the spatio-temporal saliency map The bottom right column overlays the face modulated saliency map on the movie frame signifying hot spots in the frame

We investigated an improvement to the bottom-up influenced spatio-temporal saliency model by incorporating top-down semantics of the scene Our hypothesis is that variability in eye movement patterns for di↵erent scene categories (O’Connell and Walther,2012) can help in improving saliency prediction for the early fixations Earlier research experiments have shown the influence of scene context in guiding visual attention (Neider and Zelinsky, 2006; Chen et al., 2006) In Neider and Zelinsky (2006) the scene-constrained targets were searched faster with a higher percentage of initial saccades directed to target-consistent scene regions Moreover, they found that contextual guidance biases eye movement towards target-consistent regions (Navalpakkam and Itti, 2005) as opposed to excluding target-inconsistent scene regions (Desimone and Duncan,1995) Chen et al.(2006) showed that in the presence of both top-down (scene preview) and bottom-up (colour singleton) cues, top-down information prevails in guiding eye movement They observed faster

Định dạng
Số trang	19
Dung lượng	8,41 MB