1. Trang chủ
  2. » Giáo Dục - Đào Tạo

VISUAL ATTENTION IN DYNAMIC NATURAL SCENES 1

62 185 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 62
Dung lượng 12,82 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Our result show that indeed there are scene-specific differencesfor visual attention patterns, and augmenting such information top-down knowl-edge with sensory cues bottom-up information

Trang 1

Chapter 1

Introduction

The human visual system can quickly, e↵ortlessly, and efficiently process visual

vision has been heavily influenced by how biological visual systems encode ties of the natural environment that are important for the survival of the species.Human subjects can perform several complex tasks such as object localization,identification, and recognition in a given scene e↵ortlessly, owing to their ability

proper-to attend proper-to selected portions of their visual fields while ignoring other

way perception is achieved Human subjects can also utilize a divided attention

drivers can simultaneously pay attention to traffic lights, traffic signs, and vehicles

in front of them while driving It is also worth noting that perception can occur

In this thesis, I present our research findings on the study of temporal (fixationduration) and spatial (fixation location) properties of visual attention Fixationdurations have been extensively studied using the static scene change paradigm(Land and Hayhoe, 2001; Henderson, 2003; Pannasch et al., 2010) However the

Trang 2

Part of this thesis attempts to fill this gap by looking at how fixation durationschange in movies across scene changes We also show how fixation durations can

be used as an unbiased behavioural metric to quantify the center bias in movies,

et al., 2009)

The second part of this thesis focuses on a computational model of the man visual attention system We propose a novel method of combining bottom-up(sensory information) and top-down (experience) cues Specifically we used unsu-pervised learning techniques to categorize scene specific patterns of visual attentionand hypothesized that these patterns will be unique for different types of scenese.g., natural and man-made scenes Using these patterns of eye fixations, we mod-ulated our saliency maps to investigate if we were able to improve our prediction

hu-of human fixations Our result show that indeed there are scene-specific differencesfor visual attention patterns, and augmenting such information (top-down knowl-edge) with sensory cues (bottom-up information) improves the predictive power ofthe proposed computation model

The overall thesis is organized as follows Chapter 2 reviews the current potheses on the control of fixation durations, and reviews the computational modelsused to predict visual attention in static and dynamic natural scenes In Chapter

hy-3, I describe our experimental methods, and the analysis of fixation durations withthree major results In Chapter 4, I describe a computational model of attentionand discuss our results in comparison to previous models In Chapter 5, I discussour overall conclusions, with important directions for future work

Trang 3

of attention dates back to 1694, when Descartes proposed that movements of the

1721;Leibniz,1765;von Helmholtz,1886) However, all these proposals were based

on the central idea of the obligatory coupling of visual attention to physical eyemovements, otherwise known as overt attention Helmholtz was the first to discovercovert attention by successfully demonstrating that attention can be achieved even

The acuity of the visual system in primates drops with eccentricity from the foveatowards the periphery The density of cone photoreceptors is much greater at

Trang 4

Table 2.1: Di↵erent eye movements

Tremor Also known as Nystagmus is a compensatory eye

movements, having lowest amplitude of all the eye movements

(Yarbus, 1967; Carpenter, 1988; Spauschus, 1999)

Drift Also known as Opto-Kinetic re ex is a pursuit like movement

that stabilizes an image of low velocity object (Nachmias, 1959; Fender, 1969)

saccade

Micro-Also known as Vestibulo-ocular re x and xational saccades is

involuntary image-stabilizing eye movement in response to head movements (Horw , 2003) They are also thought to correct eye displacements caused by drifts (Yarbus, 1967;

Smooth-Voluntary tracking of the stimuli moving across the visual

to keep it under fovea spot-light

Vergence Coordinated eye movements to stabilize the target image on

to the fovea region of the both eyes (Sparks, 2002)

the centre than in the periphery Peripheral vision is poor at detecting object

to that location is necessary, bringing the image onto the central fovea where thepopulation of photoreceptors is the highest With advancements in neuroscience,di↵erent kinds of oculomotor movements have been discovered, and their properties

well-known eye movements

In general, gaze-stabilizing eye movements compensate for head and body ments to keep the image under the high-resolution fovea, while gaze-shifting eyemovements provide high resolution samples of the visual environment by control-ling and directing eye movements

Trang 5

move-Figure 2.1: Visual orienting graph.

Atten-tion

All these eye-movements tend to supplement an abstract concept of overt and

Overt attention is the result of a directed eye movement towards an attendedlocation, as opposed to covert attention, where the attended location is indepen-dent of eye position A further decomposition of these two types of attention sug-gest that eye movements are under the control of endogenous and exogenous sub-

under the influence of visual or verbal instruction; e.g., to look at a central fixationcross preceeding the presentation of the stimuli In contrast, exogenous controlrefers to autonomous eye movements under the influence of the visual stimuli, e.g

a brief presentation of a target in the periphery in attention capture experiments.The di↵erence between covert and overt orienting of attention, under endogenous

and Theeuwes, 2007) In a majority of the trials subjects were endogenously cued

Trang 6

(displaying an arrow at the central fixation) to covertly attend to the upcomingtarget location, while maintaining their gaze at the central fixation The responses

to these target onsets were sampled using key-presses However in a few of thetrials, subject responses were sampled by instructing them to saccade to the targetlocation (overt orienting under endogenous control) Similarly, di↵erences betweencovert and overt attention, under exogenous control, can be exemplified by anattention capture paradigm A brief onset of the target at a peripheral location

is first covertly attended (attention capture) followed by an overt orientation, forfocused processing, (occulomotor capture)

A third level decomposition has only been reported for covert attention Twoexamples of the e↵ects of covert attention are response facilitation and inhibition

in a study in which subjects were asked to press a button when they detected

a flash of light that could appear at one of 4 possible peripheral locations on

a screen in front of them To ensure that only covert attention was involved, thesubjects were required to maintain central fixation throughout the trial Before thetarget stimulus appeared, a cue was presented that would instruct the subject toorient his/her covert attention to one of the 4 possible locations They found thatreaction times were significantly lowered when subjects were cued to the correctlocation, despite the fact that their eyes were directed somewhere else This appears

lasted for 100 to 200 milliseconds following the cue onset Any further delays in theonset of the target, when it was near the cued location, resulted in slower reactiontimes This was later explained to be due to the e↵ect of IOR The term inhibition

showed the relative impaired ability of the immediate attentional shifts to a target

Trang 7

location if attention was recently withdrawn from that cued location The delayedonset of the target near the cued location resulted in significantly slower reactiontimes The IOR e↵ect has been discussed as a novelty seeking mechanism (Posner

& Cohen, 1984) and in facilitating visual search when the target does not pop out(Klein and MacInnes,1999) It is also worth noting that overt and covert attention

Over the years, other models of attention have also been put forth:

• Independent Attention Model states that both types of attention can

• Sequential Attention Model states that overt foveation is preceded by

• Pre-Motor Theory of Attention states that covert attention is a

1987)

Temporal properties of visual attention are attributed to how a presented stimuliinfluence’s fixation durations Overt orientation is not only manifested by physicaleye movements to the desired location, but also by the duration that the eyes stay

at the attended location This behavioral property (fixation duration) has beeninvestigated by many researchers and has been found to be a function of a variety of

Smith (2009) listed the di↵erent control mechanisms a↵ecting fixation durations;

• Process Monitoring states that fixation durations are driven by the to-moment visual and cognitive analysis

Trang 8

moment-– Immediate control exerts influences on fixation durations based on

and Pollatsek, 1981)

– Delayed Control exerts influences on subsequent fixation durations,originating from the slow development of higher-level visual and cogni-tive processes

• Autonomous Control states that most of the fixation durations are pendent of the immediate perceptual and cognitive processing of the currentfixation

inde-– Timing control suggests that fixation durations are determined by aninternal stochastic timer that is designed to move the eyes at a constantrate, regardless of the scene type or task definition

– Parameter Control suggests that fixation durations are based on culomotor timing parameters reflecting the global viewing conditions,

• Mixed Control suggests that fixation durations are driven by some bination of the above-mentioned processes As an example, an argumentcan be made that most of the time, fixation durations are under immediate

the currently processed word was longer, if the following or proceeding wordwas skipped compared to when it was fixated Another argument can also bemade that fixation durations are under timing control, which is sometimesoverridden by delayed control due to slower acting higher-level visual and

Trang 9

Another interesting behavioral bias observed while watching natural stimuli(static images and videos) is the tendency to fixate near the centre more often

Mannan et al.,1995,1996, 1997;Reinagel and Zador, 1999;Parkhurst et al.,2002;Parkhurst and Niebur,2003; Itti,2004; Tatler et al.,2005; Tatler, 2007;Foulshamand Underwood,2008;Tseng et al.,2009) At present, the reasons behind this cen-

suggested that this centre bias should be largely attributed to the photographerbias and an expectancy-derived viewing strategy that follows from it

The photographer bias indicates that photographers and film makers typicallyplace objects of interest at or around the centre of the frame, presumably so thattheir viewing audience would be able to easily perceive the intended meaning ofthe scene and would not need to alter their gaze to perceive it Moreover, as statedabove, it has been suggested that the photographer bias promotes a typical viewingstrategy, where viewers develop a tendency to move their eyes toward the centre

of a newly presented scene since they expect the most interesting or important

The precise contribution of the photographer bias to the centre bias is difficult

to assess, and remains a crucial issue to the understanding of how we perceive

photographer bias To assess top-down influences of the photographer bias, jects were required to rate the extent to which the interesting aspects of the scenewere biased toward the centre However, there were large di↵erences between thesubjective ratings of the viewers, alluding to the fact that perhaps the scene prop-erties manifested in the photographer bias were not completely captured by thisrating system To measure the bottom-up influences of the photographer bias, the

Trang 10

model Importantly though, the contribution of saliency to the photographer andcentre biases was found to be markedly smaller than that of top-down influenceslike subjective assessment by the participants of the experiment.

A di↵erent way to tackle this issue, which is not based on subjective measuresbut is more objective, is to assess the amount of photographer bias in a scene as afunction of the disparity in fixation durations between the centre and the periph-ery When a photographer bias is present, it should influence not only the number

of fixations that are made toward the centre relative to the periphery, but also theduration of individual fixations This is because viewers are expected to remainfixated at meaningful locations for longer periods compared to less-meaningful lo-cations that would not especially evoke the viewer’s interest Under a photographerbias, most of the meaningful portions of the scene are positioned near the centre.Hence when the viewer fixates these locations the probability that he/she would befixating a meaningful spot is increased, and so is the probability that the fixationduration will be extended In other words, since viewers are expected to remainfixated at informative, interesting, or otherwise meaningful locations relative toless-meaningful ones, we should expect a correlation between fixation durationand distance from the centre when a photographer bias is present On the otherhand, other biases (like orbital reserve and motor bias) of visual attention thatare unrelated to the semantic content of the scene, do not predict that fixationdurations will be longer in central regions If there were no photographer bias,viewers would not be expected to have longer fixations at central regions, as theywould not contain information that would be more meaningful than other parts ofthe image

Trang 11

2.5 Models of Visual Attention

Visual attention can be driven by either bottom-up / exogenous-control ortop-down / endogenous-control mechanisms The bottom-up mechanisms arecharacterized by involuntary and unconscious eye movements with low cognitiveload The top-down mechanisms are characterized by voluntary and conscious eyemovements, under the influence of special tasks, and are accompanied by highercognitive load Research studies have found that bottom-up influences act more

partic-ular, Wolfe and colleagues (2000) conducted two sets of experiments that nicelydemonstrated this concept In the first experiment, 8 images were shown sequen-tially at 8 di↵erent locations on the screen for a period of 53 milliseconds, in-terleaved with a variable duration mask The target could appear at one of theeight locations, distributed in a clock-like fashion In the command condition, theimages were presented sequentially in a clockwise manner, i.e the first image ap-peared at the 12 o’clock location, followed by an image at the 1 o’clock location

If the target was to be presented as the fourth image in a presentation sequence,then the target would only appear at the fourth location In the random anarchycondition, the target could appear randomly on each frame The results showedthat attention can shift much more rapidly when allowed to move randomly (anar-chy condition) then when controlled in a top-down manner (command condition).Similar results were found for the second experiment where the objective was tofind mirror-reversed letters

The bottom-up mechanisms are mainly driven by low level processes that pend on the intrinsic features of the visual stimuli This mechanism is also called

Any feature in the visual environment can attract attention Some features likecolour, motion, orientation and size (including length and spatial frequency) have

Trang 12

proved to be reliable predictors of foveated attention, while others (e.g., shape,optical flow, luminance polarity etc.) are less probable in attracting attention(Wolfe and Horowitz, 2004) Over the years, many attempts have been made todevelop computational models based on bottom-up attention They have shownsome degree of success in simulating the fixation pattern of humans The founda-

on the “Feature Integration Theory” (FIT) According to the FIT, the human sual system can autonomously detect discrete features of the stimulus in a parallelfashion (within the limits of visual acuity and discriminability), resulting in featuremaps These feature maps are then combined into a master map of locations thatshow where things are in the scene The pre-attentive parallel processing capabilitymediates figure-ground grouping and texture segregation However for higher-levelprecepts such as object identification in a scene, focused attention is necessary.This is achieved by serial scanning of the map (item-by-item) and directing focal

discrim-inable the object features are, the faster this conjunction search can be completed.Thus the FIT explains the finding that reaction times tend to increase with the dis-play size in conjunction-search tasks but remain almost constant in feature-search

biological plausibility of the feature-based computational model A key proposalwas the idea of a topographical map that linearly combined di↵erent individualfeature maps This master map, termed the Saliency Map, provided a measure

of global conspicuity An alternative map description, the Activation Map, was

the weighted summation of feature activations These maps essentially combinedboth top-down and bottom-up information The activation maps are then used

to guide visual attention from object to object, starting with the object with the

Trang 13

highest activation, until the target is found or the current activation level falls

also combines representations of bottom-up salience in a scene with top-down vance of objects to the subject’s goal Currently, there are three classes of models,

• Hierarchical models decompose visual information using Gaussian, based or wavelet-based decomposition Later, di↵erent methods are used toaggregate the information across the hierarchy to produce an unique saliency

• Statistical models make use of the local statistical properties of the region

in the scene The saliency map is then computed as a measure of the deviation

and Tsotsos, 2009; Gao et al., 2008)

• Bayesian models combine both the bottom-up sensory visual information

2003b; Zhang et al., 2009)

and Ullman,1985) Koch and Ullman claimed that the brain computes an explicitsaliency map of the visual world The saliency was defined using principals of thecenter-surround mechanism: pixels in the scene are salient if they di↵er from sur-rounding pixels in intensity Features are computed from responses of biologically

into three channels: color, intensity and orientation Subsequently, the color and

Trang 14

intensity channel images are repeatedly sub-sampled using Gaussian-shaped nels to create Dyadic Gaussian pyramids Four orientation Gabor pyramids arecreated, using four preferred orientations (0, 45, 90, 135) A center-surround op-eration, implemented by taking the di↵erence of the filter responses, yields a set

ker-of feature map The feature maps for each channel are then normalized (to mote the maps with few stronger peaks and suppress maps with many comparablepeaks) and combined across scale and orientation to create a conspicuity map foreach channel These three maps are further normalized to enhance the conspicuousregions, and channels are linearly combined to form an overall saliency map Themodel’s output then feeds into a two-layer neural network (winner-takes-all), tosimulate the shifting of attention from one location to another location To avoidreturning immediately to the previously processed location (IOR), DoG filters areused The excitatory surround around the inhibitory center gives a slight pref-

2000)

Even though this model has been shown to be successful in predicting humanfixations, it is somewhat ad-hoc in that there is no objective function to be op-timized and many parameters must be tuned manually In contrast, the model

infor-mation sampled from the image Features are learned from a set of natural imagesusing independent component analysis (ICA) These have been shown qualitatively

to resemble the receptive fields found in the primary visual cortex (V1) and theirresponses exhibit the desired properties of sparsity Furthermore, since the featureslearned are independent, the joint probability of the features is the product of thefeature’s marginal probability Once the basis function and coefficients are learned,they are tested on a set of new images First, for each location x in the image,the responses of the learned basis functions are obtained (i.e ICA coefficients)

Trang 15

These ICA coefficients correspond to various basis filters that respond to di↵erentfeatures The histogram density estimate is used to produce the distributions foreach of these coefficients, over a local neighborhood This is followed by computingthe joint likelihood using the given neighborhood coefficients In the end, the final

the likelihood of the content within its given neighborhood (ensemble) It is worthnoting that if the neighborhood of the point of interest is defined as the entire im-age, the definition of saliency becomes identical to bottom-up saliency as defined

in Oliva et al.(2003), where the saliency of each location is inversely proportional

to its occurrence probability in the image

observer’s beliefs of their environment and how these beliefs evolve over time Thedata observations that leave the observers’ beliefs una↵ected carry no surprise, andthus do not get registered in the model’s output In contrast, the data observationsthat force the observers to revise their existing beliefs, significantly, elicit surprise.Briefly, initial beliefs are computed in a series of small windows over the entireimage, along several low level features (colors, intensity, orientation, and contrast)and at several spatial and temporal scales Following this initial calculation of thebeliefs, based on the low level hypothesis of the environment, any abrupt visualchange in subsequent frames or images would cause a re-evaluation of the priorbeliefs about the environment The model then uses Bayes’ rule to compute the

Trang 16

implies that the observed data/image carries surprise in terms of an abrupt lowlevel visual change In contrast, if there were no di↵erences in the posterior andprior distributions, no surprise will be elicited Thus surprise is quantified by thedistance between the posterior and prior distributions, as measured using the KLdivergence.

In summary, the surprise reflects the computation of saliency over spatial andtemporal scales A spatial oddity (e.g a house in a field) or temporal oddity (e.g asnow image appearing during a normal TV broadcast) will typically elicit surpriseinitially But since the observer is continuously updating its beliefs about the worldthrough Byes’ rule, the surprise elicited by the repeated presentation of such odditydecreases with every presentation This definition of surprise saliency is somewhatsimilar to other definitions of saliency that are based on deviation of local features

Tsotsos, 2006; Gao et al., 2008), except that it extends the notion to the temporal realm However, it is important to note that statistical uniqueness is notalways similar to surprise Statistically unique snow images during a TV broadcastwill still elicit decreasing salience surprise over time

on learned statistics from a series of natural images A contrasting view is to

information, but their underlying implications are di↵erent Saliency using localimage statistics argues that foreground objects are likely to have di↵erent featuresthan the background Thus saliency is defined as the deviation of the values from

Trang 17

the average statistics of the image On the other hand, SUN’s intuition statesthat since target objects are less frequently observed than the background in dailylife, thus rare/novel features are more likely to be fixated by humans This claimwas substantiated using evidence from the literature that describe what attracts

results (such as search asymmetry and parallel vs serial search) and eye movementpatterns, in general, while viewing images and movies The main formulation oftheir probabilistic model is made under the assumption that the aim of the humanvisual system is to find potential targets that are important for survival Attention

is directed to regions that have high likelihood of belonging to the target class

term on the right hand side forms top-down saliency (probability of features ofthe target at currently processed location), the second term is a constant, and thethird term captures bottom-up saliency (self information at the current location).Taken together, both top-down and bottom-up saliency provides pointwise mutualinformation between the features and targets Two features were used: DoG andICA The DoG filters were applied to three di↵erent channels separately (intensity,Red-Green and Blue-Yellow) at 4 scales, yielding a total of 12 features Indepen-

his-tograms of each feature (the histogram of each feature represents the frequency ofits response, indicating how rarely or frequently a particular feature was present

Trang 18

in natural images This is fundamentally di↵erent fromBruce and Tsotsos(2006),where learned basis functions were used for density estimation, producing coeffi-cients for local neighborhoods and thus yielding a distribution of values for a singlecoefficient Bottom-up saliency was computed by estimating the joint probabilityfrom the features Since the features were independent, the joint probabilities wereobtained from the product of the feature marginal probabilities of the features.

ap-proach for computing saliency maps The apap-proach is di↵erent from other methodsusing natural image statistics in that the method is non-parametric and computessaliency rapidly It does so by computing the di↵erence (spectral residue) betweenthe log spectrum (log of an amplitude spectrum) of an image and its smoothedversion This is followed by transforming the residual image to the spatial domain

claimed that the phase spectrum of an image has even more predictive power thanthe amplitude spectrum

et al., 2009) share a common problem, which is sensitivity to the presence of

cancellation, and thus frequently assign high salience values to background motion(due to camera panning or zooming) in the presence of more salient foregroundmotion

Top-down mechanisms are defined as relying on task or contextual

(Jonides, 1981) or centrally cued (Posner et al., 1980) attention The classical

et al., 1967) He was the first to demonstrate that a subject’s gaze-shift varieswith high-level task definition In his experiment using paintings of people in a

Trang 19

living room, subjects were either asked no questions or specific questions like judgethe economic status, what clothes the people were wearing, where are they, etc.

A three minute recording of the gaze showed di↵erent eye scan paths for all theconditions (elicited by di↵erent questions asked), thus implying that top-down task

Cohen, 1980, 1984) also showed how covert and overt attention might co-exist toachieve the task A central cue, shown as an arrow, is presented to the subjectspointing to the location of the target The ability to detect the target, in terms ofreaction times, was typically better in the trials in which the target was presented

at the cued location than in the trials in which the target appeared at an uncuedlocation Subsequently, Posner described three major functions concerning atten-tion: alertness, orienting of attention and target detection Alerting pertains toproviding the ability to process high priority signals, e.g peripheral motion getsprioritized by the attention system for subsequent processing Orienting improvesefficiency of the target processing in terms of acuity by reporting more rapidly theevents occurring at the foveated location Target detection implies that observersare conscious about the presence of the stimulus

In practice, computational models based on top-down information are morecomplex and require specific contextual knowledge of the task to modulate thefeature maps Recent studies in object recognition have proposed unified frame-works, combining both top-down and bottom-up information to compute saliency(Torralba, 2003b; Gao and Vasconcelos, 2005; Gao et al., 2008)

2003a) In the search of an object in a scene, the probability of interest is the joint

Trang 20

probability that the object is present in the current scene, together with the ject’s probable location (if the object is present), given the observed features Thiswas calculated using Bayes’ rule Another example of contextual priming is theincorporation of gist (Torralba 2003b), which represents a semantic classification of

that humans can do basic level categorization of complex natural scenes within a

these basic level classifications at such speeds, it has been suggested that we may

This is because eyes have not moved much in a such short time and thus no scenewide exploration has been carried out to form object level representations Theseintermediate representations can also be used to recognize unfamiliar scenes (scenes

posed saliency as a solution to the classification problem, with the aim of ing the classification error They first applied this concept to the problem of object

be assigned to the location in a scene useful for the task To accomplish this, theyselected a set of discriminative features best representing the class of interest (e.g.faces or cars) The saliency is then defined as a weighted sum of the features thatare salient for that class Thus, an inherent, task-oriented definition of saliency has

locations that are very di↵erent from their surroundings They used Gaussians (DoG) filters and Gabor filters, and measured the saliency of a pixel as

di↵erence-of-a Kullbdi↵erence-of-ack-Leibler (KL) divergence between the histogrdi↵erence-of-am of filtered responses di↵erence-of-atthat pixel and the histogram of the filtered responses in the surrounding region.Their results showed better performance for motion saliency, even in the presence

Trang 21

of ego-motion.

Earlier e↵orts in developing visual attention models were limited to static

Watson and Ahumada Jr, 2005; Le Meur et al., 2006; Walther and Koch, 2006;Kienzle et al.,2009;Najemnik and Geisler,2008) However more recent models cantake into account the spatio-temporal dynamics of the visual input, typically expe-

2007; Gao et al., 2008; Itti and Baldi, 2009; Zhang et al., 2009; Seo and Milanfar,

2009; Marat et al., 2009; Bruce and Tsotsos, 2009; Vig et al., 2010; Mahadevanand Vasconcelos, 2010; Zhao and Koch, 2011) As mentioned earlier, many of

et al., 2008; Itti et al., 1998; Torralba et al., 2003; Zhang et al., 2009) e.g type

of filters, number of filters to use, etc In contrast, non-parametric models tend tolearn the filters directly from the training stimuli without any need for parameter

However most of the current models lack a simple and general top-down

Milanfar(2009) proposed a self-resemblance measure for computing salient regions,andBruce and Tsotsos(2008) used a self-information measure On the other hand,

Baldi’s (2006) surprise model used Bayesian statistics to compute saliency Here,prior probability modulated the saliency over time, e.g., a new object in the scene

and Koch (2011) have shown that subjects weigh image features di↵erently (e.g.,face and orientation were given more importance than colour and intensity) Thus,their proposed model accounted for the non-linear integration of feature maps The

Trang 22

feature weights were learned from eye movement data using least square regression(Zhao and Koch, 2011) and AdaBoost techniques (Zhao and Koch, 2012) Tor-ralba et al (2006) attempted to integrate top-down information, using contextualmodulation However the contextual modulation was implemented using image-wide horizontal regions of interest (ROI) These ROI’s were learned from the eyemovements on labeled training data, e.g., in a street scene, when asked to searchfor people and trees, subjects preferentially looked at the centre of the image andthe top of the image respectively This approach is highly specific and not easilygeneralizable In addition, the use of horizontal regions of interest appear to bequite limited as image gist is more likely to be two dimensional than one dimen-sional Part of this thesis focused on how to integrate top-down information withbottom-up spatio-temporal saliency in a general way To accomplish this, we firstlearnt scene categories on unlabeled training data consisting of scene gist descrip-tor, and verified that the eye movement patterns for the di↵erent scene categorieswere indeed di↵erent Test images were first categorized into the di↵erent scenecategories using the scene fist descriptor, and then the category specific eye move-ment patterns were used to modulate the bottom-up saliency maps for these testimages Finally, we validated the saliency modulation by comparing to a number

of di↵erent controls, followed by comparisons to a number of well-known models

of visual attention

Trang 23

Chapter 3

Experiment and Analysis

In this chapter, I will present our study on temporal properties of visual attention(fixation duration) I will give details on our method of collecting eye movementdata from human subjects, parsing the eye data into fixations and saccades, fol-lowed by the analysis of fixation durations in the context of scene transitions Ouranalyses show that fixation durations vary in response to global visual interrup-tion In addition, fixation durations can also be used as a behavioural metric toquantify the centre bias

3.1.1 Participants

Eye movement traces were collected for a total of 32 university students (17 femaleand 15 male) between the ages of 20 to 28 They were monetarily compensated fortheir time All the participants had normal or corrected to normal vision and werenaive to eye movement experiments Informed consent was also obtained from allthe participants before the start of the experiment

Trang 24

Table 3.1: Movie Database

by a gray background, so it was unlikely that the di↵erent frame resolutions wouldhave a↵ected our results

3.1.3 Procedure

We performed 2 sets of experiments In the first set of experiments, 11

Trang 25

analysis encouraged us to expand the collected psychophysical data by includingmore participants and longer movies Thus, in the second set of experiments, 11participants were presented with longer duration movies from Group 2 Movieswere played on a 21-inch computer monitor with a refresh rate of 120 Hz, and at

a distance of 56 cm from the subject (corresponding to a 40x30 degrees field ofview) Subjects were tested for eye dominance to facilitate better eye tracking.For the eye dominance test, subjects were instructed to extend their arms, withpalms facing away They were then asked to bring the hands together and usetheir forefinger and thumb on both hands to form a small hole in the middle Thiswas followed by the instruction to look through the hole and focus on an object,about 15 feet away, with both eyes open Subsequently, they were instructed toclose one eye at a time to find out when the object disappeared from view withinthe hole The eye in which the object remained in the hole was determined to bethe dominant eye The subject’s head was stabilized with a chin-rest and subjectswere instructed to simply watch the movies that were shown in random order Inorder to assess the level of alertness of the subjects, they were told there would beasked some general questions about the movie at the end of the experiment

In each experimental session, the movies were blocked into six trials Theorder of these movies was varied between participants, to minimize any potentialorder e↵ects on their fixation patterns All the subjects underwent a calibrationprocedure before the start of each eye tracking trial Subjects were allowed to takebreaks between the trials, and were allowed to complete a session over multipledays In order to maintain alertness during the experiment, we limited each session

to maximum of 45 minutes Instantaneous eye positions were tracked by a highspeed CMOS camera (CRS Research) utilizing pupil and dual first purkinje images

gaze tracking ranges between -20 to +20 degrees horizontal and -15 to +15 degrees

Trang 26

vertical) for each participant’s dominant eye.

Eye tracker time was synchronized with the stimulus onset using the

tightly synchronized with the refresh interval of the monitor (8.33 milliseconds) tomaintain a movie frame rate of 24 frames per second

3.1.4 Classifying Eye Movements

Monocular eye movements were recorded using the High Speed Video Eye Tracker(HS-VET), from Cambridge Research Systems (CRS), at 250 Hz Each eye record-ing session was preceded by 9-point calibration routine Eye positions were firsttransformed into pixel positions for easier comparison with the stimulus Eye po-sitions too close to the movie image boundary were filtered out to avoid boundary

was used to identify fixations/smooth pursuits, drifts, and saccades It requiredtwo parameters: dispersion and duration The dispersion parameter was set tohave a maximum span equal to the high acuity zone of the fovea, which subtended

in the stimulus presented on the monitor The duration parameter was used to

A two-dimensional record of locations visited by the subjects called the Fixation

map was obtained by convolving the fixated location (x,y) with a 2D Gaussian

40 x 40 pixels), corresponding to the high acuity foveal zone This is shown in

Subsequently, a master fixation density map was obtained; taking into account

Trang 28

frequency of fixations, with blue showing low fixation count, and red showing highfixation count

3.2.1 Introduction

We investigated the variability in fixation durations controlled by the onset of the

that fixation durations vary as a function of di↵erent tasks and scene onsets More

et al., 2010) found that the onset of a disruption in scene representation (static

Trang 29

Figure 3.3: Examples of scene transitions.

images) just prior to the fixation altered fixation durations Henderson grouped thefixation distribution into two sets; Early and Late fixations Early fixations werecharacterized as fixations starting immediately after the onset of the delay whilelate fixations were characterized as fixations starting late in the delay According

to Henderson and Smith, 2009 early fixations were little influenced by the delayonset, while the later showed an increase in fixation durations However, theseexperiments were performed using static images that were interrupted by a noiseimage that was presented with varying durations We wanted to study the changes

in fixation durations under more natural settings Thus, we extended the idea tomovie stimuli where numerous scene cuts served as global changes in visual stimuli,similar to those in earlier static image experiments These scene cuts are referred to

of such scene transitions in movies

3.2.2 Fixation Duration

We looked at sequence of fixations around each scene transition Our analysisfocused on the last fixation before the scene transition, the on-going fixation atthe time of the transition (cross-over fixation) and the first fixation after the scene

Trang 30

transition An intuitive explanation of these di↵erent types of fixation is illustrated

data set is shown here as an example We plotted the median of the fixationdurations for the last fixation, the cross-over fixation, and the first fixation forthe entire data set collected over 32 subjects The 95% confidence intervals on themedians were obtained by bootstrapping 1000 times from the fixation distributionsand subsequently computing 2.5th and 97.5th percentile on the distribution of themedians From the plot, it is apparent that cross-over fixations were elongatedcompared to last fixations, while first fixations were shortened compared to lastfixations This meant that in response to a scene transition, the primary response

of the subjects was not to cut short the on-going (cross-over) fixation, but toremain at that location This may be due to the need to process the new sceneand identify new locations to fixate We will analyze these data in greater detail

in the subsequent sections

The last fixation was treated as a control fixation since this fixation completedbefore the scene transition or the global change It is considered to be of ‘normalduration’ since we expect it to not be influenced by any global change induced

by the scene transition We will use the names “control” set and “last” fixationset interchangeably to address the distribution of the last fixations from this pointonwards in the text

the right panel), following each scene transition, and the last fixation (plotted inblack in the left panel) preceding each scene transition, sorted by the duration of thecross-over fixation before and after the scene transition (plotted in red and greenrespectively) Here, the scene transition boundary occurs at time = 0 The plotsshow that fixations that started right before the transition (fixations indicated byshort red lines in the left plot) and fixations that had been ongoing for a while (long

Trang 31

Scene (n) Scene (n) Scene (n+1) Scene (n+1)

Last fixation Cross-over fixation First Fixation

Fixations

Median with confidence interval

Figure 3.4: Fixation durations before and after the scene transition Our analysis focused

on the last fixation before the scene transition, the fixation during the transition over) and the first fixation after the scene transition The median duration along withthe 95% confidence intervals are plotted for each fixation under analysis

Ngày đăng: 10/09/2015, 09:22

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN