313 Static Summarization From Multiple Geo-referenced Videos 32 3.1 Introduction.. 70 4.3 Summarization From Multiple Geo-referenced Videos... 116 6 Interactive and Dynamic Exploration A
Trang 12014
Trang 2I hereby declare that this thesis is my original work and it has beenwritten by me in its entirety I have duly acknowledged all the sources of
information which have been used in the thesis
This thesis has also not been submitted for any degree in any university
previously
ZHANG YINGJuly 2014
a
Trang 3Foremost, I would like to express my sincere gratitude to my supervisorProfessor Roger Zimmermann for his patience, motivation and knowledge.Without his guidance and constant feedback this Ph.D would not havebeen achievable.
I would also like to thank my committee members: Prof Michael S.Brown, Prof Mohan Kankanhalli and Prof Wei Tsang Ooi for their en-couragement and insightful comments
My sincere thanks also goes to Prof Beomjoo Seo, Prof Yu-Ling Hsueh,Prof He Ma, Dr David A Shamma and Dr Luming Zhang for discussions,suggestions and ideas for improvement
I thank the labmates in Multimedia Management Lab and Media search Lab: Jia Hao, Haiyang Ma, Zhijie Shen, Xiangyu Wang, XiaohongXiang, Yangyang Xiang, Tian Gan, Guanfeng Wang, and Yifang Yin fortheir support to my research work
Re-I am also very grateful to all my friends in Singapore who are always
so helpful in numerous ways
Last but not the least, I would like to thank my family for all theirlove and encouragement For my parents Shusheng Zhang and Li Yu, whoraised me with a love of science and supported me in all my pursuits Andfor my beloved husband Haitao Zhao whose faithful support during myPh.D is so appreciated Thank you
Trang 4Summary vi
1.1 Background and Motivation 1
1.2 Research Challenges and Contributions 7
2 Literature Review and Preliminaries 12 2.1 Literature Review 12
2.1.1 Video Summarization 13
2.1.2 Visual Exploration in Geo-space 20
2.1.3 Visual Recommendation 26
2.2 Preliminaries 28
2.2.1 Symbolic Notations and Abbreviation 28
2.2.2 Description of Viewable Scene in Videos 29
i
Trang 52.2.3 Geo-referenced Video Dataset Description 31
3 Static Summarization From Multiple Geo-referenced Videos 32 3.1 Introduction 32
3.2 Single Video Summarization 35
3.3 Multi-video Summarization for a Landmark 36
3.3.1 Subshot Selection 39
3.3.2 Subshot Assembly 44
3.4 Multi-video Summarization for an Area 49
3.5 Experiments 53
3.5.1 Classification Accuracy 54
3.5.2 Keyframe Summary for Single Video 54
3.5.3 Landmark Multi-Video Summary 55
3.5.4 Multi-Landmark Multi-Video Summary 62
3.6 Summary 63
4 Region of Interest Detection and Summarization from Geo-referenced Videos 65 4.1 Introduction 65
4.2 Regions of Interest Detection from Multiple Videos 68
4.2.1 Probabilistic Model for ROI Detection 68
4.2.2 Capture Intention Distribution for Video Frames 69
4.2.3 ROI Detection from Multiple Videos 70
4.3 Summarization From Multiple Geo-referenced Videos 72
4.4 Experiments 76
4.4.1 ROI Detection 76
4.4.2 Multi-Video Summarization 80
Trang 64.4.3 Algorithm Robustness and Efficiency 83
4.5 Summary 87
5 Quality-Guided Multi-Video Summarization by Graph For-mulation 88 5.1 Introduction 88
5.2 Aesthetics Assessment for User Generated Videos 91
5.3 Multiple Video Summarization 95
5.3.1 Graph Construction 97
5.3.2 Dynamic Programming Solution 100
5.4 Experiments 101
5.4.1 Aesthetics Evaluation for User Generated Videos 102
5.4.2 Aesthetics-Guided Multi-Video Summarization 108
5.5 Summary 116
6 Interactive and Dynamic Exploration Among Multiple Geo-referenced Videos 118 6.1 Introduction 118
6.2 Video Segmentation 121
6.2.1 Landmark Popularity 121
6.2.2 Landmark Completeness 122
6.3 Video Summarization 123
6.4 Online Query Process 124
6.4.1 Data Structure Design 125
6.4.2 Data Indexing 126
6.4.3 Online Query Processing 128
6.5 Experiments 130
iii
Trang 76.5.1 Offline Video Summarization 130
6.5.2 Online Query 132
6.6 Demo System 136
6.7 Summary 137
7 Camera Shooting Location Recommendation for Objects in Geo-Space 138 7.1 Introduction 138
7.2 Relevance Filter for Landmark Photos 143
7.3 Photo Aesthetic Quality Measurement 145
7.3.1 Color Features 146
7.3.2 Texture Features 146
7.3.3 Spatial Distribution Features 147
7.3.4 Feature Classification 147
7.4 Camera Location Recommendation 148
7.4.1 GMM-based Camera Location Clustering 148
7.4.2 Camera Location Recommendation 150
7.4.3 Spatio-temporal Camera Spot Recommendations 153
7.5 Experiments 156
7.5.1 Data Setup 156
7.5.2 Image Filtering 156
7.5.3 Image Quality Measurement 157
7.5.4 Viewing Statistics 160
7.5.5 Camera Shooting Location Recommendations 160
7.6 Demo System 169
7.7 Summary 171
Trang 88 Conclusions and Future Work 172
8.1 Conclusions 172
8.2 Future Work 174
8.2.1 Adaption to Updated Video Set 174
8.2.2 Summarizations According to Video Categories 174
8.2.3 Audio Quality Evaluation 175
8.2.4 Summary with Crowdsourcing Knowledge 175
v
Trang 9In recent years, we have witnessed an overwhelming number of user-generatedvideos being captured on a daily basis An essential reason is the rapid de-velopment in camera technology and hence videos are easily recorded onmultiple portable devices, especially mobile smartphones Such flexibilityencourages the modern videos to be tagged with additional various sensorproperties In this thesis, we are interested in geo-referenced videos whosemeta-data is closely tied to geographic identifications These videos havegreat appeal for prospective travelers and visitors who are unfamiliar with
a region, an area or a city For example, before someone visits a place, ageo-referenced video search engine can quickly retrieve a list of videos thatare captured in this place so the visitors could obtain an overall visual im-pression, conveniently and quickly However, users face the prospect of anever increasing viewing burden if the size of these video repositories keepsincreasing and as a result more videos are relevant to a search query Tomanage these video retrievals and provide viewers with an efficient way tobrowse, we introduce a novel solution to automatically generate a summa-rization from multiple user generated videos and present their salience toviewers in an enjoyable manner
This thesis consists of three major parts In the first part, we introducethree pieces of work to produce a preview video to summarize a sub-area
in geo-space from multiple videos Several metrics are proposed to ate the summary quality and a heuristic method is used to determine thevideo (segment) selection and connection One of the key features of ourtechnique is that it leverages the geographic contexts to create a satisfac-tory summarization result automatically, robustly and efficiently We also
Trang 10evalu-propose a graph based model to formulate this summary problem whichcan be applied to general videos In the second part, an interactive anddynamic video exploration system is built where people can conduct per-sonalized summary queries through direct map-based manipulations Inthe third part, we investigate whether external crowdsourcing databasescontribute to improving the summary quality Proposing a GMM modeland integrating visual or social knowledge, we recommend a list of locations
to be preferentially selected in a summarization as they are of potential tocapture appealing photos
vii
Trang 112.1 Symbolic notations and abbreviation 29
2.2 GeoVid dataset description 31
3.1 Subshot classification results 54
4.1 Dataset description 77
4.2 ROI evaluation 79
4.3 Dataset description for robustness test 84
4.4 Time analysis of our summary method 86
5.1 Video aesthetics evaluation comparison of three methods 104
6.1 User study for summary quality evaluation 132
6.2 Processing time comparison of three methods 132
6.3 Time components in the whole simulation 135
7.1 Classification error rates of individual features 157
7.2 Classification error rates of combo features 159
7.3 Viewing statistics for Flickr photos 160
Trang 121.1 Conceptual view of geo-referenced videos 4
1.2 Overview for the geo-referenced video search system 5
1.3 Overview of thesis contributions 8
2.1 Overview of related work 13
2.2 Illustration of FOV in 2D space 30
3.1 Overview of system architecture (Chapter 3) 33
3.2 System architecture for static video summarization 34
3.3 Conceptual illustration of selection metrics 37
3.4 Grid-based illustration for redundancy and coverage loss 39
3.5 Coverage detection 41
3.6 A coverage subshot’s motion classification detection 46
3.7 Illustration of transition cost 52
3.8 Keyframe summary of a sample video 54
3.9 Original video trajectory 55
3.10 Angular difference curves 55
ix
Trang 133.11 Coverage subshots 55
3.12 Summary length comparison 57
3.13 3D BaseMatrix 57
3.14 Video skim trajectories with different thresholds 58
3.15 Video skim trajectories with different factor weights 58
3.16 Thumbnails of a summary 58
3.17 Trajectory of a landmark (MBS) summary 60
3.18 Trajectory of a landmark (Esplanade) summary 60
3.19 Comparison of trajectories generated by two methods 61
3.20 Inconsistency factor comparison of two methods 61
3.21 Overall inconsistency comparison of two methods 62
3.22 Trajectory of a region summary 63
4.1 Overview of system architecture (Chapter 4) 66
4.2 Angular difference 69
4.3 Histogram presentation of a video 74
4.4 ROI maps generated by three methods 78
4.5 Summary comparison (reconstructed ROI) of three methods 80 4.6 Summary comparison (overall quality) of three methods 82
4.7 Robustness evaluation of three summary methods 84
4.8 Efficiency evalution of three summary methods 85
4.9 Time component of our summary method 87
5.1 Overview of system architecture (Chapter 5) 89
5.2 Workflow of video aesthetics evaluation 92
5.3 User study for video aesthetics evaluation 104
5.4 Video aesthetics evaluation by video category 105
5.5 Video aesthetics evaluation comparison of three methods 107
Trang 145.6 Summary comparison (overall quality) of three methods 110
5.7 Summary comparison (individual factors) of three methods 111 5.8 Summary comparison (reconstruction rate) of three methods.113 5.9 Robust evaluation (overall quality) of three methods 113
5.10 Robust evaluation (individual factors) of three methods 114
5.11 Frame samples of our method 115
5.12 Frame samples of method 2 115
5.13 Frame samples of method 3 115
6.1 A conceptual example for our system 119
6.2 Overview of system architecture (Chapter 6) 120
6.3 Index structure overview 126
6.4 User study for summary quality and robustness evaluation 131
6.5 Online processing time analysis 134
6.6 Interface for dynamic video summarization 136
7.1 Overview of system architecture (Chapter 7) 140
7.2 Tags for sample photos 144
7.3 Illustration of GMM based camera spot clustering 148
7.4 Evaluation of image filter results 157
7.5 CDF of photo quality 159
7.6 GMM distributions for landmarks 161
7.7 Sample photos for landmarks 162
7.8 Contour map for landmarks 163
7.9 Spatial/Temporal bursts for MBS (Example 1) 164
7.10 Spatial/Temporal bursts for MBS (Example 2) 165
7.11 Spatial/Temporal bursts for Singapore Flyer 166
7.12 User study results, part (1) 166
xi
Trang 157.13 User study results, part (2) 168
7.14 Interface for camera location recommendation system 169
Trang 16In recent years, with the advancements in camera technologies, we havewitnessed a flourish of videos being produced on a daily basis and thevolume of these videos is growing at a rapid rate These videos are widelyshared on the Internet and are taking up an increasing part of the wholeInternet traffic [2,6,90,79] Cisco reported that video traffic already took66% of the whole Internet traffic in 2013 and this number is estimated toreach 79% by 2018 [2] Netflix, an American on-demand Internet streamingmedia provider, itself accounted for 33% of traffic during peak periods ofNorth American web traffic in 2013 [6] It was revealed by YouTube in
2014 that over 6 billion hours of videos were watched each month and 100hours of video were uploaded every minute [15]
With the rapid growth of video sharing websites, these video resources
1
Trang 17are instantly available for users through a simple search Viewing vant videos is assisting users to obtain an overall visual impression for
rele-an object easily Such convenience greatly benefits the public in variousdisciplines such as education and entertainment [87] Compared to someother knowledge acquisition means, searching through videos has gainedpopularity by their unique features First, videos have intensive dynam-ics in their contents These movements can grab the viewer’s attentionmore easily compared to other static media resources such as images ortext [13] Second, videos contain resourceful information from not onlyvisual but also audio and contextual aspects, which lay a good foundationfor a thorough understanding Third, videos are contiguous in the tem-poral dimension As people’s visual impression for an object comes fromthe knowledge accumulation [49], such temporal continuity offers a betterchance for comprehensive understanding
At the same time, the rapid development in video recording technologyhas made it possible for videos to be captured on multiple devices, especially
on the mobile ones It is predicted that worldwide camera-phone shipmentswould grow from nearly 1.3 billion units in 2012 to over 1.6 billion units by
2017 [14] In a survey by Photobucket, 45% of the 2500 respondents usedmobile devices to shoot video at least once per week in 2011 [10] Suchflexibility provided by having the video recording function within everysmartphone which people constantly carry, endows the modern videos withsome new features:
Content Diversity: Traditionally, videos were mostly created by
a limited number of media producers However, today there is a largerproportion of multimedia documents created by the general public and we
Trang 18name them as User Generated Content (UGC) Flickr reported that in July
2014, the camera on the Apple iPhone was the most popular camera used tocapture its hosted photos/videos and there were 114,693,845 uploads by theiPhone 5 alone [1] Among User Generated Videos (UGV), statistics havedemonstrated that the number of creators grew from 15.4 to 27.2 millionfrom 2008 to 2013 only in the US [96] Captured by different photographersunder various environmental conditions, these consumer videos reveal thereal appearance of a target object [71] from diverse perspectives [29] Ad-ditionally, and more importantly, such videos are potentially much moreup-to-date Popular objects are almost constantly being recorded by users,therefore reflecting their latest views
Metadata Richness: The integration of multiple sensors on ture devices can record additional contextual metadata in conjunction withvideos These contexts indicate various sensor properties (for example butnot limited to, location, orientation, acceleration, luminosity, temperature)and are correlated to each other We name such videos as sensor-richvideos Among these sensor metadata, some contexts are closely tied tocertain geographic identifications and we name the videos with such geo-graphic sensors properties as geo-referenced videos These sensor contextsshow great potential in video semantics analysis [32, 89, 118] For exam-ple, location (read from GPS) indicates where a video is captured from;orientation (read from compass) narrows the visible scope where the cap-ture intention is located Although some of this knowledge could also bederived from video content analysis, processing contextual meta-data how-ever provides an additional solution with a much lighter computationalcomplexity A conceptual view of geo-referenced video is illustrated in
cap-3
Trang 19Fig 1.1 The viewable scene of each sensor-tagged video frame covers anarea in geo-space (represented by a pie-shape) and locations of consecutivevideo frames compose a trajectory.
Figure 1.1: Conceptual view of geo-referenced videos The viewable areafor each video frame is presented by a pie shape (Detailed descriptions in Sec-
map A rectangular range query retrieves all videos whose viewable area overlapswith the rectangle
These geo-referenced videos have great appeal for example for tive travelers and visitors who are unfamiliar with a region, an area or acity For example, before they visit a place, a geo-referenced video searchengine can quickly retrieve a list of videos that are captured in this place
prospec-so the visitors could obtain an overall visual impression, conveniently andquickly The overview of a complete geo-referenced video search system isillustrated in the Fig 1.2 Mobile users collect sensor metadata during thevideo recording (Video Acquisition) The data (both the video contentsand contexts) is uploaded and stored in a remote server (Video Storage).End-users post queries through a web-based interface (Video Retrieval )and watch the retrieved videos on various devices The prototype of such
a system was introduced in the work by Arslan Ay et al [21], in which theposted queries could be any spatio-temporal queries such as a point, a line,
a poly-line, a circular area, a rectangular area or a polygon area (a
Trang 20rect-angular area query is shown in Fig 1.1) For each query, all video clips ,whose viewable scenes overlap with the query, are automatically retrieved
to ensure retrieval of pertinent information The users need to view all thevideos to obtain their final selections
End-user Client Video Acquisition Video Storage Web Interface for Video Retrieval
Video List
FOV of Query Meta-Data
Videos
Figure 1.2: Overview for the geo-referenced video search system
However, the volume of the user generated videos is increasing andhence more relevant videos are retrieved for the same query So the end-users face the risk of an increasing viewing burden as it takes efforts tobrowse all retrieved videos This problem is especially serious for mobileusers because the mobile devices usually have limited screen size, storagespace and battery life, so browsing the massive results degrades the viewingsatisfaction On top of this, existing systems rank the results according tothe relevance (spatial or temporal overlap) to the queried space, howeverthey ignore their representativeness So redundancy exists and will worsen
if more videos are uploaded to the repository All these disadvantagesmotivate us to reorganize the massive retrieval results and present a skimversion to the viewers To solve this problem, in this thesis, we adopt thetechnique of video summarization
Video summarization, also named video abstraction, is one of the mostcommon and quick ways to get the highlights from a long list of informationpieces [66] The operation of video summarization creates a shorter video
1 The term video segment, video section are used interchangeably in this proposal.
5
Trang 21clip or a video poster which includes only the important scenes in theoriginal video streams so that the users can gain a better understanding of
a video document without watching the entire video
Video summarization was first introduced by Lienhart et al [61] andhas been studied intensively in many works, with the majority leverag-ing computer vision techniques, for example by tracking low level featuressuch as color, structure, motion and audio [39,65,43,69,62,103,16,111].Such low-level feature based techniques usually achieve good results butwith the main drawback that they are too computationally complex Forexample, the color and illumination attributes are usually pixel-based in-formation, and since there may be millions of pixels in each frame, thewhole analysis can be time-consuming Another obvious disadvantage isthat these low level features are more content-oriented than semantic-oriented so that many studies only work for domain specific videos wherethe domain-specific knowledge should be known in advance This inspiresmore researchers to incorporate external information from outside of thevideos themselves [85, 93, 18, 105] However most of them gather suchexternal information either through a logistically cumbersome way such
as the user study, or they are supported by intensive pre-known domainknowledge [19, 105] Moreover, all these studies target single video sum-marization There exist only a few studies considering the multiple videoscenario [58, 60, 83, 97], most of which adopt similar techniques extendedfrom single video summarization An obvious disadvantage of these meth-ods is that the whole computational workload increases rapidly, resulting
in an inefficiency with a big video repository Additionally, all of thesemethods produce a key-frame based summarization rather than a segment-
Trang 22composed one, so that they ignore the relations among different video clips,i.e., how to determine the order among the selected video clips.
All the above problems motivate us to produce a summarization frommultiple geo-referenced videos and we will discuss the research challenges
in the next section
In this thesis, we design a method to produce a segment-composed videosummarization (video skim) from multiple user generated geo-referencedvideos and there exist several research challenges:
• There exists no previous work for video skim generation from multiplevideos So our first challenge is to formulate this problem so it canreflect the unique features of the multi-video domain
• Once we have created a summary video, how can this result be plored by users in a convenient and efficient manner?
ex-• We are also interested to know whether external crowdsourcing canimprove the overall summarization quality
• We would like to concrete the answers to the above questions for ourgeo-referenced videos by fully leveraging the geographic metadata
Based on the above challenges, the main contributions of this thesis,
as shown in Fig 1.3, consist of three major parts: summarization lation, exploration and improvement In the first summarization part, weintroduce three pieces of work to produce a preview video to summarize
formu-7
Trang 23Static Summary Framework
Guided Summary
Quality-ROI Detection &
Summary Refinement
Interactive &
Dynamic Exploration
Interactive &
Exploration
Recommend Camera Locations
Summarization Exploration Improvement
Figure 1.3: Overview of the main contributions of this thesis There are threemain components: summarization, exploration and improvement
a sub-area in geo-space from multiple videos In the second part, an teractive and dynamic video exploration system is built where people canconduct personalized summary queries through direct map-based manipu-lations In the third part, we investigate whether external crowdsourceddatabases contribute to improving the summary quality A brief introduc-tion for each of these five works is described as follows:
in-1 Static Summarization from Multiple Geo-Referenced Videos
We take a rectangular region query as an initial attempt and gate how to create a summary for this region from its retrieved videos
investi-We expect a summary to present the most salient contents within thistarget space For geo-referenced outdoor videos, such salience is rep-resented by landmark which could be a building, a statue or anyother popular objects So the summarization problem is converted
to how we generate a summary for each landmark and how we mine the traveling route among summaries for different landmarks
deter-We propose three metrics to evaluate the summary quality includingcoverage, redundancy and inconsistency and propose a heuristics so-lution to optimize each of these factors The summary length is fixed
Trang 24(static) which completely depends on the original information amongthe input videos.
2 Region of Interest Detection and Summarization from referenced Videos The above work requires users to specify whatlandmarks should be included in summary Additionally, the sum-mary is created by a greedy solution which includes all viewing per-spective for each landmark so there exists no control on the summarylength So in this work, we detect the landmarks, which are morecommonly known as regions of interest (ROIs) from input videos, in
Geo-an automatic mGeo-anner according to videos’ geographic properties thermore, we take the summary length into consideration and refinethe criteria to evaluate summary quality Active learning technique
Fur-is adopted to select the most informative video segments
3 Quality-Guided Multi-Video Summarization by Graph mulation In our previous studies, we does not consider the diversity
For-in terms of video quality, which is however a typical characteristic ofuser generated contents, and poor quality videos seriously degradethe viewing experience So in this work, we refine the summariza-tion problem to include a quality factor Additionally, our previousworks adopt heuristic solutions to create a summary by optimizingeach of its evaluation metric separately In this work, we considerthese metrics jointly A graph formulation is adopted to model thesummarization problem, with a dynamic programming based solution
to achieve the best result This solution can be applied to the eral videos, as the main technique does not depend on the geographicmetadata but is purely based on videos’ visual contents
gen-9
Trang 254 Interactive and Dynamic Exploration among Geo-ReferencedVideos Once we have obtained a summary from multiple videos,
in the second part, we address another critical issue – that is how toexplore the summaries conveniently Inspired by the popular routesuggestion function in Google Map, and the panoramic view function
in Street View, we propose a system to explore the summary for auser specified route in an interactive and dynamic way With efficientdata indexing, the system can rapidly retrieve a video answer for adesired tour path in real-time Moreover, it can quickly satisfy queryupdates during the playback of the last query to produce personal-ized and satisfying results, as well as an elegant adaption when newvideos are included in the database
5 Camera Shooting Location Recommendation for Objects inGeo-Space During the summarization for an object, an importantcriterion is how well a video clip presents the visual appearance forthis object Recently, community contributed photo collections (e.g.,Flickr) have attracted researchers’ interests, because they provide notonly multimedia documents but also rich statistics including both lo-cation metadata and user behaviors This arouses our interests toinvestigate if the relation between the camera location and object lo-cation affects people’s preferences This work recommends users a list
of locations, where they may be able to capture appealing landmarkphotos by themselves
The remainder of this dissertation is organized as follows Chapter 2
provides a comprehensive literature survey on relevant prior work and liminaries Chapters 3 to 7 are five individual but correlated works that
Trang 26pre-we have investigated in this dissertation: Chapter3details a static solutionfor multiple video summarization Chapter 4 describes how to detect theregion of interest and refines the summarization problem Chapter5intro-duces a general solution to the multi-video summarization problem for anyvideo Chapter 6 describes an interactive and dynamic means to explorevideos Chapter7recommends a list of locations to capture appealing pho-tos Finally, conclusions come in Chapter 8 as well as the plans for futurework.
11
Trang 27Literature Review and Preliminaries
This chapter first investigates the existing studies that are relevant to ourwork The review mainly covers three fields: video summarization, vi-sual exploration in geo-space and visual recommendation As illustrated
in Fig 2.1, each colored circles refers to a related topic and each arrowindicates how this topic is related to my contribution in this thesis
1 Our main target is to create a video summarization, so we look intothe mainstream techniques in this field We start from the singlevideo summarization as it shares some basic concepts and commonsolutions with multi-video summarization Then a few existing stud-ies for multi-video summarization are discussed
2 Our work generates summarizations from geo-referenced videos, so
we investigate how the current studies associate some geo-properties
Trang 28with different media documents A survey about visual exploration
in geo-space is introduced
3 Our work investigates a priority selection among videos with certaincontexts or contents This is highly related to the work of visualrecommendation So we review several related work in this field anddiscuss their cons and pros
CameraShootingLocation
RecommendationforObjectinGeoͲSpace
StaticSummarization
FromMultipleGeoͲ referencedVideos
summa-Video salience, or video highlights, can be defined by either the geted audience or the video type From the targeted audience perspective,some professionals desire a summarization for special purposes So theypre-define salient features according to such purposes (e.g., domain color,camera motions), and extract video subparts according to these features
tar-13
Trang 29From the video type perspective, the salient information intensively pends on the concrete applications, so they could be scenes, events or spe-cial objects For example, in a sports video, the events with goals mostlyattract the audience In a tourism video, the landmark buildings may beappealing Or in a movie, some conversation scenes are usually important.
de-To detect the video highlights and divide a video into multiple ments, proper features are extracted from either internal and externalsources [66] Internal features are extracted from the video stream itself
seg-A video usually consists of three channels: image, audio and text channels
So if a video summary is generated by analyzing the features among thesethree channels, we call it internal-feature-based video summary However,
it has recently been recognized that video sources in isolation do not fullysupport the understanding of the video semantics [66, 56] That is why anincreasing bigger portion of recent researches pay attention to the externalinformation which is related to the video streams The external informa-tion is named user-based if collected through human-interaction (e.g., userstudy), or contextual based (e.g., video meta-data) otherwise
The last step in video summarization is to present such detected videosalience to the audience in a proper way Previous work has proposedmainly two kinds of presentations, a keyframe-based video summary and avideo skimming/skim A keyframe-based summary is also called still imageabstract or static storyboard, which consists of a small collection of imagesextracted from the underlying video source A video skimming/skim, alsocalled a moving image abstract or moving storyboard, consists of a col-lection of images as well as their corresponding audio abstract from theoriginal video Both the above two summarizations are subparts of the
Trang 30original video with the only difference being whether they are composed ofonly a series of independent images or video segments coming with audiostreams.
In the next part, we have investigated the mainstream techniques forsingle video summarization in Section2.1.1and multiple video summariza-tion in Section 2.1.1
Single Video Summarization
In the field of video summarization, the majority of existing techniques arebased on low level media signal features which are quite computationallycomplex to extract For example, Ekin et al [39] prepared a summariza-tion by analyzing the domain color in each video frame Their algorithmworked only for soccer videos, based on the assumption that for a certainkind of view shot in a soccer game, the color distribution rate should keeprelatively steady So the segment boundaries were determined by check-ing the difference among consecutive video frames in terms of the domaincolor The segments were assigned to pre-defined categories and a summa-rization composing the important category was produced Ma et al [65]leveraged features such as color contrast, intensity contrast and orientationcontrast to model the human attention on a particular image Then for avideo, each frame was assigned an attention score and a time series scorecurve was generated accordingly Segments with high scores were selected
as video salience Gong et al [43] used a color histogram to measure thesimilarity between frames/shots and the similar contents were filtered out
to reduce the overall redundancy in a summary Xu et al [106] presented anovel framework by defining a set of audio keywords for event detection in
15
Trang 31the soccer videos Audio keywords were a middle-level representation thatcan bridge the gap between low-level features and high-level semantics Intheir work, they defined three main audio keywords: whistling, commen-tator speech and audience sound For each main class, different featureswere selected by a hierarchical SVM classifier from candidate audio fea-tures such as the Mel-frequency cepstral coefficients, Zero-crossing ratesand Linear predictive codings So for any other soccer videos, they usedthese well trained audio keywords and features to detect the video salienceand produced a summary accordingly.
Text is also important information in videos as it can be closely related
to the video semantics and hence contributes to for example event detection
in video summarization Text is extracted in either a direct or an obliqueway For the direct sources, texts, in most cases, are presented in super-imposed captions, textual overlays, graphical inserts, tags and subtitles.These texts are embedded in the video stream rather than in a separatestream Zhang et al [111] proposed to use the sports statistics on thescreen to identify important events for a sports game For oblique sources,texts were recognized by speech recognition or were indicated from domainknowledge by data training So spoken words such as “goal”, “shot” and
“save” can be detected to construct a key word database, which helped toidentify events later [105]
However, recent research found that video sources in isolation can notfully support content analysis [56,66] Hence, many studies have leveragedexternal information to detect video salience The external informationcame from various resources For example, it was obtained by the usersurveys [85, 93]; from the tracking of human physical responses [18]; or
Trang 32from the public news on the Internet [105] However, there exist two jor drawbacks of these methods: firstly, they involved the users to gatherinformation which is logistically cumbersome; secondly, the third-party in-formation was mostly domain-dependant so that these techniques can only
ma-be applied to narrow, specific scenarios [19,105]
Multiple Video Summarization
The earlier mentioned techniques consider the summarization from a gle video For a multi-video scenario, there exist only a few studies Li et
sin-al [59] proposed an algorithm named Maximal Marginal Relevance whichwas originally used in text summarization but now was extended to thevideo domain Two variants of Video-MMR were suggested, and the au-thors proposed a criterion to select the best combination of parameters forVideo-MMR Then, the authors compared two summarization strategies:Global Summarization, which summarized all the individual videos at thesame time, and Individual Summarization, which summarized each individ-ual video independently and concatenated the results Extended from thisidea, they further presented an algorithm for video summarization, AudioVideo Maximal Marginal Relevance (AV-MMR), exploiting both audio andvideo information [58] AV-MMR iteratively selected the segments whichbest represented the unselected information and were nonredundant withthe previously selected information AV-MMR is a generic algorithm which
is suitable for both single and multiple videos with multiple genres Severalvariants of AV-MMR were proposed later and the best one was identified
by experimentation Besides, a visual representation of the coherence ofaudio and video information for a set of audio-visual sequences was also
17
Trang 33proposed Combining both Video-MMR and AV-MMR, the authors mated the balance factor between audio information and visual informa-tion, and proposed an method named OB-MMR [60] The above threemethods share the similar idea for multiple video summary but leverag-ing different features from aural, visual, or their hybrid combinations Themost distinguished frames were selected as key frames in the final summaryfrom a global comparison among frames from all input videos However,there are generally two primary drawbacks in their methods First is theircomputational complexity as the global comparisons usually take great ef-forts and they are time-consuming The method could not be efficientlyadapted to an increased video repository as the global decision needs are-computation Another drawback is that their method does not reflectthe relationship among these keyframes to the video saliency and does notdetermine the order to assemble these frames This may be ignored for asummary for a single video as we could always arrange these frames ac-cording to their original timelines however this simple method fails in themulti–video summary case.
esti-Similarly, Shao et al [83] proposed an approach to analyze both visualand textual features across a set of videos and create a so-called circularstoryboard composed of topic-representative keyframes and keywords Inthis method, video segments were extracted from visual data and keywordsfor each segment were detected from speech transcripts Such informationworked together with the visual contents to construct a complex graphand hidden topics were detected by a clustering-based graph analysis Asummary was produced to maximize the coverage of the complex graphover the original video set
Trang 34Wang et al [97] proposed an approach for multi-document video marization by exploring the redundancy among different videos The im-portance of keyframes was first measured by the content inclusion based
sum-on intra- and inter-video similarities Then they proposed a minimumdescription length (MDL) for automatically determining the appropriatelength of the summary Finally a video summary was generated for users
to browse the contents of the whole video article Results showed thatmulti-document video summarization was perceived to be more elegantand informative compared with single-document approach
Although a display order is decided, these content based tion techniques still have difficulty to overcome semantic gaps to indicaterelationships among videos [66] In contrast, our work first defines the po-tential highlights of videos, then extracts both keyframes and skims fromsensor data with a lightweight computational workload, and finally con-nects these pieces in a useful way
summariza-Summary and Discussion
In video summarization field, majority of existing studies work for a singlevideo by analyzing their low level signal features So when these techniquesare extended into multi-video domain, they face a big challenge of quicklyincreased computational complexity Secondly, in these summary videos,their composed segments are connected in an order as they appear in theoriginal video but such connection order is unknown if they come fromdifferent resource videos Additionally, many of these existing studies areheavily domain-dependent techniques, so it is infeasible to deploy them togeo-referenced videos All the above challenges motivate us to reformulate
19
Trang 35the summarization problem in multi-video domain and to fully leveragevideos’ geographic contextual data to improve computational efficiency.
2.1.2 Visual Exploration in Geo-space
The exploring of multimedia documents in geo-space is closely tied to theexploration target So we divided the related work in this field into threeparts: exploration for a landmark, for a region and for a route
Landmark Exploration
For a queried landmark, photos are mostly searched by either their names or
a sample photo by similarity matching from visual features, labeled tags orany other fusions [41,22,81] Fergus et al [41] proposed a method extend-ing from the constellation model to include heterogeneous parts which mayrepresent either the landmark appearance or the geometry of a region ofthe object Such model can be employed for ranking the output of an imagesearch engine when searching for object categories Berg et al [22] demon-strated a method for identifying images containing an object in a wide range
of aspects, configurations and appearances Their results are illustrated to
be accurate despite these variations and rely on four simple cues: text,color, shape and texture Such techniques are used in the image searchengine and the most-similar results are returned To avoid redundancy ontop of the search result list, studies start investigating how to produce aset of representative photos for a queried object or a scene [86,54,75] Forexample, Kennedy et al [54] proposed to generate diverse and representa-tive images for landmarks from community-contributed photo sets based
on a visual feature clustering Features in each image were extracted and
Trang 36matched with the others to generate a linked graph where the edge weightsindicate their visual connection Clusters with dense connections were de-tected, so for each cluster, photos had similar visual representations and aset of diverse representative clusters were returned to the final recommen-dation list Simon et al [86] formulated a scene summarization problem byselecting a set of images that efficiently represents the visual content of agiven scene The selection covered the most interesting aspects of the scenewhile keeping minimal redundancy They examined the distribution of im-ages in the collection to select a set of canonical views to form the scenesummary, using clustering techniques on visual features In these works,
as the visual diversity and representativeness matter for their final goal themost, the connections between the location-based characteristics in geo-space and such visual features are usually weak or even ignored So moreand more studies try to organize a landmark photo collection according totheir location connectivity [82, 40, 89, 26] Epshtein et al [40] organizedthe images in a hierarchical tree from an overview to details based on scenesemantics At each level, representative images were selected and shown
to the user The proposed framework enables a summarized display of theinformation and facilitates efficient browsing Snavely et al [89] organizedthe images in 3D space using full camera parameters, recovered by match-ing visual features between images So tourists can benefit from browsingthese images according to the landmark 3D structure Because these algo-rithms were typically supported by a high overlap in contents, Brahmachari
et al [26] proposed an efficient algorithm for clustering weakly-overlappingviews when the photo-set is sparse, based on opportunistic use of epipolargeometry estimation and organized them in a tree structure graph
21
Trang 37All these works above solve a landmark-query from a visual tive and return the most relevant visual contents However, none of themstart from the location perspective such as where could one create appeal-ing landmark views Although some works listed above incorporate thelocation factor into the problem but they mainly focus on the visual con-nectivity among photos that are adjacent in geo-space, however they donot look into the location problem from a global perspective It is ob-served that locations apart from each other may also create similar views.Moreover, a detailed evaluation among different locations for a landmark
perspec-is left unexplored Users still need to look through all images according tothe constructed architecture to find which one they like the most and needother support to find out the corresponding camera location In Chapter7,
we look into this problem from a location perspective To be specific, for
a queried landmark, we retrieve a set of relevant images and try to nize them according to whether their camera locations have potentials toconduct an appealing landmark view
orga-Region Exploration
Given a region query, most existing studies present the retrieval multimediadocuments according to their semantic categories Such semantics comefrom either their spatial features, temporal features or a combination.The initial solution for an region query is to augment all media docu-ments in a queried area with various additional information [94, 40], such
as location tags, textual tags or to display them on top of the route maps.Toyama et al [94] prepared a World Wide Media eXchange (WWMX)database indexing large collections of image media by several pieces of
Trang 38metadata including timestamp, owner ID, and critically, location stamp sothe system can provide users several explorations according to their prefer-ences However, such browsing takes the risk of leaving unfamiliar visitorslost if they are unaware of what should be focused on in a tremendousdatabase So approaches have emerged that create a summary from thewhole image dataset and present them to users [50, 53] For example,Jaffe et al [50] proposed a system to automatically select representativeand relevant photographs from a particular spatial region by mining pho-tographic behavior patterns from multiple perspectives.
According to the temporal features, there exist works which mainlyfocus on “event detection” [68,42,34,76,42] For example, Papadopoulos
et al [76] intended to distinguish events from a photo-set by building avisual graph and recommend them to user However, they did not lookinto the differences of different temporal slots In contrast, our work willlook into the evaluations in spatial-temporal space and suggest both goodcamera-positions and time slots
Route Exploration
There exist two related sub-fields for a route based query: route planningand associating media documents to an assigned query
For route modeling, Lu et al [63] proposed a trip planner We adopted
a similar design in that for each famous destination, the method mergesdifferent parts of trajectories, and discovers the most popular path amongthese merged candidate paths from the order of the path scores, which isderived from popularity, photo density, and path lengths However, theirmethod is difficult to apply to a video data set as each frame would need
23
Trang 39to be a sample photo Furthermore, they do not consider if the photo isreally related to the region of interest as they only examine the longitudeand latitude of the camera location to decide if a photo shows a landmark.However, the object of interest may be some distance away from the spec-ified geo-location, and furthermore the camera may have pointed into adifferent direction Therefore, this method provides no assurance that theuser can actually observe the desired landmarks along the generated path.The technique suggested by Arase et al [20] separated a photo trajec-tory taken by the same user into successive photo trips by detecting suddenchanges in the spatial, temporal, and user-generated tag differences usingvarious features The classified trip was then analyzed to extract commontraveling patterns A common trip pattern was described in a textual formprimarily consisting of frequently used specific tags, while ignoring tagsthat were too general However, the suggested trip patterns heavily relied
on user-generated tags and the accuracy of their descriptions
There exists some other work that relates to trajectory partitioningusing different spatio-temporal criteria such as speed, direction, and lo-cation or their combination [19, 27] Aizawa [19] took a spatio-temporalsampling for keyframe selection so visual computation could be reduced Inspatio-temporal sampling, the location and the time of the recording weresampled using data from the GPS receiver and time data In addition,they made use of the derivatives of the location, which defined the move-ment of the users, such as the speed and the direction Changes of locationand their derivatives were triggers for the spatio-temporal sampling Sixsets of sampling parameters were pre-determined and corresponding frameswere presented However, these methods did not analyze the relationships
Trang 40among different trajectories In our application multiple trajectories need
to be considered
To associate media documents with a route, Jing et al [51] proposed
a visual tour where users draw a route directly on the map and the tem automatically returned high-quality photos along the route However,the generated result was fixed and cannot be partially updated throughuser interactions unless a new query was conducted To work with videos,Pongnumkul et al [77] prepared a storyboard for browsing tour videosand a user can interact with the system with a set of controls However,people should have rich prior knowledge of the queried geo-space as theywere required to manually drag a set of pre-processed key-frames to a mapand to pin them on proper landmarks, as well as designate a tour path.This system supports interactive controls such as adding or removing pro-vided video shots, which however require user knowledge as well Oursystem takes all these factors into consideration, including landmark selec-tion, concise but comprehensive summary preparation, and dynamic andinteractive system set-up All processes require no customer’s experience,quite the opposite, users accumulate information of an unknown area andcan gradually update their queries to find their points of interest
sys-Summary and Discussion
Majority of existing studies in this field leverage photo resources to present
a static visual appearance for either a landmark, a region or a route stead, we would like to present a dynamic summary view for these geo-graphic properties from video resources for better lifeliness Additionally,most of the existing studies investigate which photos should be selected
In-25