Thesis for Degree of duster - Content-based videu indexing oud retrrevial Abstract Video indexmg and retrieval is an important clement of any large multimedia database.. Thesis for D
Trang 1BỘ GIÁO DỤC VÀ ĐÀO TẠO
TRUONG DAI HOC BACH KHOA HA NOI
PHAM QUANG HAI
CONTENT-BASED VIDEO INDEXING AND RETRIEVAL
LUẬN VĂN THẠC SĨ KỸ THUẬT
xU LY THONG TIN Va TRUYEN THONG
Hà Nội - 2005
Trang 2
TRUONG DAI HOC BACH KHOA HA NOI
PHAM QUANG HAI
CONTENT-BASED VIDEO INDEXING AND RETRIEVAL
LUAN VAN THAC SI KY THUAT XU! LY THONG TIN Va TRUYEN THONG
NGUOI HUONG DAN KHOA HOC:
Alain Boucher
Hà Nội - 2005
Trang 3Thesis for Degree of duster - Content-based videu indexing oud retrrevial
Abstract
Video indexmg and retrieval is an important clement of any large
multimedia database Because video contains huge information and has large store requirement, the research on video is still continuous and
opens ‘his thesis works at low-level features of video, concentrates to
the corner feature of frame Comparison of corners between frames
brings back to us the ability of detection for cut, gradual transitions and even rich type of wipes of shot transition in video Continue to use
comer-based motion combining with histogram features, by measuring
distance of how far the motion move, key frames is selected, and it is
ready for indexing and retrieval application
Other side of work is using segmentation of each frame and
merging all regions which have the same motion By this way, | would like to separate regions in frame into layers and it will used to indexing
key objects
One chapter of this thesis is reserved for learning how to index and retrieve in video It is an overview of video indexing and retrieval system:
what they did, what they are doing and how they will do This thesis is
expected to contribute usefully to multimedia system at MICA
Trang 4
Acknowledgements
This work is a part in Multimedia Information Communication
Application (MICA) research central
First of all, I would like to thank to Dr Alain Boucher, IT lecturer
at Institut de la Francophonie pour I’ Informatique (JF1), Vietnam, leader
of Image Processing group at MICA central, as my supervisor Thank you
for your support and funding of acknowledge to me, thank for mecting to discuss of working every week and your patience during time | worked and sorry for inconvenience of what I brought to you
1 alao thank to I.e Thị I.an and Thomas Martin, members at MICA,
I couldn’t have done this thesis without your supports ‘hank to both of
you in acknowledge in image processing theory and programming in C! |
with the newbie like me
I would like to thank to directors in MICA: Mr Nguyen Trong
Ghang, Mr Eric Castclli and Mrs Nguyen Thi Yen who accepted and
helped me to have good working environment in MICA ‘Thank to
members in MICA who welcome me to work in MICA as a trainee I
have very good impression in your amiable attitudes and your helps
Finally, I want to thank to my family, my two sisters who often
fostered me oven the long distance from homeland Thank to my parent who helped me anytime when | went down and my brother who visited
me sometime to tidy up my room because of my laziness
Trang 5
L1 Content-based Video Indexing and Retrieval (CB VIR)
12 Aims and Objectives
112 Content-based video indexing and retrieval system
IL2.1 Video sequance structure
112.2, Video data classification,
112.3 Camera operations -
11.2.4, Introduction lo CRVIR systom
112.5 CBVIR Archilecture
3 Features Extraction
I4 Structure Analysi:
TIA1 Shot Transitions classification
1142 Shot Netzetion Teelmiques
1.6.1 Introduction to Video Abstraction
1162 Key frame extraction: -
TL7 Vidco Iudexing, Refricval, and Browsing - a AB
Chapter IIL Video Indexing by Camera motions using Comer-based motion vector
TL2 Video and Image Parsing in MICA
HIL2.1 MPEG2 Video Decoder
11.2.3 Video and Image Processing akrry Combination so AB
HL3.1 Hamis Corner points deteotor co 47
Trang 6
11.3.2 Corespond poituls matching
TI4 — Shol Transitions dotzction
11.4.1 Shot cut Detection algorithm
11.42 Shot cut detection description
143 Results and evaluation
TIS Video Indexing
1.5.1 Motion Characterization
11.5.2 Comer-based motion vector
1153 Global Motion calculation
W154 Key frame extraction
1.5.5 Problem in object extraction
iv
Trang 7Thesis for Degree of duster - Content-based videu indexing oud retrrevial
List of abbreviations
CD: Compact Disk
DVD: Digital Versatile Disk
MPEG: Moving Pictures Experts Group
CBVIR: Content Based Video Indexing and Retrieval
CBIR: Content Based Indexing and Retrieval
IEC: International Llectro-technical Commission
DCT Discrete Cosine Transform
JPEG: Joint Photographic Experts Group
IDC: Inverse Discrete Cosine Transform
GOP Group of Piclurcs
Tham Quang lai
Trang 8List of figures
Figure I-1 Position of video sysizn in MICA ventral
Figure I-1 Two consceutive frames from video sequence
Kigure IL-2 Motion Compensation from MPEG video stream
Figure IL-3 Block diagram of MPEG video encoder
Figure 1-4 Video Ilierarchical Structure
Figure If-5 Connon dircotions of moving videu camera
Figure IL-6 Common rotation and zoom of stationary video camera
+igure I-? CVBIR common system diagram
Vigure 11-8 Classification of video modeling technique Level | with video raw data,
Level II with derived or logical features, and Level III with semantic level
Figure IE13 Effect of Gabor Filter to image Ti alt
Figure I-16 Reduce the number of bits during calculate the histogram
Figure I-17 Cut (a) and (Fade/Dissolve) trom trame difference
Figure I-18 Twin Comparison (picture taken from (IL 4 5)
Figure If-19: Head tracking for determine trajselories
Figure 1-20: The 2D motion trajcetory (third direction is fame time line) 31 Figure I-21 Optical flow (a) two frame fiom video sequence (b) optical flow 33 Higure Il-22 Optical flow filed produced by pan and track, tilt and boom, zoom and
Figure I-23 Motion sogmentation by optical flow - 238 Figure I-24 Local and Global Contextual Information - 2 +igure IU-1 Relation between R and eigenvalues " 1 Migure II-2 Hatris Comer in image with different given comer nmber a AD gue T1L-3 (2) Two frames extracted while camera pan right (b) correspond points
Figwe I 4 Results from no shot transition 84
Figure 1U-5 Results from shot cut transition see SS
Figure IT-7 Correspondent points matching mmbers in one video sequence 60 Figure [0-8 Two frames fiom two shots but similar - -
Tigure II-9 Corresponent points in video sequence 3
Kigure I-10 Frame sequence from video sequence 3
igure 11-11 Keep motion vectors by given threshold for magnitđes 2 65 Figure I1T-12 8 used dircelions for standardizing veclor directions 66 Figure I-13 Some consecutive frames fữom pan ripht shot se 86
Trang 9
Thesis for Degree of duster - Content-based videu indexing oud retrrevial
Figure I1T-14 Video frame from video sequence |
Figure I-15 Video frame from video scquonec 2
Figure II-16 Key ftarne selection ñom video maosaic
Figure 1LL-17 Key frames is selected from motion graph
igure 1-18 Complicated motion graph from video
Figure ITT-19 cases of vector graph -
Figure I-20 Results of key frame selection
Figure I-21 Hierarchical indexing for CBVIR system
Trang 10List of tables
Table | Tast dala used for shot cul algorithm
Table 2 Shot detsvtion result from lest data
‘Table 3 Four types of detection an algorithm can make
‘Table 4 Vector directions rule
Table 5 Calculating global motion fiom set of comer-based vector
Table 6 Vidzo sequence for global motion
Table 7 Three table of motion vectors for video sequence 1, 2 and 6
‘Table 8 Global motion from video sequence 3
Trang 11Thesis for Degree of duster - Content-based videu indexing oud retrrevial
ability of rapid process and very big storage for multimedia data became
available Kach year, there are huge numbers of movies created by movie
industry, hundreds of Television Company in the world made many and
many video news And for cach person or cach family, boing owner of a
camera became easier than ever They made home video for their events
which happened im their hile That reasons made the information of video became hugeness, achieve difficulty, petting messy when people tried to
browse his need video information Lasily to realize that sometime it is
very hard to tind the shot of “sunsct scene” in given a thousand vidcos of
3 hours for each ‘That's why the system of retrieving video is now
researching and developing more and more in the word
In Viemam, there are more researches and application in
multimedia data to fit with the development and requirement of modern
life and researching video became important and essential By means of
this thesis, ] tried to discover and summarized video researches while
practiced a part of CBVIR system,
L2 Aims and Objectives
In MICA central, we are now developing a multimedia system
including Speech Processing, and Image Processing Position of video system is showed in Figure 1-1
Trang 12
Thesis is divided into two big chapters with conclusion These
chapters are organized as follows
Chapter II
This chapter provides basic information, characteristic of video and
a general CBVIR system This chapter also introduces to techniques that used in video for feature extraction, video analysis
Chapter III
This chapter describes techniques that are used in practice: Harris
comer, motion from Harris corner and how it correlates to CBVIR
system Results came from practice is shown in this chapter with evaluations,
Chapter IV
This chapter concludes the thesis and gives direction for future works,
Trang 13Thesis for Degree of duster - Content-based videu indexing oud retrrevial
Chapter Il Background
1.1 Video Formats and Frame Extraction System
11.1.1 Video Formats
There are numbers of video format to use in CBVIR systems It
depends on database of system Some types of format which used popularly in storages like VCD, DVD, hard disk are DAT, AVI, MPEG, MPG, MOV In CBVIR systems in MICA, we used MPEG-2
format while parsing video stream An advantage of MPEG format is
reducing the size of video file into small that makes a lot of video
processing system became available, While encoding MPEG video
stream, they used “two-steps” to compress video: Once for spatial compress and once for motion compress (mulion compensation) The
requirement of applications that used MPEG video can be played in
anywhere: Digital Storage Media requires small size, good quality enough to process because of its cust, asynumetric applicalions requires the ability of subdivision for video delivery (e.g video game online}, symmetric applications needs a video format that can compress and
docompress process al the same time .All that requirements is satislied
by using MPEG fonnat
11.1.2 Introduction to MPEG
The Moving Pictures Experts Group abbreviated MPEG is part of
the International Standards Organisation (TSO), and delines standards for
digital video and digital audio The primal task of this group was to develop a format to play back video and audio in real time from a CD
Trang 14
Meanwhile the demands have sen and beside the CD the DVD needs to
be supported as well as transmission equipment like satellites and
networks All this uperational uses arc covered by a broad selection of
standards Well known arc the standards MPEG-1, MPEG-2, MPEG-4
and MPEG-7 Each standard provides levels and profiles to support
special applications in an uplimised way
1.1.2.1 MPEG-2 Video Standard
MPEG-2 video is an ISO/IEC standard that specifies the syntax
and semantics of an enclosed video bitstream These include parameters
such as bit rates, picture sizes and resolutions which may be applied, and how it is decoded to reconstruct the picture What MPEG-2 does not
define is how the decoder and encoder should be implemented, only that
they should be compliant with the MPIG-2 bitstream This leaves
designers free to develop the best encoding and decoding methods whilst
retaining compatibility ‘The range of possibilities of the MPKG-2
standard is so wide that not all features of the standard are used for all
applications |KEITH|
422.2, MPEG-2 Encoding
One of the most interest points of MPEG is reduce the size of video
into smallest as it can It concerns to compress algorithm including spatial
compensation and temporal compensation ‘his method is also applied to
other MPEG standards like MPEG-1, MPEG-4, and MPEG-7 With spatial compensation, JPKG, standard that reduce the size of an image is
used By adjusting the various parameters, compressed image size can be
traded against reconstrucled image quality over a wide range Image
quality ranges trom “browsing” (with the ratio of compression 100:1) [KEITH] With temporary compensation, motion compensation is used
Temporal compression is achicved by only cneoding the difference
between successive pictures Imagine a scene where at first there is no
Trang 15
movement, and then an object moves across the picture The first picture
in the sequence contains all the information required until there is any
movement, so there is no need to encode any of the information after the
first picture until the movement occurs Thereafter, all that needs to be
encoded is the part of the picture that contains movement The rest of the
scene is not affected by the moving object because it is still the same as the first picture The means by which is determined how much movement
is contained between two successive pictures is known as motion
estimation prediction The information obtained from this process is then
used by motion compensated prediction to define the parts of the picture that can be discarded This means that pictures cannot be considered in isolation A given picture is constructed from the prediction from a
previous picture, and may be used to predict the next picture
Motion compensation can be description in Figure II-1
Figure II-1 Two consecutive frames from video sequence
Like what we have seen at two consecutive frames, the man on the
right and the background are staying static, the man on the left is moving
All information we need to store is back ground, figure of the man on the
right, and the motion figure of the man on the left The compensation
motion here is the next frame is create from the last frame plus the part which arisen from motion and subtract the part which overridden from
Trang 16described in visual in Figure II-2
&e
Figure IJ-2 Motion Compensation from MPEG video stream
Block diagram of MPEG video encoder can be described in Figure
Figure II-3 Block diagram of MPEG video encoder
Fist of all, frames data (raw data) is compressed by DCT (Discrete Cosine Transform) by divide a frame in to macroblocks and calculate
block DC coefficient Quantization (Ql) with “zig-zag” scanning
optimized value for each macro blocks After that, MCP (motion
compensated prediction) is used to exploit redundant temporal
information that is not changing from picture to picture
MPEG-2 defines three picture types
Trang 17Thesis for Degree of duster - Content-based videu indexing oud retrrevial
I (Intraframe) pictures These are encoded without reference to another picture to allow for random access In MICA center, we used this
type of frame during process
P (Predictive) pictures arc cncoded using motion compensated
prediction on the previous picture therefore contain a reference to the
previous picture They may themselves be used in subsequent predictions
8 (Bi-directional) pictures are encoded using, motion compensated
prediction on the previous and next pictures, which must be either a B or
P picture B pictures are not used in subsequent predictions
Usually, they mixed L B, or P frame into Group of Picture (GOP)
One GOP could include [, and P, or I, P and B frames Depend on what they did during encoding time, types of MPKG will be different It
depends on number of types used in GOP or order of L B, P frame in
GOP
One more important problem in MPEG-2 standard is motion
estimation Motion eslimation prediction is a method of determining the
amount of movement contained between two picluros This is achioved
by dividing the picture to be encoded inte sections known as
macroblocks The size of a macroblock is 16 x 16 pixcls Each macroblock is scarched for the closcst match in the scarch arca of the
picture it is beg compared with Motion estimation prediction is not
uscd on I pictures, however B and P pictures can relor to I pictures For P pictures, only the previous picture is searched for matching macroblocks
In B pictures both the previous and next pictures are searched When a
match 1s found, the offset (or motion vector) between them is calculated
‘The matching parts are used to create a prediction picture, by using the
motion vectors The prediction piclure is then compared in the same
manner to the picture to be encoded Macroblocks which have a match have already been encoded, and are therefore redundant Macroblocks
Tham Quang lai
Trang 18
which have no match to any part of the search area in the picture to be encoded represent the difference between the pictures, and these
macroblocks are encoded To understand more about MPEG standard, see
[MPEG] for details
1.2 Content-based video indexing and retrieval
system
11.2.1 Video sequence structure
A video stream is created basically from shots Shot is a
fundamental unit of video and it depicts a continuous capture from camera turning on until turning of for another shot Like what is
illustrated in figure Figure Il-4, when a video producer created a video,
they made it from shots and grouped them into scenes and embedded
some effect between shots
Keyframes selection van vt
Trang 19Thesis for Degree of duster - Content-based videu indexing oud retrrevial
One scenc could be underslocd like scrics of shols which is
semantic constrained (e.g., a scene describes two talking people, sitting in
chairs with inlerlacement shots of another people in a party) That means,
it is not easy te detect a scene by signalling features like colour, shape Tt must be detected by semantic level (mentioned in the next section) During The lowest level is frame in video hicrarchical structure CR VIR
system extracts a shot into frames and selects the most interested frame
from them in key frame selection step
11.2.2 Video data classification
To work with video, classification of video data is very important Follow [ROWE], Row et al classified video metadata into 3 levels for
each video by
Bibliographic data: This catcgory includes information about the
video (e.g., title, abstract, subject, and genre) and the individuals involved
in the video (e.g., producer, director, and cast)
Structural data: Video and movies can be described by a hierarchy
of movie, segment, scene, and shot where each entry in the hierarchy is
composed of one or more entrics at a lower level (cg., 4 segment is
composed of a sequence of scenes and a scene is composed of a sequence
of shots)
Content data: Users want lo relricve videos based on thei content audio and visual content) In addition, because of the nature of video the
visual content is a combination of slalic content frames and dynamic
content Thus, the content indexes may be scts of kcy frames that
represent major images and object indexes that indicate entry and exit
frames for each appearance ol a signilicant object or individual
With this classification, to work with video stream, normally we
have to parsing video into 3 levels above The first level can receive from
Trang 20
other information which is signed inside video stream or may be, simpler,
from text that appears in video stream (must use text recognition
|LIENHARTI) To determine structure data, system must rebuild its
structure from primitive cloments (frame) by using some techniques (find
refer) In the most of CBVIR system, content data is used as a major
element to work wilh video And in my thesis, I work with frame as a
primitive element
Another way to classify video data is based on it purpose It’s
called purpose-based classes In [EW], Lew at al classify video into 4
classes
Entertainment: Information in this class is highly stylized
depending on the particular sub-category: fiction, non-fiction and
interactive Film stories, TV programs, cartoon belong fiction With
non-licuion, information docs not need to be “logical”, and fallow story
flow Interactive video can be found out in games
Information: The most common information of video in television
is news It gives convey of information to viewers
Communication: Video used in communication is different from
play back video It must be designed for communication, suilable with packet transformation (c.g., video conferences)
Data analysis: Scientific video recording (e.g., video of chemical,
biology, psychology )
‘The way to classify video into these classes is important because of
it different structure It helps to classify video information in semantic
level Classity video shot in these levels is illustrated in [FAN] In that
paper, Fan et al classify video into Politics news, sports news, financial
news Or in [ROZENNI, Rozenn et al used features of sports lo
classify tennis sport video or snooker video Li et al in [I.1] detect
particular events in sport broadcast video like American football by
Trang 21
Thesis for Degree of duster - Content-based videu indexing oud retrrevial
dotermine in video stream where a player hil the bascball, whore is
goal In [SATOH], to detect news video, detecting anchor person by
face detection is used, IL combines feature of face and video caption (Lext)
to roeognize appearance of person in video All of these applications try
to classify any video stream into its semantic class
11.2.3 Camera operations
There are two reasons thal made motion in video: camera and
object A camera shot stores objcct motions only when it is staying static,
no adjusting and objects like people, animal, vehicle are moving front
of it Reversely, camera motions are gencraled by moving, rotating and using zoo function of camera By moving camera, often used in film
production, video camera lying on support moved in a track Usually, this
casc is used to focus a moving object There are four common directions
of camera moving is illustrated in Figure 1-5 Some other cases of video motion created by stationary camera are in Figure IL-6
Trang 22
Tilt up
Pan right
Tilt down
Figure 11-6 Common rotation and ziom af stationary video camera
1.2.4, Introduction to CBVIR system
How many information is stored in one video shot? It’s said that
“One image is thousand of words” and here, one video could hold
thousand of images And not only that, in one video shot could store
another information like sound, voice, text and one feature which made
video became impress is motion Thus, any CBVIR system can be
extended from Image, Sound, Voice, Text Indexing and Retrieval
System Beside, motion that is taken from image sequence and
information which came from a collection of image are used in CBVIR
system also And a CBVIR system must satisfy the main target that is indexing and retrieval End-users give queries and with their critical,
CBVIR should give back lo them the results that is the most similar to
queries But imagine that, if we use the image or video query and browse
it on entire video stream by frame by frame, it is the big problem of very high time cost and some time, it is not exactly what we want A system of CBVIR for end-users can be seen simply in Figure II-7 This system can
use queries of videos stream, images, sounds, texts or combine all of them The result gave back to end user with result of the most similar
results
Trang 23
Thesis for Degree of duster - Content-based videu indexing oud retrrevial
CBVIR system most simitar videos
Sounds, voices
Texts
Figure IL-7 CVBIR common system diagram
In [FAISAL], Faisal I Bashir gave the classification of video
modeling schema demonstrated in Figure II-8 Any CBVIR system used
features of signals like colour, shape, texture of raw data of video in
Level I Techniques at this level tend to model the apparent characteristics of “stuff” (as opposed to “things”) inside video clips
Video data is continuous and unstructured To analyze and understand its contents, the video needs to be parsed into smaller chunks (frames, shats
and scenes) At level Il, system analysis and synthesizes these features to
get logical and statistical features (computer vision), and used hoth of
these above derived results for semantic representation At this semantic
level, video bil stream that contains audio stream and possibly closed
caption text along with sequence of images contams a wealth of rich
information about objects and events being depicted Onve the feature
level summarization is done, scmanuc Icvel description based on
conceptual models, built on a knowledge base is needed In the following
scclions, we review techniques thal try lo bridge the semantic gap and
present a high level picture obtained from video data One example can
be found m [FAN] where people tried to classify each shot of video into
Trang 24
semantic scones of news, sports, science information CBVIR system
belongs to one of these levels depending on purpose of it
Raw/Compressed
Multimedia Data
Color Histogram, Texture
Signal Processing Descriptor, Trajectory
‘Object Based Representation
Intelligent modeling based on
Al & Philosophy Concepts
Figure 1-8 Classification af video modeling technique Level 1 with video raw
data, Level II with derived or logical features, and Level II with semantic level
its analogy In CBVR system of text, to make cfficient system, they must
decompose documents into elements: paragraphs, sentences and words
Trang 25
Thesis for Degree of duster - Content-based videu indexing oud retrrevial
After thal, they make a conlent lable that map to document, cxtracl
keywords from document by features, and use query text to retrieve on
enuire these indexed keywords Similarly, in CBVIR sysicm, we should sogment a video document into shots and scenes to compose a table of
contents, and we should extract key frames or key sequences as index
entries for scenes or stones Therelore, the core research in content-based video retrieval is developing technologies to automatically parse video, audio, and text to identify meaningful composition structure and to
extract and represent content attributes of any video sources
Feature oxtraction
Retrievat and browsing
Figure I1-9 Pracess diagram of CRYTR system
Figure IL-9 illustrates processes of common CE VIR system There are four main processes: feature extraction, structure analysis, abstraction
and indexing There are many research on cach provess, cach one have its
own challenges and here, we discuss bricfly review for cach process
Feature extraction: As what mentioned in sessions before, each video contains many [catures on i: image’s leatures, veice’s lealures, sound’s features and text’s features Keature extraction will separate all
these features, depend on each system, to serve parsing video stream
(shot dotection, scene detection, motion detection ) The usual and
Trang 26
cflcctive way is combination Icatures which generalo many ways to approach a CBVIR system
Structure analysis: In Figure I-4, the top level is video sequence
and it is stories also Video sequences 18 partitioned into scenes (analogous
paragraphs in document) and each scene is composed sets of shots (like
sonlences im paragraphs) and shot is constructed from frames sequence Structure analysis must be able to decompose all these elements
Ilowever, scene separation sometime is impossible because it is based on
stories and it must be recognized by higher level (level TIT) as illustrated
in Figure U-8 Normally, to divide scene to each other, the most
technique is based on film production rules Bul in facl, it is still get much
challenge and less successful Kor shot separation, there are many way to
archive like using color, motion Most of shot algorithm used visual
information and the last one, lrame extraction is video format dependent
Video abstraction: Original video is always longer than need
information Abstraction is similar 1o step of finding keywords in CBIR
of text, IL will reduce inlormation to retrieve (equivalent to reduce time
code) in CBVIR system For example, one video of a football match,
scenes come from stand of stadium is redundant if we only consider to Match events Abstraction rejects all shots, scones that system does not need to care That means abstraction must be executed on entire shots, scenes and storics Abstraction video content includes skimming, highlights and summary A video skim is a condensed representation of the video containing keywords, frames, visual, and audio sequences hghlights normally involve detection of important events in the video A
summary means that we preserved important structural and semantic
information in a shorl version of the videu represented via key audio,
video, frames, and/or segments One of the most popular ways is key
frames extraction Key frames play an important role in the video
Trang 27
Thesis for Degree of duster - Content-based videu indexing oud retrrevial
abstraction process Key [rames are still images, extracted {rom original video data, which best represent the content of shots in an abstract
manner The representational power of a sel of key [rames depends on
how they're chosen from all frames of a sequence Not all image frames
within a sequence are equally descriptive, and the challenge is how to
automatically determine which frames arc most represcnlative An even more challenging task is to detect a hierarchical set of key frames such
that a subset at a given level represents a certain granularity of video
content, which is critical for content-based video browsing Researchers
have developed many effective algorithms, although robust key frame
extraction remains a challenging research topic
Indexing for retrieval and browsing: ‘Vhe structural and content attributes extracted in feature extraction, video parsing, and abstraction
processes, or the aUtributes that arc cntored manually, are oflon referred to
as metadata Based on these attributes, we can build video indices and the
table of content through, for instance, a clustering process that classifies
sequences or shots into different visual categorics or an indexing
structure As in many other database systems, we need schemes and tools
lo use the indices and content metadata to query, scarch, and browse large
video databases Researehers have developed numerous schemes and
tools for video indexing and query However, robust and effective tools
tested by thorough experimental evaluation with large data scls are still lacking, ‘Therefore, in the majority of cases, retrieving or searching video
databases by keywords or phrases will be the mode of operation In some
cases, we can retrieve with reasonable performance by content similarity
defined by low-level visual features of, for instance, key frames and
example-based queries
Tham Quang lai
Trang 281.3 Features Extraction
Each CBVIR system chooses its own way to approach searching
and browsing There arc used features is image features, audio features and text features In this section, I would like to introduce to features used
widely in CBVIR system and give the overview for each feature In
limitation of thesis, it is impossible to discuss in detail for each subject
As we've known that each frame in video stream is equivalent to
one image That means any thing we can do on image can be applied to
video One thing made video differs from image is constraint between conseculive image features There are Lhree low-level features often used
in CBIR of image mapped to video: coler, texture and shape
Color:
RGB color space
Model KGB is one of the first practical models of area of colors and contains the recipe for creation of colors This model has emerged in
evident manner in the mes of birth of television (1908 and the next
years) This is the model resulting from the receiving specification of eye and it is based on fact that the impressions of almost all the colors in eye
can be evoked by mixing in the fixed proportions of three sclocted
clusters of light of the properly chosen width of spectrum Three of
components (R, G, B) (R-RED G-GREEN, B-BLUE) is the identification of color in the model RGH The color space of RGH color
can be describe in Figure II-10
HSV color space
One more proposal of the model of colors’ description, suggested
by Alvey Ray Snmth, appeared in 1978 The symbols in the name of
model are the first letters of English names for the components of the
description of color The model is considered as a cone of round base
Trang 29
The dimensions of cone are described by the component H (Hue),
component § (Saturation) and the component V (Value)
brightness is the height component Figure II-11 illustrates the way to
create color on the surface of cylinder HSV color space is very useful in CBIR of image system because it is close to the human eyes perception
Color Histogram:
Comparison of color histogram to determine where the same
distance of histogram appeared This method is used very popular
because of its light time cost and its high effect
Texture:
Tamura features
Tamura features are based on theory of texture [TAMURA] Three Tamura features are known as coarseness, contrast and directionality
Trang 30measurement to separate it into different levels For example, in Figure
II-12.a the right image is coarser than the left and it has greater value
0.086690 0.910828
a Figure [1-12 Tamura features and their values (a) Coarseness (b) Contrast (c)
Directionality
Gabor Filter:
Given an image /(x, y) with m, n are scale and orientation
correspondently Gabor formula can be described as follow
G6 9)= LD &-Ny- OV, (6.0 at)
st ae ie variants of mask filter
is mother wavelet function defined following
Trang 31Thesis for Degree of duster - Content-based videu indexing oud retrrevial
Change these argument values will change the image results Thal moans
Gabor filter is used to filter image and take need textures
Shape:
Shape represcntations can be classificd into contour-based and
region-based techniques Both categories of techniques have been
investigaled and several approaches have been developed In CBIR of image system, shape of object is extracted and it is compared through
image database
1.4 Structure Analysis
Structure analysis includes detecting scene, shot break and
extracting frames In this section, I only introduce to shot detection because this is the most important task during analyzing video
114.1 Shot Transitions classification
Shot transition is techniques to lead the viewer from shot ty other
shot [JONE, KATZZ] Normally, before movie production (concatenation
of shots together to make movie story), each shot is calculated from time
of turning on camera until turning off to change to other shot One thing
to separate shot together is continuous feature that shot will appear wherever the sudden change of continuous scene in time line After
applying transition shot technique, the way to separate became more
difficult Detecting shot change is very important step in any CBVIR
system Shot transition could be categorized as follow
Cut: a shot changes to another instantaneously see Figurc I-L4.a
Fade-in: a shot changes from a conslant image This content image
usually is black frame ‘The most fade-in can be found in the first shot of
movie at the beginning Figure II-14.b
Trang 32
Figure II-14.¢
Dissolve: first shot fade-out and second shot fade-in As you see in Figure Il-14.d there is a dissolve effect appeared after Frame#43 and before Frame#83
Wipe: Wipe is the most complicate transition because it is generated by video production Figure II-14.e is just only one type of
wipe (push left) Figure II-15 illustrates techniques is used for shot transition The arrows are trend of how the second shot appearing to the
fist shot
Thorough understanding of shot transition is the way to approach
shot detection We will see how to automatic detect the shot transition in
the next section and today, the research on shot detection still continues because of its difficulty and interest
a
Iramei#43 Irame#S7 Irame#6‡ Irame#7l Irame#79 Irame#83
Trang 33
11.4.2 Shot Detection Techniques
There are many way to detect where the shot appeared In general,
automatic shot detection can be classified into 5 categories: pixel-based,
statistics-based, transform-based, feature-based, and histogram-based
Pixel difference method:
Pair-wise pixel comparison (also called template matching)
evaluates the differences in intensity or colour values of corresponding
Trang 34
pixels in two successive frames The difference belweon frame ff
(intensity of pixel (4y) in a frame fijand frame fi; is defined as:
} xa
X*tN
Otsuji et al [OTSUTI] and Zhang et al [ZHANG] count the
number of the changed pixels and a camera shot break is declared if the
percentage of the Latal number of changed pixels exceeds a lbreshold The
differences in pixels and threshold are:
otherwise,
As we can see m (11.4), if the difference of pixel is above of a
threshold value t then a camera break is declared But easy to realize that
this formula is very sonsitive to camera motion and it will be very difficult to detect for pan, zoom, or may be just a large movement of
object while the difference value of correspond identity value is high
‘That means it will give more false alarm for the results of camera break Tlowever, this method is fast and robust to detect camera break
Block-based method:
In contrast to template matching that is based on global image
characteristic (pixel by pixel 8 differences), block-based approaches use local characterisiic lo increase the robustness lo camera and objcet
movement [ZIIANG] Let 4 and za+) be the mean imtensity values for a
given region in two consecutive frames and & and da; be the
corresponding variances ‘he frame difference is defined as the percentage of the regions whose likelihood ratios (II 5) exceed a pre- defined threshold t
Trang 35Thesis for Degree of duster - Content-based videu indexing oud retrrevial
[L ÿ_ 4>:
‘This approach is better than the previous one as it increases the
tolerance against noise associated with camera and object movement
However, it is possible that even though the two corresponding blocks are
different, they can have the same density function In such cases no
change is detecled
Histogram comparison:
A step further towards reducing sensitivity to camera and object
movements can be done by comparing the hislograms of successive
images ‘lhe idea behind histogram-based approaches is that two frames
with unchanging background and unchanging (although moving} objects
will have little difference in their histograms In addition, histograms arc
invariant to image rotation and change slowly under the variations of
viewing angle and scale As a disadvanlage one can nole thal two images with similar histograms may have completely different content However,
the probability for such events is low enough, moreover techniques for
dealing with this problem have already been proposed in [PASS]
A grey level (color) histogram of a frame i is an n-dimensional
vector Hi(jJ=1, 1 where ở is the number of grey levels (colors) and Hj)
is the number of pixels from the frame i with grey level (color)
Global Comparison:
The simplest approach uses an adaptation of the metrics [rom (3)
but instead of intensity values, grey level histograms are compared A cut
is declared if the absolute sum of histogram differences between two
successive [rames D(f,fi+) is greater than a threshold L
Trang 36
the grey value and 7 is the total number of grey levels
Another simple and very effective approach is to compare color histograms Zhang et al [ZHANG] apply (II 8) where j, instead of grey levels, denotes a code value derived from the three color intensities of a pixel In order to reduce the bin number (3 colors x 8 bits create histograms with 224 bins), only the upper two bits of each color intensity
component are used to compose the color code, This solution can be
demonstrated in Figure II-16 where fours bit from 0 to 4 are truncated
The comparison of the resulting 64 bins has been shown to give sufficient
accuracy When the difference is larger than a given threshold T, a cut is
Trang 37Thesis for Degree of duster - Content-based videu indexing oud retrrevial
Figure 1-17 Cut (a) and (Fade/Dissvlve) from frame difference
Figure 11-17 show the results of cumulative differences between frames In Figure [1-17.a show the frame differences when cut appeared
with the sign of peaks and Figure 11-17.b show the frame differences
when fade or dissolve appeared ‘The cut could be detected by threshold
easily but with fade and dissolve it 1s more difficult ‘Lhe next session will
explain how to detect fade and dissolve Lransitions
YT win-comparison
The twin-comparison method [SMOLIAR] takes into account the
cumulative differences between frames of the gradual transition In Lhe
first pass a high threshold 7; is used to detect cuts as shown in Figure TI-18 (the first peak) In the second pass a lower threshold T; is employed
to detect the potential starting frame F, of a gradual transition Ƒ; is than compared to subsequent frames (Figure II-18) This is called an accumulated comparison as during a gradual transition this difference
value increases ‘The end frame #; of the transition is detected when the difference between consecutive frames decreases to less than T;, while the accumulated comparison has increased to a valuc higher than 7, If the consecutive difference falls below 7; before the accumulated difference
exceeds T,,, then the potential start frame /, 1s dropped and the search
continues for other gradual transitions It was found, however, that there
are some gradual transitions during which the consecutive difference falls
below the lower threshold This problem can be easily solved by selling a
tolerance value that allows a certain number of consecutive frames with
low difference values before rejecting the transition candidate As it can
be scen, the twin comparison detects both abrupt and gradual Lransitions
at the same time
Trang 38
Eigure 1-18 Twin Comparison (picture taken from [114 5|)
Local histogram comparison:
Histogram based approaches are simple and more robust to object and camera movements but they ignore the spatial information and,
therefore, fail when two different images have similar histograms On the
other hand, block based comparison metheds make use of spatial
information ‘hey typically perform better than pair-wise pixel
comparison but are still sensitive to camera and object motion and are
also computationally expensive By integrating the two paradigms, false alarms due to camera and object movement can be reduced while enough
spatial information is retained lo produce more accurate results
The frame-to-frame difference of frame fj and frame fi; is
compuled in b, regions (blocks) as
ia
DE fan = SPP 2,62 ag)
Trang 39
Thesis for Degree of duster - Content-based videu indexing oud retrrevial
where H,(j,k) denotes the histogram value al grey level j for the region
(block) & and by is the total number of the blocks
There are some methods used im shot detection also: model-based segmentation, IXCT-Based method, or Vector Quantization With model-
based segmentation, it is not only for detection shot cut or dissolve but
also lor any detection of wipe types In [HAMPAUR |, Hampapur ct al gave a model for variant of chromatic or brightness in two consequent
frame and detected the change In [IDRIS], Idris and Panchanathan use
vector quantization to compress a video sequence using a codebook of
size 256 and 64-dimensional vectors ‘The histogram of the labels
obtained from the codebook for cach frame is used as 4 frame similarity
measure and a 7° statistic is used to detect cuts In DCT-based method,
they use the coefficients to compare two video frames The DCT is
commonly used for reducing spatial redundancy in an image in different
video compression schemes such as MPEG and JPEG Compression of the video is carried out by dividing the image into a set of 8x8 pixel
blocks Using the IDC'T the pixels in the blocks are transformed into 64
coefficients which are quantized and IJufiman entropy encoded The
DCT coefficients are analyzed lo find frames where camera breaks take
place Since the coefficients in the frequency domain are mathematically
related to the spatial domain they can be used in detecting the changes in
the video sequence
1.8 Video motion
Like what mentioned before, motion is the most interest feature
that appeared in video From video sequence, there are many ways to
extract motion, depend on the purpose of each system And what kind of
motion they considered’ It could be motion of each pixel in frame (the
way optical flow is), motion for cach block (MPEG2 handled this),
Trang 40
motion of object, global motion, molion of layers Extracling motion
became important not only in low-level analysis but also in higher-level
analysis This session discuss some type of approached motions but more
detail in optical flow because of its popular using and its primary
11.5.1 Motion trajectory
The motion trajectory of an object is a simple, high-level feature,
delined as the localization, in ume and space, of one representative pomt
of this object This descriptor shows uscfulncss for content-based
retrieval in object oriented visual databases It is also of help in more
specific applications In given contexts with a priori knowledge, trajectory can enable much functionality In surveillance, alarms can be
triggered if some object has a trajectory identified as dangerous (e.g
passing through a forbidden area, being unusually quick, ete.) In sports,
specific actions (e.g tennis rallies taking place at the net) can be recognized Besides, such a description also allows enhancing data interactions‘manipulations: for semiautomatic multimedia editing,
trajectory can be stretched, shifted, etc, to adapt the object motion to any
given sequence global context
‘the descriptor is essentially a list of key-points (x,y,z.Ð) along with
a sel of optional interpolating fimetions that describe the path of the
object between key-points, in terms of aceclcralion The specd is
implicitly known by the key-points specification The key-points are
specilicd by them time instant and cither their 2-D or 3-D Cartesian coordinates, depending on the intended application The interpolating
functions are defined for each component x(t), y(t), and z(t)
independently