Comparison of corners between frames brings back to us the ability of detection for cut, gradual transitions and even rich type of wipes of shot transition in video.. 5 Figure II-2 Moti
Trang 1-
PHẠM QUANG HẢI
CONTENT-BASED VIDEO INDEXING AND RETRIEVAL
LUẬN VĂN THẠC SĨ KỸ THUẬT
XỬ LÝ THÔNG TIN Và TRUYỀN THÔNG
Hà Nội – 2005
Trang 2-
PHẠM QUANG HẢI
CONTENT-BASED VIDEO INDEXING AND RETRIEVAL
LUẬN VĂN THẠC SĨ KỸ THUẬT
XỬ LÝ THÔNG TIN Và TRUYỀN THÔNG
NGƯỜI HƯỚNG DẪN KHOA HỌC:
Alain Boucher
Hà Nội - 2005
Trang 3Abstract
Video indexing and retrieval is an important element of any large multimedia database Because video contains huge information and has large store requirement, the research on video is still continuous and opens This thesis works at low-level features of video, concentrates to the corner feature of frame Comparison of corners between frames brings back to us the ability of detection for cut, gradual transitions and even rich type of wipes of shot transition in video Continue to use corner-based motion combining with histogram features, by measuring distance of how far the motion move, key frames is selected, and it is ready for indexing and retrieval application
Other side of work is using segmentation of each frame and merging all regions which have the same motion By this way, I would like to separate regions in frame into layers and it will used to indexing key objects
One chapter of this thesis is reserved for learning how to index and retrieve in video It is an overview of video indexing and retrieval system: what they did, what they are doing and how they will do This thesis is expected to contribute usefully to multimedia system at MICA
Trang 4Acknowledgements
This work is a part in Multimedia Information Communication Application (MICA) research central
First of all, I would like to thank to Dr Alain Boucher, IT lecturer
at Institut de la Francophonie pour l’Informatique (IFI), Vietnam, leader
of Image Processing group at MICA central, as my supervisor Thank you for your support and funding of acknowledge to me, thank for meeting to discuss of working every week and your patience during time I worked and sorry for inconvenience of what I brought to you
I also thank to Le Thi Lan and Thomas Martin, members at MICA,
I couldn’t have done this thesis without your supports Thank to both of you in acknowledge in image processing theory and programming in C++ with the newbie like me
I would like to thank to directors in MICA: Mr Nguyen Trong Giang, Mr Eric Castelli and Mrs Nguyen Thi Yen who accepted and helped me to have good working environment in MICA Thank to members in MICA who welcome me to work in MICA as a trainee I have very good impression in your amiable attitudes and your helps
Finally, I want to thank to my family, my two sisters who often fostered me even the long distance from homeland Thank to my parent who helped me anytime when I went down and my brother who visited
me sometime to tidy up my room because of my laziness
Trang 5Contents
Abstract i
Acknowledgements ii
Contents iii
List of abbreviations v
List of figures vi
List of tables viii
Chapter I Introduction 1
I.1 Content-based Video Indexing and Retrieval (CBVIR) 1
I.2 Aims and Objectives 1
I.3 Thesis outline 2
Chapter II Background 3
II.1 Video Formats and Frame Extraction System 3
II.1.1 Video Formats 3
II.1.2 Introduction to MPEG 3
II.2 Content-based video indexing and retrieval system 8
II.2.1 Video sequence structure 8
II.2.2 Video data classification 9
II.2.3 Camera operations 11
II.2.4 Introduction to CBVIR system 12
II.2.5 CBVIR Architecture 14
II.3 Features Extraction 18
II.4 Structure Analysis 21
II.4.1 Shot Transitions classification 21
II.4.2 Shot Detection Techniques 23
II.5 Video motion 29
II.5.1 Motion trajectory 30
II.5.2 Motion Parametric 31
II.5.3 Motion activity (or object motion) 32
II.5.4 Optical Flow 33
II.6 Video Abstraction 36
II.6.1 Introduction to Video Abstraction 36
II.6.2 Key frame extraction: 40
II.6.3 Video Abstraction 41
II.7 Video Indexing, Retrieval, and Browsing 43
II.8 Thesis Scope 44
Chapter III Video Indexing by Camera motions using Corner-based motion vector 45 III.1 Introduction 45
III.2 Video and Image Parsing in MICA 45
III.2.1 MPEG2 Video Decoder 45
III.2.2 Image Processing Library 46
Trang 6III.3.2 Correspond points matching 49
III.4 Shot Transitions detection 51
III.4.1 Shot cut Detection algorithm 51
III.4.2 Shot cut detection description 52
III.4.3 Results and evaluation 58
III.5 Video Indexing 64
III.5.1 Motion Characterization 64
III.5.2 Corner-based motion vector 65
III.5.3 Global Motion calculation 66
III.5.4 Key frame extraction 74
III.5.5 Problem in object extraction 77
III.5.6 Video Indexing 77
III.6 Summary 78
Chapter IV Conclusions and Future Work 80
IV.1 Thesis Summary 80
IV.2 Future Works 81
Trang 7List of abbreviations
CD: Compact Disk
DVD: Digital Versatile Disk
MPEG: Moving Pictures Experts Group
CBVIR: Content Based Video Indexing and Retrieval
CBIR: Content Based Indexing and Retrieval
IEC: International Electro-technical Commission
DCT: Discrete Cosine Transform
JPEG: Joint Photographic Experts Group
IDC: Inverse Discrete Cosine Transform
GOP: Group of Pictures
Trang 8List of figures
Figure I-1 Position of video system in MICA central 2
Figure II-1 Two consecutive frames from video sequence 5
Figure II-2 Motion Compensation from MPEG video stream 6
Figure II-3 Block diagram of MPEG video encoder 6
Figure II-4 Video Hierarchical Structure 8
Figure II-5 Common directions of moving video camera 11
Figure II-6 Common rotation and zoom of stationary video camera 12
Figure II-7 CVBIR common system diagram 13
Figure II-8 Classification of video modeling technique Level I with video raw data, Level II with derived or logical features, and Level III with semantic level abstraction 14
Figure II-9 Process diagram of CBVIR system 15
Figure II-10 RGB color space (picture source [SEMMIX]) 19
Figure II-11 HSV color space (picture source [SEMMIX]) 19
Figure II-12 Tamura features and their values (a) Coarseness (b) Contrast (c) Directionality 20
Figure II-13 Effect of Gabor Filter to image results 20
Figure II-14 Shot Transitions (a) cut (b) fade-in (c) fade-out (d) dissolve (e) wipe 23
Figure II-15 Some transition effects for wipe (pictures taken from Pinnacle Software) 23
Figure II-16 Reduce the number of bits during calculate the histogram 26
Figure II-17 Cut (a) and (Fade/Dissolve) from frame difference 27
Figure II-18 Twin Comparison (picture taken from [II.4 5]) 28
Figure II-19: Head tracking for determine trajectories 31
Figure II-20: The 2D motion trajectory (third direction is frame time line) 31
Figure II-21 Optical flow (a) two frame from video sequence (b) optical flow 33
Figure II-22 Optical flow filed produced by pan and track, tilt and boom, zoom and dolly 34
Figure II-23 Motion segmentation by optical flow 36
Figure II-24 Local and Global Contextual Information 42
Figure III-1 Relation between R and eigenvalues 48
Figure III-2 Harris Corner in image with different given corner number 49
Figure III-3 (a)Two frames extracted while camera pan right (b) correspond points results, drew lines in frame#760 51
Figure III-4 Results from no shot transition 54
Figure III-5 Results from shot cut transition 55
Figure III-6 Results from dissolve 57
Figure III-7 Correspondent points matching numbers in one video sequence 60
Figure III-8 Two frames from two shots but similar 62
Figure III-9 Correspondent points in video sequence 3 62
Figure III-10 Frame sequence from video sequence 3 63
Figure III-11 Keep motion vectors by given threshold for magnitudes 65
Figure III-12 8 used directions for standardizing vector directions 66
Figure III-13 Some consecutive frames from pan right shot 66
Trang 9Figure III-14 Video frame from video sequence 1 72
Figure III-15 Video frame from video sequence 2 72
Figure III-16 Key frame selection from video mosaic 74
Figure III-17 Key frames is selected from motion graph 75
Figure III-18 Complicated motion graph from video 75
Figure III-19 cases of vector graph 76
Figure III-20 Results of key frame selection 76
Figure III-21 Hierarchical indexing for CBVIR system 78
Trang 10List of tables
Table 1 Test data used for shot cut algorithm 59
Table 2 Shot detection result from test data 60
Table 3 Four types of detection an algorithm can make 61
Table 4 Vector directions rule 69
Table 5 Calculating global motion from set of corner-based vector 70
Table 6 Video sequence for global motion 71
Table 7 Three table of motion vectors for video sequence 1, 2 and 6 72
Table 8 Global motion from video sequence 3 73
Trang 113 hours for each That’s why the system of retrieving video is now researching and developing more and more in the word
In Vietnam, there are more researches and application in multimedia data to fit with the development and requirement of modern life and researching video became important and essential By means of this thesis, I tried to discover and summarized video researches while practiced a part of CBVIR system
I.2 Aims and Objectives
In MICA central, we are now developing a multimedia system including Speech Processing, and Image Processing Position of video
Trang 12Smart room
NETWORK
other centrals
Figure I-1 Position of video system in MICA central
I.3 Thesis outline
Thesis is divided into two big chapters with conclusion These chapters are organized as follows
Chapter II
This chapter provides basic information, characteristic of video and
a general CBVIR system This chapter also introduces to techniques that used in video for feature extraction, video analysis
Chapter III
This chapter describes techniques that are used in practice: Harris corner, motion from Harris corner and how it correlates to CBVIR system Results came from practice is shown in this chapter with evaluations
Chapter IV
This chapter concludes the thesis and gives direction for future works
Trang 13Chapter II Background
II.1 Video Formats and Frame Extraction System
II.1.1 Video Formats
There are numbers of video format to use in CBVIR systems It depends on database of system Some types of format which used popularly in storages like VCD, DVD, hard disk…are DAT, AVI, MPEG, MPG, MOV In CBVIR systems in MICA, we used MPEG-2 format while parsing video stream An advantage of MPEG format is reducing the size of video file into small that makes a lot of video processing system became available While encoding MPEG video stream, they used “two-steps” to compress video: Once for spatial compress and once for motion compress (motion compensation) The requirement of applications that used MPEG video can be played in anywhere: Digital Storage Media requires small size, good quality enough to process because of its cost, asymmetric applications requires the ability of subdivision for video delivery (e.g video game online), symmetric applications needs a video format that can compress and decompress process at the same time…All that requirements is satisfied
by using MPEG format
II.1.2 Introduction to MPEG
The Moving Pictures Experts Group abbreviated MPEG is part of the International Standards Organisation (ISO), and defines standards for digital video and digital audio The primal task of this group was to
Trang 14Meanwhile the demands have risen and beside the CD the DVD needs to
be supported as well as transmission equipment like satellites and networks All this operational uses are covered by a broad selection of standards Well known are the standards MPEG-1, MPEG-2, MPEG-4 and MPEG-7 Each standard provides levels and profiles to support special applications in an optimised way
II.1.2.1 MPEG-2 Video Standard
MPEG-2 video is an ISO/IEC standard that specifies the syntax and semantics of an enclosed video bitstream These include parameters such as bit rates, picture sizes and resolutions which may be applied, and how it is decoded to reconstruct the picture What MPEG-2 does not define is how the decoder and encoder should be implemented, only that they should be compliant with the MPEG-2 bitstream This leaves designers free to develop the best encoding and decoding methods whilst retaining compatibility The range of possibilities of the MPEG-2 standard is so wide that not all features of the standard are used for all applications [KEITH]
II.1.2.2 MPEG-2 Encoding
One of the most interest points of MPEG is reduce the size of video into smallest as it can It concerns to compress algorithm including spatial compensation and temporal compensation This method is also applied to other MPEG standards like MPEG-1, MPEG-4, and MPEG-7 With spatial compensation, JPEG, standard that reduce the size of an image is used By adjusting the various parameters, compressed image size can be traded against reconstructed image quality over a wide range Image quality ranges from “browsing” (with the ratio of compression 100:1) [KEITH] With temporary compensation, motion compensation is used Temporal compression is achieved by only encoding the difference between successive pictures Imagine a scene where at first there is no
Trang 15movement, and then an object moves across the picture The first picture
in the sequence contains all the information required until there is any movement, so there is no need to encode any of the information after the first picture until the movement occurs Thereafter, all that needs to be encoded is the part of the picture that contains movement The rest of the scene is not affected by the moving object because it is still the same as the first picture The means by which is determined how much movement
is contained between two successive pictures is known as motion estimation prediction The information obtained from this process is then used by motion compensated prediction to define the parts of the picture that can be discarded This means that pictures cannot be considered in isolation A given picture is constructed from the prediction from a previous picture, and may be used to predict the next picture
Motion compensation can be description in Figure II-1
Figure II-1 Two consecutive frames from video sequence
Like what we have seen at two consecutive frames, the man on the right and the background are staying static, the man on the left is moving All information we need to store is back ground, figure of the man on the right, and the motion figure of the man on the left The compensation motion here is the next frame is create from the last frame plus the part
Trang 16new part [CALIC] The compensation of motion we considered can be described in visual in Figure II-2
Figure II-2 Motion Compensation from MPEG video stream
Block diagram of MPEG video encoder can be described in Figure II-3:
IDC
+ MCP
Figure II-3 Block diagram of MPEG video encoder
Fist of all, frames data (raw data) is compressed by DCT (Discrete Cosine Transform) by divide a frame in to macroblocks and calculate block DC coefficient Quantization (Q1) with “zig-zag” scanning optimized value for each macro blocks After that, MCP (motion compensated prediction) is used to exploit redundant temporal information that is not changing from picture to picture
MPEG-2 defines three picture types:
Trang 17I (Intraframe) pictures These are encoded without reference to
another picture to allow for random access In MICA center, we used this type of frame during process
P (Predictive) pictures are encoded using motion compensated
prediction on the previous picture therefore contain a reference to the previous picture They may themselves be used in subsequent predictions
B (Bi-directional) pictures are encoded using motion compensated
prediction on the previous and next pictures, which must be either a B or
P picture B pictures are not used in subsequent predictions
Usually, they mixed I, B, or P frame into Group of Picture (GOP) One GOP could include I, and P, or I, P and B frames Depend on what they did during encoding time, types of MPEG will be different It depends on number of types used in GOP or order of I, B, P frame in GOP
One more important problem in MPEG-2 standard is motion estimation Motion estimation prediction is a method of determining the amount of movement contained between two pictures This is achieved
by dividing the picture to be encoded into sections known as macroblocks The size of a macroblock is 16 x 16 pixels Each macroblock is searched for the closest match in the search area of the picture it is being compared with Motion estimation prediction is not used on I pictures, however B and P pictures can refer to I pictures For P pictures, only the previous picture is searched for matching macroblocks
In B pictures both the previous and next pictures are searched When a match is found, the offset (or motion vector) between them is calculated The matching parts are used to create a prediction picture, by using the motion vectors The prediction picture is then compared in the same
Trang 18which have no match to any part of the search area in the picture to be encoded represent the difference between the pictures, and these macroblocks are encoded To understand more about MPEG standard, see [MPEG] for details
II.2 Content-based video indexing and retrieval
system
II.2.1 Video sequence structure
A video stream is created basically from shots Shot is a fundamental unit of video and it depicts a continuous capture from camera turning on until turning of for another shot Like what is illustrated in figure Figure II-4, when a video producer created a video, they made it from shots and grouped them into scenes and embedded some effect between shots
Trang 19One scene could be understood like series of shots which is semantic constrained (e.g., a scene describes two talking people, sitting in chairs with interlacement shots of another people in a party) That means,
it is not easy to detect a scene by signalling features like colour, shape…It must be detected by semantic level (mentioned in the next section) During The lowest level is frame in video hierarchical structure CBVIR system extracts a shot into frames and selects the most interested frame from them in key frame selection step
II.2.2 Video data classification
To work with video, classification of video data is very important Follow [ROWE], Row et al classified video metadata into 3 levels for each video by:
Bibliographic data: This category includes information about the
video (e.g., title, abstract, subject, and genre) and the individuals involved
in the video (e.g., producer, director, and cast)
Structural data: Video and movies can be described by a hierarchy
of movie, segment, scene, and shot where each entry in the hierarchy is composed of one or more entries at a lower level (e.g., a segment is composed of a sequence of scenes and a scene is composed of a sequence
of shots)
Content data: Users want to retrieve videos based on their content
audio and visual content) In addition, because of the nature of video the visual content is a combination of static content frames and dynamic content Thus, the content indexes may be sets of key frames that represent major images and object indexes that indicate entry and exit frames for each appearance of a significant object or individual
With this classification, to work with video stream, normally we
Trang 20other information which is signed inside video stream or may be, simpler, from text that appears in video stream (must use text recognition [LIENHART1]) To determine structure data, system must rebuild its structure from primitive elements (frame) by using some techniques (find refer) In the most of CBVIR system, content data is used as a major element to work with video And in my thesis, I work with frame as a primitive element
Another way to classify video data is based on it purpose It’s called purpose-based classes In [LEW], Lew at al classify video into 4 classes:
Entertainment: Information in this class is highly stylized
depending on the particular sub-category: fiction, non-fiction and interactive Film stories, TV programs, cartoon…belong fiction With non-fiction, information does not need to be “logical”, and follow story flow Interactive video can be found out in games
Information: The most common information of video in television
is news It gives convey of information to viewers
Communication: Video used in communication is different from
play back video It must be designed for communication, suitable with packet transformation (e.g., video conferences)
Data analysis: Scientific video recording (e.g., video of chemical,
biology, psychology…)
The way to classify video into these classes is important because of
it different structure It helps to classify video information in semantic level Classify video shot in these levels is illustrated in [FAN] In that paper, Fan et al classify video into Politics news, sports news, financial news… Or in [ROZENN], Rozenn et al used features of sports to classify tennis sport video or snooker video Li et al in [LI] detect particular events in sport broadcast video like American football by
Trang 21determine in video stream where a player hit the baseball, where is goal…In [SATOH], to detect news video, detecting anchor person by face detection is used It combines feature of face and video caption (text)
to recognize appearance of person in video All of these applications try
to classify any video stream into its semantic class
II.2.3 Camera operations
There are two reasons that made motion in video: camera and object A camera shot stores object motions only when it is staying static,
no adjusting and objects like people, animal, vehicle…are moving front
of it Reversely, camera motions are generated by moving, rotating and using zoo function of camera By moving camera, often used in film production, video camera lying on support moved in a track Usually, this case is used to focus a moving object There are four common directions
of camera moving is illustrated in Figure II-5 Some other cases of video motion created by stationary camera are in Figure II-6
Figure II-5 Common directions of moving video camera
Trang 22Figure II-6 Common rotation and zoom of stationary video camera
II.2.4 Introduction to CBVIR system
How many information is stored in one video shot? It’s said that
“One image is thousand of words” and here, one video could hold thousand of images And not only that, in one video shot could store another information like sound, voice, text and one feature which made video became impress is motion Thus, any CBVIR system can be extended from Image, Sound, Voice, Text Indexing and Retrieval System Beside, motion that is taken from image sequence and information which came from a collection of image are used in CBVIR system also And a CBVIR system must satisfy the main target that is indexing and retrieval End-users give queries and with their critical, CBVIR should give back to them the results that is the most similar to queries But imagine that, if we use the image or video query and browse
it on entire video stream by frame by frame, it is the big problem of very high time cost and some time, it is not exactly what we want A system of CBVIR for end-users can be seen simply in Figure II-7 This system can use queries of videos stream, images, sounds, texts…or combine all of them The result gave back to end user with result of the most similar results
Trang 23Input Queries
Results
Figure II-7 CVBIR common system diagram
In [FAISAL], Faisal I Bashir gave the classification of video modeling schema demonstrated in Figure II-8 Any CBVIR system used features of signals like colour, shape, texture of raw data of video in Level I Techniques at this level tend to model the apparent characteristics of “stuff” (as opposed to “things”) inside video clips Video data is continuous and unstructured To analyze and understand its contents, the video needs to be parsed into smaller chunks (frames, shots and scenes) At level II, system analysis and synthesizes these features to get logical and statistical features (computer vision), and used both of these above derived results for semantic representation At this semantic level, video bit stream that contains audio stream and possibly closed caption text along with sequence of images contains a wealth of rich information about objects and events being depicted Once the feature level summarization is done, semantic level description based on conceptual models, built on a knowledge base is needed In the following sections, we review techniques that try to bridge the semantic gap and present a high level picture obtained from video data One example can
be found in [FAN] where people tried to classify each shot of video into
Trang 24semantic scenes of news, sports, science information…CBVIR system belongs to one of these levels depending on purpose of it
AI & Philosophy Use of Intelligent Multimedia Knowledge
bases.
Raw/Compressed Multimedia Data
II.2.5 CBVIR Architecture
In [NEVENKA], authors gave a CBVIR system which is considered as a most standard system They perceive a video as a document They compared between CBIR of text and video and find out its analogy In CBVR system of text, to make efficient system, they must decompose documents into elements: paragraphs, sentences and words
Trang 25After that, they make a content table that map to document, extract keywords from document by features, and use query text to retrieve on entire these indexed keywords Similarly, in CBVIR system, we should segment a video document into shots and scenes to compose a table of contents, and we should extract key frames or key sequences as index entries for scenes or stories Therefore, the core research in content-based video retrieval is developing technologies to automatically parse video, audio, and text to identify meaningful composition structure and to extract and represent content attributes of any video sources
Feature extraction
Structure analysis
Retrieval and
browsing
Abstraction
Clustersing and Indexing Video Stream
Summary / skimmed video
Metadata Features
Figure II-9 Process diagram of CBVIR system
Figure II-9 illustrates processes of common CBVIR system There are four main processes: feature extraction, structure analysis, abstraction and indexing There are many research on each process, each one have its own challenges and here, we discuss briefly review for each process
Feature extraction: As what mentioned in sessions before, each
video contains many features on it: image’s features, voice’s features, sound’s features and text’s features Feature extraction will separate all these features, depend on each system, to serve parsing video stream (shot detection, scene detection, motion detection…) The usual and
Trang 26effective way is combination features which generate many ways to approach a CBVIR system
Structure analysis: In Figure II-4, the top level is video sequence
and it is stories also Video sequence is partitioned into scenes (analogous paragraphs in document) and each scene is composed sets of shots (like sentences in paragraphs) and shot is constructed from frames sequence Structure analysis must be able to decompose all these elements However, scene separation sometime is impossible because it is based on stories and it must be recognized by higher level (level III) as illustrated
in Figure II-8 Normally, to divide scene to each other, the most technique is based on film production rules But in fact, it is still get much challenge and less successful For shot separation, there are many way to archive like using color, motion…Most of shot algorithm used visual information and the last one, frame extraction is video format dependent
Video abstraction: Original video is always longer than need
information Abstraction is similar to step of finding keywords in CBIR
of text It will reduce information to retrieve (equivalent to reduce time code) in CBVIR system For example, one video of a football match, scenes come from stand of stadium is redundant if we only consider to Match events Abstraction rejects all shots, scenes that system does not need to care That means abstraction must be executed on entire shots, scenes and stories Abstraction video content includes skimming, highlights and summary A video skim is a condensed representation of the video containing keywords, frames, visual, and audio sequences Highlights normally involve detection of important events in the video A summary means that we preserved important structural and semantic information in a short version of the video represented via key audio, video, frames, and/or segments One of the most popular ways is key frames extraction Key frames play an important role in the video
Trang 27abstraction process Key frames are still images, extracted from original video data, which best represent the content of shots in an abstract manner The representational power of a set of key frames depends on how they’re chosen from all frames of a sequence Not all image frames within a sequence are equally descriptive, and the challenge is how to automatically determine which frames are most representative An even more challenging task is to detect a hierarchical set of key frames such that a subset at a given level represents a certain granularity of video content, which is critical for content-based video browsing Researchers have developed many effective algorithms, although robust key frame extraction remains a challenging research topic
Indexing for retrieval and browsing: The structural and content
attributes extracted in feature extraction, video parsing, and abstraction processes, or the attributes that are entered manually, are often referred to
as metadata Based on these attributes, we can build video indices and the table of content through, for instance, a clustering process that classifies sequences or shots into different visual categories or an indexing structure As in many other database systems, we need schemes and tools
to use the indices and content metadata to query, search, and browse large video databases Researchers have developed numerous schemes and tools for video indexing and query However, robust and effective tools tested by thorough experimental evaluation with large data sets are still lacking Therefore, in the majority of cases, retrieving or searching video databases by keywords or phrases will be the mode of operation In some cases, we can retrieve with reasonable performance by content similarity defined by low-level visual features of, for instance, key frames and example-based queries
Trang 28II.3 Features Extraction
Each CBVIR system chooses its own way to approach searching and browsing There are used features is image features, audio features and text features In this section, I would like to introduce to features used widely in CBVIR system and give the overview for each feature In limitation of thesis, it is impossible to discuss in detail for each subject
As we’ve known that each frame in video stream is equivalent to one image That means any thing we can do on image can be applied to video One thing made video differs from image is constraint between consecutive image features There are three low-level features often used
in CBIR of image mapped to video: color, texture and shape
Color:
RGB color space
Model RGB is one of the first practical models of area of colors and contains the recipe for creation of colors This model has emerged in evident manner in the times of birth of television (1908 and the next years) This is the model resulting from the receiving specification of eye and it is based on fact that the impressions of almost all the colors in eye can be evoked by mixing in the fixed proportions of three selected clusters of light of the properly chosen width of spectrum Three of components (R, G, B) (R=RED, G=GREEN, B=BLUE) is the identification of color in the model RGB The color space of RGB color can be describe in Figure II-10
HSV color space
One more proposal of the model of colors' description, suggested
by Alvey Ray Smith, appeared in 1978 The symbols in the name of model are the first letters of English names for the components of the description of color The model is considered as a cone of round base
Trang 29The dimensions of cone are described by the component H (Hue), component S (Saturation) and the component V (Value)
Figure II-10 RGB color space (picture source [SEMMIX])
Figure II-11 HSV color space (picture source [SEMMIX])
The hue H value is the angular, the saturation S is the radial and the brightness is the height component Figure II-11 illustrates the way to create color on the surface of cylinder HSV color space is very useful in CBIR of image system because it is close to the human eyes perception
Color Histogram:
Comparison of color histogram to determine where the same distance of histogram appeared This method is used very popular because of its light time cost and its high effect
Texture:
Tamura features
Tamura features are based on theory of texture [TAMURA] Three
Trang 30Figure II-12 illustrates three Tamura features Each feature has it own measurement to separate it into different levels For example, in Figure II-12.a the right image is coarser than the left and it has greater value
Given an image I(x, y) with m, n are scale and orientation
correspondently Gabor formula can be described as follow:
=
s t
n m n
G ( , ) ( )( ) * ( , )
,
s, t are size variants of mask filter
is mother wavelet function defined following:
x
y x y
x
jW y
x y
1 )
,
2
2 2
W = f (,, ) with is given frequency, is deviation angle and
is given phase The effect of these values can be seen in Figure II-13
Figure II-13 Effect of Gabor Filter to image results
In Figure II-13, with =00 and =30Hz, the result image (in the left) include all elements have direction of 00 and frequency of 30Hz
Trang 31Change these argument values will change the image results That means Gabor filter is used to filter image and take need textures
Shape:
Shape representations can be classified into contour-based and region-based techniques Both categories of techniques have been investigated and several approaches have been developed In CBIR of image system, shape of object is extracted and it is compared through image database
II.4 Structure Analysis
Structure analysis includes detecting scene, shot break and extracting frames In this section, I only introduce to shot detection because this is the most important task during analyzing video
II.4.1 Shot Transitions classification
Shot transition is techniques to lead the viewer from shot to other shot [JONE, KATZZ] Normally, before movie production (concatenation
of shots together to make movie story), each shot is calculated from time
of turning on camera until turning off to change to other shot One thing
to separate shot together is continuous feature that shot will appear wherever the sudden change of continuous scene in time line After applying transition shot technique, the way to separate became more difficult Detecting shot change is very important step in any CBVIR system Shot transition could be categorized as follow:
Cut: a shot changes to another instantaneously see Figure II-14.a
Fade-in: a shot changes from a constant image This content image
usually is black frame The most fade-in can be found in the first shot of movie at the beginning Figure II-14.b
Trang 32Fade-out: Reverse to Fade-in, a shot changes to a constant image
Figure II-14.c
Dissolve: first shot fade-out and second shot fade-in As you see in
Figure II-14.d there is a dissolve effect appeared after Frame#43 and before Frame#83
Wipe: Wipe is the most complicate transition because it is
generated by video production Figure II-14.e is just only one type of wipe (push left) Figure II-15 illustrates techniques is used for shot transition The arrows are trend of how the second shot appearing to the fist shot
Thorough understanding of shot transition is the way to approach shot detection We will see how to automatic detect the shot transition in the next section and today, the research on shot detection still continues because of its difficulty and interest
Frame#43 Frame#57 Frame#64 Frame#71 Frame#79 Frame#83
a Frame#43 Frame#57 Frame#64 Frame#71 Frame#79 Frame#83
b Frame#43 Frame#57 Frame#64 Frame#71 Frame#79 Frame#83
c
Trang 33Frame#43 Frame#57 Frame#64 Frame#71 Frame#79 Frame#83
d Frame#43 Frame#57 Frame#64 Frame#71 Frame#79 Frame#83
e Figure II-14 Shot Transitions (a) cut (b) fade-in (c) fade-out (d) dissolve (e) wipe
Figure II-15 Some transition effects for wipe (pictures taken from
Pinnacle Software)
II.4.2 Shot Detection Techniques
There are many way to detect where the shot appeared In general, automatic shot detection can be classified into 5 categories: pixel-based, statistics-based, transform-based, feature-based, and histogram-based
Pixel difference method:
Pair-wise pixel comparison (also called template matching)
Trang 34pixels in two successive frames The difference between frame f i
(intensity of pixel (x,y) in a frame f i )and frame f i+1 is defined as:
0
1
, (
X
x Y
y
i i
i i
Y X f
f
Otsuji et al [OTSUJI] and Zhang et al [ZHANG] count the number of the changed pixels and a camera shot break is declared if the percentage of the total number of changed pixels exceeds a threshold The differences in pixels and threshold are:
,
) , ( ) , ( 0
1 ) ,
otherwise
t y x f y x f if
y x
Block-based method:
In contrast to template matching that is based on global image characteristic (pixel by pixel 8 differences), block-based approaches use local characteristic to increase the robustness to camera and object movement [ZHANG] Let i and i+1 be the mean intensity values for a given region in two consecutive frames and i and i+1 be the corresponding variances The frame difference is defined as the percentage of the regions whose likelihood ratios (II 5) exceed a pre-defined threshold t:
1
2 2 1 1
2 2
+
+ +
=
i i
i i i
Trang 35, 0
1 ) , (
otherwise
t if
y x
Histogram comparison:
A step further towards reducing sensitivity to camera and object movements can be done by comparing the histograms of successive images The idea behind histogram-based approaches is that two frames with unchanging background and unchanging (although moving) objects will have little difference in their histograms In addition, histograms are invariant to image rotation and change slowly under the variations of viewing angle and scale As a disadvantage one can note that two images with similar histograms may have completely different content However, the probability for such events is low enough, moreover techniques for dealing with this problem have already been proposed in [PASS]
A grey level (color) histogram of a frame i is an n-dimensional vector H i (j)=1,…,n where n is the number of grey levels (colors) and H(j)
is the number of pixels from the frame i with grey level (color) j
Global Comparison:
The simplest approach uses an adaptation of the metrics from (3) but instead of intensity values, grey level histograms are compared A cut
is declared if the absolute sum of histogram differences between two
successive frames D(f i ,f i+1) is greater than a threshold t:
1 j
Trang 36where H i (j) is the histogram value for the grey level j in the frame i, j is the grey value and n is the total number of grey levels
Another simple and very effective approach is to compare color histograms Zhang et al [ZHANG] apply (II 8) where j, instead of grey levels, denotes a code value derived from the three color intensities of a pixel In order to reduce the bin number (3 colors x 8 bits create histograms with 224 bins), only the upper two bits of each color intensity component are used to compose the color code This solution can be demonstrated in Figure II-16 where fours bit from 0 to 4 are truncated The comparison of the resulting 64 bins has been shown to give sufficient accuracy When the difference is larger than a given threshold T, a cut is declared
Figure II-16 Reduce the number of bits during calculate the histogram
+ +
) (
) ( )
( )
,
(
j H
j H j H f
f
D
i
i i
i
a
b
Trang 37Figure II-17 Cut (a) and (Fade/Dissolve) from frame difference
Figure II-17 show the results of cumulative differences between frames In Figure II-17.a show the frame differences when cut appeared with the sign of peaks and Figure II-17.b show the frame differences when fade or dissolve appeared The cut could be detected by threshold easily but with fade and dissolve it is more difficult The next session will explain how to detect fade and dissolve transitions
Twin-comparison
The twin-comparison method [SMOLIAR] takes into account the cumulative differences between frames of the gradual transition In the
first pass a high threshold T h is used to detect cuts as shown in Figure
II-18 (the first peak) In the second pass a lower threshold T l is employed
to detect the potential starting frame F s of a gradual transition F s is than compared to subsequent frames (Figure II-18) This is called an accumulated comparison as during a gradual transition this difference
value increases The end frame F e of the transition is detected when the
difference between consecutive frames decreases to less than T l, while the
accumulated comparison has increased to a value higher than T h If the
consecutive difference falls below T l before the accumulated difference
exceeds T h , then the potential start frame F s is dropped and the search continues for other gradual transitions It was found, however, that there are some gradual transitions during which the consecutive difference falls below the lower threshold This problem can be easily solved by setting a tolerance value that allows a certain number of consecutive frames with low difference values before rejecting the transition candidate As it can
be seen, the twin comparison detects both abrupt and gradual transitions
at the same time
Trang 38Figure II-18 Twin Comparison (picture taken from [II.4 5])
Local histogram comparison:
Histogram based approaches are simple and more robust to object and camera movements but they ignore the spatial information and, therefore, fail when two different images have similar histograms On the other hand, block based comparison methods make use of spatial information They typically perform better than pair-wise pixel comparison but are still sensitive to camera and object motion and are also computationally expensive By integrating the two paradigms, false alarms due to camera and object movement can be reduced while enough spatial information is retained to produce more accurate results
The frame-to-frame difference of frame f i and frame f i+1 is computed in b n regions (blocks) as:
1
, (f f b H j k H j k
Trang 39where H i (j,k) denotes the histogram value at grey level j for the region (block) k and b n is the total number of the blocks
There are some methods used in shot detection also: model-based segmentation, DCT-Based method, or Vector Quantization With model-based segmentation, it is not only for detection shot cut or dissolve but also for any detection of wipe types In [HAMPAUR], Hampapur et al gave a model for variant of chromatic or brightness in two consequent frame and detected the change In [IDRIS], Idris and Panchanathan use vector quantization to compress a video sequence using a codebook of size 256 and 64-dimensional vectors The histogram of the labels obtained from the codebook for each frame is used as a frame similarity measure and a 2 statistic is used to detect cuts In DCT-based method, they use the coefficients to compare two video frames The DCT is commonly used for reducing spatial redundancy in an image in different video compression schemes such as MPEG and JPEG Compression of the video is carried out by dividing the image into a set of 8x8 pixel blocks Using the DCT the pixels in the blocks are transformed into 64 coefficients which are quantized and Huffman entropy encoded The DCT coefficients are analyzed to find frames where camera breaks take place Since the coefficients in the frequency domain are mathematically related to the spatial domain they can be used in detecting the changes in the video sequence
II.5 Video motion
Like what mentioned before, motion is the most interest feature that appeared in video From video sequence, there are many ways to extract motion, depend on the purpose of each system And what kind of motion they considered? It could be motion of each pixel in frame (the
Trang 40motion of object, global motion, motion of layers…Extracting motion became important not only in low-level analysis but also in higher-level analysis This session discuss some type of approached motions but more detail in optical flow because of its popular using and its primary
II.5.1 Motion trajectory
The motion trajectory of an object is a simple, high-level feature, defined as the localization, in time and space, of one representative point
of this object This descriptor shows usefulness for content-based retrieval in object oriented visual databases It is also of help in more specific applications In given contexts with a priori knowledge, trajectory can enable much functionality In surveillance, alarms can be triggered if some object has a trajectory identified as dangerous (e.g passing through a forbidden area, being unusually quick, etc.) In sports, specific actions (e.g tennis rallies taking place at the net) can be recognized Besides, such a description also allows enhancing data interactions/manipulations: for semiautomatic multimedia editing, trajectory can be stretched, shifted, etc, to adapt the object motion to any given sequence global context
The descriptor is essentially a list of key-points (x,y,z,t) along with
a set of optional interpolating functions that describe the path of the object between key-points, in terms of acceleration The speed is implicitly known by the key-points specification The key-points are specified by their time instant and either their 2-D or 3-D Cartesian coordinates, depending on the intended application The interpolating functions are defined for each component x(t), y(t), and z(t) independently