Content based video indexing and retrieval

Comparison of corners between frames brings back to us the ability of detection for cut, gradual transitions and even rich type of wipes of shot transition in video.. 5 Figure II-2 Moti

Trang 1

-

PHẠM QUANG HẢI

CONTENT-BASED VIDEO INDEXING AND RETRIEVAL

LUẬN VĂN THẠC SĨ KỸ THUẬT

XỬ LÝ THÔNG TIN Và TRUYỀN THÔNG

Hà Nội – 2005

Trang 2

-

PHẠM QUANG HẢI

CONTENT-BASED VIDEO INDEXING AND RETRIEVAL

LUẬN VĂN THẠC SĨ KỸ THUẬT

XỬ LÝ THÔNG TIN Và TRUYỀN THÔNG

NGƯỜI HƯỚNG DẪN KHOA HỌC:

Alain Boucher

Hà Nội - 2005

Trang 3

Abstract

Video indexing and retrieval is an important element of any large multimedia database Because video contains huge information and has large store requirement, the research on video is still continuous and opens This thesis works at low-level features of video, concentrates to the corner feature of frame Comparison of corners between frames brings back to us the ability of detection for cut, gradual transitions and even rich type of wipes of shot transition in video Continue to use corner-based motion combining with histogram features, by measuring distance of how far the motion move, key frames is selected, and it is ready for indexing and retrieval application

Other side of work is using segmentation of each frame and merging all regions which have the same motion By this way, I would like to separate regions in frame into layers and it will used to indexing key objects

One chapter of this thesis is reserved for learning how to index and retrieve in video It is an overview of video indexing and retrieval system: what they did, what they are doing and how they will do This thesis is expected to contribute usefully to multimedia system at MICA

Trang 4

Acknowledgements

This work is a part in Multimedia Information Communication Application (MICA) research central

First of all, I would like to thank to Dr Alain Boucher, IT lecturer

at Institut de la Francophonie pour l’Informatique (IFI), Vietnam, leader

of Image Processing group at MICA central, as my supervisor Thank you for your support and funding of acknowledge to me, thank for meeting to discuss of working every week and your patience during time I worked and sorry for inconvenience of what I brought to you

I also thank to Le Thi Lan and Thomas Martin, members at MICA,

I couldn’t have done this thesis without your supports Thank to both of you in acknowledge in image processing theory and programming in C++ with the newbie like me

I would like to thank to directors in MICA: Mr Nguyen Trong Giang, Mr Eric Castelli and Mrs Nguyen Thi Yen who accepted and helped me to have good working environment in MICA Thank to members in MICA who welcome me to work in MICA as a trainee I have very good impression in your amiable attitudes and your helps

Finally, I want to thank to my family, my two sisters who often fostered me even the long distance from homeland Thank to my parent who helped me anytime when I went down and my brother who visited

me sometime to tidy up my room because of my laziness

Trang 5

Contents

Abstract i

Acknowledgements ii

Contents iii

List of abbreviations v

List of figures vi

List of tables viii

Chapter I Introduction 1

I.1 Content-based Video Indexing and Retrieval (CBVIR) 1

I.2 Aims and Objectives 1

I.3 Thesis outline 2

Chapter II Background 3

II.1 Video Formats and Frame Extraction System 3

II.1.1 Video Formats 3

II.1.2 Introduction to MPEG 3

II.2 Content-based video indexing and retrieval system 8

II.2.1 Video sequence structure 8

II.2.2 Video data classification 9

II.2.3 Camera operations 11

II.2.4 Introduction to CBVIR system 12

II.2.5 CBVIR Architecture 14

II.3 Features Extraction 18

II.4 Structure Analysis 21

II.4.1 Shot Transitions classification 21

II.4.2 Shot Detection Techniques 23

II.5 Video motion 29

II.5.1 Motion trajectory 30

II.5.2 Motion Parametric 31

II.5.3 Motion activity (or object motion) 32

II.5.4 Optical Flow 33

II.6 Video Abstraction 36

II.6.1 Introduction to Video Abstraction 36

II.6.2 Key frame extraction: 40

II.6.3 Video Abstraction 41

II.7 Video Indexing, Retrieval, and Browsing 43

II.8 Thesis Scope 44

Chapter III Video Indexing by Camera motions using Corner-based motion vector 45 III.1 Introduction 45

III.2 Video and Image Parsing in MICA 45

III.2.1 MPEG2 Video Decoder 45

III.2.2 Image Processing Library 46

Trang 6

III.3.2 Correspond points matching 49

III.4 Shot Transitions detection 51

III.4.1 Shot cut Detection algorithm 51

III.4.2 Shot cut detection description 52

III.4.3 Results and evaluation 58

III.5 Video Indexing 64

III.5.1 Motion Characterization 64

III.5.2 Corner-based motion vector 65

III.5.3 Global Motion calculation 66

III.5.4 Key frame extraction 74

III.5.5 Problem in object extraction 77

III.5.6 Video Indexing 77

III.6 Summary 78

Chapter IV Conclusions and Future Work 80

IV.1 Thesis Summary 80

IV.2 Future Works 81

Trang 7

List of abbreviations

CD: Compact Disk

DVD: Digital Versatile Disk

MPEG: Moving Pictures Experts Group

CBVIR: Content Based Video Indexing and Retrieval

CBIR: Content Based Indexing and Retrieval

IEC: International Electro-technical Commission

DCT: Discrete Cosine Transform

JPEG: Joint Photographic Experts Group

IDC: Inverse Discrete Cosine Transform

GOP: Group of Pictures

Trang 8

List of figures

Figure I-1 Position of video system in MICA central 2

Figure II-1 Two consecutive frames from video sequence 5

Figure II-2 Motion Compensation from MPEG video stream 6

Figure II-3 Block diagram of MPEG video encoder 6

Figure II-4 Video Hierarchical Structure 8

Figure II-5 Common directions of moving video camera 11

Figure II-6 Common rotation and zoom of stationary video camera 12

Figure II-7 CVBIR common system diagram 13

Figure II-8 Classification of video modeling technique Level I with video raw data, Level II with derived or logical features, and Level III with semantic level abstraction 14

Figure II-9 Process diagram of CBVIR system 15

Figure II-10 RGB color space (picture source [SEMMIX]) 19

Figure II-11 HSV color space (picture source [SEMMIX]) 19

Figure II-12 Tamura features and their values (a) Coarseness (b) Contrast (c) Directionality 20

Figure II-13 Effect of Gabor Filter to image results 20

Figure II-14 Shot Transitions (a) cut (b) fade-in (c) fade-out (d) dissolve (e) wipe 23

Figure II-15 Some transition effects for wipe (pictures taken from Pinnacle Software) 23

Figure II-16 Reduce the number of bits during calculate the histogram 26

Figure II-17 Cut (a) and (Fade/Dissolve) from frame difference 27

Figure II-18 Twin Comparison (picture taken from [II.4 5]) 28

Figure II-19: Head tracking for determine trajectories 31

Figure II-20: The 2D motion trajectory (third direction is frame time line) 31

Figure II-21 Optical flow (a) two frame from video sequence (b) optical flow 33

Figure II-22 Optical flow filed produced by pan and track, tilt and boom, zoom and dolly 34

Figure II-23 Motion segmentation by optical flow 36

Figure II-24 Local and Global Contextual Information 42

Figure III-1 Relation between R and eigenvalues 48

Figure III-2 Harris Corner in image with different given corner number 49

Figure III-3 (a)Two frames extracted while camera pan right (b) correspond points results, drew lines in frame#760 51

Figure III-4 Results from no shot transition 54

Figure III-5 Results from shot cut transition 55

Figure III-6 Results from dissolve 57

Figure III-7 Correspondent points matching numbers in one video sequence 60

Figure III-8 Two frames from two shots but similar 62

Figure III-9 Correspondent points in video sequence 3 62

Figure III-10 Frame sequence from video sequence 3 63

Figure III-11 Keep motion vectors by given threshold for magnitudes 65

Figure III-12 8 used directions for standardizing vector directions 66

Figure III-13 Some consecutive frames from pan right shot 66

Trang 9

Figure III-14 Video frame from video sequence 1 72

Figure III-15 Video frame from video sequence 2 72

Figure III-16 Key frame selection from video mosaic 74

Figure III-17 Key frames is selected from motion graph 75

Figure III-18 Complicated motion graph from video 75

Figure III-19 cases of vector graph 76

Figure III-20 Results of key frame selection 76

Figure III-21 Hierarchical indexing for CBVIR system 78

Trang 10

List of tables

Table 1 Test data used for shot cut algorithm 59

Table 2 Shot detection result from test data 60

Table 3 Four types of detection an algorithm can make 61

Table 4 Vector directions rule 69

Table 5 Calculating global motion from set of corner-based vector 70

Table 6 Video sequence for global motion 71

Table 7 Three table of motion vectors for video sequence 1, 2 and 6 72

Table 8 Global motion from video sequence 3 73

Trang 11

3 hours for each That’s why the system of retrieving video is now researching and developing more and more in the word

In Vietnam, there are more researches and application in multimedia data to fit with the development and requirement of modern life and researching video became important and essential By means of this thesis, I tried to discover and summarized video researches while practiced a part of CBVIR system

I.2 Aims and Objectives

In MICA central, we are now developing a multimedia system including Speech Processing, and Image Processing Position of video

Trang 12

Smart room

NETWORK

other centrals

Figure I-1 Position of video system in MICA central

I.3 Thesis outline

Thesis is divided into two big chapters with conclusion These chapters are organized as follows

Chapter II

This chapter provides basic information, characteristic of video and

a general CBVIR system This chapter also introduces to techniques that used in video for feature extraction, video analysis

Chapter III

This chapter describes techniques that are used in practice: Harris corner, motion from Harris corner and how it correlates to CBVIR system Results came from practice is shown in this chapter with evaluations

Chapter IV

This chapter concludes the thesis and gives direction for future works

Trang 13

Chapter II Background

II.1 Video Formats and Frame Extraction System

II.1.1 Video Formats

There are numbers of video format to use in CBVIR systems It depends on database of system Some types of format which used popularly in storages like VCD, DVD, hard disk…are DAT, AVI, MPEG, MPG, MOV In CBVIR systems in MICA, we used MPEG-2 format while parsing video stream An advantage of MPEG format is reducing the size of video file into small that makes a lot of video processing system became available While encoding MPEG video stream, they used “two-steps” to compress video: Once for spatial compress and once for motion compress (motion compensation) The requirement of applications that used MPEG video can be played in anywhere: Digital Storage Media requires small size, good quality enough to process because of its cost, asymmetric applications requires the ability of subdivision for video delivery (e.g video game online), symmetric applications needs a video format that can compress and decompress process at the same time…All that requirements is satisfied

by using MPEG format

II.1.2 Introduction to MPEG

The Moving Pictures Experts Group abbreviated MPEG is part of the International Standards Organisation (ISO), and defines standards for digital video and digital audio The primal task of this group was to

Trang 14

Meanwhile the demands have risen and beside the CD the DVD needs to

be supported as well as transmission equipment like satellites and networks All this operational uses are covered by a broad selection of standards Well known are the standards MPEG-1, MPEG-2, MPEG-4 and MPEG-7 Each standard provides levels and profiles to support special applications in an optimised way

II.1.2.1 MPEG-2 Video Standard

MPEG-2 video is an ISO/IEC standard that specifies the syntax and semantics of an enclosed video bitstream These include parameters such as bit rates, picture sizes and resolutions which may be applied, and how it is decoded to reconstruct the picture What MPEG-2 does not define is how the decoder and encoder should be implemented, only that they should be compliant with the MPEG-2 bitstream This leaves designers free to develop the best encoding and decoding methods whilst retaining compatibility The range of possibilities of the MPEG-2 standard is so wide that not all features of the standard are used for all applications [KEITH]

II.1.2.2 MPEG-2 Encoding

One of the most interest points of MPEG is reduce the size of video into smallest as it can It concerns to compress algorithm including spatial compensation and temporal compensation This method is also applied to other MPEG standards like MPEG-1, MPEG-4, and MPEG-7 With spatial compensation, JPEG, standard that reduce the size of an image is used By adjusting the various parameters, compressed image size can be traded against reconstructed image quality over a wide range Image quality ranges from “browsing” (with the ratio of compression 100:1) [KEITH] With temporary compensation, motion compensation is used Temporal compression is achieved by only encoding the difference between successive pictures Imagine a scene where at first there is no

Trang 15

movement, and then an object moves across the picture The first picture

in the sequence contains all the information required until there is any movement, so there is no need to encode any of the information after the first picture until the movement occurs Thereafter, all that needs to be encoded is the part of the picture that contains movement The rest of the scene is not affected by the moving object because it is still the same as the first picture The means by which is determined how much movement

is contained between two successive pictures is known as motion estimation prediction The information obtained from this process is then used by motion compensated prediction to define the parts of the picture that can be discarded This means that pictures cannot be considered in isolation A given picture is constructed from the prediction from a previous picture, and may be used to predict the next picture

Motion compensation can be description in Figure II-1

Figure II-1 Two consecutive frames from video sequence

Like what we have seen at two consecutive frames, the man on the right and the background are staying static, the man on the left is moving All information we need to store is back ground, figure of the man on the right, and the motion figure of the man on the left The compensation motion here is the next frame is create from the last frame plus the part

Trang 16

new part [CALIC] The compensation of motion we considered can be described in visual in Figure II-2

Figure II-2 Motion Compensation from MPEG video stream

Block diagram of MPEG video encoder can be described in Figure II-3:

IDC

+ MCP

Figure II-3 Block diagram of MPEG video encoder

Fist of all, frames data (raw data) is compressed by DCT (Discrete Cosine Transform) by divide a frame in to macroblocks and calculate block DC coefficient Quantization (Q1) with “zig-zag” scanning optimized value for each macro blocks After that, MCP (motion compensated prediction) is used to exploit redundant temporal information that is not changing from picture to picture

MPEG-2 defines three picture types:

Trang 17

I (Intraframe) pictures These are encoded without reference to

another picture to allow for random access In MICA center, we used this type of frame during process

P (Predictive) pictures are encoded using motion compensated

prediction on the previous picture therefore contain a reference to the previous picture They may themselves be used in subsequent predictions

B (Bi-directional) pictures are encoded using motion compensated

prediction on the previous and next pictures, which must be either a B or

P picture B pictures are not used in subsequent predictions

Usually, they mixed I, B, or P frame into Group of Picture (GOP) One GOP could include I, and P, or I, P and B frames Depend on what they did during encoding time, types of MPEG will be different It depends on number of types used in GOP or order of I, B, P frame in GOP

One more important problem in MPEG-2 standard is motion estimation Motion estimation prediction is a method of determining the amount of movement contained between two pictures This is achieved

by dividing the picture to be encoded into sections known as macroblocks The size of a macroblock is 16 x 16 pixels Each macroblock is searched for the closest match in the search area of the picture it is being compared with Motion estimation prediction is not used on I pictures, however B and P pictures can refer to I pictures For P pictures, only the previous picture is searched for matching macroblocks

In B pictures both the previous and next pictures are searched When a match is found, the offset (or motion vector) between them is calculated The matching parts are used to create a prediction picture, by using the motion vectors The prediction picture is then compared in the same

Trang 18

which have no match to any part of the search area in the picture to be encoded represent the difference between the pictures, and these macroblocks are encoded To understand more about MPEG standard, see [MPEG] for details

II.2 Content-based video indexing and retrieval

system

II.2.1 Video sequence structure

A video stream is created basically from shots Shot is a fundamental unit of video and it depicts a continuous capture from camera turning on until turning of for another shot Like what is illustrated in figure Figure II-4, when a video producer created a video, they made it from shots and grouped them into scenes and embedded some effect between shots

Trang 19

One scene could be understood like series of shots which is semantic constrained (e.g., a scene describes two talking people, sitting in chairs with interlacement shots of another people in a party) That means,

it is not easy to detect a scene by signalling features like colour, shape…It must be detected by semantic level (mentioned in the next section) During The lowest level is frame in video hierarchical structure CBVIR system extracts a shot into frames and selects the most interested frame from them in key frame selection step

II.2.2 Video data classification

To work with video, classification of video data is very important Follow [ROWE], Row et al classified video metadata into 3 levels for each video by:

Bibliographic data: This category includes information about the

video (e.g., title, abstract, subject, and genre) and the individuals involved

in the video (e.g., producer, director, and cast)

Structural data: Video and movies can be described by a hierarchy

of movie, segment, scene, and shot where each entry in the hierarchy is composed of one or more entries at a lower level (e.g., a segment is composed of a sequence of scenes and a scene is composed of a sequence

of shots)

Content data: Users want to retrieve videos based on their content

audio and visual content) In addition, because of the nature of video the visual content is a combination of static content frames and dynamic content Thus, the content indexes may be sets of key frames that represent major images and object indexes that indicate entry and exit frames for each appearance of a significant object or individual

With this classification, to work with video stream, normally we

Trang 20

other information which is signed inside video stream or may be, simpler, from text that appears in video stream (must use text recognition [LIENHART1]) To determine structure data, system must rebuild its structure from primitive elements (frame) by using some techniques (find refer) In the most of CBVIR system, content data is used as a major element to work with video And in my thesis, I work with frame as a primitive element

Another way to classify video data is based on it purpose It’s called purpose-based classes In [LEW], Lew at al classify video into 4 classes:

Entertainment: Information in this class is highly stylized

depending on the particular sub-category: fiction, non-fiction and interactive Film stories, TV programs, cartoon…belong fiction With non-fiction, information does not need to be “logical”, and follow story flow Interactive video can be found out in games

Information: The most common information of video in television

is news It gives convey of information to viewers

Communication: Video used in communication is different from

play back video It must be designed for communication, suitable with packet transformation (e.g., video conferences)

Data analysis: Scientific video recording (e.g., video of chemical,

biology, psychology…)

The way to classify video into these classes is important because of

it different structure It helps to classify video information in semantic level Classify video shot in these levels is illustrated in [FAN] In that paper, Fan et al classify video into Politics news, sports news, financial news… Or in [ROZENN], Rozenn et al used features of sports to classify tennis sport video or snooker video Li et al in [LI] detect particular events in sport broadcast video like American football by

Trang 21

determine in video stream where a player hit the baseball, where is goal…In [SATOH], to detect news video, detecting anchor person by face detection is used It combines feature of face and video caption (text)

to recognize appearance of person in video All of these applications try

to classify any video stream into its semantic class

II.2.3 Camera operations

There are two reasons that made motion in video: camera and object A camera shot stores object motions only when it is staying static,

no adjusting and objects like people, animal, vehicle…are moving front

of it Reversely, camera motions are generated by moving, rotating and using zoo function of camera By moving camera, often used in film production, video camera lying on support moved in a track Usually, this case is used to focus a moving object There are four common directions

of camera moving is illustrated in Figure II-5 Some other cases of video motion created by stationary camera are in Figure II-6

Figure II-5 Common directions of moving video camera

Trang 22

Figure II-6 Common rotation and zoom of stationary video camera

II.2.4 Introduction to CBVIR system

How many information is stored in one video shot? It’s said that

“One image is thousand of words” and here, one video could hold thousand of images And not only that, in one video shot could store another information like sound, voice, text and one feature which made video became impress is motion Thus, any CBVIR system can be extended from Image, Sound, Voice, Text Indexing and Retrieval System Beside, motion that is taken from image sequence and information which came from a collection of image are used in CBVIR system also And a CBVIR system must satisfy the main target that is indexing and retrieval End-users give queries and with their critical, CBVIR should give back to them the results that is the most similar to queries But imagine that, if we use the image or video query and browse

it on entire video stream by frame by frame, it is the big problem of very high time cost and some time, it is not exactly what we want A system of CBVIR for end-users can be seen simply in Figure II-7 This system can use queries of videos stream, images, sounds, texts…or combine all of them The result gave back to end user with result of the most similar results

Trang 23

Input Queries

Results

Figure II-7 CVBIR common system diagram

In [FAISAL], Faisal I Bashir gave the classification of video modeling schema demonstrated in Figure II-8 Any CBVIR system used features of signals like colour, shape, texture of raw data of video in Level I Techniques at this level tend to model the apparent characteristics of “stuff” (as opposed to “things”) inside video clips Video data is continuous and unstructured To analyze and understand its contents, the video needs to be parsed into smaller chunks (frames, shots and scenes) At level II, system analysis and synthesizes these features to get logical and statistical features (computer vision), and used both of these above derived results for semantic representation At this semantic level, video bit stream that contains audio stream and possibly closed caption text along with sequence of images contains a wealth of rich information about objects and events being depicted Once the feature level summarization is done, semantic level description based on conceptual models, built on a knowledge base is needed In the following sections, we review techniques that try to bridge the semantic gap and present a high level picture obtained from video data One example can

be found in [FAN] where people tried to classify each shot of video into

Trang 24

semantic scenes of news, sports, science information…CBVIR system belongs to one of these levels depending on purpose of it

AI & Philosophy Use of Intelligent Multimedia Knowledge

bases.

Raw/Compressed Multimedia Data

II.2.5 CBVIR Architecture

In [NEVENKA], authors gave a CBVIR system which is considered as a most standard system They perceive a video as a document They compared between CBIR of text and video and find out its analogy In CBVR system of text, to make efficient system, they must decompose documents into elements: paragraphs, sentences and words

Trang 25

After that, they make a content table that map to document, extract keywords from document by features, and use query text to retrieve on entire these indexed keywords Similarly, in CBVIR system, we should segment a video document into shots and scenes to compose a table of contents, and we should extract key frames or key sequences as index entries for scenes or stories Therefore, the core research in content-based video retrieval is developing technologies to automatically parse video, audio, and text to identify meaningful composition structure and to extract and represent content attributes of any video sources

Feature extraction

Structure analysis

Retrieval and

browsing

Abstraction

Clustersing and Indexing Video Stream

Summary / skimmed video

Metadata Features

Figure II-9 Process diagram of CBVIR system

Figure II-9 illustrates processes of common CBVIR system There are four main processes: feature extraction, structure analysis, abstraction and indexing There are many research on each process, each one have its own challenges and here, we discuss briefly review for each process

Feature extraction: As what mentioned in sessions before, each

video contains many features on it: image’s features, voice’s features, sound’s features and text’s features Feature extraction will separate all these features, depend on each system, to serve parsing video stream (shot detection, scene detection, motion detection…) The usual and

Trang 26

effective way is combination features which generate many ways to approach a CBVIR system

Structure analysis: In Figure II-4, the top level is video sequence

and it is stories also Video sequence is partitioned into scenes (analogous paragraphs in document) and each scene is composed sets of shots (like sentences in paragraphs) and shot is constructed from frames sequence Structure analysis must be able to decompose all these elements However, scene separation sometime is impossible because it is based on stories and it must be recognized by higher level (level III) as illustrated

in Figure II-8 Normally, to divide scene to each other, the most technique is based on film production rules But in fact, it is still get much challenge and less successful For shot separation, there are many way to archive like using color, motion…Most of shot algorithm used visual information and the last one, frame extraction is video format dependent

Video abstraction: Original video is always longer than need

information Abstraction is similar to step of finding keywords in CBIR

of text It will reduce information to retrieve (equivalent to reduce time code) in CBVIR system For example, one video of a football match, scenes come from stand of stadium is redundant if we only consider to Match events Abstraction rejects all shots, scenes that system does not need to care That means abstraction must be executed on entire shots, scenes and stories Abstraction video content includes skimming, highlights and summary A video skim is a condensed representation of the video containing keywords, frames, visual, and audio sequences Highlights normally involve detection of important events in the video A summary means that we preserved important structural and semantic information in a short version of the video represented via key audio, video, frames, and/or segments One of the most popular ways is key frames extraction Key frames play an important role in the video

Trang 27

abstraction process Key frames are still images, extracted from original video data, which best represent the content of shots in an abstract manner The representational power of a set of key frames depends on how they’re chosen from all frames of a sequence Not all image frames within a sequence are equally descriptive, and the challenge is how to automatically determine which frames are most representative An even more challenging task is to detect a hierarchical set of key frames such that a subset at a given level represents a certain granularity of video content, which is critical for content-based video browsing Researchers have developed many effective algorithms, although robust key frame extraction remains a challenging research topic

Indexing for retrieval and browsing: The structural and content

attributes extracted in feature extraction, video parsing, and abstraction processes, or the attributes that are entered manually, are often referred to

as metadata Based on these attributes, we can build video indices and the table of content through, for instance, a clustering process that classifies sequences or shots into different visual categories or an indexing structure As in many other database systems, we need schemes and tools

to use the indices and content metadata to query, search, and browse large video databases Researchers have developed numerous schemes and tools for video indexing and query However, robust and effective tools tested by thorough experimental evaluation with large data sets are still lacking Therefore, in the majority of cases, retrieving or searching video databases by keywords or phrases will be the mode of operation In some cases, we can retrieve with reasonable performance by content similarity defined by low-level visual features of, for instance, key frames and example-based queries

Trang 28

II.3 Features Extraction

Each CBVIR system chooses its own way to approach searching and browsing There are used features is image features, audio features and text features In this section, I would like to introduce to features used widely in CBVIR system and give the overview for each feature In limitation of thesis, it is impossible to discuss in detail for each subject

As we’ve known that each frame in video stream is equivalent to one image That means any thing we can do on image can be applied to video One thing made video differs from image is constraint between consecutive image features There are three low-level features often used

in CBIR of image mapped to video: color, texture and shape

Color:

RGB color space

Model RGB is one of the first practical models of area of colors and contains the recipe for creation of colors This model has emerged in evident manner in the times of birth of television (1908 and the next years) This is the model resulting from the receiving specification of eye and it is based on fact that the impressions of almost all the colors in eye can be evoked by mixing in the fixed proportions of three selected clusters of light of the properly chosen width of spectrum Three of components (R, G, B) (R=RED, G=GREEN, B=BLUE) is the identification of color in the model RGB The color space of RGB color can be describe in Figure II-10

HSV color space

One more proposal of the model of colors' description, suggested

by Alvey Ray Smith, appeared in 1978 The symbols in the name of model are the first letters of English names for the components of the description of color The model is considered as a cone of round base

Trang 29

The dimensions of cone are described by the component H (Hue), component S (Saturation) and the component V (Value)

Figure II-10 RGB color space (picture source [SEMMIX])

Figure II-11 HSV color space (picture source [SEMMIX])

The hue H value is the angular, the saturation S is the radial and the brightness is the height component Figure II-11 illustrates the way to create color on the surface of cylinder HSV color space is very useful in CBIR of image system because it is close to the human eyes perception

Color Histogram:

Comparison of color histogram to determine where the same distance of histogram appeared This method is used very popular because of its light time cost and its high effect

Texture:

Tamura features

Tamura features are based on theory of texture [TAMURA] Three

Trang 30

Figure II-12 illustrates three Tamura features Each feature has it own measurement to separate it into different levels For example, in Figure II-12.a the right image is coarser than the left and it has greater value

Given an image I(x, y) with m, n are scale and orientation

correspondently Gabor formula can be described as follow:

=

s t

n m n

G ( , ) ( )( ) * ( , )

,

s, t are size variants of mask filter

 is mother wavelet function defined following:

 x

y x y

x

jW y

x y

1 )

,

2

2 2

W = f (,, ) with  is given frequency,  is deviation angle and 

is given phase The effect of these values can be seen in Figure II-13

Figure II-13 Effect of Gabor Filter to image results

In Figure II-13, with =00 and =30Hz, the result image (in the left) include all elements have direction of 00 and frequency of 30Hz

Trang 31

Change these argument values will change the image results That means Gabor filter is used to filter image and take need textures

Shape:

Shape representations can be classified into contour-based and region-based techniques Both categories of techniques have been investigated and several approaches have been developed In CBIR of image system, shape of object is extracted and it is compared through image database

II.4 Structure Analysis

Structure analysis includes detecting scene, shot break and extracting frames In this section, I only introduce to shot detection because this is the most important task during analyzing video

II.4.1 Shot Transitions classification

Shot transition is techniques to lead the viewer from shot to other shot [JONE, KATZZ] Normally, before movie production (concatenation

of shots together to make movie story), each shot is calculated from time

of turning on camera until turning off to change to other shot One thing

to separate shot together is continuous feature that shot will appear wherever the sudden change of continuous scene in time line After applying transition shot technique, the way to separate became more difficult Detecting shot change is very important step in any CBVIR system Shot transition could be categorized as follow:

Cut: a shot changes to another instantaneously see Figure II-14.a

Fade-in: a shot changes from a constant image This content image

usually is black frame The most fade-in can be found in the first shot of movie at the beginning Figure II-14.b

Trang 32

Fade-out: Reverse to Fade-in, a shot changes to a constant image

Figure II-14.c

Dissolve: first shot fade-out and second shot fade-in As you see in

Figure II-14.d there is a dissolve effect appeared after Frame#43 and before Frame#83

Wipe: Wipe is the most complicate transition because it is

generated by video production Figure II-14.e is just only one type of wipe (push left) Figure II-15 illustrates techniques is used for shot transition The arrows are trend of how the second shot appearing to the fist shot

Thorough understanding of shot transition is the way to approach shot detection We will see how to automatic detect the shot transition in the next section and today, the research on shot detection still continues because of its difficulty and interest

Frame#43 Frame#57 Frame#64 Frame#71 Frame#79 Frame#83

a Frame#43 Frame#57 Frame#64 Frame#71 Frame#79 Frame#83

b Frame#43 Frame#57 Frame#64 Frame#71 Frame#79 Frame#83

c

Trang 33

Frame#43 Frame#57 Frame#64 Frame#71 Frame#79 Frame#83

d Frame#43 Frame#57 Frame#64 Frame#71 Frame#79 Frame#83

e Figure II-14 Shot Transitions (a) cut (b) fade-in (c) fade-out (d) dissolve (e) wipe

Figure II-15 Some transition effects for wipe (pictures taken from

Pinnacle Software)

II.4.2 Shot Detection Techniques

There are many way to detect where the shot appeared In general, automatic shot detection can be classified into 5 categories: pixel-based, statistics-based, transform-based, feature-based, and histogram-based

Pixel difference method:

Pair-wise pixel comparison (also called template matching)

Trang 34

pixels in two successive frames The difference between frame f i

(intensity of pixel (x,y) in a frame f i )and frame f i+1 is defined as:

0

1

, (

X

x Y

y

i i

Y X f

f

Otsuji et al [OTSUJI] and Zhang et al [ZHANG] count the number of the changed pixels and a camera shot break is declared if the percentage of the total number of changed pixels exceeds a threshold The differences in pixels and threshold are:

,

) , ( ) , ( 0

1 ) ,

otherwise

t y x f y x f if

y x

Block-based method:

In contrast to template matching that is based on global image characteristic (pixel by pixel 8 differences), block-based approaches use local characteristic to increase the robustness to camera and object movement [ZHANG] Let i and i+1 be the mean intensity values for a given region in two consecutive frames and i and i+1 be the corresponding variances The frame difference is defined as the percentage of the regions whose likelihood ratios (II 5) exceed a pre-defined threshold t:

1

2 2 1 1

2 2

+

+ +

=

i i

i i i

Trang 35

, 0

1 ) , (

otherwise

t if

y x

Histogram comparison:

A step further towards reducing sensitivity to camera and object movements can be done by comparing the histograms of successive images The idea behind histogram-based approaches is that two frames with unchanging background and unchanging (although moving) objects will have little difference in their histograms In addition, histograms are invariant to image rotation and change slowly under the variations of viewing angle and scale As a disadvantage one can note that two images with similar histograms may have completely different content However, the probability for such events is low enough, moreover techniques for dealing with this problem have already been proposed in [PASS]

A grey level (color) histogram of a frame i is an n-dimensional vector H i (j)=1,…,n where n is the number of grey levels (colors) and H(j)

is the number of pixels from the frame i with grey level (color) j

Global Comparison:

The simplest approach uses an adaptation of the metrics from (3) but instead of intensity values, grey level histograms are compared A cut

is declared if the absolute sum of histogram differences between two

successive frames D(f i ,f i+1) is greater than a threshold t:



1 j

Trang 36

where H i (j) is the histogram value for the grey level j in the frame i, j is the grey value and n is the total number of grey levels

Another simple and very effective approach is to compare color histograms Zhang et al [ZHANG] apply (II 8) where j, instead of grey levels, denotes a code value derived from the three color intensities of a pixel In order to reduce the bin number (3 colors x 8 bits create histograms with 224 bins), only the upper two bits of each color intensity component are used to compose the color code This solution can be demonstrated in Figure II-16 where fours bit from 0 to 4 are truncated The comparison of the resulting 64 bins has been shown to give sufficient accuracy When the difference is larger than a given threshold T, a cut is declared

Figure II-16 Reduce the number of bits during calculate the histogram



+ +

) (

) ( )

( )

,

(

j H

j H j H f

f

D

i

i i

i

a

b

Trang 37

Figure II-17 Cut (a) and (Fade/Dissolve) from frame difference

Figure II-17 show the results of cumulative differences between frames In Figure II-17.a show the frame differences when cut appeared with the sign of peaks and Figure II-17.b show the frame differences when fade or dissolve appeared The cut could be detected by threshold easily but with fade and dissolve it is more difficult The next session will explain how to detect fade and dissolve transitions

Twin-comparison

The twin-comparison method [SMOLIAR] takes into account the cumulative differences between frames of the gradual transition In the

first pass a high threshold T h is used to detect cuts as shown in Figure

II-18 (the first peak) In the second pass a lower threshold T l is employed

to detect the potential starting frame F s of a gradual transition F s is than compared to subsequent frames (Figure II-18) This is called an accumulated comparison as during a gradual transition this difference

value increases The end frame F e of the transition is detected when the

difference between consecutive frames decreases to less than T l, while the

accumulated comparison has increased to a value higher than T h If the

consecutive difference falls below T l before the accumulated difference

exceeds T h , then the potential start frame F s is dropped and the search continues for other gradual transitions It was found, however, that there are some gradual transitions during which the consecutive difference falls below the lower threshold This problem can be easily solved by setting a tolerance value that allows a certain number of consecutive frames with low difference values before rejecting the transition candidate As it can

be seen, the twin comparison detects both abrupt and gradual transitions

at the same time

Trang 38

Figure II-18 Twin Comparison (picture taken from [II.4 5])

Local histogram comparison:

Histogram based approaches are simple and more robust to object and camera movements but they ignore the spatial information and, therefore, fail when two different images have similar histograms On the other hand, block based comparison methods make use of spatial information They typically perform better than pair-wise pixel comparison but are still sensitive to camera and object motion and are also computationally expensive By integrating the two paradigms, false alarms due to camera and object movement can be reduced while enough spatial information is retained to produce more accurate results

The frame-to-frame difference of frame f i and frame f i+1 is computed in b n regions (blocks) as:

1

, (f f b H j k H j k

Trang 39

where H i (j,k) denotes the histogram value at grey level j for the region (block) k and b n is the total number of the blocks

There are some methods used in shot detection also: model-based segmentation, DCT-Based method, or Vector Quantization With model-based segmentation, it is not only for detection shot cut or dissolve but also for any detection of wipe types In [HAMPAUR], Hampapur et al gave a model for variant of chromatic or brightness in two consequent frame and detected the change In [IDRIS], Idris and Panchanathan use vector quantization to compress a video sequence using a codebook of size 256 and 64-dimensional vectors The histogram of the labels obtained from the codebook for each frame is used as a frame similarity measure and a 2 statistic is used to detect cuts In DCT-based method, they use the coefficients to compare two video frames The DCT is commonly used for reducing spatial redundancy in an image in different video compression schemes such as MPEG and JPEG Compression of the video is carried out by dividing the image into a set of 8x8 pixel blocks Using the DCT the pixels in the blocks are transformed into 64 coefficients which are quantized and Huffman entropy encoded The DCT coefficients are analyzed to find frames where camera breaks take place Since the coefficients in the frequency domain are mathematically related to the spatial domain they can be used in detecting the changes in the video sequence

II.5 Video motion

Like what mentioned before, motion is the most interest feature that appeared in video From video sequence, there are many ways to extract motion, depend on the purpose of each system And what kind of motion they considered? It could be motion of each pixel in frame (the

Trang 40

motion of object, global motion, motion of layers…Extracting motion became important not only in low-level analysis but also in higher-level analysis This session discuss some type of approached motions but more detail in optical flow because of its popular using and its primary

II.5.1 Motion trajectory

The motion trajectory of an object is a simple, high-level feature, defined as the localization, in time and space, of one representative point

of this object This descriptor shows usefulness for content-based retrieval in object oriented visual databases It is also of help in more specific applications In given contexts with a priori knowledge, trajectory can enable much functionality In surveillance, alarms can be triggered if some object has a trajectory identified as dangerous (e.g passing through a forbidden area, being unusually quick, etc.) In sports, specific actions (e.g tennis rallies taking place at the net) can be recognized Besides, such a description also allows enhancing data interactions/manipulations: for semiautomatic multimedia editing, trajectory can be stretched, shifted, etc, to adapt the object motion to any given sequence global context

The descriptor is essentially a list of key-points (x,y,z,t) along with

a set of optional interpolating functions that describe the path of the object between key-points, in terms of acceleration The speed is implicitly known by the key-points specification The key-points are specified by their time instant and either their 2-D or 3-D Cartesian coordinates, depending on the intended application The interpolating functions are defined for each component x(t), y(t), and z(t) independently

Định dạng
Số trang	97
Dung lượng	2,42 MB