1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận văn content based video indexing and retrieval

97 1 0
Tài liệu được quét OCR, nội dung có thể không chính xác
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Content-based video indexing and retrieval
Tác giả Pham Quang Hai
Người hướng dẫn Dr. Alain Boucher
Trường học Truong Dai Hoc Bach Khoa Ha Noi
Chuyên ngành Information Technology
Thể loại Luận văn thạc sĩ
Năm xuất bản 2005
Thành phố Hà Nội
Định dạng
Số trang 97
Dung lượng 2,23 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Thesis for Degree of duster - Content-based videu indexing oud retrrevial Abstract Video indexmg and retrieval is an important clement of any large multimedia database.. Thesis for D

Trang 1

BỘ GIÁO DỤC VÀ ĐÀO TẠO

TRUONG DAI HOC BACH KHOA HA NOI

PHAM QUANG HAI

CONTENT-BASED VIDEO INDEXING AND RETRIEVAL

LUẬN VĂN THẠC SĨ KỸ THUẬT

xU LY THONG TIN Va TRUYEN THONG

Hà Nội - 2005

Trang 2

TRUONG DAI HOC BACH KHOA HA NOI

PHAM QUANG HAI

CONTENT-BASED VIDEO INDEXING AND RETRIEVAL

LUAN VAN THAC SI KY THUAT XU! LY THONG TIN Va TRUYEN THONG

NGUOI HUONG DAN KHOA HOC:

Alain Boucher

Hà Nội - 2005

Trang 3

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

Abstract

Video indexmg and retrieval is an important clement of any large

multimedia database Because video contains huge information and has large store requirement, the research on video is still continuous and

opens ‘his thesis works at low-level features of video, concentrates to

the corner feature of frame Comparison of corners between frames

brings back to us the ability of detection for cut, gradual transitions and even rich type of wipes of shot transition in video Continue to use

comer-based motion combining with histogram features, by measuring

distance of how far the motion move, key frames is selected, and it is

ready for indexing and retrieval application

Other side of work is using segmentation of each frame and

merging all regions which have the same motion By this way, | would like to separate regions in frame into layers and it will used to indexing

key objects

One chapter of this thesis is reserved for learning how to index and retrieve in video It is an overview of video indexing and retrieval system:

what they did, what they are doing and how they will do This thesis is

expected to contribute usefully to multimedia system at MICA

Trang 4

Acknowledgements

This work is a part in Multimedia Information Communication

Application (MICA) research central

First of all, I would like to thank to Dr Alain Boucher, IT lecturer

at Institut de la Francophonie pour I’ Informatique (JF1), Vietnam, leader

of Image Processing group at MICA central, as my supervisor Thank you

for your support and funding of acknowledge to me, thank for mecting to discuss of working every week and your patience during time | worked and sorry for inconvenience of what I brought to you

1 alao thank to I.e Thị I.an and Thomas Martin, members at MICA,

I couldn’t have done this thesis without your supports ‘hank to both of

you in acknowledge in image processing theory and programming in C! |

with the newbie like me

I would like to thank to directors in MICA: Mr Nguyen Trong

Ghang, Mr Eric Castclli and Mrs Nguyen Thi Yen who accepted and

helped me to have good working environment in MICA ‘Thank to

members in MICA who welcome me to work in MICA as a trainee I

have very good impression in your amiable attitudes and your helps

Finally, I want to thank to my family, my two sisters who often

fostered me oven the long distance from homeland Thank to my parent who helped me anytime when | went down and my brother who visited

me sometime to tidy up my room because of my laziness

Trang 5

L1 Content-based Video Indexing and Retrieval (CB VIR)

12 Aims and Objectives

112 Content-based video indexing and retrieval system

IL2.1 Video sequance structure

112.2, Video data classification,

112.3 Camera operations -

11.2.4, Introduction lo CRVIR systom

112.5 CBVIR Archilecture

3 Features Extraction

I4 Structure Analysi:

TIA1 Shot Transitions classification

1142 Shot Netzetion Teelmiques

1.6.1 Introduction to Video Abstraction

1162 Key frame extraction: -

TL7 Vidco Iudexing, Refricval, and Browsing - a AB

Chapter IIL Video Indexing by Camera motions using Comer-based motion vector

TL2 Video and Image Parsing in MICA

HIL2.1 MPEG2 Video Decoder

11.2.3 Video and Image Processing akrry Combination so AB

HL3.1 Hamis Corner points deteotor co 47

Trang 6

11.3.2 Corespond poituls matching

TI4 — Shol Transitions dotzction

11.4.1 Shot cut Detection algorithm

11.42 Shot cut detection description

143 Results and evaluation

TIS Video Indexing

1.5.1 Motion Characterization

11.5.2 Comer-based motion vector

1153 Global Motion calculation

W154 Key frame extraction

1.5.5 Problem in object extraction

iv

Trang 7

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

List of abbreviations

CD: Compact Disk

DVD: Digital Versatile Disk

MPEG: Moving Pictures Experts Group

CBVIR: Content Based Video Indexing and Retrieval

CBIR: Content Based Indexing and Retrieval

IEC: International Llectro-technical Commission

DCT Discrete Cosine Transform

JPEG: Joint Photographic Experts Group

IDC: Inverse Discrete Cosine Transform

GOP Group of Piclurcs

Tham Quang lai

Trang 8

List of figures

Figure I-1 Position of video sysizn in MICA ventral

Figure I-1 Two consceutive frames from video sequence

Kigure IL-2 Motion Compensation from MPEG video stream

Figure IL-3 Block diagram of MPEG video encoder

Figure 1-4 Video Ilierarchical Structure

Figure If-5 Connon dircotions of moving videu camera

Figure IL-6 Common rotation and zoom of stationary video camera

+igure I-? CVBIR common system diagram

Vigure 11-8 Classification of video modeling technique Level | with video raw data,

Level II with derived or logical features, and Level III with semantic level

Figure IE13 Effect of Gabor Filter to image Ti alt

Figure I-16 Reduce the number of bits during calculate the histogram

Figure I-17 Cut (a) and (Fade/Dissolve) trom trame difference

Figure I-18 Twin Comparison (picture taken from (IL 4 5)

Figure If-19: Head tracking for determine trajselories

Figure 1-20: The 2D motion trajcetory (third direction is fame time line) 31 Figure I-21 Optical flow (a) two frame fiom video sequence (b) optical flow 33 Higure Il-22 Optical flow filed produced by pan and track, tilt and boom, zoom and

Figure I-23 Motion sogmentation by optical flow - 238 Figure I-24 Local and Global Contextual Information - 2 +igure IU-1 Relation between R and eigenvalues " 1 Migure II-2 Hatris Comer in image with different given comer nmber a AD gue T1L-3 (2) Two frames extracted while camera pan right (b) correspond points

Figwe I 4 Results from no shot transition 84

Figure 1U-5 Results from shot cut transition see SS

Figure IT-7 Correspondent points matching mmbers in one video sequence 60 Figure [0-8 Two frames fiom two shots but similar - -

Tigure II-9 Corresponent points in video sequence 3

Kigure I-10 Frame sequence from video sequence 3

igure 11-11 Keep motion vectors by given threshold for magnitđes 2 65 Figure I1T-12 8 used dircelions for standardizing veclor directions 66 Figure I-13 Some consecutive frames fữom pan ripht shot se 86

Trang 9

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

Figure I1T-14 Video frame from video sequence |

Figure I-15 Video frame from video scquonec 2

Figure II-16 Key ftarne selection ñom video maosaic

Figure 1LL-17 Key frames is selected from motion graph

igure 1-18 Complicated motion graph from video

Figure ITT-19 cases of vector graph -

Figure I-20 Results of key frame selection

Figure I-21 Hierarchical indexing for CBVIR system

Trang 10

List of tables

Table | Tast dala used for shot cul algorithm

Table 2 Shot detsvtion result from lest data

‘Table 3 Four types of detection an algorithm can make

‘Table 4 Vector directions rule

Table 5 Calculating global motion fiom set of comer-based vector

Table 6 Vidzo sequence for global motion

Table 7 Three table of motion vectors for video sequence 1, 2 and 6

‘Table 8 Global motion from video sequence 3

Trang 11

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

ability of rapid process and very big storage for multimedia data became

available Kach year, there are huge numbers of movies created by movie

industry, hundreds of Television Company in the world made many and

many video news And for cach person or cach family, boing owner of a

camera became easier than ever They made home video for their events

which happened im their hile That reasons made the information of video became hugeness, achieve difficulty, petting messy when people tried to

browse his need video information Lasily to realize that sometime it is

very hard to tind the shot of “sunsct scene” in given a thousand vidcos of

3 hours for each ‘That's why the system of retrieving video is now

researching and developing more and more in the word

In Viemam, there are more researches and application in

multimedia data to fit with the development and requirement of modern

life and researching video became important and essential By means of

this thesis, ] tried to discover and summarized video researches while

practiced a part of CBVIR system,

L2 Aims and Objectives

In MICA central, we are now developing a multimedia system

including Speech Processing, and Image Processing Position of video system is showed in Figure 1-1

Trang 12

Thesis is divided into two big chapters with conclusion These

chapters are organized as follows

Chapter II

This chapter provides basic information, characteristic of video and

a general CBVIR system This chapter also introduces to techniques that used in video for feature extraction, video analysis

Chapter III

This chapter describes techniques that are used in practice: Harris

comer, motion from Harris corner and how it correlates to CBVIR

system Results came from practice is shown in this chapter with evaluations,

Chapter IV

This chapter concludes the thesis and gives direction for future works,

Trang 13

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

Chapter Il Background

1.1 Video Formats and Frame Extraction System

11.1.1 Video Formats

There are numbers of video format to use in CBVIR systems It

depends on database of system Some types of format which used popularly in storages like VCD, DVD, hard disk are DAT, AVI, MPEG, MPG, MOV In CBVIR systems in MICA, we used MPEG-2

format while parsing video stream An advantage of MPEG format is

reducing the size of video file into small that makes a lot of video

processing system became available, While encoding MPEG video

stream, they used “two-steps” to compress video: Once for spatial compress and once for motion compress (mulion compensation) The

requirement of applications that used MPEG video can be played in

anywhere: Digital Storage Media requires small size, good quality enough to process because of its cust, asynumetric applicalions requires the ability of subdivision for video delivery (e.g video game online}, symmetric applications needs a video format that can compress and

docompress process al the same time .All that requirements is satislied

by using MPEG fonnat

11.1.2 Introduction to MPEG

The Moving Pictures Experts Group abbreviated MPEG is part of

the International Standards Organisation (TSO), and delines standards for

digital video and digital audio The primal task of this group was to develop a format to play back video and audio in real time from a CD

Trang 14

Meanwhile the demands have sen and beside the CD the DVD needs to

be supported as well as transmission equipment like satellites and

networks All this uperational uses arc covered by a broad selection of

standards Well known arc the standards MPEG-1, MPEG-2, MPEG-4

and MPEG-7 Each standard provides levels and profiles to support

special applications in an uplimised way

1.1.2.1 MPEG-2 Video Standard

MPEG-2 video is an ISO/IEC standard that specifies the syntax

and semantics of an enclosed video bitstream These include parameters

such as bit rates, picture sizes and resolutions which may be applied, and how it is decoded to reconstruct the picture What MPEG-2 does not

define is how the decoder and encoder should be implemented, only that

they should be compliant with the MPIG-2 bitstream This leaves

designers free to develop the best encoding and decoding methods whilst

retaining compatibility ‘The range of possibilities of the MPKG-2

standard is so wide that not all features of the standard are used for all

applications |KEITH|

422.2, MPEG-2 Encoding

One of the most interest points of MPEG is reduce the size of video

into smallest as it can It concerns to compress algorithm including spatial

compensation and temporal compensation ‘his method is also applied to

other MPEG standards like MPEG-1, MPEG-4, and MPEG-7 With spatial compensation, JPKG, standard that reduce the size of an image is

used By adjusting the various parameters, compressed image size can be

traded against reconstrucled image quality over a wide range Image

quality ranges trom “browsing” (with the ratio of compression 100:1) [KEITH] With temporary compensation, motion compensation is used

Temporal compression is achicved by only cneoding the difference

between successive pictures Imagine a scene where at first there is no

Trang 15

movement, and then an object moves across the picture The first picture

in the sequence contains all the information required until there is any

movement, so there is no need to encode any of the information after the

first picture until the movement occurs Thereafter, all that needs to be

encoded is the part of the picture that contains movement The rest of the

scene is not affected by the moving object because it is still the same as the first picture The means by which is determined how much movement

is contained between two successive pictures is known as motion

estimation prediction The information obtained from this process is then

used by motion compensated prediction to define the parts of the picture that can be discarded This means that pictures cannot be considered in isolation A given picture is constructed from the prediction from a

previous picture, and may be used to predict the next picture

Motion compensation can be description in Figure II-1

Figure II-1 Two consecutive frames from video sequence

Like what we have seen at two consecutive frames, the man on the

right and the background are staying static, the man on the left is moving

All information we need to store is back ground, figure of the man on the

right, and the motion figure of the man on the left The compensation

motion here is the next frame is create from the last frame plus the part which arisen from motion and subtract the part which overridden from

Trang 16

described in visual in Figure II-2

&e

Figure IJ-2 Motion Compensation from MPEG video stream

Block diagram of MPEG video encoder can be described in Figure

Figure II-3 Block diagram of MPEG video encoder

Fist of all, frames data (raw data) is compressed by DCT (Discrete Cosine Transform) by divide a frame in to macroblocks and calculate

block DC coefficient Quantization (Ql) with “zig-zag” scanning

optimized value for each macro blocks After that, MCP (motion

compensated prediction) is used to exploit redundant temporal

information that is not changing from picture to picture

MPEG-2 defines three picture types

Trang 17

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

I (Intraframe) pictures These are encoded without reference to another picture to allow for random access In MICA center, we used this

type of frame during process

P (Predictive) pictures arc cncoded using motion compensated

prediction on the previous picture therefore contain a reference to the

previous picture They may themselves be used in subsequent predictions

8 (Bi-directional) pictures are encoded using, motion compensated

prediction on the previous and next pictures, which must be either a B or

P picture B pictures are not used in subsequent predictions

Usually, they mixed L B, or P frame into Group of Picture (GOP)

One GOP could include [, and P, or I, P and B frames Depend on what they did during encoding time, types of MPKG will be different It

depends on number of types used in GOP or order of L B, P frame in

GOP

One more important problem in MPEG-2 standard is motion

estimation Motion eslimation prediction is a method of determining the

amount of movement contained between two picluros This is achioved

by dividing the picture to be encoded inte sections known as

macroblocks The size of a macroblock is 16 x 16 pixcls Each macroblock is scarched for the closcst match in the scarch arca of the

picture it is beg compared with Motion estimation prediction is not

uscd on I pictures, however B and P pictures can relor to I pictures For P pictures, only the previous picture is searched for matching macroblocks

In B pictures both the previous and next pictures are searched When a

match 1s found, the offset (or motion vector) between them is calculated

‘The matching parts are used to create a prediction picture, by using the

motion vectors The prediction piclure is then compared in the same

manner to the picture to be encoded Macroblocks which have a match have already been encoded, and are therefore redundant Macroblocks

Tham Quang lai

Trang 18

which have no match to any part of the search area in the picture to be encoded represent the difference between the pictures, and these

macroblocks are encoded To understand more about MPEG standard, see

[MPEG] for details

1.2 Content-based video indexing and retrieval

system

11.2.1 Video sequence structure

A video stream is created basically from shots Shot is a

fundamental unit of video and it depicts a continuous capture from camera turning on until turning of for another shot Like what is

illustrated in figure Figure Il-4, when a video producer created a video,

they made it from shots and grouped them into scenes and embedded

some effect between shots

Keyframes selection van vt

Trang 19

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

One scenc could be underslocd like scrics of shols which is

semantic constrained (e.g., a scene describes two talking people, sitting in

chairs with inlerlacement shots of another people in a party) That means,

it is not easy te detect a scene by signalling features like colour, shape Tt must be detected by semantic level (mentioned in the next section) During The lowest level is frame in video hicrarchical structure CR VIR

system extracts a shot into frames and selects the most interested frame

from them in key frame selection step

11.2.2 Video data classification

To work with video, classification of video data is very important Follow [ROWE], Row et al classified video metadata into 3 levels for

each video by

Bibliographic data: This catcgory includes information about the

video (e.g., title, abstract, subject, and genre) and the individuals involved

in the video (e.g., producer, director, and cast)

Structural data: Video and movies can be described by a hierarchy

of movie, segment, scene, and shot where each entry in the hierarchy is

composed of one or more entrics at a lower level (cg., 4 segment is

composed of a sequence of scenes and a scene is composed of a sequence

of shots)

Content data: Users want lo relricve videos based on thei content audio and visual content) In addition, because of the nature of video the

visual content is a combination of slalic content frames and dynamic

content Thus, the content indexes may be scts of kcy frames that

represent major images and object indexes that indicate entry and exit

frames for each appearance ol a signilicant object or individual

With this classification, to work with video stream, normally we

have to parsing video into 3 levels above The first level can receive from

Trang 20

other information which is signed inside video stream or may be, simpler,

from text that appears in video stream (must use text recognition

|LIENHARTI) To determine structure data, system must rebuild its

structure from primitive cloments (frame) by using some techniques (find

refer) In the most of CBVIR system, content data is used as a major

element to work wilh video And in my thesis, I work with frame as a

primitive element

Another way to classify video data is based on it purpose It’s

called purpose-based classes In [EW], Lew at al classify video into 4

classes

Entertainment: Information in this class is highly stylized

depending on the particular sub-category: fiction, non-fiction and

interactive Film stories, TV programs, cartoon belong fiction With

non-licuion, information docs not need to be “logical”, and fallow story

flow Interactive video can be found out in games

Information: The most common information of video in television

is news It gives convey of information to viewers

Communication: Video used in communication is different from

play back video It must be designed for communication, suilable with packet transformation (c.g., video conferences)

Data analysis: Scientific video recording (e.g., video of chemical,

biology, psychology )

‘The way to classify video into these classes is important because of

it different structure It helps to classify video information in semantic

level Classity video shot in these levels is illustrated in [FAN] In that

paper, Fan et al classify video into Politics news, sports news, financial

news Or in [ROZENNI, Rozenn et al used features of sports lo

classify tennis sport video or snooker video Li et al in [I.1] detect

particular events in sport broadcast video like American football by

Trang 21

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

dotermine in video stream where a player hil the bascball, whore is

goal In [SATOH], to detect news video, detecting anchor person by

face detection is used, IL combines feature of face and video caption (Lext)

to roeognize appearance of person in video All of these applications try

to classify any video stream into its semantic class

11.2.3 Camera operations

There are two reasons thal made motion in video: camera and

object A camera shot stores objcct motions only when it is staying static,

no adjusting and objects like people, animal, vehicle are moving front

of it Reversely, camera motions are gencraled by moving, rotating and using zoo function of camera By moving camera, often used in film

production, video camera lying on support moved in a track Usually, this

casc is used to focus a moving object There are four common directions

of camera moving is illustrated in Figure 1-5 Some other cases of video motion created by stationary camera are in Figure IL-6

Trang 22

Tilt up

Pan right

Tilt down

Figure 11-6 Common rotation and ziom af stationary video camera

1.2.4, Introduction to CBVIR system

How many information is stored in one video shot? It’s said that

“One image is thousand of words” and here, one video could hold

thousand of images And not only that, in one video shot could store

another information like sound, voice, text and one feature which made

video became impress is motion Thus, any CBVIR system can be

extended from Image, Sound, Voice, Text Indexing and Retrieval

System Beside, motion that is taken from image sequence and

information which came from a collection of image are used in CBVIR

system also And a CBVIR system must satisfy the main target that is indexing and retrieval End-users give queries and with their critical,

CBVIR should give back lo them the results that is the most similar to

queries But imagine that, if we use the image or video query and browse

it on entire video stream by frame by frame, it is the big problem of very high time cost and some time, it is not exactly what we want A system of CBVIR for end-users can be seen simply in Figure II-7 This system can

use queries of videos stream, images, sounds, texts or combine all of them The result gave back to end user with result of the most similar

results

Trang 23

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

CBVIR system most simitar videos

Sounds, voices

Texts

Figure IL-7 CVBIR common system diagram

In [FAISAL], Faisal I Bashir gave the classification of video

modeling schema demonstrated in Figure II-8 Any CBVIR system used

features of signals like colour, shape, texture of raw data of video in

Level I Techniques at this level tend to model the apparent characteristics of “stuff” (as opposed to “things”) inside video clips

Video data is continuous and unstructured To analyze and understand its contents, the video needs to be parsed into smaller chunks (frames, shats

and scenes) At level Il, system analysis and synthesizes these features to

get logical and statistical features (computer vision), and used hoth of

these above derived results for semantic representation At this semantic

level, video bil stream that contains audio stream and possibly closed

caption text along with sequence of images contams a wealth of rich

information about objects and events being depicted Onve the feature

level summarization is done, scmanuc Icvel description based on

conceptual models, built on a knowledge base is needed In the following

scclions, we review techniques thal try lo bridge the semantic gap and

present a high level picture obtained from video data One example can

be found m [FAN] where people tried to classify each shot of video into

Trang 24

semantic scones of news, sports, science information CBVIR system

belongs to one of these levels depending on purpose of it

Raw/Compressed

Multimedia Data

Color Histogram, Texture

Signal Processing Descriptor, Trajectory

‘Object Based Representation

Intelligent modeling based on

Al & Philosophy Concepts

Figure 1-8 Classification af video modeling technique Level 1 with video raw

data, Level II with derived or logical features, and Level II with semantic level

its analogy In CBVR system of text, to make cfficient system, they must

decompose documents into elements: paragraphs, sentences and words

Trang 25

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

After thal, they make a conlent lable that map to document, cxtracl

keywords from document by features, and use query text to retrieve on

enuire these indexed keywords Similarly, in CBVIR sysicm, we should sogment a video document into shots and scenes to compose a table of

contents, and we should extract key frames or key sequences as index

entries for scenes or stones Therelore, the core research in content-based video retrieval is developing technologies to automatically parse video, audio, and text to identify meaningful composition structure and to

extract and represent content attributes of any video sources

Feature oxtraction

Retrievat and browsing

Figure I1-9 Pracess diagram of CRYTR system

Figure IL-9 illustrates processes of common CE VIR system There are four main processes: feature extraction, structure analysis, abstraction

and indexing There are many research on cach provess, cach one have its

own challenges and here, we discuss bricfly review for cach process

Feature extraction: As what mentioned in sessions before, each video contains many [catures on i: image’s leatures, veice’s lealures, sound’s features and text’s features Keature extraction will separate all

these features, depend on each system, to serve parsing video stream

(shot dotection, scene detection, motion detection ) The usual and

Trang 26

cflcctive way is combination Icatures which generalo many ways to approach a CBVIR system

Structure analysis: In Figure I-4, the top level is video sequence

and it is stories also Video sequences 18 partitioned into scenes (analogous

paragraphs in document) and each scene is composed sets of shots (like

sonlences im paragraphs) and shot is constructed from frames sequence Structure analysis must be able to decompose all these elements

Ilowever, scene separation sometime is impossible because it is based on

stories and it must be recognized by higher level (level TIT) as illustrated

in Figure U-8 Normally, to divide scene to each other, the most

technique is based on film production rules Bul in facl, it is still get much

challenge and less successful Kor shot separation, there are many way to

archive like using color, motion Most of shot algorithm used visual

information and the last one, lrame extraction is video format dependent

Video abstraction: Original video is always longer than need

information Abstraction is similar 1o step of finding keywords in CBIR

of text, IL will reduce inlormation to retrieve (equivalent to reduce time

code) in CBVIR system For example, one video of a football match,

scenes come from stand of stadium is redundant if we only consider to Match events Abstraction rejects all shots, scones that system does not need to care That means abstraction must be executed on entire shots, scenes and storics Abstraction video content includes skimming, highlights and summary A video skim is a condensed representation of the video containing keywords, frames, visual, and audio sequences hghlights normally involve detection of important events in the video A

summary means that we preserved important structural and semantic

information in a shorl version of the videu represented via key audio,

video, frames, and/or segments One of the most popular ways is key

frames extraction Key frames play an important role in the video

Trang 27

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

abstraction process Key [rames are still images, extracted {rom original video data, which best represent the content of shots in an abstract

manner The representational power of a sel of key [rames depends on

how they're chosen from all frames of a sequence Not all image frames

within a sequence are equally descriptive, and the challenge is how to

automatically determine which frames arc most represcnlative An even more challenging task is to detect a hierarchical set of key frames such

that a subset at a given level represents a certain granularity of video

content, which is critical for content-based video browsing Researchers

have developed many effective algorithms, although robust key frame

extraction remains a challenging research topic

Indexing for retrieval and browsing: ‘Vhe structural and content attributes extracted in feature extraction, video parsing, and abstraction

processes, or the aUtributes that arc cntored manually, are oflon referred to

as metadata Based on these attributes, we can build video indices and the

table of content through, for instance, a clustering process that classifies

sequences or shots into different visual categorics or an indexing

structure As in many other database systems, we need schemes and tools

lo use the indices and content metadata to query, scarch, and browse large

video databases Researehers have developed numerous schemes and

tools for video indexing and query However, robust and effective tools

tested by thorough experimental evaluation with large data scls are still lacking, ‘Therefore, in the majority of cases, retrieving or searching video

databases by keywords or phrases will be the mode of operation In some

cases, we can retrieve with reasonable performance by content similarity

defined by low-level visual features of, for instance, key frames and

example-based queries

Tham Quang lai

Trang 28

1.3 Features Extraction

Each CBVIR system chooses its own way to approach searching

and browsing There arc used features is image features, audio features and text features In this section, I would like to introduce to features used

widely in CBVIR system and give the overview for each feature In

limitation of thesis, it is impossible to discuss in detail for each subject

As we've known that each frame in video stream is equivalent to

one image That means any thing we can do on image can be applied to

video One thing made video differs from image is constraint between conseculive image features There are Lhree low-level features often used

in CBIR of image mapped to video: coler, texture and shape

Color:

RGB color space

Model KGB is one of the first practical models of area of colors and contains the recipe for creation of colors This model has emerged in

evident manner in the mes of birth of television (1908 and the next

years) This is the model resulting from the receiving specification of eye and it is based on fact that the impressions of almost all the colors in eye

can be evoked by mixing in the fixed proportions of three sclocted

clusters of light of the properly chosen width of spectrum Three of

components (R, G, B) (R-RED G-GREEN, B-BLUE) is the identification of color in the model RGH The color space of RGH color

can be describe in Figure II-10

HSV color space

One more proposal of the model of colors’ description, suggested

by Alvey Ray Snmth, appeared in 1978 The symbols in the name of

model are the first letters of English names for the components of the

description of color The model is considered as a cone of round base

Trang 29

The dimensions of cone are described by the component H (Hue),

component § (Saturation) and the component V (Value)

brightness is the height component Figure II-11 illustrates the way to

create color on the surface of cylinder HSV color space is very useful in CBIR of image system because it is close to the human eyes perception

Color Histogram:

Comparison of color histogram to determine where the same

distance of histogram appeared This method is used very popular

because of its light time cost and its high effect

Texture:

Tamura features

Tamura features are based on theory of texture [TAMURA] Three Tamura features are known as coarseness, contrast and directionality

Trang 30

measurement to separate it into different levels For example, in Figure

II-12.a the right image is coarser than the left and it has greater value

0.086690 0.910828

a Figure [1-12 Tamura features and their values (a) Coarseness (b) Contrast (c)

Directionality

Gabor Filter:

Given an image /(x, y) with m, n are scale and orientation

correspondently Gabor formula can be described as follow

G6 9)= LD &-Ny- OV, (6.0 at)

st ae ie variants of mask filter

is mother wavelet function defined following

Trang 31

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

Change these argument values will change the image results Thal moans

Gabor filter is used to filter image and take need textures

Shape:

Shape represcntations can be classificd into contour-based and

region-based techniques Both categories of techniques have been

investigaled and several approaches have been developed In CBIR of image system, shape of object is extracted and it is compared through

image database

1.4 Structure Analysis

Structure analysis includes detecting scene, shot break and

extracting frames In this section, I only introduce to shot detection because this is the most important task during analyzing video

114.1 Shot Transitions classification

Shot transition is techniques to lead the viewer from shot ty other

shot [JONE, KATZZ] Normally, before movie production (concatenation

of shots together to make movie story), each shot is calculated from time

of turning on camera until turning off to change to other shot One thing

to separate shot together is continuous feature that shot will appear wherever the sudden change of continuous scene in time line After

applying transition shot technique, the way to separate became more

difficult Detecting shot change is very important step in any CBVIR

system Shot transition could be categorized as follow

Cut: a shot changes to another instantaneously see Figurc I-L4.a

Fade-in: a shot changes from a conslant image This content image

usually is black frame ‘The most fade-in can be found in the first shot of

movie at the beginning Figure II-14.b

Trang 32

Figure II-14.¢

Dissolve: first shot fade-out and second shot fade-in As you see in Figure Il-14.d there is a dissolve effect appeared after Frame#43 and before Frame#83

Wipe: Wipe is the most complicate transition because it is generated by video production Figure II-14.e is just only one type of

wipe (push left) Figure II-15 illustrates techniques is used for shot transition The arrows are trend of how the second shot appearing to the

fist shot

Thorough understanding of shot transition is the way to approach

shot detection We will see how to automatic detect the shot transition in

the next section and today, the research on shot detection still continues because of its difficulty and interest

a

Iramei#43 Irame#S7 Irame#6‡ Irame#7l Irame#79 Irame#83

Trang 33

11.4.2 Shot Detection Techniques

There are many way to detect where the shot appeared In general,

automatic shot detection can be classified into 5 categories: pixel-based,

statistics-based, transform-based, feature-based, and histogram-based

Pixel difference method:

Pair-wise pixel comparison (also called template matching)

evaluates the differences in intensity or colour values of corresponding

Trang 34

pixels in two successive frames The difference belweon frame ff

(intensity of pixel (4y) in a frame fijand frame fi; is defined as:

} xa

X*tN

Otsuji et al [OTSUTI] and Zhang et al [ZHANG] count the

number of the changed pixels and a camera shot break is declared if the

percentage of the Latal number of changed pixels exceeds a lbreshold The

differences in pixels and threshold are:

otherwise,

As we can see m (11.4), if the difference of pixel is above of a

threshold value t then a camera break is declared But easy to realize that

this formula is very sonsitive to camera motion and it will be very difficult to detect for pan, zoom, or may be just a large movement of

object while the difference value of correspond identity value is high

‘That means it will give more false alarm for the results of camera break Tlowever, this method is fast and robust to detect camera break

Block-based method:

In contrast to template matching that is based on global image

characteristic (pixel by pixel 8 differences), block-based approaches use local characterisiic lo increase the robustness lo camera and objcet

movement [ZIIANG] Let 4 and za+) be the mean imtensity values for a

given region in two consecutive frames and & and da; be the

corresponding variances ‘he frame difference is defined as the percentage of the regions whose likelihood ratios (II 5) exceed a pre- defined threshold t

Trang 35

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

[L ÿ_ 4>:

‘This approach is better than the previous one as it increases the

tolerance against noise associated with camera and object movement

However, it is possible that even though the two corresponding blocks are

different, they can have the same density function In such cases no

change is detecled

Histogram comparison:

A step further towards reducing sensitivity to camera and object

movements can be done by comparing the hislograms of successive

images ‘lhe idea behind histogram-based approaches is that two frames

with unchanging background and unchanging (although moving} objects

will have little difference in their histograms In addition, histograms arc

invariant to image rotation and change slowly under the variations of

viewing angle and scale As a disadvanlage one can nole thal two images with similar histograms may have completely different content However,

the probability for such events is low enough, moreover techniques for

dealing with this problem have already been proposed in [PASS]

A grey level (color) histogram of a frame i is an n-dimensional

vector Hi(jJ=1, 1 where ở is the number of grey levels (colors) and Hj)

is the number of pixels from the frame i with grey level (color)

Global Comparison:

The simplest approach uses an adaptation of the metrics [rom (3)

but instead of intensity values, grey level histograms are compared A cut

is declared if the absolute sum of histogram differences between two

successive [rames D(f,fi+) is greater than a threshold L

Trang 36

the grey value and 7 is the total number of grey levels

Another simple and very effective approach is to compare color histograms Zhang et al [ZHANG] apply (II 8) where j, instead of grey levels, denotes a code value derived from the three color intensities of a pixel In order to reduce the bin number (3 colors x 8 bits create histograms with 224 bins), only the upper two bits of each color intensity

component are used to compose the color code, This solution can be

demonstrated in Figure II-16 where fours bit from 0 to 4 are truncated

The comparison of the resulting 64 bins has been shown to give sufficient

accuracy When the difference is larger than a given threshold T, a cut is

Trang 37

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

Figure 1-17 Cut (a) and (Fade/Dissvlve) from frame difference

Figure 11-17 show the results of cumulative differences between frames In Figure [1-17.a show the frame differences when cut appeared

with the sign of peaks and Figure 11-17.b show the frame differences

when fade or dissolve appeared ‘The cut could be detected by threshold

easily but with fade and dissolve it 1s more difficult ‘Lhe next session will

explain how to detect fade and dissolve Lransitions

YT win-comparison

The twin-comparison method [SMOLIAR] takes into account the

cumulative differences between frames of the gradual transition In Lhe

first pass a high threshold 7; is used to detect cuts as shown in Figure TI-18 (the first peak) In the second pass a lower threshold T; is employed

to detect the potential starting frame F, of a gradual transition Ƒ; is than compared to subsequent frames (Figure II-18) This is called an accumulated comparison as during a gradual transition this difference

value increases ‘The end frame #; of the transition is detected when the difference between consecutive frames decreases to less than T;, while the accumulated comparison has increased to a valuc higher than 7, If the consecutive difference falls below 7; before the accumulated difference

exceeds T,,, then the potential start frame /, 1s dropped and the search

continues for other gradual transitions It was found, however, that there

are some gradual transitions during which the consecutive difference falls

below the lower threshold This problem can be easily solved by selling a

tolerance value that allows a certain number of consecutive frames with

low difference values before rejecting the transition candidate As it can

be scen, the twin comparison detects both abrupt and gradual Lransitions

at the same time

Trang 38

Eigure 1-18 Twin Comparison (picture taken from [114 5|)

Local histogram comparison:

Histogram based approaches are simple and more robust to object and camera movements but they ignore the spatial information and,

therefore, fail when two different images have similar histograms On the

other hand, block based comparison metheds make use of spatial

information ‘hey typically perform better than pair-wise pixel

comparison but are still sensitive to camera and object motion and are

also computationally expensive By integrating the two paradigms, false alarms due to camera and object movement can be reduced while enough

spatial information is retained lo produce more accurate results

The frame-to-frame difference of frame fj and frame fi; is

compuled in b, regions (blocks) as

ia

DE fan = SPP 2,62 ag)

Trang 39

Thesis for Degree of duster - Content-based videu indexing oud retrrevial

where H,(j,k) denotes the histogram value al grey level j for the region

(block) & and by is the total number of the blocks

There are some methods used im shot detection also: model-based segmentation, IXCT-Based method, or Vector Quantization With model-

based segmentation, it is not only for detection shot cut or dissolve but

also lor any detection of wipe types In [HAMPAUR |, Hampapur ct al gave a model for variant of chromatic or brightness in two consequent

frame and detected the change In [IDRIS], Idris and Panchanathan use

vector quantization to compress a video sequence using a codebook of

size 256 and 64-dimensional vectors ‘The histogram of the labels

obtained from the codebook for cach frame is used as 4 frame similarity

measure and a 7° statistic is used to detect cuts In DCT-based method,

they use the coefficients to compare two video frames The DCT is

commonly used for reducing spatial redundancy in an image in different

video compression schemes such as MPEG and JPEG Compression of the video is carried out by dividing the image into a set of 8x8 pixel

blocks Using the IDC'T the pixels in the blocks are transformed into 64

coefficients which are quantized and IJufiman entropy encoded The

DCT coefficients are analyzed lo find frames where camera breaks take

place Since the coefficients in the frequency domain are mathematically

related to the spatial domain they can be used in detecting the changes in

the video sequence

1.8 Video motion

Like what mentioned before, motion is the most interest feature

that appeared in video From video sequence, there are many ways to

extract motion, depend on the purpose of each system And what kind of

motion they considered’ It could be motion of each pixel in frame (the

way optical flow is), motion for cach block (MPEG2 handled this),

Trang 40

motion of object, global motion, molion of layers Extracling motion

became important not only in low-level analysis but also in higher-level

analysis This session discuss some type of approached motions but more

detail in optical flow because of its popular using and its primary

11.5.1 Motion trajectory

The motion trajectory of an object is a simple, high-level feature,

delined as the localization, in ume and space, of one representative pomt

of this object This descriptor shows uscfulncss for content-based

retrieval in object oriented visual databases It is also of help in more

specific applications In given contexts with a priori knowledge, trajectory can enable much functionality In surveillance, alarms can be

triggered if some object has a trajectory identified as dangerous (e.g

passing through a forbidden area, being unusually quick, ete.) In sports,

specific actions (e.g tennis rallies taking place at the net) can be recognized Besides, such a description also allows enhancing data interactions‘manipulations: for semiautomatic multimedia editing,

trajectory can be stretched, shifted, etc, to adapt the object motion to any

given sequence global context

‘the descriptor is essentially a list of key-points (x,y,z.Ð) along with

a sel of optional interpolating fimetions that describe the path of the

object between key-points, in terms of aceclcralion The specd is

implicitly known by the key-points specification The key-points are

specilicd by them time instant and cither their 2-D or 3-D Cartesian coordinates, depending on the intended application The interpolating

functions are defined for each component x(t), y(t), and z(t)

independently

Ngày đăng: 10/06/2025, 11:23

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm