1. Trang chủ
  2. » Giáo án - Bài giảng

concept-based video retrieval

46 270 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Concept-based Video Retrieval
Tác giả Cees G.M. Snoek, Marcel Worring
Trường học University of Amsterdam
Chuyên ngành Intelligent Systems
Thể loại nghiên cứu khoa học
Năm xuất bản 2008
Thành phố Amsterdam
Định dạng
Số trang 46
Dung lượng 4,12 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

How to improve concept detection?Feature Extraction Supervised Learner Feature Extraction Supervised Learner 17 Feature Extraction Supervised Visual Feature Extraction T t l Supervise

Trang 1

C t B d Vid R t i l Concept-Based Video Retrieval

Cees Snoek and Marcel Worring

with contributions by:

many

Intelligent Systems Lab Amsterdam, University of Amsterdam, The Netherlands

3

The science of labeling

¾ To understand anything in science, things have to have a name that is recognized and is universal

naming chemical elements

naming human genome naming ‘categories’

naming living organisms

naming rocks and minerals

naming textual information

What about naming video information?

Trang 2

Problem statement

1101011011011 0110110110011 0101101111100 1101011011111

1101011011011 0110110110011 0101101111100 1101011011111

1101011011011 0110110110011 0101101111100 1101011011111

1101011011011 0110110110011 0101101111100 1101011011111

1101011011011

1101011011011 0110110110011 0101101111100 1101011011111

1101011011011 0110110110011 0101101111100

1101011011011 0110110110011 0101101111100 1101011011111

1101011011011 0110110110011 0101101111100 1101011011111

1101011011011 0110110110011 0101101111100 1101011011111

Different low-level features

6

color

Trang 3

Basic example: color histogram

count

Histogram

380 pixels

count

7

Total 243200 pixels Histogram is a summaryof the data summarizing

in this case color characteristics

Advanced example: codebook model

¾ Create a codeword vocabulary

9 Codeword annotation (e.g Sky, Water)

Leung and Malik IJCV, 2001.

Sivic and Zisserman ICCV, 2003.

van Gemert, PhD thesis, UvA, 2008.

¾ Discretize Image with codewords

¾ Represent image as codebook histogram

30 40 50 60 70 80

Trang 4

The goal: semantic video indexing

¾ Is the process of automatically detecting the presence of a semantic concept in a video stream

Airplane

9

Semantic indexing

¾ The computer vision approach

9 Building detectors one-at-the-time

A face detector for frontal faces

3 years later

10

A face detector for non-frontal faces One (or more) PhD for every new concept

Trang 5

So how about these?

Outdoor

And the > 1000 others ………

Generic concept detection in a nutshell

outdoor aircraft

Feature Extraction Supervised

Learner

Training

Labeled examples

It is outdoor probability 0.95

Trang 7

Support vector machine

F1

SVM usually is a good choice

15

¾ Support Vector Machine

9 Maximizes margin between two classes

Margin

Supervised Learner

¾ Depends on many parameters

9 Select best of multiple parameter combinations

9 Using cross validation

SVM Vector Semantic Concept Probability

Weight for positive class

Weight for negative class

Trang 8

How to improve concept detection?

Feature Extraction Supervised

Learner

Feature Extraction Supervised

Learner

17

Feature Extraction Supervised

Visual Feature Extraction

T t l

Supervised Learner

+Only one learning phase

+Truly a multimedia representation

-Multimodal combination often ad hoc

18

Textual Feature Extraction -One modality may dominate

- Feature vectors become too large easily

Trang 9

Feature fusion: unimodal

References:

van de Sande, CIVR 2008

+Codebook model reduces dimensionality

-Combination still ad hoc

-One feature may dominate

0 1

Relative frequency

1 2 3 4 5

Codebook element

Harris-Laplace salient points

Point sampling strategy Color feature extraction Codebook model

1

Relative frequency

Bag-of-features

.

+Focus on modality strength

+Fusion in semantic space

Classifier fusion: multimodal

References:

Wu, ACM Multimedia 2004 Snoek, ACM Multimedia 2005

Supervised Learner

Classifier Fusion

Visual Feature Extraction

Textual Feature

E t ti

Supervised Learner

Supervised Learner

-Expensive in terms of learning effort

-Possible loss of feature space correlation

Trang 10

Classifier fusion: unimodal

Support Vector Machine

Global Image Feature Extraction

References:

Snoek, TRECVID 2006 Wang, ACM MIR 2007

Geometric Mean

Logistic Regression

Fisher Discriminant

Regional Image Feature Extraction

Keypoint Image Feature Extraction

21

+Aggregation functions reduce learning effort

+Offers opportunity to use all available examples efficiently

-Linear function likely to be sub-optimal

Modeling relations

¾ Exploitation of conceptual co-occurrence

9 Concepts do not occur in vacuum

9 In contrast, they are related

References: IBM 2003 Naphade and Huang, TMM 3(1) 2001

In contrast, they are related

Trang 11

– Limited scalability

9 Implicitly learn relations: using SVM, or data mining tools

– Assumes classifier learns relations – Suffers from error propagation

23

References: IBM 2003 Naphade and Huang, TMM 3(1) 2001

IBM’s pipeline

Trang 12

References: IBM 2003 Naphade and Huang, TMM 3(1) 2001

IBM’s pipeline

F e a t

F u s i

C l a s s i

F u s i

M o d e l

R e l a t

25

t u r e

i o n

i f i e r

i o n

l i n g

t i o n s

Layout Features Extraction

Semantic Pathfinder

Supervised L Multimodal Features

Supervised Learner

Supervised Learner

Visual Features Extraction

Semantic Features Combination

Context Features Extraction

Capture Features Extraction

Content Features Extraction

Select Best of

3 Paths after Validation Animal

Sports

Vehicle

Flag Fire

26

Learner Features

Combination Textual

Features Extraction

Content Analysis Step Style Analysis Step Context Analysis Step

Sports

Entertainment Monologue Weather

news

Hu Jintao

Trang 13

Layout Features Extraction

Semantic Pathfinder

C l

Supervised L Multimodal Features

Supervised Learner

Supervised Learner

Visual Features Extraction

Semantic Features Combination

Context Features Extraction

Capture Features Extraction

Content Features Extraction

Select Best of

3 Paths after Validation Animal

Sports

Vehicle

Flag Fire

Feature Fusion

a s s i f i

F u s i o n

Modeling Relations

27

Learner Features

Combination Textual

Features Extraction

Content Analysis Step Style Analysis Step Context Analysis Step

Sports

Entertainment Monologue Weather

news

Hu Jintao

Feature Fusion

e r

Tsinghua University

Trang 14

Tsinghua University

C l

Feature Fusion

l a s s i f i

F u s i o n

Modeling Relations

29

i e r n

Fragmented research efforts…

30

NIST

Since 2001 worldwide evaluation by NIST

Video analysis researchers

9 Until 2001 everybody defined her or his own concepts

9 Using specific and small data sets

9 Hard to compare methodologies

Trang 15

NIST TRECVID benchmark

¾ Benchmark objectives

9 Promote progress in video retrieval research

9 Provide common dataset (shots, recognized speech, key frames)

9 Use open metrics-based evaluation

anno 2001

9 Use open, metrics-based evaluation

¾ Large international field of participants

31

¾ Currently the de facto standard for evaluation

http://trecvid.nist.gov/

80 100 120 140 160 180

Hours of train data Hours of test data

ABC, CNN

Arabic TV

English, Chinese, Arabic TV

2001 2002 2003 2004 2005 2006

TRECVID Evolution:

data, tasks, participants,

Source: Paul Over, NIST

0 20 40 60

60 70

Applied Finished

NIST

Prelinger archive

CNN, C-Span ABC,

CNN

Shots Shots Shots Shots Shots Shots Search Search Search Search Search Search

Concepts Concepts Concepts Concepts Concepts

Stories Stories BBC rushes BBC rushes

papers:

Trang 16

80 100 120 140 160 180

Hours of train data Hours of test data

ABC, CNN

Arabic TV

English, Chinese, Arabic TV

2001 2002 2003 2004 2005 2006

TRECVID Evolution:

data, tasks, participants,

Source: Paul Over, NIST

0 20 40 60

60 70

Applied Finished

NIST

Prelinger archive

CNN, C-Span ABC,

CNN

Shots Shots Shots Shots Shots Shots Search Search Search Search Search Search

Concepts Concepts Concepts Concepts Concepts

Stories Stories BBC rushes BBC rushes

80 100 120 140 160 180

Hours of train data Hours of test data

ABC, CNN

Arabic TV

English, Chinese, Arabic TV

2001 2002 2003 2004 2005 2006

TRECVID Evolution:

data, tasks, participants,

Source: Paul Over, NIST

0 20 40 60

60 70

Applied Finished

NIST

Prelinger archive

CNN, C-Span ABC,

CNN

Shots Shots Shots Shots Shots Shots Search Search Search Search Search Search

Concepts Concepts Concepts Concepts Concepts

Stories Stories BBC rushes BBC rushes

papers:

Trang 17

NIST TRECVID BenchmarkConcept detection task

¾ Given:

9 a video dataset segmented into set of Sunique shots

9 set of set of N Nsemantic concept definitions: semantic concept definitions:

¾ Task:

35

9 How well can you detect the concepts?

9 Rank Sbased on presence of concept from N

Trang 18

TRECVID evaluation measures

¾ Classification procedure

9 Training: many hours of (partly) annotated video

9 Testing: many hours of Testing: many hours of unseen unseen video video Results

¾ Evaluation measure: Average Precision

9 Combines precision and recall

9 Averages precision after every relevant shot

9 Top of the ranked list most important

Trang 19

491 detectors, a closer look

The number of labeled image examples used at training time seems decisive in concept detector accuracy.

Demo time!

Trang 20

Concept detector: requires examples

¾ TRECVID’s collaborative research agenda has been pushing manual concept annotation efforts

374 491

LSCOM MediaMill - UvA

Trang 21

Collaborative annotation tool

¾ Manual annotation by 100+ TRECVID participants

9 Incomplete, but reliable

TRECVID 2005

References:

Christel, Informedia, 2005 Volkmer et al, ACM MM 2005

43

Manual annotations: LSCOM-lite

9 Large Scale Annotation for Multimedia

9 Aims for ontology of 1,000 annotated concepts

References:

Naphade et al, IEEE Multimedia 2006

TRECVID 2005

Aims for ontology of 1,000 annotated concepts

¾ LSCOM-Lite: annotations for 39 semantic concepts

Trang 22

TRECVID Criticism

¾ Focus is on the final result

9 TRECVID judges relative merit of indexing methods

9 Ignores repeatability of intermediate analysis steps

¾ Systems are becoming more complex

9 Typically combining several features and learning methods

¾ Component-based optimization and comparison impossible

Content Layout Features

45

Supervised Learner Multimodal Features Combination

Supervised Learner Supervised Learner

Visual Features

Textual Features

Content Analysis Step Style Analysis Step Context Analysis Step

Semantic Combination

Context Features Capture Features Content Features

Select Best of after Validation

What is the contribution of these components?

¾ The Challenge provides

9 Manually annotated lexicon of 101 semantic concepts

9 Pre-computed low-level multimedia features

9 Trained classifier models

¾ The Challenge allows to

9 Gain insight in intermediate video analysis steps

9 Foster repeatability of experiments

9 Optimize video analysis systems on a

MediaMill Challenge

Supervised Learner

Visual Feature Extraction

Late Fusion Textual

Feature Extraction

Combined Analysis

Supervised Learner

Supervised Learner

Supervised Learner

Experiment 2 Experiment 1

Trang 23

ƒ Pure computer vision

ƒ Pure natural language processing

ƒ Pure machine learning

9 For education

47

– Students can do

ƒ large scale experiments

ƒ compare themselves to each other

ƒ …… and to the state-of-the-art

Columbia374

¾ Baseline for 374 concept detectors

9 Focus is on visual analysis experiments

Online available:

http://www.ee.columbia.edu/ln/dvmm/columbia374/

Trang 24

50

Trang 25

DemoResults for drummer

51

Conclusions

¾ An international community is building a bridge to narrow the semantic gap

9 Currently detects more than 500 concepts in broadcast video

9 Generalizes outside news domain

¾ Important lessons

9 No superior method for all concepts exists,

9 Best to learn optimal approach per concept

9 Best methods cover variation in features, classifiers, and concepts

Trang 26

Concept detection challenges

¾ Show generality of approach over several domains

9 Show benefit of web-based image/video and annotations

¾ Show that concept classes work with less analysis

¾ Show that concept classes work with less analysis

9 People, objects, setting

¾ Show benefit of using dynamic nature of video

¾ Show that an ontology can help

9 How to connect logical relations to uncertain detectors?

¾ Show that ‘iconological’ concepts can be detected

9 E.g funny, sarcastic, cozy, …

53

Using concept detectors

¾ “We are now seeing researchers starting to use the confidence values from concept detectors, within the shot retrieval process and this appears to be theshot retrieval process and this appears to be the roadmap for future work in this area.”

9 Alan Smeaton, Information Systems, 32(4):545-559, 2007

54

Trang 27

Measure concept detector influence

TRECVID automatic search task

Search Engine Result Query

Topics

¾ Automatically solve search topic

¾ Return 1,000 ranked shot-based results

¾ Evaluate using Average Precision

¾ TRECVID 2005

9 85 hrs test set – Chinese, Arabic, English TV News

9 24 search topics

Trang 28

Topic examples

Find shots of a hockey rink with at least one of the nets fully visible from

some point of view.

Find shots of one or more helicopters

Find shots of an office setting, i.e., one or more desks/tables and one or more computers and one or more

people

Influence of lexicon size

¾ Lexicon = 363 machine learned concept detectors

¾ Procedure

¾ Procedure

1 Set bag size B= 10;

2 Select random bag of Bdetectors from lexicon

3 Determine maximum performance for each search topic

4 B+=10;

5 Go back to step 2

58

¾ Repeat the process 100 times

9 Reduces random positive and negative effects

Trang 29

Influence of lexicon size

TRECVID 2005

Linear increase for first 60 detectors

¾ Size matters

9 Lexicon of 150 detectors comes close to maximum performance

¾ Some detectors perform well for specific topics

9 Tennis game detector for “find two visible tennis players”

¾ Substantial number of detectors not accurate enough yet

9 Only small increase when more than 70 detectors are used

59

¾ How to combine multiple detectors?

9 Experiment: pair-wise oracle fusion

Influence of detector combination

Office

Computer

− Improvement for 20 out of 24 topics

− Increase per topic as high as 89%

− Overall increase 10%

Trang 30

Find shots of George Bush entering

or leaving a vehicle (e.g., car, van, airplane, helicopter, etc) (he and vehicle both visible at the same time)

Iyad Allawi

+

rocket propelled grenades

?

Problem statement

Find shots of an office setting

62

¾ How to translate query topic to concept detectors?

Trang 31

Detector selection strategies

Detector Selection Strategies

Semantic Visual Querying

Video Query

With B Huurnink / M de Rijke

L Hollink / G Schreiber

M Worring

Find shots of an office setting

Multimedia raw data

Data flow conventions

Ontology

Text Matching

Influence of detector selection combi

¾ Individual selection strategies seem comparable

9 But, oracle combination of selection strategies pays off!

TRECVID 2005

O t l Vi l

T t Ontology Querying

Building

Visual Querying

Grass

Text Matching

Office Emile

Lahoud Computer

Find shots of an office setting, i.e., one or more desks/tables and one or more computers and one or more

people

Computer

Trang 32

TRECVID interactive search task

¾ So many choices for retrieval…

9 Why not let user decide interactively?

65

http://trecvid.nist.gov/

‘Classic’ Informedia system

References:

Carnegie Mellon University

¾ First multimodal video search engine

66

Ngày đăng: 24/04/2014, 13:20

TỪ KHÓA LIÊN QUAN

w