How to improve concept detection?Feature Extraction Supervised Learner Feature Extraction Supervised Learner 17 Feature Extraction Supervised Visual Feature Extraction T t l Supervise
Trang 1C t B d Vid R t i l Concept-Based Video Retrieval
Cees Snoek and Marcel Worring
with contributions by:
many
Intelligent Systems Lab Amsterdam, University of Amsterdam, The Netherlands
3
The science of labeling
¾ To understand anything in science, things have to have a name that is recognized and is universal
naming chemical elements
naming human genome naming ‘categories’
naming living organisms
naming rocks and minerals
naming textual information
What about naming video information?
Trang 2Problem statement
1101011011011 0110110110011 0101101111100 1101011011111
1101011011011 0110110110011 0101101111100 1101011011111
1101011011011 0110110110011 0101101111100 1101011011111
1101011011011 0110110110011 0101101111100 1101011011111
1101011011011
1101011011011 0110110110011 0101101111100 1101011011111
1101011011011 0110110110011 0101101111100
1101011011011 0110110110011 0101101111100 1101011011111
1101011011011 0110110110011 0101101111100 1101011011111
1101011011011 0110110110011 0101101111100 1101011011111
Different low-level features
6
color
Trang 3Basic example: color histogram
count
Histogram
380 pixels
count
7
Total 243200 pixels Histogram is a summaryof the data summarizing
in this case color characteristics
Advanced example: codebook model
¾ Create a codeword vocabulary
9 Codeword annotation (e.g Sky, Water)
Leung and Malik IJCV, 2001.
Sivic and Zisserman ICCV, 2003.
van Gemert, PhD thesis, UvA, 2008.
¾ Discretize Image with codewords
¾ Represent image as codebook histogram
30 40 50 60 70 80
Trang 4The goal: semantic video indexing
¾ Is the process of automatically detecting the presence of a semantic concept in a video stream
Airplane
9
Semantic indexing
¾ The computer vision approach
9 Building detectors one-at-the-time
A face detector for frontal faces
3 years later
10
A face detector for non-frontal faces One (or more) PhD for every new concept
Trang 5So how about these?
Outdoor
And the > 1000 others ………
Generic concept detection in a nutshell
outdoor aircraft
Feature Extraction Supervised
Learner
Training
Labeled examples
It is outdoor probability 0.95
Trang 7Support vector machine
F1
SVM usually is a good choice
15
¾ Support Vector Machine
9 Maximizes margin between two classes
Margin
Supervised Learner
¾ Depends on many parameters
9 Select best of multiple parameter combinations
9 Using cross validation
SVM Vector Semantic Concept Probability
Weight for positive class
Weight for negative class
Trang 8How to improve concept detection?
Feature Extraction Supervised
Learner
Feature Extraction Supervised
Learner
17
Feature Extraction Supervised
Visual Feature Extraction
T t l
Supervised Learner
+Only one learning phase
+Truly a multimedia representation
-Multimodal combination often ad hoc
18
Textual Feature Extraction -One modality may dominate
- Feature vectors become too large easily
Trang 9Feature fusion: unimodal
References:
van de Sande, CIVR 2008
+Codebook model reduces dimensionality
-Combination still ad hoc
-One feature may dominate
0 1
Relative frequency
1 2 3 4 5
Codebook element
Harris-Laplace salient points
Point sampling strategy Color feature extraction Codebook model
1
Relative frequency
Bag-of-features
.
+Focus on modality strength
+Fusion in semantic space
Classifier fusion: multimodal
References:
Wu, ACM Multimedia 2004 Snoek, ACM Multimedia 2005
Supervised Learner
Classifier Fusion
Visual Feature Extraction
Textual Feature
E t ti
Supervised Learner
Supervised Learner
-Expensive in terms of learning effort
-Possible loss of feature space correlation
Trang 10Classifier fusion: unimodal
Support Vector Machine
Global Image Feature Extraction
References:
Snoek, TRECVID 2006 Wang, ACM MIR 2007
Geometric Mean
Logistic Regression
Fisher Discriminant
Regional Image Feature Extraction
Keypoint Image Feature Extraction
21
+Aggregation functions reduce learning effort
+Offers opportunity to use all available examples efficiently
-Linear function likely to be sub-optimal
Modeling relations
¾ Exploitation of conceptual co-occurrence
9 Concepts do not occur in vacuum
9 In contrast, they are related
References: IBM 2003 Naphade and Huang, TMM 3(1) 2001
In contrast, they are related
Trang 11– Limited scalability
9 Implicitly learn relations: using SVM, or data mining tools
– Assumes classifier learns relations – Suffers from error propagation
23
References: IBM 2003 Naphade and Huang, TMM 3(1) 2001
IBM’s pipeline
Trang 12References: IBM 2003 Naphade and Huang, TMM 3(1) 2001
IBM’s pipeline
F e a t
F u s i
C l a s s i
F u s i
M o d e l
R e l a t
25
t u r e
i o n
i f i e r
i o n
l i n g
t i o n s
Layout Features Extraction
Semantic Pathfinder
Supervised L Multimodal Features
Supervised Learner
Supervised Learner
Visual Features Extraction
Semantic Features Combination
Context Features Extraction
Capture Features Extraction
Content Features Extraction
Select Best of
3 Paths after Validation Animal
Sports
Vehicle
Flag Fire
26
Learner Features
Combination Textual
Features Extraction
Content Analysis Step Style Analysis Step Context Analysis Step
Sports
Entertainment Monologue Weather
news
Hu Jintao
Trang 13Layout Features Extraction
Semantic Pathfinder
C l
Supervised L Multimodal Features
Supervised Learner
Supervised Learner
Visual Features Extraction
Semantic Features Combination
Context Features Extraction
Capture Features Extraction
Content Features Extraction
Select Best of
3 Paths after Validation Animal
Sports
Vehicle
Flag Fire
Feature Fusion
a s s i f i
F u s i o n
Modeling Relations
27
Learner Features
Combination Textual
Features Extraction
Content Analysis Step Style Analysis Step Context Analysis Step
Sports
Entertainment Monologue Weather
news
Hu Jintao
Feature Fusion
e r
Tsinghua University
Trang 14Tsinghua University
C l
Feature Fusion
l a s s i f i
F u s i o n
Modeling Relations
29
i e r n
Fragmented research efforts…
30
NIST
Since 2001 worldwide evaluation by NIST
Video analysis researchers
9 Until 2001 everybody defined her or his own concepts
9 Using specific and small data sets
9 Hard to compare methodologies
Trang 15NIST TRECVID benchmark
¾ Benchmark objectives
9 Promote progress in video retrieval research
9 Provide common dataset (shots, recognized speech, key frames)
9 Use open metrics-based evaluation
anno 2001
9 Use open, metrics-based evaluation
¾ Large international field of participants
31
¾ Currently the de facto standard for evaluation
http://trecvid.nist.gov/
80 100 120 140 160 180
Hours of train data Hours of test data
ABC, CNN
Arabic TV
English, Chinese, Arabic TV
2001 2002 2003 2004 2005 2006
TRECVID Evolution:
data, tasks, participants,
Source: Paul Over, NIST
0 20 40 60
60 70
Applied Finished
NIST
Prelinger archive
CNN, C-Span ABC,
CNN
Shots Shots Shots Shots Shots Shots Search Search Search Search Search Search
Concepts Concepts Concepts Concepts Concepts
Stories Stories BBC rushes BBC rushes
papers:
Trang 1680 100 120 140 160 180
Hours of train data Hours of test data
ABC, CNN
Arabic TV
English, Chinese, Arabic TV
2001 2002 2003 2004 2005 2006
TRECVID Evolution:
data, tasks, participants,
Source: Paul Over, NIST
0 20 40 60
60 70
Applied Finished
NIST
Prelinger archive
CNN, C-Span ABC,
CNN
Shots Shots Shots Shots Shots Shots Search Search Search Search Search Search
Concepts Concepts Concepts Concepts Concepts
Stories Stories BBC rushes BBC rushes
80 100 120 140 160 180
Hours of train data Hours of test data
ABC, CNN
Arabic TV
English, Chinese, Arabic TV
2001 2002 2003 2004 2005 2006
TRECVID Evolution:
data, tasks, participants,
Source: Paul Over, NIST
0 20 40 60
60 70
Applied Finished
NIST
Prelinger archive
CNN, C-Span ABC,
CNN
Shots Shots Shots Shots Shots Shots Search Search Search Search Search Search
Concepts Concepts Concepts Concepts Concepts
Stories Stories BBC rushes BBC rushes
papers:
Trang 17NIST TRECVID BenchmarkConcept detection task
¾ Given:
9 a video dataset segmented into set of Sunique shots
9 set of set of N Nsemantic concept definitions: semantic concept definitions:
¾ Task:
35
9 How well can you detect the concepts?
9 Rank Sbased on presence of concept from N
Trang 18TRECVID evaluation measures
¾ Classification procedure
9 Training: many hours of (partly) annotated video
9 Testing: many hours of Testing: many hours of unseen unseen video video Results
¾ Evaluation measure: Average Precision
9 Combines precision and recall
9 Averages precision after every relevant shot
9 Top of the ranked list most important
Trang 19491 detectors, a closer look
The number of labeled image examples used at training time seems decisive in concept detector accuracy.
Demo time!
Trang 20Concept detector: requires examples
¾ TRECVID’s collaborative research agenda has been pushing manual concept annotation efforts
374 491
…
LSCOM MediaMill - UvA
Trang 21Collaborative annotation tool
¾ Manual annotation by 100+ TRECVID participants
9 Incomplete, but reliable
TRECVID 2005
References:
Christel, Informedia, 2005 Volkmer et al, ACM MM 2005
43
Manual annotations: LSCOM-lite
9 Large Scale Annotation for Multimedia
9 Aims for ontology of 1,000 annotated concepts
References:
Naphade et al, IEEE Multimedia 2006
TRECVID 2005
Aims for ontology of 1,000 annotated concepts
¾ LSCOM-Lite: annotations for 39 semantic concepts
Trang 22TRECVID Criticism
¾ Focus is on the final result
9 TRECVID judges relative merit of indexing methods
9 Ignores repeatability of intermediate analysis steps
¾ Systems are becoming more complex
9 Typically combining several features and learning methods
¾ Component-based optimization and comparison impossible
Content Layout Features
45
Supervised Learner Multimodal Features Combination
Supervised Learner Supervised Learner
Visual Features
Textual Features
Content Analysis Step Style Analysis Step Context Analysis Step
Semantic Combination
Context Features Capture Features Content Features
Select Best of after Validation
What is the contribution of these components?
¾ The Challenge provides
9 Manually annotated lexicon of 101 semantic concepts
9 Pre-computed low-level multimedia features
9 Trained classifier models
¾ The Challenge allows to
9 Gain insight in intermediate video analysis steps
9 Foster repeatability of experiments
9 Optimize video analysis systems on a
MediaMill Challenge
Supervised Learner
Visual Feature Extraction
Late Fusion Textual
Feature Extraction
Combined Analysis
Supervised Learner
Supervised Learner
Supervised Learner
Experiment 2 Experiment 1
Trang 23 Pure computer vision
Pure natural language processing
Pure machine learning
9 For education
47
– Students can do
large scale experiments
compare themselves to each other
…… and to the state-of-the-art
Columbia374
¾ Baseline for 374 concept detectors
9 Focus is on visual analysis experiments
Online available:
http://www.ee.columbia.edu/ln/dvmm/columbia374/
Trang 2450
Trang 25DemoResults for drummer
51
Conclusions
¾ An international community is building a bridge to narrow the semantic gap
9 Currently detects more than 500 concepts in broadcast video
9 Generalizes outside news domain
¾ Important lessons
9 No superior method for all concepts exists,
9 Best to learn optimal approach per concept
9 Best methods cover variation in features, classifiers, and concepts
Trang 26Concept detection challenges
¾ Show generality of approach over several domains
9 Show benefit of web-based image/video and annotations
¾ Show that concept classes work with less analysis
¾ Show that concept classes work with less analysis
9 People, objects, setting
¾ Show benefit of using dynamic nature of video
¾ Show that an ontology can help
9 How to connect logical relations to uncertain detectors?
¾ Show that ‘iconological’ concepts can be detected
9 E.g funny, sarcastic, cozy, …
53
Using concept detectors
¾ “We are now seeing researchers starting to use the confidence values from concept detectors, within the shot retrieval process and this appears to be theshot retrieval process and this appears to be the roadmap for future work in this area.”
9 Alan Smeaton, Information Systems, 32(4):545-559, 2007
54
Trang 27Measure concept detector influence
TRECVID automatic search task
Search Engine Result Query
Topics
¾ Automatically solve search topic
¾ Return 1,000 ranked shot-based results
¾ Evaluate using Average Precision
¾ TRECVID 2005
9 85 hrs test set – Chinese, Arabic, English TV News
9 24 search topics
Trang 28Topic examples
Find shots of a hockey rink with at least one of the nets fully visible from
some point of view.
Find shots of one or more helicopters
Find shots of an office setting, i.e., one or more desks/tables and one or more computers and one or more
people
Influence of lexicon size
¾ Lexicon = 363 machine learned concept detectors
¾ Procedure
¾ Procedure
1 Set bag size B= 10;
2 Select random bag of Bdetectors from lexicon
3 Determine maximum performance for each search topic
4 B+=10;
5 Go back to step 2
58
¾ Repeat the process 100 times
9 Reduces random positive and negative effects
Trang 29Influence of lexicon size
TRECVID 2005
Linear increase for first 60 detectors
¾ Size matters
9 Lexicon of 150 detectors comes close to maximum performance
¾ Some detectors perform well for specific topics
9 Tennis game detector for “find two visible tennis players”
¾ Substantial number of detectors not accurate enough yet
9 Only small increase when more than 70 detectors are used
59
¾ How to combine multiple detectors?
9 Experiment: pair-wise oracle fusion
Influence of detector combination
Office
Computer
− Improvement for 20 out of 24 topics
− Increase per topic as high as 89%
− Overall increase 10%
Trang 30Find shots of George Bush entering
or leaving a vehicle (e.g., car, van, airplane, helicopter, etc) (he and vehicle both visible at the same time)
Iyad Allawi
+
rocket propelled grenades
?
Problem statement
Find shots of an office setting
62
¾ How to translate query topic to concept detectors?
Trang 31Detector selection strategies
Detector Selection Strategies
Semantic Visual Querying
Video Query
With B Huurnink / M de Rijke
L Hollink / G Schreiber
M Worring
Find shots of an office setting
Multimedia raw data
Data flow conventions
Ontology
Text Matching
Influence of detector selection combi
¾ Individual selection strategies seem comparable
9 But, oracle combination of selection strategies pays off!
TRECVID 2005
O t l Vi l
T t Ontology Querying
Building
Visual Querying
Grass
Text Matching
Office Emile
Lahoud Computer
Find shots of an office setting, i.e., one or more desks/tables and one or more computers and one or more
people
Computer
Trang 32TRECVID interactive search task
¾ So many choices for retrieval…
9 Why not let user decide interactively?
65
http://trecvid.nist.gov/
‘Classic’ Informedia system
References:
Carnegie Mellon University
¾ First multimodal video search engine
66