Visual Event Recognition in Videos by Learning from Web Data ppt

Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning A-MKL, in order to 1 fuse the information from multiple pyramid levels and features i.

Trang 1

Visual Event Recognition in Videos

by Learning from Web Data

Lixin Duan, Dong Xu, Member, IEEE, Ivor Wai-Hung Tsang, and

Jiebo Luo, Fellow, IEEE

Abstract—We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube) Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Aligned Space-Time Pyramid Matching (ASTPM), to measure the distance between any two video clips Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time features and static SIFT features) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web video domain and consumer video domain) For each pyramid level and each type of local features, we first train a set of SVM classifiers based on the combined training set from two domains by using multiple base kernels from different kernel types and parameters, which are then fused with equal weights to obtain a prelearned average classifier In A-MKL, for each event class we learn an adapted target classifier based on multiple base kernels and the prelearned average classifiers from this event class or all the event classes by minimizing both the structural risk functional and the mismatch between data distributions of two domains Extensive experiments demonstrate the

effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data We also conduct an in-depth investigation on various aspects of the proposed method A-MKL, such as the analysis on the combination coefficients on the prelearned classifiers, the convergence of the learning algorithm, and the performance variation by using different proportions of labeled consumer videos Moreover, we show that A-MKL using the prelearned classifiers from all the event classes leads to better performance when compared with A-MKL using the prelearned classifiers only from each individual event class.

Index Terms—Event recognition, transfer learning, domain adaptation, cross-domain learning, adaptive MKL, aligned space-time pyramid matching.

Ç

1 INTRODUCTION

INrecent years, digital cameras and mobile phone cameras

have become popular in our daily life Consequently,

there is an increasingly urgent demand on indexing and

retrieving from a large amount of unconstrained consumer

videos In particular, visual event recognition in consumer

videos has attracted growing attention However, this is an

extremely challenging computer vision task due to two

main issues First, consumer videos are generally captured

by amateurs using hand-held cameras of unstaged events

and thus contain considerable camera motion, occlusion,

cluttered background, and large intraclass variations within

the same type of events, making their visual cues highly

variable and thus less discriminant Second, these users are

generally reluctant to annotate many consumer videos,

posing a great challenge to the traditional video event

recognition techniques that often cannot learn robust

classifiers from a limited number of labeled training videos

While a large number of video event recognition techniques have been proposed (see Section 2 for more details), few of them [5], [16], [17], [28], [30] focused on event recognition in the highly unconstrained consumer video domain Loui et al [30] developed a consumer video data set which was manually labeled for 25 concepts including activities, occasions, static concepts like scenes and objects,

as well as sounds Based on this data set, Chang et al [5] developed a multimodal consumer video classification system by using visual features and audio features In the web video domain, Liu et al [28] employed strategies inspired by PageRank to effectively integrate both motion features and static features for action recognition in YouTube videos In [16], action models were first learned from loosely labeled web images and then used for identifying human actions in YouTube videos However, the work in [16] cannot distinguish actions like “sitting_down” and “standing_up” because it did not utilize temporal information in its image-based model Recently, Ikizler-Cinbis and Sclaroff [17] proposed employing multiple instance learning to integrate multiple features of the people, objects, and scenes for action recognition in YouTube videos

Most event recognition methods [5], [25], [28], [32], [41], [43], [49] follow the conventional framework First, a sufficiently large corpus of training data is collected in which the concept labels are generally obtained through expensive human annotation Next, robust classifiers (also called models or concept detectors) are learned from the training data Finally, the classifiers are used to detect the presence of the events in any test data When sufficient and strong labeled

L Duan, D Xu and I.W.-H Tsang are with the School of Computer

Engineering, Nanyang Technological University, N4-02a-29, Nanyang

Avenue, Singapore 639798.

E-mail: {S080003, DongXu, IvorTsang}@ntu.edu.sg.

J Luo is with the Department of Computer Science, University of

Rochester, CSB 611, Rochester, NY 14627 E-mail: jluo@cs.rochester.edu.

Manuscript received 12 Dec 2010; revised 19 July 2011; accepted 26 Sept.

2011; published online 26 Sept 2011.

Recommended for acceptance by T Darrell, D Hogg, and D Jacobs.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number

TPAMISI-2010-12-0945.

Digital Object Identifier no 10.1109/TPAMI.2011.265.

Trang 2

training samples are provided, these event recognition

methods have achieved promising results However, for

visual event recognition in consumer videos, it is time

consuming and expensive for users to annotate a large

number of consumer videos It is also well known that the

learned classifiers from a limited number of labeled training

samples are usually not robust and do not generalize well

In this paper, we propose a new event recognition

framework for consumer videos by leveraging a large

amount of loosely labeled YouTube videos Our work is

based on the observation that a large amount of loosely

labeled YouTube videos can be readily obtained by using

keywords (also called tags) based search However, the

quality of YouTube videos is generally lower than

con-sumer videos because YouTube videos are often

down-sampled and compressed by the web server In addition,

YouTube videos may have been selected and edited to

attract attention, while consumer videos are in their

naturally captured state In Fig 1, we show four frames

from two events (i.e., “picnic” and “sports”) as examples to

illustrate the considerable appearance differences between

consumer videos and YouTube videos Clearly, the visual

feature distributions of samples from the two domains (i.e.,

web video domain and consumer video domain) can

change considerably in terms of the statistical properties

(such as mean, intraclass, and interclass variance)

Our proposed framework is shown in Fig 2 and consists

of two contributions First, we extend the recent work on

pyramid matching [13], [25], [26], [48], [49] and present a new

matching method, called Aligned Space-Time Pyramid

Matching (ASTPM), to effectively measure the distances

between two video clips that may be from different domains

Specifically, we divide each video clip into space-time

volumes over multiple levels We calculate the pairwise

distances between any two volumes and further integrate the

information from different volumes with Integer-flow Earth

Mover’s Distance (EMD) to explicitly align the volumes In contrast to the fixed volume-to-volume matching used in [25], the space-time volumes of two videos across different space-time locations can be matched using our ASTPM method, making it better at coping with the large intraclass variations within the same type of events (e.g., moving objects in consumer videos can appear at different space-time locations, and the background within two different videos, even captured from the same scene, may be shifted due to considerable camera motion)

The second is our main contribution In order to cope with the considerable variation between feature distributions of videos from the web video domain and consumer video domain, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL) Specifically, we first obtain one prelearned classifier for each event class at each pyramid level and with each type of local feature, in which existing kernel methods (e.g., SVM) can be readily employed In this work, we adopt the prelearned average classifier by equally fusing a set of SVM classifiers that are prelearned based on a combined training set from two domains by using multiple base kernels from different kernel types and parameters For each event class, we then learn an adapted classifier based on multiple base kernels and the prelearned average classifiers from this event class or all event classes by minimizing both the structural risk func-tional and mismatch between data distributions of two domains It is noteworthy that the utilization of the prelearned average classifiers from all event classes in A-MKL is based on the observation that some events may share common motion patterns [47] For example, the videos from some events (such as “birthday,” “picnic,” and

“wedding”) usually contain a number of people talking with each other Therefore, it is beneficial to learn an adapted classifier for “birthday” by leveraging the prelearned classifiers from “picnic” and “wedding.”

Fig 1 Four sample frames from consumer videos and YouTube videos Our work aims to recognize the events in consumer videos by using a limited number of labeled consumer videos and a large number of YouTube videos The examples from two events (i.e.,“picnic” and “sports”) illustrate the considerable appearance differences between consumer videos and YouTube videos, which poses great challenges to conventional learning schemes but can be effectively handled by our transfer learning method A-MKL.

Fig 2 The flowchart of the proposed visual event recognition framework It consists of an aligned space-time pyramid matching method that effectively measures the distances between two video clips and a transfer learning method that effectively copes with the considerable variation in feature distributions between the web videos and consumer videos.

Trang 3

The remainder of this paper is organized as follows:

Section 2 will provide brief reviews of event recognition The

proposed methods ASTPM and A-MKL will be introduced in

Sections 3 and 4, respectively Extensive experimental results

will be presented in Section 5, followed by conclusions and

future work in Section 6

2 RELATEDWORK ON EVENT RECOGNITION

Event recognition methods can be roughly categorized into

model-based methods and appearance-based techniques

Model-based approaches relied on various models,

includ-ing HMM [35], coupled HMM [3], and Dynamic Bayesian

Network [33], to model the temporal evolution The

relationships among different body parts and regions are

also modeled in [3], [35], in which object tracking needs to

be conducted at first before model learning

Appearance-based approaches employed space-time

(ST) features extracted from volumetric regions that can

be densely sampled or from salient regions with significant

local variations in both spatial and temporal dimensions

[24], [32], [41] In [19], Ke et al employed boosting to learn a

cascade of filters based on space-time features for efficient

visual event detection Laptev and Lindeberg [24] extended

the ideas of Harris interest point operators and Dolla´r et al

[7] employed separable linear filters to detect the salient

volumetric regions Statistical learning methods, including

SVM [41] and probabilistic Latent Semantic Analysis

(pLSA) [32], were then applied by using the aforementioned

space-time features to obtain the final classification

Recently, Kovashka and Grauman [20] proposed a new

feature formation technique by exploiting multilevel

voca-bularies of space-time neighborhoods Promising results

[12], [20], [27], [32], [41] have been reported on video data

sets under controlled conditions, such as Weizman [12] and

KTH [41] data sets Interested readers may refer to [45] for a

recent survey

Recently, researchers proposed new methods to address

the more challenging event recognition task on video data

sets captured under much less uncontrolled conditions,

including movies [25], [43] and broadcast news videos [49]

In [25], Laptev et al integrated local space-time features

(i.e., Histograms of Oriented Gradient (HOG) and

Histo-grams of Optical Flow (HOF)), space-time pyramid

match-ing, and SVM for action classification in movies In order to

locate the actions from movies, a new discriminative

clustering algorithm [11] was developed based on the

weakly labeled training data that can be readily obtained

from movie scripts without any cost of manual annotation

Sun et al [43] employed Multiple Kernel Learning (MKL) to

efficiently fuse three types of features, including a so-called

SIFT average descriptor and two trajectory-based features

To recognize events in diverse broadcast news videos, Xu

and Chang [49] proposed a multilevel temporal matching

algorithm for measuring video similarity

However, all these methods followed the conventional

learning framework by assuming that the training and test

samples are from the same domain and feature distribution

When the total number of labeled training samples is

limited, the performances of these methods would be poor

In contrast, the goal of our work is to propose an effective

event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos, where we must deal with the distribution mismatch between videos from two domains (i.e., web video domain and consumer video domain) As a result, our algorithm can learn a robust classifier for event recognition requiring only a small number of labeled consumer videos

3 ALIGNED SPACE-TIMEPYRAMID MATCHING

Recently, pyramid matching algorithms were proposed for different applications, such as object recognition, scene classification, and event recognition in movies and news videos [13], [25], [26], [48], [49] These methods involved pyramidal binning in different domains (e.g., feature, spatial,

or temporal domain), and improved performances were reported by fusing the information from multiple pyramid levels Spatial pyramid matching [26] and its space-time extension [25] used fixed block-to-block matching and fixed volume-to-volume matching (we refer to it as unaligned space-time matching), respectively In contrast, our proposed Aligned Space-Time Pyramid Matching extends the methods

of Spatially Aligned Pyramid Matching (SAPM) [48] and Temporally Aligned Pyramid Matching (TAPM) [49] from either the spatial domain or the temporal domain to the joint space-time domain, where the volumes across different space and time locations can be matched

Similarly to [25], we divide each video clip into

8lnonoverlapped space-time volumes over multiple levels,

l¼ 0; ; L 1, where the volume size is set as 1=2lof the original video in width, height, and temporal dimension Fig 3 illustrates the partitions of two videos Vi and Vj at level-1 Following [25], we extract the local space-time (ST) features, including HOG and HOF, which are further concatenated together to form lengthy feature vectors We also sample each video clip to extract image frames and then extract static local SIFT features [31] from them

Our method consists of two matching stages In the first matching stage, we calculate the pairwise distance Drc

between each two space-time volumes ViðrÞ and VjðcÞ, where r; c ¼ 1; ; R, with R being the total number of volumes in a video The space-time features are vector-quantized into visual words and then each space-time volume is represented as a token-frequency feature As suggested in [25], we use 2 distance to measure the distance Drc Noting that each space-time volume consists of

a set of image blocks, we also extract token-frequency features from each image block by vector quantizing the corresponding SIFT features into visual words And based

on the token-frequency features, as suggested in [49], the pairwise distance Drcbetween two volumes ViðrÞ and VjðcÞ

is calculated by using EMD [39] as follows:

Drc¼

PH u¼1

PI v¼1fbuvduv

PH u¼1

PI v¼1fbuv ; where H; I are the numbers of image blocks in ViðrÞ; VjðcÞ, respectively, duv is the distance between two image blocks (euclidean distance is used in this work), and bfuv is the optimal flow that can be obtained by solving the linear programming problem as follows:

Trang 4

fuv¼ arg min

f uv 0

XH u¼1

XI v¼1

fuvduv;

s:t:XH

u¼1

XI

v¼1

fuv¼ 1;XI

v¼1

fuv 1

H;8u;XH u¼1

fuv1

I;8v:

In the second stage, we further integrate the

informa-tion from different volumes by using integer-flow EMD to

explicitly align the volumes We try to solve a flow

matrix bFrc containing binary elements which represent

unique matches between volumes ViðrÞ and VjðcÞ As

suggested in [48], [49], such a binary solution can be

conveniently computed by using the standard Simplex

method for linear programming, which is presented in the

following theorem:

Theorem 1 ([18]) The linear programming problem,

b

Frc¼ arg min

F rc

XR r¼1

XR c¼1

FrcDrc;

s:t:XR

c¼1

Frc¼ 1; 8r;XR

r¼1

Frc¼ 1; 8c;

will always have an integer optimal solution when solved by

using the Simplex method

Fig 3 illustrates the matching results of two videos after

using our ASTPM method, indicating the reasonable

match-ing between similar scenes (i.e., the crowds, the playground,

and the Jumbotron TV screens in the two videos) It is also

worth mentioning that our ASTPM method can preserve the

space-time proximity relations between volumes from two

videos at level-1 when using the ST or SIFT features

Specifically, the ST features (respectively, SIFT features) in

one volume can only be matched to the ST features

(respectively, SIFT features) within another volume at

level-1 in our ASTPM method rather than arbitrary ST features

(respectively, SIFT features) within the entire video as in the

classical bag-of-words model (e.g., ASTPM at level-0)

Finally, the distance DlðVi; VjÞ between two video clips Vi

and Vj at level-l can be directly calculated by

DlðVi; VjÞ ¼

PR r¼1

PR c¼1FbrcDrc

PR r¼1

PR c¼1Fbrc

:

In the next section, we will propose a new transfer learning method to fuse the information from multiple pyramid levels and different types of features

4 ADAPTIVE MULTIPLE KERNEL LEARNING

Following the terminology from prior literature, we refer to the web video domain as the auxiliary domain DA(a.k.a., source domain) and the consumer video domain as the target domain

DT¼ DT

l [ DT

u, where DT

landDT

u represent the labeled and unlabeled data in the target domain, respectively In this work, we denote Inas the n n identity matrix and 0n; 1n2

IRn as n 1 column vectors of all zeros and all ones, respectively The inequality a ¼ ½a1; ; an0 0nmeans that

ai 0 for i ¼ 1; ; n Moreover, the element-wise product between vectors a and b is defined as a b ¼ ½a1b1; ; anbn0 4.1 Brief Review of Related Learning Work

Transfer learning (a.k.a., domain adaptation or cross-domain learning) methods have been proposed for many applications [6], [8], [9], [29], [50] To take advantage of all labeled patterns from both auxiliary and target domains, Daume´ [6] proposed Feature Replication (FR) by using augmented features for SVM training In Adaptive SVM (A-SVM) [50], the target classifier fTðxÞ is adapted from an existing classifier fAðxÞ (referred to as auxiliary classifier) trained based on the samples from the auxiliary domain Specifically, the target decision function is defined as follows:

fTðxÞ ¼ fAðxÞ þ fðxÞ; ð1Þ where fðxÞ is called a perturbation function that is learned

by using the labeled data from the target domain only (i.e.,

DT

l) While A-SVM can also employ multiple auxiliary classifiers, these auxiliary classifiers are fused with pre-defined weights to obtain fAðxÞ [50] Moreover, the target classifier fTðxÞ is learned based on only one kernel Recently, Duan et al [8] proposed Domain Transfer SVM (DTSVM) to simultaneously reduce the mismatch between the distributions of two domains and learn a target decision function The mismatch was measured by Maximum Mean Discrepancy (MMD) [2] based on the distance between the means of the samples, respectively, from the auxiliary domain DA and the target domain DT in a Reproducing

Fig 3 Illustration of the proposed Aligned Space-Time Pyramid Matching method at level-1: (a) Each video is divided into eight space-time volumes along the width, height, and temporal dimensions (b) The matching results are obtained by using our ASTPM method Each pair of matched volumes from two videos is highlighted in the same color For better visualization, please see the colored PDF file.

Trang 5

Kernel Hilbert Space (RKHS) spanned by a kernel

function k, namely,

DISTkðDA;DTÞ ¼ 1

nA

Xn A

i¼1

’

xAi

1

nT

Xn T

i¼1

’

xTi

H

; ð2Þ

where xA

is and xT

is are the samples from the auxiliary and target domains, respectively, and the kernel function k is

induced from the nonlinear feature mapping function ’ðÞ,

i.e., kðxi; xjÞ ¼ ’ðxiÞ0’ðxjÞ We define a column vector s

with N ¼ nAþ nT entries, in which the first nA entries are

set as 1=nA and the remaining entries are set as 1=nT,

respectively With the above notions, the square of MMD in

(2) can be simplified as follows [2], [8]:

DIST2kðDA;DTÞ ¼ trðKSÞ; ð3Þ where trðKSÞ represents the trace of KS, S ¼ ss02 IRNN,

and K ¼ ½KKA;AT ;AKKA;TT ;T 2 IRNN, and KA;A2 IRn A n A, KT ;T 2

IRn T n T, and KA;T 2 IRn A n T are the kernel matrices defined

for the auxiliary domain, the target domain, and the

cross-domain from the auxiliary cross-domain to the target cross-domain,

respectively

4.2 Formulation of A-MKL

Motivated by A-SVM [50] and DTSVM [8], we propose a

new transfer learning method to learn a target classifier

adapted from a set of prelearned classifiers as well as a

perturbation function that is based on multiple base kernels

kms The prelearned classifiers are used as prior for learning

a robust adapted target classifier In A-MKL, the existing

machine learning methods (e.g., SVM, FR, and so on) using

different types of features (e.g., SIFT and ST features) can be

readily used to obtain the prelearned classifiers Moreover,

in contrast to A-SVM [50], which uses the predefined

weights to combine the prelearned auxiliary classifiers, we

learn the linear combination coefficients pjPp¼1 of the

prelearned classifiers fpðxÞjPp¼1in this work, where P is the

total number of the prelearned classifiers Specifically, we

use the average classifiers from one event class or all the

event classes as the prelearned classifiers (see Sections 5.3

and 5.6 for more details) We additionally employ multiple

predefined kernels to model the perturbation function in this

work, because the utilization of multiple base kernels kms

instead of a single kernel can further enhance the

interpret-ability of the decision function and improve performances

[23] We refer to our transfer learning method based on

multiple base kernels as A-MKL because A-MKL can handle

the distribution mismatch between the web video domain

and the consumer video domain

Following the traditional MKL assumption [23], the

kernel function k is represented as a linear combination of

multiple base kernels kms as follows:

k¼XM m¼1

dmkm; ð4Þ where dms are the linear combination coefficients, dm 0 and

PM

m¼1dm¼ 1; each base kernel function kmis induced from

the nonlinear feature mapping function ’ ðÞ, i.e.,

kmðxi; xjÞ ¼ ’mðxiÞ0’mðxjÞ, and M is the total number of base kernels Inspired by semiparametric SVM [42], we define the target decision function on any sample x as follows:

fTðxÞ ¼XP

p¼1

pfpðxÞ þXM

m¼1

dmw0m’mðxÞ þ b

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

fðxÞ

; ð5Þ

where fðxÞ ¼PM

m¼1dmw0m’mðxÞ þ b is the perturbation function with b as the bias term Note that multiple base kernels are employed in fðxÞ

As in [8], we employ the MMD criterion to reduce the mismatch between the data distributions of two domains in this work Let us define the linear combination coefficient vector as d ¼ ½d1; ; dM0 and the feasible set of d as M ¼

fd 2 IRMj10

Md¼ 1; d 0Mg With (4), (3) can be rewritten as DIST2kðDA;DTÞ ¼ ðdÞ ¼ h0d; ð6Þ where h ¼ ½trðK1SÞ; ; trðKMSÞ0, Km¼ ½’mðxÞ0’mðxÞ 2

IRNN is the mth base kernel matrix defined on the samples from both auxiliary and target domains Let us denote the labeled training samples from both the auxiliary and target domains (i.e., DA[ DT

l) as ðxi; yiÞjni¼1, where n is the total number of labeled training samples from the two domains The optimization problem in A-MKL is then formulated as follows:

min

d2MGðdÞ ¼1

2

2ðdÞ þ JðdÞ; ð7Þ where

JðdÞ ¼ min

w m ;;b; i

1 2

XM m¼1

dmkwmk2þ kk2

!

þ CXn i¼1

i; s:t: yifTðxiÞ 1 i; i 0;

ð8Þ

¼ ½1; ; P0 is the vector of ps, and ; C > 0 are the regularization parameters Denote ~wm¼ ½w0

m; ffiffiffi

p

00 and

~

’mðxiÞ ¼ ½’mðxiÞ0; p 1ﬃﬃfðxiÞ00, where f ðxiÞ ¼ ½f1ðxiÞ; ;

fPðxiÞ0 The optimization problem in (8) can then be rewritten as follows:

JðdÞ ¼ min

~

w m ;b;i i

1 2

XM m¼1

dmk ~wmk2þ CXn

i¼1

i; s:t: yi

XM m¼1

dmw~0m’~mðxiÞ þ b

!

1 i; i 0:

ð9Þ

By defining ~vm¼ dmw~m, we rewrite the optimization problem in (9) as a quadratic programming (QP) problem [37]:

JðdÞ ¼ min

~

v m ;b; i

1 2

XM m¼1

k~vmk2

dm

þ CXn i¼1

i; s:t: yi

XM m¼1

~

v0m’~mðxiÞ þ b

!

1 i; i 0:

ð10Þ

Theorem 2 ([8], [37]).The optimization problem in (7) is jointly convex with respect to d, ~v , b, and

Trang 6

Proof Note that the first term 1

22ðdÞ of GðdÞ in (7) is a quadratic term with respect to d And other terms in (10)

are linear except the term1

2

PM m¼1 k~ v m k 2

d m As shown in [37], this term is also jointly convex with respect to d and ~vm

Therefore, the optimization problem in (7) is jointly convex

with respect to d, ~vm, b, and i t

With Theorem 2, the objective in (7) can reach its global

minimum By introducing the Lagrangian multiplier

¼ ½1; ; n0, we solve the dual form of the optimization

problem in (10) as follows:

JðdÞ ¼ max

2A10n1

2ð yÞ0 XM

m¼1

dmK~m

! ð yÞ; ð11Þ

where y ¼ ½y1; ; yn0 is the label vector of the training

samples, A ¼ f2 IRnj0y¼ 0; 0n C1ng is the

feasi-ble set of the dual variafeasi-ble , ~Km¼ ½ ~’mðxiÞ0’~mðxjÞ 2 IRnn

is defined by the labeled training data from both domains,

and ~’mðxiÞ0’~mðxjÞ ¼ ’mðxiÞ0’mðxjÞ þ1

ffðxiÞ0ffðxjÞ Recall that ffffðxÞ is a vector of the predictions on x from the

prelearned classifiers fps, which resembles the label

information of x and can be used to construct the idealized

kernel [22] Thus, the new kernel matrix ~Kmcan be viewed

as the integration of both the visual information (i.e., from

Km) and the label information, which can lead to better

discriminative power Surprisingly, the optimization

pro-blem in (11) is in the same form as the dual of SVM with the

kernel matrixPM

m¼1dmK~m Thus, the optimization problem can be solved by existing SVM solvers such as LIBSVM [4]

4.3 Learning Algorithm of A-MKL

In this work, we employ the reduced gradient descent

procedure proposed in [37] to iteratively update the linear

combination coefficient d and the dual variable in (7)

Updating the dual variable .Given the linear

combina-tion coefficient d, we solve the optimizacombina-tion problem in (11) to

obtain the dual variable by using LIBSVM [4]

Updating the linear combination coefficient d.Suppose

the dual variable is fixed With respect to d, the objective

function GðdÞ in (7) becomes

GðdÞ ¼1

2d

0hh0dþ 10n1

2ð yÞ0 XM

m¼1

dmK~m

! ð yÞ

!

¼1

2d

0hh0d q0dþ const;

ð12Þ where q ¼ ½1

2ð yÞ0K1ð yÞ; ;1

2ð yÞ0KMð yÞ0and the last term is a constant term that is irrelevant to d,

namely, const ¼ ð10n 1

2

Pn i;j¼1ijyiyjffðxiÞ0ffðxjÞÞ

We adopt the second-order gradient descent method to

update the linear combination coefficient d at iteration

tþ 1 by

dtþ1¼ dt tgt; ð13Þ where tis the learning rate which can be obtained by using

a standard line search method [37], gt¼ ðr2

tGÞ1rtGis the updating direction, and rtG¼ hh0dt q and r2

tG¼ hh0 are the first-order and second-order derivatives of G in (12)

with respect to d at the tth iteration, respectively Note that

hh0 is not of full rank, and therefore we replace hh0 by

hh0þ IM to avoid numerical instability, where is set as

105 in the experiments Then, the updating function (13) can be rewritten as follows:

dtþ1¼ ð1 tÞdtþ tdnewt ; ð14Þ where dnew

t ¼ ðhh0þ IMÞ1q Note that by replacing hh0 with hh0þ IM, the solution to rtG¼ hh0dt q ¼ 0M

becomes dnewt Given dt2 M, we project dnew

t onto the feasible set M to ensure dtþ12 M as well

The whole optimization procedure is summarized in Algorithm 1.1 We terminate the iterative updating proce-dure once the objective in (7) converges or the number of iterations reaches Tmax

105and Tmax¼ 15 in the experiments

Algorithm 1.Adaptive Multiple Kernel Learning 1: Input: labeled training samples ðxi; yiÞjni¼1, prelearned classifiers fpðxÞP

p¼1and predefined base kernel functions kmjMm¼1

2: Initialization: t 1 and dt M11M

3: Solve for the dual variables tin (11) by using SVM 4: While t < Tmax Do

5: qt ½1

2ðt yÞ0K1ðt yÞ; ;1

2ðt yÞ0KMðt yÞ0 6: dnewt ðhh0þ IMÞ1qtand project dnewt onto the feasible set M

7: Update the base kernel combination coefficients

dtþ1by using (14) with standard line search 8: Solve for the dual variables tþ1in (11) by using SVM

9: If jGðdtþ1Þ Gðdt

10: t t þ 1 11: End While 12: Output: dtand t

Note that by setting the derivative of the Lagrangian obtained from (9) with respect to ~wm to zero, we obtain

~

wm¼Pn

i¼1iyi’~mðxiÞ Recall that ffiffiffi

p

andp 1ﬃﬃfðxiÞ are the last P entries of ~wm and ~’mðxiÞ, respectively Therefore, the linear combination coefficient of the prelearned classifiers can be obtained as follows:

¼1

Xn i¼1

iyiffðxiÞ:

With the optimal dual variables and linear combina-tion coefficients d, the target decision funccombina-tion (5) of our method A-MKL can be rewritten as follows:

fTðxÞ ¼Xn

i¼1

iyi

XM m¼1

dmKmðxi; xÞ þ1

ffðxiÞ0ffðxÞ

!

þ b:

4.4 Differences from Related Learning Work A-SVM [50] assumes that the target classifier fTðxÞ is adapted from existing auxiliary classifiers fA

pðxÞs However, our proposed method A-MKL is different from A-SVM in several aspects:

1 The source code can be downloaded from our project webpage http://vc.sce.ntu.edu.sg/index_files/VisualEventRecognition/ VisualEventRecognition.html.

Trang 7

1 In A-SVM, the auxiliary classifiers are learned by

using only the training samples from the auxiliary

domain In contrast, the prelearned classifiers used

in A-MKL can be learned by using the training

samples either from the auxiliary domain or from

both domains

2 In A-SVM, the auxiliary classifiers are fused with

predefined weights ps in the target classifier, i.e.,

fTðxÞ ¼PP

p¼1 pfA

pðxÞ þ fðxÞ In contrast, A-MKL learns the optimal combination coefficients ps in (5)

3 In A-SVM, the perturbation function fðxÞ is based

on one single kernel, i.e., fðxÞ ¼ w0’ðxÞ þ b

However, in A-MKL, the perturbation function

fðxÞ ¼PM

m¼1dmw0m’mðxÞ þ b in (5) is based on multiple kernels, and the optimal kernel

combina-tion is automatically determined during the learning

process

4 A-SVM cannot utilize the unlabeled data in the

target domain

On the contrary, the valuable unlabeled data in the target

domain are used in the MMD criterion of A-MKL for

measuring the data distribution mismatch between two

domains

Our work is also different from the prior work

DTSVM [8], where the target decision function fTðxÞ ¼

PM

m¼1dmw0

m’mðxÞ þ b is only based on multiple base

kernels In contrast, in A-MKL, we use a set of prelearned

classifiers fpðxÞs as the parametric functions, and model the

perturbation function fðxÞ based on multiple base kernels

in order to better fit the target decision function To fuse

multiple prelearned classifiers, we also learn the optimal

linear combination coefficients ps As shown in the

experiments, our A-MKL is more robust in real applications

by utilizing optimally combined classifiers as the prior

MKL methods [23], [37] utilize the training data and the

test data drawn from the same domain When they come

from different distributions, MKL methods may fail to learn

the optimal kernel This would degrade the classification

performance in the target domain On the contrary, A-MKL

can better make use of the data from two domains to improve

the classification performance

5 EXPERIMENTS

In this section, we first evaluate the effectiveness of the

proposed method ASTPM We then compare our proposed

method A-MKL with the baseline SVM, and three existing

transfer learning algorithms: FR [6], A-SVM [50], and

DTSVM [8], as well as an MKL method discussed in [8]

We also analyze the learned combination coefficients ps of

the prelearned classifiers, illustrate the convergence of the

learning algorithm of A-MKL and investigate the

perfor-mance variations of A-MKL using different proportions of

labeled consumer videos Moreover, we show that A-MKL

using the prelearned classifiers from all event classes is

better than A-MKL using the prelearned classifiers from

one event class

For all methods, we train one-versus-all classifiers with a

fixed regularization parameter C ¼ 1 For performance

evaluation, we use the noninterpolated Average Precision

(AP) as in [25], [49], which corresponds to the multipoint

average precision value of a precision-recall curve and incorporates the effect of recall Mean Average Precision (MAP) is the mean of APs over all the event classes 5.1 Data Set Description and Features

In our data set, part of the consumer videos are derived (under a usage agreement) from the Kodak Consumer Video Benchmark Data Set [30] which was collected by Kodak from about 100 real users over the period of one year There are 1,358 consumer video clips in the Kodak data set A second part of the Kodak data set contains web videos from YouTube collected using keywords-based search After removing TV commercial videos and low-quality videos, there are 1,873 YouTube video clips in total

An ontology of 25 semantic concepts was defined and keyframe-based annotation was performed by students at Columbia University to assign binary labels (presence or absence) for each visual concept for both sets of videos (see [30] for more details)

In this work, six events, “birthday,” “picnic,” “parade,”

“show,” “sports,” and “wedding,” are chosen for experi-ments We additionally collected new consumer video clips from real users on our own Similarly to [30], we also downloaded new YouTube videos from the website More-over, we also annotated the consumer videos to determine whether a specific event occurred by asking an annotator, who is not involved in the algorithmic design, to watch each video clip rather than just look at the key frames, as done in [30] For video clips in the Kodak consumer data set [30], only the video clips receiving positive labels in their keyframe-based annotation are reexamined We do not additionally annotate the YouTube videos2 collected by ourselves and Kodak because in a real scenario we can only obtain loosely labeled YouTube videos and cannot use any further manual annotation It should be clear that our consumer video set comes from two sources—the Kodak consumer video data set and our additional collection of personal videos, and our web video set is a combined set of YouTube videos as well

We confirm that the quality of YouTube videos is much lower than that of consumer videos directly collected from real users Therefore, our data set is quite challenging for transfer learning algorithms The total numbers of consumer videos and YouTube videos are 195 and 906, respectively Note that our data set is a single-label data set, i.e., each video belongs to only one event

In real-world applications, the labeled samples in the target domain (i.e., consumer video domain) are usually much fewer than those in the auxiliary domain (i.e., web video domain) In this work, all 906 loosely labeled YouTube videos are used as labeled training data in the auxiliary domain We randomly sample three consumer videos from each event (18 videos in total) as the labeled training videos in the target domain, and the remaining videos in the target domain are used as the test data We sample the labeled target training videos five times and report the means and standard deviations of MAPs or per-event APs for each method

2 The annotator felt that at least 20 percent of YouTube videos are incorrectly labeled after checking the video clips.

Trang 8

For all the videos in the data sets, we extract two types of

features The first one is the local ST feature [25], in which

72D HOG and 90D HOF are extracted by using the online

tool.3 After that, they are concatenated together to form a

162D feature vector We also sample each video clip at a rate

of 2 frames per second to extract image frames from each

video clip (we have 65 frames per video on average) For each

frame, we extract 128D SIFT features from salient regions,

which are detected by Difference-of-Gaussian (DoG) interest

point detector [31] On average, we have 1,385 ST features

and 4,144 SIFT features per video Then, we build visual

vocabularies by using k-means to group the ST features and

SIFT features into 1,000 and 2,500 clusters, respectively

5.2 Aligned Space-Time Pyramid Matching versus

Unaligned Space-Time Pyramid Matching

(USTPM)

We compare our proposed Aligned Space-Time Pyramid

Matching (ASTPM) discussed in Section 3 with the fixed

volume-to-volume matching method, referred to as the

Unaligned Space-Time Pyramid Matching (USTPM)

meth-od, used in [25] In [25], the space-time volumes of one video

clip are matched with the volumes of the other video at the

same spatial and temporal locations at each level In other

words, the second matching stage based on integer-flow

EMD is not applied, and the distance between two video

clips is equal to the sum of diagonal elements of the distance

matrix, i.e., PR

r¼1Drr For computational efficiency, we set

the total number of levels L ¼ 2 in this work Therefore, we

have two ways of partitions in which one video clip is

divided into 1 1 1 and 2 2 2 space-time volumes,

respectively

We use the baseline SVM classifier learned by using the

combined training data set from two domains We test the

performances with four types of kernels: Gaussian kernel

(i.e., Kði; jÞ ¼ expðD2ðVi; VjÞÞ), Laplacian kernel (i.e.,

Kði; jÞ ¼ expð ffiffiffip

DðVi; VjÞÞ), inverse square distance (ISD) kernel (i.e., Kði; jÞ ¼ 1

D 2 ðV i ;V j Þþ1), and inverse distance (ID) kernel (i.e., Kði; jÞ ¼pﬃﬃ 1

DðV i ;V j Þþ1), where DðVi; VjÞ represents the distance between video Vi and Vj, and is the kernel

parameter We use the default kernel parameter ¼ 0¼1

A, where A is the mean value of the square distances between

all training samples as suggested in [25]

Tables 1 and 2 show the MAPs of the baseline SVM over

six events for SIFT and ST features at different levels

according to different types of kernels with the default

kernel parameter Based on the means of MAPs, we have

the following three observations: 1) In all cases, the results

at level-1 using aligned matching are better than those at

level-0 based on SIFT features, which demonstrates the

effectiveness of space-time partition and it is also consistent with the findings for prior pyramid matching methods [25], [26], [48], [49] 2) At level-1, our proposed ASTPM outper-forms USTPM used in [25], thanks to the additional alignment of space-time volumes 3) The results from space-time features are not as good as those from static SIFT features As also reported in [15], a possible explana-tion is that the extracted ST features may fall on cluttered backgrounds because the consumer videos are generally captured by amateurs with hand-held cameras

5.3 Performance Comparisons of Transfer Learning Methods

We compare our method A-MKL with other methods, including the baseline SVM, FR, A-SVM, MKL, and DTSVM For the baseline SVM, we report the results of SVM_AT and SVM_T, in which the labeled training samples are from two domains (i.e., the auxiliary domain and the target domain) and only from the target domain, respectively Specifically, the aforementioned four types of kernels (i.e., Gaussian kernel, Laplacian kernel, ISD kernel, and ID kernel) are adopted Note that in our initial conference version [10] of this paper, we have demon-strated that A-MKL outperforms other methods by setting the kernel parameter as ¼ 2l 0, where l 2 L ¼ f6;

4; ; 2g In this work, we test A-MKL by using another set of kernel parameters, i.e., L ¼ f3; 2; ; 1g Note that the total number of base kernels is 16jLj from two pyramid levels and two types of local features, four types of kernels, and jLj kernel parameters, where jLj is the cardinality of L All methods are compared in three cases: a) classifiers learned based on SIFT features, b) classifiers learned based on

ST features, c) classifiers learned based on both SIFT and ST features For both SVM_AT and FR (respectively, SVM_T), we train 4jLj independent classifiers with the corresponding 4jLj base kernels for each pyramid level and each type of local features using the training samples from two domains (respectively, the training samples from target domain) And we further fuse the 4jLj independent classifiers with equal weights to obtain the average classifier fSIF T

l or fST

l , where l ¼ 0 and 1 For SVM_T, SVM_AT, and FR, the final classifier is obtained by fusing average classifiers with equal weights (e.g., 1

2ðfSIF T

0 þ fSIF T

1 Þ for case a, 1

2ðfST

0 þ fST

1 Þ for case b, and 1

4ðfSIF T

0 þ fSIF T

1 þ fST

0 þ fST

1 Þ for case c) For A-SVM, we learn 4jLj independent auxiliary classifiers for each pyramid level and each type of local features using the training data from the auxiliary domain and the correspond-ing 4jLj base kernels, and then we independently learn four adapted target classifies from two pyramid levels and two types of features by using the labeled training data from the target domain based on Gaussian kernel with the default

3 http://www.irisa.fr/vista/Equipe/People/Laptev/download.html.

TABLE 1 Means and Standard Deviations (Percent) of MAPs

over Six Events at Different Levels Using SVM

with the Default Kernel Parameter for SIFT Features

TABLE 2 Means and Standard Deviations (Percent) of MAPs over Six Events at Different Levels Using SVM with the Default Kernel Parameter for ST Features

Trang 9

kernel parameter [50] Similarly to SVM_T, SVM_AT, and FR,

the final A-SVM classifier is obtained by fusing two

(respectively, four) adapted target classifiers for cases a and

b (respectively, case c) For MKL and DTSVM, we

simultaneously learn the linear combination coefficients

of 8jLj base kernels (for cases a or b) or 16jLj base kernels

(for case c) by using the combined training samples from

both domains Recall that for our method A-MKL, we make

use of prelearned classifiers as well as multiple base kernels

(see (5) in Section 4.2) In the experiment, we consider each

average classifier as one prelearned classifier and learn the

target decision function of A-MKL based on two average

classifiers fSIF T

l j1l¼0or fST

l j1l¼0for cases a or b (respectively, all the four average classifiers for case c) as well as 8jLj base

kernels based on SIFT or ST features for cases a or b

(respectively, 16jLj base kernels based on both types of

features for case c) For A-MKL, we empirically fix ¼ 105

and set ¼ 20 for all three cases Considering that DTSVM

and A-MKL can take advantage of both labeled and

unlabeled data by using the MMD criterion to measure

the mismatch in data distributions between two domains,

we use semi-supervised setting in this work More

specifically, all the samples (including test samples) from

the target domain and auxiliary domain are used to

calculate h in (6) Note that all test samples are used as

unlabeled data during the learning process

Table 3 reports the means and standard deviations of

MAPs over all six events in three cases for all methods

From Tables 3, we have the following observations based on

the means of MAPs:

1 The best result of SVM_T is worse than that of

SVM_AT, which demonstrates that the learned SVM

classifiers based on a limited number of training

samples from the target domain are not robust We

also observe that SVM_T is always better than

SVM_AT for cases b and c A possible explanation is

that the ST features of video samples from the

auxiliary and target domains distribute sparsely in

the ST feature space, which makes the ST feature not

robust and thus it is more likely that the data from the

auxiliary domain may degrade the event recognition

performances in the target domain for cases b and c

2 In this application, A-SVM achieves the worst results

in cases a and c in terms of the mean of MAPs,

possibly because the limited number of labeled

training samples (e.g., three positive samples per

event) in the target domain are not sufficient for

A-SVM to robustly learn an adapted target classifier

which is based on only one kernel

3 DTSVM is generally better than MKL in terms of the

mean of MAPs This is consistent with [8]

4 For all methods, the MAPs based on SIFT features are better that those based on ST features In practice, the simple ensemble method, SVM_AT, achieves good performances when only using the SIFT features in case a It indicates that SIFT features are more effective for event recognition in consumer videos However, the MAPs of SVM_AT, FR and A-SVM in case c are much worse compared with case a It suggests that the simple late fusion methods using equal weights are not robust for integrating strong features and weak features In contrast, for DTSVM and our method A-MKL, the results in case c are improved by learning optimal linear combination coefficients to effectively fuse two types of features

5 For each of three cases, our proposed method A-MKL achieves the best performance by effectively fusing average classifiers (from two pyramid levels and two types of local features) and multiple base kernels as well as reducing the mismatch in the data distributions between two domains We also believe the utilization of multiple base kernels and pre-learned average classifiers can also well cope with YouTube videos with noisy labels In Table 3, compared with the best means of MAPs of SVM_T (42.32 percent), SVM_AT (53.93 percent), FR (49.98 percent), A-SVM (38.42 percent), MKL (47.19 per-cent), and DTSVM (53.78 perper-cent), the relative improvements of our best result (58.20 percent) are 37.52, 7.92, 16.54, 51.48, 23.33, and 8.22 percent, respectively

In Fig 4, we plot the means and standard deviations of per-event APs for all methods Our method achieves the best performances in three out of six events in case c and some concepts enjoy large performance gains according to the means of per-event APs, e.g., the AP of “parade” significantly increases from 65.96 percent (DTSVM) to 75.21 percent (A-MKL)

5.4 Analysis on the Combination Coefficients ps of the Prelearned Classifiers

Recall that we learn the linear combination coefficients ps

of the prelearned classifiers fps in A-MKL And the absolute value of each preflects the importance of the correspond-ing prelearned classifier Specifically, the larger jpj is, the more fp contributes in the target decision function For better representation, let us denote the corresponding average classifiers fSIF T

0 , fSIF T

1 , fST

0 , and fST

1 as f1, f2, f3, and f4, respectively

Taking one round of training/test data split in the target domain, for example, we draw the combination coefficients

ps of the four prelearned classifiers fps for all events in Fig 5 In this experiment, we again set L ¼ f3; 2; ; 1g

We observe that the absolute values of and are always

TABLE 3 Means and Standard Deviations (Percent) of MAPs over Six Events for All Methods in Three Cases

Trang 10

much larger than those of 3and 4, which shows that the

prelearned classifiers (i.e., f1and f2) based on SIFT features

play dominant roles among all the prelearned classifiers

This is not surprising because SIFT features are much more

robust than ST features, as demonstrated in Section 5.3

From Fig 5, we also observe that the values of 3and 4are

generally not close to zero, which demonstrates that A-MKL

can further improve the event recognition performance by

effectively integrating strong and weak features Recall that

A-MKL using both types of features outperforms A-MKL

with only SIFT features (see Table 3) We have similar

observations for other rounds of experiments

5.5 Convergence of A-MKL Learning Algorithm

Recall that we iteratively update the dual variable and the

linear combination coefficient d in A-MKL (see Section 4.3)

We take one round of training/test data split as an example

to discuss the convergence of the iterative algorithm of

A-MKL in which we also set L as f3; 2; ; 1g and we

use both types of features In Fig 6, we plot the change of

the objective value of A-MKL with respect to the number of

iterations We observe that A-MKL converges after about

eight iterations for all events We have similar observations for other rounds of experiments

5.6 Utilization of Additional Prelearned Classifiers from Other Event Classes

In the previous experiments, for a specific event class, we only utilize the prelearned classifiers (i.e., average classifiers

flSIF Tj1l¼0 and fST

l j1l¼0) from this event class As a general learning method, A-MKL can readily incorporate additional prelearned classifiers In our event recognition application,

we observe that some events may share common motion patterns [47] For example, the videos from some events (like “birthday,” “picnic,” and “wedding”) usually contain

a number of people talking with each other Thus, it is beneficial to learn an adapted classifier for “birthday” by

Fig 4 Means and standard deviations of per-event APs of six events for all methods.

Fig 5 Illustration of the combination coefficients ps of the prelearned

classifiers for all events.

Fig 6 Illustration of the convergence of the A-MKL learning algorithm for all events.

Định dạng
Số trang	14
Dung lượng	1,21 MB