Advances in imaging and electron physics, volume 187

achieving a representation of the data that apprehend the intrinsic sionality of the underlying variables and degrees of freedom in the data.Learning image manifolds has been shown to be

Trang 1

Peter W HawkesCEMES-CNRSToulouse, France

Trang 2

CEMES-CNRS, Toulouse, France

AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Academic Press is an imprint of Elsevier

Trang 3

Cover photo credit:

Ahmed Elgammal, Homeomorphic Manifold Analysis (HMA): Untangling Complex Manifolds Advances in Imaging and Electron Physics (2015) 187, pp 1-82

Academic Press is an imprint of Elsevier

125, London Wall, EC2Y 5AS

525 B Street, Suite 1800, San Diego, CA 92101-4495, USA

225 Wyman Street, Waltham, MA 02451, USA

The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK

First edition 2015

No part of this publication may be reproduced or transmitted in any form or by any means, electronic

or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher Details on how to seek permission, further information about the Publisher ’s permissions policies and our arrangements with organizations such

as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions

This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).

Notices

Knowledge and best practice in this ﬁeld are constantly changing As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library

Library of Congress Cataloging-in-Publication Data

A catalog record for this book is available from the Library of Congress

ISBN: 978-0-12-802255-9

ISSN: 1076-5670

For information on all Academic Press publications visit

our website at http://store.elsevier.com/

Trang 4

The ﬁrst of the two chapters that make up this volume deals with polarized scanning electron microscopy, a technique that is not new but istoday of the highest interest Teruo Kohashi has been using this approachfor more than 20 years and his chapter is therefore an authoritative account

spin-of the subject Heﬁrst explains the principle behind spin-polarization tion for the study of magnetic domains He then describes at length theinstrumental aspects The chapter concludes with a wide range of applica-tions This lucid and knowledgeable text will surely be much appreciated

detec-In the second chapter, Ahmed Elgammal explores a very different topic:many problems in computer vision, and almost all tasks in human vision,involve analysis of image data in high-dimensional spaces The humancase is very striking, for we are often able to recognize objects whateverthe viewpoint, scale, lighting, and orientation The process that originallygenerated the image with which the computer or human is confronted is,however, frequently governed by a relatively small number of variablesand the data are often assumed to lie on a low-dimensional manifold Inthis account of the subject, A Elgammalﬁrst surveys the problems arisingfrom these vision tasks and then presents homeomorphic manifold analysis

in detail This long chapter forms a monograph on the subject and will, I

am sure, be of great value to readers in this active area of research

As always, I thank the authors for the trouble they have taken to maketheir subjects understandable by readers from other subject areas

Peter Hawkes

viij

Trang 5

Structure and microscopy of quasicrystals

M Berz, P.M Duxbury, K Makino and C.-Y Ruan (Vol 190)

Femtosecond electron imaging and spectroscopy

C Bobisch and R M€oller

Ballistic electron microscopy

Reﬂective electron beam lithography

N Chandra and R Ghosh

Quantum entanglement in electron optics

A Cornejo Rodriguez and F Granados Agustin

Ronchigram quanti ﬁcation

N de Jonge, Ed (Vol 189)

Trang 6

A.R Faruqi, G McMullan and R Henderson (Vol 190)

Direct detectors

M Ferroni

Transmission microscopy in the scanning electron microscope

R.G Forbes

Liquid metal ion sources

P Gai and E.D Boyes

Aberration-corrected environmental electron microscopy

J Grotemeyer and T Muskat (Vol 189)

Time-of-ﬂight mass spectrometry

V.S Gurov, A.O Saulebekov and A.A Trubitsyn

Analytical, approximate analytical and numerical methods for the design of energy analyzers

M Haschke

Micro-XRF excitation in the scanning electron microscope

R Herring and B McMorran

Electron vortex beams

M.S Isaacson

Early STEM development

K Ishizuka

Contrast transfer and crystal images

K Jensen, D Shif ﬂer and J Luginsland

Physics of ﬁeld emission cold cathodes

Trang 7

Ultrafast electron microscopy

D Paganin, T Gureyev and K Pavlov

Intensity-linear methods in inverse imaging

M Pap (Vol 189)

A special voice transform, analytic wavelets and Zernike functions

N Papamarkos and A Kesidis

The inverse Hough transform

Q Ramasse and R Brydson

The SuperSTEM laboratory

B Rieger and A.J Koster

Image formation in cryo-electron microscopy

P Rocca and M Donelli

Imaging of dielectric objects

J Rodenburg

Lensless imaging

J Rouse, H.-n Liu and E Munro

The role of differential algebra in electron optics

J Sanchez

Fisher vector encoding for the classiﬁcation of natural images

P Santi

Light sheet ﬂuorescence microscopy

C.J.R Sheppard, S.S Kou and J Lin (Vol 189)

The Hankel transform in n-dimensions, and its applications in optical propagation and imaging

R Shimizu, T Ikuta and Y Takai

Defocus image modulation processing in real time

T Soma

Focus-deﬂection systems and their applications

I.F Spivak-Lavrov

Analytical methods of calculation and simulation of new schemes of static and

time-of-ﬂight mass spectrometers

Trang 8

P Sussner and M.E Valle

Fuzzy morphological associative memories

Trang 10

6 Applications of Homomorphism on 1-D Manifolds 30

6.3 A Multifactor Model for Facial Expression Analysis 41

* This work was funded by NSF award IIS-0328991 and NSF CAREER award IIS-0546372 Advances in Imaging and Electron Physics, Volume 187

ISSN 1076-5670

Trang 11

6.3.1 Facial Expression Synthesis and Recognition 42

7 Applications of Homomorphism on 2-D Manifolds 44 7.1 The Topology of the Joint Con ﬁguration-viewpoint Manifold 46

7.5 Generalization to the Full-View Sphere 51

8.1 Learning Con ﬁguration-viewpoint, and Shape Manifolds 62

8.3 Simultaneous Tracking on the Three Manifolds Using Particle Filtering 65 8.4 Examples: Pose and View Estimation from General Motion Manifolds 66

Trang 12

achieving a representation of the data that apprehend the intrinsic sionality of the underlying variables and degrees of freedom in the data.Learning image manifolds has been shown to be quite useful in recogni-tion in such situations as learning appearance manifolds from different viewsfor object recognition (Murase & Nayar, 1995) Linear subspace methods,such as principle component analysis (PCA) (Jolliffe, 1986), provide a way

dimen-to discover the fundamental modes of variations in the data, hence ing the data in the span of a small number of bases PCA is the foundation formany traditional computer vision algorithms, such as active shape models

with the perceptual modes of variations in the data However, there is noguarantee of that since the basic formulation aims at ﬁnding a subspacethat best retain the variance of the data Supervised subspace methods,such as linear discriminant analysis (LDA), provide a way to discover the un-derlying subspace of the data that maximizes class separation for the task ofclassiﬁcation Many variants have been proposed to the basic subspacelearning methods, aiming at achieving better low-dimensional representa-tions with varying objectives Bilinear (Tenenbaum & Freeman, 2000) andmultilinear (Vasilescu & Terzopoulos, 2002) methods were also suggested

to model subspaces of orthogonal modes of variations that exist in the data.The introduction of nonlinear dimensionality reduction techniques,such as local linear embedding (LLE) (Roweis & Saul, 2000), isometricfeature mapping (Isomap) (Tenenbaum, 1998), and others, provided tools

to represent manifolds in low-dimensional Euclidean embedding spaces.Traditional manifold learning approaches are unsupervised, and the goal is

to ﬁnd a low-dimensional embedding of the data that preserves the localmanifold geometry Some manifold learning techniques use supervision,

in terms of class labels, to achieve better discriminative embeddings of thedata However, in practice, away from simple examples, it is hardly thecase that the various orthogonal perceptual aspects can be shown to corre-spond to certain directions or clusters in the obtained embedding spaces.Why is learning image manifolds difﬁcult? Consider a simple example ofimages of a rigid object Images are the result of a complex image formationprocess that involves several variables including, but not limited to, relativeobject-camera pose, illumination geometry, surface reﬂectance, and digitiza-tion process More variables are introduced and the process becomes evenmore complex when we consider images of different objects, or images of ar-ticulated objects, different backgrounds, visual occlusion, clutter, etc.Depending on the task, some of these variables (probably one or two) are

Trang 13

important, and the rest of them are deemed to be nuisance variables However,all these variables affect the geometry of the images as points in the image space.Any assumption about the image manifold structure has to deal with these vari-ables collectively Even if we simplify the problem to the case of translation orrotation of a simple two-dimensional (2-D) object in the image space, aspointed out by Donoho and Grimes (2005), the resulting image manifold

is not going to be smooth or differentiable because of the existence of edges

in objects, which cause discontinuities in the image space Local smoothnessand differentiability are basic assumptions behind the theory of Riemannianmanifolds As was shown by Donoho and Grimes (2005), the basic localmanifold isometry assumption is invalid when dealing with real images,which is the most basic deﬁnition of a manifold This, of course, is related

to the image space representation being used; certain image representationswould provide easier ways to study image manifolds than others Besidesthe aforementioned fundamental difﬁculties, in many real-world applica-tions, the available images, despite being numerous, do not necessarily pro-vide dense sampling of the underlying manifold of the interesting variables.Instead, plenty of images would provide dense sampling of the nuisancevariables The approach in this chapter is based on learning the visual mani-fold in a way that utilizes our knowledge about the basic processes thatgenerate the data, and the expected sources of variations on these data.The approach mainly utilizes the concept of homeomorphism between themanifolds of different instances, which collectively constitute the data Formanifolds with known topology, manifold learning can be formulated differ-ently from the traditional way, which focuses just on achieving a low-dimensional embedding of the whole data Manifold learning, then, is thetask of learning a mapping from and to a topological structure to and fromthe data where that topological structure is homeomorphic to the data By

“known topology,” I do not mean knowing the topology of the wholedata, but rather knowing the topology of the basic instances that constitutethe data, such as the topology of the motion or the viewpoint manifold.This chapter presents the theory and applications of the concept of ho-meomorphic manifold analysis (HMA) Given a set of topologically equiv-alent manifolds, HMA models the variation in their geometries in the space

of functions that map between a topologically equivalent common sentation and each of them HMA is based on decomposing the“style” pa-rameters of manifolds in the space of nonlinear functions that map between auniﬁed embedded representation of the content manifold and style-dependent visual observations

Trang 14

This chapter argues that this setting is suitable to different problems invisual learning, focusing in particular on the applications of the framework

to modeling the manifold of human motion in the image space To thisend, I show how the framework can be utilized to learn the visual manifoldfor periodic human motions, such as gait motion, as well as nonperiodic mo-tions, such as facial expressions I also show how to approach complex mani-folds with several variations due to factors such as viewpoint and personalbody style I also show that the HMA framework is suitable for modelingthe object-view manifold

There are several advantages of the HMA framework, which are lighted here and will become more clear in the discussion within the appro-priate context The HMA framework does not assume smoothness,differentiability of the image manifold or local isometry The frameworkyields a generative model of image manifolds, where images are generatedthrough a function of several relevant variables Therefore, factorization ofthe sources of variability is a key of the framework Low-dimensional mani-fold representations of each of these variables are utilized to generate thecomplex image manifold Factorizing complex data to separate low-dimensional manifold representations facilitates an efﬁcient solution toseveral problems, such as body-posture estimation, viewpoint estimation,tracking, activity recognition, and capturing biometric data

high-This HMA framework was introduced byElgammal and Lee (2004b)as

a way to separate style and content on manifolds The framework wasapplied and has been extensively validated over the last decade in the context

of human motion analysis in different settings, including locomotion

2009), facial expression (Lee & Elgammal, 2005a), complex motion such

as ballet dancing (Lee & Elgammal, 2007, 2010a), and others The work was also used to recover the image translation and rotation manifoldfor tracking (Elgammal, 2005) The framework was also successfully used

frame-to model the object-view manifold in the context of multiview objectrecognition and pose estimation in recent years (Zhang et al., 2013; Bakry

& Elgammal, 2014)

The structure of this chapter is as follows Section2describes the tion behind the HMA framework with two motivating case examples, otherthan a biological motivation Section3gives an overview of the mathemat-ical framework, highlights the challenges, and paves the way for thefollowing sections, which contain the details about learning the modeland using it in inference Details on learning the model are presented in

Trang 15

intui-section 4 Section 5 shows several methods to perform inference on themodel to solve for the different variables governing the image formationprocess Several applications of the model for the case of one-dimensional(1-D) manifolds are presented in section6with applications to gait and facialexpression analysis Modeling 2-D manifolds is described in section7withinthe context of modeling the joint conﬁguration-viewpoint manifold Sec-tion 8 describes using the framework to model complex human motions.Finally, section 9 details the connection between the framework anddifferent related works in the literature.

2 MOTIVATING SCENARIOS

2.1 Case Example I: Modeling the View-Object Manifold

Consider collections of images from any of the following cases orcombinations of them: (1) instances of different object classes; (2) instances

of an object class (within-class variations); (3) different views of an object.The shape and appearance of an object in a given image is a function ofits category, style within category, viewpoint, and several other factors.The visual manifold given all these variables collectively is impossible tomodel Let us ﬁrst simplify the problem Let us assume that the object isdetected in the training images (so there is no 2-D translation or in the planerotation manifold) Let us also assume that we are dealing with rigid objects,and ignore the illumination variations (using an illumination invariantfeature representation) Basically, we are left with variations due to category,within category, and viewpoint; i.e., we are dealing with a combined view-object manifold We will set aside some of these assumptions later in thediscussion

The aim here is to learn a factorized model (or class of models) that canparameterize each of these factors of variability independently The shapeand appearance of an object instance in an image is considered to be function

of several latent parameterizing variables: category, style within class, and,object viewpoint Given a test image and the learned model(s), such a model

is supposed to be used to make simultaneous inferences about the differentlatent variables Obviously, learning a latent variable model and using it ininference is not a novel idea It is quite challenging to make inferences in

a high-dimensional parameter space, and even more challenging to do so

in multiple spaces Therefore, it is essential that the learned model wouldrepresent each latent variable in a separate low-dimensional representation,

Trang 16

invariant of other factors (untangled), to facilitate efﬁcient inference over, the model should explicitly exploit the manifold structure of eachlatent variable.

More-The underlying principle in this framework is that multiple views of anobject lie on an intrinsically low-dimensional manifold (view manifold) in theinput space The view manifolds of different objects are distributed in thatinput space To recover the category and pose of a test image, we need toknow which manifold this image belongs to and what the intrinsic coordi-nate of that image is within that manifold This basic view of object recog-nition and pose estimation is not new; it was used in the seminal work of

achieve linear dimensionality reduction of the visual data, and the manifolds

of different objects were represented as parameterized curves in the ding space However, dimensionality reduction techniques, whether linear

embed-or nonlinear, will only project the data to a lower dimension and will not

be able to achieve the desired untangled representation

The main challenge is how to achieve an untangled representation of thevisual manifold The key is to utilize the low-dimensionality and known to-pology of the view manifold of individual objects To explain the point, let

us consider the simple case where the different views are obtained from aviewing circle (e.g., a camera looking at an object on a turntable) Theview manifold of each object in this case is a 1-D closed manifold embedded

in the input space However, that simple closed curve deforms on the inputspace as a function of the object geometry and appearance The visual mani-fold can be degenerate– for example, imaging a textureless sphere fromdifferent views result in the same image; i.e., the view manifold in thiscase is degenerate to a single-point

Ignoring degeneracy, the view manifolds of all objects share the same pology but differ in geometry, and they are all homeomorphic to each other.Therefore, capturing and parameterizing the deformation of a given object’sview manifold gives fundamental information about the object category andwithin category The deformation space of these view manifolds captures aview-invariant signature of objects, and analyzing such space provides anovel way to tackle the categorization and within-class parameterization.Therefore, a fundamental aspect to untangle the complex object-viewmanifold is to use view-manifold deformation as an invariant for categoriza-tion and modeling the within-class variations If the views are obtained from

to-a full or pto-art of the view-sphere to-around the object, the resulting visuto-al mto-ani-fold should be a deformed sphere as well In general, the dimensionality of the

Trang 17

mani-view manifold of an object is bounded by the dimensionality of mani-viewing manifold(degrees of freedom imposed by the camera-object relative pose).Figure 1illustratesthe framework for untangling the object-view manifold by factorizing thedeformation of individual object’s view manifolds in a view-invariant space,which can be the basis for recognition (Zhang et al., 2013; Bakry & Elgam-

2.2 Case Example II: Modeling the Visual Manifold

of Biological Motion

Let us consider the case of a biological motion: human motion Concerning

an articulated motion observed from a camera (stationary or moving), such amotion can be represented as a kinematic sequence Z1:T ¼ z1; ; zT andobserved as an observation sequence Y1 :T ¼ y1; ; yT With an accurate3-D body model, camera calibration, and geometric transformation infor-mation, Y1 :Tcan be explained as a projection of an articulated model How-ever, in this chapter, I am interested in a different interpretation of therelation between the observations and the kinematics that does not involveany body model

Manifold deformation space

Trang 18

For illustration, let us consider the observed motion, in the form ofshape, for a gait motion The silhouette (occluding contour) of a humanwalking or performing a gesture is an example of a dynamic shape, wherethe shape deforms over time based on the action being performed Thesedeformations are restricted by the physical body and the temporal constraintsposed by the action being performed Given the spatial and temporal con-straints, these silhouettes, as points in a high-dimensional visual input space,are expected to lie on a low-dimensional manifold Intuitively, the gait is a1-D manifold that is embedded in a high-dimensional visual space Such amanifold twists in the high-dimensional visual space.Figure 2(a)shows anembedding of the visual gait manifold in a three-dimensional (3-D) embed-ding space (Elgammal & Lee, 2004a) Similarly, the appearance of a face with

Figure 2 Homeomorphism of gait manifolds ( Elgammal & Lee, 2004a ) Visualization of gait manifolds from different viewpoints of a walker obtained using LLE embedding (a) Embedded gait manifold for a side view of the walker Sample frames from a walking cycle along the manifold with the frame numbers shown to indicate the order A total of

10 walking cycles are shown (300 frames) (b) Embedded gait manifold from kinematic data (joint angle position through the walking cycles (c) Embedded gait manifolds from ﬁve different viewpoints of the walker ( Elgammal & Lee, 2004a , © IEEE) (See color plate)

Trang 19

expressions is an example of a dynamic appearance that lies on a dimensional manifold in the visual input space.

low-In general, not only for the case of periodic motions such as gait, despitethe high dimensionality of the body configuration space, many human mo-tions intrinsically lie on low-dimensional manifolds This is true for the ki-nematics of the body (the kinematic manifold), as well as for the observedmotion through image sequences (the visual manifold) Therefore, the dy-namic sequence Z1:T lies on a manifold called the kinematic manifold The ki-nematic manifold is the manifold of body configuration changes in thekinematic space In addition, the observations lie on a manifold, known asthe visual manifold Although the intrinsic body configuration manifold might

be very low in dimensionality, the resulting visual manifold (in terms ofshape, appearance, or both) is challenging to model, given the various as-pects that affect the appearance Examples of such aspects include thebody type (slim, big, tall, etc.) of the person performing the motion,clothing, viewpoint, and illumination Such variability makes the task oflearning a visual manifold very challenging because we are dealing withdata points that lie on multiple manifolds at the same time: body conﬁgura-tion manifold, viewpoint manifold, body shape manifold, illuminationmanifold, etc However, the underlying body conﬁguration manifold,invariant to all other factors, is low in dimensionality In contrast, we donot know the dimensionality of the shape manifold of all people, while

we know that gait is a 1-D manifold motion Therefore, the body conuration manifold can be explicitly modeled, while all the other factors canmodel deformations to this intrinsic manifold

ﬁg-Consequently, a key property that we will use to model complex visualmanifolds is the topological equivalence or homeomorphism between thedifferent realizations of the body conﬁguration manifolds of the same mo-tion Ignoring the case of degeneracy, the visual manifold is homeomorphic

to the kinematic manifold In an illustrative example,Figure 2(b)shows thekinematic manifold of gait, whileFigure 2(a, c) show the visual manifold ofgait from different viewpoints of the walkers Similarly, the observed shapes(same for appearance) of different people performing the same motion, (e.g.,gait) lie on topologically equivalent manifolds in the visual input space(ignoring degeneracy) However, these manifolds differ in their geometry

deforma-tion of each person’s manifold depends on his or her body shape, which poses different twists on it Therefore, parameterizing the deformation ofeach person’s manifold provide an encoding of the body shape The

Trang 20

deformation space of these projected intrinsic motion manifolds captures aview-invariant signature of the person’s body shape, invariant to the motion.Let us even consider a more complex case: observing a particular motion,performed by different people, captured from different viewpoints The vi-sual data given these three factors (motion, viewpoint, personal body varia-tions) is very complex to model If we consider a particular person and aparticular viewpoint, the observed shapes will lie on a low-dimensionalmanifold, which is the projected motion manifold The projected motionmanifolds of different people, from different viewpoints, are topologicallyequivalent, but different in their geometry Modeling the deformation space

of each of these manifolds provide a motion-invariant encoding of both theviewpoint variability and the person’s body shape variability

2.3 Biological Motivation

Humans are able to recognize and categorize an object under wide variability

in the visual stimuli (viewpoint, scale, illumination, articulation, etc.) larly, humans recognize activities and facial expressions despite the wide vari-ability in visual stimuli This ability is a fundamental mystery of perception.While the role of manifold representations in perception is still unclear, it isclear that images of the same object lie on a low-dimensional manifold in

Simi-Figure 3 Homeomorphism of gait manifolds: Visualization of gait manifolds of different people from a side-view camera.

Trang 21

the high-dimensional visual space deﬁned by the retinal array (w100 millionphotoreceptors and w1 million retinal ganglion cells) On the other hand,neurophysiologists have found that neural populationﬁring is typically a func-tion of a small number of variables, which implies that population activitiesalso lie on low-dimensional manifolds (Seung & Lee, 2000; DiCarlo, Zocco-lan, & Rust, 2012).

recognize objects, invariant of different viewing conditions such as point, is fundamentally based on untangling the visual manifold encoded

view-in the neural population of the early vision areas (retview-inal ganglion cells,LGN, V1) This is achieved through a serious of successive transformation(re-representation) along the ventral stream (V1,V2, V4, to IT) that leads

to an untangled population at IT However, it is unknown how the ventralstream achieves this untangling They argued that since the IT populationsupports tasks other than recognition, such as pose estimation, the manifoldrepresentation is somehow “ﬂattened” and untangled in IT DiCarlo andCox’s hypothesis is illustrated inFigure 4 In their hypothesis, they stressedthe feedforward cascade of neural re-representation as a way for untanglingthe visual manifold They also stressed the role of temporal information as away to implicitlyﬂatten the visual manifold Several earlier studies have sug-gested the role of temporal information in achieving invariants (e.g.,Wallis

of “ﬁnding new biologically plausible algorithms that progressivelyuntangled object manifold”

Figure 4 Illustration of DiCarlo and Cox model ( DiCarlo & Cox, 2007 ): Left: Tangled ifolds of different objects in early vision areas Right: Untangled ( ﬂattened) manifold representation in IT (See color plate)

Trang 22

Inspired by this perspective, the HMA is a computational model that canachieve untangling of the complex visual manifold Analogous to thetangled visual manifold in early vision areas, images in any feature spacealso exhibit this problem, which makes recognition very hard How can

we untangle such a complex manifold to achieve effective representationthat facilitate recognition? Note that the HMA framework is not by anymeans an attempt to implement DiCarlo and Cox’s, hypothesis (DiCarlo

& Cox, 2007; DiCarlo, Zoccolan, & Rust 2012), nor is it an attempt to date it through a computational model We are merely motivated by the hy-pothesis to achieve an effective computer vision solution Unlike theDiCarlo/Fox model, the HMA framework applied to the object-viewmanifold case does not ﬂatten the view manifold Instead, it learns aview-invariant representation untangled from the view representation, aswas illustrated inFigure 1

vali-3 FRAMEWORK OVERVIEW

This general discussion uses the notion of instance and instance fold to denote individual realizations of the intrinsic manifold in the visualinput space For example, in the case of the object-view manifold, an instance

is equivalent to an object, and an instance manifold denotes the view fold of an object In the case of human motion, an instance refers to asequence of images of one person performing an activity observed from aparticular viewpoint, and an instance manifold refers to the image manifold

mani-of the motion mani-of one subject observed from one viewpoint We refer to iations among different instance manifolds as“style,” and within that context,

var-we might refer to the intrinsic manifold as“content.” In all cases, the instancemanifolds live in the image space, which we refer to by the visual input space,regardless of the representation used Representing the visual input can vary,whether the input is represented in terms of shapes or appearance

The fundamental concept behind the HMA framework is manifold meomorphism Therefore, we start by a mathematical deﬁnition ofhomeomorphism:

ho-Deﬁnition 1

A function f : X/Y between two topological spaces is called a homeomorphism

if it is a bijection and continuous, and its inverse is continuous

Deﬁnition 2

Two manifolds,A and B, are said to be homeomorphic to each other if a morphism exists between them

Trang 23

homeo-Let us denote the manifold of instance s in the input space byDs3Rd,where d is the dimensionality of the input space Assuming that all instancemanifoldsDsare not degenerate (I will discuss this issue shortly), then theyare all topologically equivalent and homeomorphic to each other More-over, suppose that we can obtain a common manifold representation acrossall instances, denoted byM3Re, in a Euclidean embedding space of dimen-sionality e All manifolds Ds are also homeomorphic toM Each instancemanifoldDsis a deformed version ofM Notice that, in this case, the ex-istence of the inverse is assumed but not required for computation; i.e., we

do not need the inverse function to recover the intrinsic coordinate on themanifold We mainly care about the mapping in a generative manner from

In the mapping in Eq.(1), the geometric deformation of instance fold Ds, from common manifoldM, is encoded in coefﬁcient matrix Cs

mani-.Therefore, the space of matricesC ¼ fCsg encodes the variability betweenmanifolds of different instances and can be used to parameterize such mani-folds Notice that the dimensionality of these matrices (d Nj) does notdepend on the number of images available in each instance, but rather onthe choice of the basis points

We can parameterize the variability across different manifolds in a space in the space of coefﬁcient matrices Given a set of style-dependentfunctions in the form of Eq (1), the style variables can be factorized inthe mapping coefﬁcient space This results in a generative model, which

sub-in the simplest case (the case of a ssub-ingle style factor) can be written as

Trang 24

whereA is a third-order tensor of dimensionality d n Nj The product

i is the mode-i tensor product as deﬁned in Lathauwer, de Moor, and

around the common manifold M, which is explicitly modeled In thismodel, the variable bs˛Rnis a parameterization of manifoldDsthat encodesthe manifold geometry of instance s We denote that space by“style” Thevariable a denotes the model parameters, which are encoded in the tensor

A A sample realization of this single-style-factor model for modeling man gait will be explained in section6.1

hu-In the general case, where several style variables exist in the data, thefactorization can be achieved using multilinear analysis of the coefﬁcienttensor Therefore, the general form for the mapping function gð$Þ that

we use is

gðx; b1; b2; /; br; aÞ ¼ A b / br jðxÞ; (3)where each bi˛Rn i is a vector representing a parameterization of the ithstyle factor A is a core tensor of order r þ 2 and of dimensionality

d n1 nr Nj The product operator in Eq (3) is the mode-itensor product as deﬁned inLathauwer, de Moor, and Vandewalle (2000a),where the indices are dropped since they are implied from the dimensions ofthe tensor Sample realizations of this general model for the analysis of gaitand facial expressions will be detailed in sections 6.2 and 6.3, respectively.The models in Eqs.(2) and (3)can be seen as a hybrid model that uses amix of nonlinear and multilinear factors The relation between intrinsic coor-dinate x and the input is nonlinear, where other factors are approximated lin-early through high-order tensor analysis The use of nonlinear mapping isessential since the representation of the intrinsic manifold is related nonlinearly

to the input (instance manifolds) The main motivation behind the hybridmodel is as follows: The intrinsic manifold (e.g., motion manifold, or viewmanifold) itself lies on a low-dimensional manifold, which can be explicitlymodeled, while it might not be possible to model the other factors explicitlyusing nonlinear manifolds For example, the shapes of different people mightlie on a manifold; however, we do not know the dimensionality of that mani-fold and we might not have enough data to model it The best choice is torepresent it as a subspace Therefore, the model in (3) gives a tool that com-bines manifold-based models, where manifolds are explicitly represented,with subspace models for style factors if no better models are available.The framework also allows modeling any style factor on a manifold in itscorresponding subspace, since the data can lie naturally on a manifold in that

Trang 25

subspace This feature of the model will be utilized in section8, where theview manifold of a motion is modeled in the subspace deﬁned by the pre-vious factorization.

Dealing with degeneracy: Of course, the visual manifold can be degenerate,

or it can be self-intersecting because of the projection from 3-D to 2-D andlack of visual features For example, in the case of the view manifold of atextureless sphere, the visual manifold is degenerate to a single point Insuch cases, the homeomorphic assumption does not hold The key to tacklethis challenge is learning the mapping in a generative manner fromM to

Ds, and not in the other direction By enforcing the known nondegeneratetopology onM, the mapping from M to Dsstill exists, still is a function,and still captures the manifold deformation In such cases, the recovery ofthe intrinsic coordinate within the manifold (e.g., object pose) might beambiguous and ill posed In fact, such degenerate cases can be detected byrank analysis of the mapping matrix Cs

The realization of the models in Eqs.(2) and (3) requires a pipeline ofthree steps First, a representation of the common manifold M has to beestablished, which is denoted as the“content” manifold This step depends

on the application and the available knowledge about the instance manifoldtopologies Therefore, different solutions are available for this step, whichare detailed in section4.4 The second step is manifold parameterization, whereeach instance manifolds is parameterized, using Eq (1), which will bedetailed in section4.2 The third step is manifold factorization, where the co-

efﬁcient space is factorized to achieve low-dimensional representations ofthe various style factors Details about this step will be explained in section

4.3 Once the model is learned, it can be used for solving for the various tors through inference, which will be detailed in section5

fac-4 MANIFOLD FACTORIZATION

For the sake of clarity, without loss of generality, this section describesﬁtting the model from data in the context of human motion analysis; i.e., theintrinsic manifold in this case is the body conﬁguration manifold, and stylevariability includes different people, different views, etc

4.1 Style Setting

Toﬁt the model in Eq.(3), we need image sequences at each combination

of style factors, all representing the same motion The input sequences do

Trang 26

not have to have the same length Each style factor is represented by a set ofdiscrete samples in the training data; i.e., a set of discrete views, discreteshape styles, discrete expressions, etc We denote the set of discrete samplesfor the ith style factor by Biand the number of these samples by Ni¼ jBij.

A certain combination of style factors is denoted by an r-tuple:

s˛B1 / Br We call such a tuple a style setting Overall, the trainingdata needed toﬁt the model is N1 / Nr sequences

We consider the case for the sth sequence We will drop the index swhen it is implied from the context for simplicity Given a style-speciﬁcsequence Ys and its embedding coordinates Xs, we learn a style-dependent nonlinear mapping function from the embedding space intothe input space; i.e., a function gsð$Þ : Re/Rd that maps from embeddingspace into the input space (observation) We can learn a nonlinear mappingfunction gsð$Þ that satisﬁes ys

com-In particular, we use a semiparametric form for the function gð$Þ fore, for the lth dimension of the input (e.g., the lth pixel), the function glð$Þ

There-is a radial basThere-is function (RBF) interpolant fromRe intoR in the form

glðxÞ ¼ plðxÞ þXN

i¼1

wjlx zj; (4)

whereð$Þ is a real-valued basic function, wjrepresents coefﬁcients, and j$j

is the second norm onRe (the embedding space ofM) The choice of thecenters is arbitrary (not necessarily data points) Therefore, this is a form ofgeneralized radial basis function (GRBF) (Poggio & Girosi, 1990)

Typical choices for the basis (kernel) function include the thin-plate(ðuÞ ¼ u2logðuÞ), multiquadric (ðuÞ ¼pffiffiffiffiffiffiffiffiffiffiffiffiffiffiu2þ a2

), Gaussian (ðuÞ ¼

eau2), biharmonic (ðuÞ ¼ u) and triharmonic (ðuÞ ¼ u3) splines Here,

Trang 27

pl is a linear polynomial with coefficients cl; i.e., plðxÞ ¼ ½1 xu$cl Thepolynomial part is needed for positive, semi-definite kernels to span thenull space in the corresponding RKHS The polynomial part is an essentialregularizer with the choice of specific basis functions such as thin-plate spline(TPS) kernel A Gaussian kernel does not need a polynomial part (Kimeldorf

& Wahba, 1971)

The whole mapping can be achieved by stacking the functions glð$Þ, and

it can be written in a matrix form as

jðxÞ ¼ðjx z1jÞ/ðjx zNjÞ 1 xuu: (6)

In this case, the dimensionality of induced kernel space is

Nj¼ N þ e þ 1 The matrix Csrepresents the coefﬁcients for d differentnonlinear mappings for style setting s, each from a low-dimension embed-ding space into real numbers

To ensure orthogonality and to make the problem well posed, thefollowing side condition constraints are imposed: PN

i¼1wipjðxiÞ ¼ 0;

j¼ 1; /; m, where pjare the linear basis of p Therefore, the solution for

Cscan be obtained by directly solving the linear systems

Ys is an ðns dÞ matrix containing the input images for style setting s;i.e.,Ys¼ ½ys

1/ys

n su Solution for Csis guaranteed under certain conditions

on the basic functions used

Trang 28

for the instance manifolds, the style parameters can be factorized byfinding alow-dimensional subspace that approximates the space of coefficientmatrices Let the coefficients be arranged as a d K Nj tensorC Theform of the desired decomposition is

where A is a d ds Njtensor containing bases for the RBF coefﬁcientspace, and S ¼ ½s1; /; sK is ds K The columns of S contain the instancestyle parameterization This decomposition can be achieved by arranging themapping coefﬁcients as a ðDNjÞ K matrix:

N j are the columns of Ck

Given C, category vectors andcontent bases can be obtained by singular value decomposition (SVD) as

C ¼ USVT The bases are the columns of US and the object instance/category vectors are the rows of V Usually, ðDNjÞ[K, so the dimen-sionality of instance/category vectors obtained by SVD will be K; i.e., ds¼K.This factorization is unsupervised, where there are no speciﬁc class labelsassociated with the instance manifolds In contrast, supervised factorizationcan be achieved by utilizing the class labels using linear methods such asLDA or nonlinear such as KPLS SeeBakry and Elgammal (2013)for detailsabout this supervised factorization

4.3.2 Multifactor Model

Given the learned nonlinear mapping coefﬁcients Cs

for all style settings

s˛B1 / Br, the style parameters can be factorized byﬁtting a ear model (Lathauwer , de Moor, and Vandewalle 2000a; Vasilescu & Ter-

can be achieved by higher-order singular value decomposition (HOSVD)with matrix unfolding, which is a generalization of SVD (Lathauwer, de

1 Matrix unfolding is an operation to reshape a high-order tensor array into a matrix form Given an r-order tensor A with dimensions N 1 N 2 / N r , the mode-n matrix unfolding, denoted by

AðnÞ¼ unfoldingðA ; nÞ, ﬂattens A into a matrix whose column vectors are the mode-n vectors ( Lathauwer, de Moor, and Vandewalle 2000a ) Therefore, the dimension of the unfolded matrix

AðnÞis N ðN N /N n1 N nþ1 /N Þ.

Trang 29

Each of the coefﬁcient matrices Cs

, with dimensionality d Njcan berepresented as a coefﬁcient vector csby column stacking; i.e., cs is an Nc ¼

d$Njdimensional vector All the coefﬁcient vectors can then be arranged in

an order of rþ1 coefﬁcient tensor C with dimensionality Nc N1/ Nr The coefﬁcient tensor is then factorized using HOSVD as

C ¼ ~D 1~B12~B2 /r~Brrþ1~F ;where the matrix ~Bi is the mode-i basis of C , which represents theorthogonal basis for the space for the ith style factor ~F represents the basis forthe mapping coefﬁcient space The dimensionality of each of the ~Bimatrices

is Ni Ni The dimensionality of the matrix ~F is Nc Nc D is a coretensor, with dimensionality N1 / Nr Nc, which governs the in-teractions (the correlation) among the different mode basis matrices.Similar to PCA, it is desired to reduce the dimensionality for each of theorthogonal spaces to retain a subspace representation This can be achieved

by applying higher-order orthogonal iteration for dimensionality reduction

rep-resentation is

where the reduced dimensionality for D is n1 / nr nc; for Bi, it is

Ni ni, and for F, it is Nc nc, where n1,/, nr, and ncare the number ofbasis retained for each factor, respectively Since the basis for the mappingcoefﬁcients, F is not used in the analysis, we can combine it with the coretensor using tensor multiplication to obtain coefﬁcient eigenmodes, which is

a new core tensor formed by Z ¼ D rþ1F with dimensionality

n1 / nr Nc Therefore, Eq.(10)can be rewritten as

The columns of the matrices B1; /; Br represent orthogonal basis foreach style factor’s subspace, respectively Any style setting s can be repre-sented by a set of style vectors b1˛Rn 1; /; br˛Rn r for each of the style fac-tors The corresponding coefﬁcient matrix C can then be generated byunstacking the vector c obtained by the tensor product

c ¼ Z 1b1 /rbr:Therefore, we can generate any specific instant of the motion by speci-fying the body configuration parameter xtthrough the kernel map defined

in Eq.(6) The whole model for generating image yst can be expressed as

Trang 30

ys¼ unstackingðZ 1b1 /rbrÞ$jðxÞ:

This can be expressed abstractly also by arranging the tensorZ into aorder rþ2 tensor A with dimensionality d n1 / nr Nj This re-sults in the factorization in the form of Eq.(3), i.e.,

ys¼ A 2b1 /rþ1brrþ2jðxÞ:

4.4 Content Manifold Embedding

In order to achieve the aforementioned parameterization and factorization,

we need to establish a representation of the common manifoldM, which ishomeomorphic to all instance manifolds There are several ways to achieve

an embedded representation of M, which depend on the application, thedata, and our knowledge about topology of it The discussion in this sectionhighlights the requirements for that embedding There are three ways thatcan be used to achieve such an embedding:

1 Nonlinear dimensionality reduction from visual data: Such approach assumesthe instance manifolds, in the observation space, is recoverable fromthe visual data through the application of traditional nonlinear dimen-sionality reduction techniques This might not be always true with theexistence of many factors affecting the visual data This also depends

on the representation of the input This approach, its applicability, andlimitations are discussed in section4.4.1

2 Topological conceptual embedding: In many cases, the topology of theinstance manifolds as well as the intrinsic manifold is known; forexample, the gait manifold is a closed, 1-D manifold While the actualmanifold might not be recoverable from the data itself, our conceptualknowledge about the motion manifold allows us to model the data aslying on a distorted or deformed manifold, whose topology is known.This can be achieved using a conceptual representation of themanifold and using nonlinear mapping to model the deformation ofthat manifold to ﬁt the data This approach, its applicability, andlimitations are discussed in section4.4.2

3 Embedding from auxiliary data: In the context of human motion, in manycases, both motion-captured data and visual data are available Themotion-captured data (kinematics) can be used to achieve anembedding of the conﬁguration manifold invariant of the aspectsaffecting the visual observations (viewpoint, style, etc.) The visual data

is assumed to be lying on deformed manifolds that are homeomorphic

Trang 31

to the conﬁguration manifold Section 8 of this chapter discusses thisapproach within the context of modeling complex motion manifolds.4.4.1 Nonlinear Dimensionality Reduction from Visual Data

There are several nonlinear dimensionality reduction (NLDR) techniquesthat can be used to embed data lying on a manifold; e.g., LLE (Roweis &

model (GPLVM;Lawrence, 2003), etc All these approaches are vised, where the goal is to embed the data in to a low-dimensional Euclideanspace and the data is presumed to lie on a manifold Such approaches havebeen used to achieve embedded representations for tracking and pose esti-mation (e.g., Elgammal & Lee, 2004a; Sminchisescu & Jepson, 2004;

dimen-sionality reduction techniques cannot directly obtain a useful embeddingwhen multiple variability factors exist in the data For example, they cannotembed multiple people’s manifolds simultaneously in a way that yields a use-ful representation This is because, although such approaches try to capturethe manifold geometry, typically, the intrasubject distances are much smallercompared to the intersubject distances An example can be seen in

the inputs are spatially aligned As a result, the embedding shows separatemanifolds (e.g., in left most plot ofFigure 5(a), one manifold is degenerate

to a point because the embedding is dominated by the manifold with largestintramanifold distance.) Even if we force LLE to include correspondingpoints on different manifolds to each point’s neighbors, this results in super-ﬁcial embedding that does not capture the manifold geometry This is aninstance of a problem known as manifold alignment

Given sequences for different style settings (e.g., different people anddifferent viewpoints), we need to obtain a unified embedding for the under-lying body configuration manifold Given style-dependent sequences of thesame motion under different style settings, an embedding of each sequencecan be achieved using nonlinear dimensionality reduction Since eachsequence corresponds to a single style setting (e.g., a certain view and a certainperson), that sequence is expected to show mainly the intrinsic motion mani-fold Once each sequence is embedded, a unified representation can beachieved by warping the individual embeddings to an average representation.Next I illustrate an example of this process in the context of learning aunified representation of the gait manifold from multiple subjects’ se-quences Each person’s manifold is embedded separately using NLDR

Trang 32

−3 −2

−1 0

1 23

−3

−2

−1 0 1 2 3

−1.5

−1

−0.5 0 0.5 1 1.5

−2 −1.5 −1−0.5 0 0.5 1 1.5 2 2.5

−2 0 2

−1.5

−1

−0.5 0 0.5 1 1.5 2

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2

−2 0 2

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2

(b)

Figure 5 (a) Embedding obtained by LLE for three-person data with two different K values Inter-manifold distance dominates the

Trang 33

Each manifold is then represented as a parametric curve Given theembedded manifold Xk for person k, a cubic spline mkðtÞ is ﬁtted to themanifold as a function of time; i.e., mkðtÞ : R/Re, where t¼ 0/1 is atime variable The manifold for person k is sampled at N uniform time in-stances mk(ti), where i¼ 1/N For the case of periodic motion, such as gait,each cycle on the manifold is time-mapped from 0 to 1 given a correspond-ing origin point on the manifold, where the cycles can be computed fromthe geodesic distances to the origin.

Given multiple manifolds, a mean manifold Z(ti) is learned by warping

mkðtiÞ using nonrigid transformation using an approach similar to that of

regularized nonrigid transformations fð:; akÞ, where the objective is to mize the energy function

is a smoothness term Inparticular, thin-plate spline (TPS) is used for the nonrigid transformation.Given the transformation parameters ak, the entire data sets are warped to

~

Xk¼ f ðXk; akÞ; k ¼ 1/K: Figure 5(a, c) shows an example of threedifferent manifolds and their warping into a uniﬁed manifold embedding.Alternative solutions for embedding multiple manifolds can be achievedthrough manifold alignment For example, Torki, Elgammal, and Lee

reduc-tion techniques to capture both the intermanifold and intramanifold etry to obtain a uniﬁed representation

geom-In general, we found that this warping solution is suitable for a style factor model and 1-D manifolds For multifactor models, the deforma-tions can be very large among the multiple manifolds representing thedifferent variant factors In such cases, a conceptual embedding is preferred,which is discussed next

single-4.4.2 Topological Conceptual Manifold Embedding

As mentioned earlier, one essential limitation of using nonlinear ality reduction to achieve an embedding of the visual manifold is that thedata itself might not lie on a smooth manifold in the visual space, as we mightthink it should This is due to different reasons, including the lack of dense

Trang 34

sampling, noise, image representations, existence of other nuisance sources

of variability that are not accounted for (e.g., image translation), etc Incontrast to using NLDR to learn an embedded representation of the com-mon content manifold, if the topology of the manifold is known, a concep-tual topologically equivalent representation of the manifold can be directlyused Here, the term topologically equivalent means equivalent to the notion ofthe underlying intrinsic motion manifold The actual data instances aredeformed versions of that manifold, where such deformation is capturedthrough the nonlinear mapping in Eq.(1)in a generative way

For example, for the gait case, the gait manifold is a 1-D closed manifoldembedded in the input space We can think of it as a unit circle twisted andstretched in the space based on the shape and the appearance of the personunder consideration, or based on the viewpoint In general, all closed 1-Dmanifolds are topologically homeomorphic to a unit circle Therefore, wecan use a unit circle as a uniﬁed representation of all gait cycles, for all people,for all viewpoints This is true not only for gait, but for all periodic motion,and it is true when modeling the viewpoint manifold when the images arecaptured from a viewing circle around the object Given that all the manifoldsunder consideration are homeomorphic to the unit circle, the actual data isused to learn nonlinear warping between the conceptual representation andthe actual data manifold One important thing to notice is that, since the map-ping in Eq.(1)is from the representation to the data, it will always be a func-tion Therefore, even if the manifold in the observation space might have adifferent topology (e.g., self-intersecting or collapsing), this will not be a prob-lem in learning the manifold deformation Section6shows several examples

of using a unit circle as a common manifold representation for different cations of modeling activities lying on 1-D closed manifolds, such as gait, aswell as 1-D open manifolds such as facial expressions

appli-Other topological structures can be used to model more complex data.For example, Section 7 shows that a torus representation can be used tomodel 2-D manifolds (joint posture-viewpoint manifolds) for different ac-tivities Conceptual representations has also been used to model image trans-lation and rotation manifolds for tracking byElgammal (2005)

5 INFERENCE

Given a test image and the generative model learned from trainingdata, it is desired to solve efﬁciently for each of the style factors and the

Trang 35

intrinsic manifold coordinate This is an inference problem We start bydescribing the inference procedure for the case of a one-style-factor model,

in the form of Eq.(2) We then describe inference solutions for the generalmultifactor case in the form of Eq.(3) We discriminate here between twoscenarios: (1) The input is a whole motion cycle, and (2) The input is a singleimage For the first scenario, since we have a whole motion manifold, wecan obtain a closed-form analytical solution for each of the factors by align-ing the input sequence manifold to the model manifold representation Forthe second scenario, we introduce an iterative deterministic annealing solu-tion Alternatively, sampling methods such as Markov Chain Monte Carlo(MCMC) and particlefiltering can be used to infer the body configurationand style parameters from a single image or through a temporal sequence offrames (Lee & Elgammal, 2007; Elgammal & Lee, 2009)

5.1 Solving for One Style Factor

Given a new input y ˛Rd, it is required toﬁnd both the intrinsic coordinate

on the manifold, x ˛Re, and the instance style parameters bs˛Rn These knowns should minimize the reconstruction error, deﬁned as

This is an inference problem in two unknown variables We present twosolutions for this problem: an iterative solution that iterates between solvingfor x and bs, and a sampling-based solution

5.1.1 Iterative Solution

We can solve for both style bsand content (intrinsic manifold coordinate) x

in an Expectation-Maximization-Like (EM) iterative procedure: In theE-step, we calculate the content x given the style parameters, and in theM-step, we calculate new style parameters bs based on the content Theinitial content can be obtained using a mean style vector bs The details ofthe two steps will be described next

5.1.1.1 Closed-Form Linear Approximation for the Coordinate

on the Manifold

Note that x is a continuous variable and nonlinearly related to the input Ifthe style vector bsis known, we can solve for x efﬁciently under some con-ditions If the dimensionality of the manifold is low (for example, for 1-Dmanifolds such as the gait manifold, or a 1-D view manifold), effectivesearching can be done on the embedded manifold representation

Trang 36

Alternatively, a closed-form linear approximation can be obtained for x.Each observation yields a set of d nonlinear equations in e unknowns (or

d nonlinear equations in one e-dimensional unknown) Therefore, a tion for x can be obtained by a least-squares solution for the overcon-strained nonlinear system:

solu-x¼ argmin

x jjy BjðxÞjj2;where B ¼ A bs However, because of the linear polynomial part in theinterpolation function, the vector j(x) has a special form [Eq.(6)] that fa-cilitates a closed-form, least-squares linear approximation and, therefore,avoids having to solve the nonlinear system This can be achieved byobtaining the pseudo-inverse of B Note that B has rank N since Ndistinctive RBF centers are used Therefore, the pseudo-inverse can beobtained by decomposing B using SVD such that B ¼ USVu and, there-fore, vector j(x) can be recovered simply as jðxÞ ¼ V ~SUuy, where ~S is thediagonal matrix obtained by taking the inverse of the nonzero singular values

in the diagonal matrix S and setting the rest to zero Linear approximationfor the embedding coordinate xcan be obtained by taking the last e rows inthe recovered vector j(x)

5.1.1.2 Solving for Discrete Styles

If the manifold embedding coordinate, x, is known, we can solve for stylevector bs We assume that there is a set of discrete style classes, represented

by their mean style vectors bk; k ¼ 1; /; K, which are learned from thetraining data Given the embedding coordinate x, the observation, y, can

be considered to be drawn from a Gaussian mixture model, with its kthcomponent centered at A bk

jðxÞ for each style class k Therefore,observation likelihood given the k-class pðyjk; xÞ can be computed as

pðyjk; xÞfexpny A bk

jðxÞ2.

2s2o:The style conditional class probabilities can be obtained using Bayesrule as

pðkjx; yÞ ¼ pðyjk; xÞpðkjxÞ=pðyjxÞ;

where pðyjxÞ ¼P

k

pðyjx; kÞpðkÞ A new style vector can then be obtained as

a linear combination of the style vectors as bs¼P

k

wkbk, where the weights

wkare set to be pðkjx; yÞ

Trang 37

5.1.2 Sampling-based Solution

We can solve for both x and b, given a test image, using sampling methodssuch as particleﬁlter (Arulampalam et al., 2002) Let us denote the style sam-ples by s1; s2; /; sK in the style space and denote the content samples by

x1; x2; /; xL on the uniﬁed manifold representation To evaluate the formance of each particle, deﬁne the likelihood of a particle (sk

Resampling of the style and content particles can be achieved according

to Wsand Wxfrom a normal distribution to reduce the reconstruction error

To keep the reconstruction error as small as possible, the particle with theminimum error should be kept at each iteration This solution was used

5.2 Solving for Multiple Style Factors Given a Whole

Sequence

This section presents the solution for the style variables in Eq.(3), given asequence of images that represent a whole instance manifold We can solvefor the different style factors iteratively First, the sequence is embedded andaligned to the embedded common manifoldM Then, a mapping with co-efficient matrix C is learned from the aligned embedding coordinates to theinput Given such coefficients, we need to find the optimal b1,/, brfactors,which can generate such coefficients, minimizing the error:

Eðb1; /; brÞ ¼ jjc Z 1b12/rbrjj; (16)where c is the column stacking of C, and Z is the core tensor in Eq.(11) Ifall the style vectors are known except the ith factor’s vector, then we canobtain a closed-form solution for bi This can be achieved by evaluating theproduct

G ¼ Z 1b1 /i1bi1iþ1biþ1 /rbr

Trang 38

to obtain a tensor G The solution for bi can be obtained by solving thesystem c ¼ G 2bifor bi, which can be written as a typical linear system byunfoldingG as a matrix Therefore, an estimate of bican be obtained by

where G2is the matrix obtained by mode-2 unfolding ofG ; and y denotesthe pseudo-inverse using SVD Similarly, we can analytically solve for allother style factors starting with a mean style estimate for each of the stylefactors, since the style vectors are not known at the beginning Iterativeestimation of each of the style factors using Eq.(17)would lead to a localminima for the error in Eq.(16)

5.3 Solving for Body Conﬁguration and Style Factors

from a Single Image

The most typical scenario is the case where input is a single image (y ˛Rd),and it is required toﬁnd the embedding coordinate, x ˛Re, on the intrinsicmanifold, and the style factors b1; /; br These unknowns should minimizethe reconstruction error, deﬁned as

we can model the viewpoint manifold in the view factor subspace given ﬁcient sampled viewpoints

suf-For the ith style factor, let the mean vectors of the style classes in thetraining data be bki; k ¼ 1; /; Ki, where Ki is the number of classes and k

is the class index Such classes can be obtained by clustering the style vectorsfor each style factor in its subspace Given such classes, we need to solve forlinear regression weights aiksuch that

Trang 39

bi¼XK i k¼1

aikbk

i :

If all the style factors are known, then Eq.(18)reduces to a search lem for the x on the embedded manifold representation that minimizes theerror On the other hand, if x and all style factors are known except the ithfactor, we can obtain the conditional class probabilities pðky; x; s=b

prob-iÞ, which

is proportional to the observation likelihood pðyx; s=b

i; kÞ Here, the tion s=biis used to denote the style factors except for the ith factor This like-lihood can be estimated by assuming a Gaussian density centered around

nota-A b1 / bki /rbr jðxÞ with covariance Sik; i.e.,

p yx; s

=b i; kzN A b1 / bk

i /rbr jðxÞ; Sik

:Given the ith factor’s class probabilities, the weights are set to

aik ¼ pðky; x; s=b

iÞ This setting favors an iterative procedure for solvingfor x; b1; /; br However, an incorrect estimation of any of the factorswould lead to wrong estimation of the others, then leading to a localminima For example, in the gait model in section6.2, later in this chapter

a wrong estimate of the view factor would lead to a wrong estimate of bodyconﬁguration and a wrong estimate for shape style To avoid this, we use adeterministic annealing-like procedure, where at the beginning, the weightsfor all the style factors are forced to be close to uniform to avoid having tomake hard decisions The weights gradually become discriminative there-after To achieve this, we use variable class variances, which are uniform

to all classes and are deﬁned as Si¼ Ts2

iI for the ith factor The ture parameter, T, starts with a large value and gradually reduced in eachstep, and a new body conﬁguration estimate is computed The solutionframework is summarized in Figure 6 Application of this algorithms will

tempera-be descritempera-bed in section 6.2 for gait motion and section 6.3 for facialexpressions

6 APPLICATIONS OF HOMOMORPHISM ON 1-D

MANIFOLDS

This section illustrates several applications of the HMA framework inthe context of modeling human motion, where the activity can be charac-terized as lying on a 1-D manifold This includes a model for gait that fac-torizes personal shape styles (described in section 6.1), a multifactor gait

Trang 40

model that factorizes the viewpoint and personal style (described in section

appear-ance from the facial expression motion (described in section6.3)

Shape Representation:

One essential challenge when modeling visual data manifolds is the issue

of image representation While in principle, the data is expected to lie on alow-dimensional manifold, the actual image representation might notexhibit that The manifold might not be recoverable from the data if the rep-resentation does not exhibit smooth transitions between images that are sup-posed to be neighboring points on the manifold We represent each shapeinstance as an implicit function y(x) at each pixel x, such that y(x)¼ 0 onthe contour, y(x)>0 inside the contour, and y(x)<0 outside the contour

We use a signed-distance function for this purpose Such a representationimposes smoothness on the distance between shapes Given such a represen-tation, an input shape is a point inRd where d is the dimensionality of theinput space Implicit function representation is typically used in level-setmethods For the facial expression applications, we represent the appearancedirectly using pixel intensities

6.1 A Single-Style-Factor Model for Gait

Here, we give an example of a single-style-factor model.Figure 7shows anexample of data with different subjects performing the same activity Thecontent in this case is the gait motion, while the style is the person’s shape

Figure 6 Iterative Estimation Using Deterministic Annealing.

Định dạng
Số trang	163
Dung lượng	19,41 MB