O’Reilly Learning OpenCV phần 7 docx

Similarly, if you computed the motion of points from the previous frame then you are in a good position to make good initial guesses for where they will be in the next frame.. Sparse opt

Trang 1

We now have an overconstrained system for which we can solve provided it contains

more than just an edge in that 5-by-5 window To solve for this system, we set up a

least-squares minimization of the equation, whereby min Ad b− 2 is solved in standard

Figure 10-8 Aperture problem: through the aperture window (upper row) we see an edge moving to

the right but cannot detect the downward part of the motion (lower row)

p

I p

I p t

t b

1 2

Trang 2

When can this be solved?—when (A A) is invertible And (A A) is invertible when it

has full rank (2), which occurs when it has two large eigenvectors Th is will happen

in image regions that include texture running in at least two directions In this case,

(ATA) will have the best properties then when the tracking window is centered over a

corner region in an image Th is ties us back to our earlier discussion of the Harris

cor-ner detector In fact, those corcor-ners were “good features to track” (see our previous

re-marks concerning cvGoodFeaturesToTrack()) for precisely the reason that (ATA) had two

large eigenvectors there! We’ll see shortly how all this computation is done for us by the

cvCalcOpticalFlowLK() function

Th e reader who understands the implications of our assuming small and coherent

mo-tions will now be bothered by the fact that, for most video cameras running at 30 Hz,

large and noncoherent motions are commonplace In fact, Lucas-Kanade optical fl ow by

itself does not work very well for exactly this reason: we want a large window to catch

large motions, but a large window too oft en breaks the coherent motion assumption!

To circumvent this problem, we can track fi rst over larger spatial scales using an image

pyramid and then refi ne the initial motion velocity assumptions by working our way

down the levels of the image pyramid until we arrive at the raw image pixels

Hence, the recommended technique is fi rst to solve for optical fl ow at the top layer and

then to use the resulting motion estimates as the starting point for the next layer down

We continue going down the pyramid in this manner until we reach the lowest level

Th us we minimize the violations of our motion assumptions and so can track faster and

longer motions Th is more elaborate function is known as pyramid Lucas-Kanade

opti-cal fl ow and is illustrated in Figure 10-9 Th e OpenCV function that implements

Pyra-mid Lucas-Kanade optical fl ow is cvCalcOpticalFlowPyrLK(), which we examine next

Th e result arrays for this OpenCV routine are populated only by those pixels for which it

is able to compute the minimum error For the pixels for which this error (and thus the

displacement) cannot be reliably computed, the associated velocity will be set to 0 In

most cases, you will not want to use this routine Th e following pyramid-based method

is better for most situations most of the time

Pyramid Lucas-Kanade code

We come now to OpenCV’s algorithm that computes Lucas-Kanade optical fl ow in a

pyramid, cvCalcOpticalFlowPyrLK() As we will see, this optical fl ow function makes use

Trang 3

of “good features to track” and also returns indications of how well the tracking of each

point is proceeding

void cvCalcOpticalFlowPyrLK(

const CvArr* imgA, const CvArr* imgB, CvArr* pyrA, CvArr* pyrB, CvPoint2D32f* featuresA, CvPoint2D32f* featuresB, int count, CvSize winSize, int level, char* status, float* track_error, CvTermCriteria criteria, int flags );

Th is function has a lot of inputs, so let’s take a moment to fi gure out what they all do

Once we have a handle on this routine, we can move on to the problem of which points

to track and how to compute them

Th e fi rst two arguments of cvCalcOpticalFlowPyrLK() are the initial and fi nal images;

both should be single-channel, 8-bit images Th e next two arguments are buff ers

allo-cated to store the pyramid images Th e size of these buff ers should be at least (img.width

Figure 10-9 Pyramid Lucas-Kanade optical fl ow: running optical fl ow at the top of the pyramid fi rst

mitigates the problems caused by violating our assumptions of small and coherent motion; the

mo-tion estimate from the preceding level is taken as the starting point for estimating momo-tion at the next

layer down

Trang 4

+ 8)*img.height/3 bytes,* with one such buff er for each of the two input images (pyrA

and pyrB) (If these two pointers are set to NULL then the routine will allocate, use, and

free the appropriate memory when called, but this is not so good for performance.) Th e

array featuresA contains the points for which the motion is to be found, and featuresB

is a similar array into which the computed new locations of the points from featuresA

are to be placed; count is the number of points in the featuresA list Th e window used for

computing the local coherent motion is given by winSize Because we are constructing

an image pyramid, the argument level is used to set the depth of the stack of images

If level is set to 0 then the pyramids are not used Th e array status is of length count;

on completion of the routine, each entry in status will be either 1 (if the corresponding

point was found in the second image) or 0 (if it was not) Th e track_error parameter is

optional and can be turned off by setting it to NULL If track_error is active then it is an

array of numbers, one for each tracked point, equal to the diff erence between the patch

around a tracked point in the fi rst image and the patch around the location to which

that point was tracked in the second image You can use track_error to prune away

points whose local appearance patch changes too much as the points move

Th e next thing we need is the termination criteria Th is is a structure used by many

OpenCV algorithms that iterate to a solution:

cvTermCriteria(

int type, // CV_TERMCRIT_ITER, CV_TERMCRIT_EPS, or both int max_iter,

double epsilon );

Typically we use the cvTermCriteria() function to generate the structure we need Th e

fi rst argument of this function is either CV_TERMCRIT_ITER or CV_TERMCRIT_EPS, which tells

the algorithm that we want to terminate either aft er some number of iterations or when

the convergence metric reaches some small value (respectively) Th e next two arguments

set the values at which one, the other, or both of these criteria should terminate the

al-gorithm Th e reason we have both options is so we can set the type to CV_TERMCRIT_ITER |

CV_TERMCRIT_EPS and thus stop when either limit is reached (this is what is done in most

real code)

Finally, flags allows for some fi ne control of the routine’s internal bookkeeping; it may

be set to any or all (using bitwise OR) of the following

* If you are wondering why the funny size, it’s because these scratch spaces need to accommodate not just the

image itself but the entire pyramid.

Trang 5

Th e array B already contains an initial guess for the feature’s coordinates when the routine is called

Th ese fl ags are particularly useful when handling sequential video Th e image pyramids

are somewhat costly to compute, so recomputing them should be avoided whenever

possible Th e fi nal frame for the frame pair you just computed will be the initial frame

for the pair that you will compute next If you allocated those buff ers yourself (instead

of asking the routine to do it for you), then the pyramids for each image will be sitting

in those buff ers when the routine returns If you tell the routine that this information is

already computed then it will not be recomputed Similarly, if you computed the motion

of points from the previous frame then you are in a good position to make good initial

guesses for where they will be in the next frame

So the basic plan is simple: you supply the images, list the points you want to track in

featuresA, and call the routine When the routine returns, you check the status array

to see which points were successfully tracked and then check featuresB to fi nd the new

locations of those points

Th is leads us back to that issue we put aside earlier: how to decide which features are

good ones to track Earlier we encountered the OpenCV routine cvGoodFeatures

ToTrack(), which uses the method originally proposed by Shi and Tomasi to solve this

problem in a reliable way In most cases, good results are obtained by using the

com-bination of cvGoodFeaturesToTrack() and cvCalcOpticalFlowPyrLK() Of course, you can

also use your own criteria to determine which points to track

Let’s now look at a simple example (Example 10-1) that uses both cvGoodFeaturesToTrack()

and cvCalcOpticalFlowPyrLK(); see also Figure 10-10

Example 10-1 Pyramid Lucas-Kanade optical fl ow code

// Pyramid L-K optical flow example

//

#include <cv.h>

#include <cxcore.h>

#include <highgui.h>

const int MAX_CORNERS = 500;

int main(int argc, char** argv) {

// Initialize, load two images from the file system, and

// allocate the images and other structures we will need for

// results.

//

IplImage* imgA = cvLoadImage(“image0.jpg”,CV_LOAD_IMAGE_GRAYSCALE);

IplImage* imgB = cvLoadImage(“image1.jpg”,CV_LOAD_IMAGE_GRAYSCALE);

CvSize img_sz = cvGetSize( imgA );

int win_size = 10;

IplImage* imgC = cvLoadImage(

Trang 6

Example 10-1 Pyramid Lucas-Kanade optical fl ow code (continued)

IplImage* eig_image = cvCreateImage( img_sz, IPL_DEPTH_32F, 1 );

IplImage* tmp_image = cvCreateImage( img_sz, IPL_DEPTH_32F, 1 );

int corner_count = MAX_CORNERS;

CvPoint2D32f* cornersA = new CvPoint2D32f[ MAX_CORNERS ];

char features_found[ MAX_CORNERS ];

float feature_errors[ MAX_CORNERS ];

CvSize pyr_sz = cvSize( imgA->width+8, imgB->height/3 );

IplImage* pyrA = cvCreateImage( pyr_sz, IPL_DEPTH_32F, 1 );

IplImage* pyrB = cvCreateImage( pyr_sz, IPL_DEPTH_32F, 1 );

CvPoint2D32f* cornersB = new CvPoint2D32f[ MAX_CORNERS ];

cvCalcOpticalFlowPyrLK(

imgA,

imgB,

Trang 7

Example 10-1 Pyramid Lucas-Kanade optical fl ow code (continued)

for( int i=0; i<corner_count; i++ ) {

if( features_found[i]==0|| feature_errors[i]>550 ) {

Dense Tracking Techniques

OpenCV contains two other optical fl ow techniques that are now seldom used Th ese

routines are typically much slower than Lucas-Kanade; moreover, they (could, but) do

not support matching within an image scale pyramid and so cannot track large

mo-tions We will discuss them briefl y in this section

Trang 8

Horn-Schunck method

Th e method of Horn and Schunck was developed in 1981 [Horn81] Th is technique was

one of the fi rst to make use of the brightness constancy assumption and to derive the

basic brightness constancy equations Th e solution of these equations devised by Horn

and Schunck was by hypothesizing a smoothness constraint on the velocities vx and vy

Th is constraint was derived by minimizing the regularized Laplacian of the optical fl ow

Here α is a constant weighting coeffi cient known as the regularization constant Larger

values of α lead to smoother (i.e., more locally consistent) vectors of motion fl ow Th is

is a relatively simple constraint for enforcing smoothness, and its eff ect is to

penal-ize regions in which the fl ow is changing in magnitude As with Lucas-Kanade, the

Horn-Schunck technique relies on iterations to solve the diff erential equations Th e

function that computes this is:

void cvCalcOpticalFlowHS(

const CvArr* imgA, const CvArr* imgB, int usePrevious, CvArr* velx,

Figure 10-10 Sparse optical fl ow from pyramid Lucas-Kanade: the center image is one video frame

aft er the left image; the right image illustrates the computed motion of the “good features to track”

(lower right shows fl ow vectors against a dark background for increased visibility)

Trang 9

CvArr* vely, double lambda, CvTermCriteria criteria );

Here imgA and imgB must be 8-bit, single-channel images Th e x and y velocity results

will be stored in velx and vely, which must be 32-bit, fl oating-point, single-channel

im-ages Th e usePrevious parameter tells the algorithm to use the velx and vely velocities

computed from a previous frame as the initial starting point for computing the new

velocities Th e parameter lambda is a weight related to the Lagrange multiplier You are

probably asking yourself: “What Lagrange multiplier?”* Th e Lagrange multiplier arises

when we attempt to minimize (simultaneously) both the motion-brightness equation

and the smoothness equations; it represents the relative weight given to the errors in

each as we minimize

Block matching method

You might be thinking: “What’s the big deal with optical fl ow? Just match where pixels

in one frame went to in the next frame.” Th is is exactly what others have done Th e term

“block matching” is a catchall for a whole class of similar algorithms in which the

im-age is divided into small regions called blocks [Huang95; Beauchemin95] Blocks are

typically square and contain some number of pixels Th ese blocks may overlap and, in

practice, oft en do Block-matching algorithms attempt to divide both the previous and

current images into such blocks and then compute the motion of these blocks

Algo-rithms of this kind play an important role in many video compression algoAlgo-rithms as

well as in optical fl ow for computer vision

Because block-matching algorithms operate on aggregates of pixels, not on individual

pixels, the returned “velocity images” are typically of lower resolution than the input

images Th is is not always the case; it depends on the severity of the overlap between the

blocks Th e size of the result images is given by the following formula:

Th e implementation in OpenCV uses a spiral search that works out from the location

of the original block (in the previous frame) and compares the candidate new blocks

with the original Th is comparison is a sum of absolute diff erences of the pixels (i.e., an

L1 distance) If a good enough match is found, the search is terminated Here’s the

func-tion prototype:

* You might even be asking yourself: “What is a Lagrange multiplier?” In that case, it may be best to ignore

this part of the paragraph and just set lambda equal to 1.

Trang 10

void cvCalcOpticalFlowBM(

const CvArr* prev, const CvArr* curr, CvSize block_size, CvSize shift_size, CvSize max_range, int use_previous, CvArr* velx, CvArr* vely );

Th e arguments are straightforward Th e prev and curr parameters are the previous and

current images; both should be 8-bit, single-channel images Th e block_size is the size

of the block to be used, and shift_size is the step size between blocks (this parameter

controls whether—and, if so, by how much—the blocks will overlap) Th e max_range

pa-rameter is the size of the region around a given block that will be searched for a

cor-responding block in the subsequent frame If set, use_previous indicates that the values

in velx and vely should be taken as starting points for the block searches.* Finally, velx

and vely are themselves 32-bit single-channel images that will store the computed

mo-tions of the blocks As mentioned previously, motion is computed at a block-by-block

level and so the coordinates of the result images are for the blocks (i.e., aggregates of

pixels), not for the individual pixels of the original image

Mean-Shift and Camshift Tracking

In this section we will look at two techniques, mean-shift and camshift (where

“cam-shift ” stands for “continuously adaptive mean-“cam-shift ”) Th e former is a general technique

for data analysis (discussed in Chapter 9 in the context of segmentation) in many

ap-plications, of which computer vision is only one Aft er introducing the general theory

of mean-shift , we’ll describe how OpenCV allows you to apply it to tracking in images

Th e latter technique, camshift , builds on mean-shift to allow for the tracking of objects

whose size may change during a video sequence

Mean-Shift

Th e mean-shift algorithm† is a robust method of fi nding local extrema in the density

distribution of a data set Th is is an easy process for continuous distributions; in that

context, it is essentially just hill climbing applied to a density histogram of the data.‡ For

discrete data sets, however, this is a somewhat less trivial problem

* If use_previous==0, then the search for a block will be conducted over a region of max_range distance

from the location of the original block If use_previous!=0, then the center of that search is fi rst displaced

by Δx= vel ( , ) and Δy x x y = vel ( , ).y x y

† Because mean-shift is a fairly deep topic, our discussion here is aimed mainly at developing intuition

for the user For the original formal derivation, see Fukunaga [Fukunaga90] and Comaniciu and Meer [Comaniciu99].

‡ Th e word “essentially” is used because there is also a scale-dependent aspect of mean-shift To be exact:

mean-shift is equivalent in a continuous distribution to fi rst convolving with the mean-shift kernel and then applying a hill-climbing algorithm.

Trang 11

Th e descriptor “robust” is used here in its formal statistical sense; that is, mean-shift

ignores outliers in the data Th is means that it ignores data points that are far away from

peaks in the data It does so by processing only those points within a local window of

the data and then moving that window

Th e mean-shift algorithm runs as follows

Choose a search window:

1

its initial location;

• its type (uniform, polynomial, exponential, or Gaussian);

• its shape (symmetric or skewed, possibly rotated, rounded or rectangular);

• its size (extent at which it rolls off or is cut off )

• Compute the window’s (possibly weighted) center of mass

To give a little more formal sense of what the mean-shift algorithm is: it is related to the

discipline of kernel density estimation, where by “kernel” we refer to a function that has

mostly local focus (e.g., a Gaussian distribution) With enough appropriately weighted

and sized kernels located at enough points, one can express a distribution of data

en-tirely in terms of those kernels Mean-shift diverges from kernel density estimation in

that it seeks only to estimate the gradient (direction of change) of the data distribution

When this change is 0, we are at a stable (though perhaps local) peak of the distribution

Th ere might be other peaks nearby or at other scales

Figure 10-11 shows the equations involved in the mean-shift algorithm Th ese equations

can be simplifi ed by considering a rectangular kernel,† which reduces the mean-shift

vector equation to calculating the center of mass of the image pixel distribution:

M M

c= 10 c=00

01 00,

Here the zeroth moment is calculated as:

y x

00=∑ ∑ ( , )and the fi rst moments are:

* Iterations are typically restricted to some maximum number or to some epsilon change in center shift

between iterations; however, they are guaranteed to converge eventually.

† A rectangular kernel is a kernel with no falloff with distance from the center, until a single sharp

transi-tion to zero value Th is is in contrast to the exponential falloff of a Gaussian kernel and the falloff with the square of distance from the center in the commonly used Epanechnikov kernel.

Trang 12

Th e mean-shift vector in this case tells us to recenter the mean-shift window over the

calculated center of mass within that window Th is movement will, of course, change

what is “under” the window and so we iterate this recentering process Such recentering

will always converge to a mean-shift vector of 0 (i.e., where no more centering

move-ment is possible) Th e location of convergence is at a local maximum (peak) of the

dis-tribution under the window Diff erent window sizes will fi nd diff erent peaks because

“peak” is fundamentally a scale-sensitive construct

In Figure 10-12 we see an example of a two-dimensional distribution of data and an

ini-tial (in this case, rectangular) window Th e arrows indicate the process of convergence

on a local mode (peak) in the distribution Observe that, as promised, this peak fi nder is

statistically robust in the sense that points outside the mean-shift window do not aff ect

convergence—the algorithm is not “distracted” by far-away points

In 1998, it was realized that this mode-fi nding algorithm could be used to track moving

objects in video [Bradski98a; Bradski98b], and the algorithm has since been greatly

ex-tended [Comaniciu03] Th e OpenCV function that performs mean-shift is implemented

in the context of image analysis Th is means in particular that, rather than taking some

Figure 10-11 Mean-shift equations and their meaning

y

10=∑ ∑ ( , ) and 01=∑ ∑ ( , )

Trang 13

arbitrary set of data points (possibly in some arbitrary number of dimensions), the

OpenCV implementation of mean-shift expects as input an image representing the

den-sity distribution being analyzed You could think of this image as a two-dimensional

histogram measuring the density of points in some two-dimensional space It turns out

that, for vision, this is precisely what you want to do most of the time: it’s how you can

track the motion of a cluster of interesting features

int cvMeanShift(

const CvArr* prob_image, CvRect window, CvTermCriteria criteria, CvConnectedComp* comp );

In cvMeanShift(), the prob_image, which represents the density of probable locations,

may be only one channel but of either type (byte or fl oat) Th e window is set at the

ini-tial desired location and size of the kernel window Th e termination criteria has been

described elsewhere and consists mainly of a maximum limit on number of mean-shift

movement iterations and a minimal movement for which we consider the window

Figure 10-12 Mean-shift algorithm in action: an initial window is placed over a two-dimensional

array of data points and is successively recentered over the mode (or local peak) of its data

distribu-tion until convergence

Trang 14

locations to have converged.* Th e connected component comp contains the converged

search window location in comp->rect, and the sum of all pixels under the window is

kept in the comp->area fi eld

Th e function cvMeanShift() is one expression of the mean-shift algorithm for

rectangu-lar windows, but it may also be used for tracking In this case, you fi rst choose the

fea-ture distribution to represent an object (e.g., color + texfea-ture), then start the mean-shift

window over the feature distribution generated by the object, and fi nally compute the

chosen feature distribution over the next video frame Starting from the current

win-dow location, the mean-shift algorithm will fi nd the new peak or mode of the feature

distribution, which (presumably) is centered over the object that produced the color and

texture in the fi rst place In this way, the mean-shift window tracks the movement of the

object frame by frame

Camshift

A related algorithm is the Camshift tracker It diff ers from the meanshift in that

the search window adjusts itself in size If you have well-segmented distributions (say

face features that stay compact), then this algorithm will automatically adjust itself for

the size of face as the person moves closer to and further from the camera Th e form of

the Camshift algorithm is:

int cvCamShift(

const CvArr* prob_image, CvRect window, CvTermCriteria criteria, CvConnectedComp* comp, CvBox2D* box = NULL );

Th e fi rst four parameters are the same as for the cvMeanShift() algorithm Th e box

param-eter, if present, will contain the newly resized box, which also includes the orientation of

the object as computed via second-order moments For tracking applications, we would

use the resulting resized box found on the previous frame as the window in the next frame

Many people think of mean-shift and camshift as tracking using color features, but this is not entirely correct Both of these algorithms track the distribution of any kind of feature that is expressed in the prob_image; hence they make for very lightweight, robust, and effi cient trackers.

Motion Templates

Motion templates were invented in the MIT Media Lab by Bobick and Davis [Bobick96;

Davis97] and were further developed jointly with one of the authors [Davis99;

Brad-ski00] Th is more recent work forms the basis for the implementation in OpenCV

* Again, mean-shift will always converge, but convergence may be very slow near the local peak of a

distribu-tion if that distribudistribu-tion is fairly “fl at” there.

Trang 15

Motion templates are an eff ective way to track general movement and are especially

ap-plicable to gesture recognition Using motion templates requires a silhouette (or part of

a silhouette) of an object Object silhouettes can be obtained in a number of ways

Th e simplest method of obtaining object silhouettes is to use a reasonably stationary

1

camera and then employ frame-to-frame diff erencing (as discussed in Chapter 9)

Th is will give you the moving edges of objects, which is enough to make motion templates work

You can use chroma keying For example, if you have a known background color

which you can isolate new foreground objects/people as silhouettes

You can use active silhouetting techniques—for example, creating a wall of

pyramid segmentation or mean-shift segmentation) described in Chapter 9

For now, assume that we have a good, segmented object silhouette as represented by

the white rectangle of Figure 10-13(A) Here we use white to indicate that all the pixels

are set to the fl oating-point value of the most recent system time stamp As the rectangle

moves, new silhouettes are captured and overlaid with the (new) current time stamp;

the new silhouette is the white rectangle of Figure 10-13(B) and Figure 10-13(C) Older

motions are shown in Figure 10-13 as successively darker rectangles Th ese sequentially

fading silhouettes record the history of previous movement and thus are referred to as

the “motion history image”

Figure 10-13 Motion template diagram: (A) a segmented object at the current time stamp (white);

(B) at the next time step, the object moves and is marked with the (new) current time stamp, leaving

the older segmentation boundary behind; (C) at the next time step, the object moves further, leaving

older segmentations as successively darker rectangles whose sequence of encoded motion yields the

motion history image

Trang 16

In cvUpdateMotionHistory(), all image arrays consist of single-channel images Th e

silhouette image is a byte image in which nonzero pixels represent the most recent

seg-mentation silhouette of the foreground object Th e mhi image is a fl oating-point image

that represents the motion template (aka motion history image) Here timestamp is the

current system time (typically a millisecond count) and duration, as just described, sets

how long motion history pixels are allowed to remain in the mhi In other words, any mhi

pixels that are older (less) than timestamp minus duration are set to 0

Once the motion template has a collection of object silhouettes overlaid in time, we can

derive an indication of overall motion by taking the gradient of the mhi image When we

take these gradients (e.g., by using the Scharr or Sobel gradient functions discussed in

Chapter 6), some gradients will be large and invalid Gradients are invalid when older

or inactive parts of the mhi image are set to 0, which produces artifi cially large gradients

around the outer edges of the silhouettes; see Figure 10-15(A) Because we know the

time-step duration with which we’ve been introducing new silhouettes into the mhi via

cvUpdateMotionHistory(), we know how large our gradients (which are just dx and dy

step derivatives) should be We can therefore use the gradient magnitude to eliminate

gradients that are too large, as in Figure 10-15(B) Finally, we can collect a measure of

global motion; see Figure 10-15(C) Th e function that eff ects parts (A) and (B) of the

fi gure is cvCalcMotionGradient():

Silhouettes whose time stamp is more than a specifi ed duration older than the current

system time stamp are set to 0, as shown in Figure 10-14 Th e OpenCV function that

ac-complishes this motion template construction is cvUpdateMotionHistory():

void cvUpdateMotionHistory(

const CvArr* silhouette, CvArr* mhi, double timestamp, double duration );

Figure 10-14 Motion template silhouettes for two moving objects (left ); silhouettes older than a

specifi ed duration are set to 0 (right)

Trang 17

In cvCalcMotionGradient(), all image arrays are single-channel Th e function input mhi

is a fl oating-point motion history image, and the input variables delta1 and delta2 are

(respectively) the minimal and maximal gradient magnitudes allowed Here, the

ex-pected gradient magnitude will be just the average number of time-stamp ticks between

each silhouette in successive calls to cvUpdateMotionHistory(); setting delta1 halfway

below and delta2 halfway above this average value should work well Th e variable

aperture_size sets the size in width and height of the gradient operator Th ese values

can be set to -1 (the 3-by-3 CV_SCHARR gradient fi lter), 3 (the default 3-by-3 Sobel fi lter),

5 (for the 5-by-5 Sobel fi lter), or 7 (for the 7-by-7 fi lter) Th e function outputs are mask, a

single-channel 8-bit image in which nonzero entries indicate where valid gradients were

found, and orientation, a fl oating-point image that gives the gradient direction’s angle

at each point

Th e function cvCalcGlobalOrientation() fi nds the overall direction of motion as the

vector sum of the valid gradient directions

double cvCalcGlobalOrientation(

const CvArr* orientation, const CvArr* mask, const CvArr* mhi, double timestamp, double duration );

When using cvCalcGlobalOrientation(), we pass in the orientation and mask image

computed in cvCalcMotionGradient() along with the timestamp, duration, and resulting

mhi from cvUpdateMotionHistory(); what’s returned is the vector-sum global orientation,

void cvCalcMotionGradient(

const CvArr* mhi, CvArr* mask, CvArr* orientation, double delta1, double delta2, int aperture_size=3 );

Figure 10-15 Motion gradients of the mhi image: (A) gradient magnitudes and directions; (B) large

gradients are eliminated; (C) overall direction of motion is found

Trang 18

as in Figure 10-15(C) Th e timestamp together with duration tells the routine how much

motion to consider from the mhi and motion orientation images One could compute

the global motion from the center of mass of each of the mhi silhouettes, but summing

up the precomputed motion vectors is much faster

We can also isolate regions of the motion template mhi image and determine the local

motion within that region, as shown in Figure 10-16 In the fi gure, the mhi image is

scanned for current silhouette regions When a region marked with the most current

time stamp is found, the region’s perimeter is searched for suffi ciently recent motion

(recent silhouettes) just outside its perimeter When such motion is found, a

downward-stepping fl ood fi ll is performed to isolate the local region of motion that “spilled off ” the

current location of the object of interest Once found, we can calculate local motion

gra-dient direction in the spill-off region, then remove that region, and repeat the process

until all regions are found (as diagrammed in Figure 10-16)

Figure 10-16 Segmenting local regions of motion in the mhi image: (A) scan the mhi image for

cur-rent silhouettes (a) and, when found, go around the perimeter looking for other recent silhouettes

(b); when a recent silhouette is found, perform downward-stepping fl ood fi lls (c) to isolate local

mo-tion; (B) use the gradients found within the isolated local motion region to compute local momo-tion;

(C) remove the previously found region and search for the next current silhouette region (d), scan

along it (e), and perform downward-stepping fl ood fi ll on it (f); (D) compute motion within the

newly isolated region and continue the process (A)-(C) until no current silhouette remains

Trang 19

Th e function that isolates and computes local motion is cvSegmentMotion():

CvSeq* cvSegmentMotion(

const CvArr* mhi, CvArr* seg_mask, CvMemStorage* storage, double timestamp, double seg_thresh );

In cvSegmentMotion(), the mhi is the single-channel fl oating-point input We also pass in

storage, a CvMemoryStorage structure allocated via cvCreateMemStorage() Another input

is timestamp, the value of the most current silhouettes in the mhi from which you want

to segment local motions Finally, you must pass in seg_thresh, which is the maximum

downward step (from current time to previous motion) that you’ll accept as attached

motion Th is parameter is provided because there might be overlapping silhouettes from

recent and much older motion that you don’t want to connect together

It’s generally best to set seg_thresh to something like 1.5 times the average diff erence in

silhouette time stamps Th is function returns a CvSeq of CvConnectedComp structures, one

for each separate motion found, which delineates the local motion regions; it also

re-turns seg_mask, a single-channel, fl oating-point image in which each region of isolated

motion is marked a distinct nonzero number (a zero pixel in seg_mask indicates no

mo-tion) To compute these local motions one at a time we call cvCalcGlobalOrientation(),

using the appropriate mask region selected from the appropriate CvConnectedComp or

from a particular value in the seg_mask; for example,

cvCmpS(

seg_mask, // [value_wanted_in_seg_mask], // [your_destination_mask], CV_CMP_EQ

)

Given the discussion so far, you should now be able to understand the motempl.c

example that ships with OpenCV in the …/opencv/samples/c/ directory We will now

extract and explain some key points from the update_mhi() function in motempl.c Th e

update_mhi() function extracts templates by thresholding frame diff erences and then

passing the resulting silhouette to cvUpdateMotionHistory():

cvAbsDiff( buf[idx1], buf[idx2], silh );

cvThreshold( silh, silh, diff_threshold, 1, CV_THRESH_BINARY );

cvUpdateMotionHistory( silh, mhi, timestamp, MHI_DURATION );

Th e gradients of the resulting mhi image are then taken, and a mask of valid gradients is

produced using cvCalcMotionGradient() Th en CvMemStorage is allocated (or, if it already

exists, it is cleared), and the resulting local motions are segmented into CvConnectedComp

structures in the CvSeq containing structure seq:

cvCalcMotionGradient(

Trang 20

mhi, mask, orient, MAX_TIME_DELTA, MIN_TIME_DELTA, 3

);

if( !storage ) storage = cvCreateMemStorage(0);

else cvClearMemStorage(storage);

seq = cvSegmentMotion(

mhi, segmask, storage, timestamp, MAX_TIME_DELTA );

A “for” loop then iterates through the seq->total CvConnectedComp structures extracting

bounding rectangles for each motion Th e iteration starts at -1, which has been

desig-nated as a special case for fi nding the global motion of the whole image For the local

motion segments, small segmentation areas are fi rst rejected and then the orientation is

calculated using cvCalcGlobalOrientation() Instead of using exact masks, this routine

restricts motion calculations to regions of interest (ROIs) that bound the local motions;

it then calculates where valid motion within the local ROIs was actually found Any

such motion area that is too small is rejected Finally, the routine draws the motion

Examples of the output for a person fl apping their arms is shown in Figure 10-17, where

the output is drawn above the raw image for four sequential frames going across in two

rows (For the full code, see …/opencv/samples/c/motempl.c.) In the same sequence, “Y”

postures were recognized by the shape descriptors (Hu moments) discussed in Chapter

8, although the shape recognition is not included in the samples code.

for( i = -1; i < seq->total; i++ ) { if( i < 0 ) { // case of the whole image // .[does the whole image]

else { // i-th motion component comp_rect = ((CvConnectedComp*)cvGetSeqElem( seq, i ))->rect;

// [reject very small components]

} [set component ROI regions]

angle = cvCalcGlobalOrientation( orient, mask, mhi, timestamp, MHI_DURATION);

[find regions of valid motion]

[reset ROI regions]

[skip small valid motion regions]

[draw the motions]

}

Trang 21

Suppose we are tracking a person who is walking across the view of a video camera

At each frame we make a determination of the location of this person Th is could be

done any number of ways, as we have seen, but in each case we fi nd ourselves with an

estimate of the position of the person at each frame Th is estimation is not likely to be

Figure 10-17 Results of motion template routine: going across and top to bottom, a person moving

and the resulting global motions indicated in large octagons and local motions indicated in small

octagons; also, the “Y” pose can be recognized via shape descriptors (Hu moments)

Trang 22

Th e machinery for accomplishing the two-phase estimation task falls generally under

the heading of estimators, with the Kalman fi lter [Kalman60] being the most widely

used technique In addition to the Kalman fi lter, another important method is the

con-densation algorithm, which is a computer-vision implementation of a broader class of

extremely accurate Th e reasons for this are many Th ey may include inaccuracies in

the sensor, approximations in earlier processing stages, issues arising from occlusion

or shadows, or the apparent changing of shape when a person is walking due to their

legs and arms swinging as they move Whatever the source, we expect that these

mea-surements will vary, perhaps somewhat randomly, about the “actual” values that might

be received from an idealized sensor We can think of all these inaccuracies, taken

to-gether, as simply adding noise to our tracking process

We’d like to have the capability of estimating the motion of this person in a way that

makes maximal use of the measurements we’ve made Th us, the cumulative eff ect of

our many measurements could allow us to detect the part of the person’s observed

tra-jectory that does not arise from noise Th e key additional ingredient is a model for the

person’s motion For example, we might model the person’s motion with the following

statement: “A person enters the frame at one side and walks across the frame at constant

velocity.” Given this model, we can ask not only where the person is but also what

pa-rameters of the model are supported by our observations

Th is task is divided into two phases (see Figure 10-18) In the fi rst phase, typically called

the prediction phase, we use information learned in the past to further refi ne our model

for what the next location of the person (or object) will be In the second phase, the

correction phase, we make a measurement and then reconcile that measurement with

the predictions based on our previous measurements (i.e., our model)

Figure 10-18 Two-phase estimator cycle: prediction based on prior data followed by reconciliation of

the newest measurement

Trang 23

methods known as particle fi lters Th e primary diff erence between the Kalman fi lter and

the condensation algorithm is how the state probability density is described We will

explore the meaning of this distinction in the following sections

The Kalman Filter

First introduced in 1960, the Kalman fi lter has risen to great prominence in a wide

vari-ety of signal processing contexts Th e basic idea behind the Kalman fi lter is that, under

a strong but reasonable* set of assumptions, it will be possible—given a history of

mea-surements of a system—to build a model for the state of the system that maximizes the

a posteriori† probability of those previous measurements For a good introduction, see

Welsh and Bishop [Welsh95] In addition, we can maximize the a posteriori probability

without keeping a long history of the previous measurements themselves Instead, we

iteratively update our model of a system’s state and keep only that model for the next

iteration Th is greatly simplifi es the computational implications of this method

Before we go into the details of what this all means in practice, let’s take a moment to

look at the assumptions we mentioned Th ere are three important assumptions required

in the theoretical construction of the Kalman fi lter: (1) the system being modeled is

linear, (2) the noise that measurements are subject to is “white”, and (3) this noise is also

Gaussian in nature Th e fi rst assumption means (in eff ect) that the state of the system

at time k can be modeled as some matrix multiplied by the state at time k–1 Th e

ad-ditional assumptions that the noise is both white and Gaussian means that the noise is

not correlated in time and that its amplitude can be accurately modeled using only an

average and a covariance (i.e., the noise is completely described by its fi rst and second

moments) Although these assumptions may seem restrictive, they actually apply to a

surprisingly general set of circumstances.‡

What does it mean to “maximize the a posteriori probability of those previous

measure-ments”? It means that the new model we construct aft er making a measurement—taking

into account both our previous model with its uncertainty and the new measurement

with its uncertainty—is the model that has the highest probability of being correct For

our purposes, this means that the Kalman fi lter is, given the three assumptions, the best

way to combine data from diff erent sources or from the same source at diff erent times

We start with what we know, we obtain new information, and then we decide to change

* Here by “reasonable” we mean something like “suffi ciently unrestrictive that the method is useful for a

reasonable variety of actual problems arising in the real world” “Reasonable” just seemed like less of a mouthful.

† Th e modifi er “a posteriori” is academic jargon for “with hindsight” Th us, when we say that such and such

a distribution “maximizes the a posteriori probability”, what we mean is that that distribution, which is sentially a possible explanation of “what really happened”, is actually the most likely one given the data we have observed you know, looking back on it all in retrospect.

es-‡ OK, one more footnote We actually slipped in another assumption here, which is that the initial

distribu-tion also must be Gaussian in nature Oft en in practice the initial state is known exactly, or at least we treat

it like it is, and so this satisfi es our requirement If the initial state were (for example) a 50-50 chance of being either in the bedroom or the bathroom, then we’d be out of luck and would need something more sophisticated than a single Kalman fi lter.

Trang 24

what we know based on how certain we are about the old and new information using a

weighted combination of the old and the new

Let’s work all this out with a little math for the case of one-dimensional motion You

can skip the next section if you want, but linear systems and Gaussians are so friendly

that Dr Kalman might be upset if you didn’t at least give it a try

Some Kalman math

So what’s the gist of the Kalman fi lter?—information fusion Suppose you want to know

where some point is on a line (our one-dimensional scenario).* As a result of noise, you

have two unreliable (in a Gaussian sense) reports about where the object is: locations x1

and x2 Because there is Gaussian uncertainty in these measurements, they have means

of x– 1 and x– 2 together with standard deviations σ1and σ2 Th e standard deviations are,

in fact, expressions of our uncertainty regarding how good our measurements are Th e

probability distribution as a function of location is the Gaussian distribution:

1 2

i

i i

given two such measurements, each with a Gaussian probability distribution, we would

expect that the probability density for some value of x given both measurements would

be proportional to p(x) = p1(x) p2(x) It turns out that this product is another Gaussian

distribution, and we can compute the mean and standard deviation of this new

distri-bution as follows Given that

2

1 2

2 2

1 2

2 2

that average value simply by computing the derivative of p(x) with respect to x Where a

function is maximal its derivative is 0, so

dp dx

p x

1 2

12 2 2

Since the probability distribution function p(x) is never 0, it follows that the term in

brackets must be 0 Solving that equation for x gives us this very important relation:

2

1 2 2

2 1

1 2

1 2 2

2 2

=+

* For a more detailed explanation that follows a similar trajectory, the reader is referred to J D Schutter,

J De Geeter, T Lefebvre, and H Bruyninckx, “Kalman Filters: A Tutorial” (http://citeseer.ist.psu.edu/

443226.html).

Trang 25

Th us, the new mean value x– 12 is just a weighted combination of the two measured means,

where the weighting is determined by the relative uncertainties of the two

measure-ments Observe, for example, that if the uncertainty σ2 of the second measurement is

particularly large, then the new mean will be essentially the same as the mean x1 for the

more certain previous measurement

With the new mean x– 12 in hand, we can substitute this value into our expression for

p12(x) and, aft er substantial rearranging,* identify the uncertainty σ12

2 as:

1 2 2 2

=

At this point, you are probably wondering what this tells us Actually, it tells us a lot It

says that when we make a new measurement with a new mean and uncertainty, we can

combine that measurement with the mean and uncertainty we already have to obtain a

new state that is characterized by a still newer mean and uncertainty (We also now have

numerical expressions for these things, which will come in handy momentarily.)

Th is property that two Gaussian measurements, when combined, are equivalent to a

sin-gle Gaussian measurement (with a computable mean and uncertainty) will be the most

important feature for us It means that when we have M measurements, we can combine

the fi rst two, then the third with the combination of the fi rst two, then the fourth with

the combination of the fi rst three, and so on Th is is what happens with tracking in

com-puter vision; we obtain one measure followed by another followed by another

Th inking of our measurements (xi , σi) as time steps, we can compute the current state of

our estimation ( ˆ , ˆ )x i σ as follows At time step 1, we have only our fi rst measure ˆx x i 1= 1

and its uncertainty ˆσ12 σ

1 2

= Substituting this in our optimal estimation equations yields

2 1

1 2

1 2 2

= we have:

* Th e rearranging is a bit messy If you want to verify all this, it is much easier to (1) start with the equation

for the Gaussian distribution p12(x) in terms of x– 12 and σ 12, (2) substitute in the equations that relate x– 12 to x– 1

and x– 2 and those that relate σ 12 to σ 1 and σ 2 , and (3) verify that the result can be separated into the product

of the Gaussians with which we started.

Trang 26

ˆ ˆˆ

=+

A rearrangement similar to what we did for ˆx2 yields an iterative equation for estimating

variance given a new measurement:

2 1 21

In their current form, these equations allow us to separate clearly the “old” information

(what we knew before a new measurement was made) from the “new” information (what

our latest measurement told us) Th e new information (x2−xˆ )1 , seen at time step 2, is

called the innovation We can also see that our optimal iterative update factor is now:

K=+

ˆˆ

σ

1 2

1 2 2 2

Th is factor is known as the update gain Using this defi nition for K, we obtain the

fol-lowing convenient recursion form:

= − K

In the Kalman fi lter literature, if the discussion is about a general series of measurements

then our second time step “2” is usually denoted k and the fi rst time step is thus k – 1.

Systems with dynamics

In our simple one-dimensional example, we considered the case of an object being

lo-cated at some point x, and a series of successive measurements of that point In that case

we did not specifi cally consider the case in which the object might actually be moving

in between measurements In this new case we will have what is called the prediction

phase During the prediction phase, we use what we know to fi gure out where we expect

the system to be before we attempt to integrate a new measurement

In practice, the prediction phase is done immediately aft er a new measurement is made,

but before the new measurement is incorporated into our estimation of the state of the

system An example of this might be when we measure the position of a car at time t,

then again at time t + dt If the car has some velocity v, then we do not just incorporate

the second measurement directly We fi rst fast-forward our model based on what we

knew at time t so that we have a model not only of the system at time t but also of the

system at time t + dt, the instant before the new information is incorporated In this

way, the new information, acquired at time t + dt, is fused not with the old model of the

Trang 27

system, but with the old model of the system projected forward to time t + dt Th is is the

meaning of the cycle depicted in Figure 10-18 In the context of Kalman fi lters, there are

three kinds of motion that we would like to consider

Th e fi rst is dynamical motion Th is is motion that we expect as a direct result of the state

of the system when last we measured it If we measured the system to be at position x

with some velocity v at time t, then at time t + dt we would expect the system to be

lo-cated at position x + v ∗ dt, possibly still with velocity.

Th e second form of motion is called control motion Control motion is motion that we

expect because of some external infl uence applied to the system of which, for whatever

reason, we happen to be aware As the name implies, the most common example of

control motion is when we are estimating the state of a system that we ourselves have

some control over, and we know what we did to bring about the motion Th is is

par-ticularly the case for robotic systems where the control is the system telling the robot

to (for example) accelerate or go forward Clearly, in this case, if the robot was at x and

moving with velocity v at time t, then at time t + dt we expect it to have moved not only

to x + v ∗ dt (as it would have done without the control), but also a little farther, since

we did tell it to accelerate

Th e fi nal important class of motion is random motion Even in our simple

one-dimensional example, if whatever we were looking at had a possibility of moving on its

own for whatever reason, we would want to include random motion in our prediction

step Th e eff ect of such random motion will be to simply increase the variance of our

state estimate with the passage of time Random motion includes any motions that are

not known or under our control As with everything else in the Kalman fi lter

frame-work, however, there is an assumption that this random motion is either Gaussian (i.e.,

a kind of random walk) or that it can at least be modeled eff ectively as Gaussian

Th us, to include dynamics in our simulation model, we would fi rst do an “update” step

before including a new measurement Th is update step would include fi rst applying any

knowledge we have about the motion of the object according to its prior state, applying

any additional information resulting from actions that we ourselves have taken or that

we know to have been taken on the system from another outside agent, and, fi nally,

incorporating our notion of random events that might have changed the state of the

system since we last measured it Once those factors have been applied, we can then

in-corporate our next new measurement

In practice, the dynamical motion is particularly important when the “state” of the

sys-tem is more complex than our simulation model Oft en when an object is moving, there

are multiple components to the “state” such as the position as well as the velocity In

this case, of course, the state evolves according to the velocity that we believe it to have

Handling systems with multiple components to the state is the topic of the next section

We will develop a little more sophisticated notation as well to handle these new aspects

of the situation

Trang 28

Figure 10-19 Combining our prior knowledge N(x k–1 , σ k–1 ) with our measurement observation

N(z k , σ k ); the result is our new estimate N x ( ˆ , ˆ ) kσk

Kalman equations

We can now generalize these motion equations in our toy model Our more general

discussion will allow us to factor in any model that is a linear function F of the object’s

state Such a model might consider combinations of the fi rst and second derivatives of

the previous motion, for example We’ll also see how to allow for a control input u k to

our model Finally, we will allow for a more realistic observation model z in which we

might measure only some of the model’s state variables and in which the measurements

may be only indirectly related to the state variables.*

To get started, let’s look at how K, the gain in the previous section, aff ects the estimates

If the uncertainty of the new measurement is very large, then the new measurement

es-sentially contributes nothing and our equations reduce to the combined result being the

same as what we already knew at time k – 1 Conversely, if we start out with a large

vari-ance in the original measurement and then make a new, more accurate measurement,

then we will “believe” mostly the new measurement When both measurements are of

equal certainty (variance), the new expected value is exactly between them All of these

remarks are in line with our reasonable expectations

Figure 10-19 shows how our uncertainty evolves over time as we gather new

observations

Th is idea of an update that is sensitive to uncertainty can be generalized to many

state variables Th e simplest example of this might be in the context of video tracking,

where objects can move in two or three dimensions In general, the state might contain

* Observe the change in notation from x k to z k Th e latter is standard in the literature and is intended to

clarify that z k is a general measurement, possibly of multiple parameters of the model, and not just (and sometimes not even) the position x k.

Định dạng
Số trang	57
Dung lượng	0,93 MB