Similarly, if you computed the motion of points from the previous frame then you are in a good position to make good initial guesses for where they will be in the next frame.. Sparse opt
Trang 1We now have an overconstrained system for which we can solve provided it contains
more than just an edge in that 5-by-5 window To solve for this system, we set up a
least-squares minimization of the equation, whereby min Ad b− 2 is solved in standard
Figure 10-8 Aperture problem: through the aperture window (upper row) we see an edge moving to
the right but cannot detect the downward part of the motion (lower row)
p
I p
I p t
t b
1 2
Trang 2When can this be solved?—when (A A) is invertible And (A A) is invertible when it
has full rank (2), which occurs when it has two large eigenvectors Th is will happen
in image regions that include texture running in at least two directions In this case,
(ATA) will have the best properties then when the tracking window is centered over a
corner region in an image Th is ties us back to our earlier discussion of the Harris
cor-ner detector In fact, those corcor-ners were “good features to track” (see our previous
re-marks concerning cvGoodFeaturesToTrack()) for precisely the reason that (ATA) had two
large eigenvectors there! We’ll see shortly how all this computation is done for us by the
cvCalcOpticalFlowLK() function
Th e reader who understands the implications of our assuming small and coherent
mo-tions will now be bothered by the fact that, for most video cameras running at 30 Hz,
large and noncoherent motions are commonplace In fact, Lucas-Kanade optical fl ow by
itself does not work very well for exactly this reason: we want a large window to catch
large motions, but a large window too oft en breaks the coherent motion assumption!
To circumvent this problem, we can track fi rst over larger spatial scales using an image
pyramid and then refi ne the initial motion velocity assumptions by working our way
down the levels of the image pyramid until we arrive at the raw image pixels
Hence, the recommended technique is fi rst to solve for optical fl ow at the top layer and
then to use the resulting motion estimates as the starting point for the next layer down
We continue going down the pyramid in this manner until we reach the lowest level
Th us we minimize the violations of our motion assumptions and so can track faster and
longer motions Th is more elaborate function is known as pyramid Lucas-Kanade
opti-cal fl ow and is illustrated in Figure 10-9 Th e OpenCV function that implements
Pyra-mid Lucas-Kanade optical fl ow is cvCalcOpticalFlowPyrLK(), which we examine next
Th e result arrays for this OpenCV routine are populated only by those pixels for which it
is able to compute the minimum error For the pixels for which this error (and thus the
displacement) cannot be reliably computed, the associated velocity will be set to 0 In
most cases, you will not want to use this routine Th e following pyramid-based method
is better for most situations most of the time
Pyramid Lucas-Kanade code
We come now to OpenCV’s algorithm that computes Lucas-Kanade optical fl ow in a
pyramid, cvCalcOpticalFlowPyrLK() As we will see, this optical fl ow function makes use
Trang 3of “good features to track” and also returns indications of how well the tracking of each
point is proceeding
void cvCalcOpticalFlowPyrLK(
const CvArr* imgA, const CvArr* imgB, CvArr* pyrA, CvArr* pyrB, CvPoint2D32f* featuresA, CvPoint2D32f* featuresB, int count, CvSize winSize, int level, char* status, float* track_error, CvTermCriteria criteria, int flags );
Th is function has a lot of inputs, so let’s take a moment to fi gure out what they all do
Once we have a handle on this routine, we can move on to the problem of which points
to track and how to compute them
Th e fi rst two arguments of cvCalcOpticalFlowPyrLK() are the initial and fi nal images;
both should be single-channel, 8-bit images Th e next two arguments are buff ers
allo-cated to store the pyramid images Th e size of these buff ers should be at least (img.width
Figure 10-9 Pyramid Lucas-Kanade optical fl ow: running optical fl ow at the top of the pyramid fi rst
mitigates the problems caused by violating our assumptions of small and coherent motion; the
mo-tion estimate from the preceding level is taken as the starting point for estimating momo-tion at the next
layer down
Trang 4+ 8)*img.height/3 bytes,* with one such buff er for each of the two input images (pyrA
and pyrB) (If these two pointers are set to NULL then the routine will allocate, use, and
free the appropriate memory when called, but this is not so good for performance.) Th e
array featuresA contains the points for which the motion is to be found, and featuresB
is a similar array into which the computed new locations of the points from featuresA
are to be placed; count is the number of points in the featuresA list Th e window used for
computing the local coherent motion is given by winSize Because we are constructing
an image pyramid, the argument level is used to set the depth of the stack of images
If level is set to 0 then the pyramids are not used Th e array status is of length count;
on completion of the routine, each entry in status will be either 1 (if the corresponding
point was found in the second image) or 0 (if it was not) Th e track_error parameter is
optional and can be turned off by setting it to NULL If track_error is active then it is an
array of numbers, one for each tracked point, equal to the diff erence between the patch
around a tracked point in the fi rst image and the patch around the location to which
that point was tracked in the second image You can use track_error to prune away
points whose local appearance patch changes too much as the points move
Th e next thing we need is the termination criteria Th is is a structure used by many
OpenCV algorithms that iterate to a solution:
cvTermCriteria(
int type, // CV_TERMCRIT_ITER, CV_TERMCRIT_EPS, or both int max_iter,
double epsilon );
Typically we use the cvTermCriteria() function to generate the structure we need Th e
fi rst argument of this function is either CV_TERMCRIT_ITER or CV_TERMCRIT_EPS, which tells
the algorithm that we want to terminate either aft er some number of iterations or when
the convergence metric reaches some small value (respectively) Th e next two arguments
set the values at which one, the other, or both of these criteria should terminate the
al-gorithm Th e reason we have both options is so we can set the type to CV_TERMCRIT_ITER |
CV_TERMCRIT_EPS and thus stop when either limit is reached (this is what is done in most
real code)
Finally, flags allows for some fi ne control of the routine’s internal bookkeeping; it may
be set to any or all (using bitwise OR) of the following
* If you are wondering why the funny size, it’s because these scratch spaces need to accommodate not just the
image itself but the entire pyramid.
Trang 5Th e array B already contains an initial guess for the feature’s coordinates when the routine is called
Th ese fl ags are particularly useful when handling sequential video Th e image pyramids
are somewhat costly to compute, so recomputing them should be avoided whenever
possible Th e fi nal frame for the frame pair you just computed will be the initial frame
for the pair that you will compute next If you allocated those buff ers yourself (instead
of asking the routine to do it for you), then the pyramids for each image will be sitting
in those buff ers when the routine returns If you tell the routine that this information is
already computed then it will not be recomputed Similarly, if you computed the motion
of points from the previous frame then you are in a good position to make good initial
guesses for where they will be in the next frame
So the basic plan is simple: you supply the images, list the points you want to track in
featuresA, and call the routine When the routine returns, you check the status array
to see which points were successfully tracked and then check featuresB to fi nd the new
locations of those points
Th is leads us back to that issue we put aside earlier: how to decide which features are
good ones to track Earlier we encountered the OpenCV routine cvGoodFeatures
ToTrack(), which uses the method originally proposed by Shi and Tomasi to solve this
problem in a reliable way In most cases, good results are obtained by using the
com-bination of cvGoodFeaturesToTrack() and cvCalcOpticalFlowPyrLK() Of course, you can
also use your own criteria to determine which points to track
Let’s now look at a simple example (Example 10-1) that uses both cvGoodFeaturesToTrack()
and cvCalcOpticalFlowPyrLK(); see also Figure 10-10
Example 10-1 Pyramid Lucas-Kanade optical fl ow code
// Pyramid L-K optical flow example
//
#include <cv.h>
#include <cxcore.h>
#include <highgui.h>
const int MAX_CORNERS = 500;
int main(int argc, char** argv) {
// Initialize, load two images from the file system, and
// allocate the images and other structures we will need for
// results.
//
IplImage* imgA = cvLoadImage(“image0.jpg”,CV_LOAD_IMAGE_GRAYSCALE);
IplImage* imgB = cvLoadImage(“image1.jpg”,CV_LOAD_IMAGE_GRAYSCALE);
CvSize img_sz = cvGetSize( imgA );
int win_size = 10;
IplImage* imgC = cvLoadImage(
Trang 6Example 10-1 Pyramid Lucas-Kanade optical fl ow code (continued)
IplImage* eig_image = cvCreateImage( img_sz, IPL_DEPTH_32F, 1 );
IplImage* tmp_image = cvCreateImage( img_sz, IPL_DEPTH_32F, 1 );
int corner_count = MAX_CORNERS;
CvPoint2D32f* cornersA = new CvPoint2D32f[ MAX_CORNERS ];
char features_found[ MAX_CORNERS ];
float feature_errors[ MAX_CORNERS ];
CvSize pyr_sz = cvSize( imgA->width+8, imgB->height/3 );
IplImage* pyrA = cvCreateImage( pyr_sz, IPL_DEPTH_32F, 1 );
IplImage* pyrB = cvCreateImage( pyr_sz, IPL_DEPTH_32F, 1 );
CvPoint2D32f* cornersB = new CvPoint2D32f[ MAX_CORNERS ];
cvCalcOpticalFlowPyrLK(
imgA,
imgB,
Trang 7Example 10-1 Pyramid Lucas-Kanade optical fl ow code (continued)
for( int i=0; i<corner_count; i++ ) {
if( features_found[i]==0|| feature_errors[i]>550 ) {
Dense Tracking Techniques
OpenCV contains two other optical fl ow techniques that are now seldom used Th ese
routines are typically much slower than Lucas-Kanade; moreover, they (could, but) do
not support matching within an image scale pyramid and so cannot track large
mo-tions We will discuss them briefl y in this section
Trang 8Horn-Schunck method
Th e method of Horn and Schunck was developed in 1981 [Horn81] Th is technique was
one of the fi rst to make use of the brightness constancy assumption and to derive the
basic brightness constancy equations Th e solution of these equations devised by Horn
and Schunck was by hypothesizing a smoothness constraint on the velocities vx and vy
Th is constraint was derived by minimizing the regularized Laplacian of the optical fl ow
Here α is a constant weighting coeffi cient known as the regularization constant Larger
values of α lead to smoother (i.e., more locally consistent) vectors of motion fl ow Th is
is a relatively simple constraint for enforcing smoothness, and its eff ect is to
penal-ize regions in which the fl ow is changing in magnitude As with Lucas-Kanade, the
Horn-Schunck technique relies on iterations to solve the diff erential equations Th e
function that computes this is:
void cvCalcOpticalFlowHS(
const CvArr* imgA, const CvArr* imgB, int usePrevious, CvArr* velx,
Figure 10-10 Sparse optical fl ow from pyramid Lucas-Kanade: the center image is one video frame
aft er the left image; the right image illustrates the computed motion of the “good features to track”
(lower right shows fl ow vectors against a dark background for increased visibility)
Trang 9CvArr* vely, double lambda, CvTermCriteria criteria );
Here imgA and imgB must be 8-bit, single-channel images Th e x and y velocity results
will be stored in velx and vely, which must be 32-bit, fl oating-point, single-channel
im-ages Th e usePrevious parameter tells the algorithm to use the velx and vely velocities
computed from a previous frame as the initial starting point for computing the new
velocities Th e parameter lambda is a weight related to the Lagrange multiplier You are
probably asking yourself: “What Lagrange multiplier?”* Th e Lagrange multiplier arises
when we attempt to minimize (simultaneously) both the motion-brightness equation
and the smoothness equations; it represents the relative weight given to the errors in
each as we minimize
Block matching method
You might be thinking: “What’s the big deal with optical fl ow? Just match where pixels
in one frame went to in the next frame.” Th is is exactly what others have done Th e term
“block matching” is a catchall for a whole class of similar algorithms in which the
im-age is divided into small regions called blocks [Huang95; Beauchemin95] Blocks are
typically square and contain some number of pixels Th ese blocks may overlap and, in
practice, oft en do Block-matching algorithms attempt to divide both the previous and
current images into such blocks and then compute the motion of these blocks
Algo-rithms of this kind play an important role in many video compression algoAlgo-rithms as
well as in optical fl ow for computer vision
Because block-matching algorithms operate on aggregates of pixels, not on individual
pixels, the returned “velocity images” are typically of lower resolution than the input
images Th is is not always the case; it depends on the severity of the overlap between the
blocks Th e size of the result images is given by the following formula:
Th e implementation in OpenCV uses a spiral search that works out from the location
of the original block (in the previous frame) and compares the candidate new blocks
with the original Th is comparison is a sum of absolute diff erences of the pixels (i.e., an
L1 distance) If a good enough match is found, the search is terminated Here’s the
func-tion prototype:
* You might even be asking yourself: “What is a Lagrange multiplier?” In that case, it may be best to ignore
this part of the paragraph and just set lambda equal to 1.
Trang 10void cvCalcOpticalFlowBM(
const CvArr* prev, const CvArr* curr, CvSize block_size, CvSize shift_size, CvSize max_range, int use_previous, CvArr* velx, CvArr* vely );
Th e arguments are straightforward Th e prev and curr parameters are the previous and
current images; both should be 8-bit, single-channel images Th e block_size is the size
of the block to be used, and shift_size is the step size between blocks (this parameter
controls whether—and, if so, by how much—the blocks will overlap) Th e max_range
pa-rameter is the size of the region around a given block that will be searched for a
cor-responding block in the subsequent frame If set, use_previous indicates that the values
in velx and vely should be taken as starting points for the block searches.* Finally, velx
and vely are themselves 32-bit single-channel images that will store the computed
mo-tions of the blocks As mentioned previously, motion is computed at a block-by-block
level and so the coordinates of the result images are for the blocks (i.e., aggregates of
pixels), not for the individual pixels of the original image
Mean-Shift and Camshift Tracking
In this section we will look at two techniques, mean-shift and camshift (where
“cam-shift ” stands for “continuously adaptive mean-“cam-shift ”) Th e former is a general technique
for data analysis (discussed in Chapter 9 in the context of segmentation) in many
ap-plications, of which computer vision is only one Aft er introducing the general theory
of mean-shift , we’ll describe how OpenCV allows you to apply it to tracking in images
Th e latter technique, camshift , builds on mean-shift to allow for the tracking of objects
whose size may change during a video sequence
Mean-Shift
Th e mean-shift algorithm† is a robust method of fi nding local extrema in the density
distribution of a data set Th is is an easy process for continuous distributions; in that
context, it is essentially just hill climbing applied to a density histogram of the data.‡ For
discrete data sets, however, this is a somewhat less trivial problem
* If use_previous==0, then the search for a block will be conducted over a region of max_range distance
from the location of the original block If use_previous!=0, then the center of that search is fi rst displaced
by Δx= vel ( , ) and Δy x x y = vel ( , ).y x y
† Because mean-shift is a fairly deep topic, our discussion here is aimed mainly at developing intuition
for the user For the original formal derivation, see Fukunaga [Fukunaga90] and Comaniciu and Meer [Comaniciu99].
‡ Th e word “essentially” is used because there is also a scale-dependent aspect of mean-shift To be exact:
mean-shift is equivalent in a continuous distribution to fi rst convolving with the mean-shift kernel and then applying a hill-climbing algorithm.
Trang 11Th e descriptor “robust” is used here in its formal statistical sense; that is, mean-shift
ignores outliers in the data Th is means that it ignores data points that are far away from
peaks in the data It does so by processing only those points within a local window of
the data and then moving that window
Th e mean-shift algorithm runs as follows
Choose a search window:
1
its initial location;
• its type (uniform, polynomial, exponential, or Gaussian);
• its shape (symmetric or skewed, possibly rotated, rounded or rectangular);
• its size (extent at which it rolls off or is cut off )
• Compute the window’s (possibly weighted) center of mass
To give a little more formal sense of what the mean-shift algorithm is: it is related to the
discipline of kernel density estimation, where by “kernel” we refer to a function that has
mostly local focus (e.g., a Gaussian distribution) With enough appropriately weighted
and sized kernels located at enough points, one can express a distribution of data
en-tirely in terms of those kernels Mean-shift diverges from kernel density estimation in
that it seeks only to estimate the gradient (direction of change) of the data distribution
When this change is 0, we are at a stable (though perhaps local) peak of the distribution
Th ere might be other peaks nearby or at other scales
Figure 10-11 shows the equations involved in the mean-shift algorithm Th ese equations
can be simplifi ed by considering a rectangular kernel,† which reduces the mean-shift
vector equation to calculating the center of mass of the image pixel distribution:
M M
c= 10 c=00
01 00,
Here the zeroth moment is calculated as:
y x
00=∑ ∑ ( , )and the fi rst moments are:
* Iterations are typically restricted to some maximum number or to some epsilon change in center shift
between iterations; however, they are guaranteed to converge eventually.
† A rectangular kernel is a kernel with no falloff with distance from the center, until a single sharp
transi-tion to zero value Th is is in contrast to the exponential falloff of a Gaussian kernel and the falloff with the square of distance from the center in the commonly used Epanechnikov kernel.
Trang 12Th e mean-shift vector in this case tells us to recenter the mean-shift window over the
calculated center of mass within that window Th is movement will, of course, change
what is “under” the window and so we iterate this recentering process Such recentering
will always converge to a mean-shift vector of 0 (i.e., where no more centering
move-ment is possible) Th e location of convergence is at a local maximum (peak) of the
dis-tribution under the window Diff erent window sizes will fi nd diff erent peaks because
“peak” is fundamentally a scale-sensitive construct
In Figure 10-12 we see an example of a two-dimensional distribution of data and an
ini-tial (in this case, rectangular) window Th e arrows indicate the process of convergence
on a local mode (peak) in the distribution Observe that, as promised, this peak fi nder is
statistically robust in the sense that points outside the mean-shift window do not aff ect
convergence—the algorithm is not “distracted” by far-away points
In 1998, it was realized that this mode-fi nding algorithm could be used to track moving
objects in video [Bradski98a; Bradski98b], and the algorithm has since been greatly
ex-tended [Comaniciu03] Th e OpenCV function that performs mean-shift is implemented
in the context of image analysis Th is means in particular that, rather than taking some
Figure 10-11 Mean-shift equations and their meaning
y
10=∑ ∑ ( , ) and 01=∑ ∑ ( , )
Trang 13arbitrary set of data points (possibly in some arbitrary number of dimensions), the
OpenCV implementation of mean-shift expects as input an image representing the
den-sity distribution being analyzed You could think of this image as a two-dimensional
histogram measuring the density of points in some two-dimensional space It turns out
that, for vision, this is precisely what you want to do most of the time: it’s how you can
track the motion of a cluster of interesting features
int cvMeanShift(
const CvArr* prob_image, CvRect window, CvTermCriteria criteria, CvConnectedComp* comp );
In cvMeanShift(), the prob_image, which represents the density of probable locations,
may be only one channel but of either type (byte or fl oat) Th e window is set at the
ini-tial desired location and size of the kernel window Th e termination criteria has been
described elsewhere and consists mainly of a maximum limit on number of mean-shift
movement iterations and a minimal movement for which we consider the window
Figure 10-12 Mean-shift algorithm in action: an initial window is placed over a two-dimensional
array of data points and is successively recentered over the mode (or local peak) of its data
distribu-tion until convergence
Trang 14locations to have converged.* Th e connected component comp contains the converged
search window location in comp->rect, and the sum of all pixels under the window is
kept in the comp->area fi eld
Th e function cvMeanShift() is one expression of the mean-shift algorithm for
rectangu-lar windows, but it may also be used for tracking In this case, you fi rst choose the
fea-ture distribution to represent an object (e.g., color + texfea-ture), then start the mean-shift
window over the feature distribution generated by the object, and fi nally compute the
chosen feature distribution over the next video frame Starting from the current
win-dow location, the mean-shift algorithm will fi nd the new peak or mode of the feature
distribution, which (presumably) is centered over the object that produced the color and
texture in the fi rst place In this way, the mean-shift window tracks the movement of the
object frame by frame
Camshift
A related algorithm is the Camshift tracker It diff ers from the meanshift in that
the search window adjusts itself in size If you have well-segmented distributions (say
face features that stay compact), then this algorithm will automatically adjust itself for
the size of face as the person moves closer to and further from the camera Th e form of
the Camshift algorithm is:
int cvCamShift(
const CvArr* prob_image, CvRect window, CvTermCriteria criteria, CvConnectedComp* comp, CvBox2D* box = NULL );
Th e fi rst four parameters are the same as for the cvMeanShift() algorithm Th e box
param-eter, if present, will contain the newly resized box, which also includes the orientation of
the object as computed via second-order moments For tracking applications, we would
use the resulting resized box found on the previous frame as the window in the next frame
Many people think of mean-shift and camshift as tracking using color features, but this is not entirely correct Both of these algorithms track the distribution of any kind of feature that is expressed in the prob_image; hence they make for very lightweight, robust, and effi cient trackers.
Motion Templates
Motion templates were invented in the MIT Media Lab by Bobick and Davis [Bobick96;
Davis97] and were further developed jointly with one of the authors [Davis99;
Brad-ski00] Th is more recent work forms the basis for the implementation in OpenCV
* Again, mean-shift will always converge, but convergence may be very slow near the local peak of a
distribu-tion if that distribudistribu-tion is fairly “fl at” there.
Trang 15Motion templates are an eff ective way to track general movement and are especially
ap-plicable to gesture recognition Using motion templates requires a silhouette (or part of
a silhouette) of an object Object silhouettes can be obtained in a number of ways
Th e simplest method of obtaining object silhouettes is to use a reasonably stationary
1
camera and then employ frame-to-frame diff erencing (as discussed in Chapter 9)
Th is will give you the moving edges of objects, which is enough to make motion templates work
You can use chroma keying For example, if you have a known background color
which you can isolate new foreground objects/people as silhouettes
You can use active silhouetting techniques—for example, creating a wall of
pyramid segmentation or mean-shift segmentation) described in Chapter 9
For now, assume that we have a good, segmented object silhouette as represented by
the white rectangle of Figure 10-13(A) Here we use white to indicate that all the pixels
are set to the fl oating-point value of the most recent system time stamp As the rectangle
moves, new silhouettes are captured and overlaid with the (new) current time stamp;
the new silhouette is the white rectangle of Figure 10-13(B) and Figure 10-13(C) Older
motions are shown in Figure 10-13 as successively darker rectangles Th ese sequentially
fading silhouettes record the history of previous movement and thus are referred to as
the “motion history image”
Figure 10-13 Motion template diagram: (A) a segmented object at the current time stamp (white);
(B) at the next time step, the object moves and is marked with the (new) current time stamp, leaving
the older segmentation boundary behind; (C) at the next time step, the object moves further, leaving
older segmentations as successively darker rectangles whose sequence of encoded motion yields the
motion history image
Trang 16In cvUpdateMotionHistory(), all image arrays consist of single-channel images Th e
silhouette image is a byte image in which nonzero pixels represent the most recent
seg-mentation silhouette of the foreground object Th e mhi image is a fl oating-point image
that represents the motion template (aka motion history image) Here timestamp is the
current system time (typically a millisecond count) and duration, as just described, sets
how long motion history pixels are allowed to remain in the mhi In other words, any mhi
pixels that are older (less) than timestamp minus duration are set to 0
Once the motion template has a collection of object silhouettes overlaid in time, we can
derive an indication of overall motion by taking the gradient of the mhi image When we
take these gradients (e.g., by using the Scharr or Sobel gradient functions discussed in
Chapter 6), some gradients will be large and invalid Gradients are invalid when older
or inactive parts of the mhi image are set to 0, which produces artifi cially large gradients
around the outer edges of the silhouettes; see Figure 10-15(A) Because we know the
time-step duration with which we’ve been introducing new silhouettes into the mhi via
cvUpdateMotionHistory(), we know how large our gradients (which are just dx and dy
step derivatives) should be We can therefore use the gradient magnitude to eliminate
gradients that are too large, as in Figure 10-15(B) Finally, we can collect a measure of
global motion; see Figure 10-15(C) Th e function that eff ects parts (A) and (B) of the
fi gure is cvCalcMotionGradient():
Silhouettes whose time stamp is more than a specifi ed duration older than the current
system time stamp are set to 0, as shown in Figure 10-14 Th e OpenCV function that
ac-complishes this motion template construction is cvUpdateMotionHistory():
void cvUpdateMotionHistory(
const CvArr* silhouette, CvArr* mhi, double timestamp, double duration );
Figure 10-14 Motion template silhouettes for two moving objects (left ); silhouettes older than a
specifi ed duration are set to 0 (right)
Trang 17In cvCalcMotionGradient(), all image arrays are single-channel Th e function input mhi
is a fl oating-point motion history image, and the input variables delta1 and delta2 are
(respectively) the minimal and maximal gradient magnitudes allowed Here, the
ex-pected gradient magnitude will be just the average number of time-stamp ticks between
each silhouette in successive calls to cvUpdateMotionHistory(); setting delta1 halfway
below and delta2 halfway above this average value should work well Th e variable
aperture_size sets the size in width and height of the gradient operator Th ese values
can be set to -1 (the 3-by-3 CV_SCHARR gradient fi lter), 3 (the default 3-by-3 Sobel fi lter),
5 (for the 5-by-5 Sobel fi lter), or 7 (for the 7-by-7 fi lter) Th e function outputs are mask, a
single-channel 8-bit image in which nonzero entries indicate where valid gradients were
found, and orientation, a fl oating-point image that gives the gradient direction’s angle
at each point
Th e function cvCalcGlobalOrientation() fi nds the overall direction of motion as the
vector sum of the valid gradient directions
double cvCalcGlobalOrientation(
const CvArr* orientation, const CvArr* mask, const CvArr* mhi, double timestamp, double duration );
When using cvCalcGlobalOrientation(), we pass in the orientation and mask image
computed in cvCalcMotionGradient() along with the timestamp, duration, and resulting
mhi from cvUpdateMotionHistory(); what’s returned is the vector-sum global orientation,
void cvCalcMotionGradient(
const CvArr* mhi, CvArr* mask, CvArr* orientation, double delta1, double delta2, int aperture_size=3 );
Figure 10-15 Motion gradients of the mhi image: (A) gradient magnitudes and directions; (B) large
gradients are eliminated; (C) overall direction of motion is found
Trang 18as in Figure 10-15(C) Th e timestamp together with duration tells the routine how much
motion to consider from the mhi and motion orientation images One could compute
the global motion from the center of mass of each of the mhi silhouettes, but summing
up the precomputed motion vectors is much faster
We can also isolate regions of the motion template mhi image and determine the local
motion within that region, as shown in Figure 10-16 In the fi gure, the mhi image is
scanned for current silhouette regions When a region marked with the most current
time stamp is found, the region’s perimeter is searched for suffi ciently recent motion
(recent silhouettes) just outside its perimeter When such motion is found, a
downward-stepping fl ood fi ll is performed to isolate the local region of motion that “spilled off ” the
current location of the object of interest Once found, we can calculate local motion
gra-dient direction in the spill-off region, then remove that region, and repeat the process
until all regions are found (as diagrammed in Figure 10-16)
Figure 10-16 Segmenting local regions of motion in the mhi image: (A) scan the mhi image for
cur-rent silhouettes (a) and, when found, go around the perimeter looking for other recent silhouettes
(b); when a recent silhouette is found, perform downward-stepping fl ood fi lls (c) to isolate local
mo-tion; (B) use the gradients found within the isolated local motion region to compute local momo-tion;
(C) remove the previously found region and search for the next current silhouette region (d), scan
along it (e), and perform downward-stepping fl ood fi ll on it (f); (D) compute motion within the
newly isolated region and continue the process (A)-(C) until no current silhouette remains
Trang 19Th e function that isolates and computes local motion is cvSegmentMotion():
CvSeq* cvSegmentMotion(
const CvArr* mhi, CvArr* seg_mask, CvMemStorage* storage, double timestamp, double seg_thresh );
In cvSegmentMotion(), the mhi is the single-channel fl oating-point input We also pass in
storage, a CvMemoryStorage structure allocated via cvCreateMemStorage() Another input
is timestamp, the value of the most current silhouettes in the mhi from which you want
to segment local motions Finally, you must pass in seg_thresh, which is the maximum
downward step (from current time to previous motion) that you’ll accept as attached
motion Th is parameter is provided because there might be overlapping silhouettes from
recent and much older motion that you don’t want to connect together
It’s generally best to set seg_thresh to something like 1.5 times the average diff erence in
silhouette time stamps Th is function returns a CvSeq of CvConnectedComp structures, one
for each separate motion found, which delineates the local motion regions; it also
re-turns seg_mask, a single-channel, fl oating-point image in which each region of isolated
motion is marked a distinct nonzero number (a zero pixel in seg_mask indicates no
mo-tion) To compute these local motions one at a time we call cvCalcGlobalOrientation(),
using the appropriate mask region selected from the appropriate CvConnectedComp or
from a particular value in the seg_mask; for example,
cvCmpS(
seg_mask, // [value_wanted_in_seg_mask], // [your_destination_mask], CV_CMP_EQ
)
Given the discussion so far, you should now be able to understand the motempl.c
example that ships with OpenCV in the …/opencv/samples/c/ directory We will now
extract and explain some key points from the update_mhi() function in motempl.c Th e
update_mhi() function extracts templates by thresholding frame diff erences and then
passing the resulting silhouette to cvUpdateMotionHistory():
cvAbsDiff( buf[idx1], buf[idx2], silh );
cvThreshold( silh, silh, diff_threshold, 1, CV_THRESH_BINARY );
cvUpdateMotionHistory( silh, mhi, timestamp, MHI_DURATION );
Th e gradients of the resulting mhi image are then taken, and a mask of valid gradients is
produced using cvCalcMotionGradient() Th en CvMemStorage is allocated (or, if it already
exists, it is cleared), and the resulting local motions are segmented into CvConnectedComp
structures in the CvSeq containing structure seq:
cvCalcMotionGradient(
Trang 20mhi, mask, orient, MAX_TIME_DELTA, MIN_TIME_DELTA, 3
);
if( !storage ) storage = cvCreateMemStorage(0);
else cvClearMemStorage(storage);
seq = cvSegmentMotion(
mhi, segmask, storage, timestamp, MAX_TIME_DELTA );
A “for” loop then iterates through the seq->total CvConnectedComp structures extracting
bounding rectangles for each motion Th e iteration starts at -1, which has been
desig-nated as a special case for fi nding the global motion of the whole image For the local
motion segments, small segmentation areas are fi rst rejected and then the orientation is
calculated using cvCalcGlobalOrientation() Instead of using exact masks, this routine
restricts motion calculations to regions of interest (ROIs) that bound the local motions;
it then calculates where valid motion within the local ROIs was actually found Any
such motion area that is too small is rejected Finally, the routine draws the motion
Examples of the output for a person fl apping their arms is shown in Figure 10-17, where
the output is drawn above the raw image for four sequential frames going across in two
rows (For the full code, see …/opencv/samples/c/motempl.c.) In the same sequence, “Y”
postures were recognized by the shape descriptors (Hu moments) discussed in Chapter
8, although the shape recognition is not included in the samples code.
for( i = -1; i < seq->total; i++ ) { if( i < 0 ) { // case of the whole image // .[does the whole image]
else { // i-th motion component comp_rect = ((CvConnectedComp*)cvGetSeqElem( seq, i ))->rect;
// [reject very small components]
} [set component ROI regions]
angle = cvCalcGlobalOrientation( orient, mask, mhi, timestamp, MHI_DURATION);
[find regions of valid motion]
[reset ROI regions]
[skip small valid motion regions]
[draw the motions]
}
Trang 21Suppose we are tracking a person who is walking across the view of a video camera
At each frame we make a determination of the location of this person Th is could be
done any number of ways, as we have seen, but in each case we fi nd ourselves with an
estimate of the position of the person at each frame Th is estimation is not likely to be
Figure 10-17 Results of motion template routine: going across and top to bottom, a person moving
and the resulting global motions indicated in large octagons and local motions indicated in small
octagons; also, the “Y” pose can be recognized via shape descriptors (Hu moments)
Trang 22Th e machinery for accomplishing the two-phase estimation task falls generally under
the heading of estimators, with the Kalman fi lter [Kalman60] being the most widely
used technique In addition to the Kalman fi lter, another important method is the
con-densation algorithm, which is a computer-vision implementation of a broader class of
extremely accurate Th e reasons for this are many Th ey may include inaccuracies in
the sensor, approximations in earlier processing stages, issues arising from occlusion
or shadows, or the apparent changing of shape when a person is walking due to their
legs and arms swinging as they move Whatever the source, we expect that these
mea-surements will vary, perhaps somewhat randomly, about the “actual” values that might
be received from an idealized sensor We can think of all these inaccuracies, taken
to-gether, as simply adding noise to our tracking process
We’d like to have the capability of estimating the motion of this person in a way that
makes maximal use of the measurements we’ve made Th us, the cumulative eff ect of
our many measurements could allow us to detect the part of the person’s observed
tra-jectory that does not arise from noise Th e key additional ingredient is a model for the
person’s motion For example, we might model the person’s motion with the following
statement: “A person enters the frame at one side and walks across the frame at constant
velocity.” Given this model, we can ask not only where the person is but also what
pa-rameters of the model are supported by our observations
Th is task is divided into two phases (see Figure 10-18) In the fi rst phase, typically called
the prediction phase, we use information learned in the past to further refi ne our model
for what the next location of the person (or object) will be In the second phase, the
correction phase, we make a measurement and then reconcile that measurement with
the predictions based on our previous measurements (i.e., our model)
Figure 10-18 Two-phase estimator cycle: prediction based on prior data followed by reconciliation of
the newest measurement
Trang 23methods known as particle fi lters Th e primary diff erence between the Kalman fi lter and
the condensation algorithm is how the state probability density is described We will
explore the meaning of this distinction in the following sections
The Kalman Filter
First introduced in 1960, the Kalman fi lter has risen to great prominence in a wide
vari-ety of signal processing contexts Th e basic idea behind the Kalman fi lter is that, under
a strong but reasonable* set of assumptions, it will be possible—given a history of
mea-surements of a system—to build a model for the state of the system that maximizes the
a posteriori† probability of those previous measurements For a good introduction, see
Welsh and Bishop [Welsh95] In addition, we can maximize the a posteriori probability
without keeping a long history of the previous measurements themselves Instead, we
iteratively update our model of a system’s state and keep only that model for the next
iteration Th is greatly simplifi es the computational implications of this method
Before we go into the details of what this all means in practice, let’s take a moment to
look at the assumptions we mentioned Th ere are three important assumptions required
in the theoretical construction of the Kalman fi lter: (1) the system being modeled is
linear, (2) the noise that measurements are subject to is “white”, and (3) this noise is also
Gaussian in nature Th e fi rst assumption means (in eff ect) that the state of the system
at time k can be modeled as some matrix multiplied by the state at time k–1 Th e
ad-ditional assumptions that the noise is both white and Gaussian means that the noise is
not correlated in time and that its amplitude can be accurately modeled using only an
average and a covariance (i.e., the noise is completely described by its fi rst and second
moments) Although these assumptions may seem restrictive, they actually apply to a
surprisingly general set of circumstances.‡
What does it mean to “maximize the a posteriori probability of those previous
measure-ments”? It means that the new model we construct aft er making a measurement—taking
into account both our previous model with its uncertainty and the new measurement
with its uncertainty—is the model that has the highest probability of being correct For
our purposes, this means that the Kalman fi lter is, given the three assumptions, the best
way to combine data from diff erent sources or from the same source at diff erent times
We start with what we know, we obtain new information, and then we decide to change
* Here by “reasonable” we mean something like “suffi ciently unrestrictive that the method is useful for a
reasonable variety of actual problems arising in the real world” “Reasonable” just seemed like less of a mouthful.
† Th e modifi er “a posteriori” is academic jargon for “with hindsight” Th us, when we say that such and such
a distribution “maximizes the a posteriori probability”, what we mean is that that distribution, which is sentially a possible explanation of “what really happened”, is actually the most likely one given the data we have observed you know, looking back on it all in retrospect.
es-‡ OK, one more footnote We actually slipped in another assumption here, which is that the initial
distribu-tion also must be Gaussian in nature Oft en in practice the initial state is known exactly, or at least we treat
it like it is, and so this satisfi es our requirement If the initial state were (for example) a 50-50 chance of being either in the bedroom or the bathroom, then we’d be out of luck and would need something more sophisticated than a single Kalman fi lter.
Trang 24what we know based on how certain we are about the old and new information using a
weighted combination of the old and the new
Let’s work all this out with a little math for the case of one-dimensional motion You
can skip the next section if you want, but linear systems and Gaussians are so friendly
that Dr Kalman might be upset if you didn’t at least give it a try
Some Kalman math
So what’s the gist of the Kalman fi lter?—information fusion Suppose you want to know
where some point is on a line (our one-dimensional scenario).* As a result of noise, you
have two unreliable (in a Gaussian sense) reports about where the object is: locations x1
and x2 Because there is Gaussian uncertainty in these measurements, they have means
of x– 1 and x– 2 together with standard deviations σ1and σ2 Th e standard deviations are,
in fact, expressions of our uncertainty regarding how good our measurements are Th e
probability distribution as a function of location is the Gaussian distribution:
1 2
i
i i
given two such measurements, each with a Gaussian probability distribution, we would
expect that the probability density for some value of x given both measurements would
be proportional to p(x) = p1(x) p2(x) It turns out that this product is another Gaussian
distribution, and we can compute the mean and standard deviation of this new
distri-bution as follows Given that
2
1 2
2 2
1 2
2 2
2 2
that average value simply by computing the derivative of p(x) with respect to x Where a
function is maximal its derivative is 0, so
dp dx
p x
1 2
12 2 2
Since the probability distribution function p(x) is never 0, it follows that the term in
brackets must be 0 Solving that equation for x gives us this very important relation:
2
1 2 2
2 1
1 2
1 2 2
2 2
=+
* For a more detailed explanation that follows a similar trajectory, the reader is referred to J D Schutter,
J De Geeter, T Lefebvre, and H Bruyninckx, “Kalman Filters: A Tutorial” (http://citeseer.ist.psu.edu/
443226.html).
Trang 25Th us, the new mean value x– 12 is just a weighted combination of the two measured means,
where the weighting is determined by the relative uncertainties of the two
measure-ments Observe, for example, that if the uncertainty σ2 of the second measurement is
particularly large, then the new mean will be essentially the same as the mean x1 for the
more certain previous measurement
With the new mean x– 12 in hand, we can substitute this value into our expression for
p12(x) and, aft er substantial rearranging,* identify the uncertainty σ12
2 as:
1 2 2 2
=
At this point, you are probably wondering what this tells us Actually, it tells us a lot It
says that when we make a new measurement with a new mean and uncertainty, we can
combine that measurement with the mean and uncertainty we already have to obtain a
new state that is characterized by a still newer mean and uncertainty (We also now have
numerical expressions for these things, which will come in handy momentarily.)
Th is property that two Gaussian measurements, when combined, are equivalent to a
sin-gle Gaussian measurement (with a computable mean and uncertainty) will be the most
important feature for us It means that when we have M measurements, we can combine
the fi rst two, then the third with the combination of the fi rst two, then the fourth with
the combination of the fi rst three, and so on Th is is what happens with tracking in
com-puter vision; we obtain one measure followed by another followed by another
Th inking of our measurements (xi , σi) as time steps, we can compute the current state of
our estimation ( ˆ , ˆ )x i σ as follows At time step 1, we have only our fi rst measure ˆx x i 1= 1
and its uncertainty ˆσ12 σ
1 2
= Substituting this in our optimal estimation equations yields
2 1
1 2
1 2 2
= we have:
* Th e rearranging is a bit messy If you want to verify all this, it is much easier to (1) start with the equation
for the Gaussian distribution p12(x) in terms of x– 12 and σ 12, (2) substitute in the equations that relate x– 12 to x– 1
and x– 2 and those that relate σ 12 to σ 1 and σ 2 , and (3) verify that the result can be separated into the product
of the Gaussians with which we started.
Trang 26ˆ ˆˆ
=+
A rearrangement similar to what we did for ˆx2 yields an iterative equation for estimating
variance given a new measurement:
2 1 21
In their current form, these equations allow us to separate clearly the “old” information
(what we knew before a new measurement was made) from the “new” information (what
our latest measurement told us) Th e new information (x2−xˆ )1 , seen at time step 2, is
called the innovation We can also see that our optimal iterative update factor is now:
K=+
ˆˆ
σ
1 2
1 2 2 2
Th is factor is known as the update gain Using this defi nition for K, we obtain the
fol-lowing convenient recursion form:
= − K
In the Kalman fi lter literature, if the discussion is about a general series of measurements
then our second time step “2” is usually denoted k and the fi rst time step is thus k – 1.
Systems with dynamics
In our simple one-dimensional example, we considered the case of an object being
lo-cated at some point x, and a series of successive measurements of that point In that case
we did not specifi cally consider the case in which the object might actually be moving
in between measurements In this new case we will have what is called the prediction
phase During the prediction phase, we use what we know to fi gure out where we expect
the system to be before we attempt to integrate a new measurement
In practice, the prediction phase is done immediately aft er a new measurement is made,
but before the new measurement is incorporated into our estimation of the state of the
system An example of this might be when we measure the position of a car at time t,
then again at time t + dt If the car has some velocity v, then we do not just incorporate
the second measurement directly We fi rst fast-forward our model based on what we
knew at time t so that we have a model not only of the system at time t but also of the
system at time t + dt, the instant before the new information is incorporated In this
way, the new information, acquired at time t + dt, is fused not with the old model of the
Trang 27system, but with the old model of the system projected forward to time t + dt Th is is the
meaning of the cycle depicted in Figure 10-18 In the context of Kalman fi lters, there are
three kinds of motion that we would like to consider
Th e fi rst is dynamical motion Th is is motion that we expect as a direct result of the state
of the system when last we measured it If we measured the system to be at position x
with some velocity v at time t, then at time t + dt we would expect the system to be
lo-cated at position x + v ∗ dt, possibly still with velocity.
Th e second form of motion is called control motion Control motion is motion that we
expect because of some external infl uence applied to the system of which, for whatever
reason, we happen to be aware As the name implies, the most common example of
control motion is when we are estimating the state of a system that we ourselves have
some control over, and we know what we did to bring about the motion Th is is
par-ticularly the case for robotic systems where the control is the system telling the robot
to (for example) accelerate or go forward Clearly, in this case, if the robot was at x and
moving with velocity v at time t, then at time t + dt we expect it to have moved not only
to x + v ∗ dt (as it would have done without the control), but also a little farther, since
we did tell it to accelerate
Th e fi nal important class of motion is random motion Even in our simple
one-dimensional example, if whatever we were looking at had a possibility of moving on its
own for whatever reason, we would want to include random motion in our prediction
step Th e eff ect of such random motion will be to simply increase the variance of our
state estimate with the passage of time Random motion includes any motions that are
not known or under our control As with everything else in the Kalman fi lter
frame-work, however, there is an assumption that this random motion is either Gaussian (i.e.,
a kind of random walk) or that it can at least be modeled eff ectively as Gaussian
Th us, to include dynamics in our simulation model, we would fi rst do an “update” step
before including a new measurement Th is update step would include fi rst applying any
knowledge we have about the motion of the object according to its prior state, applying
any additional information resulting from actions that we ourselves have taken or that
we know to have been taken on the system from another outside agent, and, fi nally,
incorporating our notion of random events that might have changed the state of the
system since we last measured it Once those factors have been applied, we can then
in-corporate our next new measurement
In practice, the dynamical motion is particularly important when the “state” of the
sys-tem is more complex than our simulation model Oft en when an object is moving, there
are multiple components to the “state” such as the position as well as the velocity In
this case, of course, the state evolves according to the velocity that we believe it to have
Handling systems with multiple components to the state is the topic of the next section
We will develop a little more sophisticated notation as well to handle these new aspects
of the situation
Trang 28Figure 10-19 Combining our prior knowledge N(x k–1 , σ k–1 ) with our measurement observation
N(z k , σ k ); the result is our new estimate N x ( ˆ , ˆ ) kσk
Kalman equations
We can now generalize these motion equations in our toy model Our more general
discussion will allow us to factor in any model that is a linear function F of the object’s
state Such a model might consider combinations of the fi rst and second derivatives of
the previous motion, for example We’ll also see how to allow for a control input u k to
our model Finally, we will allow for a more realistic observation model z in which we
might measure only some of the model’s state variables and in which the measurements
may be only indirectly related to the state variables.*
To get started, let’s look at how K, the gain in the previous section, aff ects the estimates
If the uncertainty of the new measurement is very large, then the new measurement
es-sentially contributes nothing and our equations reduce to the combined result being the
same as what we already knew at time k – 1 Conversely, if we start out with a large
vari-ance in the original measurement and then make a new, more accurate measurement,
then we will “believe” mostly the new measurement When both measurements are of
equal certainty (variance), the new expected value is exactly between them All of these
remarks are in line with our reasonable expectations
Figure 10-19 shows how our uncertainty evolves over time as we gather new
observations
Th is idea of an update that is sensitive to uncertainty can be generalized to many
state variables Th e simplest example of this might be in the context of video tracking,
where objects can move in two or three dimensions In general, the state might contain
* Observe the change in notation from x k to z k Th e latter is standard in the literature and is intended to
clarify that z k is a general measurement, possibly of multiple parameters of the model, and not just (and sometimes not even) the position x k.