Tracking the Nearest Object

In order to build useful interactive applications with the Kinect, we need to write sketches that respond to the scene in a way that people find intuitive and clear. People have an inherent instinct for the objects and spaces around them. When you walk through a doorway, you don’t have to think about how to position yourself so you don’t bump into the doorframe. If I asked you to extend your arm towards me or towards some other object, you could do it without thinking. When you walk up to an object, you know when it’s within reach before you even extend your arm to pick it up.

Because people have such a powerful understanding of their surrounding spaces, physical interfaces are radically easier for them to understand than abstract screen-based ones. And this is especially true for physical interfaces that provide simple and direct feedback where the user’s movements immediately translate into action.

Having a depth camera gives us the opportunity to provide interfaces like these without needing to add anything physical to the computer. With just some simple processing of the depth data coming in from the Kinect, we can give the user the feeling of directly controlling something on the screen.

The simplest way to achieve this effect is by tracking the point that is closest to the Kinect. Imagine that you’re standing in front of the Kinect with nothing between you and it. If you extend your arm towards the Kinect then your hand will be the closest point that the Kinect sees. If our code is tracking the closest point then suddenly your hand will be controlling something on screen. If our sketch draws a line based on how you move your hand, then the interface should feel as intuitive as painting with a brush.

If your sketch moves photos around to follow your hand it should feel as intuitive as a table covered in prints.

Regardless of the application, though, all of these intuitive interfaces begin by tracking the point closest to the Kinect. How would we start going about that? What are the

steps between accessing the depth value of a single point and looking through all of the points to find the closest one? Further all of these interfaces translate the closest point into some kind of visible output. So, once we’ve found the closest point how do we translate that position into something that the user can see?

Later in this chapter, you’re going to write a sketch that accomplishes all of these things.

But before we dive into writing real code, I want to give you a sense of the overall procedure that we’re going to follow for finding the closest pixel. This is something that you’ll need over-and-over in the future when you make your own Kinect apps and so rather than simply memorizing or copy-and-pasting code, it’s best to understand the ideas behind it so that you can reinvent it yourself when you need it. To that end, I’m going to start by explaining the individual steps involved in plain English so that you can get an idea of what we’re trying to do before we dive into the forest of variables, for-loops, and type declarations. Hopefully this pseudo code, as its known, will act as a map so you don’t get lost in these nitty gritty details when they do arise.

So now: the plan. First a high-level overview.

Finding the Closest Pixel

Look at every pixel that comes from the Kinect one at a time. When looking at a single pixel, if that pixel is the closest one we’ve seen so far, save its depth value to compare against later pixels and save its position. Once we’ve finished looking through all the pixels, what we’ll be left with is the depth value of the closest pixel and its position.

Then we can use those to display a simple circle on the screen that will follow the user’s hand (or whatever else they wave at their Kinect).

Sounds pretty straightforward. Let’s break it down into a slightly more concrete form to make sure we understand it:

get the depth array from the kinect for each row in the depth image look at each pixel in the row

for each pixel, pull out the corresponding value from the depth array if that pixel is the closest one we've seen so far

save its value

and save its position (both X and Y coordinates)

then, once we've looked at every pixel in the image, whatever value we saved last will be the closest depth reading in the array and whatever position we saved last will be the position of the closest pixel.

draw the depth image on the screen

draw a red circle over it, positioned at the X and Y coordinates we saved of the closest pixel.

Now this version is starting to look a little bit more like code. I’ve even indented it like it’s code. In fact, we could start to write our code by replacing each line in this pseudo- code with a real line of code and we’d be pretty much on the right track.

But before we do that, let’s make sure that we understand some of the subtleties of this plan and how it will actually find us the closest point. The main idea here is that we’re going to loop over every point in the depth array comparing depth values as we go. If the depth value of any point is closer than the closest one we’ve seen before, then we save that new value and compare all future points against it instead.

It’s a bit like keeping track of the leader during an Olympic competition. The event starts without a leader. By definition whoever goes first becomes the leader. Their distance or speed becomes the number to beat. During the event if any athlete runs a faster time or jumps a farther distance then they become the new leader and their score becomes the one to beat. At the end of the event, after all the athletes have had their turn, whoever has the best score is the winner and whatever their score is becomes the win- ning time or distance.

Our code is going to work exactly like the judges in that Olympic competition. But instead of looking for the fastest time or furthest distance, we’re looking for the closest point. As we go through the loop, we’ll check each point’s distance and if it’s closer than the closest one we’ve seen so far, that point will become our leader and its distance will be the one to beat. And then, when we get to the end of the loop, after we’ve seen all the points, whichever one is left as the leader will be the winner, we’ll have found our closest point.

Ok. At this point you should understand the plan for our code and be ready to look at the actual sketch. I’m presenting it here with the pseudo-code included as comments directly above the lines that implement the corresponding idea. Read through it, run it, see what it does and then I’ll explain a few of the nitty gritty details that we haven’t covered yet.

import SimpleOpenNI.*;

SimpleOpenNI kinect;

int closestValue;

int closestX;

int closestY;

void setup() {

size(640, 480);

kinect = new SimpleOpenNI(this);

kinect.enableDepth();

}

void draw() {

closestValue = 8000;

kinect.update();

// get the depth array from the kinect int[] depthValues = kinect.depthMap();

// for each row in the depth image for(int y = 0; y < 480; y++){

// look at each pixel in the row for(int x = 0; x < 640; x++){

// pull out the corresponding value from the depth array int i = x + y * 640;

int currentDepthValue = depthValues[i];

// if that pixel is the closest one we've seen so far

if(currentDepthValue > 0 && currentDepthValue < closestValue){

// save its value

closestValue = currentDepthValue;

// and save its position (both X and Y coordinates) closestX = x;

closestY = y;

} } }

//draw the depth image on the screen image(kinect.depthImage(),0,0);

// draw a red circle over it,

// positioned at the X and Y coordinates // we saved of the closest pixel.

fill(255,0,0);

ellipse(closestX, closestY, 25, 25);

}

Hey, a sketch that’s actually interactive! When you run this sketch you should see a red dot floating over the depth image following whatever is closest to the Kinect. If you face the Kinect so that there’s nothing between you and the camera and extend your hand, then the red dot should follow your hand around when you move it.

For example, Figure 2-14 shows the red dot following my extended fist when I run the sketch.

The tracking is good enough that if you point a single finger at the Kinect and wag it back and forth disapprovingly, the dot should even stick to the tip of your outstretched finger.

Now you should understand most of what’s going in this code based on our discussion of the pseudo-code, but there’s a few details that are worth pointing out and clarifying.

First, let’s look at "closestValue". This is going to be the variable that holds the current record holder for closest pixel as we work our way though the image. Ironically, the winner of this competition will be the pixel with the lowest value, not the highest. As we saw in Example 5 the depth values range from about 450 to just under 8000 and lower depth values correspond to closer points.

At the beginning of draw(), we set closestValue to 8000. That number is so high that it’s actually outside of the range of possible of values that we’ll see from the depth map.

Hence, all of our actual points within the depth image will have lower values. This

guarantees that all our pixels will be considered in the competition for closest point and that one of them will actually win.

Another interesting twist comes up with this line near the middle of draw():

if(currentDepthValue > 0 && currentDepthValue < closestValue){

Here we’re comparing the depth reading of the current point with closestValue, the current record holder, to see if the current point should be crowned as the new closest point. However, we don’t just compare currentDepthValue with closest value, we also check to see if it is greater than zero. Why?

In general we know that the lower a point’s depth value, the closer that point is to the Kinect. Back in “Higher Resolution Depth Data” on page 74, though, when we were first exploring these depth map readings, we discovered an exception to this rule. The closest points in the image have depth readings of around 450, but there are some other points that have readings of zero. These are the points that the Kinect can’t see and hence doesn’t have data for. They might be so close that their within the Kinect’s min- imum range or obscured by the shadow of some closer object. Either way, we know that none of these points are the closest one and so we need to discard them. That’s why we added the check for currentDepthValue > 0 to our if statement.

Figure 2-14. Our red circle following my outstretched fist.

Next, let’s look at our two for-loops. We know from our pseudo-code that we want to go through every row in the image and within every row we want to look at every point in that row. How did we translate that into code?

// for each row in the depth image for(int y = 0; y < 480; y++){

// look at each pixel in the row for(int x = 0; x < 640; x++){

What we’ve got here is two for-loops, one inside the other. The outer one increments a variable y from zero up to 479. We know that the depth image from the Kinect is 480 pixels tall. In other words, it consists of 480 rows of pixels. This outer loop will run once for each one of those rows, setting y to the number of the current row (starting at zero).

The next line kicks off a for-loop that does almost the same thing, but with a different variable, x, and a different constraint, 640. This inner loop is going to run once per row.

We want it to cover every pixel in the row. Since the depth image from the Kinect is 640 pixels wide, we know that it’ll have to run 640 times in order to do so.

The code inside of this inner loop, then, is going to run once per pixel in the image. It will proceed across each row in turn, left to right, before jumping down to the next row until it reaches the bottom right corner of the image and stops.

But as we well know from our previous experience with kinect.depthMap(), our depth Values array doesn’t store rows of pixels; it’s just a single flat stack. Hence we need to invoke the same logic we just learned for converting between the x-y coordinate of the image and the position of a value in the array. And that’s exactly what we do inside the inner for look:

// pull out the corresponding value from the depth array int i = x + y * 640;

That line should look familiar to you from the example in “Higher Resolution Depth Data” on page 74. It converts the x-y coordinates of a pixel in the image to the index of the corresponding value in the array. And once we’ve got that index, we can use it to access the depthValues array and pull out the value for the current point, which is exactly what we do on the next line. This again, should look familiar from our previous work.

Now, at this point you’ve made the big transition. You’ve switched from working with a single depth pixel at a time to processing the entire depth image. You understand how to write the nested loops that let your code run over every point in the depth image.

Once you’ve got that down, the only other challenge in this sketch is understanding how we use that ability to answer questions about the entire depth image as a whole.

In this case, the question we’re answering is: which point in the depth image is closest to the Kinect? In order to answer that question we need to translate from code that runs on a series of individual points to information that holds up for all points in the image.

In this sketch, our answer to that question is contained in our three main variables:

closestValue, closestX, and closestY. They’re where we store the information that we build up from processing each individual depth point.

Using Variable Scope

In order to understand how this works, how these variables can aggregate data from individual pixels into a more widely useful form, we need to talk about "scope". When it comes to code, "scope" describes how long a variable sticks around. Does it exist only inside of a particular for-loop? Does it exist only in a single function? Or does it persist for the entire sketch? In this example, we have variables that have all three of these different scopes and these variables work together to aggregate data from each pixel to create useful information about the entire depth image. The data tunnels its way out from the innermost scope where it relates only to single pixels to the outer scope where it contains information about the entire depth image: the location and distance of its closest point.

Our Processing sketch is like an onion. It has many layers and each scope covers a different set of these layers. Once a variable is assigned it stays set for all of the layers inside of the one on which it was originally defined. So, variables defined outside of any function are available everywhere in the sketch. For example kinect is defined at the top of this sketch and we use it in both our setup() and draw() functions. We don’t have to reset our kinect variable at the start of draw() each time, we can just use it.

Variables defined on inner layers, on the other hand, disappear whenever we leave that layer. Our variable i, for example, which gets declared just inside the inner for loop—

at the innermost core of the onion—represents the array index for each individual point in the depthMap. It disappears and gets reset for each pixel every time our inner loop runs. We wouldn’t want its value to persist because each pixel’s array index is inde- pendent of all the ones that came before. We want to start that calculation from scratch each time, not build it up over time.

Another piece of information that we want to change with every pixel is currentDepth Value. That’s the high resolution depth reading that corresponds to each pixel. Every time the inner loop runs, we want to pull a new depth reading out of the depthValues array for each new pixel, we don’t care about the old ones. That’s why both i and currentDepthValue are declared in this most inner scope. We want them to constantly change with each pixel.

There’s also some data that we want to change every time draw() runs, but stay the same through both of our for-loops. This data lives on an intermediate layer of the onion. The key variable here is depthValues which stores the array of depth readings from the Kinect. We want to pull in a new frame from the Kinect every time draw() runs. Hence this variable should get reset every time the draw() function restarts. But we also want the depthValues array to stick around long enough so that we can process all of its points. It needs to be available inside of our inner for-loop so we can access each point to read out its value and do our calculations. That’s why depthValues is in

this intermediate scope, available throughout each run of draw, but not across the entire sketch.

And finally, moving out to the outermost layer of the onion, we find our three key variables: closestValue, closestX, and closestY. Just like kinect, we declared these at the very top of our sketch, outside of either the setup() or draw function and so they are available everywhere. And, more than that, they persist over time. No matter how many times the inner pixel-processing loop runs, no matter how many times the draw() function itself runs, these variables will retain their values until we intentionally change them. That’s why we can use them to build up an answer to find the closest pixel. Even though the inner loop only knows the distance of the current pixel it can constantly compare this distance with the closestValue, changing the closestValue if necessary. And we’ve seen in our discussion of the pseudo-code (and of Olympic re- cords) how that ability leads to eventually finding the closest point in the whole depth image. This all works because of the difference in scope. If closestValue didn’t stick around as we processed all of the pixels, it wouldn’t end up with the right value when we were done processing them.

Do we actually need closestValue, closestX, and closestY to be global variables, available everywhere, just like kinect? Unlike our kinect object, we don’t access any of these three in our setup function, we only use them within draw(). Having them be global also allows their values to persist across multiple runs of draw(). Are we taking ad- vantage of this?

Well, certainly not for closestValue. The very first line of draw() sets closestValue to 8000, discarding whatever value it ended up with after the last run through all of the depth image’s pixels. If we didn’t reset closestValue, we would end up comparing every pixel in each new frame from the Kinect to the closest pixel that we’d ever seen since our sketch started running. Instead of constantly tracking the closest point in our scene, causing the red circle to track your extended hand, the sketch would lock onto the closest point and only move if some closer point emerged in the future. If you walked up to the Kinect and you were the closest thing in the the scene, the red circle might track you, but then, when you walked away, it would get stuck at your closest point to the Kinect.

By resetting closestValue for each run of draw() we ensure that we find the closest point in each frame from the Kinect no matter what happened in the past. This tells us that we could move the scope of closestValue to be contained within draw() without changing how our sketch works.

But what about closestX and closestY? We don’t set these to a default value at the top of draw(). They enter our nested for-loops still containing their values from the previous frame. But what happens then? Since closestValue is set above the range of possible depth values, any point in the depth map that has an actual depth value will cause closestX and closestY to change, getting set to that point’s x and y coordinates. Some point in the depth image has to be the closest one.

Installing the SimpleOpenNI Processing Library