Figure 2.4.: Surfel view directions. We support up to six surfels for orthogonal view directions onto the voxel faces.
2.2. Multi-Resolution Surfel Maps
In MRSMaps, we represent the joint color and shape distribution of RGB-D measurements at multiple resolutions in 3D. We use octrees as a natural data structure for this purpose. The tree subdivides the represented 3D volume into cubic voxels at various resolutions, where resolution is defined as the inverse of the cube’s side length. A node in the tree corresponds to a single voxel. Inner nodes branch to at least one of eight child nodes, dividing the voxel of the inner node into eight equally sized sub-voxels. The nodes at the same depth d in the tree share a common cube resolutionρ(d) := 2dρ(0) which is a power of 2 of the cube resolution of the root node at depth 0.
In each node of the tree, i.e., inner nodes as well as leaf nodes, we store statistics on the joint spatial and color distribution of the points P within its volume. The distribution is approximated with sample meanàand covariance Σ of the data, i.e., we model the data as normally distributed in a node’s volume.
We denote the local description of voxel content as surfel s. It describes the local shape and color distribution within the voxel by the following attributes:
• mean às ∈R6 and covariance Σs∈R6, where the first three coordinates àps model the 3D coordinates of the points within the voxel and the latter three dimensions àcs=àLs, àαs, àβsT describe color,
• a surface normal ns∈R3 pointing to the sensor origin and normalized to unit length,
• a local shape-texture descriptorhs.
Since we build maps of scenes and objects from several perspectives, multiple distinct surfaces may be contained within a node’s volume. We model this by maintaining multiple surfels in a node that are visible from different view directions (see Fig. 2.4). We use up to six orthogonal view directions v ∈V :=
Figure 2.5.:αβ chrominances for different luminance values.
Figure 2.6.: Lαβ color space example. From left to right: Color image, L-, α-, β-channel.
{±ex,±ey,±ez} aligned with the basis vectors ex, ey, ez of the map reference frame. When adding a new pointpto the map, we determine the view direction onto the point vp=Tcmp and associate it with the surfel belonging to the most similar view direction,
v0= arg max
v∈V
nvTvpo. (2.8)
The transformTcm maps pfrom camera to map frame.
By maintaining the joint distribution of points and color in a 6D Gaussian distribution, we also model the spatial distribution of color. In order to sepa- rate chrominance from luminance information and to represent chrominances in Cartesian space, we choose a variant of the HSL color space. We define the Lαβ color space through
L:= 1
2(max{R, G, B}+ min{R, G, B}), α:=R−1
2G−1
2B, and β:=
√3
2 (G−B).
(2.9)
The chrominancesα and β represent hue and saturation of the color (Hanbury, 2008) and Lits luminance (see Figs. 2.5 and 2.6).
2.2. Multi-Resolution Surfel Maps
Figure 2.7.: Multi-resolution surfel map aggregation from an RGB-D image. Top left: RGB image of the scene. Top right: Maximum voxel resolution coding, color codes octant of the leaf in its parent’s voxel (max.
resolution (0.0125 m)−1). Bottom: 15 samples per color and shape surfel at (0.025 m)−1 (left) and at (0.05 m)−1 resolution (right).
Surface normals n are determined from the eigen decomposition of the point sample covariance in a local neighborhood at the surfel. We set the surface normal to the eigenvector that corresponds to the smallest eigenvalue, and direct the normal towards the view-point. Due to the discretization of the 3D volume into voxels, surfels may only receive points on a small surface patch compared to the voxel resolution. We thus smooth the normals by determining the normal from the covariance of the surfel and adjacent surfels in the voxel grid.
Neighboring voxels can efficiently be found using precalculated look-up ta- bles (Zhou et al., 2011). We store the pointers to neighbors explicitly in each node to achieve better run-time efficiency than tracing the neighbors through the tree. The octree representation is still more memory-efficient than a multi- resolution grid, as it only allocates voxels that contain the 3D surface observed by the sensor.
2.2.1. Modeling Measurement Errors
We control the maximum resolution in the tree to consider the typical property of RGB-D sensors that measurement errors increase quadratically with depth
Figure 2.8.: 2D illustration of our local shape-texture descriptor. We determine a local description of shape, chrominance (α, β), and luminance (L) contrasts to improve the association of surfels. Each node is compared to its 26 neighbors. We smooth the descriptors between neighbors.
and with distance from the optical center on the image plane (see Sec. 2.1). We adapt the maximum resolution ρmax(p) at a point p with the squared distance to the optical center,
ρmax(p) = 1
λρkpk22, (2.10)
whereλρ is a factor that is governed by pixel as well as disparity resolution and noise and can be determined empirically. Fig. 2.7 shows the map representation of an RGB-D image in two example resolutions.
2.2.2. Shape-Texture Descriptor
We construct descriptors of shape and texture in the local neighborhood of each surfel (see Fig. 2.8). Similar to fast point feature histograms(FPFHs) (Rusu et al., 2009), we first build three-bin histograms hshs of the three angular surfel- pair relations between the query surfel s and its up to 26 neighbors s0 at the same resolution and view direction. The three angles are measured between the normals of both surfels ](n, n0) and between each normal and the line ƈ:=
à−à0 between the surfel means, i.e., ](n,∆à) and ](n0,∆à). Each surfel- pair relation is weighted with the number of points in the neighboring node.
We smooth the histograms to better cope with discretization effects by adding the histogram of neighboring surfels with a factor γ = 0.1 and normalize the histograms by the total number of points.
Similarly, we extract local histograms of luminancehLs and chrominancehαs,hβs contrasts. We bin luminance and chrominance differences between neighboring surfels into positive, negative, or insignificant. The shape and texture histograms are concatenated into a shape-texture descriptorhs of the surfel. Fig. 2.9 shows
2.2. Multi-Resolution Surfel Maps
Figure 2.9.: Similarity in shape-texture descriptor for blob- (top left) and edge- like structures (top right) and in planar, textureless structures (bot- tom). The MRSMaps are shown as voxel centers at a single resolu- tion each (left images). Feature similarity towards a reference point (green dot) is visualized by colored sufel means (right images, red:
low, cyan: high similarity).
feature similarity on color blobs, edges, and planar structures determined using the Euclidean distance between the shape-texture descriptors.
2.2.3. Efficient RGB-D Image Aggregation
Instead of computing mean and covariance in the nodes with a two-pass algo- rithm, we use a one-pass update scheme with high numerical accuracy (Chan et al., 1979). It determines the sufficient statisticsS(P) :=Pp∈PpandS2(P) :=
Pp∈PppT of the normal distribution from the statistics of two point sets PA and PB through
S(PA∪B)←S(PA) +S(PB),
S2(PA∪B)←S2(PA) +S2(PB) + δδT
NANB(NA+NB), (2.11) where N(ã) :=|P(ã)| and δ :=NBS(PA)−NAS(PB). From these, we obtain sample meanà(P) = |P|1 S(P) and covariance Σ(P) = |P|−11 S2(P)−ààT.
Careful treatment of the numerical stability is required when utilizing one-pass schemes for calculating the sample covariance (Chan et al., 1979). We require a minimum sample size of |P| ≥10 to create surfels and stop incorporating new data points if |P| ≥ 10,0001. The discretization of disparity and color produced by the RGB-D sensor may cause degenerate sample covariances, which
1Using double precision (machine epsilon 2.2ã10−16) and assuming a minimum standard
we robustly detect by thresholding the determinant of the covariance at a small constant.
The use of an update scheme allows for an efficient incremental update of the map. In the simplest implementation, each point is added individually to the tree: Starting at the root node, the point’s statistics is recursively added to the nodes that contain the point in their volume.
Adding each point individually is, however, not the most efficient way to gen- erate the map. Instead, we exploit that by the projective nature of the camera, neighboring pixels in the image project to nearby points on the sampled 3D surface—up to occlusion effects. This means that neighbors in the image are likely to belong to the same octree nodes. In effect, the size of the octree is significantly reduced and the leaf nodes subsume local patches in the image (see top-right of Fig. 2.7). Through the distance-dependent resolution limit, patch- size does not decrease with distance to the sensor but even increases. We exploit these properties and scan the image to aggregate the sufficient statistics of con- tiguous image regions that belong to the same octree node. This measurement aggregation allows to construct the map with only several thousand insertions of node aggregates for a 640×480 image in contrast to 307,200 point insertions.
After the image content has been incorporated into the representation, we precompute mean, covariance, surface normals, and shape-texture features.
2.2.4. Handling of Image and Virtual Borders
Special care must be taken at the borders of the image and at virtual borders where background is occluded (see Fig. 2.10). Nodes that receive such border points only partially observe the underlying surface structure. When updated with these partial measurements, the true surfel distribution is distorted towards the visible points. In order to avoid this, we determine such nodes by scanning efficiently through the image, and neglect them.
Conversely, foreground depth edges describe contours of measured surfaces.
We thus mark surfels as belonging to a contour if they receive foreground points at depth discontinuities (example contours illustrated in Fig. 2.10).