Automatic registration of color images to 3d geometry of indoor environments

63Figure 9.5 Top the result captured by a virtual camera inside the colored 3D model together with camera views recovered fromthe multiview geometry reconstruction.. Image 1 shows the

Trang 1

Automatic Registration of Color Images

to 3D Geometry of Indoor Environments

LI YUNZHEN

NATIONAL UNIVERSITY OF SINGAPORE

2008

Trang 2

Automatic Registration of Color Images

to 3D Geometry of Indoor Environments

LI YUNZHEN (B.Comp.(Hons), NUS)

A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE

2008

Trang 3

Firstly, I would like to thank my supervisor, Dr Low Kok Lim, for his invaluableguidance and constant support in the research I would also like to thank DrCheng Ho-lun and A/P Tan Tiow Seng, for their help in my graduate life I alsothank Prashast Khandelwal for his honor year work of this research

Secondly, I would like to thank all my friends, especially Yan Ke and PanBinbin We have shared the postgraduate life for two years My thanks to all thepeople in the graphics lab, for their encouragement and friendships

Lastly, I would like to thank all my family members

i

Trang 4

Acknowledgements i

1.1 Motivation and Goal 1

1.2 Contribution 3

1.3 Structure of the Thesis 4

Chapter 2 Related Work 6 2.1 Automatic Registration Methods 6

2.1.1 Feature-based Automatic Registration 6

2.1.2 Statistical-based Registration 7

2.1.3 Multi-view Geometry Approach 9

Chapter 3 Background 12 3.1 Camera Model 12

3.1.1 Intrinsic Parameters 13

3.1.2 Extrinsic Parameters 16

3.1.3 Camera Calibration 17

3.1.4 Image Un-distortion 17

3.2 Two-view Geometry 18

3.2.1 Essential Matrix and Fundamental Matrix Computation 18

3.2.2 Camera Pose Recovery from Essential Matrix 21

3.2.3 Three-D Point Recovery 22

3.3 Image Feature Detection and Matching 22

3.3.1 Corner Detection and Matching 23

3.3.2 Scale Invariant Feature Transform (SIFT) 24

3.4 Levenberg Marquardt Non-linear Optimization 27

ii

Trang 5

5.1 Data Acquisition 32

5.1.1 Range Data Representation 33

5.1.2 Image Data Capturing 34

5.2 Camera Calibration and Image Un-distortion 36

5.3 SIFT-Keypoint Detection and Matching 36

Chapter 6 Multiview Geometry Reconstruction 38 6.1 Camera Pose Recovery in Two-view System 39

6.2 Register Two-view System to Multiview System 41

6.2.1 Scale Computation 41

6.2.2 Unregistered Camera Pose Computation 41

6.2.3 Last Camera Pose Renement 42

6.3 Structure Extension and Optimization 43

6.3.1 Three-D Point Recovery from Multi-views 43

6.4 Outliers Detection 44

6.4.1 Structure Optimization 44

Chapter 7 Registration of Multiview Geometry with 3D Model 45 7.1 User Guided Registration of Multiview Geometry with 3D Model 45 7.1.1 Semi-automatic Registration System 45

7.1.2 Computing Scale between Multiview Geometry and the 3D Model 48

7.1.3 Deriving Poses of other Views in the Multiview System 49

7.2 Plane-Constrained Optimization 50

Chapter 8 Color Mapping and Adjustment 52 8.1 Occlusion Detection and Sharp Depth Boundary Mark Up 53

8.1.1 Depth Buer Rendering 53

8.1.2 Occlusion Detection 55

8.1.3 Depth Boundary Mask Image Generation 56

8.2 Blending 57

8.2.1 Exposure Unication 57

8.2.2 Weighted Blending 58

8.2.3 Preservation of Details 58

Trang 6

Chapter 9 Experiment Results and Time Analysis 61

9.1 Results of Multiview Geometry Reconstruction 61

9.2 Results of Textured Room Models 61

9.3 Related Image Based Modeling Results 61

9.4 Time Analysis of the Automatic Registration Method 64

Chapter 10 Conclusion and Future Work 68 References 69 Appendix A Information-theoretic Metric 73 A.1 Mutual information Metric 73

A.1.1 Basic Information Theory 73

A.1.2 Mutual information Metric Evaluation between two Images 73 A.2 Chi-Squared Test 75

A.2.1 Background 75

A.2.2 Chi-Squared Test about Dependence between two Images 75

A.3 Normalized Cross Correlation(NCC) 76

A.3.1 Correlation 76

A.3.2 Normalized Cross Correlation(NCC) between two Images 76 Appendix B Methods to check whether a Point is inside a Triangle

Trang 7

List of Figures

Figure 1.1 Sculpture from the Parthenon This model shows the

pre-sentation of the peplos, or robe of Athena Image taken from[31] 1

Figure 1.2 A partially textured crime scene model from DeltaSphere

Figure 2.1 Details of texture-maps for a building Those images verify

the high accuracy of the automated algorithm Images taken

Figure 2.3 Automatic alignment results (a) The library model with

three images rendered using their initial pose estimates (b)The library model with all images aligned Image taken from[39] 9

Figure 2.4 Cameras and 3D point reconstructions from photos on the

Internetthe Trevi Fountain Image taken from [28] 10Figure 3.1 Projection of a point from camera frame to image coordinates 14Figure 3.2 The two-view system 18

Figure 3.3 Dierence of Gaussian images are generated by subtracting

adjacent Gaussian images for each scale level Image takenfrom [30] 25

Figure 3.4 Local extreme of DoG images are detected by comparing a

pixel (red) with its 26 neighbors (blue) in 3 × 3 regions at

the current and adjacent scales Image taken from [30] 26

v

Trang 8

Figure 3.5 A keypoint descriptor is created by rst computing the

gra-dient magnitude and orientation at each image sample point

in a region around the keypoint location, as shown on theleft These are weighted by a Gaussian window, indicated bythe overlaid circle These samples are then accumulated into

orientation histograms summarizing the contents over 4×4

subregions, as shown on the right, with the length of eacharrow corresponding to the sum of the gradient magnitudesnear that direction within the region This gure shows a

2×2 descriptor array computed from an 8×8 set of samples, whereas the experiments in this paper use 4×4 descriptors computed from a 16×16 sample array The image and the

description taken from [20] 28Figure 3.6 SIFT matching result: the bottom image is the SIFT match-

ing result of the top images 29Figure 5.1 Equipments used during data acquisition Left is the Delta-

Sphere 3000 with a laptop, top right shows a NEC-NP50Projector and bottom right shows a Canon 40D Camera 32Figure 5.2 The recovery of 3D point in the right hand coordinate system 34Figure 5.3 The intensity image of RTPI Each pixel in the intensity

image refers to a 3D point 34Figure 5.4 The sift pattern 35Figure 5.5 Feature connected component: an outlier case 37Figure 6.1 The associated views: one is patterned and the other is the

normal view 38Figure 6.2 The multi-view system of the room The blue points are

the 3D points recovered from SIFT features Those mids represent the cameras at the recovered locations andorientations 40Figure 7.1 The graphic interface about the semi-automatic registration

pyra-The top-left sub-window shows the intensity map of rangeimage and the top-right sub-window shows a color image.Those colored points are user specied feature locations 47

Trang 9

viiFigure 7.2 The registration result using back-projection 48

Figure 7.3 The feature point inside the projected triangle 4abc 49

Figure 7.4 The registered multiview system and the green planes

de-tected from the model 50Figure 7.5 The plane constrained multiview system together with the

details (Right) Result of weighted blending with tion of details 59Figure 8.4 (Top)The dominant registration result (Mid) The weighted

preserva-blended result (Bottom) Final registration result of weightedblending with preservation of details 60Figure 9.1 (Left) A view about the feature paper wrapped box (Right)

The reconstructed multiview geometry which contains 26views 62Figure 9.2 (Left) An overview of the multiview geometry about a cone-

shape object (Right) The side view of the multiview geometry 62Figure 9.3 (Left) A feature pattern projected room image (Right) The

reconstructed multiview geometry 63Figure 9.4 (Left) A far view of the River Walk building (Right) The

reconstructed multiview geometry which contains 91 views 63Figure 9.5 (Top) the result captured by a virtual camera inside the col-

ored 3D model together with camera views recovered fromthe multiview geometry reconstruction (Bottoms) 3D ren-derings of the nal colored model 64Figure 9.6 Registration of the multiview geometry with another scanned

model The top image is the intensity image of the modelwith color registered The mid image is a view inside themodel and the bottom two images show the 3D model 65

Trang 10

Figure 9.7 Those six images are ordered from left to right, from top

to bottom Image 1 shows the reconstructed 3D togetherwith camera views; Image 2 is the top view of the point set;Image 3 shows selecting the region of the model; Image 4shows the contour linked up by lines; Image 5 shows themodel reconstructed; Image 6 shows the textured model.The last image is a color image taken by a camera 66Figure 9.8 The top two images are the views about the 3D recon-

structed color model from dierent angles The bottom twoimages are the views about the 3D reconstructed model, inwhich the little red points are the 3D points recovered andthe red planes represent the cameras 67

Trang 11

Keywords: Image-to-geometry registration, 2D-to-3D registration, range ning, multiview geometry, SIFT, image blending.

scan-ix

Trang 12

1.1 Motivation and Goal

Creating 3D, color, computer graphics models of scenes and objects from the realworld has various applications, such as digital culture heritage preservation, crimeforensics, computer games and so on

Generating digital reconstructions of historical or archaeological sites withenough delity has become the focus in the area of virtual heritage With digital re-constructions, cultural heritages can be preserved and even reconstructed In 2003,Jessi Stumpfel et al [31] reconstructed the digital reunication of the parthenonand its sculptures, see Figure 1.1 Today, the modern Acropolis Parthenon is beingreconstructed with the help of the digital parthenon

Figure 1.1: Sculpture from the Parthenon This model shows the presentation ofthe peplos, or robe of Athena Image taken from [31]

In the criminal investigation, to fully understand a crime scene, words and

1

Trang 13

2images are often not enough to express the spatial information Constructing adetailed 3D digital model would be very helpful for the investigation For example,with the digital model, some physical measurements can still be performed evenafter the original scene has been changed or cleaned up.

Figure 1.2: A partially textured crime scene model from DeltaSphere softwarepackage

Figure 1.2 shows a view of a mock-up crime scene model rendered from acolored 3D digital model acquired by a Delta-sphere range scanner The model isreconstructed using the Delta-sphere software However, to register an image tothe digital model using the software, users are required to manually specify thecorrespondences between the image and the model It would be extremely tedious

if a large number of images need to be registered

To minimize the user interaction when registering images to a model, automaticalgorithms are needed One approach is to co-locate the camera and the scanner toacquire data [39] [10] and then optimize the camera poses based on the dependency

of intensity images of the range image and color images However, it sacricesthe exibility of color image capturing Furthermore, the optimization is time-consuming Another commonly used approach explores the linear features in theurban scene [17] It works only if there are enough systematic parallel lines.However, in the indoor room environments, images should be acquired from

Trang 14

many dierent locations as our ultimate goal is to create a view dependent roommodel In this case, the precondition of the rst approach does not hold Neither

do the linear feature approach work as there are no systematic linear features Sofar, there are no automatic algorithms to register those images to the room model.This thesis focuses on the registration of color information to the acquired 3Dgeometry of the scene, and the interested domain is indoor room environmentsrather than small objects During the image acquisition, multiple color imagesfrom various view points are captured Furthermore, to allow greater exibilityand feasibility, the color camera will not be tracked, so each color image is acquiredwith an unknown camera pose In this thesis, our goal is to nd a registrationmethod in the indoor room environments with user interaction as less as possible.1.2 Contribution

The main contribution of our work is the idea of taking the approach of establishingcorrespondences among the color images instead of directly nding correspondingfeatures between the 2D and 3D spaces [17] [29] The latter approach works wellonly for higher-level features, such as parallel straight lines, and this imposesassumptions and restriction on the types of scenes the method can handle Formost indoor environments, these higher-level features usually exist, but they areoften too few or do not appear in most of the color images due to small eld ofview and short shooting distance Our approach works for more types of scenesand even for objects

The main problem of feature correspondence is the lack of features on largeuniform surfaces This occurs a lot in indoor environments where large plain walls,ceiling and oor are common We avert this problem by using light projectors toproject special light patterns onto the scene surfaces to articially introduce imagefeatures

Our method requires the user to manually input only six pairs of dences between one of the color images and the 3D model This allows the sparse

Trang 15

correspon-4model to be approximately aligned with the detailed model We detect planes inthe detailed model, and by minimizing the distances between some of the points

in the sparse model and these planes, we are able to rene the multiview geometryand the registration as a whole using sparse bundle adjustment (SBA) [19] Thisapproach is able to achieve better registration accuracy in the face of non-uniformspatial distortion in the geometric model

Our current goal is not to render the completed model with view-dependentreection Instead, we assign each point on the surface of the 3D model a singlecolor by carefully blending colors from multiple overlapping color images Ourmethod takes into consideration the dierent exposures of the color images andthe occlusion of surfaces in the 3D model It produces a colored model with verysmooth color transitions and yet preserves ne details

1.3 Structure of the Thesis

The rest of the thesis is organized as follows,

models,

two-view geometry and image features,

the mean time, the format of the range data is introduced,

3D model,

result,

Trang 16

• Chapter 9 shows more experiment results of the colored room model and thetime complexity of the whole process Furthermore, models derived from themultiview geometry are shown,

Trang 17

Chapter 2

Related Work

This thesis studies how to build a colored 3D model of indoor room environments.Our approach is to reconstruct the multiview geometry of the scene from images

rst, and then register the multiview geometry to the 3D model captured using

a scanner Thus, all the images used to reconstruct the multiview geometry areregistered to the 3D model

This chapter introduces the existing automatic approaches to register colorimages to 3D models The problems of applying those approaches to the indoorenvironments are studied

2.1 Automatic Registration Methods

There are two major classes of automatic registration methods, feature-matchingmethods and statistical-based methods

2.1.1 Feature-based Automatic Registration

In [43], Zhao uses structure from motion techniques to map a continuous videoonto a 3D urban model However, the most widely used feature-matching methodsmatch linear-features between images and 3D models

In the urban environments, there are lots of structured line features LingyunLiu and Ioannis Stamos proposed an automatic 3D to 2D registration method [17]for the photo-realistic rendering of urban scenes, refer to Figure 2.1 for a model

It utilizes parallelism and orthogonality constraints that naturally exist in urbanscenes

The major steps of the algorithm are,

6

Trang 18

• Extract 3D features and represent them by rectangular parallelepiped,

vanishing points After that, the rotation is computed and linear featuresare represented by rectangles,

fea-2.1.2 Statistical-based Registration

Besides the feature-based automatic registration, a more general multi-modal istration approach is to treat image and 3D models as random variables and applystatistical techniques that measure the amount of dependence between the vari-ables This approach is widely used in many types of multi-model registrations

Trang 19

reg-8Several similarity metrics, mutual information metric, Chi-Square metric, are used

to nd the optimal solution, refer to Appendix A

Pong, H.K et al [26] explore the mutual information between the normal ofobjects and the intensity of color images to do the registration The most com-mon methods [39][10] explore the dependence between the intensity information ofcolor images and range images The intensity information of range images can becaptured by the time-of-ight scanners using the infrared laser First, the scanneremits the laser Then the sensor captures the return laser and analyzes its energyand the time of ight to get the reected intensity and the location of the scannedpoint respectively For example, Figure 2.2 is the intensity map of an oce rangeimage captured by the DeltaSphere 3000 range scanner using the infrared laser

Figure 2.2: The intensity map of an oce range image

Nathaniel Williams et al [39] propose an automatic statistical registrationmethod based on rigidly mounting the digital camera and the laser scanner to-gether Thus, an approximately correct relative camera pose is known The cam-era pose is further rened through a Chi-Square metric nonlinear optimizationbetween the intensity of range images and color images Then Powell's multi-dimensional direction set method is applied to maximize the chi-square statisticover the six extrinsic parameters Experiments have shown that the optimizationmethod is able to consistently achieve the correct alignment when a good pose areestimated initially, refer to Figure 2.3

Trang 20

Figure 2.3: Automatic alignment results (a) The library model with three imagesrendered using their initial pose estimates (b) The library model with all imagesaligned Image taken from [39].

However, the major limitations of this statistical registration approach are,

space It limits the exibility of the 2D color sensing because the positioning

of 3D range sensor is usually more limited Sometimes, many color imagesneed to be captured from various poses (angles and locations) to create aview dependent model,

it cannot map historical photographs or color images captured at dierenttimes onto the models

It is feasible to use a tracker to track the relative position of the scanner andthe camera However, setting up the tracker would be tedious Moreover, it stillrequires 2D images and 3D images to be captured at the same time

2.1.3 Multi-view Geometry Approach

Besides line features and video used, another type of robust features, Scale variant Feature Transform (SIFT) [20], has been used in many applications, such

In-as object recognition [15], panorama reconstruction [3], photo-tourism [28] SIFTkeypoints are the local extreme extracted from Dierence of Gaussian (DoG) im-ages They are invariant to scale transformation, and ane transformation up tocertain level Current survey [33] shows that generally it is most robust feature

Trang 21

10Besides the model reconstructed from range images, there are other types ofgeo-models, such as satellite map Some works, e.g., photo tourism [28], registercolor images to such models through image-based modeling approach, which isillustrated as a special registration method here.

The photo tourism work explores photo collections about tourism locations in

detected and matched With those feature correspondences, the intrinsic, sic parameters of cameras and multiview geometry which is a sparse point set arereconstructed using structure from motion (SfM) [13] with the help of initial cam-era parameters stored in exchangeable-image-le-format (EXIF) les of images.The multiview geometry is reconstructed by adding a new view incrementally.Each time, the pose of the new view is recovered and the 3D points generated

extrin-by the new view is added to the structure Through the incremental approachusing structure-from-motion techniques, a sparse point set is reconstructed frommultiple color images, see Figure 2.4 The sparse point set can be registered to ageo-referenced image

Figure 2.4: Cameras and 3D point reconstructions from photos on the Internetthe Trevi Fountain Image taken from [28]

Trang 22

The estimated point set is related to the geo-referenced image by a similaritytransform (global translation, rotation and uniform scale) To determine the cor-rect transformation, the user interactively rotates, translates and scales the pointset until it ts the provided image or map.

There are several advantages of this approach First, the 3D image sensorand 2D image sensor are completely separated Second, it allows the registration

of historical images If there are enough corresponding image features in indoorenvironments, the approach is feasible for the registration between indoor modeland images

Trang 23

Chapter 3

Background

Registering color images to a 3D model is to recover the parameters of cameras,which includes the focal length values and other intrinsic values, the location andorientation of the camera taking each view Once those parameters are known,the 3D model can be textured by simply back-projecting the image To familiarizethose parameters, the camera model is briey introduced here

Later on, we are going to reconstruct the multiview geometry from two viewgeometries So after introducing the camera model, the geometry of two views isdiscussed Then, we go through current feature detection and matching methods,which are crucial for many applications, e.g., two view geometry The detail ofscale invariant feature transform (SIFT), used to search the feature correspon-dences, is introduced

Last, the standard nonlinear optimization Levenberg Marquardt tion is reviewed

op-tical center of the camera,

12

Trang 24

camera coordinate system,

loca-tions

A 3D point p is projected to a pixel location only after passing through those

four systems Firstly, it is transformed from the world coordinate system to era coordinate system Then it is projected to the image plane Lastly, it istransformed to the pixel coordinate system

cam-The transformation from world coordinate system to camera coordinate system

is represented by an extrinsic matrix, which is formalized by a simple translationand rotation The transformation from camera coordinate system to pixel coordi-nate system, including projection to the image plane, is determined by intrinsicparameters

3.1.1 Intrinsic Parameters

For a viewing camera, the intrinsic parameters is dened as the sets of parametersneeded to characterize the optical, geometric, and digital characteristics Thoseparameters are classied into three sets according to their functions,

coef-cient αc As most cameras currently manufactured do not have centering

co-ordinates, the coordinates in pixel of the image center (the principal point)

(ox , o y) and the eective size of the pixel in the horizontal and vertical

di-rection (sx , s y)

Trang 25

14Perspective Projection from Camera Frame to Image Coordinates

In the perspective camera model, refer to Figure 3.1, given the 3-D point p = [x3, y3, z3]> , its projection p 0 = (x, y) on the image plane satises,

x

z f

p′

p

x′

y′virtual image plane

Figure 3.1: Projection of a point from camera frame to image coordinates

Lens Distortion

The projection from the camera frame to image coordinates is not purely projectivedue to the existence of the lens Often, distortions exists and thus a projection inwhich straight lines in a scene remain straight in the projected image does not hold.There are two types of distortions, radial distortion and tangential distortion

Let (x, y) be the normalized image projection from Equation (3.1), and (xd , y d)

the coordinates of (x, y) after distortion Note r =px2+ y2, then (xd , y d)can beevaluated by,

Trang 26

where D1(x, y) , D2(x, y) model the radial distortion and tangential distortion spectively.

Due to the symmetry and imperfection of the lens, the most common tions are radially symmetric, which are called radial distortions Normally,there are two types of radial distortions, the barrel distortion and the pin-cushion distortion Radial distortions aect the distance between the image

distor-center and image point p, but do not aect the direction of the vector joining

the two points The radial distortions can be modeled by a Taylor sion,

distortion occurs Tangential distortion is modeled by,

of photosensitive elements, for a point (xd , y d) in the virtual image plane and the

corresponding point (xi , y i) in pixel coordinates, we have

Trang 27

16and vertical direction respectively The signs change in Equation (3.5) becausethe orientations of axes of the virtual image plane and physical image plane areopposite.

In homogenous coordinates, Equation 3.5 can be represented by

According to Equation 3.1 and Equation 3.6, without considering the

distor-tion, the intrinsic matrix Mint, which transforms a point (x, y, z) in camera reference coordinates to pixel coordinates (xi , y i), is

• a 3-D translation vector, T , describing the relative position of the origins of

the two reference frames, and

• a 3 × 3 rotation matrix, R, an orthogonal matrix (R > R = RR > = I)

satis-fying det(R) = 1.

in the camera frame is

Trang 28

where Mext = [R| − RT ] and pw is in homogenous coordinate.

3.1.3 Camera Calibration

The objective of camera calibration is to derive the intrinsic and extrinsic eters of a camera given a set of images taken using the camera Given the 3Dcoordinates of target points, a typical camera calibration method [14] consists offollowing three steps,

param-1 Compute the projection matrix M using the direct linear transform (DLT),

2 Estimate the camera parameters (intrinsic and extrinsic) [37] from M

ne-glecting lens distortion,

3 Model tting all the intrinsic parameters and apply Levenberg-Marquardtnonlinear optimization

In the case of self-calibration [32][42], the 3D coordinates of interested points arealso unknown and should be estimated

3.1.4 Image Un-distortion

Because of the high degree distortion models, refer to Equations 3.3 and 3.4, thereexists no algebraic inversion of Equation 3.2 to evaluate the undistorted pixelsfrom distorted pixels directly The most common way to undistorted the image is

to undistort the whole image together During the undistortion, for each pixel inthe undistorted image, the following steps are applied,

1 Derive the corresponding distorted sub-pixel coordinate from undistortedpixel coordinate,

2 Compute the color of distorted sub-pixel coordinate using bilinear lation,

interpo-3 Assign the color to the undistorted pixel coordinate

Trang 29

183.2 Two-view Geometry

In the two-view geometry reconstruction, only two images are concerned Thereconstruction mainly consists three steps, (1) corresponding features searching,(2) camera intrinsic parameters and poses recovery, and (3) 3D points recovery

In this section, assume the camera intrinsic parameters are given, we focus on the

of the right camera would be [R|T ] in the two view system In this section, the

3D computer vision book [37] and multiview geometry book [13] are taken as thereferences

3.2.1 Essential Matrix and Fundamental Matrix Computation

p

pr

pl

Figure 3.2: The two-view system

In Figure 3.2, with two views, the two camera coordinate systems are related

by a rotation R and a translation T ,

Trang 30

that vectors pr, as T and pr − T are coplanar, then,

p r T ×(p r − T ) = 0. (3.10)Combining with equation (3.9), then,

p r T ×(Rp l ) = 0. (3.11)

rep-resentation of epipolar geometry for known calibration, and the essential matrixrelates corresponding image points expressed in the camera coordinate systems.However, sometimes the two cameras may not be calibrated To generalizethe relation, fundamental matrix F, rst dened by Olivier D Faugeras [8], is

respectively, then,

Trang 31

where pl= K−1 l p l and pr = K−1

r p rare the points in the respective pixel coordinates

Specically, given the corresponding point pair pl : (x, y, 1) and pr : (x 0 , y 0 , 1), theequation 3.13 is equivalent to

(x 0 x, x 0 y, x 0 , y 0 x, y 0 y, y 0 , x, y, 1)f = 0, (3.14)

where f is the 9-vector made up of the entries of F in row-major order From a set of n corresponding-pairs, (xi , y i , 1) ↔ (x 0

i , y 0

i , 1) for i = 1, , n, we obtain a set

of linear equations of the form

rFKl Because E = b T R and

two of its singular values are equal, and the third is zero So the essential matrix

Trang 32

3.2.2 Camera Pose Recovery from Essential Matrix

compute 3D points from feature correspondences Given the left and right intrinsic

from essential matrix here First the property of essential matrix is studied andthen four possible candidates of the relative camera pose from essential matrix arepresented algebraically

Property of Essential Matrix

where U is orthogonal and α is a scale factor Noting that, up to sign, Z =

Up to scale, bT = Udiag(1, 1, 0)WU > and E = bT R = Udiag(1, 1, 0)(WU > R) is

the singular value decomposition (SVD) of E So a 3 × 3 matrix is an essential matrix if and only its singular values are (1, 1, 0) up to scale.

Extract Camera pose from Essential Matrix

R = UW >V> By bT T = 0 , it follows that T = U(0, 0, 1) > =u3, the last

col-umn of U However, the sign of E, and consequently T , cannot be determined.

Thus, corresponding to a given essential matrix, based on two possible choices of

P0 , specically, [UW V > | ± u3]and [UW>V> | ± u3]

Geometrically, there is only one correct relative camera pose The ambiguity

of camera poses can be removed by checking that all points recovered should be

Trang 33

in front of both views

3.2.3 Three-D Point Recovery

any feature correspondence can be recovered There are commonly two ways to cover 3D points, direct linear transformation method (DLT)[13] and triangulationmethod [13]

re-Direct linear transformation method (DLT) is commonly used Let projection

M and M0 respectively, for any inlier feature correspondence (x, y, 1) ↔ (x 0 , y 0 , 1),

3.3 Image Feature Detection and Matching

Image feature detection and matching is the rst and most crucial step in manyapplications, such as,

Trang 34

¦ motion tracking,

Generally, image feature detection intends to nd the feature location Duringfeature matching, descriptors used to represent the features are matched Forexample, after detection of corners, for each corner, the surrounding subregioncan be used as the corner descriptor To match corners, those subregions arematched using template matching techniques

To detect the same point independently in all images, the features used should

be repeatable For each point, to recognize the correspondence correctly, a reliableand distinctive descriptor should be used

In the rest of current section, we study the most common used features, corners,and then scale invariant feature transform (SIFT), which is proved to be the mostrobust local invariant feature descriptor The subsection introducing SIFT is based

on the work of David Lowe [20]

3.3.1 Corner Detection and Matching

A corner is the intersection of two edges Unlike the edge features which suerfrom the aperture problem, a corner's location is well-dened The most used twocorner detection methods are

moving a small window If a large response is generate whichever directionthe window moves in, a corner is detected

fact that intensity surface has two directions with signicant intensity continuities at corners

dis-When the correspondences of corners are searched, the subregions region aroundcorners are matched by template matching There are many template matching

Trang 35

24methods, such as square dierence, cross correlation, correlation coecient De-tails can be found in the manual about cvMatchTemplate function in Opencvlibrary.

3.3.2 Scale Invariant Feature Transform (SIFT)

Blobs are circular regions whose gray scale intensity dier from their surroundings.They have a center of mass and scale Since Laplacian represents the second-orderintensity change in the image, local extrema in certain Laplacian functions can betreated as blobs

Mikolajczyk and Schmid [23] shows that local extreme of the normalization of

most stable image features compared to a range of other possible image functions,such as the gradient, Hessian, or Harris corner detection function So local extrema

in a Laplacian of Gaussian are treated as blobs from now on

An image is smoothed by convolving with a variable-scale Gaussian G(x, y, σ) =

1

2πσ2 · exp( x2+y2

approxi-mation to ∂G/∂σ, using the dierence of nearby scales at kσ and σ,

Trang 36

The dierence of Gaussian function convolving with the image I(x, y) erates a Dierence of Gaussians (DoG) image DoG(x, y, σ) ∗ I(x, y), which is a

gen-close approximation to the Laplacian of Gaussian images Scale Invariant ture Transform (SIFT) [20] interest points are local extrema of the Dierence ofGaussians (DoG) images To achieve scale invariance, images are downsized todierent levels Each level is called an octave, refer to Figure 3.3

paring the intensity of an interested pixel with intensities of its 9 − 8 − 9

neighborhoods of pixels, refer to Figure 3.4

A 3D quadratic function is t to the local sample points to determine that

Trang 37

Figure 3.4: Local extreme of DoG images are detected by comparing a pixel (red)

with its 26 neighbors (blue) in 3 × 3 regions at the current and adjacent scales.

Image taken from [30]

sub-pixel location of the extreme using Newton's method The quadraticfunction is also used to reject unstable extrema with low contrast Anothertype of unstable extreme are SIFT points along the edge where a SIFT

has a large principal curvature α and a small one β in the perpendicular direction To eliminate those SIFT points, γ = α/β is threshold, say, if

γ < τ then the SIFT point is stable Given Hessian matrix H estimated by taking dierences of neighboring points, H =

where Tr(H), Det(H) are the trace and determinant of H respectively.

First, the Gaussian image G closed to the scale is determined Then dients in the area around the keypoint (x, y) are computed The orientation histogram is built from the gradient magnitude m(x, y) and orientation

Trang 38

gra-θ(x, y),

m(x, y) = p(G(x + 1, y) − G(x − 1, y))2+ (G(x, y + 1) − G(x, y − 1))2

θ(m, n) = tan −1 ((G(x, y + 1) − G(x, y − 1))/(G(x + 1, y) − G(x − 1, y))).

The orientation histogram covers the 360 degree range of orientation using

centered at the keypoint location Peaks in the orientation histogram aredominant directions of local gradients

Feature Descriptor

Descriptors are necessary to match SIFT keypoints from dierent images A SIFTdescriptor consists of a set of orientation histograms of subregions around thekeypoint Furthermore, the coordinates of the descriptor are rotated relative tothe keypoint's dominant orientation to achieve orientation invariance

Descriptors are used to ease the matching A simple matching strategy would

be "exhaustive search" Between two images A and B, for any descriptor p from

would be the optimal match There maybe similar features causing ambiguityduring the matching To eliminate the ambiguities, a threshold is applied to theratio of the best two optimal matches If the ratio satises the constraint, the

best match is selected as the correspondence of p Otherwise, it fails to nd any correspondence of p from descriptors of B Figure 3.6 shows the matching result

where most of correspondences are correct visually

3.4 Levenberg Marquardt Non-linear Optimization

Levenberg Marquardt non-linear Optimization has become the standard ear optimization method It is a damped Gaussian Newton method [16] [22] Thedamping parameter inuences both the direction and the size of the step, and it

Trang 39

Figure 3.5: A keypoint descriptor is created by rst computing the gradient nitude and orientation at each image sample point in a region around the keypointlocation, as shown on the left These are weighted by a Gaussian window, indi-cated by the overlaid circle These samples are then accumulated into orientation

mag-histograms summarizing the contents over 4×4 subregions, as shown on the right,

with the length of each arrow corresponding to the sum of the gradient

magni-tudes near that direction within the region This gure shows a 2×2 descriptor array computed from an 8×8 set of samples, whereas the experiments in this paper use 4×4 descriptors computed from a 16×16 sample array The image and the

description taken from [20]

leads to a method without a specic line search The optimization is achieved bycontrolling its own damping parameter adaptively: it raises the damping parame-ter if a step fails to reduce the error; otherwise it reduces the damping parameter

In this manner, the optimization is capable to alternate between a slow descentapproach when being far from the optimal and a fast quadratic convergence whenbeing at the optimal's neighborhood

In this thesis, Levenberg Marquardt optimization is used for the nonlinear mization More specically, an existing package SBA [19] which applies the sparseLevenberg Marquardt method to the optimization of multiview geometry is used.More detailed introduction to the nonlinear optimization, including LevenbergMarquardt method, can be found in [21]

Trang 40

opti-Figure 3.6: SIFT matching result: the bottom image is the SIFT matching result

of the top images

Định dạng
Số trang	93
Dung lượng	5,7 MB