63Figure 9.5 Top the result captured by a virtual camera inside the col- ored 3D model together with camera views recovered fromthe multiview geometry reconstruction.. Image 1 shows the
Trang 1Automatic Registration of Color Images
to 3D Geometry of Indoor Environments
LI YUNZHEN
NATIONAL UNIVERSITY OF SINGAPORE
2008
Trang 2Automatic Registration of Color Images
to 3D Geometry of Indoor Environments
LI YUNZHEN (B.Comp.(Hons), NUS)
A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE
2008
Trang 3Firstly, I would like to thank my supervisor, Dr Low Kok Lim, for his invaluableguidance and constant support in the research I would also like to thank DrCheng Ho-lun and A/P Tan Tiow Seng, for their help in my graduate life I alsothank Prashast Khandelwal for his honor year work of this research
Secondly, I would like to thank all my friends, especially Yan Ke and PanBinbin We have shared the postgraduate life for two years My thanks to all thepeople in the graphics lab, for their encouragement and friendships
Lastly, I would like to thank all my family members
i
Trang 4Acknowledgements i
1.1 Motivation and Goal 1
1.2 Contribution 3
1.3 Structure of the Thesis 4
Chapter 2 Related Work 6 2.1 Automatic Registration Methods 6
2.1.1 Feature-based Automatic Registration 6
2.1.2 Statistical-based Registration 7
2.1.3 Multi-view Geometry Approach 9
Chapter 3 Background 12 3.1 Camera Model 12
3.1.1 Intrinsic Parameters 13
3.1.2 Extrinsic Parameters 16
3.1.3 Camera Calibration 17
3.1.4 Image Un-distortion 17
3.2 Two-view Geometry 18
3.2.1 Essential Matrix and Fundamental Matrix Computation 18
3.2.2 Camera Pose Recovery from Essential Matrix 21
3.2.3 Three-D Point Recovery 22
3.3 Image Feature Detection and Matching 22
3.3.1 Corner Detection and Matching 23
3.3.2 Scale Invariant Feature Transform (SIFT) 24
3.4 Levenberg Marquardt Non-linear Optimization 27
ii
Trang 55.1 Data Acquisition 32
5.1.1 Range Data Representation 33
5.1.2 Image Data Capturing 34
5.2 Camera Calibration and Image Un-distortion 36
5.3 SIFT-Keypoint Detection and Matching 36
Chapter 6 Multiview Geometry Reconstruction 38 6.1 Camera Pose Recovery in Two-view System 39
6.2 Register Two-view System to Multiview System 41
6.2.1 Scale Computation 41
6.2.2 Unregistered Camera Pose Computation 41
6.2.3 Last Camera Pose Renement 42
6.3 Structure Extension and Optimization 43
6.3.1 Three-D Point Recovery from Multi-views 43
6.4 Outliers Detection 44
6.4.1 Structure Optimization 44
Chapter 7 Registration of Multiview Geometry with 3D Model 45 7.1 User Guided Registration of Multiview Geometry with 3D Model 45 7.1.1 Semi-automatic Registration System 45
7.1.2 Computing Scale between Multiview Geometry and the 3D Model 48
7.1.3 Deriving Poses of other Views in the Multiview System 49
7.2 Plane-Constrained Optimization 50
Chapter 8 Color Mapping and Adjustment 52 8.1 Occlusion Detection and Sharp Depth Boundary Mark Up 53
8.1.1 Depth Buer Rendering 53
8.1.2 Occlusion Detection 55
8.1.3 Depth Boundary Mask Image Generation 56
8.2 Blending 57
8.2.1 Exposure Unication 57
8.2.2 Weighted Blending 58
8.2.3 Preservation of Details 58
Trang 6Chapter 9 Experiment Results and Time Analysis 61
9.1 Results of Multiview Geometry Reconstruction 61
9.2 Results of Textured Room Models 61
9.3 Related Image Based Modeling Results 61
9.4 Time Analysis of the Automatic Registration Method 64
Chapter 10 Conclusion and Future Work 68 References 69 Appendix A Information-theoretic Metric 73 A.1 Mutual information Metric 73
A.1.1 Basic Information Theory 73
A.1.2 Mutual information Metric Evaluation between two Images 73 A.2 Chi-Squared Test 75
A.2.1 Background 75
A.2.2 Chi-Squared Test about Dependence between two Images 75
A.3 Normalized Cross Correlation(NCC) 76
A.3.1 Correlation 76
A.3.2 Normalized Cross Correlation(NCC) between two Images 76 Appendix B Methods to check whether a Point is inside a Triangle
Trang 7List of Figures
Figure 1.1 Sculpture from the Parthenon This model shows the
pre-sentation of the peplos, or robe of Athena Image taken from[31] 1
Figure 1.2 A partially textured crime scene model from DeltaSphere
Figure 2.1 Details of texture-maps for a building Those images verify
the high accuracy of the automated algorithm Images taken
Figure 2.3 Automatic alignment results (a) The library model with
three images rendered using their initial pose estimates (b)The library model with all images aligned Image taken from[39] 9
Figure 2.4 Cameras and 3D point reconstructions from photos on the
Internetthe Trevi Fountain Image taken from [28] 10Figure 3.1 Projection of a point from camera frame to image coordinates 14Figure 3.2 The two-view system 18
Figure 3.3 Dierence of Gaussian images are generated by subtracting
adjacent Gaussian images for each scale level Image takenfrom [30] 25
Figure 3.4 Local extreme of DoG images are detected by comparing a
pixel (red) with its 26 neighbors (blue) in 3 × 3 regions at
the current and adjacent scales Image taken from [30] 26
v
Trang 8Figure 3.5 A keypoint descriptor is created by rst computing the
gra-dient magnitude and orientation at each image sample point
in a region around the keypoint location, as shown on theleft These are weighted by a Gaussian window, indicated bythe overlaid circle These samples are then accumulated into
orientation histograms summarizing the contents over 4×4
subregions, as shown on the right, with the length of eacharrow corresponding to the sum of the gradient magnitudesnear that direction within the region This gure shows a
2×2 descriptor array computed from an 8×8 set of samples, whereas the experiments in this paper use 4×4 descriptors computed from a 16×16 sample array The image and the
description taken from [20] 28Figure 3.6 SIFT matching result: the bottom image is the SIFT match-
ing result of the top images 29Figure 5.1 Equipments used during data acquisition Left is the Delta-
Sphere 3000 with a laptop, top right shows a NEC-NP50Projector and bottom right shows a Canon 40D Camera 32Figure 5.2 The recovery of 3D point in the right hand coordinate system 34Figure 5.3 The intensity image of RTPI Each pixel in the intensity
image refers to a 3D point 34Figure 5.4 The sift pattern 35Figure 5.5 Feature connected component: an outlier case 37Figure 6.1 The associated views: one is patterned and the other is the
normal view 38Figure 6.2 The multi-view system of the room The blue points are
the 3D points recovered from SIFT features Those mids represent the cameras at the recovered locations andorientations 40Figure 7.1 The graphic interface about the semi-automatic registration
pyra-The top-left sub-window shows the intensity map of rangeimage and the top-right sub-window shows a color image.Those colored points are user specied feature locations 47
Trang 9viiFigure 7.2 The registration result using back-projection 48
Figure 7.3 The feature point inside the projected triangle 4abc 49
Figure 7.4 The registered multiview system and the green planes
de-tected from the model 50Figure 7.5 The plane constrained multiview system together with the
details (Right) Result of weighted blending with tion of details 59Figure 8.4 (Top)The dominant registration result (Mid) The weighted
preserva-blended result (Bottom) Final registration result of weightedblending with preservation of details 60Figure 9.1 (Left) A view about the feature paper wrapped box (Right)
The reconstructed multiview geometry which contains 26views 62Figure 9.2 (Left) An overview of the multiview geometry about a cone-
shape object (Right) The side view of the multiview geometry 62Figure 9.3 (Left) A feature pattern projected room image (Right) The
reconstructed multiview geometry 63Figure 9.4 (Left) A far view of the River Walk building (Right) The
reconstructed multiview geometry which contains 91 views 63Figure 9.5 (Top) the result captured by a virtual camera inside the col-
ored 3D model together with camera views recovered fromthe multiview geometry reconstruction (Bottoms) 3D ren-derings of the nal colored model 64Figure 9.6 Registration of the multiview geometry with another scanned
model The top image is the intensity image of the modelwith color registered The mid image is a view inside themodel and the bottom two images show the 3D model 65
Trang 10Figure 9.7 Those six images are ordered from left to right, from top
to bottom Image 1 shows the reconstructed 3D togetherwith camera views; Image 2 is the top view of the point set;Image 3 shows selecting the region of the model; Image 4shows the contour linked up by lines; Image 5 shows themodel reconstructed; Image 6 shows the textured model.The last image is a color image taken by a camera 66Figure 9.8 The top two images are the views about the 3D recon-
structed color model from dierent angles The bottom twoimages are the views about the 3D reconstructed model, inwhich the little red points are the 3D points recovered andthe red planes represent the cameras 67
Trang 11Keywords: Image-to-geometry registration, 2D-to-3D registration, range ning, multiview geometry, SIFT, image blending.
scan-ix
Trang 121.1 Motivation and Goal
Creating 3D, color, computer graphics models of scenes and objects from the realworld has various applications, such as digital culture heritage preservation, crimeforensics, computer games and so on
Generating digital reconstructions of historical or archaeological sites withenough delity has become the focus in the area of virtual heritage With digital re-constructions, cultural heritages can be preserved and even reconstructed In 2003,Jessi Stumpfel et al [31] reconstructed the digital reunication of the parthenonand its sculptures, see Figure 1.1 Today, the modern Acropolis Parthenon is beingreconstructed with the help of the digital parthenon
Figure 1.1: Sculpture from the Parthenon This model shows the presentation ofthe peplos, or robe of Athena Image taken from [31]
In the criminal investigation, to fully understand a crime scene, words and
1
Trang 132images are often not enough to express the spatial information Constructing adetailed 3D digital model would be very helpful for the investigation For example,with the digital model, some physical measurements can still be performed evenafter the original scene has been changed or cleaned up.
Figure 1.2: A partially textured crime scene model from DeltaSphere softwarepackage
Figure 1.2 shows a view of a mock-up crime scene model rendered from acolored 3D digital model acquired by a Delta-sphere range scanner The model isreconstructed using the Delta-sphere software However, to register an image tothe digital model using the software, users are required to manually specify thecorrespondences between the image and the model It would be extremely tedious
if a large number of images need to be registered
To minimize the user interaction when registering images to a model, automaticalgorithms are needed One approach is to co-locate the camera and the scanner toacquire data [39] [10] and then optimize the camera poses based on the dependency
of intensity images of the range image and color images However, it sacricesthe exibility of color image capturing Furthermore, the optimization is time-consuming Another commonly used approach explores the linear features in theurban scene [17] It works only if there are enough systematic parallel lines.However, in the indoor room environments, images should be acquired from
Trang 14many dierent locations as our ultimate goal is to create a view dependent roommodel In this case, the precondition of the rst approach does not hold Neither
do the linear feature approach work as there are no systematic linear features Sofar, there are no automatic algorithms to register those images to the room model.This thesis focuses on the registration of color information to the acquired 3Dgeometry of the scene, and the interested domain is indoor room environmentsrather than small objects During the image acquisition, multiple color imagesfrom various view points are captured Furthermore, to allow greater exibilityand feasibility, the color camera will not be tracked, so each color image is acquiredwith an unknown camera pose In this thesis, our goal is to nd a registrationmethod in the indoor room environments with user interaction as less as possible.1.2 Contribution
The main contribution of our work is the idea of taking the approach of establishingcorrespondences among the color images instead of directly nding correspondingfeatures between the 2D and 3D spaces [17] [29] The latter approach works wellonly for higher-level features, such as parallel straight lines, and this imposesassumptions and restriction on the types of scenes the method can handle Formost indoor environments, these higher-level features usually exist, but they areoften too few or do not appear in most of the color images due to small eld ofview and short shooting distance Our approach works for more types of scenesand even for objects
The main problem of feature correspondence is the lack of features on largeuniform surfaces This occurs a lot in indoor environments where large plain walls,ceiling and oor are common We avert this problem by using light projectors toproject special light patterns onto the scene surfaces to articially introduce imagefeatures
Our method requires the user to manually input only six pairs of dences between one of the color images and the 3D model This allows the sparse
Trang 15correspon-4model to be approximately aligned with the detailed model We detect planes inthe detailed model, and by minimizing the distances between some of the points
in the sparse model and these planes, we are able to rene the multiview geometryand the registration as a whole using sparse bundle adjustment (SBA) [19] Thisapproach is able to achieve better registration accuracy in the face of non-uniformspatial distortion in the geometric model
Our current goal is not to render the completed model with view-dependentreection Instead, we assign each point on the surface of the 3D model a singlecolor by carefully blending colors from multiple overlapping color images Ourmethod takes into consideration the dierent exposures of the color images andthe occlusion of surfaces in the 3D model It produces a colored model with verysmooth color transitions and yet preserves ne details
1.3 Structure of the Thesis
The rest of the thesis is organized as follows,
models,
two-view geometry and image features,
the mean time, the format of the range data is introduced,
3D model,
result,
Trang 16• Chapter 9 shows more experiment results of the colored room model and thetime complexity of the whole process Furthermore, models derived from themultiview geometry are shown,
Trang 17Chapter 2
Related Work
This thesis studies how to build a colored 3D model of indoor room environments.Our approach is to reconstruct the multiview geometry of the scene from images
rst, and then register the multiview geometry to the 3D model captured using
a scanner Thus, all the images used to reconstruct the multiview geometry areregistered to the 3D model
This chapter introduces the existing automatic approaches to register colorimages to 3D models The problems of applying those approaches to the indoorenvironments are studied
2.1 Automatic Registration Methods
There are two major classes of automatic registration methods, feature-matchingmethods and statistical-based methods
2.1.1 Feature-based Automatic Registration
In [43], Zhao uses structure from motion techniques to map a continuous videoonto a 3D urban model However, the most widely used feature-matching methodsmatch linear-features between images and 3D models
In the urban environments, there are lots of structured line features LingyunLiu and Ioannis Stamos proposed an automatic 3D to 2D registration method [17]for the photo-realistic rendering of urban scenes, refer to Figure 2.1 for a model
It utilizes parallelism and orthogonality constraints that naturally exist in urbanscenes
The major steps of the algorithm are,
6
Trang 18• Extract 3D features and represent them by rectangular parallelepiped,
vanishing points After that, the rotation is computed and linear featuresare represented by rectangles,
fea-2.1.2 Statistical-based Registration
Besides the feature-based automatic registration, a more general multi-modal istration approach is to treat image and 3D models as random variables and applystatistical techniques that measure the amount of dependence between the vari-ables This approach is widely used in many types of multi-model registrations
Trang 19reg-8Several similarity metrics, mutual information metric, Chi-Square metric, are used
to nd the optimal solution, refer to Appendix A
Pong, H.K et al [26] explore the mutual information between the normal ofobjects and the intensity of color images to do the registration The most com-mon methods [39][10] explore the dependence between the intensity information ofcolor images and range images The intensity information of range images can becaptured by the time-of-ight scanners using the infrared laser First, the scanneremits the laser Then the sensor captures the return laser and analyzes its energyand the time of ight to get the reected intensity and the location of the scannedpoint respectively For example, Figure 2.2 is the intensity map of an oce rangeimage captured by the DeltaSphere 3000 range scanner using the infrared laser
Figure 2.2: The intensity map of an oce range image
Nathaniel Williams et al [39] propose an automatic statistical registrationmethod based on rigidly mounting the digital camera and the laser scanner to-gether Thus, an approximately correct relative camera pose is known The cam-era pose is further rened through a Chi-Square metric nonlinear optimizationbetween the intensity of range images and color images Then Powell's multi-dimensional direction set method is applied to maximize the chi-square statisticover the six extrinsic parameters Experiments have shown that the optimizationmethod is able to consistently achieve the correct alignment when a good pose areestimated initially, refer to Figure 2.3
Trang 20Figure 2.3: Automatic alignment results (a) The library model with three imagesrendered using their initial pose estimates (b) The library model with all imagesaligned Image taken from [39].
However, the major limitations of this statistical registration approach are,
space It limits the exibility of the 2D color sensing because the positioning
of 3D range sensor is usually more limited Sometimes, many color imagesneed to be captured from various poses (angles and locations) to create aview dependent model,
it cannot map historical photographs or color images captured at dierenttimes onto the models
It is feasible to use a tracker to track the relative position of the scanner andthe camera However, setting up the tracker would be tedious Moreover, it stillrequires 2D images and 3D images to be captured at the same time
2.1.3 Multi-view Geometry Approach
Besides line features and video used, another type of robust features, Scale variant Feature Transform (SIFT) [20], has been used in many applications, such
In-as object recognition [15], panorama reconstruction [3], photo-tourism [28] SIFTkeypoints are the local extreme extracted from Dierence of Gaussian (DoG) im-ages They are invariant to scale transformation, and ane transformation up tocertain level Current survey [33] shows that generally it is most robust feature
Trang 2110Besides the model reconstructed from range images, there are other types ofgeo-models, such as satellite map Some works, e.g., photo tourism [28], registercolor images to such models through image-based modeling approach, which isillustrated as a special registration method here.
The photo tourism work explores photo collections about tourism locations in
detected and matched With those feature correspondences, the intrinsic, sic parameters of cameras and multiview geometry which is a sparse point set arereconstructed using structure from motion (SfM) [13] with the help of initial cam-era parameters stored in exchangeable-image-le-format (EXIF) les of images.The multiview geometry is reconstructed by adding a new view incrementally.Each time, the pose of the new view is recovered and the 3D points generated
extrin-by the new view is added to the structure Through the incremental approachusing structure-from-motion techniques, a sparse point set is reconstructed frommultiple color images, see Figure 2.4 The sparse point set can be registered to ageo-referenced image
Figure 2.4: Cameras and 3D point reconstructions from photos on the Internetthe Trevi Fountain Image taken from [28]
Trang 22The estimated point set is related to the geo-referenced image by a similaritytransform (global translation, rotation and uniform scale) To determine the cor-rect transformation, the user interactively rotates, translates and scales the pointset until it ts the provided image or map.
There are several advantages of this approach First, the 3D image sensorand 2D image sensor are completely separated Second, it allows the registration
of historical images If there are enough corresponding image features in indoorenvironments, the approach is feasible for the registration between indoor modeland images
Trang 23Chapter 3
Background
Registering color images to a 3D model is to recover the parameters of cameras,which includes the focal length values and other intrinsic values, the location andorientation of the camera taking each view Once those parameters are known,the 3D model can be textured by simply back-projecting the image To familiarizethose parameters, the camera model is briey introduced here
Later on, we are going to reconstruct the multiview geometry from two viewgeometries So after introducing the camera model, the geometry of two views isdiscussed Then, we go through current feature detection and matching methods,which are crucial for many applications, e.g., two view geometry The detail ofscale invariant feature transform (SIFT), used to search the feature correspon-dences, is introduced
Last, the standard nonlinear optimization Levenberg Marquardt tion is reviewed
op-tical center of the camera,
12
Trang 24camera coordinate system,
loca-tions
A 3D point p is projected to a pixel location only after passing through those
four systems Firstly, it is transformed from the world coordinate system to era coordinate system Then it is projected to the image plane Lastly, it istransformed to the pixel coordinate system
cam-The transformation from world coordinate system to camera coordinate system
is represented by an extrinsic matrix, which is formalized by a simple translationand rotation The transformation from camera coordinate system to pixel coordi-nate system, including projection to the image plane, is determined by intrinsicparameters
3.1.1 Intrinsic Parameters
For a viewing camera, the intrinsic parameters is dened as the sets of parametersneeded to characterize the optical, geometric, and digital characteristics Thoseparameters are classied into three sets according to their functions,
coef-cient αc As most cameras currently manufactured do not have centering
co-ordinates, the coordinates in pixel of the image center (the principal point)
(ox , o y) and the eective size of the pixel in the horizontal and vertical
di-rection (sx , s y)
Trang 2514Perspective Projection from Camera Frame to Image Coordinates
In the perspective camera model, refer to Figure 3.1, given the 3-D point p = [x3, y3, z3]> , its projection p 0 = (x, y) on the image plane satises,
x
z f
p′
p
x′
y′virtual image plane
Figure 3.1: Projection of a point from camera frame to image coordinates
Lens Distortion
The projection from the camera frame to image coordinates is not purely projectivedue to the existence of the lens Often, distortions exists and thus a projection inwhich straight lines in a scene remain straight in the projected image does not hold.There are two types of distortions, radial distortion and tangential distortion
Let (x, y) be the normalized image projection from Equation (3.1), and (xd , y d)
the coordinates of (x, y) after distortion Note r =px2+ y2, then (xd , y d)can beevaluated by,
Trang 26where D1(x, y) , D2(x, y) model the radial distortion and tangential distortion spectively.
Due to the symmetry and imperfection of the lens, the most common tions are radially symmetric, which are called radial distortions Normally,there are two types of radial distortions, the barrel distortion and the pin-cushion distortion Radial distortions aect the distance between the image
distor-center and image point p, but do not aect the direction of the vector joining
the two points The radial distortions can be modeled by a Taylor sion,
distortion occurs Tangential distortion is modeled by,
of photosensitive elements, for a point (xd , y d) in the virtual image plane and the
corresponding point (xi , y i) in pixel coordinates, we have
Trang 2716and vertical direction respectively The signs change in Equation (3.5) becausethe orientations of axes of the virtual image plane and physical image plane areopposite.
In homogenous coordinates, Equation 3.5 can be represented by
According to Equation 3.1 and Equation 3.6, without considering the
distor-tion, the intrinsic matrix Mint, which transforms a point (x, y, z) in camera reference coordinates to pixel coordinates (xi , y i), is
• a 3-D translation vector, T , describing the relative position of the origins of
the two reference frames, and
• a 3 × 3 rotation matrix, R, an orthogonal matrix (R > R = RR > = I)
satis-fying det(R) = 1.
in the camera frame is
Trang 28where Mext = [R| − RT ] and pw is in homogenous coordinate.
3.1.3 Camera Calibration
The objective of camera calibration is to derive the intrinsic and extrinsic eters of a camera given a set of images taken using the camera Given the 3Dcoordinates of target points, a typical camera calibration method [14] consists offollowing three steps,
param-1 Compute the projection matrix M using the direct linear transform (DLT),
2 Estimate the camera parameters (intrinsic and extrinsic) [37] from M
ne-glecting lens distortion,
3 Model tting all the intrinsic parameters and apply Levenberg-Marquardtnonlinear optimization
In the case of self-calibration [32][42], the 3D coordinates of interested points arealso unknown and should be estimated
3.1.4 Image Un-distortion
Because of the high degree distortion models, refer to Equations 3.3 and 3.4, thereexists no algebraic inversion of Equation 3.2 to evaluate the undistorted pixelsfrom distorted pixels directly The most common way to undistorted the image is
to undistort the whole image together During the undistortion, for each pixel inthe undistorted image, the following steps are applied,
1 Derive the corresponding distorted sub-pixel coordinate from undistortedpixel coordinate,
2 Compute the color of distorted sub-pixel coordinate using bilinear lation,
interpo-3 Assign the color to the undistorted pixel coordinate
Trang 29183.2 Two-view Geometry
In the two-view geometry reconstruction, only two images are concerned Thereconstruction mainly consists three steps, (1) corresponding features searching,(2) camera intrinsic parameters and poses recovery, and (3) 3D points recovery
In this section, assume the camera intrinsic parameters are given, we focus on the
of the right camera would be [R|T ] in the two view system In this section, the
3D computer vision book [37] and multiview geometry book [13] are taken as thereferences
3.2.1 Essential Matrix and Fundamental Matrix Computation
p
pr
pl
Figure 3.2: The two-view system
In Figure 3.2, with two views, the two camera coordinate systems are related
by a rotation R and a translation T ,
Trang 30that vectors pr, as T and pr − T are coplanar, then,
p r T ×(p r − T ) = 0. (3.10)Combining with equation (3.9), then,
p r T ×(Rp l ) = 0. (3.11)
rep-resentation of epipolar geometry for known calibration, and the essential matrixrelates corresponding image points expressed in the camera coordinate systems.However, sometimes the two cameras may not be calibrated To generalizethe relation, fundamental matrix F, rst dened by Olivier D Faugeras [8], is
respectively, then,
Trang 31where pl= K−1 l p l and pr = K−1
r p rare the points in the respective pixel coordinates
Specically, given the corresponding point pair pl : (x, y, 1) and pr : (x 0 , y 0 , 1), theequation 3.13 is equivalent to
(x 0 x, x 0 y, x 0 , y 0 x, y 0 y, y 0 , x, y, 1)f = 0, (3.14)
where f is the 9-vector made up of the entries of F in row-major order From a set of n corresponding-pairs, (xi , y i , 1) ↔ (x 0
i , y 0
i , 1) for i = 1, , n, we obtain a set
of linear equations of the form
rFKl Because E = b T R and
two of its singular values are equal, and the third is zero So the essential matrix
Trang 323.2.2 Camera Pose Recovery from Essential Matrix
compute 3D points from feature correspondences Given the left and right intrinsic
from essential matrix here First the property of essential matrix is studied andthen four possible candidates of the relative camera pose from essential matrix arepresented algebraically
Property of Essential Matrix
where U is orthogonal and α is a scale factor Noting that, up to sign, Z =
Up to scale, bT = Udiag(1, 1, 0)WU > and E = bT R = Udiag(1, 1, 0)(WU > R) is
the singular value decomposition (SVD) of E So a 3 × 3 matrix is an essential matrix if and only its singular values are (1, 1, 0) up to scale.
Extract Camera pose from Essential Matrix
R = UW >V> By bT T = 0 , it follows that T = U(0, 0, 1) > =u3, the last
col-umn of U However, the sign of E, and consequently T , cannot be determined.
Thus, corresponding to a given essential matrix, based on two possible choices of
P0 , specically, [UW V > | ± u3]and [UW>V> | ± u3]
Geometrically, there is only one correct relative camera pose The ambiguity
of camera poses can be removed by checking that all points recovered should be
Trang 33in front of both views
3.2.3 Three-D Point Recovery
any feature correspondence can be recovered There are commonly two ways to cover 3D points, direct linear transformation method (DLT)[13] and triangulationmethod [13]
re-Direct linear transformation method (DLT) is commonly used Let projection
M and M0 respectively, for any inlier feature correspondence (x, y, 1) ↔ (x 0 , y 0 , 1),
3.3 Image Feature Detection and Matching
Image feature detection and matching is the rst and most crucial step in manyapplications, such as,
Trang 34¦ motion tracking,
Generally, image feature detection intends to nd the feature location Duringfeature matching, descriptors used to represent the features are matched Forexample, after detection of corners, for each corner, the surrounding subregioncan be used as the corner descriptor To match corners, those subregions arematched using template matching techniques
To detect the same point independently in all images, the features used should
be repeatable For each point, to recognize the correspondence correctly, a reliableand distinctive descriptor should be used
In the rest of current section, we study the most common used features, corners,and then scale invariant feature transform (SIFT), which is proved to be the mostrobust local invariant feature descriptor The subsection introducing SIFT is based
on the work of David Lowe [20]
3.3.1 Corner Detection and Matching
A corner is the intersection of two edges Unlike the edge features which suerfrom the aperture problem, a corner's location is well-dened The most used twocorner detection methods are
moving a small window If a large response is generate whichever directionthe window moves in, a corner is detected
fact that intensity surface has two directions with signicant intensity continuities at corners
dis-When the correspondences of corners are searched, the subregions region aroundcorners are matched by template matching There are many template matching
Trang 3524methods, such as square dierence, cross correlation, correlation coecient De-tails can be found in the manual about cvMatchTemplate function in Opencvlibrary.
3.3.2 Scale Invariant Feature Transform (SIFT)
Blobs are circular regions whose gray scale intensity dier from their surroundings.They have a center of mass and scale Since Laplacian represents the second-orderintensity change in the image, local extrema in certain Laplacian functions can betreated as blobs
Mikolajczyk and Schmid [23] shows that local extreme of the normalization of
most stable image features compared to a range of other possible image functions,such as the gradient, Hessian, or Harris corner detection function So local extrema
in a Laplacian of Gaussian are treated as blobs from now on
An image is smoothed by convolving with a variable-scale Gaussian G(x, y, σ) =
1
2πσ2 · exp( x2+y2
approxi-mation to ∂G/∂σ, using the dierence of nearby scales at kσ and σ,
Trang 36The dierence of Gaussian function convolving with the image I(x, y) erates a Dierence of Gaussians (DoG) image DoG(x, y, σ) ∗ I(x, y), which is a
gen-close approximation to the Laplacian of Gaussian images Scale Invariant ture Transform (SIFT) [20] interest points are local extrema of the Dierence ofGaussians (DoG) images To achieve scale invariance, images are downsized todierent levels Each level is called an octave, refer to Figure 3.3
paring the intensity of an interested pixel with intensities of its 9 − 8 − 9
neighborhoods of pixels, refer to Figure 3.4
A 3D quadratic function is t to the local sample points to determine that
Trang 37Figure 3.4: Local extreme of DoG images are detected by comparing a pixel (red)
with its 26 neighbors (blue) in 3 × 3 regions at the current and adjacent scales.
Image taken from [30]
sub-pixel location of the extreme using Newton's method The quadraticfunction is also used to reject unstable extrema with low contrast Anothertype of unstable extreme are SIFT points along the edge where a SIFT
has a large principal curvature α and a small one β in the perpendicular direction To eliminate those SIFT points, γ = α/β is threshold, say, if
γ < τ then the SIFT point is stable Given Hessian matrix H estimated by taking dierences of neighboring points, H =
where Tr(H), Det(H) are the trace and determinant of H respectively.
First, the Gaussian image G closed to the scale is determined Then dients in the area around the keypoint (x, y) are computed The orienta- tion histogram is built from the gradient magnitude m(x, y) and orientation
Trang 38gra-θ(x, y),
m(x, y) = p(G(x + 1, y) − G(x − 1, y))2+ (G(x, y + 1) − G(x, y − 1))2
θ(m, n) = tan −1 ((G(x, y + 1) − G(x, y − 1))/(G(x + 1, y) − G(x − 1, y))).
The orientation histogram covers the 360 degree range of orientation using
centered at the keypoint location Peaks in the orientation histogram aredominant directions of local gradients
Feature Descriptor
Descriptors are necessary to match SIFT keypoints from dierent images A SIFTdescriptor consists of a set of orientation histograms of subregions around thekeypoint Furthermore, the coordinates of the descriptor are rotated relative tothe keypoint's dominant orientation to achieve orientation invariance
Descriptors are used to ease the matching A simple matching strategy would
be "exhaustive search" Between two images A and B, for any descriptor p from
would be the optimal match There maybe similar features causing ambiguityduring the matching To eliminate the ambiguities, a threshold is applied to theratio of the best two optimal matches If the ratio satises the constraint, the
best match is selected as the correspondence of p Otherwise, it fails to nd any correspondence of p from descriptors of B Figure 3.6 shows the matching result
where most of correspondences are correct visually
3.4 Levenberg Marquardt Non-linear Optimization
Levenberg Marquardt non-linear Optimization has become the standard ear optimization method It is a damped Gaussian Newton method [16] [22] Thedamping parameter inuences both the direction and the size of the step, and it
Trang 39Figure 3.5: A keypoint descriptor is created by rst computing the gradient nitude and orientation at each image sample point in a region around the keypointlocation, as shown on the left These are weighted by a Gaussian window, indi-cated by the overlaid circle These samples are then accumulated into orientation
mag-histograms summarizing the contents over 4×4 subregions, as shown on the right,
with the length of each arrow corresponding to the sum of the gradient
magni-tudes near that direction within the region This gure shows a 2×2 descriptor array computed from an 8×8 set of samples, whereas the experiments in this paper use 4×4 descriptors computed from a 16×16 sample array The image and the
description taken from [20]
leads to a method without a specic line search The optimization is achieved bycontrolling its own damping parameter adaptively: it raises the damping parame-ter if a step fails to reduce the error; otherwise it reduces the damping parameter
In this manner, the optimization is capable to alternate between a slow descentapproach when being far from the optimal and a fast quadratic convergence whenbeing at the optimal's neighborhood
In this thesis, Levenberg Marquardt optimization is used for the nonlinear mization More specically, an existing package SBA [19] which applies the sparseLevenberg Marquardt method to the optimization of multiview geometry is used.More detailed introduction to the nonlinear optimization, including LevenbergMarquardt method, can be found in [21]
Trang 40opti-Figure 3.6: SIFT matching result: the bottom image is the SIFT matching result
of the top images