Following these early works, there have been a wide range of workthat focuses on different challenges in the panorama constructing system, such asglobally consistent alignment [66, 55, 7
Trang 1Our work focuses on image alignment for constructing wide-angle panoramafrom regular hand-held cameras using image alignment techniques This problemwas popularized in the mid-1990’s [44,65,20] and drew the attention of the researchcommunity Following these early works, there have been a wide range of workthat focuses on different challenges in the panorama constructing system, such asglobally consistent alignment [66, 55, 70], feature points detection and matchingfor transformation estimation [45, 15, 16], post-processing for removing object
Trang 2CHAPTER 2 Background and Preliminaries
misalignment artifacts [70,68,3] and dealing with varying exposures [68,36] Thisarea has matured to the point that image stitching for panorama construction hasbeen integrated into many commercial softwares [50,1,2]
Given the diversity of work in this area, it is outside the scope of this thesis toreview all work related to image mosaicing Moreover, thorough surveys on thistopic are already available, most notable that by Szeliski [61] Instead, we reviewkey techniques that are applied in our proposed methods and that are related tothe traditional image mosaicing pipeline described in the previous chapter Inparticular, we discuss feature extraction and matching, transformation estimationand image warping onto a canonical canvas, and post-processing techniques tohide misalignment artifacts
Trang 3well as non-affine illumination changes The importance of
affine illumination invariance is shown by the comparison
to un-normalized descriptors (computed on un-normalized
regions) These descriptors obtain worse results.
In figure 4(b) the standard descriptors are compared in
the presence of illumination change Note that it shows the
results only for the detection rate higher than 0.6 SIFT
de-scriptors are normalized by the technique proposed in [11],
all other descriptors are computed on normalized image
patches We observe how the descriptors perform in the
presence of small brightness changes which remain after
the patch normalization All descriptors obtain very good
results except the differential invariants Note that steerable
filters perform better than SIFT descriptors This is
proba-bly due to the normalization procedure used for SIFT which
might be worth further investigations We can also see that
the photometric image transformations have less influence
on descriptors compared to the geometric changes (cf
fig-ure 2 and 3).
Discussion and Conclusions
In this paper we have presented an experimental
evalu-ation of interest point descriptors on images with real
geo-metric and photogeo-metric transformations The goal was to
compare descriptors computed on regions extracted with
recently proposed detection techniques which are
invari-ant to scale and affine changes In all tests, except for
light changes, SIFT descriptors obtain better results than
the other descriptors This shows the robustness and the
distinctive character of the region-based SIFT descriptor.
The second best descriptors are the steerable filters
com-puted on image patches normalized to affine photometric
and geometric transformations It can be considered as a
good choice given the low dimensionality of this descriptor The cross correlation measure gives unstable results The performance depends on the accuracy of interest point and region detection, which decreases for significant geo- metric transformations The differential invariants give sig- nificantly worse results that the steerable filters, which is surprising as they are based on the same basic components (Gaussian derivatives) The multiplication of derivatives necessary to obtain the rotation invariance increases the in- stability of the descriptors.
Regions detected by DoG are mainly blob-like tures There are no significant signal changes in the center
struc-of the blob and therefore the Gaussian filter-based tors perform better on larger point neighborhoods.
descrip-Obviously, the comparison presented here is not tive and it would be interesting to include more descrip- tors for example non-parametric descriptors, spin-images and Gabor filters However, the comparison seems to indi- cate that robust region-based descriptors perform better than point-wise descriptors Correlation is the simplest region- based descriptor However, our comparison has shown that
exhaus-it is very sensexhaus-itive to the region parameters as well as ization errors It would be interesting to include correlation with patch alignment which corrects for these errors and to measure the gain obtained by such an alignment Of course this is very time consuming and should only be used for verification.
local-It would be of interest to evaluate the impact of different sources of error which can occur in the estimation of region parameters Performance of the operators under controlled synthetic image degradation will be a useful and valuable additional dimension of the work.
Acknowledgments
This work was supported by the European FET-open project VIBES and the European project LAVA (IST- 2001-34405) We are grateful to David Lowe, Frederik Schaffalitzky and Andrew Zisserman for providing the code for their detectors/descriptors and for useful suggestions.
World Scientific Publishing Co., 2002
[4] L Florack, B ter Haar Romeny, J Koenderink, and
M Viergever General intensity transformations and second
order invariants In SCIA, pp 338–345, 1991.
[5] W Freeman and E Adelson The design and use of steerable
filters PAMI, 13(9):891–906, 1991.
[6] D Gabor Theory of communication Journal I.E.E.,
3(93):429–457, 1946
Figure 2.1: A comparison of the performance of different feature descriptors for a
case of viewpoint of the camera has rotated 60◦
This figure is from [46] In thisfigure, the detection rate is the number of correctly matched points with respect
to the number of all possible matches for the input pair of images The false
positive rate represents the probability of a descriptor may have a false match in
a conventional descriptor database From the figure we can see that, SIFT feature
generally provide the best detection rates over all other feature descriptors
feature descriptors, Lowe’s Scale-Invariant Feature Transform (SIFT) [42] feature
generally provides the best performance, followed by Freeman and Adelson’s
s-teerable filters [25] Figure2.1 from [46] shows a comparison of the performance
of different feature descriptors Therefore, in the implementation of the works in
this thesis, we use SIFT feature-based method to register the images
13
Trang 4CHAPTER 2 Background and Preliminaries
2.2.1 SIFT Feature Matching
Scale-invariant feature transform (SIFT) [42] is a powerful feature extraction methodthat achieves a scale invariant, rotation invariant and illuminance invariant featuredescriptor Initially, the property of scale invariance is achieved by looking forscale-space maxima of Difference of Gaussian (DoG) Specifically, a DoG pyramid
is first established in different scales and the features are first located at extrema ofthe DoG functions in this scale space Then the descriptor of this feature point iscomputed by accumulating the 8-direction histograms of oriented gradients in thelocal patch Essentially, there are two advantages of using the gradients instead ofusing the intensity values: one is that gradients are is more shift-tolerant since themoving edges will not change their overall direction; second, using gradients of theimage ignore the lighting change of the local patch and hence achieves illuminationinvariance At the same time, all local descriptor vectors are also aligned with thelargest gradient and hence reaches rotation invariant in a result
For each point of interest, this approaches compute the SIFT descriptor for its
4 × 4 neighboring patches and use the obtained 4 × 4 × 8 vector as the featuredescriptor Thus, for each of the input images, all features are extracted andmatched to other feature in neighboring images based on the closest Euclideandistance Figure2.2(a) shows an example of detected registered SIFT points for apair of neighboring images
Although computing the registering correspondences by finding the nearest bor of SIFT features provides locally accurate matching results, there are still in-
Trang 5neigh-2.2 Feature Registering
(a) Initial registered SIFT features
(b) Left SIFT features after RANSAC filtering
Figure 2.2: An example of SIFT feature detection and RANSAC filtering We can seethat after the RANSAC filtering process, high reliable registering correspondencesare obtained
correct matches due to reasons such as moving objects or repeated content insidethe scene Thus, it is necessary to have a set of correspondences that share thesame transformation This can be done by applying a random sample consensus(RANSAC) [24] process to the correspondences set for each pair of the image
RANSAC is a robust estimation procedure that is able to find the solution withthe most consensus A typical RANSAC procedure can be summarized as follow:
Trang 6CHAPTER 2 Background and Preliminaries
Overview of the RANSAC Algorithm
Konstantinos G Derpanis
kosta@cs.yorku.caVersion 1.2
May 13, 2010.
The RANdom SAmple Consensus (RANSAC) algorithm proposed by Fischler andBolles [1] is a general parameter estimation approach designed to cope with a largeproportion of outliers in the input data Unlike many of the common robust esti-mation techniques such as M-estimators and least-median squares that have beenadopted by the computer vision community from the statistics literature, RANSACwas developed from within the computer vision community
RANSAC is a resampling technique that generates candidate solutions by usingthe minimum number observations (data points) required to estimate the underlyingmodel parameters As pointed out by Fischler and Bolles [1], unlike conventionalsampling techniques that use as much of the data as possible to obtain an initialsolution and then proceed to prune outliers, RANSAC uses the smallest set possibleand proceeds to enlarge this set with consistent data points [1]
The basic algorithm is summarized as follows:
Algorithm 1 RANSAC
1: Select randomly the minimum number of points required to determine the modelparameters
2: Solve for the parameters of the model
3: Determine how many points from the set of all points fit with a predefined ance
toler-4: If the fraction of the number of inliers over the total number points in the setexceeds a predefined threshold τ , re-estimate the model parameters using all theidentified inliers and terminate
5: Otherwise, repeat steps 1 through 4 (maximum of N times)
The number of iterations, N , is chosen high enough to ensure that the probability
p (usually set to 0.99) that at least one of the sets of random samples does not include
an outlier Let u represent the probability that any selected data point is an inlier
pixels (around 1 − 3 ) are counted as inliers The random selection process isrepeated N times, and the sample set with largest number of inliers is kept as thefinal solution To ensure that a true set of inliers can be selected, a sufficient number
of trials N must be tried Hence the total probability of success P is
P= 1 − (1 − pk
where p is the probability of a single correspondence to be an inlier For our works,
we constantly set p = 0.5, k = 4 and N = 200 We can see that the probability ofachieving an incorrect set of correspondences is lower than 10e − 6 Figure 2.2(b)shows an example of computed feature correspondences after the RANSAC pro-
Trang 72.3 Transformation Estimation
cess
2.3 Transformation Estimation
Once the corresponding pairs are obtained, we need to compute the transformationbased on these correspondences Early image mosaicing applications (e.g satellitephoto) usually have known motion parameters and telephoto-like images Hence,the images can be directly stitched using simple motions such as translation, rota-tion and so on Table2.1provides an illustration of the hierarchy of 2D coordinatetransformations which are discussed in [30] From the table we can see that anytransformations can be represented as a 3 × 3 matrix At the same time, the trans-formation preserves more properties of the coordinates when the matrix has lessdegree of freedom (D.O.F.)
Now the question is if the input image pairs can be aligned using any of these 2Dtransformation Recall the images taking assumptions that described in Chapter1,essentially both the scenarios guarantee that all the objects can be projected onto avirtual plane It has been proved that under such assumption the transformationbetween each pair of images can be represented as a projective transform, com-monly called a homography This is done by setting the virtual plane as a referenceplane which makes the screen depth equal to 0 and hence the information of depthcan be ignored during the transformation For more details, the reader can refer toSzeliski et al.’s work [62,63]
Thus, for each pair of the registering points p and p0in the image, we use ˜p and
Trang 8CHAPTER 2 Background and Preliminaries
Name Matrix # D.O.F Preserves: Icon
translation
I t
2 ×3 2 orientation + · · · rigid (Euclidean)
R t
2 ×3 3 lengths + · · · similarity
Table 1: Hierarchy of 2D coordinate transformations The 2 × 3 matrices are extended with a third [0T 1]
row to form a full 3 × 3 matrix for homogeneous coordinate transformations.
Hierarchy of 2D transformations The preceding set of transformations are illustrated in ure 2 and summarized in Table 1 The easiest way to think of these is as a set of (potentially restricted) 3 × 3 matrices operating on 2D homogeneous coordinate vectors Hartley and Zisser- man (2004) contains a more detailed description of the hierarchy of 2D planar transformations.
Fig-The above transformations form a nested set of groups, i.e., they are closed under composition
and have an inverse that is a member of the same group Each (simpler) group is a subset of the more complex group below it.
A similar nested hierarchy exists for 3D coordinate transformations that can be denoted using
4 × 4 transformation matrices, with 3D equivalents to translation, rigid body (Euclidean) and affine transformations, and homographies (sometimes called collineations) (Hartley and Zisserman 2004).
The process of central projection maps 3D coordinates p = (X, Y, Z) to 2D coordinates x =
(x, y, 1) through a pinhole at the camera origin onto a 2D projection plane a distance f along the z
To convert the focal length f to its more commonly used 35mm equivalent, multiply the above
Table 2.1: An illustration of different types of 2D motions This table is referred
from [61]
˜p0
to denote their homogeneous coordinates Then
where ∼ denotes equality up to scale Since all correspondences should follow
one single homography, we can directly estimate the homography according to the
correspondences in a least square manner
2.3.2 Rotation Model Estimation
A more constrained case is when the camera is rotating along its center of projection
In such case, the homography can be decomposed as:
H10= K1R1R−10 K−10 = K1R10K−10 , (2.3)
18
Trang 92.3 Transformation Estimation
where Ki = diag( fi, fi, 1) is the camera intrinsic matrix which projects pixels onto an
infinite plane, and R10 is a rotation matrix representing the motion of the camera
Here, we can see that there is only one unknown f in K which represents the focal length of the camera, and three unknowns inside the general rotation matrix R.
Thus, instead of the homography transformation, we get 3−, 4− or 5− parameterrotation model when the focal length of the camera is known, equal or different.These parameters can be estimated using Levenberg-Marquardt algorithm (LMA)which is described in Brown and Lowe’s work [15, 16] The 3D rotation model ismore intrinsically stable than a full 8-parameter homography [66] A straightfor-ward benefit of this model is that the warped images suffer from less distortionsafter applying cylinder warping which is described in the following section
2.3.3 Cylinder Mapping
After the homographies for each pair of the input image are computed, the imagescan be directly aligned by warping according to the estimated homography In thatcase, one image in the input set (in most cases the centered image) is chosen as thereference image and all other images can be then transformed to the coordinatesystem of the reference image by concatenating the pair-to-pair homographies Asdescribed in Section 2.3.1, a homography preserves the perspective property thatthe straight lines remain straight after the transformation and sometimes this is
a key requirement for some applications However, for the stitched result withlarge fields of view (FOV), the content near the border of the panorama will suffersevere stretching In practice, this problem is usually solved by applying a cylindermapping [62, 20] Figure2.3 shows an illustration of the cylinder mapping The
Trang 10CHAPTER 2 Background and Preliminaries
Homography Warping
Cylinder Warping
Figure 2.3: An example of homography warping and cylinder warping The resultusing homography transformation preserves the straight line structures of thescene However, we can see that as it results in a large fields of view, the contentnear border is stretched This artifacts can be relieved using cylinder mapping
idea of cylinder mapping is to project the reference plane to a cylindrical surface.For a given radius r of the cylinder surface, the mapped coordinates of a point
p(x, y, f ) can be computed as:
x0 = rθ = r tan−1 x
f , y0
= rh = r y
px2+ f2 , (2.4)where r is usually set to be ¯f (the average of the focal length of all input images)
in our implementation Figure2.3 shows an example of cylinder mapping From
Trang 11an overlapping region such that alignments artifacts are minimized This can havethe effect at times of removing undesirable objects that appear in the overlappedregion We describe both of these techniques in the following Note that theresearch in this thesis uses the seam-cutting technique.
2.4.1 Blending
Blending techniques have played an important role since the emergence of age mosaicing Interesting works include approaches such as gradient domainblending [49,36] and Laplacian pyramid blending [17,16]
im-Gradient domain blending is done by assuming that the visual content of theimage can be represented using the gradient of the image and hence concatenating