New strategies for generating panoramic images for imperfect image series 3

Following these early works, there have been a wide range of workthat focuses on different challenges in the panorama constructing system, such asglobally consistent alignment [66, 55, 7

Trang 1

Our work focuses on image alignment for constructing wide-angle panoramafrom regular hand-held cameras using image alignment techniques This problemwas popularized in the mid-1990’s [44,65,20] and drew the attention of the researchcommunity Following these early works, there have been a wide range of workthat focuses on different challenges in the panorama constructing system, such asglobally consistent alignment [66, 55, 70], feature points detection and matchingfor transformation estimation [45, 15, 16], post-processing for removing object

Trang 2

CHAPTER 2 Background and Preliminaries

misalignment artifacts [70,68,3] and dealing with varying exposures [68,36] Thisarea has matured to the point that image stitching for panorama construction hasbeen integrated into many commercial softwares [50,1,2]

Given the diversity of work in this area, it is outside the scope of this thesis toreview all work related to image mosaicing Moreover, thorough surveys on thistopic are already available, most notable that by Szeliski [61] Instead, we reviewkey techniques that are applied in our proposed methods and that are related tothe traditional image mosaicing pipeline described in the previous chapter Inparticular, we discuss feature extraction and matching, transformation estimationand image warping onto a canonical canvas, and post-processing techniques tohide misalignment artifacts

Trang 3

well as non-affine illumination changes The importance of

affine illumination invariance is shown by the comparison

to un-normalized descriptors (computed on un-normalized

regions) These descriptors obtain worse results.

In figure 4(b) the standard descriptors are compared in

the presence of illumination change Note that it shows the

results only for the detection rate higher than 0.6 SIFT

de-scriptors are normalized by the technique proposed in [11],

all other descriptors are computed on normalized image

patches We observe how the descriptors perform in the

presence of small brightness changes which remain after

the patch normalization All descriptors obtain very good

results except the differential invariants Note that steerable

filters perform better than SIFT descriptors This is

proba-bly due to the normalization procedure used for SIFT which

might be worth further investigations We can also see that

the photometric image transformations have less influence

on descriptors compared to the geometric changes (cf

fig-ure 2 and 3).

Discussion and Conclusions

In this paper we have presented an experimental

evalu-ation of interest point descriptors on images with real

geo-metric and photogeo-metric transformations The goal was to

compare descriptors computed on regions extracted with

recently proposed detection techniques which are

invari-ant to scale and affine changes In all tests, except for

light changes, SIFT descriptors obtain better results than

the other descriptors This shows the robustness and the

distinctive character of the region-based SIFT descriptor.

The second best descriptors are the steerable filters

com-puted on image patches normalized to affine photometric

and geometric transformations It can be considered as a

good choice given the low dimensionality of this descriptor The cross correlation measure gives unstable results The performance depends on the accuracy of interest point and region detection, which decreases for significant geometric transformations The differential invariants give sig- nificantly worse results that the steerable filters, which is surprising as they are based on the same basic components (Gaussian derivatives) The multiplication of derivatives necessary to obtain the rotation invariance increases the in- stability of the descriptors.

Regions detected by DoG are mainly blob-like tures There are no significant signal changes in the center

struc-of the blob and therefore the Gaussian filter-based tors perform better on larger point neighborhoods.

descrip-Obviously, the comparison presented here is not tive and it would be interesting to include more descriptors for example non-parametric descriptors, spin-images and Gabor filters However, the comparison seems to indi- cate that robust region-based descriptors perform better than point-wise descriptors Correlation is the simplest region- based descriptor However, our comparison has shown that

exhaus-it is very sensexhaus-itive to the region parameters as well as ization errors It would be interesting to include correlation with patch alignment which corrects for these errors and to measure the gain obtained by such an alignment Of course this is very time consuming and should only be used for verification.

local-It would be of interest to evaluate the impact of different sources of error which can occur in the estimation of region parameters Performance of the operators under controlled synthetic image degradation will be a useful and valuable additional dimension of the work.

Acknowledgments

This work was supported by the European FET-open project VIBES and the European project LAVA (IST- 2001-34405) We are grateful to David Lowe, Frederik Schaffalitzky and Andrew Zisserman for providing the code for their detectors/descriptors and for useful suggestions.

World Scientific Publishing Co., 2002

[4] L Florack, B ter Haar Romeny, J Koenderink, and

M Viergever General intensity transformations and second

order invariants In SCIA, pp 338–345, 1991.

[5] W Freeman and E Adelson The design and use of steerable

filters PAMI, 13(9):891–906, 1991.

[6] D Gabor Theory of communication Journal I.E.E.,

3(93):429–457, 1946

Figure 2.1: A comparison of the performance of different feature descriptors for a

case of viewpoint of the camera has rotated 60◦

This figure is from [46] In thisfigure, the detection rate is the number of correctly matched points with respect

to the number of all possible matches for the input pair of images The false

positive rate represents the probability of a descriptor may have a false match in

a conventional descriptor database From the figure we can see that, SIFT feature

generally provide the best detection rates over all other feature descriptors

feature descriptors, Lowe’s Scale-Invariant Feature Transform (SIFT) [42] feature

generally provides the best performance, followed by Freeman and Adelson’s

s-teerable filters [25] Figure2.1 from [46] shows a comparison of the performance

of different feature descriptors Therefore, in the implementation of the works in

this thesis, we use SIFT feature-based method to register the images

13

Trang 4

2.2.1 SIFT Feature Matching

Scale-invariant feature transform (SIFT) [42] is a powerful feature extraction methodthat achieves a scale invariant, rotation invariant and illuminance invariant featuredescriptor Initially, the property of scale invariance is achieved by looking forscale-space maxima of Difference of Gaussian (DoG) Specifically, a DoG pyramid

is first established in different scales and the features are first located at extrema ofthe DoG functions in this scale space Then the descriptor of this feature point iscomputed by accumulating the 8-direction histograms of oriented gradients in thelocal patch Essentially, there are two advantages of using the gradients instead ofusing the intensity values: one is that gradients are is more shift-tolerant since themoving edges will not change their overall direction; second, using gradients of theimage ignore the lighting change of the local patch and hence achieves illuminationinvariance At the same time, all local descriptor vectors are also aligned with thelargest gradient and hence reaches rotation invariant in a result

For each point of interest, this approaches compute the SIFT descriptor for its

4 × 4 neighboring patches and use the obtained 4 × 4 × 8 vector as the featuredescriptor Thus, for each of the input images, all features are extracted andmatched to other feature in neighboring images based on the closest Euclideandistance Figure2.2(a) shows an example of detected registered SIFT points for apair of neighboring images

Although computing the registering correspondences by finding the nearest bor of SIFT features provides locally accurate matching results, there are still in-

Trang 5

neigh-2.2 Feature Registering

(a) Initial registered SIFT features

(b) Left SIFT features after RANSAC filtering

Figure 2.2: An example of SIFT feature detection and RANSAC filtering We can seethat after the RANSAC filtering process, high reliable registering correspondencesare obtained

correct matches due to reasons such as moving objects or repeated content insidethe scene Thus, it is necessary to have a set of correspondences that share thesame transformation This can be done by applying a random sample consensus(RANSAC) [24] process to the correspondences set for each pair of the image

RANSAC is a robust estimation procedure that is able to find the solution withthe most consensus A typical RANSAC procedure can be summarized as follow:

Trang 6

Overview of the RANSAC Algorithm

Konstantinos G Derpanis

kosta@cs.yorku.caVersion 1.2

May 13, 2010.

The RANdom SAmple Consensus (RANSAC) algorithm proposed by Fischler andBolles [1] is a general parameter estimation approach designed to cope with a largeproportion of outliers in the input data Unlike many of the common robust esti-mation techniques such as M-estimators and least-median squares that have beenadopted by the computer vision community from the statistics literature, RANSACwas developed from within the computer vision community

RANSAC is a resampling technique that generates candidate solutions by usingthe minimum number observations (data points) required to estimate the underlyingmodel parameters As pointed out by Fischler and Bolles [1], unlike conventionalsampling techniques that use as much of the data as possible to obtain an initialsolution and then proceed to prune outliers, RANSAC uses the smallest set possibleand proceeds to enlarge this set with consistent data points [1]

The basic algorithm is summarized as follows:

Algorithm 1 RANSAC

1: Select randomly the minimum number of points required to determine the modelparameters

2: Solve for the parameters of the model

3: Determine how many points from the set of all points fit with a predefined ance

toler-4: If the fraction of the number of inliers over the total number points in the setexceeds a predefined threshold τ , re-estimate the model parameters using all theidentified inliers and terminate

5: Otherwise, repeat steps 1 through 4 (maximum of N times)

The number of iterations, N , is chosen high enough to ensure that the probability

p (usually set to 0.99) that at least one of the sets of random samples does not include

an outlier Let u represent the probability that any selected data point is an inlier

pixels (around 1 − 3 ) are counted as inliers The random selection process isrepeated N times, and the sample set with largest number of inliers is kept as thefinal solution To ensure that a true set of inliers can be selected, a sufficient number

of trials N must be tried Hence the total probability of success P is

P= 1 − (1 − pk

where p is the probability of a single correspondence to be an inlier For our works,

we constantly set p = 0.5, k = 4 and N = 200 We can see that the probability ofachieving an incorrect set of correspondences is lower than 10e − 6 Figure 2.2(b)shows an example of computed feature correspondences after the RANSAC pro-

Trang 7

2.3 Transformation Estimation

cess

Once the corresponding pairs are obtained, we need to compute the transformationbased on these correspondences Early image mosaicing applications (e.g satellitephoto) usually have known motion parameters and telephoto-like images Hence,the images can be directly stitched using simple motions such as translation, rota-tion and so on Table2.1provides an illustration of the hierarchy of 2D coordinatetransformations which are discussed in [30] From the table we can see that anytransformations can be represented as a 3 × 3 matrix At the same time, the trans-formation preserves more properties of the coordinates when the matrix has lessdegree of freedom (D.O.F.)

Now the question is if the input image pairs can be aligned using any of these 2Dtransformation Recall the images taking assumptions that described in Chapter1,essentially both the scenarios guarantee that all the objects can be projected onto avirtual plane It has been proved that under such assumption the transformationbetween each pair of images can be represented as a projective transform, com-monly called a homography This is done by setting the virtual plane as a referenceplane which makes the screen depth equal to 0 and hence the information of depthcan be ignored during the transformation For more details, the reader can refer toSzeliski et al.’s work [62,63]

Thus, for each pair of the registering points p and p0in the image, we use ˜p and

Trang 8

Name Matrix # D.O.F Preserves: Icon

translation

I t

2 ×3 2 orientation + · · · rigid (Euclidean)

R t

2 ×3 3 lengths + · · · similarity

Table 1: Hierarchy of 2D coordinate transformations The 2 × 3 matrices are extended with a third [0T 1]

row to form a full 3 × 3 matrix for homogeneous coordinate transformations.

Hierarchy of 2D transformations The preceding set of transformations are illustrated in ure 2 and summarized in Table 1 The easiest way to think of these is as a set of (potentially restricted) 3 × 3 matrices operating on 2D homogeneous coordinate vectors Hartley and Zisser- man (2004) contains a more detailed description of the hierarchy of 2D planar transformations.

Fig-The above transformations form a nested set of groups, i.e., they are closed under composition

and have an inverse that is a member of the same group Each (simpler) group is a subset of the more complex group below it.

A similar nested hierarchy exists for 3D coordinate transformations that can be denoted using

4 × 4 transformation matrices, with 3D equivalents to translation, rigid body (Euclidean) and affine transformations, and homographies (sometimes called collineations) (Hartley and Zisserman 2004).

The process of central projection maps 3D coordinates p = (X, Y, Z) to 2D coordinates x =

(x, y, 1) through a pinhole at the camera origin onto a 2D projection plane a distance f along the z

To convert the focal length f to its more commonly used 35mm equivalent, multiply the above

Table 2.1: An illustration of different types of 2D motions This table is referred

from [61]

˜p0

to denote their homogeneous coordinates Then

where ∼ denotes equality up to scale Since all correspondences should follow

one single homography, we can directly estimate the homography according to the

correspondences in a least square manner

2.3.2 Rotation Model Estimation

A more constrained case is when the camera is rotating along its center of projection

In such case, the homography can be decomposed as:

H10= K1R1R−10 K−10 = K1R10K−10 , (2.3)

18

Trang 9

where Ki = diag( fi, fi, 1) is the camera intrinsic matrix which projects pixels onto an

infinite plane, and R10 is a rotation matrix representing the motion of the camera

Here, we can see that there is only one unknown f in K which represents the focal length of the camera, and three unknowns inside the general rotation matrix R.

Thus, instead of the homography transformation, we get 3−, 4− or 5− parameterrotation model when the focal length of the camera is known, equal or different.These parameters can be estimated using Levenberg-Marquardt algorithm (LMA)which is described in Brown and Lowe’s work [15, 16] The 3D rotation model ismore intrinsically stable than a full 8-parameter homography [66] A straightfor-ward benefit of this model is that the warped images suffer from less distortionsafter applying cylinder warping which is described in the following section

2.3.3 Cylinder Mapping

After the homographies for each pair of the input image are computed, the imagescan be directly aligned by warping according to the estimated homography In thatcase, one image in the input set (in most cases the centered image) is chosen as thereference image and all other images can be then transformed to the coordinatesystem of the reference image by concatenating the pair-to-pair homographies Asdescribed in Section 2.3.1, a homography preserves the perspective property thatthe straight lines remain straight after the transformation and sometimes this is

a key requirement for some applications However, for the stitched result withlarge fields of view (FOV), the content near the border of the panorama will suffersevere stretching In practice, this problem is usually solved by applying a cylindermapping [62, 20] Figure2.3 shows an illustration of the cylinder mapping The

Trang 10

Homography Warping

Cylinder Warping

Figure 2.3: An example of homography warping and cylinder warping The resultusing homography transformation preserves the straight line structures of thescene However, we can see that as it results in a large fields of view, the contentnear border is stretched This artifacts can be relieved using cylinder mapping

idea of cylinder mapping is to project the reference plane to a cylindrical surface.For a given radius r of the cylinder surface, the mapped coordinates of a point

p(x, y, f ) can be computed as:

x0 = rθ = r tan−1 x

f , y0

= rh = r y

px2+ f2 , (2.4)where r is usually set to be ¯f (the average of the focal length of all input images)

in our implementation Figure2.3 shows an example of cylinder mapping From

Trang 11

an overlapping region such that alignments artifacts are minimized This can havethe effect at times of removing undesirable objects that appear in the overlappedregion We describe both of these techniques in the following Note that theresearch in this thesis uses the seam-cutting technique.

2.4.1 Blending

Blending techniques have played an important role since the emergence of age mosaicing Interesting works include approaches such as gradient domainblending [49,36] and Laplacian pyramid blending [17,16]

im-Gradient domain blending is done by assuming that the visual content of theimage can be represented using the gradient of the image and hence concatenating

Định dạng
Số trang	22
Dung lượng	2,38 MB