4 3.3.1 Preliminary Knowledge 26 3.3.2 Details of Graph Cuts 28 3.4 Concepts of Stereo Matching 33 3.5.1 Technical Principles of Photography 3.5.2 Effects and Processing in Photograp
Trang 1objects/layers to focus, and other parts will be naturally blurred according to their depth values in the scene
This project can be divided into two parts One is how to produce a layer depth map Computing
a depth map is actually a work of labeling assignment This type of problem can be solved by finding a minimum value of a constructed energy function Graph Cuts algorithm is one of the most efficient optimization methods We use it to optimize our built energy function due to its feature of fast convergence The second part is to blur each layer that is not assigned to be
focused Several blurry algorithms are applied to achieve this goal
In this paper, I first describe some related work and background studies on the labeling
assignment theories and their related topics in vision area I then explore the refocus-related principals in computational photography Based on these studies, I go through our image
refocusing project in details and compare the experimental results to other existed approaches Finally, I proposed some possible future work in this research area
Trang 22
Acknowledgment
First, I would like to express my sincere appreciation to my supervisor, Dr Low Kok Lim He has offered me a large amount of precious advice and suggestions, contributed his valued time to review this paper and provided me proposed comments
I would also like to thank my family, especially my parents They always offer me continuous encouragement and infinite support My acknowledgement will give to my labmates as well When I fell tired and wanted to give up, they were here by my side and support me
Trang 32.1.1 Single Image Input 13
2.1.2 Stereo Matching – Two Images Input 15
2.1.3 Multi-View scene reconstruction - Multiple Images Input 17
2.1.4 3D World Reconstruction from a large dataset 17
Trang 44
3.3.1 Preliminary Knowledge 26
3.3.2 Details of Graph Cuts 28
3.4 Concepts of Stereo Matching 33
3.5.1 Technical Principles of Photography
3.5.2 Effects and Processing in Photography
3.6.1 Overview and Problem Formulation
3.6.2 Results and Conclusion
Computation of Camera Parameters
Estimation of Depth Value in 3D Scene
Problem Formulation and Graph Cuts
Trang 55
4.4.2
4.4.3
Results of Single Depth Map
Considerations and Difficulties
Combining Blurry Layers
Results and Comparison
Comparison to Adobe Photoshop
Different Lens Apertures
Bokeh
Different number of input images
Comparison to Other Software on Camera
Trang 6In practice, people sometimes want to shoot a photo with sharp foreground and blurred
background Moreover, they would like to sharpen the objects they prefer, blur other parts they
do not consider important In this case, we need a camera with large aperture to create a shallow depth of field However, general point-and-shoot cameras have small sensors and lenses, which
is to some degree difficult to generate the effect of shallow focus To figure out this problem, our project is using multiple photos with small aperture taken from slightly different viewpoints, which simulates a bigger camera aperture, to create a shallow depth of field It means that with only one point-and-shoot camera, ordinary users without any photography technique can take an
Trang 77
„artistic‟ depth-of-field photo Another advantage of our project is that user can randomly select one part of the referenced photo to sharpen it; the other parts are blurred meanwhile This can be achieved by producing a depth map from these given several photos
1.2 Problem Statement
Most people will ask: what kind of problems does our project solve? The first thing we consider
is the convenience for camera users The top selling point-and-shoot camera in the current
market is indeed convenient and easy to use People just need to press the shutter button, and a piece of beautiful photo will be produced However, only to generate a picture is not enough They prefer more realistic effects on the photos For example, if we choose a function of depth-of-field scene simulation on the camera, the foreground objects (i.e persons) will be sharpened while the background will be blurred by just pressing a button on the camera
The current welcomed point-and-shoot camera with depth-of-field effect usually has these two resultant photos: one is that objects are sharpened everywhere (Figure 1.1 - left), the other is that the camera detects the foreground objects automatically and keeps them sharpened, and blurs all the other regions in the photo (Figure1.1 - right) Here is the problem: what if the user wants to sharpen the background objects instead of the foreground ones? For example, in Fig.1-right, users do not want to focus the flower, maybe they would like to see the white object in the background clearer At this time we need to refocus the whole scene in the photo, i.e change the emphasized part of the scene In a word, our project should be able to solve a problem like this: with the least handling steps on the input, how users can finally obtain a photo containing the sharp part and blurred part they prefer
Trang 88
Figure 1.1 – left: all the objects are sharp everywhere; right: shallow depth of field – flower in the foreground is
sharp, while background is largely blurred
There are some existing works in the area of refocusing I will describe the details of these
methods later in the next chapter In this thesis, we present a simple but effective idea to
implement the refocusing work in current popular point-and-shoot camera The whole procedure can be divided into two parts The first part is to compute a depth map based on the given input photos The theory and the original problem formulation under the depth map computation is one classical research area in early vision: label assignment, which focuses on how to assign a label
to a pixel (in image) given some observed data This kind of problems can be solved by energy minimization A way to figure out the label assigning work is to construct an energy function and minimize it
The second part of the project is to sharpen one part of the photo, while blur the other parts based
on the depth map computed from the previous step For this phase, we need to handle the
Trang 99
occurred problems such as partial-occluded parts, blur kernel and issues from layers of
constructed 3D scene After combining these two steps above, user can finally obtain a new image with the sharp parts they want by only shooting several photos from common point-and-shoot camera
Therefore, we can summarize that the input of this project is a sequence of sharp photos taken from slightly different viewpoints Then user selects a preferred region on the referenced photo that to be emphasized, and finally the output is a depth-of-field image
1.3 Advantages and Contributions
We analyze the advantages of our project from three aspects: depth map generation, refocusing procedure and quantity of input parameters
(a) Depth map generation:
There are many approaches on producing depth map in early vision area Most of them are stereo matching methods User needs to prepare two images – left and right, with only translation movement between them The output is a disparity map according to the moving distance of each pixel in these two images, i.e large translation interprets nearer objects to the camera, while small movement accounts for the scene background that is far away from camera holder
Another type of depth map generation is to reconstruct a 3D scene from a large set of photos Those photos can be taken from very different viewpoints Actually to produce a coarse depth map, there is no need to reconstruct the corresponding 3D scene The information a depth map requires is depending on where it applies for Sometimes rough depth value (the distance from camera lens to the objects) of each pixel in the image is enough; while in other cases, especially
Trang 1010
full 3D scene reconstruction, exact depth value (e.g x, y, z axis point representation) of each 3D point is necessary
Our method is a tradeoff between the previous two approaches We use several images to
generate a usual depth map instead of 3D reconstruction First, users do not need to shoot a large set of photos, 5-7 is enough Second, the theory of multi-view scene reconstruction is simple, which is to generate an appropriate energy function for the given problem and apply an efficient energy minimization algorithm such as graph cuts to minimize it [33] The details and our
implementation will be described in the later chapter
In a word, the advantages of our depth map production has less input and is a simple algorithm idea instead of much more steps of 3D reconstruction approach such as feature point extraction, structure from motion, or bundle adjustment, etc The reason why we do not apply 3D
reconstruction approach is also because the information we require for the project is less than that in reconstructing a 3D scene from real world – rough depth value of each pixel in the
referenced image is truly enough
(b) Refocusing procedure:
One common approach to refocusing is to acquire a sequence of images with different focused setting According to the spatially varying blur estimation of the whole scene where the
information is taken from the focused difference, an all-focused resultant image can be computed
to refocus However, from the view of users, they need to take several photos under different focused setting It is not so easy to achieve if users have little knowledge about photography What if they do not know how to adjust different focused settings? In other words, some existing refocusing works have at least two input images with very different properties For example, one
of the input photos has sharp objects in the foreground, while in the other image, background objects are sharp and foreground is blurred Hence according to the different blurred degrees and regions, the program can compute a relatively accurate blur kernel to accomplish the refocusing task Notice that a camera with depth-of-field effect is required for this kind of approach
Trang 11(c) Quantity of input parameters:
Compared to other existing refocusing methods, most of them need more input parameters than ours In [8], in order to estimate a depth map under a single camera (with a wide depth of field), the author uses a sparse set of dots projected onto the scene (with a shallow depth of field
projector), and refocus based on the depth map
Another type of methods is to modify the camera lens They use various shapes of camera
aperture, or apply a coded aperture on a conventional camera [1] With the knowledge of the aperture shape, the blurred kernel is computed as well
Our project does not require extra equipment like a projector or any camera modification We do not need to take any photos with shallow depth of field to estimate the blur kernel either All users need to do is to shoot several all-focused photos with slightly different shooting angles by using a shoot-and-point digital camera It is more convenient for ordinary people without any photography technique to obtain a final refocusing photo In our project, the interaction between user and computer is the user selection of preferred region which will be emphasized (sharpened)
Trang 1212
Chapter 2
Related Work
According to the procedure of our project – depth map computation and image refocusing based
on the depth information, the literature survey in this chapter will also be divided into two
categories: existing work on depth estimation and existing work on image refocusing
In this chapter, we will not only introduce the existing related work to our project, but also describe some key algorithms and their corresponding applications These algorithms such as stereo matching concepts, take an important part into our project as well Therefore, introduction
of such concepts in detail is necessary
2.1 Depth Estimation
We divide this part as four categories according to the required input, which in my opinion is much clearer to compare these various methods
2.1.1 Single Image Input
The approaches for obtaining one depth map from a single input image, most of them require additional equipment like projector, or device modification like camera aperture shape change, etc In [8], the author uses a single camera (with a wide depth of field) and the depth value
Trang 1313
computation is based on the defocus of a sparse set of dots projected onto the scene (using a narrow depth of field projector) With the help of the projecting dots from the projector and the color segmentation algorithm, an approximated depth map of the scene with sharp boundaries can be computed for the next step Figure 2.1 is the example of [8] (a) is the acquired image from single input image and projector (b) is the computed depth map from the information provided by (a) The produced depth map has very sharp and accurate boundaries on the objects However, it cannot handle the partial occlusion problem, i.e., we are not able to see the region of the man behind that flower in depth map
Figure 2.1 – example result from [8]
In [1], the authors use a single image capture, and a small modification to the traditional camera lens – a simple piece of cardboard suffices On the condition that we have already know the shape of lens aperture, the corresponding blur kernel can be estimated, and thus convolution can
be applied to the blurry part of the image in order to recover an all-focused final image (in
refocusing step) The output of this method is a coarse depth map, which is sufficient for the next refocusing phase in most applications
Trang 1414
2.1.2 Stereo Matching – Two Images Input
Stereo matching is one of the most active research areas in computer vision and it can be applied
to many applications as an significant intermediate step, such as view synthesis, image based rendering, 3D scene reconstruction.Given two images/photos that are taken from slightly
different viewpoints, the goal of stereo matching is to assign a depth value to each pixel in the reference image, where the final result is represented as a disparity map
Disparity indicates the difference in location between two corresponding pixels and it is also considered as a synonym for inverse depth The most important step at first is to find the
corresponding pixels which refer to the same scene point from the given two left and right
images Once these correspondences are known, we can find out how much difference produced
by the camera movement
Consider that it is hard to point out the corresponding pixels under the one-map-to-one condition based on the casual camera motion, to simplify the search for correspondence, the image pair is commonly transformed into horizontal translation, so that the stereo problem is reduced to a one-dimensional search along corresponding scan lines Therefore, we can easily view the disparity value as the offset between x-coordinates in the left and right images (Figure 2.2) The objects nearer to viewpoint have a larger translation, while farer ones have only a slight move
Figure 2.2 - The stereo vision is captured in a left and a right image
Trang 1515
An excellent review of stereo work can be found in [4] It presents the taxonomy and comparison
of two-frame stereo correspondence algorithms
Stereo matching algorithms generally perform (subsets of) of the following three steps [4]:
a matching cost computation;
My own comparison and implementation only focuses on pixel-based matching costs, which is enough for the research of this project The most common pixel-based methods include squared intensity differences (SD) [14, 18, 12, 7] and absolute intensity differences (AD) [30] The computation of SD and AD falls between the single pixels of the given left and right images
b cost (support) aggregation;
The aggregation of cost is usually window-based (local) method It aggregates the matching cost
by summing or averaging over a support region For common used stereo matching, a support region is a two-dimensional area, often a square window (3-by-3, 7-by-7) Therefore the cost aggregation of each pixel in an image can be calculated over such region
c disparity computation / optimization;
It can be separated into two classes One is local methods, and the other is global ones The local methods usually perform a local “winner-take-all” (WTA) optimization at each pixel [40] For the global optimization methods, the objective is to build a disparity function that minimizes a global energy Such energy function includes data term and smoothness term, which represents energy distribution of each pixel in the two stereo images With the energy function, several minimization algorithms such as belief propagation [19], graph cuts [28, 34, 17, 2], dynamic programming [20, 31], simulated annealing [26, 11] could be used to compute the final depth map
Trang 1616
2.1.3 Multi-View scene reconstruction - Multiple Images Input
A 3D scene can be roughly reconstructed from a small set of images (about 10 pictures) The result of this type of methods is usually a dense depth map that is similar to those of stereo
matching However, it is difficult to deal with the partial occluded part in stereo matching
because of lacking of information from only two input images Users provide multiple images as input, which means they have enough information that could produce a relatively exact result Furthermore, partial occlusion may also be solved In [Multi-camera; asymmetrical], the energy function has more than two terms, i.e data term, smoothness term and visibility term Obviously, the visibility term is used to handle the partial occlusion problem Besides, [13, 25, 23] not only take advantages of multiple images as the input, but also build a new type of data structure as the output – layered depth map, which can represent the layered pattern of the real 3D scene more clearly The whole scene can be divided into several planes Each plane represents the distance
of objects to the camera holder Figure 2.3 shows an example of layered depth map [23] This new representation of depth map deals with the partial occlusion better and the most important point is that it is more convenient and exact for the next step – refocusing
(a) (b) (c) Figure 2.3 – (a) is one of the input sequence; (b) is the recovered depth map; (c) is the separated layers
2.1.4 3D World Reconstruction from a large dataset
As in [16], the input is a large dataset where the photos are taken from very different views Therefore we can obtain plenty of information including camera position, affine matrix of each
photo, and finally compute the exact depth value of each pixel in real world Given each pixel‟s x,
Trang 1717
y, z value of an image (z value is the one we compute from the large dataset), we can easily
reconstruct the 3D scene of this image and of course, the computed depth map is much more accurate than those produced from stereo matching or multi-view scene reconstruction See Figure 2.4 (the bear one if I still hold) The result of 3D reconstruction is comprised of a large amount of sparse points For people who would like to see a rough sketch of a certain building, a set of sparse points is enough If we try to obtain a dense map, estimation on the regions without known points or triangulation may be performed in order to connect the discrete points together
(a) large dataset as input (b) 3D reconstruction result
Figure 2.4 – example from [16]
2.2 Image Refocusing
In our project, this part is based on the previous depth estimation phase All our research and implementation of refocusing are according to the single dense depth map that we have produced Therefore, for the introduction of related work in image refocusing, we will only describe the refocusing part of the paper, i.e we presume that the depth map (single or layered) has already been provided and focus on how to blur or sharpen the original image
Trang 18foreground focused image with a background focused image within the boundary region They combine their depth map with a matting matrix computed from the depth estimation refinement
to produce better result [1] is also another type of approach using blur kernel estimation to fulfill the refocusing work The input are single blurred photographs taken with the modified camera (obtain both depth information and an all-focus image) and output is coarse depth information together with a normal high resolution RGB image Therefore, to reconstruct the original sharp image, correct blur scale of an observed image has to be identified with the help of modified shape of camera aperture The process is based on probabilistic model which finds the maximum likelihood of a blur scale estimation equation To summary, according to the general convolution
equation y = f k * x, where y is the observed image, x is the sharp image that will be recovered, and the blur filter f k is a scaled version of the aperture shape [1] The defocus step of this method
is to find a correct kernel f k with the help of known information of coded aperture
Another kind of methods is to produce a layered depth map which divided the whole scene into several parts Each part has a certain depth value These separated layers are viewed as
individual parts It is definitely a solution to avoid the missing regions between objects lying in different scene layers Given a layered depth map, the only thing we need to do is to blur the layers according to the depth value of the corresponding individual layer without considering
Trang 19information about light field in depth estimation area is in [32]
Trang 20I also introduce some concepts that are related to the refocusing part of the project The basic theories and relationships among the common camera parameters are described in section 3.3 Besides, some different methods (applications) on the topics of refocusing, defocusing or depth
of field are introduced to do a direct comparison with our method
3.1 Standard Energy Function in Vision
In early vision problems, the label assignment can be explained as: every pixel must be
assigned a label in some finite set L In image restoration, the assignments represent pixel
intensities, while for stereo vision and motion, the labels are disparities Another example is image segmentation, where labels represent the pixel values of each segment The goal of label
assignment is to find a labeling f that assigns each pixel a label , where f is both
Trang 2121
piecewise smooth and consistent with the observed data In general energy function definition, the vision problems can be naturally formulated as
E(f) = E smooth (f) + E data (f) (1)
where E smooth (f) measures the extent to which f is not piecewise smooth, while E data (f) measures the disagreement between f and the observed data [39] Typically, in many literatures and
applications, the form of E data (f) is defined as
stereo vision problem, ( ) is usually ( ) , where I right and I left are the pixel
intensities of the two corresponding points p and q in the given left and right image, respectively
While in image restoration problem, ( ) is normally ( ) , where I p is the observed
intensity of p
While the data term is easier to define and apply to different practical problems, the choice of smoothness term is a very important and critical issue in the current research area It directly decides whether the final result is optimal and the form of this term usually depends on the
applications For example, in some approaches, E smooth (f) makes f smooth everywhere according
to the demand of the algorithms For many other applications, E smooth (f) has to detect the object boundaries as clear as possible, which is often denoted as discontinuity preserving For image
segmentation, stereo vision problems, object boundary is a necessary factor to be firstly
considered The Potts model I described above is also a popularly used type of smoothness term From the discussion about data term and smoothness term in energy function, we consider
energies of the form
Trang 22where N is set of pairs of neighboring pixels Normally, N is composed of adjacent pixels (i.e
left and right, top and bottom, etc.), but can be arbitrary pairs as well according to problem
requirement Most applications only consider V p,q under pair-wise interactions, since pixel
dependence and interaction often happen between adjacent pixels
3.2 Optimization Methods
Finding a minimum value for a given energy function is a typical kind of optimization problem
In this section, first I give an introduction and motivation to optimization technique Then I focus
on introducing an efficient optimization method – graph cuts
3.2.1 Introduction to Optimization
In mathematics and computer science, optimization, or mathematical programming, refers to choosing the best element from some set of available alternatives In the simplest case,
optimization means solving problems where one seeks to minimize or maximize a real function
by systematically choosing the values of real or integer variables from within an allowed set This formulation, using a scalar, real-valued objective function, is probably the simplest example Generally, optimization means finding "best available" values of some objective function given a defined domain, including a variety of different types of objective functions and different types
of domains
Greig et al [3] was first to use powerful min-cut/max-flow algorithms from combinatorial
optimization to minimize certain typical energy function in computer vision Combinatorial
Trang 2323
optimization is a branch of optimization problems The feasible solutions of this kind of
problems are discrete or can be reduced to discrete case, and the goal is to find the best possible solution, most of the time is to find an approximated one For approximation algorithms, they can run in polynomial time and find a solution that is “close” to optimal
For certain energy function, in general, a labeling f is a local minimum of the energy E if
Graph cuts algorithm is one of the most popular optimization algorithms in current related
research areas It can rapidly compute a local minimum with relatively good results Here are some examples produced by graph cuts The object boundaries are pretty clear which fit the requirement of image segmentation and stereo vision cases
Since graph cuts algorithm is the major discussing topic in this paper, I will introduce it later in details
Trang 2424
Figure 3.1 - Results of color segmentation on the Berkeley dataset by using graph cuts
Figure 3.2 - Results of texture segmentation on the MIT VisTex and the Berkeley dataset by using graph cuts
Figure 3.3 – Disparity map of stereo vision matching using graph cuts
Trang 253.3.1.1 Metric and Semi-Metric
V is called a metric on the label space L if it satisfies
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
for any labels If V satisfies only (a) and (b), it is called a semi-metric
For example, Potts model ( ) ( ) is a metric, where ( ) is 1 if its argument is true, otherwise 0 The Potts model encourages labeling consisting of several regions where pixels
in the same region have equal labels [4] The discontinuity-preserving results produced from this
model are also called piecewise constant, which is widely used in segmentation, stereo vision problems
Another type of models is called piecewise smooth The truncated quadratic ( )
( | | ) is a semi-metric, while the truncated absolute distance ( )
( | |) is a metric, where K is some constant The role of constant K is to restrict possible larger discontinuity penalty imposed on the smoothness term in the energy function These models encourage labeling consisting of several regions where pixels in the same region
have similar labels
Trang 2626
3.3.1.2 Graph denotation and construction
Let 〈 〉 be a weighted graph It consists of a set of nodes V and a set of edges that connect them The set of nodes has several distinguished vertices which are called the terminals
In the context of vision problem, the nodes normally correspond to pixels, voxels or other types
of image features and terminals correspond to the set of labels which can be assigned to each pixel in the image For simplification, I will only focus on the case of two terminals (i.e two labels to be assigned) Usually the two terminal nodes can also be called source node and sink node The multiple terminals problem can be naturally extended from the two-label case In Figure 3.4, a simple example of a two terminal graph is shoed This graph construction can be used on a 3 x 3 image with two to-be-assign labels
For the edges connected between different nodes, a t-link is an edge that connects terminal nodes (source and sink) to image pixel nodes, while an n-link is an edge that connects two image nodes within a neighborhood system
Figure 3.4 – Example of a constructed graph A similar graph cuts construction was first introduced in vision by
Greig et al [3] for binary image restoration
A cut is a set of edges such that the terminals are separated by this cut in the induced graph ( ) 〈 〉 After the cut, a subset of nodes belongs to source terminal, while the
other subset of nodes is categorized into the sink terminal The cost of the cut C, denoted |C|,
equals the sum of edge weights of this cut Figure 3.5 shows a typical cut on the constructed graph The cut is represented as green dotted line
Trang 2727
The minimum cut problem is to find the optimal cut with lowest cost among all cuts separating the terminals
Figure 3.5 – Example of a cut on the constructed graph
3.3.1.3 Minimizing the Potts energy is NP-hard
The details of proof of NP-hard will not be described here All we have to know is that a
polynomial-time method for finding an optimal configuration f * would provide a time algorithm for finding the minimum cost multi-way cut, which is known to be NP-hard [38]
polynomial-3.3.2 Details of Graph Cuts
Trang 28using labeling (b) as an initial labeling estimate [39]
3.3.2.2 Definition of Swap and Algorithms
For α-β swap, given a pair of labels α and β, a α-β swap is a move from an old labeling f to a
new labeling f’ If there is pixel labeling difference between the before and after changes which
leads to a decrease of energy in the function, we say that the α-β swap succeeds and will
continue to the next iteration In other words, α-β swap means that some pixels that were labeled
α now change to be labeled β, and some pixels that were labeled β now labeled α
For α-expansion, given a label α, a α-expansion move is also a move from an old labeling f to a
new labeling f’ This algorithm means that some pixels that were not assigned to label α now are
assigned to label α
Figure 3.7 shows the α-β swap and α-expansion algorithms respectively We will call a single
execution of Steps 3.1-3.2 and iteration, and an execution of Steps 2, 3, and 4 a cycle In each
cycle, an iteration is performed for every label α or for every pair of label α and β The
Trang 2929
algorithms will continue until it cannot find any successful labeling change It can be obvious seen that a cycle in the α-β swap algorithm take |L|2 iterations, while a cycle in the α-expansion algorithm only take |L| iterations
Figure 3.7 – Overview of α-β swap algorithm (top) and α-expansion algorithm (bottom)
Given an input initial labeling f and a pair of labels α and β (swap algorithm) or a label α
(expansion algorithm), we want to find a new labeling f’ which can minimize the given energy
Trang 3030
3.3.2.3 Swap
Any cut leaves each pixel in the image with exactly one t-link, which means that the result of a
cut determines the labeling f’ of every pixel From another view, a cut can be described as: a pixel p is assigned label α when the cut C separates p from the terminal α; similarly, p is
assigned label β when the cut C separates p from the terminal β If p is not chosen to be changed
the label, its original label f p will be kept
Lemma 3.1: A labeling f C corresponding to a cut C on the constructed graph is one α-β swap away from the initial labeling f
Lemma 3.2: There is a one-to-one correspondence between cuts C on the constructed graph and
labelings that are one α-β swap from f Moreover, the cost of a cut C on the graph is | |
( ) plus a constant
Corollary 3.1: The lowest energy labeling within a single α-β swap move from f is ̂ ,
where C is the minimum cut on the constructed graph
A cut can be described as: a pixel p is assigned label α when the cut C separates p from the
terminal α If p is not chosen to be changed to the label α, its original label fp will be kept
Lemma 3.3: A labeling f C corresponding to a cut C on the constructed graph is one α-expansion away from the initial labeling f
Trang 3131
Lemma 3.4: There is a one-to-one correspondence between elementary cuts on the constructed
graph and labelings within one α-expansion of f Moreover, for any elementary cut C, we have
be finally proved
Figure 3.8 and Figure 3.9 are two examples to illustrate the results of α-β swap and α-expansion algorithms
Figure 3.8 – Examples of α-β swap and α-expansion algorithms
Figure 3.9 – Example of α-expansion Leftmost is the input initial labeling An expansion move is shown in the
middle and the right one is the corresponding binary labeling
Trang 3232
The window-based algorithms described in this survey may produce the results with a number of errors like black holes or mismatches in the disparity map Using graph cuts algorithm, the object boundaries can be clearly detected according to the choice of smoothness term and the label assignment to disparity map can be obtained due to the existence of data term
3.4 Concepts of Stereo Matching
3.4.1 Problem Formulation
For every pixel in one image, find the corresponding pixel in the other image is the basic idea of
stereo matching Here the authors of [35] refer to this definition as the traditional stereo problem
The goal of this problem is also to find the labeling ( | |) that minimizes
( ) ∑ ( )
* +
∑ ( )
Again, D p is the penalty for assigning a label to the pixel p N is the neighborhood system
composed of pairs of adjacent pixels; and V is the penalty for assigning different labels to
adjacent pixels
In the traditional stereo matching problem, the location movement of each pixel goes along
horizontal or vertical direction So if we assign a label f p to the pixel p in the reference image I, the corresponding pixel in the matching image I’ should be (p + f p) The matching penalty D p
will enforce photo-consistency, which is the tendency of corresponding pixels to have similar
intensities The possible form of D p is ( ) ‖ ( ) ( )‖
Trang 3333
For the smoothness term, Potts model is usually used to impose a penalty for different f p , f q The natural form of Potts model for smoothness term is ( ) [ ], where the
indicator function , - is 1 if its argument is true and otherwise 0 [35]
We can easily see the change of terms D and V from the tables below, which are | |
| | | |, respectively For stereo with the intensity difference and Potts model applied on smoothness term, they are
D p =
V =
More efficient and fast implementation of graph cuts is using a new min-cut/max-flow algorithm [36] to find the optimal cuts Figure 3.10 shows some experimental results of the two-view stereo matching algorithms with graph cuts We can see that even for heavily texture images, graph cuts can still work to detect clear object boundaries and assign correct labels to the pixels
left image result ground truth
( ( ) ( ))
( ( ) ( ))
( ( ) ( ))
Trang 34
34
left image result Figure 3.10 – Original images and their results Top row is the “lamp” data sequence and the bottom row is the “tree”
data sets
3.4.2 Conclusion and Future Work
The binocular stereo vision area is a relatively mature field A large number of two-view stereo matching algorithms have been proposed in recent years The results of most approaches turn out
to be pretty good, with not only clear object boundaries but also accurate disparity values An evaluation of those stereo methods is provided by Middlebury [5] weighs the advantages and disadvantages among them, which can give readers an overall understanding of the current trend
in stereo vision area
However, the given data have been simplified into a pair of rectified images, where the
corresponding points are easy to find (only need to search along the horizontal or vertical line), since the data set offered by Middlebury is strictly several pixels offset between left and right images Future work may focus on two or more casually taken images without such strict
horizontal or vertical translation Slight rotation can be viewed as a challenge to be improved
Trang 3535
3.5 Photography
In this section, I will introduce a series of concepts which has closed relationship to this project The camera parameters, post-processing techniques, and technical principles are included to help better understand our project For section 3.5.1, some related basic theories will be introduced to build a fundamental image of photography, especially camera Moreover, I will describe some pre-processing or post-processing techniques as well, such as focusing, refocusing, defocusing methods in section 3.5.2
3.5.1 Technical Principles of Photography
3.5.1.1 Pinhole Camera Model
A pinhole camera is a simple camera with a single small aperture but without a lens to focus light Figure 3.11 shows a diagram of a pinhole camera This type of camera model is usually used as a first order approximation of the mapping from a 3D scene to a 2D image, which is the main assumption in our project (I will describe it later)
Figure 3.11 – A diagram of pinhole camera
The pinhole camera model demonstrates the mathematical relationship between the coordinates
of a 3D point and its projection onto the 2D image plane of an ideal pinhole camera The reason why we would like to use such simple form is that even some of the effects like geometric
distortions or depth of filed cannot be taken into account, it can still figure them out by applying
Trang 3636
suitable coordinate transformation on the image coordinates Therefore, pinhole camera model is relatively appropriate to be used as a reasonable description of how a common camera with 2D image plane depicts a 3D scene
Figure 3.12 illustrates the geometry related to the mapping of a pinhole camera
Figure 3.12 – The geometry of a pinhole camera
A point R locates at the intersection of the optical axis and the image plane which is referred to
as the principle point or image center
A point P somewhere in the world at coordinate (x1, x2, x3) represents an object in the real world
that taken by the camera
The projection of point P onto the image plane denotes Q This point is given by the intersection
of the projection line (green) and the image plane Figure 3.13 is also the geometry of pinhole
camera but viewed from the X2 axis It better demonstrates how the model works in practical
case
Trang 3737
Figure 3.13 - The geometry of a pinhole camera as seen from the X2 axis
I apply this model to our project in the first step of computing depth map It is very useful when calculating the camera parameters and the partial occluded parts in the 3D scene I will describe
it in detail in the later chapter
to ensure sufficient light exposure, and a slow shutter speed will require a smaller aperture to avoid excessive exposure Figure 3.14 shows two different sizes of a given camera aperture
Trang 3838
Figure 3.14 - A large (1) and a small (2) aperture
The lens aperture is usually specified as an f-number, the ratio of focal length to effective
aperture diameter A lower f-number denotes a greater aperture opening which allows more light
to reach the film or image sensor Figure 3.15 illustrates some standard aperture sizes For
convenience, I will use this “f / f-number” form to represent the size of aperture
Figure 3.15 - Diagram of decreasing aperture sizes (increasing f-numbers)
I will not introduce more about camera aperture here It has a strong relationship with some photography effect like depth of field In section 3.5.2, it will be introduced again in detail combined with the practical techniques
Trang 3939
3.5.1.3 Circle of Confusion (CoC)
In photography, the circle of confusion is also used to determine the depth of field It defines how much a point needs to be blurred in order to be perceived as unsharp from human eyes When the circle of confusion becomes perceptible to the human eyes, we say that this area is outside the depth of field and therefore no longer "acceptably sharp" under the definition of DOF Figure 3.16 and 3.17 picture how the circle of confusion performs in depth of field
Figure 3.16 – The range of circle of confusion
Figure 3.17 – Illustration of circle of confusion and depth of field
Again, the relationship between circle of confusion and depth of field will be further described in the later sections
Trang 40Figure 3.18 - The area within the depth of field appears sharp, while the areas in front of and behind the depth of
field appear blurry
In general, depth of field does not abruptly change from sharp to unsharp, but instead appears as
a gradual transition (Figure 3.19)
Figure 3.19 – An image with very shallow depth of field, which appears as a gradual transition (from blurry to sharp,
then to blurry again)