Multi view image refocusing

4 3.3.1 Preliminary Knowledge 26 3.3.2 Details of Graph Cuts 28 3.4 Concepts of Stereo Matching 33 3.5.1 Technical Principles of Photography 3.5.2 Effects and Processing in Photograp

Trang 1

objects/layers to focus, and other parts will be naturally blurred according to their depth values in the scene

This project can be divided into two parts One is how to produce a layer depth map Computing

a depth map is actually a work of labeling assignment This type of problem can be solved by finding a minimum value of a constructed energy function Graph Cuts algorithm is one of the most efficient optimization methods We use it to optimize our built energy function due to its feature of fast convergence The second part is to blur each layer that is not assigned to be

focused Several blurry algorithms are applied to achieve this goal

In this paper, I first describe some related work and background studies on the labeling

assignment theories and their related topics in vision area I then explore the refocus-related principals in computational photography Based on these studies, I go through our image

refocusing project in details and compare the experimental results to other existed approaches Finally, I proposed some possible future work in this research area

Trang 2

2

Acknowledgment

First, I would like to express my sincere appreciation to my supervisor, Dr Low Kok Lim He has offered me a large amount of precious advice and suggestions, contributed his valued time to review this paper and provided me proposed comments

I would also like to thank my family, especially my parents They always offer me continuous encouragement and infinite support My acknowledgement will give to my labmates as well When I fell tired and wanted to give up, they were here by my side and support me

Trang 3

2.1.1 Single Image Input 13

2.1.2 Stereo Matching – Two Images Input 15

2.1.3 Multi-View scene reconstruction - Multiple Images Input 17

2.1.4 3D World Reconstruction from a large dataset 17

Trang 4

4

3.3.1 Preliminary Knowledge 26

3.3.2 Details of Graph Cuts 28

3.4 Concepts of Stereo Matching 33

3.5.1 Technical Principles of Photography

3.5.2 Effects and Processing in Photography

3.6.1 Overview and Problem Formulation

3.6.2 Results and Conclusion

Computation of Camera Parameters

Estimation of Depth Value in 3D Scene

Problem Formulation and Graph Cuts

Trang 5

5

4.4.2

4.4.3

Results of Single Depth Map

Considerations and Difficulties

Combining Blurry Layers

Results and Comparison

Comparison to Adobe Photoshop

Different Lens Apertures

Bokeh

Different number of input images

Comparison to Other Software on Camera

Trang 6

In practice, people sometimes want to shoot a photo with sharp foreground and blurred

background Moreover, they would like to sharpen the objects they prefer, blur other parts they

do not consider important In this case, we need a camera with large aperture to create a shallow depth of field However, general point-and-shoot cameras have small sensors and lenses, which

is to some degree difficult to generate the effect of shallow focus To figure out this problem, our project is using multiple photos with small aperture taken from slightly different viewpoints, which simulates a bigger camera aperture, to create a shallow depth of field It means that with only one point-and-shoot camera, ordinary users without any photography technique can take an

Trang 7

7

„artistic‟ depth-of-field photo Another advantage of our project is that user can randomly select one part of the referenced photo to sharpen it; the other parts are blurred meanwhile This can be achieved by producing a depth map from these given several photos

1.2 Problem Statement

Most people will ask: what kind of problems does our project solve? The first thing we consider

is the convenience for camera users The top selling point-and-shoot camera in the current

market is indeed convenient and easy to use People just need to press the shutter button, and a piece of beautiful photo will be produced However, only to generate a picture is not enough They prefer more realistic effects on the photos For example, if we choose a function of depth-of-field scene simulation on the camera, the foreground objects (i.e persons) will be sharpened while the background will be blurred by just pressing a button on the camera

The current welcomed point-and-shoot camera with depth-of-field effect usually has these two resultant photos: one is that objects are sharpened everywhere (Figure 1.1 - left), the other is that the camera detects the foreground objects automatically and keeps them sharpened, and blurs all the other regions in the photo (Figure1.1 - right) Here is the problem: what if the user wants to sharpen the background objects instead of the foreground ones? For example, in Fig.1-right, users do not want to focus the flower, maybe they would like to see the white object in the background clearer At this time we need to refocus the whole scene in the photo, i.e change the emphasized part of the scene In a word, our project should be able to solve a problem like this: with the least handling steps on the input, how users can finally obtain a photo containing the sharp part and blurred part they prefer

Trang 8

8

Figure 1.1 – left: all the objects are sharp everywhere; right: shallow depth of field – flower in the foreground is

sharp, while background is largely blurred

There are some existing works in the area of refocusing I will describe the details of these

methods later in the next chapter In this thesis, we present a simple but effective idea to

implement the refocusing work in current popular point-and-shoot camera The whole procedure can be divided into two parts The first part is to compute a depth map based on the given input photos The theory and the original problem formulation under the depth map computation is one classical research area in early vision: label assignment, which focuses on how to assign a label

to a pixel (in image) given some observed data This kind of problems can be solved by energy minimization A way to figure out the label assigning work is to construct an energy function and minimize it

The second part of the project is to sharpen one part of the photo, while blur the other parts based

on the depth map computed from the previous step For this phase, we need to handle the

Trang 9

9

occurred problems such as partial-occluded parts, blur kernel and issues from layers of

constructed 3D scene After combining these two steps above, user can finally obtain a new image with the sharp parts they want by only shooting several photos from common point-and-shoot camera

Therefore, we can summarize that the input of this project is a sequence of sharp photos taken from slightly different viewpoints Then user selects a preferred region on the referenced photo that to be emphasized, and finally the output is a depth-of-field image

1.3 Advantages and Contributions

We analyze the advantages of our project from three aspects: depth map generation, refocusing procedure and quantity of input parameters

(a) Depth map generation:

There are many approaches on producing depth map in early vision area Most of them are stereo matching methods User needs to prepare two images – left and right, with only translation movement between them The output is a disparity map according to the moving distance of each pixel in these two images, i.e large translation interprets nearer objects to the camera, while small movement accounts for the scene background that is far away from camera holder

Another type of depth map generation is to reconstruct a 3D scene from a large set of photos Those photos can be taken from very different viewpoints Actually to produce a coarse depth map, there is no need to reconstruct the corresponding 3D scene The information a depth map requires is depending on where it applies for Sometimes rough depth value (the distance from camera lens to the objects) of each pixel in the image is enough; while in other cases, especially

Trang 10

10

full 3D scene reconstruction, exact depth value (e.g x, y, z axis point representation) of each 3D point is necessary

Our method is a tradeoff between the previous two approaches We use several images to

generate a usual depth map instead of 3D reconstruction First, users do not need to shoot a large set of photos, 5-7 is enough Second, the theory of multi-view scene reconstruction is simple, which is to generate an appropriate energy function for the given problem and apply an efficient energy minimization algorithm such as graph cuts to minimize it [33] The details and our

implementation will be described in the later chapter

In a word, the advantages of our depth map production has less input and is a simple algorithm idea instead of much more steps of 3D reconstruction approach such as feature point extraction, structure from motion, or bundle adjustment, etc The reason why we do not apply 3D

reconstruction approach is also because the information we require for the project is less than that in reconstructing a 3D scene from real world – rough depth value of each pixel in the

referenced image is truly enough

(b) Refocusing procedure:

One common approach to refocusing is to acquire a sequence of images with different focused setting According to the spatially varying blur estimation of the whole scene where the

information is taken from the focused difference, an all-focused resultant image can be computed

to refocus However, from the view of users, they need to take several photos under different focused setting It is not so easy to achieve if users have little knowledge about photography What if they do not know how to adjust different focused settings? In other words, some existing refocusing works have at least two input images with very different properties For example, one

of the input photos has sharp objects in the foreground, while in the other image, background objects are sharp and foreground is blurred Hence according to the different blurred degrees and regions, the program can compute a relatively accurate blur kernel to accomplish the refocusing task Notice that a camera with depth-of-field effect is required for this kind of approach

Trang 11

(c) Quantity of input parameters:

Compared to other existing refocusing methods, most of them need more input parameters than ours In [8], in order to estimate a depth map under a single camera (with a wide depth of field), the author uses a sparse set of dots projected onto the scene (with a shallow depth of field

projector), and refocus based on the depth map

Another type of methods is to modify the camera lens They use various shapes of camera

aperture, or apply a coded aperture on a conventional camera [1] With the knowledge of the aperture shape, the blurred kernel is computed as well

Our project does not require extra equipment like a projector or any camera modification We do not need to take any photos with shallow depth of field to estimate the blur kernel either All users need to do is to shoot several all-focused photos with slightly different shooting angles by using a shoot-and-point digital camera It is more convenient for ordinary people without any photography technique to obtain a final refocusing photo In our project, the interaction between user and computer is the user selection of preferred region which will be emphasized (sharpened)

Trang 12

12

Chapter 2

Related Work

According to the procedure of our project – depth map computation and image refocusing based

on the depth information, the literature survey in this chapter will also be divided into two

categories: existing work on depth estimation and existing work on image refocusing

In this chapter, we will not only introduce the existing related work to our project, but also describe some key algorithms and their corresponding applications These algorithms such as stereo matching concepts, take an important part into our project as well Therefore, introduction

of such concepts in detail is necessary

2.1 Depth Estimation

We divide this part as four categories according to the required input, which in my opinion is much clearer to compare these various methods

2.1.1 Single Image Input

The approaches for obtaining one depth map from a single input image, most of them require additional equipment like projector, or device modification like camera aperture shape change, etc In [8], the author uses a single camera (with a wide depth of field) and the depth value

Trang 13

13

computation is based on the defocus of a sparse set of dots projected onto the scene (using a narrow depth of field projector) With the help of the projecting dots from the projector and the color segmentation algorithm, an approximated depth map of the scene with sharp boundaries can be computed for the next step Figure 2.1 is the example of [8] (a) is the acquired image from single input image and projector (b) is the computed depth map from the information provided by (a) The produced depth map has very sharp and accurate boundaries on the objects However, it cannot handle the partial occlusion problem, i.e., we are not able to see the region of the man behind that flower in depth map

Figure 2.1 – example result from [8]

In [1], the authors use a single image capture, and a small modification to the traditional camera lens – a simple piece of cardboard suffices On the condition that we have already know the shape of lens aperture, the corresponding blur kernel can be estimated, and thus convolution can

be applied to the blurry part of the image in order to recover an all-focused final image (in

refocusing step) The output of this method is a coarse depth map, which is sufficient for the next refocusing phase in most applications

Trang 14

14

2.1.2 Stereo Matching – Two Images Input

Stereo matching is one of the most active research areas in computer vision and it can be applied

to many applications as an significant intermediate step, such as view synthesis, image based rendering, 3D scene reconstruction.Given two images/photos that are taken from slightly

different viewpoints, the goal of stereo matching is to assign a depth value to each pixel in the reference image, where the final result is represented as a disparity map

Disparity indicates the difference in location between two corresponding pixels and it is also considered as a synonym for inverse depth The most important step at first is to find the

corresponding pixels which refer to the same scene point from the given two left and right

images Once these correspondences are known, we can find out how much difference produced

by the camera movement

Consider that it is hard to point out the corresponding pixels under the one-map-to-one condition based on the casual camera motion, to simplify the search for correspondence, the image pair is commonly transformed into horizontal translation, so that the stereo problem is reduced to a one-dimensional search along corresponding scan lines Therefore, we can easily view the disparity value as the offset between x-coordinates in the left and right images (Figure 2.2) The objects nearer to viewpoint have a larger translation, while farer ones have only a slight move

Figure 2.2 - The stereo vision is captured in a left and a right image

Trang 15

15

An excellent review of stereo work can be found in [4] It presents the taxonomy and comparison

of two-frame stereo correspondence algorithms

Stereo matching algorithms generally perform (subsets of) of the following three steps [4]:

a matching cost computation;

My own comparison and implementation only focuses on pixel-based matching costs, which is enough for the research of this project The most common pixel-based methods include squared intensity differences (SD) [14, 18, 12, 7] and absolute intensity differences (AD) [30] The computation of SD and AD falls between the single pixels of the given left and right images

b cost (support) aggregation;

The aggregation of cost is usually window-based (local) method It aggregates the matching cost

by summing or averaging over a support region For common used stereo matching, a support region is a two-dimensional area, often a square window (3-by-3, 7-by-7) Therefore the cost aggregation of each pixel in an image can be calculated over such region

c disparity computation / optimization;

It can be separated into two classes One is local methods, and the other is global ones The local methods usually perform a local “winner-take-all” (WTA) optimization at each pixel [40] For the global optimization methods, the objective is to build a disparity function that minimizes a global energy Such energy function includes data term and smoothness term, which represents energy distribution of each pixel in the two stereo images With the energy function, several minimization algorithms such as belief propagation [19], graph cuts [28, 34, 17, 2], dynamic programming [20, 31], simulated annealing [26, 11] could be used to compute the final depth map

Trang 16

16

2.1.3 Multi-View scene reconstruction - Multiple Images Input

A 3D scene can be roughly reconstructed from a small set of images (about 10 pictures) The result of this type of methods is usually a dense depth map that is similar to those of stereo

matching However, it is difficult to deal with the partial occluded part in stereo matching

because of lacking of information from only two input images Users provide multiple images as input, which means they have enough information that could produce a relatively exact result Furthermore, partial occlusion may also be solved In [Multi-camera; asymmetrical], the energy function has more than two terms, i.e data term, smoothness term and visibility term Obviously, the visibility term is used to handle the partial occlusion problem Besides, [13, 25, 23] not only take advantages of multiple images as the input, but also build a new type of data structure as the output – layered depth map, which can represent the layered pattern of the real 3D scene more clearly The whole scene can be divided into several planes Each plane represents the distance

of objects to the camera holder Figure 2.3 shows an example of layered depth map [23] This new representation of depth map deals with the partial occlusion better and the most important point is that it is more convenient and exact for the next step – refocusing

(a) (b) (c) Figure 2.3 – (a) is one of the input sequence; (b) is the recovered depth map; (c) is the separated layers

2.1.4 3D World Reconstruction from a large dataset

As in [16], the input is a large dataset where the photos are taken from very different views Therefore we can obtain plenty of information including camera position, affine matrix of each

photo, and finally compute the exact depth value of each pixel in real world Given each pixel‟s x,

Trang 17

17

y, z value of an image (z value is the one we compute from the large dataset), we can easily

reconstruct the 3D scene of this image and of course, the computed depth map is much more accurate than those produced from stereo matching or multi-view scene reconstruction See Figure 2.4 (the bear one if I still hold) The result of 3D reconstruction is comprised of a large amount of sparse points For people who would like to see a rough sketch of a certain building, a set of sparse points is enough If we try to obtain a dense map, estimation on the regions without known points or triangulation may be performed in order to connect the discrete points together

(a) large dataset as input (b) 3D reconstruction result

Figure 2.4 – example from [16]

2.2 Image Refocusing

In our project, this part is based on the previous depth estimation phase All our research and implementation of refocusing are according to the single dense depth map that we have produced Therefore, for the introduction of related work in image refocusing, we will only describe the refocusing part of the paper, i.e we presume that the depth map (single or layered) has already been provided and focus on how to blur or sharpen the original image

Trang 18

foreground focused image with a background focused image within the boundary region They combine their depth map with a matting matrix computed from the depth estimation refinement

to produce better result [1] is also another type of approach using blur kernel estimation to fulfill the refocusing work The input are single blurred photographs taken with the modified camera (obtain both depth information and an all-focus image) and output is coarse depth information together with a normal high resolution RGB image Therefore, to reconstruct the original sharp image, correct blur scale of an observed image has to be identified with the help of modified shape of camera aperture The process is based on probabilistic model which finds the maximum likelihood of a blur scale estimation equation To summary, according to the general convolution

equation y = f k * x, where y is the observed image, x is the sharp image that will be recovered, and the blur filter f k is a scaled version of the aperture shape [1] The defocus step of this method

is to find a correct kernel f k with the help of known information of coded aperture

Another kind of methods is to produce a layered depth map which divided the whole scene into several parts Each part has a certain depth value These separated layers are viewed as

individual parts It is definitely a solution to avoid the missing regions between objects lying in different scene layers Given a layered depth map, the only thing we need to do is to blur the layers according to the depth value of the corresponding individual layer without considering

Trang 19

information about light field in depth estimation area is in [32]

Trang 20

I also introduce some concepts that are related to the refocusing part of the project The basic theories and relationships among the common camera parameters are described in section 3.3 Besides, some different methods (applications) on the topics of refocusing, defocusing or depth

of field are introduced to do a direct comparison with our method

3.1 Standard Energy Function in Vision

In early vision problems, the label assignment can be explained as: every pixel must be

assigned a label in some finite set L In image restoration, the assignments represent pixel

intensities, while for stereo vision and motion, the labels are disparities Another example is image segmentation, where labels represent the pixel values of each segment The goal of label

assignment is to find a labeling f that assigns each pixel a label , where f is both

Trang 21

21

piecewise smooth and consistent with the observed data In general energy function definition, the vision problems can be naturally formulated as

E(f) = E smooth (f) + E data (f) (1)

where E smooth (f) measures the extent to which f is not piecewise smooth, while E data (f) measures the disagreement between f and the observed data [39] Typically, in many literatures and

applications, the form of E data (f) is defined as

stereo vision problem, ( ) is usually ( ) , where I right and I left are the pixel

intensities of the two corresponding points p and q in the given left and right image, respectively

While in image restoration problem, ( ) is normally ( ) , where I p is the observed

intensity of p

While the data term is easier to define and apply to different practical problems, the choice of smoothness term is a very important and critical issue in the current research area It directly decides whether the final result is optimal and the form of this term usually depends on the

applications For example, in some approaches, E smooth (f) makes f smooth everywhere according

to the demand of the algorithms For many other applications, E smooth (f) has to detect the object boundaries as clear as possible, which is often denoted as discontinuity preserving For image

segmentation, stereo vision problems, object boundary is a necessary factor to be firstly

considered The Potts model I described above is also a popularly used type of smoothness term From the discussion about data term and smoothness term in energy function, we consider

energies of the form

Trang 22

where N is set of pairs of neighboring pixels Normally, N is composed of adjacent pixels (i.e

left and right, top and bottom, etc.), but can be arbitrary pairs as well according to problem

requirement Most applications only consider V p,q under pair-wise interactions, since pixel

dependence and interaction often happen between adjacent pixels

3.2 Optimization Methods

Finding a minimum value for a given energy function is a typical kind of optimization problem

In this section, first I give an introduction and motivation to optimization technique Then I focus

on introducing an efficient optimization method – graph cuts

3.2.1 Introduction to Optimization

In mathematics and computer science, optimization, or mathematical programming, refers to choosing the best element from some set of available alternatives In the simplest case,

optimization means solving problems where one seeks to minimize or maximize a real function

by systematically choosing the values of real or integer variables from within an allowed set This formulation, using a scalar, real-valued objective function, is probably the simplest example Generally, optimization means finding "best available" values of some objective function given a defined domain, including a variety of different types of objective functions and different types

of domains

Greig et al [3] was first to use powerful min-cut/max-flow algorithms from combinatorial

optimization to minimize certain typical energy function in computer vision Combinatorial

Trang 23

23

optimization is a branch of optimization problems The feasible solutions of this kind of

problems are discrete or can be reduced to discrete case, and the goal is to find the best possible solution, most of the time is to find an approximated one For approximation algorithms, they can run in polynomial time and find a solution that is “close” to optimal

For certain energy function, in general, a labeling f is a local minimum of the energy E if

Graph cuts algorithm is one of the most popular optimization algorithms in current related

research areas It can rapidly compute a local minimum with relatively good results Here are some examples produced by graph cuts The object boundaries are pretty clear which fit the requirement of image segmentation and stereo vision cases

Since graph cuts algorithm is the major discussing topic in this paper, I will introduce it later in details

Trang 24

24

Figure 3.1 - Results of color segmentation on the Berkeley dataset by using graph cuts

Figure 3.2 - Results of texture segmentation on the MIT VisTex and the Berkeley dataset by using graph cuts

Figure 3.3 – Disparity map of stereo vision matching using graph cuts

Trang 25

3.3.1.1 Metric and Semi-Metric

V is called a metric on the label space L if it satisfies

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

for any labels If V satisfies only (a) and (b), it is called a semi-metric

For example, Potts model ( ) ( ) is a metric, where ( ) is 1 if its argument is true, otherwise 0 The Potts model encourages labeling consisting of several regions where pixels

in the same region have equal labels [4] The discontinuity-preserving results produced from this

model are also called piecewise constant, which is widely used in segmentation, stereo vision problems

Another type of models is called piecewise smooth The truncated quadratic ( )

( | | ) is a semi-metric, while the truncated absolute distance ( )

( | |) is a metric, where K is some constant The role of constant K is to restrict possible larger discontinuity penalty imposed on the smoothness term in the energy function These models encourage labeling consisting of several regions where pixels in the same region

have similar labels

Trang 26

26

3.3.1.2 Graph denotation and construction

Let 〈 〉 be a weighted graph It consists of a set of nodes V and a set of edges that connect them The set of nodes has several distinguished vertices which are called the terminals

In the context of vision problem, the nodes normally correspond to pixels, voxels or other types

of image features and terminals correspond to the set of labels which can be assigned to each pixel in the image For simplification, I will only focus on the case of two terminals (i.e two labels to be assigned) Usually the two terminal nodes can also be called source node and sink node The multiple terminals problem can be naturally extended from the two-label case In Figure 3.4, a simple example of a two terminal graph is shoed This graph construction can be used on a 3 x 3 image with two to-be-assign labels

For the edges connected between different nodes, a t-link is an edge that connects terminal nodes (source and sink) to image pixel nodes, while an n-link is an edge that connects two image nodes within a neighborhood system

Figure 3.4 – Example of a constructed graph A similar graph cuts construction was first introduced in vision by

Greig et al [3] for binary image restoration

A cut is a set of edges such that the terminals are separated by this cut in the induced graph ( ) 〈 〉 After the cut, a subset of nodes belongs to source terminal, while the

other subset of nodes is categorized into the sink terminal The cost of the cut C, denoted |C|,

equals the sum of edge weights of this cut Figure 3.5 shows a typical cut on the constructed graph The cut is represented as green dotted line

Trang 27

27

The minimum cut problem is to find the optimal cut with lowest cost among all cuts separating the terminals

Figure 3.5 – Example of a cut on the constructed graph

3.3.1.3 Minimizing the Potts energy is NP-hard

The details of proof of NP-hard will not be described here All we have to know is that a

polynomial-time method for finding an optimal configuration f * would provide a time algorithm for finding the minimum cost multi-way cut, which is known to be NP-hard [38]

polynomial-3.3.2 Details of Graph Cuts

Trang 28

using labeling (b) as an initial labeling estimate [39]

3.3.2.2 Definition of Swap and Algorithms

For α-β swap, given a pair of labels α and β, a α-β swap is a move from an old labeling f to a

new labeling f’ If there is pixel labeling difference between the before and after changes which

leads to a decrease of energy in the function, we say that the α-β swap succeeds and will

continue to the next iteration In other words, α-β swap means that some pixels that were labeled

α now change to be labeled β, and some pixels that were labeled β now labeled α

For α-expansion, given a label α, a α-expansion move is also a move from an old labeling f to a

new labeling f’ This algorithm means that some pixels that were not assigned to label α now are

assigned to label α

Figure 3.7 shows the α-β swap and α-expansion algorithms respectively We will call a single

execution of Steps 3.1-3.2 and iteration, and an execution of Steps 2, 3, and 4 a cycle In each

cycle, an iteration is performed for every label α or for every pair of label α and β The

Trang 29

29

algorithms will continue until it cannot find any successful labeling change It can be obvious seen that a cycle in the α-β swap algorithm take |L|2 iterations, while a cycle in the α-expansion algorithm only take |L| iterations

Figure 3.7 – Overview of α-β swap algorithm (top) and α-expansion algorithm (bottom)

Given an input initial labeling f and a pair of labels α and β (swap algorithm) or a label α

(expansion algorithm), we want to find a new labeling f’ which can minimize the given energy

Trang 30

30

3.3.2.3 Swap

Any cut leaves each pixel in the image with exactly one t-link, which means that the result of a

cut determines the labeling f’ of every pixel From another view, a cut can be described as: a pixel p is assigned label α when the cut C separates p from the terminal α; similarly, p is

assigned label β when the cut C separates p from the terminal β If p is not chosen to be changed

the label, its original label f p will be kept

Lemma 3.1: A labeling f C corresponding to a cut C on the constructed graph is one α-β swap away from the initial labeling f

Lemma 3.2: There is a one-to-one correspondence between cuts C on the constructed graph and

labelings that are one α-β swap from f Moreover, the cost of a cut C on the graph is | |

( ) plus a constant

Corollary 3.1: The lowest energy labeling within a single α-β swap move from f is ̂ ,

where C is the minimum cut on the constructed graph

A cut can be described as: a pixel p is assigned label α when the cut C separates p from the

terminal α If p is not chosen to be changed to the label α, its original label fp will be kept

Lemma 3.3: A labeling f C corresponding to a cut C on the constructed graph is one α-expansion away from the initial labeling f

Trang 31

31

Lemma 3.4: There is a one-to-one correspondence between elementary cuts on the constructed

graph and labelings within one α-expansion of f Moreover, for any elementary cut C, we have

be finally proved

Figure 3.8 and Figure 3.9 are two examples to illustrate the results of α-β swap and α-expansion algorithms

Figure 3.8 – Examples of α-β swap and α-expansion algorithms

Figure 3.9 – Example of α-expansion Leftmost is the input initial labeling An expansion move is shown in the

middle and the right one is the corresponding binary labeling

Trang 32

32

The window-based algorithms described in this survey may produce the results with a number of errors like black holes or mismatches in the disparity map Using graph cuts algorithm, the object boundaries can be clearly detected according to the choice of smoothness term and the label assignment to disparity map can be obtained due to the existence of data term

3.4 Concepts of Stereo Matching

3.4.1 Problem Formulation

For every pixel in one image, find the corresponding pixel in the other image is the basic idea of

stereo matching Here the authors of [35] refer to this definition as the traditional stereo problem

The goal of this problem is also to find the labeling ( | |) that minimizes

( ) ∑ ( )

* +

∑ ( )

Again, D p is the penalty for assigning a label to the pixel p N is the neighborhood system

composed of pairs of adjacent pixels; and V is the penalty for assigning different labels to

adjacent pixels

In the traditional stereo matching problem, the location movement of each pixel goes along

horizontal or vertical direction So if we assign a label f p to the pixel p in the reference image I, the corresponding pixel in the matching image I’ should be (p + f p) The matching penalty D p

will enforce photo-consistency, which is the tendency of corresponding pixels to have similar

intensities The possible form of D p is ( ) ‖ ( ) ( )‖

Trang 33

33

For the smoothness term, Potts model is usually used to impose a penalty for different f p , f q The natural form of Potts model for smoothness term is ( ) [ ], where the

indicator function , - is 1 if its argument is true and otherwise 0 [35]

We can easily see the change of terms D and V from the tables below, which are | |

| | | |, respectively For stereo with the intensity difference and Potts model applied on smoothness term, they are

D p =

V =

More efficient and fast implementation of graph cuts is using a new min-cut/max-flow algorithm [36] to find the optimal cuts Figure 3.10 shows some experimental results of the two-view stereo matching algorithms with graph cuts We can see that even for heavily texture images, graph cuts can still work to detect clear object boundaries and assign correct labels to the pixels

left image result ground truth

( ( ) ( ))

Trang 34

34

left image result Figure 3.10 – Original images and their results Top row is the “lamp” data sequence and the bottom row is the “tree”

data sets

3.4.2 Conclusion and Future Work

The binocular stereo vision area is a relatively mature field A large number of two-view stereo matching algorithms have been proposed in recent years The results of most approaches turn out

to be pretty good, with not only clear object boundaries but also accurate disparity values An evaluation of those stereo methods is provided by Middlebury [5] weighs the advantages and disadvantages among them, which can give readers an overall understanding of the current trend

in stereo vision area

However, the given data have been simplified into a pair of rectified images, where the

corresponding points are easy to find (only need to search along the horizontal or vertical line), since the data set offered by Middlebury is strictly several pixels offset between left and right images Future work may focus on two or more casually taken images without such strict

horizontal or vertical translation Slight rotation can be viewed as a challenge to be improved

Trang 35

35

3.5 Photography

In this section, I will introduce a series of concepts which has closed relationship to this project The camera parameters, post-processing techniques, and technical principles are included to help better understand our project For section 3.5.1, some related basic theories will be introduced to build a fundamental image of photography, especially camera Moreover, I will describe some pre-processing or post-processing techniques as well, such as focusing, refocusing, defocusing methods in section 3.5.2

3.5.1 Technical Principles of Photography

3.5.1.1 Pinhole Camera Model

A pinhole camera is a simple camera with a single small aperture but without a lens to focus light Figure 3.11 shows a diagram of a pinhole camera This type of camera model is usually used as a first order approximation of the mapping from a 3D scene to a 2D image, which is the main assumption in our project (I will describe it later)

Figure 3.11 – A diagram of pinhole camera

The pinhole camera model demonstrates the mathematical relationship between the coordinates

of a 3D point and its projection onto the 2D image plane of an ideal pinhole camera The reason why we would like to use such simple form is that even some of the effects like geometric

distortions or depth of filed cannot be taken into account, it can still figure them out by applying

Trang 36

36

suitable coordinate transformation on the image coordinates Therefore, pinhole camera model is relatively appropriate to be used as a reasonable description of how a common camera with 2D image plane depicts a 3D scene

Figure 3.12 illustrates the geometry related to the mapping of a pinhole camera

Figure 3.12 – The geometry of a pinhole camera

A point R locates at the intersection of the optical axis and the image plane which is referred to

as the principle point or image center

A point P somewhere in the world at coordinate (x1, x2, x3) represents an object in the real world

that taken by the camera

The projection of point P onto the image plane denotes Q This point is given by the intersection

of the projection line (green) and the image plane Figure 3.13 is also the geometry of pinhole

camera but viewed from the X2 axis It better demonstrates how the model works in practical

case

Trang 37

37

Figure 3.13 - The geometry of a pinhole camera as seen from the X2 axis

I apply this model to our project in the first step of computing depth map It is very useful when calculating the camera parameters and the partial occluded parts in the 3D scene I will describe

it in detail in the later chapter

to ensure sufficient light exposure, and a slow shutter speed will require a smaller aperture to avoid excessive exposure Figure 3.14 shows two different sizes of a given camera aperture

Trang 38

38

Figure 3.14 - A large (1) and a small (2) aperture

The lens aperture is usually specified as an f-number, the ratio of focal length to effective

aperture diameter A lower f-number denotes a greater aperture opening which allows more light

to reach the film or image sensor Figure 3.15 illustrates some standard aperture sizes For

convenience, I will use this “f / f-number” form to represent the size of aperture

Figure 3.15 - Diagram of decreasing aperture sizes (increasing f-numbers)

I will not introduce more about camera aperture here It has a strong relationship with some photography effect like depth of field In section 3.5.2, it will be introduced again in detail combined with the practical techniques

Trang 39

39

3.5.1.3 Circle of Confusion (CoC)

In photography, the circle of confusion is also used to determine the depth of field It defines how much a point needs to be blurred in order to be perceived as unsharp from human eyes When the circle of confusion becomes perceptible to the human eyes, we say that this area is outside the depth of field and therefore no longer "acceptably sharp" under the definition of DOF Figure 3.16 and 3.17 picture how the circle of confusion performs in depth of field

Figure 3.16 – The range of circle of confusion

Figure 3.17 – Illustration of circle of confusion and depth of field

Again, the relationship between circle of confusion and depth of field will be further described in the later sections

Trang 40

Figure 3.18 - The area within the depth of field appears sharp, while the areas in front of and behind the depth of

field appear blurry

In general, depth of field does not abruptly change from sharp to unsharp, but instead appears as

a gradual transition (Figure 3.19)

Figure 3.19 – An image with very shallow depth of field, which appears as a gradual transition (from blurry to sharp,

then to blurry again)

Định dạng
Số trang	87
Dung lượng	4,56 MB