A 3D panoramic image guiding system design and implementation for Minimum invasive surgery

A 3D panoramic image guiding system design and implementation for Minimum invasive surgery A 3D panoramic image guiding system design and implementation for Minimum invasive surgery A 3D panoramic image guiding system design and implementation for Minimum invasive surgery luận văn tốt nghiệp,luận văn thạc sĩ, luận văn cao học, luận văn đại học, luận án tiến sĩ, đồ án tốt nghiệp luận văn tốt nghiệp,luận văn thạc sĩ, luận văn cao học, luận văn đại học, luận án tiến sĩ, đồ án tốt nghiệp

Trang 3

Special thanks go to Dr Kai Che Jack Liu and Dr Wayne Shih Wei Huang from the IRCAD Taiwan/AITS, Taiwan, for enabling me to conduct in-vio animal experiments Their collaboration on the medical aspect of our research project was a great source of motivation

I am also thankful to friends in my laboratory for the great time spent together and especially to Thang, Toan, Tuan, and Tai for their suggestions and motivation

Last but not least, I want to express gratefulness to my family and especially my wife for their unlimited support and belief in me Their continuous spiritual support is the motivation for me to complete the work of this thesis

Trang 4

Abstract

Minimally invasive surgery (MIS) is gradually replacing traditional surgical methods because of its advantages, such as causing less injury, preventing unsightly surgical scars, and resulting in faster recovery time In MIS, the surgeon performs surgery by observing images transmitted from the endoscope Therefore, there are three significant challenges in MIS, which is a limited field of view (FOV), lack of depth perception, and viewing angle control during surgery

This thesis aims to explore how to solve these challenges using only images provided by the endoscopic camera without the requirement for the additional device in the operating room We proposed a 3D Panoramic Image Guiding System for MIS to provide surgeons

a broad, optimal and stable view field with Focus‑Area 3D‑Vision in the surgical area We have designed a new endoscope that consists of two endoscopic cameras attached to the tip

of one tube With the two-view images captured simultaneously from the two lenses, the proposed algorithm can combine two camera’s FOV into one larger FOV The overlap area

of the two cameras was also displayed in the 3D space Besides, our system could be a 3D measurement tool for endoscopic surgery Finally, a surgical tool detection algorithm is proposed to evaluate the surgical skills and control camera position during MIS

The experiments of the proposed system were performed on phantom model images and in-vivo animal images Experimental results confirm that our system is feasible and give promises to improve existing limitations in laparoscopic surgery

Keywords: minimally invasive surgery (MIS), image stitching, 3D reconstruction, surgical

tool detection

Trang 5

Contents

Chapter 1 Introduction 1

1.1 Minimally Invasive Surgery 1

1.2 Computer Assisted Interventions 3

1.2.1 Image stitching 3

1.2.2 3D reconstruction 4

1.2.3 Surgical Tool Detection 5

1.3 Thesis Overview 6

1.3.1 Proposed Endoscope 6

1.3.2 Problem Description 7

1.3.3 Thesis Structure 8

1.4 Thesis Contributions 9

Chapter 2 Image Stitching in Minimally Invasive Surgery 10

2.1 Introduction 10

2.2 Features-based Image Stitching 11

2.2.1 Feature Detection 12

2.2.2 Feature Matching 14

2.2.3 Find homography matrix 14

2.2.4 Image Warping 15

2.2.5 Seam Estimation 15

2.2.6 Image Blending 16

2.3 Proposed Video Stitching 16

2.3.1 Accelerating Image Registration 17

2.3.2 Accelerating Image Composition 19

2.4 Results 20

2.4.1 Video-stitching Results 20

2.4.2 Run Time Estimation 21

2.4.3 Discussion 23

2.5 Conclusions 25

Chapter 3 3D Reconstruction in Minimally Invasive Surgery 26

3.1 Introduction 26

3.2 Proposed Stereo Reconstruction 27

3.2.1 Image Rectification 27

Trang 6

3.2.2 Disparity Map 31

3.2.3 Dense Reconstruction 33

3.3 Results 35

3.3.1 Experimental Description 35

3.3.2 Evaluation of the Disparity Map and 3D Reconstruction 36

3.3.3 Evaluation of Distance Measurement 39

3.3.3 Run time Evaluation 40

3.4 Discussion 41

3.5 Conclusion 41

Chapter 4 Video Stitching in Minimally Invasive Surgery 42

4.1 Introduction 42

4.2 The Proposed Image-Stitching Algorithm 44

4.2.1 Image Registration 44

4.2.2 Image Compositing 46

4.3 The proposed Video Stitching Algorithm 46

4.3.1 Stitching Video at increased Speed 46

4.3.2 Increasing the Stability of the stitched video 48

4.4 Experimental Results 48

4.4.1 Video-stitching results 49

4.4.2 Comparison with the Previous Method 50

4.5 Discussion 53

4.4 Conclusions 54

Chapter 5 Surgical Tool Detection in Minimally Invasive Surgery 55

5.1 Introduction 55

5.2 Surgical Tool Detection 55

5.2.1 Dataset 56

5.2.2 Method 56

5.2.3 Results 58

5.3 Surgical Tool-Instance Segmentation 59

5.3.1 Dataset 59

5.3.2 Method 60

5.3.3 Results 62

5.4 Conclusions 63

Chapter 6 3D Panoramic Image Guiding System for Minimum Invasive Surgery 64

6.1 Introduction 64

6.2 Hardware 64

6.3 Software 65

Trang 7

6.3.1 Video Stitching 65

6.3.2 3D Image 65

6.3.3 Measurement 66

6.3.4 Tool Detection 66

6.3.5 Tool Tracking 66

6.3.6 Robot Control 66

6.4 Results 67

6.5 Conclusion 69

Chapter 7 Conclusion 70

7.1 Contributions 70

7.2 Limitations 71

7.3 Future Work 71

Trang 8

List of Figures

Figure 1.1: The open surgery procedure 1Figure 1.2: (a) The minimal invasive surgery procedure (b) The set up for MIS 2Figure 1.3: The robotic-assisted MIS 2Figure 1.4: Examples of the image stitching from a moving camera: (a) Behrens et al [2], (b) Liu et al [6] and (c) Ali et al [4] 4Figure 1.5: Example of the 3D reconstruction base on moving a monocular endoscope [12] 5Figure 1.6: Example of surgical tool detection [23] 6Figure 1.7: The proposed endoscope system consisted of two cameras, a mechanical tube, and a push-button The figure depicts (a) the endoscopic cameras, (b) the geometric arrangement between the two cameras, (c) the primary state of the device and (d) the working state of the device 7Figure 1.8: The schematic diagram of our endoscope system The two images on the left side indicate the input images obtained from the two lenses Through the USB ports on the

PC in the center, three outputs can be derived by our algorithm The three images at the right side indicate a window displaying a 3D image, a window showing an extended 2D view, and another window showing the instrument detection result 8Figure 2.1: The combination of the two limited camera’s FOV into a wider FOV 11Figure 2.2: Flowchart of image-stitching process 11Figure 2.3: Calculating the sum of pixel intensities inside any rectangular region will only require three additions and four memory accesses by using the integrated image: ∑=A-B-C+D 12Figure 2.4: Left to right and top to bottom: The Gaussian second-order derivatives Lxx, Lyy, Lxy (top row) and their approximations Dxx, Dyy and Dxy (bottom row) 13Figure 2.5: Conventional video-stitching algorithm 17Figure 2.6: Proposed video-stitching algorithm 17Figure 2.7: Overlap region during image stitching at frame t and frame t+1 (red), and ROI region during image stitching at frame t (left, yellow), and small region during image stitching at frame t+1 (right, yellow) 17Figure 2.8: ROI of Frame-1 Four corners of Frame-2 are transformed into four points P1, P2, P3, and P4 Red rectangle is rectangle surrounding Frame-2*’s edges and is parallel to Frame-1 ROI of Frame-1 is intersection of Frame-1 within red rectangle (green rectangle) 18

Trang 9

Figure 2.9: The image-stitching result (phantom model) The result expands the original

FOV of the input image by 60% 20

Figure 2.10: The image-stitching result (animal experiment) The result expands the original FOV of the input image by 55% 20

Figure 2.11: Comparison of image registration times for conventional method (blue) and proposed method (green) on the CPU computer (a) and the computer with an additional GPU (b) 21

Figure 2.12: Comparison of seam estimation times for conventional method (blue) and proposed method (green) on the CPU computer (a) and the computer with an additional GPU (b) 22

Figure 2.13: Comparison of the stitched image for the conventional method and the proposed method: (a) input images; (b) matching feature points and (c) stitched image by conventional method; and (d) matching feature points and (f) stitched image by our method; (e) ground truth 23

Figure 2.14: Transformation of Frame-2 into Frame-2*: (a) quadrilateral and (b) non-quadrilateral 24

Figure 2.15: Images captured by four cameras 24

Figure 2.16: Result of image stitching of four input images (area expansion ratio is 300%) 24

Figure 3.1: 3D reconstruction of the cameras’ overlap by our endoscope 26

Figure 3.2: The stereo reconstruction algorithm There are three steps: 1 Image rectification, 2 Disparity calculation, 3 3D reconstruction 27

Figure 3.3: Pinhole camera model is used in this study 28

Figure 3.4: Radial distortion of the lens: (a) No distortion, (b) Positive distortion and (c) Negative distortion 29

Figure 3.5: Image rectification: (a) two input images, (b) two output aligned images 30

Figure 3.6: Disparity computation algorithm by stereoBM The sum of absolute differences (SAD) and Winner Takes All (WTA) are used for the disparity computation 31

Figure 3.7: Disparity map calculation algorithm consists of three steps: (1) Compute disparity map by StereoBM, (2) Compute WLS disparity map and confidence map by WLS, (3) Compute WLS-FBS disparity map by FBS 32

Figure 3.8: A stereo camera model 33

Figure 3.9: 3D reconstruction from ROI and disparity map 34

Figure 3.10: Phantom model datasets 35

Figure 3.11: In-vivo animal datasets 36

Figure 3.12: The qualitative evaluation results of the disparity map Column1: Roi image is the overlaped area of two input images Column 2: Raw disparity map, computed by StereoBM Column 3: WLS disparity map, filtered by WLS Column 4: WLS-FBS disparity map, filtered by WLS+FBS 37

Trang 10

Figure 3.13: The qualitative evaluation results of the 3D reconstruction for four datasets Col 1: Point cloud of raw disparity map Col 2: Point cloud of WLS disparity map Col 3: Point cloud of WLS-FBS disparity map 39Figure 3.14: Comparision of the estimated distance with the actual distance Each side of

AC, AD, AE, AF, AG, AH and AK represents the estimated distance (yellow) and the actual distance (green) 39Figure 3.15: Comparision of the estimated depth with the actual depth in the phantom model experiment 40Figure 4.1: Proposed panoramic endoscope system 42Figure 4.2: An illustrative example of the SURF-based stitching algorithm for MIS showing: (a) only a few matching features are obtained and distributed unequally and (b) error stitching 43Figure 4.3: Two consecutive frames in the stitched video 43Figure 4.4: Proposed image-stitching algorithm 44Figure 4.5: The ROI-grid method: the corresponding point pairs are determined based on the disparity value The ROI (dark yellow) is the region that is used to calculate disparity The ROI is divided into a 9 × 24 grid and each grid’s peak (P) is used to determine the corresponding point (Q) in the right rectified-image 45Figure 4.6: Decreasing the computing time that is required to stitch video using the downsizing technique (pink area) 47Figure 4.7: Video stitching results from various samples for in-vivo animal trials Left and Middle: Two input images captured from two endoscopic cameras Right: The stitching results 49Figure 4.8: Comparison of the stitching result for SURF-based method & proposed method Left: SURF-based stitching result Middle: The proposed stitching result Right: Ground Truth 51Figure 4.9: Frame rate for both methods: SURF-based stitching (blue) and our stitching (orange) 52Figure 4.10: The proposed method increases the FOV of the input image by 35% and is used to reconstruct the dense surface image of the overlapping area 53Figure 4.11: Our endoscope is located about 2 cm from the surgical area The proposed endoscope system can expand the FOV of input images by up to 188% 54Figure 5.1: The Surgical instrument detection base on CNN 55Figure 5.2: The seven surgical tools is used in cholecystectomy surgery (top row) and their location annotations (bottom row) 56Figure 5.3: YOLO-based surgical tool detection 56Figure 5.4: The YOLO architecture 57Figure 5.5: The resized image is divided into an N × N grid cell Each grid cell prophesies (B) boxes with objectness score (p0), and class scores (p1, p2 …pC) 57Figure 5.6: Examples of surgical tool detection using YOLO 58

Trang 11

Figure 5.7: Example of annotation for the instance segmentation of surgical tool using VIA

tool 60

Figure 5.8: Instance segmentation using Mask-RCNN 60

Figure 5.9: Mask R-CNN method 61

Figure 5.10: Mask R-CNN architecture 61

Figure 5.11: Sample detections from the ResNet-11-FPN model 62

Figure 6.1: 3DPIGMIS System 65

Figure 6.2: The 6 axis robot arm consists of 6 motors 66

Figure 6.3: The robot arm control algorithm: C(xc,yc) is the center of the overlaped image; T (xt, yt) is the location of the tip of the detected tool; 66 (mm) is the distance from the tool’s tip to the the left camera 67

Figure 6.4: Proposed system (a) real system and (b) Control proplem 68

Figure 6.5: Experimental results of our system on the phantom model (a) Two input images, (b) 3D image of the overlap area, (c) stitched image, (d) the coordinates in 3D space and the distance between them (e) the trajectory of the tool’s tip in 3D space 68

Trang 12

List of Tables

Table 1.1: Technical specifications of the endoscopic cameras 6

Table 2.1: Average times for image-registration step after 2000 frames 21

Table 2.2: Average times for image seam mask estimation step after 2000 frames 22

Table 2.3: Computational times for image stitching with/without improvements 22

Table 3.1: The parameters used in the disparity map evaluation 38

Table 3.2: The average computational time (ms) for the four datasets 41

Table 4.1: The percentage of stitchable frame of SURF and our method on the same dataset 50

Table 4.2: Alignment error of SURF and our method on the same dataset 52

Table 4.3: The computational time of SURF and the proposed method 52

Table 5.1: The annotated image number for each surgical tool 56

Table 5.2: Tool detection results 59

Table 5.3: Average detection precision (%) for all surgical tools (mAP) 59

Table 5.4: Number of annotated frames for instance segmentation of surgical tool 59

Table 5.5: Segmentation instance results on validation data 62

Trang 13

Chapter 1 Introduction

1.1 Minimally Invasive Surgery

There are two approaches to surgery The first is open surgery in which the surgeon makes

an incision large enough for direct observation and manipulation (see Fig 1.1) This direct access helps the surgeon to have tactile cues to recognize and locate vessels, tubes, tumors, and other irregularities Besides, the direct view also gives information about the shape, size, and position of the organs in the patient's body

Figure 1.1: The open surgery procedure

The second approach is called minimal invasive surgery (MIS) or laparoscopic surgery In MIS, the surgeon uses a variety of techniques to operate while ensuring that the patient is subjected to as few incisions as possible MIS is a highly successful modern surgical method It involves inserting the surgical instruments and an endoscopic camera into the patient’s body through a small incision or a natural orifice (see Fig 1.2) The surgeon performs the surgery through an indirect view from the images provided by the endoscopic camera in the operating room

There are two methods used in MIS: traditional laparoscopy and robotically-assisted laparoscopy (see Fig 1.3) In robotic-assisted MIS, surgeons work from a console to control four robotic arms Computer software replaces actual hand movements and can make

Trang 14

movements very accurate Furthermore, the surgeon can also see high-resolution 3D images on the console to perform the surgery

Figure 1.2: (a) The minimal invasive surgery procedure (b) The set up for MIS

Figure 1.3: The robotic-assisted MIS

MIS is gradually replacing traditional surgical methods because of its advantages such as causing less injury, preventing unsightly surgical scars, and resulting in faster recovery time The difference between MIS and open surgery lies in their methods of observation and operation Through the open surgery methods, a doctor can look directly at the operating area, have a wide field of view (FOV) and viewing angles, and get tactile feedback during the operation However, the incision through MIS is small enough for the instruments and endoscope to pass Therefore, a doctor cannot look directly at the operating area and has to look at the image on the flat screen transmitted from the endoscope Being limited by the narrow view field of the endoscope makes it difficult for a doctor to see the

Trang 15

full picture of the operation area In the traditional MIS, the display only provides 2D images, so it is also quite challenging to recognize the depth as well as the relative position

of the organs from the surgical instruments Additionally, the endoscope is held by the assistant's hand, so sometimes the hand vibrates, causing incorrect and stable viewing angles This limitation increases surgical risks, making it challenging for less experienced doctors to perform a safe surgical operation

Although robotic-assisted MIS can provide stable views in 3D of the surgical area, such as the DaVinci Si HD Surgical System, they are usually a concern in the hospitals because of their high cost

1.2 Computer Assisted Interventions

The integration of image processing technology has become a part of computer-assisted interventions (CAI) for MIS This has dramatically assisted the surgeon in addressing the challenges that arise during the surgery

In the development of technologies of the image-guided surgeries, the issue of image registration is most concerned, especially in gastroscopic surgeries Pioneers have tried to make the images of the target organs fixed on the display screen and expand the field of vision (FOV) of the endoscope Liu et al [6] have built up a panoramic view of gastroscopy They utilized a tracking device according to the image from a single-camera gastroscope with a dual-cubic projection method to create both local and panoramic views at the same time In 2016, Hu et al [7] proposed a robust technique for image registration, which was named as Homographic Patch Feature Transform for sequential gastroscopic images Both methods would make the gastroscopy easier to handle and safer in gastroscopic surgeries

Thus, image mosaicing is gradually being applied in medical applications, and the feasibility and assistive nature of the technique with respect to clinical applications have explored widely However, image mosaicing is limited to the compositing of images with small fields of view, such as those of blood vessels and urethras Moreover, in all the studies described above, the image sequences mosaiced were obtained from a single moving endoscopic camera (see Fig 1.4) This can yield only panoramic static images that do not reflect the changes that may occur in the shape of the organs or blood vessels being imaged

Trang 16

outside the field of vision (FOV) Therefore, it is difficult to apply the technique during laparoscopic surgery, wherein the position and shape of the internal organs change frequently

(c)

Figure 1.4: Examples of the image stitching from a moving camera: (a) Behrens et al [2], (b) Liu

et al [6] and (c) Ali et al [4]

1.2.2 3D reconstruction

In order to realize the depth information of the image, the 3D reconstruction technique was introduced The essence of this technique is to map the available 2D image coordinates into 3D world coordinates In human-made environments, the 3D reconstruction using stereo images is a common approach for general problems [8, 9]

The first approach, used in traditional laparoscopy, is based on moving a monocular endoscope in order to reconstruct the 3D surface of the surgical area (see Fig 1.5) In the

Trang 17

context of MIS, there are two approaches to this technique [10] Three methods are commonly used to obtain depth information: Structure from Motion (SfM) [11], SLAM [12, 13], and Shape from Shading (SfS) [14] However, a disadvantage for both SfM and SLAM is that the camera needs to move always to obtain 3D information Moreover, the SfS method has the additional disadvantage that it is susceptible to specular highlights; therefore, it is difficult to get accurate depth information

Figure 1.5: Example of the 3D reconstruction base on moving a monocular endoscope [12]

The second approach is Robot-Assisted Surgery, which is based on the stereo endoscopes for getting the depth perception of the surgeon The principle of this approach is matching pixels between the left and the right image in order to calculate the depth information through triangulation Several studies are based on this approach in MIS For example, Stoyanov et al [15-17] presented a real-time stereo reconstruction in robotically assisted MIS Bernhardt et al [18] proposed a powerful approach for the dense matching between the two stereoscopic camera views so as to make a dense 3D reconstruction Furthermore,

a real-time dense GPU-enhanced surface reconstruction from stereo endoscopic images for intraoperative registration was also proposed in [19]

However, the 3D surface reconstruction of surgical endoscopic images is still an issue owing to certain challenges such as the abundance of texture-less areas, occlusions introduced by the surgical tools, specular highlights, smoke, and blood produced during the interventions [20] Hence, few recent studies have focused on making the surface reconstruction more reliable, exact, and robust For example, Penza et al [21] introduced a novel method to enhance the dense surface reconstruction through disparity refinement that

is based on simple linear iterative clustering (SLIC) super-pixels algorithm Recently, Wang et al [22] proposed some advanced techniques aimed at reconstructing the 3D liver surface based on the stereo vision technique Besides, the use of stereo endoscopes is still not a common practice in traditional MIS, as thus far, their use is limited to robotic systems such as the da Vinci surgical system

1.2.3 Surgical Tool Detection

Surgical tool detection is one of the essential works for analyzing the effectiveness of surgical operations [23] For example, it helps prevent surgical instrument collision by notifying the operator during MIS surgery [24]

Trang 18

There are many different approaches to surgical tool detection that have been published Cai et al [25] used markers placed on the surgical tool and two infrared cameras for detection Another approach is radiofrequency-based method for laparoscopic instrument tracking in real time [26] However, these approaches require modification to the tracked tool [27]

Figure 1.6: Example of surgical tool detection [23]

Therefore, the vision-based approach has been developed and published This approach is based on image features for surgical tool detection, such as based on color [28, 29], gradients [30], or texture [31] Besides, there are also many studies based on the convolutional neural network (CNN) For example, Putra et al [32] was the first to use CNN for multiple identification tasks on laparoscopic videos Several studies [33-35] have been proposed in the have been proposed in the tool presence detection challenge (M2CAI 2016) [36] Jin et al [37] then developed this work by relying on faster R–CNN to identify not only phase but also locate the tool's tip in cholecystectomy videos However, these studies only stop at solving tool detection but not yet applied to control the position of the camera Furthermore, these studies did not detect surgical tools in real-time

1.3 Thesis Overview

This thesis proposed a 3D Panoramic Image Guiding System for MIS We firstly describe

a proposed endoscope used in the thesis We then present the research work and the problem requirements Finally, the thesis structure and contributions will be given

1.3.1 Proposed Endoscope

Our endoscope consisted of two cameras, a push-button, and a mechanical tube (see Fig 1.7) The two cameras used in this thesis were the 2.0MP USB endoscope cameras (see Fig 1.7 (a)), the specifications of which are shown in Table 1.1

Table 1.1: Technical specifications of the endoscopic cameras USB endoscope camera, Shenzhen HuaSunTek Technology Inc

Trang 19

The mechanical tube had a diameter of 13 mm Fig 1.7 (c) shows the primary state of our device, where the push-button had not been pushed down yet and the width of the gap between the two cameras was about 2 mm In this state, our endoscope could be inserted into the patient's abdomen through a small hole about 15 mm in diameter Fig 1.7 (d) shows the working state of our device, where the push-button was pushed down and the distance between the two cameras was 15 mm In the working state of our device, the two cameras were placed in parallel each other, and the geometric arrangement between them was as shown in Fig 1.7 (b)

(a)

(b)

(d) (c)

Figure 1.7: The proposed endoscope system consisted of two cameras, a mechanical tube, and a push-button The figure depicts (a) the endoscopic cameras, (b) the geometric arrangement between the two cameras, (c) the primary state of the device and (d) the working state of the

device

1.3.2 Problem Description

This dissertation aims to simultaneously expand the limited endoscope’s FOV, reconstruct the 3D images, and detect the surgical tools as well as control the camera’s position without the aid of an assistant or foot or voice controllers

Trang 20

For the experiments, we firstly made a small incision of about 1.5 cm so that in the primary state of the device, we could put our endoscope into the patient’s abdomen Next, we

pushed the push-button downward in order to bring our device to the “working state” This

state is used in all algorithms proposed in this thesis

At the working state, our endoscope simultaneously captured the images inside the abdomen The camera synchronization is possible with hardware by using an additional hardware trigger However, our endoscope did not support this feature This study only used the functions available in OpenCV in order to achieve acceptable synchronization First, we grabbed the raw frames quickly using the grab() function of the VideoCapture class Then, we used the retrieve() function by performing heavier demosaicing, decoding, and Mat storage tasks In this manner, both the frames were captured from the two cameras

at almost the same time

Then, with the two-view overlapped images captured from two lenses, we proposed an algorithm that can create the panoramic image, 3D image simultaneously and detect surgical tools for controlling a robot arm and thus move the camera position to the appropriate location All the results are displayed on an display screen, as shown in Fig 1.8

Figure 1.8: The schematic diagram of our endoscope system The two images on the left side indicate the input images obtained from the two lenses Through the USB ports on the PC in the center, three outputs can be derived by our algorithm The three images at the right side indicate a window displaying a 3D image, a window showing an extended 2D view, and another window

showing the instrument detection result

The problem was performed under MIS at near-focus, and our endoscope could move to ensure the best observation for the surgeon Besides, The proposed algorithm needs to ensure stability and fast processing time so that the system is feasible for practical applications

1.3.3 Thesis Structure

This thesis is organized as follows Chapter 2 introduces the feature-based image stitching algorithm and improvements for the rate of video stitching However, this method is still limited in terms of performance and quality in MIS In chapter 3, we propose a 3D

Trang 21

reconstruction algorithm based on the stereo vision synthesis theory Based on the proposed algorithm in Chapter 3, Chapter 4 introduces a novel video stitching that is faster and more stable than the previous stitching method in Chapter 2 In Chapter 5, we proposed an

algorithm for detecting the surgical tool in laparoscopic videos In Chapter 6, we introduce

the 3D Panoramic Image Guiding system based on the algorithms described in previous chapters Besides, we propose an algorithm to control the camera’s position to the desired location Finally, conclusions and suggestions for future work are presented in Chapter 7

1.4 Thesis Contributions

We have contributed the following publications from the work of this thesis [38] [39] [40] [41] [42].

Trang 22

Chapter 2 Image Stitching in Minimally Invasive Surgery

2.1 Introduction

MIS is a highly successful modern surgical method and involves inserting the surgical instruments and an endoscopic camera into the patient’s body through a small incision or a natural orifice However, the limited field of view is the most challenging aspect of this surgical procedure The narrow view of the endoscopic camera prevents the surgeon from being able to image the entire surgical field with clarity This can make the procedure difficult and increase the uncertainty involved Hence, MIS is especially difficult for less-experienced surgeons

In the field of image processing, the process of combining multiple overlapping images into a larger image is known as mosaicing or image stitching Image-mosaicing methods can be classified into two categories: the direct method and the features-based method [43]

In the case of the direct method, all the available image data are used instead of a set of sparse features extracted from the images Hence, the method can provide very accurate registrations However, the initial estimation parameters must lie close to the true solution, and there must be a high degree of overlap between the images for convergence The features-based method, on the other hand, does not require an initialization process, and algorithms that can match distinctive image features, such as the Scale Invariant Feature Transform (SIFT) [44, 45], Speeded-Up Robust Features (SURF) [46], and oriented FAST and rotated BRIEF (ORB) [47], are used for estimating the alignment parameters Furthermore, the use of sparse features accelerates the estimation process and improves real-time performance A comprehensive review of the literature on this method has been published by Szeliski [43]

Features-based image-stitching methods have been implemented for medical applications For instance, Zanet et al [48] proposed a method for the automatic mosaicing of the images from a retinal slit lamp using SURF There have also been other studies on the mosaicing

of retinal slit lamp images [45, 49, 50] For example, Ji et al [51] demonstrated the fusion

of images of local prostate lesions using SURF

In MIS, there have been some studies done base on this method [2, 3, 5, 52, 53] In all the studies, the image sequences mosaiced were obtained from a single moving endoscopic camera This can yield only panoramic static images that do not reflect the changes that may occur in the shape of the organs or blood vessels being imaged outside the field of vision (FOV) Therefore, it is difficult to apply the technique during laparoscopic surgery, wherein the position and shape of the internal organs change frequently

Trang 23

To increase the camera's limited viewing angle, our approach is to combine the two camera’s FOV into a wider FOV using the features-based stitching method (see Fig 2.1) This approach can be extended with the use of multiple cameras The stitched images represents every transformation that occurs in the FOVs of the two lenses mounted on the endoscope This chapter introduces the features-based image-stitching algorithm for MIS Subsequently, a proposed video stitching for our endoscope will be presented and evaluated The work in this chapter has been published in [38]

Figure 2.1: The combination of the two limited camera’s FOV into a wider FOV

2.2 Features-based Image Stitching

The features-based image-stitching algorithm comprises two stages: image registration and image compositing as illustrated in Fig 2.2

Figure 2.2: Flowchart of image-stitching process

Trang 24

The image-registration stage has the following three steps: find the features, match these features, and then find the homography matrix The purpose of these steps is to identify the coordinate relationship between the two source images This stage is the most important one of the image-stitching process because it directly affects the correctness of the image-stitching results

2.2.1 Feature Detection

This step comprises two tasks: the first is to detect the feature points and the second is to construct descriptors of these points Feature points are the characteristic points based on which an object can be recognized within an image Because feature points only provide the positions of distinctive elements, matching them across different images requires characterizing them based on the extracted feature descriptors (feature vectors) A feature descriptor represents a subset of the total pixels in the neighborhood of the feature points There are many algorithms available for searching for feature points and extracting their descriptors, such as SIFT [44], SURF [46], and ORB [47] The feature point search algorithm used in this study is SURF [46], which allows for image scaling and rotation-invariant feature descriptions Therefore, changing the viewing angle and size scale of the image within certain limits will not affect the correctness of the match results

SURF algorithm

SURF is an algorithm for detecting invariant features with scale and rotation It uses an integrated image that makes the computation quite fast The integrated image I∑ of an input image I is determined as follows:

j y

i x Σ

Trang 25

xx xy

L (u,s) L (u,s)H(u,s)=

Figure 2.4: Left to right and top to bottom: The Gaussian second-order derivatives L xx , L yy , L xy

(top row) and their approximations D xx , D yy and D xy (bottom row)

SURF changes the size of the box filter to build the scale-space of a image The original default size of the filters is 9 × 9, which corresponds to the Gaussian derivative with s = 1.2 The filter size is then upscaled to 15 × 15, 21 × 21, 27 × 27, etc Next, the Hessian matrix determinant is computed at each scale Finally, the non-maximum elimination in 3

× 3 × 3 neighbourhoods employed for determining the maxima SURF identifies the features at locations and scale with the maximum values

The SURF descriptor is formed by first building a square region of 20s × 20s centred around the feature point at the scales SURF then split this region into smaller 4 × 4 subregions Each subregion has a descriptor vector v=(∑ dx, ∑ dy, ∑|dx|, ∑|dx|) Here, dx and dy are the Haar-wavelet responses in horizontal and vertical directions, respectively When combined, this results in the 64-dimensional SURF descriptor

Trang 26

2.2.2 Feature Matching

The next step is to match the feature points in two different images (feature matching) Matching features is the process of defining the similarity between two features in two separate images based on the Euclidean distance (SSD) between the feature descriptors:

( 1 2)

f f

i i i=1

in Ref [44]

2.2.3 Find homography matrix

This step involves using the RANdom SAmple Consensus (RANSAC) algorithm [55] to remove the mismatched corresponding point pairs and subsequently estimate the homography matrix based on the remaining set of corresponding pairs The homography matrix is a 3 × 3 matrix with 8 degrees of freedom (DoF) as shown below:

where hij represents the elements of the homography matrix

Implementation details are described as the steps below

ALGORITHM 2.1: Determine the homography matrix by the RANSAC algorithm

Assume that the probability of a correct feature match is (pi), then the probability of obtaining the accurate homography matrix after (n) trials by RANSAC is:

(1) RANSAC loop

(1.1) Choose four arbitrary matching pairs

(1.2) Estimate the exact homography matrix (H) for these pairs

(1.3) Identify the inliers {u i , v i } N satisfy: d(u , Hvi ) < threshold

(2) Take the homography matrix (H) with the maximum number of inliers (N max )

(3) Re-compute the homography matrix by minimizing ∑ N max d(ui,Hvi)/N max

i=1

Trang 27

p(H is correct)=1-(1-(pi)r)n (2.6) Thus, RANSAC was able to find the homography matrix with high accuracy after a large number of attempts.

After the image-registration step is complete, the image-compositing stage yields angle images The image-compositing process also has three steps: warping the images, finding the seam masks for the warped images, and blending them

wide-2.2.4 Image Warping

Once the homography matrix has been determined during the image-registration process,

as described above, we use a perspective transformation to transform the two source images into two warped images in the same coordinate system such that they can be aligned to obtain a final composite image

in the stitching results This step is performed to find a seam that prevents the possibility

of “ghosting” from the stitching results In this study, we found the optimal seam for the images to be stitched using the graph-cut technique [56]

In this study, I0 and I1 are symbols for the two warped images, X is overlap area The seam estimation is assigning a label e ∈ {0,1} to each pixel u ∈ P, where “0” corresponds to I0and “1” corresponds to I1

A optimal seam is estimated by minimizing the energy function via graph-cuts [56]:

Trang 28

Where, Euclidean-metric color difference: I = I -I d( ) 0( ) ( )1 2 (2.11)

2.2.6 Image Blending

After the images have been warped, a seam remain in output images owing to a difference brightness between the two images To solve this problem, we use a multi-band blending method [54] to effectively smoothen out the stitching results

Firstly, blending weights for each image are initialized:

Wmaxi (θ,γ)= {1 if Wmaxi (θ,γ)= arg max Wji(θ,γ)

0 otherwise (2.12) Then, the blending weights for each band is formed by blurring successively these max-weight maps

2.3 Proposed Video Stitching

Video stitching is the stitching of images in a frame-by-frame manner In this chapter, Frame-1 and Frame-2 are the labels used for the two images captured from two endoscopic cameras during MIS In this section, we propose a method for improving the rate of video stitching during MIS

The conventional video-stitching algorithm is as follows: the two frames are the two input images During the image-registration step, the algorithm searches for matching image features to determine the homography matrix Next, the two input frames are transformed into two warped frames The last step is to find a seam for the two warped frames and blend them Fig 2.5 illustrates these processes with the related image results

An important parameter with respect to video stitching is the execution time of the algorithm For practical applications, the video-stitching process needs to be a real-time one Because the computational time for image processing is proportional to the size of the image being processed, large or high-resolution images would take a lot of time Therefore, the proposed method aims to improve the video-mosaicing performance by improving both image registration and image compositing as demonstrated in Fig 2.6

The detail processes are described in subsections bellow

Trang 29

Figure 2.5: Conventional video-stitching algorithm

Figure 2.6: Proposed video-stitching algorithm

2.3.1 Accelerating Image Registration

The purpose of image registration is to find corresponding point pairs between the two images and to subsequently determine the homography matrix and transform the two images on the same surface plane and align them Because good matching pairs appear in the overlap region of the two images, using two small regions containing the overlap region for matching increases the accuracy and also speeds up the search for the corresponding point pairs of the original two images

Figure 2.7: Overlap region during image stitching at frame t and frame t+1 (red), and ROI region during image stitching at frame t (left, yellow), and small region during image stitching at frame

registration

Warp frames

The proposed technique for image compositing Find seam mask &

Blend frames

Stitching result

Trang 30

During laparoscopic surgery, the movement of the two endoscopic cameras is not too fast Hence, the position and size of the overlap area of the two current frames and the two previous one are not very different Furthermore, the previous frame’s overlap is determined after the image-stitching process Thus, the proposed technique uses the overlap region’s size and the position of the two previous frames to determine the two small regions

of the current frames for matching This region is called the ROI Fig 2.7 depicts how we

define the ROI at frame t and the small region at frame t+1 during image stitching

Define region of interest (ROI) ( see Fig 2.8)

After finding the homography matrix, we use a perspective transformation to transform Frame-2 into Frame-2* on the Frame-1 plane Then, we define a rectangle such that Frame-2*’s edges are parallel to Frame-1 The ROI of Frame-1 is the intersection of Frame-1 within this rectangle (the green rectangle ABCD) In the same manner, the ROI of Frame-

2 is determined by transforming Frame-1 onto the same Frame-2 plane

We assume that the four corners of Frame-2 after the transformation are P1(x1,y1), P2(x2,y2), P3(x3,y3), and P4(x4,y4) Hence, the ROI of Frame-1 is the region at position A(x0,y0) with a width AB̅̅̅̅ and height AD̅̅̅̅̅ in Frame-1

There parameters are defined as follows:

of Frame-1 is intersection of Frame-1 within red rectangle (green rectangle)

The matching of the two ROIs can reduce the computational time if the two ROIs are reduced to a lower resolution (resized-ROI region) For the downsizing of the images, the bilinear interpolation algorithm is used The resized-scale value can be input manually to accelerate the image-registration process Because the homography matrix consists of nine

Trang 31

elements (including the final element, which is equal to 1), one needs at least four correspondence points to determine the matrix Hence, the scale value is selected so as ensure that the number of good matches is not less than 4 In many of the experimental cases, we chose the scale value such that the original images could be resized to a resolution

of 320 × 240

On using the two resized-ROI regions for the matching operation, the feature point’s coordinates in the image change Therefore, the homography matrix for the two original images must be changed as well

The algorithm for determining the homography matrix is as follows:

ALGORITHM 2.2: Determine the homography matrix by resized-ROI regions

Here, the homography matrix, H, is used to transform resized-ROI-2 on the same plane as that of resized-ROI-1 Further, (x1,y1) and (x2,y2) are the coordinates of the top-left corners of ROI-1 and ROI-2, respectively, in the coordinate system, while k is the resized-

scale value

2.3.2 Accelerating Image Composition

After the image-registration process has been completed, we use the homography matrix

to transform the two frames on the same plane (warped frames) The next step is to find the seam masks for these two warped frames The aim is to determine the optimal boundary between the overlapping pixels of the two images in order to reduce the visual artifacts In this study, we use the graph-cut algorithm to determine the optimal seam between the two warped frames

However, the cut algorithm takes a lot of time The computational time for the cut algorithm from the OpenCV library for two images with average resolutions of 640 ×

graph-480 is more than 2 s for the CPU version and more than 1.5 s for the GPU version Further, the computational time is proportional to the image size In order to speed up the process,

we propose using the ROI region to determine the seam mask Then, we resize the two ROIs to a lower resolution in order to reduce the computational time for estimating the seam mask

The details are presented below (see Fig 2.6) :

(1) Determine the two small regions from the current frames The region is set to the whole frame for the first two frames For the subsequent frames, the regions are determined based on the size and position of the ROI of the previous two frames

(2) Resize the two small regions at a lower resolution (resized-small region)

(3) Use the SURF algorithm to determine the homography matrix for the two resized-small regions (H matrix), as described in the image-registration step in Fig 2.1

(4) Calculate the homography matrix for the two current frames

Trang 32

ALGORITHM 2.3: Estimate the seam masks by resized-ROI regions

2.4 Results

The OpenCV and OpenCL programming languages are both used in the proposed technique The program was executed using an Intel i3 CPU and a GTX750Ti Nvidia GPU with 8 GB RAM The GPU plays a major role in the optimization of processing in PCs The stitching performance can be accelerated by performing CPU and GPU operations simultaneously

2.4.1 Video-stitching Results

To validate the efficacy of the proposed algorithm, we performed image mosaicing on two videos using both a phantom model and in-vivo animal experiments

Figure 2.9: The image-stitching result

(phantom model) The result expands the

original FOV of the input image by 60%

Figure 2.10: The image-stitching result (animal experiment) The result expands the original FOV

of the input image by 55%.

In Fig 2.9: (a) shows two input images captured using the two endoscopic cameras during the phantom experiment while (b) shows the matching features in the two input images and

(1) Decide ROI-1 in warped-frame-1 and ROI-2 in warped-frame-2 using Eqs (2.15) and (2.16) (2) Resize ROI-1 and ROI-2 to lower resolutions to find the seam mask (ROI-seam-masks) (3) Resize these ROI-seam-masks to the original resolution and copy these masks to position (x0, y0) using Eq (2.15) in order to obtain the two seam masks for the two warped-frames

(a) Input images

(b) Match feature

points

(c) Stitching result

(a) Input images

(b) Match feature points

(c) Stitching result

Trang 33

(c) shows the stitching result Thus, the images confirm that the proposed method can expand the FOV of the original one by 160%

In Fig 2.10: (a) shows two input images captured from two videos during an in-vivo animal experiment, (b) shows the matching features in the two input images, while (c) shows the stitching result Thus, it can be seen that the proposed method can can expand the FOV of the original one by 155%

2.4.2 Run Time Estimation

The effectiveness of the proposed method was compared with that of the conventional one The results of the comparison are described in this section The stitching video was produced using the two endoscopic cameras at medium resolution (640 × 480) For larger-resolution videos, the resized-scale value selected was higher, allowing for further improvements in performance The program used for the performance comparisons was executed on both a CPU and a GPU

Image Registration

For the videos with a resolution of 640 × 480, we chose a resized-scale value of 2 while ensuring that the number of good matches was sufficiently high For larger-resolution videos, the resized-scale value selected was higher, allowing for further improvements in performance Fig 2.11 shows the computational times for the frame-by-frame image-registration step performed using a CPU computer (CPU) and the computer with an additional GPU (CPU+GPU), respectively

Figure 2.11: Comparison of image registration times for conventional method (blue) and proposed method (green) on the CPU computer (a) and the computer with an additional GPU (b).

Table 2.1 shows that the proposed method allowed for increases in the computation rate of 10.75 times (CPU) and 3.1 times (CPU+GPU) as compared to the conventional method

Table 2.1: Average times for image-registration step after 2000 frames

Trang 34

Seam Mask Estimation

For two input videos with a resolution of 640 × 480, we chose a resized-scale value of 10 while ensuring that the seam estimation quality was high For larger resolution videos, the resized-scale value will be higher, resulting in additional improvements in performance Fig 2.12 shows the computational times for seam mask estimation (frame-by-frame) as performed using a CPU computer and the computer with an additional GPU, respectively

Figure 2.12: Comparison of seam estimation times for conventional method (blue) and proposed method (green) on the CPU computer (a) and the computer with an additional GPU (b)

As can be seen from Table 2.2, the proposed method increased the speed for finding the seam mask by 153 times (CPU) and 140 times (GPU) in comparison to those for the conventional method

Table 2.2: Average times for image seam mask estimation step after 2000 frames

The image-stitching process includes image registration and image compositing Further, the image-compositing stage itself has three steps: the warping of the images, finding the seam masks, and the blending of the images From the above-described results, we can calculate the total time for the image-stitching process

Table 2.3: Computational times for image stitching with/without improvements

Table 2.3 shows the average times for the video-stitching process without improvements (i.e., for the conventional method) and with improvements (i.e., for the proposed method) for 2000 consecutive frames It can be seen that the conventional method takes 2.97 s on the CPU and 1.838 s on the (CPU+GPU), which means that the frame rate is approximately

Trang 35

0.34 fps on the CPU and 0.54 fps on the (CPU+GPU) The proposed method achieve a frame rate of 3.33 fps on the CPU and 12.82 fps on the (CPU+GPU)

This means that the proposed method results in an improvement of 10× on the CPU computer and 23× on the same PC with an additional GPU as compared to those of the conventional method

2.4.3 Discussion

Fig 2.13 shows the stitching results for the conventional and proposed methods The images

in (a) are the input images, while those in (b, d) show the results of feature matching for both method The image in (c) shows the stitching result by conventional method while that

in (e) shows the ground truth It can be seen that the stitching results obtained using the proposed method are as natual as the ground truth Furthermore, the stitching speed by our proposed method is higher while image quality is not degraded

However, the proposed method has two limitations The first is that, when the two endoscopic cameras move fast, the overlap between the two previous frames and the two current ones will vary widely in terms of location and size Hence, the proposed method

Trang 36

cannot be applied when using the ROI coordinates of the previous frames to estimate the small areas of the current frames However, this is less likely to occur during laparoscopic surgery, as fast movements blur the image captured by the camera The allowable camera velocity depends on the camera sensor and was found to be approximately 2.5 cm/s in this study The second is that the image registration of the two current frames can fail owing to poor image quality or because the overlap region is missing This will lead to an inaccurate

or missing ROI Hence, the image registration of the subsequent frames will also fail

To identify the cases where image registration fails, we determine the number and quality

of the matching pairs between the two frames If this number is less than 4, the number of equations available will not be enough to determine the homography matrix On the other hand, even the number is greater than 4 but there are very few exactly matched pairs and most of them are inaccurate, in this case even the homography matrix can be obtained, Frame-2 will be transformed into a non-quadrilateral or an irregular shape as shown in Fig 2.14 (a) and (b) Hence, the stitching results will be distorted and not meaningful to the surgeon When these situations are encountered, the current frame’s ROI is set to the whole frame and the following image-compositing stage would be skipped The stitching process will be resumed on next frames

Trang 37

In this study, we aimed to perform the real-time stitching of images from multiple endoscopic cameras In Fig 2.15 and 2.16, the stitching results for four cameras are arranged in two rows The images indicate that the camera's viewing angle can be extended

by approximately 300% However, in the present study, we only focused on stitching the images captured by two endoscopic cameras owing to the real-time operational requirements of MIS

2.5 Conclusions

In this chapter, we proposed an feature-based video stitching for MIS We have proposed

a down-sized ROI technique that can combine with SURF to improve the speed of the registration In addition, we also combined the down-sized ROI with Graph-cut algorithm

to speed up the image composition The experimental results obtained showed that the MISPE can enhance the image size by up to 160% As compared to the conventional method, the proposed one results in performance improvements of 10× CPU computer and 23× PC with an additional GPU The proposed method can also be used for stitching video from a single camera or multiple ones The method was confirmed to be effective both with large-sized images and high-resolution ones The frame rate for the video stitched from two

endoscopic cameras at a resolution of 640 × 480 was determined to be 12.82 fps

Trang 38

Chapter 3 3D Reconstruction in Minimally

Invasive Surgery

3.1 Introduction

In MIS, the physicians cannot look directly at the operating area and have to look at the images on the flat screen transmitted from the endoscope during surgery The screen display only provides a 2D image, so it is also quite challenging to recognize the depth as well as the relative position of the organs from the surgical instruments This limitation increases surgical risks, making it challenging for less experienced doctors to perform a safe surgical operation

Therefore, several studies have reported that image processing techniques overcome the limitation of laparoscopic surgery There are two common approaches in the studies [12] The first approach base on moving a monocular endoscope to reconstruct the 3D surface

of the surgical area [11, 13, 14] The second approach base on the stereo endoscopes for getting the depth perception of the surgeon [15, 18, 21] However, the 3D surface reconstruction of surgical endoscopic images is still an issue owing to certain challenges such as the abundance of texture-less areas, occlusions introduced by the surgical tools, specular highlights, smoke, and blood produced during the interventions [20]

Nowadays, several commercially available stereo-endoscopic systems have been announced to provide surgeons with 3D images of the surgical area [57] However, they are usually a concern in hospitals because of their high cost

Figure 3.1: 3D reconstruction of the cameras’ overlap by our endoscope

The structure of our endoscope is similar to a stereo endoscope (see Fig 1.7) Therefore, in this chapter, we proposed a stereo matching algorithm to reconstruct a 3D image of the two cameras' overlap without the need for additional hardware Our algorithm can provide a

Trang 39

realistic 3D picture of the real scene in real-time Besides, distance and depth information

is also presented and evaluated

As shown in Fig 3.1, our system included two endoscopic cameras for capturing the two images of the surgical area simultaneously Then, the proposed algorithm performed image processing to create the 3D images of the two cameras' overlap The work in this chapter has been published in [40]

3.2 Proposed Stereo Reconstruction

The stereo matching (stereo correspondence) is searching for matching pixels or objects in the two different camera views However, searching the entire image is a complex process that requires high-performance computing Images that are captured directly from the camera must also be corrected to account for lens distortion Therefore, this study uses an image rectification technique [58] to transform the our endoscope into an aligned-undistorted configuration The search is then simplified to a one-dimensional problem with

a horizontal line parallel to the baseline between the cameras

The stereo reconstructionalgorithm consisted of three steps, as described in Fig 3.2 These input images were rectified to calculate the disparity Then, the dense 3D reconstruction process were performed from the rectified images and the disparity information The proposed stereo reconstruction algorithm in this chapter will be described in detail in the subsections below

Figure 3.2: The stereo reconstruction algorithm There are three steps: 1 Image rectification, 2

Disparity calculation, 3 3D reconstruction

3.2.1 Image Rectification

The two basic purposes of this step are as follows: the first was image correction, which was owing to the distortion of the lens The second was the alignment of the two cameras into one viewing plane, which was so that the pixel rows between the cameras were exactly aligned with each other

Camera Model

We first assume that the camera model was a pinhole model (see Fig 3.3) [59] The association between a point in 3D space P(X, Y, Z) - units of millimeters and its image projection (u, v) - units of pixels in the image is as follow:

Trang 40

 

XY

Z1

1

u v

Where [R T] is the extrinsic parameter and A is the intrinsic parameter of camera:

Figure 3.3: Pinhole camera model is used in this study

Lens Distortions

Actual lenses often have distortion It includes radial and tangential distortions The radial distortion is due to the lens’s shape while the tangential distortion is due to the camera assembly process Radial distortion is the most popular kind that affects the image quality (see Fig 3.4)

Định dạng
Số trang	95
Dung lượng	6,33 MB