Doctoral dissertation of Computer science: 3-D object detections and recognitions - Assisting visually impaired people

Doctoral dissertation present the content: point cloud representation and the proposed method for table plane detection; primitive shapes estimation by a new robust estimator using geometrical constraints; detection and estimation of a 3-D object model for a real application; conclusion and future works.

Trang 1

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

LE VAN HUNG

3-D OBJECT DETECTIONS AND

RECOGNITIONS: ASSISTING VISUALLY

Trang 2

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

LE VAN HUNG

3-D OBJECT DETECTIONS AND

RECOGNITIONS: ASSISTING VISUALLY

Trang 3

DECLARATION OF AUTHORSHIP

I, Le Van Hung, declare that this dissertation titled, ”3-D Object Detections andRecognitions: Assisting Visually Impaired People in Daily Activities ”, and the workspresented in it are my own I confirm that:

This work was done wholly or mainly while in candidature for a Ph.D researchdegree at Hanoi University of Science and Technology

Where any part of this thesis has previously been submitted for a degree or anyother qualification at Hanoi University of Science and Technology or any otherinstitution, this has been clearly stated

Where I have consulted the published work of others, this is always clearly tributed

at- Where I have quoted from the work of others, the source is always given Withthe exception of such quotations, this dissertation is entirely my own work

I have acknowledged all main sources of help

Where the dissertation is based on work done by myself jointly with others, Ihave made exactly what was done by others and what I have contributed myself

Trang 4

This dissertation was written during my doctoral course at International ResearchInstitute Multimedia, Information, Communication and Applications (MICA), HanoiUniversity of Science and Technology (HUST) It is my great pleasure to thank all thepeople who supported me for completing this work

First, I would like to express my sincere gratitude to my advisors Dr Hai Vuand Assoc Prof Dr Thi Thuy Nguyen for their continuous support, their patience,motivation, and immense knowledge Their guidance helped me all the time of researchand writing this dissertation I could not imagine a better advisor and mentor for myPh.D study

Besides my advisors, I would like to thank to Assoc Prof Dr Thi-Lan Le,Assoc Prof Dr Thanh-Hai Tran and members of Computer Vision Department atMICA Institute The colleagues have assisted me a lot in my research process as well

as they are co-authored in the published papers Moreover, the attention at scientificconferences has always been a great experience for me to receive many the usefulcomments

During my PhD course, I have received many supports from the ManagementBoard of MICA Institute My sincere thank to Prof Yen Ngoc Pham, Prof EricCastelli and Dr Son Viet Nguyen, who gave me the opportunity to join researchworks, and gave me permission to joint to the laboratory in MICA Institute Withouttheir precious support, it has been being impossible to conduct this research

As a Ph.D student of 911 program, I would like to thank this programme forfinancial support I also gratefully acknowledge the financial support for attendingthe conferences from Nafosted-FWO project (FWO.102.2013.08) and VLIR project(ZEIN2012RIP19) I would like to thank the College of Statistics over the years both

at my career work and outside of the work

Special thanks to my family, particularly, to my mother and father for all of theirsacrifices that they have made on my behalf I also would like to thank my belovedwife for everything she supported me

Hanoi, November 2018Ph.D Student

Le Van Hung

Trang 5

1.1 Aided-systems for supporting visually impaired people 8

1.1.1 Aided-systems for navigation services 8

1.1.2 Aided-systems for obstacle detection 9

1.1.3 Aided-systems for locating the interested objects in scenes 11

1.1.4 Discussions 12

1.2 3-D object detection, recognition from a point cloud data 13

1.2.1 Appearance-based methods 13

1.2.1.1 Discussion 16

1.2.2 Geometry-based methods 16

1.2.3 Datasets for 3-D object recognition 17

1.3 Fitting primitive shapes 18

1.3.1 Linear fitting algorithms 18

1.3.2 Robust estimation algorithms 19

1.3.3 RANdom SAmple Consensus (RANSAC) and its variations 20

2 POINT CLOUD REPRESENTATION AND THE PROPOSED METHOD FOR TABLE PLANE DETECTION 24 2.1 Point cloud representations 24

2.1.1 Capturing data by a Microsoft Kinect sensor 24

2.1.2 Point cloud representation 25

2.2 The proposed method for table plane detection 28

2.2.1 Introduction 28

Trang 6

2.2.2 Related Work 29

2.2.3 The proposed method 30

2.2.3.1 The proposed framework 30

2.2.3.2 Plane segmentation 32

2.2.3.3 Table plane detection and extraction 34

2.2.4 Experimental results 36

2.2.4.1 Experimental setup and dataset collection 36

2.2.4.2 Table plane detection evaluation method 37

2.2.4.3 Results 40

2.3 Separating the interested objects on the table plane 46

2.3.1 Coordinate system transformation 46

2.3.2 Separating table plane and the interested objects 48

3 PRIMITIVE SHAPES ESTIMATION BY A NEW ROBUST ES-TIMATOR USING GEOMETRICAL CONSTRAINTS 51 3.1 Fitting primitive shapes by GCSAC 52

3.1.2 Related work 53

3.1.3 The proposed a new robust estimator 55

3.1.3.1 Overview of the proposed robust estimator (GCSAC) 55 3.1.3.2 Geometrical analyses and constraints for qualifying good samples 58

3.1.4 Experimental results of robust estimator 64

3.1.4.1 Evaluation datasets of robust estimator 64

3.1.4.2 Evaluation measurements of robust estimator 67

3.1.4.3 Evaluation results of a new robust estimator 68

3.2 Fitting objects using the context and geometrical constraints 76

3.2.1 The proposed method of finding objects using the context and geometrical constraints 77

3.2.1.1 Model verification using contextual constraints 77

3.2.2 Experimental results of finding objects using the context and geometrical constraints 78

3.2.2.1 Descriptions of the datasets for evaluation 78

3.2.2.2 Evaluation measurements 81

3.2.2.3 Results of finding objects using the context and geo-metrical constraints 82

Trang 7

4 DETECTION AND ESTIMATION OF A 3-D OBJECT MODEL

4.1 A Comparative study on 3-D object detection 86

4.1.2 Related Work 88

4.1.3 Three different approaches for 3-D objects detection in a complex scene 90

4.1.3.1 Geometry-based method for Primitive Shape detection Method (PSM) 90

4.1.3.2 Combination of Clustering objects and Viewpoint Features Histogram, GCSAC for estimating 3-D full object mod-els (CVFGS) 91

4.1.3.3 Combination of Deep Learning based and GCSAC for estimating 3-D full object models (DLGS) 93

4.1.4 Experiments 95

4.1.4.1 Data collection 95

4.1.4.2 Evaluation method 98

4.1.4.3 Setup parameters in the evaluations 101

4.1.4.4 Evaluation results 102

4.2 Deploying an aided-system for visually impaired people 109

4.2.1 Environment and material setup for the evaluation 111

4.2.2 Pre-built script 112

4.2.3 Performances of the real system 114

4.2.3.1 Evaluation of finding 3-D objects 115

4.2.4 Evaluation of usability and discussion 118

5 CONCLUSION AND FUTURE WORKS 121 5.1 Conclusion 121

5.2 Future works 123

Trang 8

No Abbreviation Meaning

1 API Application Programming Interface

3 CVFH Clustered Viewpoint Feature Histogram

6 FPFH Fast Point Feature Histogram

8 GCSAC Geometrical Constraint SAmple Consensus

12 ICP Iterative Closest Point

13 ISS Intrinsic Shape Signatures

17 LBP Local Binary Patterns

18 LMNN Large Margin Nearest Neighbor

20 LO-RANSAC Locally Optimized RANSAC

21 LRF Local Receptive Fields

23 MAPSAC Maximum A Posteriori SAmple Consensus

24 MLESAC Maximum Likelihood Estimation SAmple Consensus

26 MSAC M-estimator SAmple Consensus

29 NAPSAC N-Adjacent Points SAmple Consensus

Trang 9

30 NARF Normal Aligned Radial Features

32 NNDR Nearest Neighbor Distance Ratio

33 OCR Optical Character Recognition

34 OPENCV OPEN source Computer Vision Library

36 PCA Principal Component Analysis

38 PROSAC PROgressive SAmple Consensus

39 QR code Quick Response Code

42 RFID Radio-Frequency IDentification

43 R-RANSAC Recursive RANdom SAmple Consensus

44 SDK Software Development Kit

45 SHOT Signature of Histograms of OrienTations

46 SIFT Scale-Invariant Feature Transform

48 SURF Speeded Up Robust Features

55 USAC A Universal Framework for Random SAmple Consensus

56 VFH Viewpoint Feature Histogram

57 VIP Visually Impaired Person

57 VIPs Visually Impaired People

Trang 10

LIST OF TABLES

Table 2.1 The number of frames of each scene 36Table 2.2 The average result of detected table plane on our own dataset(%) 41Table 2.3 The average result of detected table plane on the dataset [117] (%) 43Table 2.4 The average result of detected table plane of our method withdifferent down sampling factors on our dataset 44

Table 3.1 The characteristics of the generated cylinder, sphere, cone dataset(synthesized dataset) 66Table 3.2 The average evaluation results of synthesized datasets The syn-thesized datasets were repeated 50 times for statistically representativeresults 75Table 3.3 Experimental results on the ’second cylinder’ dataset The exper-iments were repeated 20 times, then errors are averaged 75Table 3.4 The average evaluation results on the ’second sphere’, ’secondcone’ datasets The real datasets were repeated 20 times for statisticallyrepresentative results 76Table 3.5 Average results of the evaluation measurements using GCSAC andMLESAC on three datasets The fitting procedures were repeated 50times for statistical evaluations 83

Table 4.1 The average result detecting spherical objects on two stages 102Table 4.2 The average results of detecting the cylindrical objects at the firststage in both the first and second datasets 103Table 4.3 The average results of detecting the cylindrical objects at thesecond stage in both the first and second datasets 106Table 4.4 The average processing time of detecting cylindrical objects inboth the first and second datasets 106Table 4.5 The average results of 3-D queried objects detection 116

Trang 11

Figure 1.1 Illustration of the 3-D object recognition process towards localfeature based method [53] 13Figure 1.2 Illustration of primitive shapes extraction from the point cloud[144] 17Figure 1.3 Illustration of the Least squares process 19Figure 1.4 Line presentation in image space and in Hough space [126] 20Figure 1.5 Illustration of line estimation by RANSAC algorithm 21Figure 1.6 Diagram of RANSAC-based algorithms 22

Figure 2.1 Microsoft Kinect Sensor version 1 25Figure 2.2 Illustration of the organized point cloud representation process 26Figure 2.3 Description of the organized and unorganized point cloud 27Figure 2.4 (a) is a RGB image, (b) is a point cloud of a scene 28Figure 2.5 The proposed framework for table plane detection 30Figure 2.6 (a) Computing the depth value of the center pixel based on itsneighborhoods (within a (3 × 3) pixels window); (b) down sampling ofthe depth image 32

Trang 12

Figure 2.7 Illustration of estimating the normal vector of a set point in the3-D space (a) a set of points; (b) estimation of the normal vector of ablack point; (c) selection of two points for estimating a plane; (d) thenormal vector of a black point 33Figure 2.8 Illustration of point cloud segmentation process 33Figure 2.9 Example of plane segmentation (a) color image of the scene; (b)plane segmentation result with PROSAC in a our publication; (c) planesegmentation result with the organized point cloud 35Figure 2.10 Illustrating acceleration vector provided by a Microsoft Kinectsensor (xk, yk, zk) are the three coordinate axes of the Kinect coordinatesystem that mounted on the chest of VIP 35Figure 2.11 Illustration of extracting the table plane in the complex scene 36Figure 2.12 Examples of 10 scenes captured in our dataset 37Figure 2.13 Scenes in the dataset [?] 37Figure 2.14 (a) Color and depth image of the scene; (b) mask data of tableplane; (c) cropped region; (d) point cloud corresponding of the croppedregion, green point is 3-D centroid of the region 38Figure 2.15 (a), Illustration of the angle between normal vector of the de-tected table plane and T (b), Illustration of the angle between normalvector of the detected table plane ne and ng; (c) Illustration of overlap-ping and union between detected and ground-truth regions 39Figure 2.16 Detailed results for each scene of the three plane detection meth-ods on our dataset: (a) using the first evaluation measure; (b) the secondevaluation measure and (c) the third evaluation measure 42Figure 2.17 Illustration of floor plane is segmented to the multiple planes 43Figure 2.18 Results of table detection with our dataset (two first rows) andthe dataset in [117] (two bottom rows) Table plane is limited by thered color boundary in image and by green color points in point cloud.Arrow with red color is normal vector of detected table 44

Trang 13

Figure 2.19 Top line is an example detection that is defined as true detection

if using the two first evaluation measures and as false detection if usingthe third evaluation measure: (a) color image; (b) point cloud of thescene; (c) the overlap area between the 2-D contour of detected tableplane and the table plane ground-truth Bottom line is an example ofmissing case with our method (a) color image, (b) point cloud of thescene After down sampling, the number of points belonging to table is

276 that is lower than our threshold 45Figure 2.20 The transformation of original point cloud: from Kinect’s originalcoordination Ok(xk, yk, zk) to a new coordination Ot(xt, yt, zt), that thenormal vector nt of a table plane is parallel to the y-axis 47Figure 2.21 Setup of (a) the experiment 1 and (b) the experiment 2 48Figure 2.22 The distribution of error (in distance measure) (ε) of object centerestimation in two cases (Case 1: cylinder and Case 2: circle estimation)obtained from experiment 1 (a), and the experiment 2 (b) 49Figure 2.23 Illustrating the result of the detected table plane and separatinginterested objects Left is the result of the detected table plane (greenpoints) in the point cloud data of a scene; Right is the result of the pointcloud data of objects on the table 50

Figure 3.1 Illustration of using the primitive shapes estimation to estimatethe full object models 51Figure 3.2 Top panel: Over view of RANSAC-based algorithms Bottompanel: A diagram of the GCSAC’s implementations 56Figure 3.3 Geometrical parameters of a cylindrical object (a)-(c) Explana-tion of the geometrical analysis to estimate a cylindrical object (d)-(e)Illustration of the geometrical constraints applied in GCSAC (f) Result

of the estimated cylinder from a point cloud Blue points are outliers,red points are inliers 59Figure 3.4 (a) Setting geometrical parameters for estimating a cylindricalobject from a point cloud as described above (b) The estimated cylinder(green one) from an inlier p1 and an outlier p2 As shown, it is anincorrect estimation (c) Normal vectors n1 and n∗2 on the plane π arespecified 60

Trang 14

Figure 3.5 Estimating parameters of a sphere from 3-D points Red pointsare inlier points In this figure, p1, p2 are the two selected samples forestimating a sphere (two gray points), they are outlier points Therefore,the estimated sphere is with wrong centroid and radius (see green sphere(left bottom panel)) 62Figure 3.6 Estimating parameters of a cone from 3-D points using the geo-metrical analysis proposed in [131]; (a) Point cloud with three samples(p1, p2, p3) and its normal vector (b) Three estimated planes pli(i =

1, 2, 3) (c) Ei is calculated according Eq (3.8) (d) Ap and main axis

γco of the estimated cone (e) Illustration of the proposed constraint toestimate a conical object 63Figure 3.7 Point clouds of (a) dC1, dC2, dC3, (b) dSP1, dSP2, dSP3 and (b)dCO1, dCO2, dCO3 of the three datasets (the synthesized data) in case

of 50% inlier ratio The red points are inliers, whereas blue points areoutliers 65Figure 3.8 Examples of four cylindrical-like objects collected from the ’sec-ond cylinder’ dataset 66Figure 3.9 (a) Illustrating the separating the point cloud data of a ball inthe scene (b) Illustrating the point cloud data of a cone and preparingthe ground-truth of evaluating the fitting a cone 66Figure 3.10 Comparisons of GCSAC and MLESAC algorithms (a) Totalresidual errors of the estimated model and ideal case (b) Relative errors

of the estimated model and idea case 69Figure 3.11 The average number of iterations of GCSAC and MLESAC onthe synthesized dataset when were repeated 50 times for statisticallyrepresentative results 70Figure 3.12 Decomposition of residual density distribution: inlier (blue) andoutlier (red) density distributions of a synthesized point cloud with 50inliers (a) Noises are added by a uniform distribution (b) Noises areadded by a Gaussian distribution µ = 0, σ = 1.5 In each subfigure, left-panel shows the distribution of an axis (e.g., x-axis), right-panel showsthe corresponding point cloud 70

Trang 15

Figure 3.13 An illustration of GCSAC’s at a kth iteration to estimate a coffeemug in the second dataset Left: the fitting result with a random MSS.Middle: the fitting result where the random samples are updated due tothe application of the geometrical constrains Right: the current bestmodel 71Figure 3.14 The best estimated model using GCSAC (a) and MLESAC (b)with 50% inlier synthesized point cloud In each sub-figure, two differentview-points are given 71Figure 3.15 Results of coffee mug fitting Ground-truth objects are marked

as red points; estimated ones are marked as green points 72Figure 3.16 The cup is detected by fitting a cylinder Top row: The scenes

in Dataset 3 captured at different view-points; Bottom row: The results

of locating a coffee cup using GCSAC in the corresponding point clouds 72Figure 3.17 An illustration of GCSAC’s at a kth iteration to estimate a coffeemug in the second dataset Left: the fitting result with a random MSS.Middle: the fitting result where the random samples are updated due toapplying the geometrical constraints Right: the current best model 73Figure 3.18 Illustrating the fitting cylinder of GCSAC and some RANSACvariations on the synthesized datasets, which have 15% inlier ratio Redpoints are inlier points, blue points are outlier points The estimatedcylinder is a green cylinder In this figure, GCSAC estimated a cylinderfrom a point cloud data, that Ed, Er, Ea are smallest 73Figure 3.19 Illustrating the fitting sphere of GCSAC and some RANSACvariations on the synthesized datasets, which have 15% inlier ratio Redpoints are inlier points, blue points are outlier points The estimatedspheres is the green spheres 74Figure 3.20 Illustrating the fitting cone of GCSAC and some RANSAC vari-ations on the synthesized datasets, which have 15% inlier ratio Redpoints are inlier points, blue points are outlier points The estimatedcones is the green cones 74Figure 3.21 Result fitting of some instances collected from the real datasets.(a) A coffeemug; (b) A toy ball; (c) A cone object In each sub-figure: left-panel is RGB image for a reference, right-panel is fittingresult Ground-truths are marked as red points; the estimated objectsare marked as green points 76

Trang 16

Figure 3.22 Illustrations of correct a correct (a) and incorrect estimation out using the verification scheme On each sub-figure: Left panel: pointcloud data; Middle panel: the normal vector of each point; Right panel:the estimated model 79Figure 3.23 The histogram of deviation angle with the x-axis (1, 0, 0) of a realdataset in the bottom panel of Fig 3.22; (b) the histogram of deviationangle with the x-axis (1, 0, 0) of a generated cylinder dataset in the toppanel of Fig 3.22 79Figure 3.24 Illustrating of the deviation angle between the estimated cylin-der’s axis and the normal vector of the plane 80Figure 3.25 Some examples of scenes with cylindrical objects [117] collected

with-in the first dataset 80Figure 3.26 Illustration of six types of cylindrical objects in the third dataset 81Figure 3.27 Result of the table plane detection in a pre-processing step usingthe methods in our previous publication and [33].(a) RGB image ofcurrent scene; (b) The detected table plane is marked in green points.(c) The point clouds above the table plane are located and marked in red 82Figure 3.28 (a) is the results of estimating the cylindrical objects of ’MICA3D’dataset (b) is the results of estimating the cylindrical objects of [68]dataset In these scenes, there are more than one cylinder objects Theyare marked in red, green, blue and yellow, so on The estimated cylindersinclude radius, position (a center of the cylinder), main axis direction.The height can be computed using a normalization in y-value of theestimated object 83Figure 3.29 (a) The green estimated cylindrical object has he relative error ofthe estimated radius Er = 111.08%; (b) the blue estimated cylindricalobject has he relative error of the estimated radius Er= 165.92% 84Figure 3.30 Angle errors Ea of the fitting results using GCSAC with andwithout using the context’s constraint 84Figure 3.31 Extracting the fitting results of the video on the scene 1th of thefirst dataset 84

Figure 4.1 Top-panel: the procedures of PSM method Bottom-panel trated the result of each step 91

Trang 17

illus-Figure 4.2 A result of object clustering when using the method of Qingming

et al [110] (a) RGB image; (b) the result of objects clustering projected

to the image space 92Figure 4.3 (a): Presentation of neighbors of Pq [123] (b): Estimating pa-rameters of PFH descriptor [124] are as Eq 4.2 93Figure 4.4 Illustrating of training phase of CVFGS method 93Figure 4.5 Illustrating of testing phase of CVFGS method 94Figure 4.6 The architecture of Faster R-CNN network [116] for object de-tection on RGB image 95Figure 4.7 The architecture of YOLO version 2 network [114] 95Figure 4.8 The size of the feature map when using YOLO version 2 fortraining model 96Figure 4.9 YOLO divides the image into regions and predicts boundingboxes and probabilities for each region [115] 96Figure 4.10 Illustration mounted of a MS Kinect on the chest’s VIP 97Figure 4.11 Illustration mounted a Laptop on the VIPs 98Figure 4.12 Illustration of type objects and scenes in our dataset and pub-lished dataset [68] 98Figure 4.13 Illustration of the size of table 99Figure 4.14 Illustrating of detecting spherical objects evaluation The leftcolumn is the result of detecting spherical objects on the RGB image byYOLO CNN The middle column is the result of the estimated sphericalobject from the point cloud of the detected object in the left column.The right column is the result of spherical object detection when projectthe estimated sphere to the RGB image 99Figure 4.15 Illustration of computing the deviation angle of the normal vector

of table plane yt and the estimated cylinder axis γc 101

Trang 18

Figure 4.16 A final result of detecting a spherical object in the scene, left: isthe result in the RGB image, the location of spherical object is shown;right: is the finding result in 3D environment (green points are datapoints of the generated points of the estimated sphere and inlier pointsthat estimated blue sphere) The located and described of a estimatedspherical object in the scene are x=-0.4m, y=-0.45m, z=1.77m, ra-dius=0.098m 103Figure 4.17 (a), (b) Illustrating the false cases of the detecting spherical ob-jects in the first stage (c) Illustrating the false case of finding sphericalobjects following the query of the VIPs A green set point is the pointcloud that estimates the blue sphere In this case, the projected rectan-gle from generated point cloud data of blue sphere is large 104Figure 4.18 Illustration result of detecting, locating, describing a sphericalobject in the 3-D environment 105Figure 4.19 (a) Illustration of a RGB image; (b) the results of object detection

on the RGB image when using the object classifier of YOLO [115]; (c)Illustrating the results of finding cylindrical objects when not using theangle constraint of the context; (d) Illustrating the results of findingcylindrical objects when using the angle constraint of the context Thedeviation angle of estimated cylinder with the normal vector of the tableplane is 88 degrees of bottle, is 38 degrees of coffee cup (c), is 18.6 degrees

of bottle, is 4.2 degrees of coffee cup 107Figure 4.20 (a) Illustration of the estimated cylinder from point cloud data

of green jar in Fig 4.19 107Figure 4.21 Illustrating of the result of detecting two cylindrical objects onthe table (a) is the result on the RGB image, (b) is the result in the3-D environment 108Figure 4.22 Illustration results of detecting, locating, describing a 3-D cylin-drical object (green cylinder) in the 3-D environment The red pointsbelong a detected object 108Figure 4.23 The framework for deploying the complete system to detect 3-Dprimitive objects according to the queries of the VIPs 110Figure 4.24 Illustration of 3-D queried primitive objects in 2-D and 3-D en-vironments Left panel is the results of detecting cylindrical and rightpanel is the results of spherical objects in RGB image and point cloud 111

Trang 19

Figure 4.25 Illustration of the set up system on the VIPs 112

Figure 4.26 Illustrating of the VIPs come to the sharing-room or Kitchen to find the spherical objects or cylindrical objects on the table 113

Figure 4.27 Illustration of trajectory of the visually impaired 113

Figure 4.28 Illustration of object detection in the RGB images 114

Figure 4.29 Full trajectory of a volunteer in the experiments First row: scene collected by a surveillance camera Frame ID is noted above each panel Second row: Results of YOLO detection on the RGB images Last row: Results of the model fitting and object’s descriptions(e.g., position, radius) are given There are five spherical-like objects in this scene 114

Figure 4.30 Illustration of scenes and objects when VIPs move in the real environment of sharing-room, kitchen 115

Figure 4.31 Computation the occluded data Rg is the area of objects; Rs is the area of visible area 116

Figure 4.32 Illustration of results on the occluded data 117

Figure 4.33 The table plane is a wrong detection 117

Figure 4.34 The estimated spherical object is a wrong 118

Figure 4.35 Illustration of using GCSAC estimated the spherical object on the missing data 118

Figure 4.36 Illustration of the inclined cylinder 119

Figure 5.1 Illustration the problem solved in the our dissertation [93], [131] 123

Trang 20

Motivation

Visually Impaired People (VIPs) face many difficulties in their daily living days, many aided systems for the VIPs have been deployed such as navigation services,obstacle detection (iNavBelt, GuideCane products [10, 119]), object recognition in su-permarket (EyeRing at MIT’s Media Lab [74]) In their activities of daily living, themost common situation is that VIPs need to locate home facilities However, even for

Nowa-a simple Nowa-activity such Nowa-as querying common objects (e.g., Nowa-a bottle, Nowa-a coffee-cup, jNowa-ars, soon) in a conventional environment (e.g., in kitchen, cafeteria room), it may be a chal-lenging task In term of deploying an aided system for the VIPs, not only the object’sposition must be provided but also more information about the queried-objects such

as their size, grabbing directions is required

Let us consider a real scenario, as shown in Fig 1, to look for a tea or coffee cup,he(she) goes into the kitchen, touches any surrounded object and picks up the rightone In term of an aided system, that person just makes queries ”Where is a coffeecup?”, ”What is the size of the cup?”, ”Is the cup lying/standing on the table?” Theaided system should provide the information for the VIPs so that they can grasp theobjects and avoid accidents Approaching these issues, recognition on 2-D image dataand adding information on depth images are presented in [81], [82], [83] However,

in these works only information about the object’s label is provided Moreover, 2-Dinformation is able to observe only a visible part at each certain view-point While theVIPs need information about position, size and direction for safely grasping For thisreason, we use the 3-D approaches to address these critical issues

To this end, by knowing shape of the queried-object, for instance, a coffee cup isusually a cylindrical shape and lying on a flat surface (table plane) The aided systemcan resolve the query by fitting a primitive shape to the collected point cloud fromthe object More generally, the objects in the kitchen or tea room are usually placed

on the tables such as cups, bowls, jars, fruit, funnels, etc Therefore, these objectscould be simplified by the primitive shapes The problem of detecting and recognizingcomplex objects are out-of-scope-of the dissertation In addition, we observe that theprior knowledge observed from the current scene such as a cup normally stands onthe table, contextual information such as walls in the scene to be perpendicular to thetable plane; the size/height of the queried objects is limited, would be valuable cues toimprove the system performances

Trang 21

Figure 1 Illustration of a real scenario: a VIP comes to the Kitchen and gives aquery: ”Where is a coffee cup? ” on the table Left panel shows a Kinect mounted onthe human’s chest Right panel: the developed system is build on a Laptop PC.

More generic, we realize that the queried-objects could be located through fying geometric shapes: planar segments (boxes), cylinders (coffee mugs, soda cans),sphere (balls), cones, but not utilizing conventional 3-D features Approaching theseideas, a pipeline of the work ”3-D Object Detection and Recognition for Assisting Visu-ally Impaired People” is proposed in this dissertation The proposed framework consists

simpli-of several tasks including: (1) separating the queried objects from a table plane; (2)detecting candidates of the interested objects using appearance features; and (3) es-timating a model of the queried-object from a 3-D point cloud Instead of matchingthe queried-objects into 3-D models as conventional learning-based approaches do, thisresearch work focuses on constructing a simplified geometrical model of the queried-object from an unstructured set of point clouds collected by a RGB and depth sensor,wherein the last step plays the most important role

Objective

In this dissertation, we aim to propose a robust 3-D object detection and tion system As a feasible solution to deploy a real application, the proposed frameworkshould be simple, robust and friendly to the VIPs However, it is necessary to noticethat there are critical issues that might affect the performance of the proposed sys-tem Particularly, some of them are: (1) objects are queried in a complex scene wherecluster and occlusion issue may appear; (2) noises from collected data; and (3) highcomputational cost due to huge number of points in a cloud data Although in theliterature, a number of relevant works of 3-D object detection and recognition has beenattempted for a long time In this study, we will not attempt to solve these issuesseparately Instead of that, we aim to generate an unified solution To this end, theconcrete objectives are:

recogni To propose a completed 3recogni D queryrecogni based object detection system in supporting

Trang 22

Figure 2 Illustration of the process of 3-D query-based object in the indoor ment The full object model is the estimated green cylinder from the point cloud ofcoffee-cup (red points).

environ-the VIPs with high accuracy Figure 2 illustrates environ-the processes of 3-D query-basedobject detection in an indoor environment

- To deploy a real application to locate and describe objects’ information ing the VIPs grasping objects The application is evaluated in practical scenariossuch as finding objects in a sharing-room, a kitchen room

support-An available extension from this research is to give the VIPs a feeling or a way

of interaction in a simple form The fact that VIPs want to make optimal use ofall their senses (i.e., audition, touch, and kinesthetic feedback) By doing this study,informative information extracted from cameras (i.e position, size, safely directions forobject grasping) is available As a result, the proposed method could offer an effectiveway so that the a large amount of the collected data is valuable and feasible resource

Context, constraints and challenges

Figure 1 shows the context when a VIP comes to a cafeteria and using an aidedsystem for locating an object on the table The input of system is a user’s query andoutput is the object’s position in a 3-D coordination and object’s information (size,height, normal surface) The proposed system operates using a MS Kinect sensor [72].The Kinect sensor is mounted on the chest of the VIP and the laptop is warped inthe backpack, as shown in Fig 1 For deploying a real application, we require someconstraints in the scenario as the following:

The MS Kinect sensor:

– A MS Kinect sensor is mounted on VIP’s chest and he/she moves slowlyaround the table This is to collect the data of the environment

– A MS Kinect sensor captures RGB and Depth images at a normal framerate (from 10 to 30 fps) [95] with image resolution of 640×480 pixels for

Trang 23

both of those image types With each frame obtained from Kinect an eration vector is also obtained Because MS Kinect collects the images in arange from 10 to 30 fps, , it fits well with the slow movements of the VIPs(∼ 1 m/s) Although collecting image data via a wearable sensor can beaffected by subject’s movement such as image blur, vibrations in the practi-cal situations, there are no specifically requirements for collecting the imagedata For instance, VIPs are not required to be stranded before collectingthe image data.

accel-– Every queried object needs to be placed in the visible area of a MS Kinectsensor, which is in a distance of 0.8 to 4 meter and an angle of 300 aroundthe center axis of the MS Kinect sensor Therefore, the distance constraintfrom the VIPs to the table is also about 0.8 to 4m

Interested (or queried) objects are assumed to have simple geometrical structures.For instance, coffee mugs, bowls, jars, bottles, etc have cylindrical shape, whereasball(s) have spherical shape; a cube shape could be boxes, etc They are idealizedand labeled The modular interaction between a VIP and the system has not beendeveloped in the dissertation

Once a VIP wants to query an object on the table, he/she should stand in front

of the table This ensures that the current scene is in the visible area of a MSKinect sensor and can move around the table The proposed system computes andreturns the object’s information such as position, size and orientation Sendingsuch information to senses (e.g., audible information, on a Braille screen, or by avibrating type) is out of the scope of this dissertation

Some heuristic parameters are pre-determined For instance, a VIP’s height, andother parameters of contextual constraints (e.g., size of table plane in a scene,object’s height limitations) are pre-selected

The above-mentioned scenarios and constraints are to cope with the followingissues:

Occlusion and clustered issues of the interested objects: In the practical situation,when a VIP comes to a cafeteria to find an object on the table, the queried-objectscould be occluded by others At a certain view point, the MS Kinect sensorcaptures only a part of an object Therefore, full 3-D data of the queried-objects

is missed Other situation is that the collected data consists of many noisesbecause the depth image collected by MS Kinect often is affected by illuminationconditions These issues raise challenges for fitting, detecting and classifying thequeried-objects from a point cloud

Trang 24

Various appearances of same object type: The system is to support for the VIPsquerying common objects The fact that a ”blue” tea/coffee cup and a ”yellow”bottle have same type of a primitive shape (e.g., a cylindrical model) Theseobjects have the same geometrical structure but are different appearances of colorand texture We exploit recent deep learning techniques to utilize appearancefeatures (on RGB images) for addressing these issues.

Computational time: A point cloud of a scene generated from an image withsize of 640 × 480 pixels consists of hundreds of thousand of points Therefore,computations in the 3-D scene often require higher computational costs than therelevant task in 2-D image

to validate results of the model estimation

Contribution 2: Proposed a comparative study on three different approachesfor recognizing the 3-D objects in a complex scene Consequently, the best one

is a combination of deep-learning based technique and the proposed robust mator(GCSAC) This method takes recent advantages of object detection using

esti-a neuresti-al network on RGB imesti-age esti-and utilizes the proposed GCSAC to estimesti-atethe full 3-D models of the queried objects

Contribution 3: Deployed a successfully system using the proposed methods fordetecting 3-D primitive shape objects in a lab-based environment The systemcombined the table plane detection technique and the proposed method of 3-Dobjects detection and estimation It achieved fast computation for both tasks oflocating and describing the objects As a result, it fully supports the VIPs ingrasping the queried objects

Trang 25

Point cloud representation

Acceleration

vector

Table plane detection

Objects detection on RGB image

3-D objects location on the table plane

3-D objects model estimation

3-D objects information

Figure 3 A general framework of detecting the 3-D queried objects on the table of theVIPs

General framework and dissertation outline

In this dissertation, we propose an unified framework of detecting the queried 3-Dobjects on the table for supporting the VIPs in an indoor environment The proposedframework consists of three main phases as illustrated in Fig 3 The first phase isconsidered as a pre-processing step It consists of point cloud representation from theRGB and Depth images and table plane detection in order to separate the interestedobjects from a current scene The second phase aims to label the object candidates

on the RGB images The third phase is to estimate a full model from the point cloudspecified from the first and the second phases In the last phase, the 3-D objects areestimated by utilizing a new robust estimator GCSAC for the full geometrical models.Utilizing this framework, we deploy a real application The application is evaluated

in different scenarios including data sets collected in lab environments and the publicdatasets Particularly, these research works in the dissertation are composed of sixchapters as following:

Introduction: This chapter describes the main motivations and objectives of thestudy We also present critical points the research’s context, constraints andchallenges, that we meet and address in the dissertation Additionally, the generalframework and main contributions of the dissertation are also presented

Chapter 1: A Literature Review: This chapter mainly surveys existing aidedsystems for the VIPs Particularly, the related techniques for developing anaided system are discussed We also presented the relevant works on estimationalgorithms and a series of the techniques for 3-D object detection and recognition

Chapter 2: In this chapter, we describe a point cloud representation from datacollected by a MS Kinect Sensor A real-time table plane detection technique for

Trang 26

separating the interested objects from a certain scene is described The proposedtable plane detection technique is adapted with the contextual constraints Theexperimental results confirm the effectiveness of the proposed method on bothself-collected and public datasets.

Chapter 3: This chapter describes a new robust estimator for the primitive shapesestimation from a point cloud data The proposed robust estimator, named GC-SAC (Geometrical Constraint SAmple Consensus), utilizes the geometrical con-straints to choose good samples for estimating models Furthermore, we utilizethe contextual information to validate the estimation’s results In the experi-ments, the proposed GCSAC is compared with various RANSAC-based variations

in both synthesized and the real datasets

Chapter 4: This chapter describes the completed framework for locating andproviding the full information of the queried objects In this chapter, we exploitadvantages of recent deep learning techniques for object detection Moreover, toestimate full 3-D model of the queried-object, we utilize GCSAC on point clouddata of the labeled object Consequently, we can directly extract the object’sinformation (e.g., size, normal surface, grasping direction) This scheme outper-forms existing approaches such as solely using 3-D object fitting or 3-D featurelearning

Chapter 5: We conclude the works and discuss the limitations of the proposedmethod Research directions are also described for future works

Trang 27

CHAPTER 1

LITERATURE REVIEW

In this chapter, we firstly present surveys on assisted (or aided) systems for VIPs inSec 1.1 The related systems are categorized into three groups: Navigation services,obstacle detections, and positioning the interested objects in a scene The relatedworks on detecting 3-D objects in indoor environment are surveyed in Sec 1.2 In thissection, we will roughly introduce and analyze the state-of-the-art 3-D object detection,recognition techniques The readers also can refer detailed approaches in Chapter 3.Finally, in Sec 1.3, we concentrically survey on the fitting techniques using robustestimator algorithms and their applications in robotics and computer vision

Nowadays, deploying aided-systems for the VIPs is an active topic in computervision and robotics To build an aided-system for the VIPs, ones could use some tech-niques based on RFID, GPS, and vision sensors The techniques utilizing RFID, GPSoften apply in wide area environment, whereas the computer vision techniques oftenapply in indoor environments Intuitively, the vision-based techniques have advantages

to resolve different tasks such as moving, avoiding obstacles, understanding ment as well as detecting and suggesting descriptions of the interested objects Fromviewpoints of the specific aided-system using computer vision techniques, readers canrefer to a good survey in [139] In this section, we briefly survey the completed systems

environ-We divide the aided-systems into three categories: Navigation aiding systems; cle detection; and positioning objects based on the object detection and recognitiontechnique Details of each category will be presented in the following sections

Obsta-1.1.1 Aided-systems for navigation services

Most VIPs encounter various difficulty situation during their daily routine VIPshave a different perception of the world than sighted people Any system designed forthe visually impaired has to be aware of these differences in order to provide a userinterface adapted to the limitations and special needs of its users One of the difficulties

is the navigation for the VIPs when they live in the unknown environment

Zollner et al [161] propose at the conceptual level of a mobile navigational

Trang 28

aided-system that uses the MS Kinect and optical markers for wayfinding of the VIPs inbuildings The VIPs receive continuous vibro-tactile feedback at the person’s waistsuch as an impression of the environment and to warn about obstacles The systemmay be using on both environment of micro-navigation and macro-navigation Thissystem is able to detect the closest obstacles in the left, right and center region of the

MS Kinect’s ”field of view” Jain et al [65] propose a path finding technique in indoorenvironment for the VIPs that using the cell phone This system provides step-by-stepdirections to move in the building at any location to the destination using predefinedinfrastructure The system carefully calibrates audio, vibration instructions so thatthe small wearable device helps the user to navigate efficiently and are not affected bythe moving of the VIPs Mahmud et al [118] propose an aided-system for the VIPsnavigation using the data from a MS Kinect sensor This system attempts to addressingvarious tasks including face detection and recognition, object location, optical characterrecognition and audio feedback To detect and recognize objects, the authors in [118]learn the features as Haar-like for detecting and recognizing the interested objects Todetermine the distance between a VIP and objects, the system uses the depth values.This system is developed based on a number of open-source tools/libraries such asOpenCV, OpenKinect, Tesseract and Espeak The best result of the face recognitionshows an accuracy of 90% The poorest result is the text recognition with an accuracy

of 65% The proposed system is able to detect and recognize face/text/chair in acollected frame within 2.25 seconds In Vietnam, Nguyen et al [97] propose a system

to assist the VIPs using robots, where predefined environmental maps and robot helpthe VIPs moving from one location to another The system defines destinations in

an environment via the interface of the smart-phone The system deploys in simpleenvironments such as the corridor of a building, halls in a library

1.1.2 Aided-systems for obstacle detection

VIPs can meet accidents due to the obstacle objects on their trajectories Inrecent years, many systems detecting the obstacles have been proposed These systemsare built based on fusion some sensors such as Ultrasonic Sensor, Infra-red Sensor,Ultrasonic Transducers, MS Kinect Sensor, Time-of-flight, etc In this part, we brieflyintroduce the aided-systems with the obstacles detection Four categories that arelisted below

Obstacle detection system on a Smart Phone: The smart-phone is a device thatintegrated sensors such as Vision, Vibration, Gyroscope, etc Especially, it may bemounted or bought easily in the VIPs’ body These systems often use the smart-phone to capture the images of the environment and send notifications to the VIPs.Alessandro et al [23] exploit the hardware and software facilities of a standard smart-

Trang 29

phone and extract a 3-D representation of the scene in order to detect obstacles Thesystem implements two steps: ground-plane detection of the scene and simultaneouslyidentifies 3-D points that lie on it The system selects a 3-D point set which is to matchcorrespondences detected in the bottom part of the images (under the hypothesis thatsuch points belong to the ground-plane) Then a robust estimator like RANSAC isused to estimate the parameters of the ground-plane Finally, all 3-D points that donot belong to the ground-plane are labeled as obstacles and their relative depth will beexploited to assign different warning: a high level for closer objects or lower level fordistant ones In [127], authors propose a system that uses two hardware components:

a laser pointer and an android smart-phone To avoid obstacles in the environment,the system uses image processing to measure distances from a VIP to objects by laserlight triangulation technique At the same time, the obstacles are detected by edgedetection within the captured images Especially, the system is able to recognize andwarn the VIPs when stairs are in the camera’s field of view The notification of thissystem to end-user is sent out using an acoustic signal Its sensitivity is about 90% for

1 cm wide of the obstacles

Obstacle detection system using the Ultrasonic and Infra-red Sensors: Ayush et al.[151] develop prototype of an electronic device for the obstacles detection supportingVIPs when they move in the environment This wearable device is mounted on person’swaist that is integrated ultrasonic sensors and raspberry pi This device could detectobstacles around the VIPs with a distance up to 500 cm in three directions, i.e front,left and right when using a network of ultrasonic sensors The distance between theVIPs and obstacles is computed and converted to the text message, and then convertedinto speech and conveyed to the VIPs through earphones/speakers Virgil et al [148]improve the mobility of blind persons in a limited area in which 3-D obstacles detectionsystem is integrated This system performs the 3-D obstacles detection by using ultra-sonic sensors and the magnetic sensors, accelerometers for determining the position ofthe ultrasonic Sudhakar et al [138] propose a warning system for obstructing obsta-cles on the way of the VIPs The infra-red transmitter was created by amplifying thecoded pulses in the photo module The proposed aided-system is cheap, portable, userfriendly, light weighted It can warn any obstructions in a moderate range of about 4feet through

Obstacle system using the MS Kinect Sensor and Time-of-flight camera: A MSKinect sensor [72] is cheap that can collect different types of data such as color image,depth image, acceleration vector, skeleton, etc To detect the obstacles, researchersoften combine the collected data of a MS Kinect sensor as the color and depth images.Huang et al [57] propose an indoor obstacle detection system using depth imageand region growth algorithm This system composes of three parts: scene detection,

Trang 30

obstacle detection, and a vocal announcement The best detection rate for an indoorobstacle is 97.40% Monther et al [92] propose an aided-system for the VIPs using somedevices as a MS Kinect, a Tablet PC, a micro-controller, IMU sensors, and vibrationactuators to detect markers on the floor When the VIPs moving in the environment,once markers are detected, they are obstacles and warning signal will be sent to VIPs.

In Vietnam, Nguyen et al [96] develop glasses for the VIPs that called ’HapticEyes’ This device allows the VIPs to detect the obstacles that are within a rangefrom the wearer to the obstacle at the farthest distance of 1.3 meters Pham et al.[59] propose an obstacle detection for the VIPs that used the collected data of the MSKinect sensor This system also used the ground plane as sticking plane and it is able

to accurately detect the four types of obstacle: walls, doors, stairs, and a residual classthat covers others obstacles on the floor For the stair detection, the authors define astair model that consists of at least three equally spaced steps The system’s result fordetecting doors is 90.69% and down-stairs is about 89.4%

1.1.3 Aided-systems for locating the interested objects in scenes

As shown in above sections, there are a number of attentions which are to helpVIPs moving in unfamiliar environment However, the VIPs also meet many challengeswhen living and finding the objects in these environments In this section, we brieflysurvey related studies on this topic Rabia et al [63] present a good survey consisting ofthe computer vision-based techniques that utilize visual tags to recognize the interestedobjects Nicholson et al [98] create a system named ShopTalk to help the VIPs in theshopping MSI barcodes on shelves to find product locations and UPC barcodes onproducts help to identify them This system has implemented object detection based onthe barcode Lanigan et al [78] develop Trinetra, that assists the blind in identifyingproducts in a grocery store A product’s barcode is scanned using a barcode scanningpencil and then is sent via Bluetooth to a module on the user’s mobile phone Tekin

et al [38] develop a mobile phone application utilizing the phone’s video camera todetect an UPC-A barcode in the scene However, this application only sees the front

of the user The detected barcode is given for visually impaired by audio feedback Inabove systems, the barcode digits are recognized using a Bayesian statistical approachand the resulting barcode is compared against a customizable user database as well as

a manufacturer’s database If a match is found, product information is relayed to theuser via the text-to-speech module on the phone They are not suitable when deploywith the objects appearing in the daily living of the VIPs

Nikolakis et al [102] develop an application that providing an interface as 3-Dvirtual environments The application will allow blind people understand and interact

Trang 31

with objects in a Virtual Environment (VE) thanks to new technologies It introducesnew opportunities for training, entertainment, and working of the VIPs The appli-cation helps blind people for touching, grasping and manipulating objects that see insight the haptic enabled VE Schauerte et al [130] present an aided-system based oncomputer vision that helps blind people find lost objects This system is able to detectthe objects based on the combination of color and SIFT-based Yi et al [157] propose

a prototype system of blind assistant finding object by the camera-based network andmatching-based recognition algorithms This system is performed as follows: A blinduser sends his/her request of finding a specific object, and the system will find objectcollected by the cameras to search their respective regions Each camera will output

a recognition score by comparing the objects in the captured images with the objectsthat has been learned model characteristics as Speeded-Up Robust Features (SURF)and Scale Invariant Feature Transform (SIFT) feature descriptors in the dataset Af-ter that, this system uses the pre-determined thresholds to locate candidates of thequeried-objects

1.1.4 Discussions

In the above sections, we presented three main research directions to deploy system supporting VIPs in their daily living To develop a way-finding system, onescan use the GPS, RFID, Wifi technologies However, the accuracy of these methods

aided-is poor Therefore, it aided-is still open question in order to improve performances of thenavigation systems in future works By detecting obstacles, researchers often used theinformation of devices as a Smart Phone, an Ultrasonic Sensor, an Infra-red Sensor,

an Ultrasonic Transducers, a MS Kinect Sensor or a Time-of-flight These systemsare often applied in the indoor environment and infer many challenges to develop inoutdoor environments

To find/locate objects to aid the VIPs, the relevant techniques have been deployed

in some contexts such as shopping, querying lost objects The systems often usedthe defined marker as QR code, a barcode to identify objects, object location Themarkers are decoded following the defined dataset or using the object classifiers thatare trained using features of objects Especially, objects finding on the RGB images

by recent deep learning techniques presents by promising results In the next section,

we will summarize techniques which are used for 3-D object detection In Chapter 4,

we revisit 3-D object detection in which relevant ones are presented more details

Trang 32

1.2 3-D object detection, recognition from a point cloud data

3-D Object detection and recognition are two fundamental research topics in thefield of computer vision These topics have been concentrically studied since 70’sdecade of the last century 3-D object detection and recognition in a point cloud havemany challenges due to noises, contaminated data, occlusion and huge computationaltime In this section, we survey two main different approaches: The first one consists

of the appearance-based approaches, where the second one is based on geometry-basedanalysis

1.2.1 Appearance-based methods

The appearance-based techniques could be categorized into two main groups: (1)

by using template registration techniques [77] This approach is a match between thepre-prepared point cloud data of template and point cloud of the queried-object usingoptimization function Iterative-Closed Point (ICP) [19] is a common algorithm of thisapproach (2) by using point features or the matching point pairs [51], [18] Thismethod performs keypoint detection and then performs matching keypoint sets Thereare two main approaches for calculating point features in the second group They areglobal and local features [53], [8]

Local feature based: Methods often performed three steps: 3-D keypoint tion, local surface feature description, and surface matching as illustrated in Fig 1.1(adopted from [53])

detec-Figure 1.1 Illustration of the 3-D object recognition process towards local feature basedmethod [53]

The first step, the 3-D keypoints are a set of point that just satisfy two

Trang 33

re-quirements: Distinctive, i.e suitable for effective description and matching (globallydefinable); Repeatable with respect to point-of-view variations, noise, etc (locally de-finable) 3-D keypoints can detected and extracted from 3-D point clouds and rangemaps such as Intrinsic Shape Signatures (ISS) [159], Normal Aligned Radial Features(NARF) [134], Uniform Sampling: (e.g., Voxelization [34]) or from 2-D points such

as SIFT [85], AGAST [86], etc In addition, keypoint detection methods are listed inTable 3 of the article [53] Bo et al [83], [82], [81] present and improve a family ofkernel descriptors (KDES) This approach provides a principled framework for extract-ing the image features from the pixel attributes such as gradient, color, local binarypatterns, etc This approach is performed on 2-D images (RGB, depth image) andthe 2-D features such as SIFT, SURF, LBP Also in [81], the authors present somelocal features using depth kernel descriptors (gradient, kernel PCA, size, spin and localbinary pattern kernel descriptors) for the objects recognition The method in [81] cancapture different cues for recognition including size, shape, and edges (depth disconti-nuities) The authors used to a Support Vector Machine (SVM) to train the features

on the RGB-D images It achieves a relatively high accuracy of object recognition.More specifically, it has an accuracy of about 80% on the dataset presented in [76].However, this approach needs a lot of time for preparing data and learning the objectmodel because training a classifier always requires a set of the positive and negativeexamples [69]

Once the keypoints are detected, geometric information as curvatures, normalvectors of the local surface around that keypoints can be extracted and computed asfeature descriptors The 3-D local descriptors are computed for individual points

In the second step, constructing the feature descriptors can be divided into three broadcategories: signature based, histogram-based, and transform-based methods [53] Assignature-based method used a normal vector and the tangent plane of each detectedkeypoint to encode a feature descriptor [135] The histogram-based methods computedaccumulating geometric or topological measurements of the local neighborhood of akeypoint Johnson et al [67] use the normal of a keypoint as the local referenceaxis and expressed each neighboring that keypoint with two parameters: the radialdistance and the signed distance to compute the feature descriptor at a keypoint Thetransform based methods often transform a range image from the spatial space toanother space, and then describe the 3-D surface neighborhood of a given keypoint byencoding the information in that transformed space [53] As Knopp et al [70] extendthe 2-D Speeded Up Robust Feature (2D-SURF feature) to (3-D-SURF) The authorsfirst vocalize a mesh to a 3-D voxel image, and apply a Haar wavelet transform to thevoxel image, and then a Local Reference Frame (LRF) [53] is defined for each keypoint

by using the Haar wavelet responses in the neighborhood In particular, the features

Trang 34

are used such as Signature of Histograms of OrienTations (SHOT) [140], Point FeatureHistogram (PFH) [125], Point Feature Histogram RGB (PFH-RGB) [7], Fast PointFeature Histogram (FPFH) [111], are commonly used.

The third step is surface matching The surface matching methods can be vided into three steps: feature matching, hypothesis generation, and verification Thefeature matching is the process of matching between two sets of features from the li-brary of pre-encoded features and the set of features of the object to be identified.There are three popular strategies for feature matching, i.e., threshold-based, NearestNeighbor (NN) based and Nearest Neighbor Distance Ratio (NNDR) based strategies[91] Currently, there are many the proposed algorithms for surface matching, wherekd-tree [41] is one of the robust algorithms The hypothesis generation implements twotasks: One is to obtain candidate models which are potentially present in the scene, thesecond is to generate transformation hypotheses for each candidate model To build

di-an accurate trdi-ansformation hypotheses, one cdi-an use several techniques such as Rdi-an-dom Sample Consensus (RANSAC) [46], generalized Hough transform [12] [141] Theverification step is implemented to distinguish true hypotheses from false hypotheses.The Iterative Closest Point (ICP) algorithm [19] is a typical algorithm to perform theverification

Ran-Global feature based: 3-D global descriptors are considered as the object geometryrepresentation [5] It is computed a set of points that is subsets of the point clouds,which are likely to be objects This method required a pre-processing step (i.e., seg-mentation), in order to retrieve possible candidates There are some suggested featuressuch as Viewpoint Feature Histogram (VFH) [125] VFH is a global descriptor that isbased on the FPFH local descriptor A VFH consists of two parts: a viewpoint direc-tion component and an extended FPFH component It computes a viewpoint featurehistogram describing 3-D point cloud data that is captured from the stereo camera.This method is evaluated on a dataset of 60 IKEA kitchenware objects and obtainedhigh recognition rate (98.52%) Clustered Viewpoint Feature Histogram (CVFH) [4]

is an extension to VFH CVFH is suitable when the clustered object is missing manypoints

The integration of the local and global descriptors for 3-D object recognition: hamzi et al [8] propose a system for 3-D object recognition integrating of local andglobal features of the object This system is evaluated in datasets of Lai et al [75]and Besl et al [19] The average results of VFH is highest (97%) and the processingtime of combination of FPFH and VFH features is lowest (3.17s) The performanceaccuracy of integration of the VFH and FPFH descriptors for Willow Garage dataset

Al-is 89.64%

Trang 35

1.2.1.1 Discussion

As listed in the above Sections, tasks of 3-D object detection and recognition frompoint cloud data usually use two types of characteristic of global and local features.The most important process aim to match the computed feature set of training datasetsand feature sets to be computed on the testing data The output of this approach isthe object’s label The high computation time on 3-D data is required In particular,the process of preparing data for object training is consuming time The result oflabeling objects is not high because the point cloud data usually contains a lot ofnoise Practically, the captured data is obtained by Kinect sensor Therefore, theabove approaches are not reasonable for the problem of detecting an queried-object andproviding its information to the VIPs We will present the geometry-based methods

to estimate object models from a point cloud data in the next section

1.2.2 Geometry-based methods

In geometry-based approaches, prior knowledge about shape structure of objects

of interest should be provided as explicit models such as cylindrical, spherical, conicalmodel This approach usually aims at fitting primitive shapes in a point cloud data

A complex object can be present by one or some primitive shapes To detect or nize objects the relevant studies often build a graph and using the matching methods.Nieuwenhuisen et al [100] propose a framework in order to grasp objects which consist

recog-of many primitive shapes as cylinders and spheres as shown in Fig 1.2 For objectrecognition, Schnabel et al [131] employ the efficient primitive shapes detection meth-ods in the 3-D point clouds The model of object are combined of primitive shapeswith a probabilistic graph-matching technique The authors implement an efficientmulti-stage process for detecting primitive shapes The final step is planning to moveand grasp objects In [2], the authors propose a method for 3-D daily-life object local-ization using superquadric (SQ) models from the point cloud data acquired from a MSKinect sensor This method has tested with a Cube-Cylinder object on very simplebackground in order to simplifies the object detection Quantitative evaluation has notconducted Hachiuma et al [54] propose a method for 3-D object recognition thatperforms two steps: Superquadric parameters are firstly estimated from the 3-D datapoints of an object and they are estimated a feature vector F To build F , the authorsuse the Large Margin Nearest Neighbor (LMNN) algorithm In [36], multi-scale super-quadric fitting was used to estimate the 3-D geometric shapes and to recover posefrom unorganized point cloud data A low latency multiscale voxelization strategy isapplied to do the fitting task To adapt with our study, we assume kitchenware-objectsfor grasping service Their shapes are already known, therefore the geometry-based

Trang 36

approach is more suitable.

Sphere Cylinder

Figure 1.2 Illustration of primitive shapes extraction from the point cloud [144]

1.2.3 Datasets for 3-D object recognition

A number of datasets for 3-D object detection and recognition tasks have madepublicity available in 3-D vision community Xiang et al [154] propose the PASCAL3-

D dataset including augments 12 rigid categories of the PASCAL VOC 2012 [42].Averagely, there are more than 3,000 object instances per category This dataset alsoprovides 3-D annotations for pose estimation task Silberman et al [133], [132] proposethe NYU Depth datasets The NYU-Depth data set is comprised of video sequencesfrom a variety of the indoor scenes as recorded by both the RGB and Depth cameras(MS Kinect) The NYU-Depth V2 data set includes 1449 densely labeled pairs ofaligned RGB and depth images, 464 new scenes are taken label from 3 cities, 407,024new unlabeled frames Each object is labeled with a class and an instance number(cup1, cup2, cup3, etc) Geiger et al [48] propose the KITTI dataset for autonomousdriving aligns 2-D images with 3-D point clouds It is captured by driving around themid-size city, in rural areas and on highways and up to 15 cars and 30 pedestrians arevisible per image Currently, Xiang et al [153] propose the ObjectNet 3-D dataset forrecognizing the 3-D pose and 3-D shape of objects from 2-D images that consists of

100 categories, 90,127 images, 201,888 objects in these images and 44,147 3-D shapes

1.2.4 Discussions

3-D object detection and recognition are fundamental problems in the field ofcomputer vision and robotics Many relevant techniques have been developed to resolvecritical issues such as: noises, contaminated data, clustered background, occlusion Inthis section, two type main approaches which are based on appearances and geometricalfeatures have been exploited Each approach has its own advantages and disadvantages

Trang 37

Although the appearance-based method has highly accuracy and can perform on thecomplex shape, it utilizes the extracted features that require a large training datasets

or the object models While geometry-based approaches used the geometrical structure

of objects therefore they are fitted well when the environment changes However, ithas many challenges when the data have many noise or missing data due to occlusion

To estimate the parameters of a model (e.g., a geometrical model, or statisticalmodel, mathematical model, etc) from various observations (e.g., data collected by dif-ferent sensors), many linear and non-linear estimation algorithms have been deployed

In the computer vision community, to estimate the parameters of geometrical models[47], [1] or compute homography between two images [156], three following common es-timation algorithms are widely used: Least Squares (LS) [79], Hough Transform (HT)[35], RANdom Sample Consensus (RANSAC) [46] Before presenting details of the es-timation algorithms, definitions of inliers and outliers [52] in a sample set are described

as below:

An inlier: An inlier is a point data lying within the general distribution of otherobserved values, generally does not perturb the results but is nevertheless non-conforming and unusual

An outlier: An outlier is a point data that is far from the general distribution

of the other observed values, and can often perturb the results of a statisticalanalysis

1.3.1 Linear fitting algorithms

Least Squares (LS) algorithm: is a robust estimator which uses to find theparameters of vector X that the sum of squares R2 is minimum For instance, to fit aline y = ax + b (Eq 1.1), an optimal R2 is defined by:

Step 1: Choose random two points (Fig 1.3(b))

Step 2: Estimate a line from selected two points (Fig 1.3(b))

Trang 38

Figure 1.3 Illustration of the Least squares process.

Step 3: Compute the distance from each point to the estimated line (Fig 1.3(c))

Step 4: Compute the sum of squares of distances di from points to the estimatedline

Step 5: Choose the best line (Fig 1.3(d))

Repeat Step 1

The Least Squares algorithm has some variations As Diniz et al [32] introducesthe Least Mean of Squares (LMS) algorithm Several implementations of that algo-rithm are similar to the original Least Squares algorithm The only difference is, LMScomputes the mean of squares of distances M e In addition, the Least Squares cancombine with other estimation algorithms as Maximum likelihood estimation to esti-mate the parameters of models [16] LS and its variations are used to estimate theparameters of geometrical structures as plane [14], cylinder [137] sphere, circle, ellipse,paraboloid [40]

1.3.2 Robust estimation algorithms

The concept of robust estimators has been extensively presented in studies ofmathematics, optimization, statistics and computer vision We repeat their definitions

in [30] as following The robust estimators are ”a general parameter estimationapproach designed to cope with a large proportion of outliers in the inputdata.” As this definition, the models estimated by a robust estimator are not influenced

by one or a few break down point of the data

Hough Transform (HT): is a robust estimator that has proposed by Hough

Trang 39

Figure 1.4 Line presentation in image space and in Hough space [126].

[56] For instance, line detection in an image is often detected by the Hough Transformalgorithm A line y = m0x + b0 in the image corresponds to a point in Hough space

is shown in Fig 1.4 Hough Transform can estimate the model of the objects in thedata that has high outlier ratio and can be estimated multiple models from only afew parameters It has widely used in many applications such as 3-D applications,detection of objects and shapes, lane and road sign recognition, industrial and medicalapplications, pipe and cable inspection, and underwater tracking Bhattacharya et

al [20] use the HT algorithm for detecting 3-D lines, boxes, and blocks The 3-D

HT is used for the extraction of planar faces from the irregularly distributed pointclouds The authors also suggest two reconstruction strategies: detect the intersectionlines and the height of the rising edges, all detected planar faces should model someparts of the building Wittrowski et al [152] propose a 3-D Hough space votingapproach for recognizing objects, in there the object model learned from artificial 3-

D models in 3-D scene The voting method can be the clustered accumulator arrayaccording to similar directions for object reference points Chau et al [24] present theobject detection method using the Generalized Hough transform (GHT) with the colorsimilarity between homogeneous segments of the object The input of the algorithm ispreviously segmented regions with homogeneous color

1.3.3 RANdom SAmple Consensus (RANSAC) and its variations

RANSAC algorithm is a robust estimator that proposed by Fischler et al [46]since 1981 It based on the sample consensus at each iteration to generate a modeland is a general parameter estimation method fitted with a large proportion of outliers

of the input data RANSAC is an algorithm that is considered to be very effective instatistics and computer vision community It is summarized as follows:

Step 1: Select randomly the minimum number of points required to estimate themodel parameters

Step 2: Estimate the parameters of the model from selected samples

Step 3: Determine inlier points, outlier points from the set of all points that

Trang 40

Step 4: Choose the best model that has high inlier ratio.

Step 5: Repeat steps 1 through 4 (maximum of K times)

Figure 1.5 illustrates of line estimation by RANSAC algorithm RANSAC-basedalgorithms focus on two applications: estimating the parameters of models; matchingtwo images Most of the typical RANSAC variations have the codes that integratedinto the Point Cloud Library (PCL) as in [107, 106] In the last about 35 years,many variants of RANSAC have proposed such as PROSAC [27], MLESAC [143], LO-SAC[29], NAPSAC [94], etc Their diagram is illustrated in Fig 1.6 We noticed thatthe RANSAC variations focus on the improvements at ’step 1’ and ’step 3’ in Fig 1.6

As PROSAC [27] uses the sampling (’step 1’) of semi-random to choose good samplesfor estimating models It is based on the condition of the probability when a given

Định dạng
Số trang	159
Dung lượng	4,26 MB