1. Trang chủ
  2. » Luận Văn - Báo Cáo

luanvan abstract english phát hiện và nhận dạng đối tượng 3 d hỗ trợ sinh hoạt của người khiếm thị 3 d object detection and recognition assisting visually impaired people

28 122 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 28
Dung lượng 563,33 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

By knowing the queried object is a coffee cup which is usually a cylindrical shapeand lying on a flat surface table plane, the aided system could resolve the query byfitting a primitive

Trang 1

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECNOLOGY

Trang 2

The dissertation is completed at:

Hanoi University of Science and Technology

Supervisors:

1 Dr Vu Hai

2 Assoc Prof Nguyen Thi Thuy

Reviewer 1: Assoc Prof Luong Chi Mai

Reviewer 2: Assoc Prof Le Thanh Ha

Reviewer 3: Assoc Prof Nguyen Quang Hoan

The dissertation will be defended before approval committee

at Hanoi University of Science and Technology:

Time , date month year

The dissertation can be found at:

1 Ta Quang Buu Library

2 Vietnam National Library

Trang 3

Motivation

Visually Impaired People (VIPs) face many difficulties in their daily living days, many aided systems for the VIPs have been deployed such as navigation services,obstacle detection (iNavBelt, GuideCane products in Andreas et al IROS, 2014; Ri-mon et al.,2016), object recognition in supermarket (EyeRing at MIT’s Media Lab).The most common situation is that the VIPs need to locate home facilities However,even for a simple activity such as querying common objects (e.g., a bottle, a coffee-cup,jars, so on) in a conventional environment (e.g., in kitchen, cafeteria room), it may be

Nowa-a chNowa-allenging tNowa-ask In term of deploying Nowa-an Nowa-aided system for the VIPs, not only theobject’s position must be provided but also more information about the queried object(e.g., its size, grabbing objects on a flat surface such as bowls, coffee cups in a kitchentable) is required

Let us consider a real scenario, as shown in Fig 1, to look for a tea or coffeecup, he (she) goes into the kitchen, touches any surrounded object and picks up theright one In term of an aided system, that person just makes a query ”Where is acoffee cup?”, ”What is the size of the cup?”, ”The cup is lying or standing on thetable?” The aided system should provide the information for the VIPs so that theycan grasp the objects and avoid accidents such as being burned Even when doing3-D objects detection, recognition on 2-D image data and more information on depthimages as presented in (Bo et al NIPS 2010, Bo et al CVPR 2011, Bo et al IROS2011), only information about the objects label is provided At the same time theinformation that the system captured from the environment is the image frames of theenvironment Therefore the data of the objects on the table gives only a visible part ofthe object like the front of cup, box or fruit While the information that the VIPs needare the information about the position, size and direction for safely grasping Fromthis, we use the ”3-D objects estimation method” to estimate the information of theobjects

By knowing the queried object is a coffee cup which is usually a cylindrical shapeand lying on a flat surface (table plane), the aided system could resolve the query byfitting a primitive shape to the collected point cloud from the object The objects inthe kitchen or tea room are usually placed on the tables such as cups, bowls, jars, fruit,funnels, etc Therefore, these objects can be simplified by the primitive shapes Theproblem of detecting and recognizing the complex objects in the scene is not considered

in the dissertation The prior knowledge observed from the current scene such as a

Trang 4

Figure 1 Illustration of a real scenario: a VIP comes to the Kitchen and gives aquery: ”Where is a coffee cup? ” on the table Left panel shows a Kinect mounted onthe human’s chest Right panel: the developed system is build on a Laptop PC.

cup normally stands on the table, contextual constraints such as walls in the scene areperpendicular to the table plane; the size/height of the queried object is limited, would

be valuable cues to improve the system performances

Generally, we realize that the queried objects could be identified through fying geometric shapes: planar segments (boxes), cylinders (coffee mugs, soda cans),sphere (balls), cones, without utilizing conventional 3-D features Approaching theseideas, a pipeline of the work ”3-D Object Detection and Recognition for Assisting Visu-ally Impaired People” is proposed It consists of several tasks, including: (1) separatingthe queried objects from table plane detection result by using the transformation orig-inal coordinate system technique; (2) detecting candidates for the interested objectsusing appearance features; and (3) estimating a model of the queried object from a3-D point cloud Wherein the last one plays an important role Instead of matchingthe queried objects into 3-D models as conventional learning-based approaches do, thisresearch work focuses on constructing a simplified geometrical model of the queriedobjects from an unstructured set of point clouds collected by a RGB and range sensor

simpli-Objective

In this dissertation, we aim to propose a robust 3-D object detection and tion system As a feasible solution to deploy a real application, the proposed frameworkshould be simple, robust and friendly to the VIPs However, it is necessary to noticethat there are critical issues that might affect the performance of the proposed sys-tem Particularly, some of them are: (1) objects are queried in a complex scene wherecluster and occlusion issue may appear; (2) noises from collected data; and (3) highcomputational cost due to huge number of points in the cloud data Although in theliterature, a number of relevant works of 3-D object detection and recognition has beenattempted for a long time, in this study, we will not attempt to solve these issues sep-arately Instead of that, we aim to build an unified solution To this end, the concreteobjectives are:

Trang 5

recogni-Figure 2 Illustration of the process of 3-D query-based object in the indoor ment The full object model is the estimated green cylinder from the point cloud ofcoffee-cup (red points).

environ To propose a completed 3environ D queryenviron based object detection system in supportingthe VIPs with high accuracy Figure 2 illustrates the processes of 3-D query-basedobject detection in an indoor environment

- To deploy a real application to locate and describe objects’ information in porting the VIPs grasping objects The application is evaluated in practicalscenarios such as finding objects in a sharing-room, a kitchen room

sup-An available extension from this research is to give the VIPs a feeling or a way

of interaction in a simple form The fact that the VIPs want to make optimal use ofall their senses (i.e., audition, touch, and kinesthetic feedback) By doing this study,informative information extracted from cameras (i.e position, size, safely directionsfor object grasping) is available As a result, the proposed method can offer an effectiveway so that the a large amount of the collected data is valuable as feasible resource

Context, constraints and challenges

Figure 1 shows the context when a VIP comes to a cafeteria and using an aidedsystem for locating an object on the table The input of system is a query and output

is object position in a 3-D coordination and object’s information (size, height) Theproposed system operates with a MS Kinect sensor version 1 The Kinect sensor ismounted on the chest of the VIPs and the laptop is warped in the backpack as shown

in Fig 1-bottom For deploying a real application, we have some constraints for thescenario as the following:

ˆ The MS Kinect sensor:

– A MS Kinect sensor is mounted on VIP’s chest and he/she moves slowlyaround the table This is to collect the data of the environment

– A MS Kinect sensor captures RGB and Depth images at a normal frame rate(from 10 to 30 fps) with image resolution of 640×480 pixels for both of thoseimage types With each frame obtained from Kinect an acceleration vector

Trang 6

is also obtained Because MS Kinect collects the images in a range from

10 to 30 fps, , it fits well with the slow movements of the VIPs (∼ 1 m/s).Although collecting image data via a wearable sensor can be affected bysubject’s movement such as image blur, vibrations in the practical situations,there are no specifically requirements for collecting the image data Forinstance, VIPs are not required to be stranded before collecting the imagedata

– Every queried object needs to be placed in the visible area of a MS Kinectsensor, which is in a distance of 0.8 to 4 meter and an angle of 300 aroundthe center axis of the MS Kinect sensor Therefore, the distance constraintfrom the VIPs to the table is also about 0.8 to 4m

ˆ Interested (or queried) objects are assumed to have simple geometrical structures.For instance, coffee mugs, bowls, jars, bottles, etc have cylindrical shape, whereasball(s) have spherical shape; a cube shape could be boxes, etc They are idealizedand labeled The modular interaction between a VIP and the system has not beendeveloped in the dissertation

ˆ Once a VIP wants to query an object on the table, he(she) should stand in front

of the table This ensures that the current scene is in the visible area of a MSKinect sensor and can move around the table The proposed system computes andreturns the object’s information such as position, size and orientation Sendingsuch information to senses (e.g., audible information, on a Braille screen, or by avibrating type) is out of the scope of this dissertation

ˆ Some heuristic parameters are pre-determined For instance, a VIP’s height, andother parameters of contextual constraints (e.g., size of table plane in a scene,object’s height limitations) are pre-selected

The above-mentioned scenarios and challenges are to cope with the following sues:

is-ˆ Occlusion and cluster of the interested objects: In the practical, when a VIPcomes to a cafeteria to find an object on the table, the queried objects could beoccluded by others At a certain view point, a MS Kinect sensor captured only

a part of of an object Therefore, data of the queried objects is missed Othersituation is that the data consists of many noises because the depth image of a

MS Kinect version 1 often is affected by illumination conditions These issues arechallenges for fitting, detecting and classifying the objects from a point cloud

ˆ Various appearances of same object type: The system is to support for the VIPsquerying common objects In fact that a ”blue” tea/coffee cup and a ”yellow”

Trang 7

bottle have same type of a primitive shape (as cylindrical model) These objectshave the same geometric structure but are different colors We exploit learning-based techniques to utilize appearance features (on RGB images) for recognizingthe queried objects.

ˆ Computational time: A point cloud of a scene that is generated from an imagewith size of 640 × 480 pixels consists of hundreds of thousands of points There-fore, computations in the 3-D environment often require higher computationalcosts than a task in the 2-D environment

to validate results of the model estimation

ˆ Contribution 2: Proposed a comparative study on three different approachesfor recognizing the 3-D objects in a complex scene Consequently, the best one

is a combination of deep-learning based technique and the proposed robust mator(GCSAC) This method takes recent advantages of object detection using

esti-a neuresti-al network on RGB imesti-age esti-and utilizes the proposed GCSAC to estimesti-atethe full 3-D models of the queried objects

ˆ Contribution 3: Deployed a successfully system using the proposed methods fordetecting 3-D primitive shape objects in a lab-based environment The systemcombined the table plane detection technique and the proposed method of 3-Dobjects detection and estimation It achieved fast computation for both tasks oflocating and describing the objects As a result, it fully supports the VIPs ingrasping the queried objects

General framework and dissertation outline

In this dissertation, we propose an unified framework of detecting the queried 3-Dobjects on the table for supporting the VIPs in an indoor environment The proposedframework consists of three main phases as illustrated in Fig 3 The first phase isconsidered as a pre-processing step It consists of point cloud representation from the

Trang 8

Point cloud representation

Acceleration

vector

Table plane detection

Objects detection on RGB image

3-D objects location on the table plane

3-D objects model estimation

3-D objects information

Candidates Fitting 3-D objects

Figure 3 A general framework of detecting the 3-D queried objects on the table of theVIPs

RGB and Depth images and table plane detection in order to separate the interestedobjects from a current scene The second phase aims to label the object candidates

on the RGB images The third phase is to estimate a full model from the point cloudspecified from the first and the second phases In the last phase, the 3-D objects areestimated by utilizing a new robust estimator GCSAC for the full geometrical models.Utilizing this framework, we deploy a real application The application is evaluated

in different scenarios including data sets collected in lab environments and the publicdatasets Particularly, these research works in the dissertation are composed of sixchapters as following:

ˆ Introduction: This chapter describes the main motivations and objectives of thestudy We also present critical points the research’s context, constraints andchallenges, that we meet and address in the dissertation Additionally, the generalframework and main contributions of the dissertation are also presented

ˆ Chapter 1: A Literature Review: This chapter mainly surveys existing aidedsystems for the VIPs Particularly, the related techniques for developing anaided system are discussed We also presented the relevant works on estimationalgorithms and a series of the techniques for 3-D object detection and recognition

ˆ Chapter 2: In this chapter, we describe a point cloud representation from datacollected by a MS Kinect Sensor A real-time table plane detection technique forseparating the interested objects from a certain scene is described The proposedtable plane detection technique is adapted with the contextual constraints Theexperimental results confirm the effectiveness of the proposed method on bothself-collected and public datasets

ˆ Chapter 3: This chapter describes a new robust estimator for the primitive shapesestimation from a point cloud data The proposed robust estimator, named GC-

Trang 9

SAC (Geometrical Constraint SAmple Consensus), utilizes the geometrical straints to choose good samples for estimating models Furthermore, we utilizethe contextual information to validate the estimation’s results In the experi-ments, the proposed GCSAC is compared with various RANSAC-based variations

con-in both synthesized and the real datasets

ˆ Chapter 4: This chapter describes the completed framework for locating andproviding the full information of the queried objects In this chapter, we exploitadvantages of recent deep learning techniques for object detection Moreover, toestimate full 3-D model of the queried-object, we utilize GCSAC on point clouddata of the labeled object Consequently, we can directly extract the object’sinformation (e.g., size, normal surface, grasping direction) This scheme outper-forms existing approaches such as solely using 3-D object fitting or 3-D featurelearning

ˆ Chapter 5: We conclude the works and discuss the limitations of the proposedmethod Research directions are also described for future works

CHAPTER 1 LITERATURE REVIEW

In this chapter, we would like to present surveys on the related works of aid systemsfor the VIPs and detecting objects methods in indoor environment Firstly, relevantaiding applications for VIPs are presented in Sec 1.1 Then, the robust estimators andtheir applications in the robotics, computer vision are presented in Sec 1.3 Finally,

we will introduce and analyses the state-of-the-art works with 3-D object detection,recognition in Sec 1.2

Trang 10

1.1 Aided systems supporting for visually impaired people

1.1.1 Aided systems for navigation service

1.1.2 Aided systems for obstacle detection

1.1.3 Aided systems for locating the interested objects in scenes

1.1.4 Aided systems for detecting objects in daily activities

1.3 Fitting primitive shapes: A brief survey

1.3.1 Linear fitting algorithms

1.3.2 Robust estimation algorithms

1.3.3 RANdom SAmple Consensus (RANSAC) and its variations

1.3.4 Discussions

CHAPTER 2 POINT CLOUD REPRESENTATION AND THE PROPOSED METHOD FOR TABLE PLANE

DETECTION

A common situation in activities of daily living of visually impaired people (VIPs)

is to query an object (a coffee cup, water bottle, so on) on a flat surface We assumethat such flat surface could be a table plane in a sharing room, or in a kitchen Tobuild the completed aided-system supporting for VIPs, obviously, the queried objectsshould be separated from a table plane in current scene In a general frame-work thatconsists other steps such as detection, and estimation full model of the queried objects,the table plane detection could be considered as a pre-processing step Therefore, thischapter is organized as follows: Firstly, we introduce a representation of the pointclouds which are combined the data collected by Kinect sensor in Section 2.1 We thenpresent the proposed method for the table plane detection in Section 2.2

Trang 11

2.1 Point cloud representation

2.1.1 Capturing data by a Microsoft Kinect sensor

In order to collect the data from the environment for building an aid system forthe VIPs to detect, grasp objects that have simple geometrical structure on the table

in the indoor environment The color image and depth image are captured from MSKinect sensor version 1

2.1.2 Point cloud representation

The result of calibration images is a the camera’s intrinsic matrix Hmfor projectingpixels in 2-D space to 3-D space as follows:

com-in which acceleration data provided by the MS Kcom-inect sensor to prune the tion results The proposed algorithms achieve real-time performance as well as a highdetection rate of the table planes

extrac-2.2.2 Related Work

2.2.3 The proposed method

2.2.3.1 The proposed framework

Our research context aims to develop object finding and grasping-aided servicesfor VIP The proposed framework, as shown in Fig 2.6, consists of four steps: down-sampling, organized point cloud representation, plane segmentation and table planeclassification Because of our work utilizing only depth feature, a simple and effectivemethod for down-sampling and smoothing the depth data is described below

Trang 12

Depth Down

sampling

Organized point cloud representation

Plane classification Plane

segmentation

Table plane

vector

Microsoft

Kinect

Figure 2.6: The proposed frame-work for table plane detection

Given a sliding window (of size n × n pixels), the depth value of a center pixelD(xc, yc) is computed from the Eq 2.2:

D(xc, yc) =

PN i=1D(xi, yi)

2.2.3.3 Table plane detection/extraction

The results of the first step are planes that are perpendicular to the accelerationvector After rotating the y axis such that it is parallel with the acceleration vector.Therefore, the table plane is highest plane in the scene, that means the table plane isthe one with minimum y-value

2.2.4 Experimental results

2.2.4.1 Experimental setup and dataset collection

The first dataset called ’MICA3D’ : A Microsoft Kinect version 1 is mounted onthe person’s chest, the person then moves around one table in the room The distancebetween the Kinect and the center of the table is about 1.5 m The height of theKinect compared with table plane is about 0.6 meter The height of table plane isabout 60 → 80 cm We capture data of 10 different scenes which include a cafeteria,showroom, and kitchen and, so on These scenes cover common contexts in dailyactivities of visually impaired people The second dataset is introduced of (Richtsfeld

et al IROS, 2012) This dataset contains calibrated RGB-D data of 111 scenes Eachscene has a table plane The size of the image is 640x480 pixels

2.2.4.2 Table plane detection evaluation method

Therefore, three evaluation measures are needed and they are defined as below.Evaluation measure 1 (EM1): This measure evaluates the difference between the

Trang 13

Table 2.2: The average result of detected table plane on our own dataset(%).

Approach Evaluation Measurement Missing

rate

Frame persecondEM1 EM2 EM3 Average

First Method 87.43 87.26 71.77 82.15 1.2 0.2Second Method 98.29 98.25 96.02 97.52 0.63 0.83Proposed Method 96.65 96.78 97.73 97.0 0.81 5

Table 2.3: The average result of detected table plane on the dataset [3] (%)

Approach Evaluation Measurement Missing

rate

Frame persecondEM1 EM2 EM3 Average

First Method 87.39 68.47 98.19 84.68 0.0 1.19Second Method 87.39 68.47 95.49 83.78 0.0 0.98Proposed Method 87.39 68.47 99.09 84.99 0.0 5.43

normal vector extracted from the detected table plane and the normal vector extractedfrom ground-truth data

Evaluation measure 2 (EM2): By using EM1, only one point was used (centerpoint of the ground-truth) to estimate the angle To reduce the noise influence, morepoints for determining the normal vector of the ground truth are used For the EM2,

3 points (p1, p2, p3) are randomly selected from the ground-truth point cloud

Evaluation measure 3 (EM3): The two evaluation measures presented above

do not take into account the area of the detected table plane Therefore, it is to proposeEM3 that is inspired by the Jaccard index for object detection

Trang 14

2.3 Separating the interested objects on the table plane

2.3.1 Coordinate system transformation

2.3.2 Separating table plane and interested objects

2.3.3 Discussions

CHAPTER 3 PRIMITIVE SHAPES ESTIMATION BY A NEW ROBUST ESTIMATOR USING GEOMETRICAL

is implemented To perform search for good samples, we define two criteria: (1) Theselected samples must ensure being consistent with the estimated model via a roughlyinlier ratio evaluation; (2) The samples must satisfy explicit geometrical constraints ofthe interested objects (e.g., cylindrical constraints)

3.1.2 Related work

3.1.3 The proposed new robust estimator

3.1.3.1 Overview of the proposed robust estimator (GCSAC)

To estimate parameters of a 3-D primitive shape, an original RANSAC paradigm,

as shown in the top panel of Figure 3.2, selects randomly an (Minimum Sample MSS) from a point cloud and then model parameters are estimated and validated Thealgorithm is often computationally infeasible and it is unnecessary to try every possiblesample Our proposed method (GCSAC - in the bottom panel of Figure 3.2) is based

Subset-on an original versiSubset-on of RANSAC, however it is different in three major aspects: (1)

At each iteration, the minimal sample set is conducted when the random samplingprocedure is performed, so that probing the consensus data is easily achievable Inother words, a low pre-defined inlier threshold can be deployed as a weak condition

of the consistency Then after only (few) random sampling iterations, the candidates

Ngày đăng: 16/11/2018, 13:52

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w