Định vị dựa trên thông tin hình ảnh( vision based localization)

Throughout the thesis, we first briefly introduce an overview of vision – based localization; We then present the proposed frame-work including steps: Background Subtraction for detectin

Trang 1

MINISTRY OF EDUCATION AND TRAINING

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

-

NGUYEN VAN GIAP

VISION – BASED LOCALIZATION

Trang 2

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

Độc lập – Tự do – Hạnh phúc

BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ

Họ và tên tác giả luận văn : Nguyễn Văn Giáp

Đề tài luận văn: Định vị sử dụng thông tin hình ảnh

Chuyên ngành: Khoa học máy tính

Mã số SV: CB140975

Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác nhận tác giả đã sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày 21 tháng 10 năm 2016 với các nội dung sau:

 Bổ sung trích dẫn cho các hình vẽ, bổ sung thứ nguyên

 Sửa các lỗi soạn thảo trong toàn bộ luận văn

 Bổ sung thêm danh mục các ký hiệu, ý nghĩa và thứ nguyên

 Bổ sung các thông tin còn thiếu trong tài liệu tham khảo

 Làm rõ việc sử dụng lại code, chỉ rõ phần nào, ở đâu

Hà Nội, ngày……tháng 11 năm 2016

Người hướng dẫn

TS Vũ Hải

Tác giả luận văn

Nguyễn Văn Giáp CHỦ TỊCH HỘI ĐỒNG

GS.TS Eric Castelli

Trang 3

 Where I have consulted the published work of others, this is always clearly attributed

 Where I have quoted from the work of others, the source is always given With the exception of such quotations, this thesis is entirely my own work

 I have acknowledged all main sources of help

 Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself

Signed:

Date:

Trang 4

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

International Research Institute MICA Computer Vision Department

Master of Science

Vision – Based Localization

by Nguyen Van Giap Abstract

Nowadays, the vision-based localization systems using vision sensors are widely used in public and crowded places Positioning information, which is extracted from the image data stream by surveillance cameras, could support monitoring services in different manners For instance, to detect people/subjects who are not allowed entrancing in a certain place; or to link human trajectories and then to identify their abnormal behaviors, so on These services always require positioning information of the interested subjects Whereas, the other localization techniques are still limited about distance, accuracy (e.g., WIFI, RFID), or setting the environment and usability (e.g., Bluetooth, GPS) The vision–based localization technique, particularly, in indoor-environment has many advantages such as: scalable, high accuracy; without requirements of the additional attached-equipment/devices to the subjects

Motivated by above advantages, this thesis aims to study, and propose a high accuracy vision-based localization system in indoor environments We also take into account detailed implementations and developments, as well as testing the proposed techniques It is noticed that the thesis focuses on moving human indoor environments that is monitored in a surveillance network camera To archive a high accuracy positioning system, the thesis deals with the critical issues of a vision-based localization We observe that there is no a perfect human detector and tracking Then, we use a regression-model to eliminate outlier detections The system therefore improves detection and tracking results

Throughout the thesis, we first briefly introduce an overview of vision – based localization; We then present the proposed frame-work including steps: Background Subtraction for detecting moving subject; shadows removal techniques for improving detection result, and linear regression method to eliminate the outliers; and finally the tracking object using a Kalman Filter The most important result of the thesis is demonstrations which show a high accuracy and real-time computation for human

Trang 5

positioning in indoor environments These evaluations are implemented in several indoor environments

Trang 6

ACKNOWLEDGEMENTS

I am so honor to be here the second time, in one of the finest university

in Vietnam to write those grateful words to people who have been supporting, guiding me from the very first moment when I was an undergraduate student until now, when I am writing my master thesis

I am grateful to my supervisor, Dr.Vu Hai, whose expertise, understanding, generous guidance and support made the research topics to be possible for me that were of great interest to me I am pleasure to work with him

I would like to special thanks to Dr Le Thi Lan, Dr Tran Thi Thanh Hai and all of the members in the Computer Vision Department, MICA Institute for their sharp comments, guidance for my works which helps me a lot in how to study and to do researching in right way and also the valuable advices and encouragements that they gave to me during my thesis Particularly, I would to express my appreciations to a Ph.D Student Pham Thi Thanh Thuy, who allow me to use a valuable database of human tracking in a surveillance camera network Without her permission, I could not make extensive evaluations for the proposed method

Finally, I would especially like to thank my family and friends for their continue love, support they have given me through my life, helps me pass through all the frustrating, struggling, confusing Thanks for everything that helped me get to this day

Hanoi, 10/2016 Nguyen Van Giap

Trang 7

CONTENTS

ACKNOWLEDGEMENTS 5

LIST OF FIGURES 8

LIST OF TABLES 10

ABBREVIATIONS 11

Chapter 1: INTRODUCTION 12

1.1 Context and Motivation 12

1.2 Objectives 14

1.3 Vision-based localization and main contributions 15

1.4 Scope and limitations on the research 16

1.5 Thesis Outline 17

Chapter 2: RELATED WORK 18

2.1 A briefly survey on localization techniques 18

2.2 A brief servey on vision-based localization systems 19

Chapter 3: PROPOSED FRAMEWORK 25

3.1 Formulate the vision-based localization 25

3.2 Background subtraction 28

3.3 Post-processing procedure 30

3.4 Shadow removal techniques 32

3.4.1 Chromaticity-based feature extraction 32

3.4.2 Shadow-matching score utilizing physical properties 33

3.4.3 Existing issues after applying shadow removal 34

3.5 A localization estimation using a regression 35

3.5.1 Linear regression 35

3.5.2 Definition of Gaussian processing 35

3.5.3 Regression model 37

3.5.4 Estimating height with regression model 38

3.5.2 Outliers removal 41

3.6 Object tracking 42

Chapter 4: EXPERIMENTAL EVALUATIONS 44

4.1 Experimental setup 44

4.2 Evaluation results of the BGS and shadow removal 46

Trang 8

4.3 Evaluation results of the Gaussian processing regression 49

4.3.1 Evaluation results with GP 49

4.3.2 Evaluation about suitable method 51

4.3.3 Discuss about  52

4.4 The final evaluation results of the proposed system 55

Chapter 5: CONCLUSION AND FUTURE WORKS 57

5.1 Conclusion 57

5.2 Future works 57

PUBLICATION 58

BIBLIOGRAPHY 59

APPENDIX 61

Trang 9

LIST OF FIGURES

Figure 1.1: Indoor localization techniques 13

Figure 1.2: Surveillance camera network 13

Figure 1.3: Positioning a human from video stream 14

Figure 1.4: Casting shadow problem a) Origin image; (b)-(c) Casting shadows; d) Mask of object; (e) Shadow pixels 14

Figure 1.5: Some examples of the experimental environments a) hallway; b) Lobby and c) in a room 16

Figure 2.1 The flow chart of a common vision-based localization technique 19

Figure 2.2 Kinds of object shadows ([9]) 22

Figure 2.3: An illustratrion of the human tracking results in [9] 23

Figure 2.4: Different tracking approaches 23

Figure 3.1: Foot-point definition 25

Figure 3.2: Transformation 2D point to 3-D point real world 25

Figure 3.3: Calibration procedure Top row: The original image; Bottom row: the corner points are detected for calculating the homographic matrix 26

Figure 3.4 A wrong tracked-point detection 26

Figure 3.5: The general flow chart of the proposed method 27

Figure 3.6: Result of BGS with adaptive Gaussian Mixture 29

Figure 3.7: Widespread object shadow: a) Origin situation; b) Mask situation 30

Figure 3.8: An illustration of the wrong object detection results due to shadow appearances 30

Figure 3.9: Noisy by illumination changing 31

Figure 3.10: Results of preprocessing 32

Figure 3.11: Example using chromatic-based features for shadow remova 33

Figure 3.12: (a) Physical shadow model and examples of (b) original image, (c) log-scale of _(p) property, (d) log-log-scale of _(p) property 34

Figure 3.13: Problems of shadow removal 34

Figure 3.14 : Graphical model of the Gaussian Regression 37

Figure 3.15: Position and height of object 38

Trang 10

Figure 3.16: Position and height of object 39

Figure 3.17: Ground truth dataset to train GP model 39

Figure 3.18: Height detection object (H_det) 40

Figure 3.19: Estimated height of object (H_est) 40

Figure 3.20: Tracking results high object with GP model scanario 1 40

Figure 3.21: Height of object tracking results scenario 2 41

Figure 3.22: Outliers of tracking object and removal outliers result 41

Figure 3.23: Consensus between H_det and H_est 42

Figure 3.24: Result of Kalman filter using 43

Figure 3:25: Tracking results with and without processing 43

Figure 4.1: Testing environment 44

Figure 4.2: Processing dataset 45

Figure 4.3: Some frames of scenario 1 45

Figure 4.4: Some frames of scenario 2 45

Figure 4.5: Low quality tracking results 46

Figure 4.6: Mapping moving object result with BGS and shadow removal 47

Figure 4.7 Compare tracking BGS H_det with H_gt 48

Figure 4.9 Compare tracking BGS and shadow removal H_det with H_gt 49

Figure 4.10: Results with BGS-Shadow removal and GP 50

Figure 4.11: Results with application t 51

Figure 4.12: Calculation  in scenario 2 53

Figure 4.13: Result with  scenario 2 53

Figure 4.14: Result with  scenario 1 54

Figure 4.15: Values of  scenario 1 55

Trang 11

LIST OF TABLES

Table 2.1: Some localization techniques (Adapted from [1]) 18

Table 4.2: Evaluation tracking result with BGS-Shadow removal 49

Table 4.3: Evaluation tracking result with BGS-Shadow-GP system 50

Table 4.4: Testing correlation Position and Height of moving object 51

Table 4.5: Relation of Position and Height of moving object 52

Table 4.6: ANOVA analysis of ground truth dataset 52

Table 4.7: Coefficients of independent and dependent valuation 52

Table 4.8: Compare t values and gain-lost valuation positioning tracking 53 Table 4.9: Evaluation scenario 1without removal ourliers 54

Table 4.10 Evaluation scenario 1with removal outliers 54

Table 4.11: Comparison of final results and evaluation 56

Trang 12

RFID Radio Frequency Identification HSI Hue, saturation, and intensity

H_est Estimated height of object

H_det Detected height of object

H_gt Ground truth height of object

Trang 13

Chapter 1: INTRODUCTION 1.1 Context and Motivation

Computer vision is the field of science and technology in which machines gain high-level understanding (or what are meanings of see) using vision/camera sensors

As a scientific discipline, computer vision is concerned with the theory for building artificial system obtaining information from a single or sequence images By analyzing such image data, a wide variety of computer vision applications (or machine visions) have been deployed, for instance, from navigations for mobile robot/autonomous vehicle, surveillance systems in both public and private environments; to diagnostics assistance using medical imaging To deploy a vision machine, many related fields such as artificial intelligence, computer graphic, optimization, modeling environments, so on, could be involved In term of an intelligent surveillance system, views from single/multiple surveillance cameras usually are underlying a series of vision-based and pattern recognition techniques such as: achieving human (detection issues), motion trajectories (tracking issues), and human identification (re-identification issues) As consequents, there are many relevant applications such as homeland security, crime prevention, traffic control, accident prediction and detection, and monitoring patients, elderly and children at home In HCI applications, ones also require position information to determine occupancy of an area to turn off lighting; localization of a moving object which triggers a camera to record subsequent events These applications always require analyzing extract positioning information from video streams collected by the surveillance cameras

Vision-based localization system (particularly, in indoor environments) is one

of important solution among localization technologies Although there are many techniques which could be deployed for indoor localization such as: Bluetooth; WIFI, LIDAR, GPS and vision/camera techniques, as shows Figure 1.1 Referring

to the vision-based localization system, it is more natural for human beings, and provides much extracted informative features at a given time Vision – based localization therefore shows significant advantages In addition, in one hand vision – based sensors are cheaper and cheaper, in the other hand security systems from the public areas are more and more important Those are opened more opportunities for developing intelligent vision-based localization technology

Trang 14

Figure 1.1: Indoor localization techniques

This research theme has been widely and active research topic in the field of computer visions Many vision-based localization techniques have been studied extensively In intelligent surveillance systems, vision-based localization consists of many components, each of them critically impacts to whole system’s performances Among these conponents, human moving detection and tracking is the main target

of these systems In this thesis, we propose a vision-based indoor localization service for intelligent building to solve above requirements A vision-based

localization can be performed in two different ways: Fixed camera systems and Mobile camera systems While mobile camera system is often performed by mobile robot, (or navigation service supporting visually impaired-people), this thesis’s

theme focuses on developing localization services for fixed camera systems, as

shown in Figure 1.2

Figure 1.2: Surveillance camera network

The system extracts human from video stream, then, it transform 2-D position

to 3-D real world coordination, as shown in Figure 1.3 However, to extract correctly 2-D position from video stream, vision-based techniques suffer from many

Trang 15

aspects such as lighting condition, shadow issues, object occlusion, complicated background, and so on To this end, the thesis deals with some critical issues such as: detection and tracking moving object in common surveillance camera with objectives described below

Figure 1.3: Positioning a human from video stream

1.2 Objectives

The thesis aims at researching and developing solutions for person localization in real-time person surveillance systems in indoor environments by camera Using vision-based techniques for detection and tracking human moving still exists some problems: complexes background, noises, occlusions; lighting conditions, casting shadows or quality of image/video Figure 1.4 shows some problems of them Most of them will make our detection and tracking results with having no perfect human detector Therefore, we pay much attention for improving quality person localization in indoor environment situation

Figure 1.4: Casting shadow problem a) Origin image; (b)-(c) Casting

shadows; d) Mask of object; (e) Shadow pixels

To this end, our objectives are detailed as below:

Trang 16

 Researching and utilizing some techniques for human detection and tracking such as: Background subtraction, Shadow removal, Kalman filter, Tracking; Detection, Gaussian processing, and proposing frame-work suitably

 Training and developing forecast function with Gaussian processing method This sub-work is to develop solutions for common human detector which always appears issues such as object’s shadow situation, or outliners

of the tracked/detected points

 Converting 2-D image points to 3-D world points through a calibration procedure

The proposed techniques are developed by C++ and utilizied OpenCV 2.4.9 library

1.3 Vision-based localization and main contributions

As shown in Figure 1.3, given an image sequence or a video stream collected from a surveillance camera, we formulate a vision-based localization as follow: a foot-point extracted from human p(x,y) on 2-D images is transformed to 3-D floor P(x,y,z) in the environments In this application, we set z=0 that based on an assumption that human stand on the floor, and their activities are walking on a plane-floor While the transformation from 2-D point to 3-D is implemented by a calibration procedure, extracting 2-D human foot point is more complicated procedure We have faced some problems when we practice localization objects:

 Shadow of objects

 Noises by illumination changing

 Obscuring by other objects

 Noises of background by environment (light, branches shaking, ….)

 Quality of image/video

In order to get the 2-D point from image sequences captured by a surveillance camera, firstly we perform the background subtraction to separate the foreground/background information The results of the foreground may consist of artifacts We prune the background subtraction results in order to precise 2-D position After applying the post-processing procedure to pruning results, we put 2-

D point to a tracking module using a Kalman filter However, to eliminate outliers,

we apply a learning procedure to estimate corresponding human height from the detected positions Based on constraint between two observations, we can eliminate outliers Therefore, main contribution of the thesis is proposed processing frame

Trang 17

work to improve detection and tracking moving person in indoor environment The experimental results reported that the accuracy of the localization is increased to nearly 30% compared with results from common approaches

1.4 Scope and limitations on the research

We utilize a fixed camera networks: A stationary/fixed CCD camera which capture frames at normal frame rate (from 15 to 30 fps) and the image resolution at 640× 480 pixels Because the proposed technique can deploy similar with the entire camera in the network In almost figures and demonstration, we show only results from one fixed camera

The environments for implementing and evaluating the proposed techniques are limited as below:

 It is an indoor environment with space constraints: A single floor with the scenarios in hallway, lobby and in a room The space in which a people is continuously located by a camera

 Furniture and other objects are in an office building

 Illumination conditions: Both natural and artificial lighting sources are considered

o Natural lighting source changing within a day (in the morning, noon, afternoon)

o Artificial lighting sources are stable in a room

 We show some collected images at lobby, hallway and in room areas in Figure 1.5

Figure 1.5: Some examples of the experimental environments a) hallway; b)

Lobby and c) in a room

Trang 18

Chapter 2: Related work

In this chapter, we present state-of-the-art researches for object localization which are based on computer vision techniques In addition, we report some relevant works that will be deployed and extended in our works

Chapter 3: Proposed framework

In this chapter, we propose a framework for localization The frame-work consists of relevant and improved techniques to make high accuracy of the localization of moving objects in image/video

Chapter 4: Experimental Evaluations

After proposing the frame-work, we develop and evaluate its performance The experimental results will be shown in the Chapter 4 Besides the experimental results are evaluated and compared with previous results

Chapter 5: Conclusion and future works

Following the results of the experimental results, we conclude this work and discuss some limitations We plan the next research and improve the quality of localization with computer vision technology

Trang 19

Chapter 2: RELATED WORK

Localization systems could be based on non-visual or visual information, need

to obtain data from a navigated environment with sensors as input In this chapter,

we will first introduce some non-visual sensor based localization methods which might be used or compared with vision-based solutions We then focus on vision-based system using a surveillance camera Because of vision-based localization technique spreads over several topics of computer vision We divide them into smaller related research topics which are: background subtraction, shadow removal and object tracking

2.1 A briefly survey on localization techniques

To design an indoor positioning system, and there are several implementation methods The most common methods include vision, infrared, ultrasound, Wireless Local Area Network (WLAN), RFID, and Bluetooth Table 2.1 shows comparative performances of these techniques [1]:

Table 2.1: Some localization techniques (Adapted from [1])

Labelee wifi Bluetooth UWB Hybrid

S./I.GPS RFID/NFC Scanning/QR Outdoor

Scale according enclosure

Necessary

infransture

WLAN router

Specific Trans

Specific Trans NFC Tag QR Tag

Energy

spending Medium Medium Medium High Low Low Specific

Cost High High High High Low Low

RFID is more mature technique to use in indoor positioning system However its anti-interference ability is poor This technique is more suitable for the

Trang 20

positioning of goods The Bluetooth techniques for complex environments are less stable and easily disturbed by noise signal Ultrasonic technology use ultrasonic waves to measure distance between fixed-point station and the mobile target to localize These methods need to set up equitments in the monitoring environments such as relevant techniques proposed [2] and [3], which require multiple nodes to locate Comparing with these techniques, although computational costs of the vision-based methods are higher, the popularity of the monitoring system using a survillance camera opens new opportunities for localization services This could be considerd as the added value service in the survillance camera network Moreover, recently, computational issues are overcome thanks to high/power computing systems

2.2 A brief servey on vision-based localization systems

For a vision-based system, the related works share a common frame-work as shown shown in Figure 2.1 The frame-work consists of following main components such as: moving object detection, prunning results, object tracking, and projecting the object into real world coordination

Figure 2.1 The flow chart of a common vision-based localization technique

The different approaches will be different effective results Almost vision-based

localization systems deal with moving object localization in the first step This one

is the main target To do this, moving objects are extracted by utilizing the basic algrothims such as background subtraction[4], motion extraction based on

Trang 21

difference frames; optical flow, so on However, it is noticed that because the inherent ambiguities of ego-motion (camera motion) is difficult to estimate from vision-data; and also the scene structure (e.g., depth variations) can be discarded, the quality of the taget detection is not guarantted

The second step in the common frame-work aims to prunning the detection

results Because the target detection is suffered from many artifacts, such as the gradual and sudden illumination changes (such as clouds), camera oscillations, high frequency background objects, such as tree branches or sea waves, biasing of moving objects and the noise coming from the camera (especially in thermal cameras) Therefore, the target results must be further processings Concerning these issues, above stated noise issues, are introduced for the detection problem

The third part of vision-based localization is tracking moving object Object

tracking has a wide variety of applications like smart video surveillance, traffic video monitoring, accident prediction and detection, motion-based recognition, human computer interaction, human behavior understanding, etc Object tracking in general, is a very challenging task because of the following circumstances; complex object shapes, cluttered background, loss of information caused by representing the real- world 3D into a 2D scene, noise in images induced by image capturing device, illumination variations, occlusions, shadows, etc Further, while tracking in cluttered environment the motion object blob may be occluded and hence the features cannot be extracted from the occluded motion object blob In such a case, prediction tools such as Kalman filter or particle filter are used to estimate the object location A lot of work have been carried out on object tracking and these approaches can be classified as region based, feature based, model based and hybrid [6]

In this thesis, we focus to study and improve vision – based localization with stationary camera situation The main targets of vision-based localization are accuracy detection object and tracking it/them every time Firstly, to extract the target object, we utilize a background subtraction; then to pruning the detection results, we utilize a shadow removal technique For tracking the human, we utilizie

a Kalman filter The related works of these techniques are presented in sub-sections below

Background Subtraction Techniques:

The background subtraction is a widely used approach for detecting moving objects

in videos from static cameras Many background subtraction techniques are proposed such as Running Gaussian average; Temporal median filter; Mixture of

Trang 22

Gaussians; Eigen backgrounds The rationale in the approach is that of detecting the moving objects from the difference between the current frame and a reference frame, often called the “background image”, or “background model” As a basic, the background image must be a representation of the scene with no moving objects and must be kept regularly updated so as to adapt to the varying luminaries conditions and geometry settings More complex models have extended the concept

of “background subtraction” beyond its literal meaning Several methods for performing background subtraction have been proposed in the recent literature All

of these methods try to effectively estimate background model from the temporal sequence of the frames All approaches aim, however, at real-time performance, because the background subtraction often is the first step of a serie of the consequence techniques which will be deployed with different applications

Human detection technique:

Given the input frames, human detection is executed to determine the image regions which contains the targets Although the performance of human detectors has improved tremendously in recent years, detecting partially occluded person or detecting in dynamic or clutter background remains a weakness of current approaches There are two main approaches for human detection The first one detects moving objects and considers them as people It is called as motion-based detection In the second one, people are detected by applying a human classifier In case of detecting human from a fixed camera networks, background subtraction techniques are most popular choices for motion-based human detection For the second one, the extracted features are modeling through statistical learning-based approaches Some popular human modeling which are built based on image features: Haar Wavelets [8], Haar-Like[9], HOG, Shapelet, etc However, using a single feature for human detection is not as effective as fusing some of them For the classification: Some popular classi_ers for human detection including SVM, AdaBoost, MPLBoost, Linear SVM, RBF kernel SVM It is crucial for selecting image features together with an effective classifier In general, HOG feature with Adaboost or linear SVM can give better human detection performance

Shadow removal technique:

A shadow is created when direct light from any source of illumination is obstructed either partially or totally by an object If the light energy is fallen less, that area is represented as shadow region whereas if the light energy is emitted more, this area is represented as non shadow region [11] There are two types of shadow, self shadow and cast shadow, as shown in Figure 2.2 Self-shadow is

Trang 23

objects itself and another is cast-shadow [9] Both cast and self shadow has different brightness value The brightness of all the shadows in an image depends on the reflectivity of the object upon which they are cast as well as the illumination from secondary light sources Self shadows usually have a higher brightness than cast shadows since they receive more secondary lighting from surrounding illuminated objects

Figure 2.2 Kinds of object shadows ([9])

As described in [11], shadow removal techniques can be categorized as following:

- Model Based Techniques: Model based techniques have limited applicability and are applied to specific problems (say aerial images) and simple objects only These are dependent on prior information about illumination conditions and scene geometry as well as the object which also turns out to be a major drawback

- Image based Techniques: In these techniques, certain image shadow properties such as color/intensity, shadow structure and boundaries etc are used Nevertheless, if any of that information is available, it can be used to improve the detection process performance

- Color/Spectrum based Shadow Detection: The color/spectrum model attempts to describe the color change of shaded pixel and find the color feature that

is illumination invariant The shadows are then discriminated from foreground objects by using empirical thresholds on HSV color space

- Texture based Shadow Detection: The principle behind the textural model is that the texture of foreground objects is different from that of the background, while the texture of shaded area remains the same as that of the background The several techniques have been developed to detect moving cast shadows in a normal indoor environment

Human tracking:

Tracking is an important task of computer vision with many applications in surveillance, scene monitoring, navigation, sport scene analysis, and video database

Trang 24

management The objective of tracking the object is linked to the object to be tracked in consecutive frames, as shown in an example in Figure 2.3 Although studying for many years, the problem of "tracking object" remains an open research today The links can be very difficult when the fast-moving subjects and related to the speed of the video frame In addition, tracking objects can be complicated by some reasons: Interference in the image; the obscured in whole or part of the object; Changing lighting conditions; Complex shape of the object Another situation which increases the complexity of the problem is that people who want to track changes direction constantly For these cases the tracking systems often use a dynamic model which describes the object can move how to take the various movements of the object

Figure 2.3: An illustratrion of the human tracking results in [9]

In vision community, there are various tracking approaches, as shown in Fig 2.4

Figure 2.4: Different tracking approaches

- Point tracking (Fig.2.4a): The detected objects are represented by points, and the tracking of these points is based on the previous object states which can include object positions and motion

- Appearance tracking (Fig.2.4b): The object appearance can be for example a rectangular template or an elliptical shape with an associated RGB color histogram Objects are tracked by considering the coherence of their appearances in

Trang 25

consecutive frames This motion is usually in the form of a parametric transformation such as a translation, a rotation or an affinity

- Silhouette tracking (Fig.2.4c-d): The tracking is performed by estimating the object region in each frame Silhouette tracking methods use the information encoded inside the object region This information can be in the form of appearance density and shape models which are usually in the form of edge maps Given the object models, silhouettes are tracked by either shape matching or contour evolution

Trang 26

Chapter 3: PROPOSED FRAMEWORK 3.1 Formulate the vision-based localization

We assume that a moving subject is separated from image sequence/video stream in a surveillance camera Basing on the detection results, the Region-Of-Interest (RoI) and a related point (e.g FootPoints, human centers) are extracted in each frame, as shown in Fig 3.1 For example, a FootPoint is a 2-D point (x,y) in image coordinate, where human foot touches the floor-plane

Figure 3.1: Foot-point definition

Given a FootPoint P(x,y), the corresponding position on 3-D world coordination is calculated by a transformation T, as defined below:

Figure 3.2: Transformation 2D point to 3-D point real world

To identify the homographic matrix H, a calibration procedure is setup H matrix is given by:

3 3

'

x H

x Hx





Trang 27

In this thesis, we do not describe in detail the calibration procedure This procedure is setup by collecting chessboard images, as shown in Fig 3.3 Because the fixed camera is utilized, parameters of the transformation can be used at different times

Figure 3.3: Calibration procedure Top row: The original image; Bottom row: the corner points are detected for calculating the homographic matrix

The most important point, which affects to accuracy of the localization service

is the detected point from the image sequence Therefore well extracted ROIs and points, the more accurate of tracking and localization phases are An example case, which illustrates affect of the tracked-point detections, is shown in Fig 3.4 below:

Figure 3.4 A wrong tracked-point detection

In this example, the detected Footpoint (based on red-rectangle) is far from the correct one (marked by yellow-box) As consequent, the corresponding 3-D position

in real-world coordinate therefore is wrong estimated We know that most of detector is not always perfect The main reasons are common techniques such as background subtraction, human detection, and tracking always suffered from artifacts from environments (e.g., shadow) or object occlusions, or lighting conditions Therefore, to archive high accuracy results of the localization service, a detector needs to handle following problems:

- Shadow of objects

Trang 28

- Noises by illumination changing

- Obscuring by other objects

- Noises of background by environment (light, branches shaking,)

Therefore, in this work, we focus on increase quality of the foot-point detection through many pruning procedures The general flow char of the proposed system is shown in Figure 3.5 below:

Figure 3.5: The general flow chart of the proposed method

To solve this one, the proposed technique has a major difference from Fig 3.5

It is that input of the object tracking procedure is pruned through a regression procedure We eliminate outlier results (that is results of the detections) based on a correlation evaluation The evaluation infer how is different between the detection results and the estimation results Too low correlation means that the detection results are not confident Otherwise, high correlation is preserved because such position is consensuses due to both observations: one from detection results; one from estimation results The inliers are put into an object tracking module Because

Trang 29

only inliers are utilized, the object tracking module could be a simple procedure

We show effectiveness of the proposed techniques in the experimental results

3.2 Background subtraction

In the project environment, camera network is usually organized permanently; Environmental monitoring is limited in the building lobby, hallway These are the fixed environments and minimizing the change of environment Some environmental situation could be changed like opening door; changing the position

of objects (pots, fire extinguishers ) Therefore, our approaches to detect moving object is based on the background subtraction The solution is modeling background, then we use this model with the current frame from which to draw the foreground motion Putting into each video frame is compared with a reference pattern or patterns background The pixels in the current frame have a significant bias against the background will be seen as moving objects There are several different algorithms about background subtraction which are studied below [10] To obtain a trade-off between computational time and the performance, we utilize the Adaptive Gaussian mixture Model algorithm for the background subtraction procedure This technique is proposed by Stauffer and Grimson in 1999 The key-idea is an observation that a single platform model cannot handle continuous frames

in a long time due to problem of the light changing, repeated actions, the clutter from the actual setting Using the method of Gaussian mixture dispersed to represent each pixel in a model According to that argument, implement and integrate this approach into the surveillance system Figure 3.6 illustrates some background subtraction results using Adaptive GMM

Trang 30

(a) Result with Cam1

(b) Result with Cam2

Figure 3.6: Result of BGS with adaptive Gaussian Mixture

Background subtraction is basic and simple method to detect moving object Therefore, there are some critical issues with the BGS results: existing shadows object; having much small noises Figure 3.7 and Figure 3.8 shows some examples Specially, shadow of moving objects which appears spread over the surface that the light is obscured (corner, assigned the two surfaces) are often larger than focused object These problems directly effects to quality of the detection and tracking moving object So, we suggest solutions to solve these problems in following next steps: Removing noises and Removing shadows of the objects

Trang 31

a) b)

Figure 3.7: Widespread object shadow: a) Origin situation; b) Mask situation

a) Appearance of a large shadow b) Wrong object detection result

Figure 3.8: An illustration of the wrong object detection results due to shadow

appearances

3.3 Post-processing procedure

Illumination change is the cause of much noise appearances We apply a series

of morphological operator and filtering techniques to remove these artifacts

The proposed procedures are explained below:

Trang 32

a1) Original image (t) b1) Original image (t+i)

Figure 3.9: Noisy by illumination changing

 Rescale and thresholding to remove the small noises/artifacts:

o Down scale and upscale;

o Thresholding suitable selection

 Applying median filters:

o Applying Filters and rescale to eliminate noises on original mask image

 Find out the largest blob

Results of pre-processing are shown in Fig 3.10

Trang 33

Figure 3.10: Results of preprocessing

However, after applying the post-processing procedure, there are some existing problems, e.g., still existing shadows of object; Blobs object is not fit and low accuracy detection object Then in next step, we prune the detection result through applying a shadow removal technique

3.4 Shadow removal techniques

In this work we utilize a density-based score fusion scheme using a based approach is taken into account for removing shadow regions This technique

feature-is proposed in a related work [26] To make the thesfeature-is to be more consolidated, we explain the shadow removal technique as follows In [26], two different types of features in the examined shadow region are extracted They are chromaticity-based and physical features Two likelihoods or shadow-matching scores are calculated from corresponding features A likelihood ratio as shadow per non shadow score is calculated Probabilities of shadow and non shadow are estimated on the basis of approximating distributions of the shadow-matching scores using GMM

3.4.1 Chromaticity-based feature extraction

Chromaticity features have been chosen in various shadow detection techniques, such as in [20] Because of shadow regions tend to have lower light intensity than their nearby areas Therefore, we firstly convert an input RGB image into HSV color space HSV color space separates chromaticity and luminosity channels In HSV color space, shadow region on background does not change its hue (H) and shadow pixels often have lower in the saturation (S) Adapt to this observation, we then calculate the difference of Hue and Saturation between Foreground (F) and Background (B) at a shadow pixel p as below:

Định dạng
Số trang	66
Dung lượng	3,1 MB