Luận văn thạc sĩ people counting using detection and tracking techniques for smart video surveillance

Some challenges in crowd counting [IJ ee Framework for a people counting based on face detcetion and tracking System framework for depth-assisted face detection and association for peop

Trang 1

People Counting Using Detection And Tracking

Techniques For Smart Video Surveillance

Ha Thi Oanh

Ha Noi University of Science and Technology

Supervisor

Assoc Prof Tran Thi Thanh Hai

In partial fulfillment of the requirements for the degree of

Master of Computer Science

April 20, 2023

Trang 2

Dr Doan hi Huong Giang who directly supported me

The master’s thesis is within the framework of the ministerial-level scientific

research project "Research and development of un automatic systeur for assessing learning activities in class based on image processing leclnolozy

and artificial intelligence” code CT2020.02.BK A.02 led hy Assoc Prof Dr

Le ‘Thi Lan, Students sincerely thank the topic

Finally, [wish to show my appreciation to all my friends and family mem- bers who helped me finalizing the project

Trang 3

Abstract

Video or image-bused people vounting in realtime has multiple applice- tions in intelligent transportation, density estimation or class management, and so on Although this problem has been widely studied, it stills face

some main challenges due to crowded seenc and occlusion In a common

approach, this problem iy carried out by detecting people using conventional detectors However, this approach can be failed when people stay in various postnres or are accluded by each other We notice that even a main part of human body is ocelucied, their face and head are still observable

Tu additivn, a person can not be detected at a frame but may be recovered

at the previons or the next frames

In this thesis, we attempt to improve the people counting result beyond

these observations We first, deploy two detectors (Yolo and Ttetina- Face}

for detecting heads and for faces of pcople in the secne We then develop

4 pairing bochnique tnt aligns the fuce and the head of cach person This alignment helps ta recover the missed detection of head or face thus in creases the tme positive rate To overcome the missed detection of hoth face and head at a certain frame, we apply a tracking technique (i.e SOML}

on the combined detection result, Putting all of these techniques in an uni

ficd framework helps bo increases the true positive rates from 90.36% to 96.21% on ClassTTead Part 2 dataset.

Trang 4

Contents

List of Acronymtypes

1 Introduction

1.2 Scientific and practical significance

1.2.1 Scientific significance 1.2.2 Practical significance 1.2.3 Challenges and Motivation -

Lâ Objeclive nad Contibuois

1.3.2 Contributions

Related works

2.1 Detection based people counting

241.1 Face detection bascd people counting 2

2.1.2 Head derection based an peaple counting 2.1.3 Ilybrid detection based on people counting

22 Density estimation based people counting

2BL Overview of ebjua trucking 2.0

23.2 Mniltiple Object Tracking 2.3.3 Tracking techniqnes

Trang 5

Proposed methed for pcople counting

3L

32

The proposed people counting [nunework

Yolo-based head detection bee

34 Combination of head and face detection

3.4.1 Linear sum assignmenc prablem 3.4.2 Tead-fane pairing cost:

3.5 Person tracking

Experiments

4.1 Dataset and Evaluation Metits

4.1.1 Gur collected dataset: ClasHead

4.111 ClassHead Part1 4.1.12 Classllead Part.2

412 Hollywood Heads dataset

4.1.4 Wider Face dataset:

4.1.5 Dvaluation metries

Trang 6

CONTENTS

4.15.1- InersecUon over nion (OU)

4.1.5.3 Precision and Reeall

4.2.3 Đvaluarion on Wider Eaccdabaset

4.2.4 Đvalunton on ClaissHoud Part 2 dalaso

5 Conclusions

5.1 Conelusion

5.2 Future Works

References

Trang 7

Iilustracion of the input and output of people counting from an image

Some challenges in crowd counting [IJ ee

Framework for a people counting based on face detcetion and tracking

System framework for depth-assisted face detection and association for

people counting System framework for a people counting method based on head detection aud tracking ky ——

Network stricture of Double Anchor R-CNN Architecture of JointDet -

Examples of people density estimation Example of Multiple Object Tracking © ee Fimgarian Algorithm

The tracking process of the SORT algorithm

Architecture of the proposed penple counting and tracking system

Flow architecture of the proposed smart surveillance system

‘Lhe proposed framework for people counting by pairing head and face detection and tracking 2.0.0 ee eee

Trang 8

Antomatic learning of hound bax anchors [4]

Activation functions used in Yolov5 (a) SiLU function (b) Sigmoid

Junction [4]

Đxample [or creating dataset.yaml

An overview of the single-stage dense face localisation approach Reti-

naFace is designed based on the feature pyramids with independent con-

text modules Following the context modules, we calculate a multi-task

Organize davuset for Yolo training, 2.00.0 Lxample of RetinaFace testing on Wider Face dataset

Flowchart of combining object detection and tracking to improve the

Camera layout in the simulated classroom and an image obtained from

Trang 9

Results of Hollywood Heads dataset, (a) Results of head detection; (b)

Resulrs af face detection; (c} Marching head and face detection uaing the

Tiungarian algorithm [eads are denoted with green, faces are yellow, iiissed ground truths are red, und head-face pairiugy wre cyan 2 MAE measurement results on 2 proposed inelhody in Casablanca Heads

dataset

Results of Casablanca dataset (a) Results of head detection; (b) Re sults of face detection; (c) Matching head and face detection using the Hungarian algorithm Heads arc denoted with green, faces are yellow, missed ground (ruths are red, und hiead-face pairings ure cyan

Results of Wider Dace dataset (a) Results of head detection; (b) Re- sults of face detection; (c) Matching head and face detection using the Hungarian algorithm Heads arc denoted with grecn, faces are yellow, missed ground (ruths wre red, und head-face pairingy ure cyan

MultiDetert: resnits im ClassHead Part_2 (a) Head detections, (b) Face detections, (¢) MulriNetect

Head tracking method results in ClassHead Part? dataset (a) Head

detections at frame 1, (b) Head tracking at frame 100

MnltiDeteet with Traek method results in ClassHead Parr 2 dataset (a) MultiDetect with Track at frame 1, {b} MultiDetect with Track at frame

Trang 10

Setup camera parameters for data collection

ClassHead Port | dataset for raining and tcsting Head detector YoluvS

ClassHead Part 2 dataset

Results of the proposed method on the Hollywood Heads dataset:

Results of the proposed method on the Ceseblanea datasct

Resuly of the proposed method on Wider Face dataset Results of the method af the head detection method in ClassHead Part_2 daraset:

Results of the method of the MultiDectect in Classllead L'art_2 dataset Results of the Head ‘Tracking in ClassHead Part_2 dataset

Results of method VultiDetect with Track in ClassHead Part 2 dataset

Tixperimental reenrlra ìn rhe ClassTlead Part 2 dataset after nsing 4 meth-

ods

ix

Trang 11

List of Acronymtypes

CNN Convolutioual Neural Nebwork

HOG Recurrent Neurnl Network

LSTM Long short term memory

NN Neural Network

RPN Region Proposal Network

YOLO You Only Look Once.

Trang 12

Chapter 1

Tntroduction

Hecently, people counting in images or video has become an active research topic due

to its wide range of applications, from public sofcty to intelligent crowd flew Manual comting is impractical since it is 4 tedious and time-consnming task, particularly

in crowded scenes This chapter aims to define the problem of people counting, ite challenges, and provide cliscussions on the drawbacks of existing methods to motivate our work, We then clarify our objectives and contributions to this ficld Finally, we

deseribe the organization of the Unesis

People counting in crowds refers to the process of accurately counting the number of indivicluals present in a densely populated area or space ‘I/his is a challenging task due

to the high density of people, occlusions, overlapping individuals, and the nced to track people us hey wove Unrough vhe crowd People counting, hus buon extensively sLudied

in recent years, and it has numerons real-life applicarions, including event management, public safety, and transportation Ror instance, it can be mtilized to monitor crowd density and prevent overcrowding in public spaces, optimize and improve sccurity at events and trausportution hubs, ele

To capture peaple in crowds, some sensors sich as thermal imaging cameras, RGR:

cameras, and lasers may be used RGR cameras are the moat commanly utilized due

Trang 13

to th low-cost and popularity in almost public spaces From video/data, computer vision techniques such as object detection and tracking, optical flow, and background subtraction can identify and track individuals in the crowd The problem of people

counting from an image is defined as follows:

¢ Input: An image or a frame from a video sequence

¢ Output: The number of people and/or their locations in the frame/image

Figure 1.1 depicts the input and output of a people counting algorithm The algo-

rithm stores the number of people detected and determines the bounding box of each

individual Depending on the context, location data may be crucial for further pro-

cessing However, in highly crowded scenes, obtaining an exact count of people may be

impractical, and an estimation of the number of individuals ufficient In the next

chapter, we will review some related works that provide an estimation of the people

count with or without location and bounding box information

Colflleo

Figure 1.1; Illustration of the input and output of people counting from an image

1.2.1 Scientific significance

People counting in crowds of humans has several scientific implications, including:

e Crowd dynami People counting in crowds provide important data for studying

crowd dynamics, such as how people move, how they interact with each other, and

Trang 14

1.2 Scientific aud practical significance

how they respond to changes in the environment, This information can be used ta

develop mathematical models of crowd behavior and improve onr mnderstanding

of crowd dynamics,

@ Social behavior: People counting in crowds can also provide insights into social behavior It can help researchers understand how people interact with each other

in crowded cuvirouments, such as how they form groups, how they communicate,

and how they coordinate their movernents

* Computer vision and machine learning; people counting in crowds provides on important application for developing und evaluating computer vision und niachine learning algorithms It helps to advance the state-of-the-art in object detection, tracking, segmentation, anc classification, which are essential for people counting

in crowded cuvirouments

Sensor technology: People counting in crowds also drive the development of new scnsor technologies, such as cameras, depth sensors, and thermal sensors, that

are designed to capture date in crowded environments, This helps to wdvance the

field of sensor technology and improve our ability to capture data in challenging environments

* Iluman-computer interaction: People counting in crowds can also provide insights into human-computer interaction, particularly in the context of intelligent systems It helps to understand how people interact with technology in crowded

cnvireninents and Lew technology can be designed to support people in these

Trang 15

1.2 Scientific aud practical significance

organizers identify high-density areas aud take action lu prevent overcrowding,

which is critical for public safety

* Retail and marketing: People counting is an essential tool for retailers to optimize staffing levels, measnre customer traffic, and improve customer service Tt helps retailers identify high-traffic areas and monitor customer behavior, such as the tine speul in specific sevtious or the frequeney of return visits

« Dublie safety anc seeurity: Meople counting is also an important tool for public safely und security, helping to monitor erowd density aud prevent overcrowding

in public spaces, as well as optimize scaffing levels and improt at events

the content and teaching methods

1.2.3 Challenges and Motivation

"Yo solve the people counting problem, there exists a number of approaches which achieved impressive accuracy However, this problem still faces many challenges us following:

Occlusion: As crowd density increases, individuals may start to occlude each other, which poses a challenge for traditional detection algorithms and motivates

the development of density cstitmation models

Trang 16

(@) Rotation (6) Mlumination variation @ Weather changes

Figure 1.2; Some challenges in crowd counting [1]

ene, the background may be highly cluttered

¢ Complex background: In a natural s

and contain objects with similar appearances or colors to the foreground, which

can cause confusion

e Scale variation: One of the primary problems that should be addressed in the

density ¢ igned to address the

seale variation problem in the first step

the camera viewpoints, such as different poses and photographic angles

e Illumination variation: The illumination varies at different times of the day usually from dark to light and then back to dark, from dawn to dusk

e Weather changes: The scenes in the wild are usually under various types of

weather conditions, such as clear, clouds, rain, fog thunder nd extra

Trang 17

1,8 Objectives and Contributions

urination change (Fig].2.e) and weather change (Fig 2.0) These challenges can uot

be solved in one model Tn this thesis, we attempt to improve the people comnting performance by overcoming occlusion and scale changes, although some other challenges suny be iuplicitly resolved thanks o the studied model itself,

1.3.1 Objectives

The main goal of this thesis is to improve the performance of people counting from images/video to overcame the ocainsion issne in a crawded scene To obtain this goal, following are the specific objectives:

Stuy and develop techniques for detecting hnmans that; can he used for peaple counting and localization

« Improve the techniques to avoid missed detection in crowded scenes

1.3.2 Contributions

"Lhe work of thie thesis is within the context of a project granted by the Ministry of Education and Training (MOET), with the project code CT2020.02.BKA.02 Que of tue tusks of Uhtis prujoct is Lo detect aud count the uumber of students ia elussrcont, and then create a density map of the students This will aid in better management

of the students and improve the quality of teaching and learning As a resnit, beside validating the propescd on benchmork datasct, we also take part in building a new datusct in classroom and bust our method on that dataset, We summurize the muia

contribution of our work aa follows:

Trang 18

1.4 Thesis outline

» First, we propose a method that combines the detection results of bud face

and head to improve the true positive rate of penple counting (namely called:

MultiDetect}

Sevuud, we deploy a tracking technique lo handle fast-moving objects that nay

cause motion blur effects and missed detection (namely called: MuttiDetect with

“Lrack)

« Finally, we conduct extensive experiments to validate MultiDetact improvement

on three benchmark datasets (Widerace, [ollyhead, and Casablanca) We also build a new dataset in the MOE’ project, in which 1 participated in collecting and annotating the data and we conduct extensive cxpcriments to validate both

improvements MulliDetect and MultiDeteet with Truck in our dataset

1.4 Thesis outline

‘The thesis is structured into 5 chapters:

1 Introduction: This chapter provides the definition of the people counting problem und introduces its scicutific aud pructical significance Then, we describe some

of the main challenges that motivated the wark of this thesis Finally, we present our abjectives, contributions, and thesis outline

ie Packground and Related Works: This chapter conducts a survey on the deep Tearning-hased approach for hnman detection and tracking for the problem of people counting and localization We also describe some methods that roughly estimate the people density without giving localization information The analysis

on both uppreuches und our construinis motivate us bo Ipllow the detection aud tracking methods We then present briefly fimdamental knowledge abont deep models for human detection and tracking

3 Proposed Vethad: This chapter incrodnces onr proposed framework for penple detection and localization from videos We describe each component of the

Trang 19

1.4 Thesis outline

framework in delail aud how bo implement it in practice,

4, Experiments: This chapter presents the datasets, cvaluation protocol, teubinivall sevup, results, and discussions related to our experiments In particular, we describe the process af collecting and annotating our new dataset: in a classroom

environment

Conclnsion: This chapter summarizes ont work, highlights the contributions, analyzing the limitations, and providing some ideas for future research directions

Trang 20

Chapter 2

Related works

‘This chapter presents some basic knowledge as well aa related works regarding the rescarch topic of this thesis There are two approaches to the people counting problem from still images: i} detection-hased approach that detects individuals and then can give a number of people in the scene: ii) density estimation-baced approach that just give a roughly a number of persons without location information Besides, to improve the detcetion result, some works doploy tracking techniques We present these ap- prouches in yetions 2.1 and 2.2 respectively, We then desuibe the racking Weelmiques

in section 2.3 Finally, we conclude the chapter in section 2.4

People counting cam be curried out by detecting luees, heads, or bodies depending on

the comext aud the scene The most common technique is detecting the human budy,

‘but in cases where the human body is acclnded or in a challenging posture, the head

and face can be alternative solutions

2.1.1 Face detection based people counting

Fuce detection-based people counting aims to detect und truck fnces in images or real

time video streams, then count vke number of detected faces, Face detection alporithius

Trang 21

typically rely on deep learning models trained on large datasets to identify and locate

faces in images or videos accurately

Tsong-Yi Chen et al presented an automatic people-counting system based on face recognition in which people passing through a gate or door are counted by placing a

video camera [5] First, they use the image difference to detect the rough edges of

moving people Then, color features are utilized to locate people’s faces Based on

the NCC (Normalized Color Coordinate) color space, the face is initially obtained by detecting the skin tone area, and then the subject's facial features are analyzed to determine if the subject is a real face or not After face detection, a person will be

tracked by following the recognized face, and this person will be added if that person’s

face touches the count line

Xi Zhao et al presented a method of counting people based on face detection,

tracking, and trajectory classification [2] They first performed face detection and then face tracking by combining a new scale-invariant Kalman filter with a kernel-

based color histogram tracking algorithm From each face orbit, the angle of those

neighboring points is extracted Finally, to distinguish the real face trajectory from the fake one, the authors used the K-NN classification method based on the Earth

Mover's distance The framework for this paper is described in Fig.2.1

Trang 22

people using the Kinect camera The author used depth information for false face

detection, and then a 3D data association is used to link tracks with detection results

Finally, they counted the people who enter the region of interest using a validated

trajectory, as shown in Fig,2.2

Figure 2.2: Depth-assisted face detection and association for people counting [6]

Face detection nowadays can achieve very high accuracy However, this problem still faces challenges such as variations in lighting conditions, occlusions, and face

orientation Besides, it requires a face in front of the camera, Without that assumption,

the performance of face detection-based people counting may drastically reduce

2.1.2 Head detection based on people counting

Head detection can be a more flexible solution to deal with constraints on detecting,

only frontal faces Bin Li and al {7] proposed a people-counting method based on

head detection and tracking The purpose of this proposal is to evaluate the number

of people who move under an indoor overhead camera This framework consists of

four parts: foreground extraction, head detection, head tracking, and crossing-line

judgment Firstly, the proposed method utilizes a foreground extraction method to

obtain foreground regions of moving people, and some morphological operations are

employed to optimize the foreground regions After that, it exploits an LBP (local binary pattern) feature-based Adaboost: classifier for head detection in the optimized

foreground regions Once head is detected, it is tracked by a local head tracking method

based on the Mean Shift algorithm Finally, based on head tracking, the method uses

crossing-line judgment to determine whether the candidate head object will be counted

11

Trang 23

In [8], the authors proposed a deep model-based method that works as a head

detector thar take scale variations inte account Peanle rmmring in outdoor vennes

face many challenges, sich as severe occlusians, few pixels per head, and significant variations in a person’s head size due to wide sports areas his method is based on the notion that the head is the most visible part of sports venues where large numbers

of people are gathered They generate seale-wware head proposals bused on a scale tap to cope with the problen of different scales, Scale-aware proposals are then fed to the Convolutional Nenral Netwark (CNN}, which provides a response matrix containing the presence probabilities of people observed across scene scales Hinally, they use nun-muximal suppression Lo get accurate head positions For une performuuree

evaluativu, they carry out exlensive experiments ow lwo standard datasets: S-HOCK

12

Trang 24

and UCF-HDDC

2.1.3 Hybrid detection based on people counting

In various scenarios, a single head detector or face detector may not provide accurate

results Therefore, some researchers proposed hybrid detection that combines detection results from different human parts (body, head, face)

Hybrid detection-based people counting combines human body parts to improve the efficiency of counting people in a crowd Double Anchor R-CNN network as Fig

2.4, proposed by Kevin Zhang [9] recently combines the head and body of a person

Figure 2.4: Architecture of Double Anchor R-C:

Another approach also combines head and body using JointDet architecture [10]

JoinDet network consists of four major components, as shown in Fig 2.5: The RPN

18

Trang 25

network, the Head R-CNN, the Body R-

'NN, and the RDM The head-to-body ratio

is then calculated to get whole-body recommendations The head and body proposals

are then submitted to two parallel R-CNN branches to obtain interim results These temporary results are further processed to get the final results, as follows:

Matching head - body using the proposed strategy to output the matched body-

head pairs as pair 1 to pair N;

e Extracting corresponding features of each pair for RDM to discriminate their

relation (i.e., whether they belong to the same person);

e According to the learned relationship to reduce head false positives and recall

14

Trang 26

2.2 Denslty estirnation based people counting

Video surveillance system ace commonly deployed in very crowd surveillance that ix impractical to detect each individnal As a consequence, a density estimation is an approach to approximately connt the number of people

‘he authors in [11] conducted an estimate of penple density in a crowded! environ-

ment In this paper, the authors proposed a two fold method First, they propose a density estimate for the crowd size Second, da a eaumt of the people in the crawd

As the density of the crowd increases, the congestion in the crowd also increases To get around this, they ean use an improved adaptive K - GMM background subtrac-

tion methed to extract the foreground accurately in real-time applications to avoid the

estimation provienm By applying a boundury detection ulgorithim, they were able ta

estimare the size of the crowd The number of people in a crowd was counted nsing the

*eanny edge detector” algorithm, the “eannected component: laheling” methad, and

the “centered bounding box” method This article proposes a real-time video surveil-

lance system The above-proposed works are compared with different davusels like

IBM, KTH, CAVIAR, PRTS2009, and CROWN Tt can he used for bath testing and

training phases

'Phe authors in [12] proposed a supervised learning framework for visual object

counting tasks, such as estimating the number of people in a surveillance video frame Their youl is to accurately estimate the mumber of people However, they omitted the difficnlt task of detecting and locating individual objects Instead, they proposed

to estimate an integral image density over any image region Learning how to infer

such a density can be formulated as minimizing a normalized quadratic cost function

Bu, they introduced w new loss function, well suited for guch learning and efficiently computable through a subarray maximal alyoritlin The problem can then be thought

of as a convex quadratic: program, that is solvable with cut-plane optimization The

proposed framework was flexible, as it can accept any domain-specific visual feature

Qnee trained, their system provides the number of objects und requires only a very

short amount of time for the feature extraction step Therefore, this mode! becomes

Trang 27

Figure 2.6: Examples of people density estimation Counting people in a surveillance

video frame, Close-ups are shown alongside the images The bottom close-ups show examples of the dotted annotations (crosses) This framework learns to estimate the

number of objects in the previously unseen images based on a set of training images of

the same kind augmented with dotted annotations [12]

Tracking is an efficient technique to improve the true positive rate when an object is missed detected at a given frame In this section, we briefly present the overview of the object tracking problem, then two typical tracking techniques (SORT and DeepSORT) that are widely deployed in literature We finally analyze some works using tracking for people counting problems

2.3.1 Overview of object tracking

Object tracking is a technique used to assign a unique ID to each object as it moves temporally The process starts when the object appears and ends when the object

leaves the scene for a certain time The goal of object tracking is to accurately identify

16

Trang 28

2.3 People tracking

objects of interest, estimate their trajectories in the video, aud track them as they

move The ohject tracking problem involves:

* Object detection: The first step in real-time object tracking is to detect the abject of interest in each frame of the video or image stream There ar

various abject detection techniques available, including feature-based methods such as

scale-invariant feature transform (SI"I'), speeded-up robust features (BULL),

and histograms of oriented gradients (HOG), as well as deep learning-based object detection ulgeritluns sue as YOLO (You Only Look Onee), SSD (Single Shot Detector}, and Faster R-CNN {Region-based Convolutional Neural Nevwork}

Object wuckiug: Ouce Une object is doteeted, the wext slep is lo track it, over time Object tracking can be achieved using various techniques, including optical flow, mean-shift, particle filters, and Kalman filters These techniques estimate the object's motion and predict its location in subsequent frames

« Data association: In scenarios where there are multiple objects in the video

or image strewn, it iy casential to asyovinte cach object's location with ils eorre- sponding identity Data association bechniques, such as the Hungarian algorith, are used ta match the detected objects with their previous locations to maintain

their identities over time

Object re-detection: Tn some soenarios, the abject of interest may disappear from the video or image stream for a short duration Object re-detection techniques,

can be used to re-detect the

such as template matching or appearance modeling,

object when it reappeurs

2.3.2 Multiple Object Tracking

Simple object tracking assumes there is only oue object in Uke scene Tracking becomes harder when there are many objects ‘I'he multiple object tracking method aims to track all objects appearing in the frame by deteeting and assigning identifiers to each object,

17

Trang 29

2.3 People tracking

as shown in Fig 2.7 lu addition, the [Ds assigned to aư object need to be consistent across each frame Multiple abject tracking reqnires handling:

* Accurate object detection: This is a critical task, especially for detection-based

tracking, to ensnre the presence of all objects in the scene

« Occlnded objects: Objects are partially ar completely obscured When an TT) ig assigned to an object, the TD should be consistent thronghont the video Towever, when an object is obscured, relying solely on object detection is not enough to solve this problou

Object absence: Au vbject may gp out of the frame and then reappears, Similar

to the previous issue, this is about ID switches, It is uecessary lo solve the problam of object: re-identification, including abscuring or disappearing, to reduce

the number of [D_switches to the lowest: possible level

« Objects trajectories overlapping: Objects with overlapping trajectories can alsa lead to the wrong assignment of IDs to objects, which is also a problem to deal with when working with multiple object tracking

Vigure 2.7 illustrates an example of multiple object tracking In the first row, people are firstly detected and bounded by yellow boxes, The second row presents tracked peuple overtime Euch persow is identified Ly » color The lust row shows Ube euse one person is detected in the first frame (red bounding box), but missed in the next frames due to occlusion, it is still kept by tracking technique

2.3.3 Tracking techniques

In the literature, there is a number of tracking techniques proposed for different tasks such og huuiun tacking, robot trucking, wd yo on, Tn this section, we review three conventional lechuiques: Kalan Glier, SORT, aud DeepSORT, which are improved

versions by temps

18

Trang 30

2.3 People tracking

(c) tracklets by associating every detection box

Figure 2.7: Multiple Object Tracking (a) shows all the detection boxes with their

ows the tracklets obtained by previous methods which associates detection

‘ores are higher than a threshold, ie 0.5 The same box color represents

s obtained by the proposed method in the

scores, (b) s

boxes whose s

the same identity (c) shows the tracklet

paper The dashed boxes represent the predicted box of the previous tracklets using Kalman Filter, The two low score detection boxes are correctly matched to the previous tracklets based on the large IoU [13]

‘The Kalman Filter [14] was proposed by R E, Kalman in 1960 The Kalman filter pre-

dicts the state of an object using previous information The Kalman filter equations are

categorized into two groups: prediction (updated over time) and correction (updated

by measure), A metric update is used to provide a feedback value that, combined with

the prior state estimate, gives a posterior state estimate

In order to use the Kalman filter to estimate the internal state of a process given

only a sequence of noisy observations, one must model the process in accordance with

the following framework This means specifying the matrices, for each time-step k,

19

Trang 31

2.3 People tracking

following:

@ Fy: the state-transition madel;

# Qg: the covariance of the process noise;

@ R,: the covariance of the observation noise:

then there is alsa

The Kalman filter made] assumes the true state at time k is evalved from the state

ak { — 1) according to Tiq.2.1:

ion model which is applied to the previous state x11

@ By is the control-inpnt model which is applied to the control vector uy;

ew; is the proces noise, which is assumed lo be drawn from a zero mean multe

variate normal distribntion

Trang 32

2.3 People tracking

® vụ is the observation noise, which is assumed lo be zero mean Gaussian white

noise with covariance Ra

The next processing steps of Kalman Filter can be divided into two main parts (probability-hased approach):

Step 1: Predict

Prediclod (a priori) siute estimate:

Bape — Fax ye 1+ Peay,

Predicted (a priori) estimate covariance:

Updated (a priori) state estimate:

Cri = Suh 1+ Kaye

Updated (a priori) estimate covariance:

Pips KEG) Pap

(am)

51

(28)

Trang 33

SORT [15] is an acronym for Simple Online Real-time Object Tracking, an algorithm

belonging to tracking-bv-detection (or detection-based tracking) With the tracking by

deWwevion problem, w common Leature is bo sepueute the detection resuly und use Ulis

result 1o Wack the object The next task is tu find a way to assuciate the bounding

poxes obtained in each frame and assign an TD to each abject Therefore, the processing

steps for each new frame is as follows:

of SORT, the Faster R-CNN was utilized

* Association: In case of multiple object tracking, it needs an association algorithm

to associate a target with 0 detected object In SORT algorithm, Hungarian was

deployed for this purpose

Hungarian Algorithm

The TIingarian algorithm [16] was developed and published im 1955 and proposed

as a solution to the assignment problem, Let denote n the number of detection (i — 1,2, ,n} and m the number of tracks predicted (j = 1,2, m) as show in Fig

2.8 The ussociation of detevtion i with a track j buses ou a cost function thut is

the distance between @ and 7 in feature space Detail of the Hungarian cau be seen

in the original paper 16) In the following, we juat review some concept and ideas

of the algorithm The [lungarian algorithm tried to associate each detection with its

Trang 34

e Theorem 1: Suppose the cost matrix of the assignment problem is non-negative

and has at least n zero elements Furthermore, if these n zeros are in n differ-

ent rows and m diffe

m track) and X* = |:;| a solution (an optimal solution) to this problem Suppose

C’ is a matrix obtained from C by adding a number a 4 0 (positive or negative)

23

Trang 35

2.3 People tracking

to each element in row r of C Then X* is also the solution to the assigument

problem with the cost matrix C"

Main steps of SORT algorithm The processing steps of SORT are shown in Fig

2.9:

Step 1: Detevtion: The first slop is to detect objects is cach frame of a video naing 4 computer vision algorithm such as a neural network-based detector

Step 2: Association: The next slep is te asseciabe the debleeted objects with,

previously tracked objects from previons frames This is dane hy eamparing the features of the detected objects with those of the existing tracked objects and

assigning a similurity score

Step 3: Prediction: Once the association is made, the algorithm predicts the position of the ubjects in the next mune using Kulinan filtcr or another nioviow

model

Step 4: Update: In vhis step, the trucked ebjecly arc upduted with the new information from the enrrent frame, stich as the position and size of the objects

Step fi: Track management: The final step involves manaying the tracked objects,

snch as removing objects that are no longer in the frame or creating new tracks

for newly detected objects

Tt notices that, there are three types of output of Himgarian algorithm: 1) Tt finds a detection corresponding to a target (matched tracks) ‘I'hen this association will he used to update Kalman filter; Unmatched tracks: no detection is found to match

with the track, then the track may be deleted depending on it lifetime: 3) If there are

new objects detected, which are nat matched with any targets, then they will he naed

to create new track.

Trang 36

In the original version of SORT, the cost function is defined based on the IoU distance,

it does not take the appearance similarity of detection and target into account Deep-

SORT was developed by Nicolai Wojke and Alex Bewley [17] to address the omission

problems associated with a high number of ID switches The solution proposed by

DeepSORT is based on using deep learning to extract features of objects to increase

aceuracy in the data association process, In addition, a linking strategy known as

‘Matching Cascade’ was developed to help link objects that had previously disappeared more effectively

DeepSORT is an improvement over the SORT (Simple Online and Realtime Track-

ing) algorithm in multiple ways:

e Association metric: DeepSORT uses the Mahalanobis distance metric to associate detected objects with existing tracks, while SORT uses the Euclidean distance

metric The Mahalanobis distance takes into account the covariance matrix of

the data, which allows better handling data with varying scales and correlation

between dimensions This results in a more accurate association of objects with

existing tracks, even when there are occlusions or other objects in the scene

Trang 37

2.3 People tracking

Feature embedding: DeeySORT uses a deep ueural uetwork to ensbed object features into a high-dimensional apace, while SORT nses hand-crafted features such as color histograms Deep learning-based embeddings are more powerful und cxprossive, ullowiny for better dixcrlaination between objects und reducing

the risk of track drift

e@ Truck management: DecpSORT cuploys vu truck munagement strategy that ab

lows hetrer handling of occlusions, missed detection and track fragmentation Specifically, it uses a Kalman filter to estimate the position and velocity of the object, and a gating mechanism to filter out detection that are unlikely to belong

to the truck It also ayes a track imitivlion process bo slurt new tracks when uo

existing track cut be associated with a uew detection

Overall DeepSORT's inaprovements over SORT resull in wore accurate aud robust tracking, especially in chellenging scenarios where objects are partially occluded or move quickly

2.3.4 Tracking-based people counting

“Tracking-based neople enmting is a method of comting people by using computer vision techniques to track individuals as they move through a space ‘here are several different tracking-based people.counting methods, including:

1, Single-camera tacking [18]: This method uses a single camera to track people

as they move through the space The camera captures images or video, and the software analyzes the data to identify individuals and track their movements

2 Multiple.camera tracking 19]: ‘This method uses multiple cameras placed through out the space to capture images or video from different angles 'Phe software com bines the date Lrom cach camera to track peuple us they move between dilferent

areas

Depth-hased tracking [20|: This method uses cameras that can capture depth information, such as Microsoft’s Kinect: camera, to track people as they move

Trang 38

track their movements

The research in the paper [21] is to develop an accurate and efficient system capa-

ble of error-free counting and tracking in public places The main goal of this research

xt, random particles are distributed, and features are

extracted Subsequently, particle flows are clustered using a self-organizing map, and

people counting and tracking are performed based on motion trajectories The test

results on the PETS-2009 and TUD-Pedestrian datasets achieved high results

Trang 39

own in Fig.2.11: peo-

ple detection, head-to-torso template extraction, tracking, and crowd cluster analysis Firstly, the system extracts human silhouettes using an inverse transform as well as a median filter, reducing the cost of computing and handling various complex monitor-

ing situations Secondly, people are detected by their heads and torsos due to their being less varied and barely occluded Thirdly, each person is tracked through consec-

utive frames using the Kalman filter technique with Jaccard similarity and normalized

cr

-correlation Finally, the template matching is used for crowd counting with cue localization and clustering via Gaussian mapping for normal or abnormal event detection The experimental results on two challenging video surveillance datasets, such as

PETS2009 and UM!

crowd analysis datasets, demonstrate that the proposed system

provides 88.7% and 95.5% in terms of counting accuracy and detection rate, respec-

' Frames Extraction Buckground Removal || |

1 Grayscale Conversion Connected Regions | |

1 Inverse Transform Silhouette Extraction |} 1

1 Morphology People Localization |] Gaussia smoathing | '

HỆ: PeopleExaeion - Head Detection | Crowd Clustering | 1

i ‘Template Matching ||| Cluster Analysis ¡

Ae seem sea pee eo, aes 1

Figure 2.11: Flow architecture of the proposed smart surveillance system [22]

28

Trang 40

2.4 Conclusion of the chapter

This chupter presuuted our study about methods for people counting, based on people detection and tracking and people density estimation The detection and tracking- based people counting techniques are suitable for the nermally crowded scene while the latter one is more suitable for highly crowded scenes In this work, we follow the first approach, which detects and tracks humens to improve the accuracy when an occlusion appears We will descrihe our proposed merhod in chapter 3

Tiêu đề	People counting using detection and tracking techniques for smart video surveillance
Tác giả	Ha Thi Oanh
Người hướng dẫn	Assoc. Prof. Tran Thi Thanh Hai
Trường học	Ha Noi University of Science and Technology
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2023
Thành phố	Ha Noi

Định dạng
Số trang	92
Dung lượng	2,1 MB