People counting using detection and tracking techniques for smart video surveillance

People Counting Using Detection And Tracking Techniques For Smart Video SurveillanceHa Thi Oanh Ha Noi University of Science and Technology Supervisor Assoc.. 5 2.1 Framework for a peopl

Trang 1

People Counting Using Detection And Tracking Techniques For Smart Video Surveillance

Ha Thi Oanh

Ha Noi University of Science and Technology

Supervisor Assoc Prof Tran Thi Thanh Hai

In partial fulfillment of the requirements for the degree of

Master of Computer Science

April 20, 2023

Trang 2

Dr Doan Thi Huong Giang who directly supported me.

The master’s thesis is within the framework of the ministerial-level scientificresearch project ”Research and development of an automatic system forassessing learning activities in class based on image processing technologyand artificial intelligence” code CT2020.02.BKA.02 led by Assoc Prof Dr

Le Thi Lan Students sincerely thank the topic

Finally, I wish to show my appreciation to all my friends and family bers who helped me finalizing the project

Trang 3

Video or image-based people counting in real-time has multiple tions in intelligent transportation, density estimation or class management,and so on Although this problem has been widely studied, it stills facesome main challenges due to crowded scene and occlusion In a commonapproach, this problem is carried out by detecting people using conven-tional detectors However, this approach can be failed when people stay invarious postures or are occluded by each other We notice that even a mainpart of human body is occluded, their face and head are still observable

applica-In addition, a person can not be detected at a frame but may be recovered

at the previous or the next frames

In this thesis, we attempt to improve the people counting result beyondthese observations We first deploy two detectors (Yolo and Retina-Face)for detecting heads and for faces of people in the scene We then develop

a pairing technique that aligns the face and the head of each person Thisalignment helps to recover the missed detection of head or face thus in-creases the true positive rate To overcome the missed detection of bothface and head at a certain frame, we apply a tracking technique (i.e SORT)

on the combined detection result Putting all of these techniques in an fied framework helps to increases the true positive rates from 90.36% to96.21% on ClassHead Part 2 dataset

Trang 4

1.1 Introduction to people counting 1

1.2 Scientific and practical significance 2

1.2.1 Scientific significance 2

1.2.2 Practical significance 3

1.2.3 Challenges and Motivation 4

1.3 Objectives and Contributions 6

1.3.1 Objectives 6

1.3.2 Contributions 6

1.4 Thesis outline 7

2 Related works 9 2.1 Detection based people counting 9

2.1.1 Face detection based people counting 9

2.1.2 Head detection based on people counting 11

2.1.3 Hybrid detection based on people counting 13

2.2 Density estimation based people counting 15

2.3 People tracking 16

2.3.1 Overview of object tracking 16

2.3.2 Multiple Object Tracking 17

2.3.3 Tracking techniques 18

Trang 5

2.3.3.1 Kalman filter 19

2.3.3.2 SORT 22

2.3.3.3 DeepSORT 25

2.3.4 Tracking-based people counting 26

2.4 Conclusion of the chapter 29

3 Proposed method for people counting 30 3.1 The proposed people counting framework 30

3.2 Yolo-based head detection 31

3.2.1 Yolo revisit 31

3.2.2 Yolov5 34

3.2.3 Implementation of Yolov5 for head detection 38

3.3 RetinaFace based face detection 39

3.3.1 RetinaFace architecture 40

3.3.2 Implementation of RetinaFace for face detection 43

3.4 Combination of head and face detection 44

3.4.1 Linear sum assignment problem 44

3.4.2 Head-face pairing cost 45

3.5 Person tracking 45

3.6 Conclusion 49

4 Experiments 50 4.1 Dataset and Evaluation Metrics 50

4.1.1 Our collected dataset: ClassHead 50

4.1.1.1 ClassHead Part 1 53

4.1.1.2 ClassHead Part 2 55

4.1.2 Hollywood Heads dataset 55

4.1.3 Casablanca dataset 57

4.1.4 Wider Face dataset 57

4.1.5 Evaluation metrics 57

Trang 6

4.1.5.1 Intersection over Union (IoU) 59

4.1.5.2 Precision and Recall 59

4.1.5.3 F1-score 60

4.1.5.4 AP and mAP 61

4.1.5.5 Mean Absolute Error 61

4.2 Experimental Results 62

4.2.1 Evaluation on Hollywood dataset 62

4.2.2 Evaluation on Casablanca dataset 63

4.2.3 Evaluation on Wider Face dataset 66

4.2.4 Evaluation on ClassHead Part 2 dataset 66

5 Conclusions 72 5.1 Conclusion 72

5.2 Future Works 72

Trang 7

List of Figures

1.1 Illustration of the input and output of people counting from an image 2

1.2 Some challenges in crowd counting [1] 5

2.1 Framework for a people counting based on face detection and tracking in a video [2] 10

2.2 System framework for depth-assisted face detection and association for people counting 11

2.3 System framework for a people counting method based on head detection and tracking 12

2.4 Network structure of Double Anchor R-CNN 13

2.5 Architecture of JointDet 14

2.6 Examples of people density estimation 16

2.7 Example of Multiple Object Tracking 19

2.8 Hungarian Algorithm 23

2.9 The tracking process of the SORT algorithm 25

2.10 Architecture of the proposed people counting and tracking system 27

2.11 Flow architecture of the proposed smart surveillance system 28

3.1 The proposed framework for people counting by pairing head and face detection and tracking 32

3.2 Output of Yolo network[3] 34

3.3 Yolov5 architecture[4] 35

3.4 Spatial Pyramid Pooling 37

Trang 8

LIST OF FIGURES

3.5 Path Aggregation Network 373.6 Automatic learning of bound box anchors [4] 383.7 Activation functions used in Yolov5 (a) SiLU function (b) Sigmoidfunction [4] 393.8 Example for creating dataset.yaml 403.9 An overview of the single-stage dense face localisation approach Reti-naFace is designed based on the feature pyramids with independent con-text modules Following the context modules, we calculate a multi-taskloss for each anchor 423.10 Organize dataset for Yolo training 433.11 Example of RetinaFace testing on Wider Face dataset 443.12 Flowchart of combining object detection and tracking to improve thetrue positive rate 464.1 Camera layout in the simulated classroom and an image obtained fromeach camera view 524.2 Illustration of LabelMe interface and main operations to annotate animage 534.3 Illustration of images taken from five camera view in ClassHead Part 1dataset: (a) View 1 , (b) View 2, (c) View 3, (d) View 4 and (e) View 5 544.4 Some example images of ClassHead Part 2 dataset: view ch03 (a), viewch04 (b), view ch05 (c), and view ch12 (d) and view ch13 (e) 564.5 Some example images of Hollywood Heads dataset (first row), Casablancadataset (second row), Wider Face dataset (third row), and ClassHeadPart 2 of our dataset (last row) 584.6 Calculating IOU 594.7 Precision and Recall metrics 604.8 MAE measurement results on 2 proposed methods in Hollywood Headsdataset 63

Trang 9

LIST OF FIGURES

4.9 Results of Hollywood Heads dataset (a) Results of head detection; (b)Results of face detection; (c) Matching head and face detection using theHungarian algorithm Heads are denoted with green, faces are yellow,missed ground truths are red, and head-face pairings are cyan 644.10 MAE measurement results on 2 proposed methods in Casablanca Headsdataset 654.11 Results of Casablanca dataset (a) Results of head detection; (b) Re-sults of face detection; (c) Matching head and face detection using theHungarian algorithm Heads are denoted with green, faces are yellow,missed ground truths are red, and head-face pairings are cyan 654.12 Results of Wider Face dataset (a) Results of head detection; (b) Re-sults of face detection; (c) Matching head and face detection using theHungarian algorithm Heads are denoted with green, faces are yellow,missed ground truths are red, and head-face pairings are cyan 674.13 MultiDetect results in ClassHead Part 2 (a) Head detections, (b) Facedetections, (c) MultiDetect 684.14 Head tracking method results in ClassHead Part 2 dataset (a) Headdetections at frame 1, (b) Head tracking at frame 100 694.15 MultiDetect with Track method results in ClassHead Part 2 dataset (a)MultiDetect with Track at frame 1, (b) MultiDetect with Track at frame

100 704.16 MAE measurement results on 3 proposed methods in ClassHead Part 2dataset 71

Trang 10

List of Tables

4.1 Setup camera parameters for data collection 514.2 ClassHead Part 1 dataset for training and testing Head detector Yolov5 554.3 ClassHead Part 2 dataset 554.4 Results of the proposed method on the Hollywood Heads dataset 634.5 Results of the proposed method on the Casablanca dataset 644.6 Results of the proposed method on Wider Face dataset 664.7 Results of the method of the head detection method in ClassHead Part 2dataset 674.8 Results of the method of the MultiDectect in ClassHead Part 2 dataset 684.9 Results of the Head Tracking in ClassHead Part 2 dataset 694.10 Results of method MultiDetect with Track in ClassHead Part 2 dataset 704.11 Experimental results in the ClassHead Part 2 dataset after using 4 meth-ods 71

Trang 11

List of Acronymtypes

CNN Convolutional Neural Network

HOG Recurrent Neural Network

LSTM Long short-term memory

NN Neural Network

RPN Region Proposal Network

YOLO You Only Look Once

Trang 12

Chapter 1

Introduction

Recently, people counting in images or video has become an active research topic due

to its wide range of applications, from public safety to intelligent crowd flow Manualcounting is impractical since it is a tedious and time-consuming task, particularly

in crowded scenes This chapter aims to define the problem of people counting, itschallenges, and provide discussions on the drawbacks of existing methods to motivateour work We then clarify our objectives and contributions to this field Finally, wedescribe the organization of the thesis

People counting in crowds refers to the process of accurately counting the number ofindividuals present in a densely populated area or space This is a challenging task due

to the high density of people, occlusions, overlapping individuals, and the need to trackpeople as they move through the crowd People counting has been extensively studied

in recent years, and it has numerous real-life applications, including event management,public safety, and transportation For instance, it can be utilized to monitor crowddensity and prevent overcrowding in public spaces, optimize and improve security atevents and transportation hubs, etc

To capture people in crowds, some sensors such as thermal imaging cameras, RGBcameras, and lasers may be used RGB cameras are the most commonly utilized due

Trang 13

1.2 Scientific and practical significance

to their low-cost and popularity in almost public spaces From video/data, computervision techniques such as object detection and tracking, optical flow, and backgroundsubtraction can identify and track individuals in the crowd The problem of peoplecounting from an image is defined as follows:

• Input: An image or a frame from a video sequence

• Output: The number of people and/or their locations in the frame/image.Figure 1.1 depicts the input and output of a people counting algorithm The algo-rithm stores the number of people detected and determines the bounding box of eachindividual Depending on the context, location data may be crucial for further pro-cessing However, in highly crowded scenes, obtaining an exact count of people may beimpractical, and an estimation of the number of individuals is sufficient In the nextchapter, we will review some related works that provide an estimation of the peoplecount with or without location and bounding box information

Figure 1.1: Illustration of the input and output of people counting from an image

1.2.1 Scientific significance

People counting in crowds of humans has several scientific implications, including:

• Crowd dynamics: People counting in crowds provide important data for studyingcrowd dynamics, such as how people move, how they interact with each other, and

Trang 14

how they respond to changes in the environment This information can be used todevelop mathematical models of crowd behavior and improve our understanding

in crowded environments

• Sensor technology: People counting in crowds also drive the development of newsensor technologies, such as cameras, depth sensors, and thermal sensors, thatare designed to capture data in crowded environments This helps to advance thefield of sensor technology and improve our ability to capture data in challengingenvironments

• Human-computer interaction: People counting in crowds can also provide sights into human-computer interaction, particularly in the context of intelligentsystems It helps to understand how people interact with technology in crowdedenvironments and how technology can be designed to support people in thesesettings

Trang 15

organizers identify high-density areas and take action to prevent overcrowding,which is critical for public safety

• Retail and marketing: People counting is an essential tool for retailers to optimizestaffing levels, measure customer traffic, and improve customer service It helpsretailers identify high-traffic areas and monitor customer behavior, such as thetime spent in specific sections or the frequency of return visits

• Public safety and security: People counting is also an important tool for publicsafety and security, helping to monitor crowd density and prevent overcrowding

in public spaces, as well as optimize staffing levels and improve security at eventsand transportation hubs It can also assist in tracking and identifying suspects

in security footage

• Transportation: People counting is useful in transportation systems to measurethe usage of different modes of transportation and optimize public transporta-tion routes and schedules It helps to reduce congestion, improve efficiency, andenhance the overall user experience

• Education: Student counting in classroom environment help is an important andreliable source of information for improving the quality of education by changingthe content and teaching methods

1.2.3 Challenges and Motivation

To solve the people counting problem, there exists a number of approaches whichachieved impressive accuracy However, this problem still faces many challenges asfollowing:

• Occlusion: As crowd density increases, individuals may start to occlude eachother, which poses a challenge for traditional detection algorithms and motivatesthe development of density estimation models

Trang 16

Figure 1.2: Some challenges in crowd counting [1]

• Complex background: In a natural scene, the background may be highly clutteredand contain objects with similar appearances or colors to the foreground, whichcan cause confusion

• Scale variation: One of the primary problems that should be addressed in thedensity estimation models is the strong variation in the scales of objects As

a result, almost existing density estimation models are designed to address thescale variation problem in the first step

• Camera viewpoint: The issue of rotation variation is drastically increased due tothe camera viewpoints, such as different poses and photographic angles

• Illumination variation: The illumination varies at different times of the day,usually from dark to light and then back to dark, from dawn to dusk

• Weather changes: The scenes in the wild are usually under various types ofweather conditions, such as clear, clouds, rain, fog, thunder, overcast, and extrasunny

Figure 1.2 shows some examples where the people are highly occluded (Fig1.2.a), ground is complex (Fig1.2.b), scale variation (Fig1.2.c), camera view change (Fig1.2.d),

Trang 17

back-1.3 Objectives and Contributions

illumination change (Fig1.2.e) and weather change (Fig1.2.f) These challenges can not

be solved in one model In this thesis, we attempt to improve the people counting formance by overcoming occlusion and scale changes, although some other challengesmay be implicitly resolved thanks to the studied model itself

1.3.1 Objectives

The main goal of this thesis is to improve the performance of people counting fromimages/video to overcome the occlusion issue in a crowded scene To obtain this goal,following are the specific objectives:

• Conduct a survey of existing techniques for human counting, analyze their maindrawbacks, and then propose a suitable solution

• Study and develop techniques for detecting humans that can be used for peoplecounting and localization

• Improve the techniques to avoid missed detection in crowded scenes

1.3.2 Contributions

The work of this thesis is within the context of a project granted by the Ministry ofEducation and Training (MOET), with the project code CT2020.02.BKA.02 One ofthe tasks of this project is to detect and count the number of students in a classroom,and then create a density map of the students This will aid in better management

of the students and improve the quality of teaching and learning As a result, besidevalidating the proposed on benchmark dataset, we also take part in building a newdataset in classroom and test our method on that dataset We summarize the maincontribution of our work as follows:

Trang 18

1.4 Thesis outline

• First, we propose a method that combines the detection results of both faceand head to improve the true positive rate of people counting (namely called:MultiDetect)

• Second, we deploy a tracking technique to handle fast-moving objects that maycause motion blur effects and missed detection (namely called: MultiDetect withTrack)

• Finally, we conduct extensive experiments to validate MultiDetect improvement

on three benchmark datasets (WiderFace, Hollyhead, and Casablanca) We alsobuild a new dataset in the MOET project, in which I participated in collectingand annotating the data and we conduct extensive experiments to validate bothimprovements MultiDetect and MultiDetect with Track in our dataset

The thesis is structured into 5 chapters:

1 Introduction: This chapter provides the definition of the people counting problemand introduces its scientific and practical significance Then, we describe some

of the main challenges that motivated the work of this thesis Finally, we presentour objectives, contributions, and thesis outline

2 Background and Related Works: This chapter conducts a survey on the deeplearning-based approach for human detection and tracking for the problem ofpeople counting and localization We also describe some methods that roughlyestimate the people density without giving localization information The analysis

on both approaches and our constraints motivate us to follow the detection andtracking methods We then present briefly fundamental knowledge about deepmodels for human detection and tracking

3 Proposed Method: This chapter introduces our proposed framework for ple detection and localization from videos We describe each component of the

Trang 19

peo-1.4 Thesis outline

framework in detail and how to implement it in practice

4 Experiments: This chapter presents the datasets, evaluation protocol, technicalsetup, results, and discussions related to our experiments In particular, wedescribe the process of collecting and annotating our new dataset in a classroomenvironment

5 Conclusion: This chapter summarizes our work, highlights the contributions,analyzing the limitations, and providing some ideas for future research directions

Trang 20

Chapter 2

Related works

This chapter presents some basic knowledge as well as related works regarding theresearch topic of this thesis There are two approaches to the people counting problemfrom still images: i) detection-based approach that detects individuals and then cangive a number of people in the scene; ii) density estimation-based approach that justgive a roughly a number of persons without location information Besides, to improvethe detection result, some works deploy tracking techniques We present these ap-proaches in sections 2.1 and 2.2 respectively We then describe the tracking techniques

in section 2.3 Finally, we conclude the chapter in section 2.4

People counting can be carried out by detecting faces, heads, or bodies depending onthe context and the scene The most common technique is detecting the human body,but in cases where the human body is occluded or in a challenging posture, the headand face can be alternative solutions

2.1.1 Face detection based people counting

Face detection-based people counting aims to detect and track faces in images or time video streams, then count the number of detected faces Face detection algorithms

Trang 21

real-2.1 Detection based people counting

typically rely on deep learning models trained on large datasets to identify and locatefaces in images or videos accurately

Tsong-Yi Chen et al presented an automatic people-counting system based on facerecognition in which people passing through a gate or door are counted by placing avideo camera [5] First, they use the image difference to detect the rough edges ofmoving people Then, color features are utilized to locate people’s faces Based onthe NCC (Normalized Color Coordinate) color space, the face is initially obtained bydetecting the skin tone area, and then the subject’s facial features are analyzed todetermine if the subject is a real face or not After face detection, a person will betracked by following the recognized face, and this person will be added if that person’sface touches the count line

Xi Zhao et al presented a method of counting people based on face detection,tracking, and trajectory classification [2] They first performed face detection andthen face tracking by combining a new scale-invariant Kalman filter with a kernel-based color histogram tracking algorithm From each face orbit, the angle of thoseneighboring points is extracted Finally, to distinguish the real face trajectory fromthe fake one, the authors used the K-NN classification method based on the EarthMover’s distance The framework for this paper is described in Fig.2.1

Figure 2.1: Framework for a people counting based on face detection and tracking in

a video [2]

Guangyu Zhao et al [6] developed a system capable of detecting and counting

Trang 22

2.1 Detection based people counting

people using the Kinect camera The author used depth information for false facedetection, and then a 3D data association is used to link tracks with detection results.Finally, they counted the people who enter the region of interest using a validatedtrajectory, as shown in Fig.2.2

Figure 2.2: Depth-assisted face detection and association for people counting [6]

Face detection nowadays can achieve very high accuracy However, this problemstill faces challenges such as variations in lighting conditions, occlusions, and faceorientation Besides, it requires a face in front of the camera Without that assumption,the performance of face detection-based people counting may drastically reduce

2.1.2 Head detection based on people counting

Head detection can be a more flexible solution to deal with constraints on detectingonly frontal faces Bin Li and al [7] proposed a people-counting method based onhead detection and tracking The purpose of this proposal is to evaluate the number

of people who move under an indoor overhead camera This framework consists offour parts: foreground extraction, head detection, head tracking, and crossing-linejudgment Firstly, the proposed method utilizes a foreground extraction method toobtain foreground regions of moving people, and some morphological operations areemployed to optimize the foreground regions After that, it exploits an LBP (localbinary pattern) feature-based Adaboost classifier for head detection in the optimizedforeground regions Once head is detected, it is tracked by a local head tracking methodbased on the Mean Shift algorithm Finally, based on head tracking, the method usescrossing-line judgment to determine whether the candidate head object will be counted

Trang 23

of people are gathered They generate scale-aware head proposals based on a scalemap to cope with the problem of different scales Scale-aware proposals are thenfed to the Convolutional Neural Network (CNN), which provides a response matrixcontaining the presence probabilities of people observed across scene scales Finally,they use non-maximal suppression to get accurate head positions For the performanceevaluation, they carry out extensive experiments on two standard datasets: S-HOCK

Trang 24

and UCF-HDDC

2.1.3 Hybrid detection based on people counting

In various scenarios, a single head detector or face detector may not provide accurateresults Therefore, some researchers proposed hybrid detection that combines detectionresults from different human parts (body, head, face)

Hybrid detection-based people counting combines human body parts to improvethe efficiency of counting people in a crowd Double Anchor R-CNN network as Fig.2.4, proposed by Kevin Zhang [9] recently combines the head and body of a person.This network consists of 4 stages as follows:

1 A dual-anchor zone recommendation network to generate head and body tions in pairs

sugges-2 A cross-recommendation module to generate high-quality training samples forthe R-CNN part

3 A module to efficiently combine head and body features

4 A generic NMS (Non-Maximum Suppression) algorithm for post-processing

Figure 2.4: Architecture of Double Anchor R-CNN [9]

Another approach also combines head and body using JointDet architecture [10].JoinDet network consists of four major components, as shown in Fig 2.5: The RPN

Trang 25

network, the Head R-CNN, the Body R-CNN, and the RDM The head-to-body ratio

is then calculated to get whole-body recommendations The head and body proposalsare then submitted to two parallel R-CNN branches to obtain interim results Thesetemporary results are further processed to get the final results, as follows:

• Matching head - body using the proposed strategy to output the matched head pairs as pair 1 to pair N;

body-• Extracting corresponding features of each pair for RDM to discriminate theirrelation (i.e., whether they belong to the same person);

• According to the learned relationship to reduce head false positives and recallsuppressed human detections

To evaluate the effectiveness of the proposed method, they conducted extensive periments on the CrowdHuman, CityPersons, and Caltech-USA datasets The resultsshow that their method has the best results compared to the previous methods

ex-Figure 2.5: Architecture of JointDet [10]

Through a survey on the methods of combining human body parts, we discoveredthat the accuracy of the hybrid model improved significantly These findings havenot only provided us with new ideas but also motivated us to develop more robustpeople-counting algorithms for crowded scenes

Trang 26

2.2 Density estimation based people counting

Video surveillance systems are commonly deployed in very crowd surveillance that isimpractical to detect each individual As a consequence, a density estimation is anapproach to approximately count the number of people

The authors in [11] conducted an estimate of people density in a crowded ment In this paper, the authors proposed a two fold method First, they propose adensity estimate for the crowd size Second, do a count of the people in the crowd

environ-As the density of the crowd increases, the congestion in the crowd also increases Toget around this, they can use an improved adaptive K - GMM background subtrac-tion method to extract the foreground accurately in real-time applications to avoid theestimation problem By applying a boundary detection algorithm, they were able toestimate the size of the crowd The number of people in a crowd was counted using the

”canny edge detector” algorithm, the ”connected component labeling” method, andthe ”centered bounding box” method This article proposes a real-time video surveil-lance system The above-proposed works are compared with different datasets likeIBM, KTH, CAVIAR, PETS2009, and CROWD It can be used for both testing andtraining phases

The authors in [12] proposed a supervised learning framework for visual objectcounting tasks, such as estimating the number of people in a surveillance video frame.Their goal is to accurately estimate the number of people However, they omittedthe difficult task of detecting and locating individual objects Instead, they proposed

to estimate an integral image density over any image region Learning how to infersuch a density can be formulated as minimizing a normalized quadratic cost function

So, they introduced a new loss function, well suited for such learning and efficientlycomputable through a subarray maximal algorithm The problem can then be thought

of as a convex quadratic program, that is solvable with cut-plane optimization Theproposed framework was flexible, as it can accept any domain-specific visual feature.Once trained, their system provides the number of objects and requires only a veryshort amount of time for the feature extraction step Therefore, this model becomes

Trang 27

Tracking is an efficient technique to improve the true positive rate when an object ismissed detected at a given frame In this section, we briefly present the overview of theobject tracking problem, then two typical tracking techniques (SORT and DeepSORT)that are widely deployed in literature We finally analyze some works using trackingfor people counting problems.

2.3.1 Overview of object tracking

Object tracking is a technique used to assign a unique ID to each object as it movestemporally The process starts when the object appears and ends when the objectleaves the scene for a certain time The goal of object tracking is to accurately identify

Trang 28

• Object tracking: Once the object is detected, the next step is to track it overtime Object tracking can be achieved using various techniques, including opticalflow, mean-shift, particle filters, and Kalman filters These techniques estimatethe object’s motion and predict its location in subsequent frames.

• Data association: In scenarios where there are multiple objects in the video

or image stream, it is essential to associate each object’s location with its sponding identity Data association techniques, such as the Hungarian algorithm,are used to match the detected objects with their previous locations to maintaintheir identities over time

corre-• Object re-detection: In some scenarios, the object of interest may disappear fromthe video or image stream for a short duration Object re-detection techniques,such as template matching or appearance modeling, can be used to re-detect theobject when it reappears

2.3.2 Multiple Object Tracking

Simple object tracking assumes there is only one object in the scene Tracking becomesharder when there are many objects The multiple object tracking method aims to trackall objects appearing in the frame by detecting and assigning identifiers to each object,

Trang 29

• Object absence: An object may go out of the frame and then reappears Similar

to the previous issue, this is about ID switches It is necessary to solve theproblem of object re-identification, including obscuring or disappearing, to reducethe number of ID switches to the lowest possible level

• Objects trajectories overlapping: Objects with overlapping trajectories can alsolead to the wrong assignment of IDs to objects, which is also a problem to dealwith when working with multiple object tracking

Figure 2.7 illustrates an example of multiple object tracking In the first row, peopleare firstly detected and bounded by yellow boxes The second row presents trackedpeople overtime Each person is identified by a color The last row shows the case oneperson is detected in the first frame (red bounding box), but missed in the next framesdue to occlusion, it is still kept by tracking technique

2.3.3 Tracking techniques

In the literature, there is a number of tracking techniques proposed for different taskssuch as human tracking, robot tracking, and so on In this section, we review threeconventional techniques: Kalman filter, SORT, and DeepSORT, which are improvedversions by temps

Trang 30

2.3 People tracking

Figure 2.7: Multiple Object Tracking (a) shows all the detection boxes with theirscores (b) shows the tracklets obtained by previous methods which associates detectionboxes whose scores are higher than a threshold, i.e 0.5 The same box color representsthe same identity (c) shows the tracklets obtained by the proposed method in thepaper The dashed boxes represent the predicted box of the previous tracklets usingKalman Filter The two low score detection boxes are correctly matched to the previoustracklets based on the large IoU [13]

2.3.3.1 Kalman filter

The Kalman Filter [14] was proposed by R E Kalman in 1960 The Kalman filter dicts the state of an object using previous information The Kalman filter equations arecategorized into two groups: prediction (updated over time) and correction (updated

pre-by measure) A metric update is used to provide a feedback value that, combined withthe prior state estimate, gives a posterior state estimate

In order to use the Kalman filter to estimate the internal state of a process givenonly a sequence of noisy observations, one must model the process in accordance withthe following framework This means specifying the matrices, for each time-step k,

Trang 31

2.3 People tracking

following:

• Fk: the state-transition model;

• Hk: the observation model;

• Qk: the covariance of the process noise;

• Rk: the covariance of the observation noise;

• and sometimes Bk: the control-input model as described below; if Bkis included,then there is also

• uk: the control vector, representing the controlling input into control-input modelThe Kalman filter model assumes the true state at time k is evolved from the state

at (k − 1) according to Eq.2.1:

xk = Fkxk−1+ Bkuk+ wk (2.1)where:

• Fk is the state transition model which is applied to the previous state xk−1

• Bk is the control-input model which is applied to the control vector uk;

• wk is the process noise, which is assumed to be drawn from a zero mean variate normal distribution

multi-At time k an observation (or measurement) zk of the true state xk is made according

to Eq.3.15:

zk = Hkxk+ vk (2.2)where:

• Hk is the observation model, which maps the true state space into the observedspace

Trang 32

Kk = ˆPk|k−1H⊤kS−1k (2.7)Updated (a priori) state estimate:

ck|k = ˆck|k−1+ Kky˜k (2.8)Updated (a priori) estimate covariance:

Pk|k = (I − KkHk) ˆPk|k−1 (2.9)

Trang 33

2.3 People tracking

Measurement post-fit residual:

˜

yk|k = zk− Hkxk|k (2.10)2.3.3.2 SORT

SORT [15] is an acronym for Simple Online Real-time Object Tracking, an algorithmbelonging to tracking-by-detection (or detection-based tracking) With the tracking bydetection problem, a common feature is to separate the detection results and use thisresult to track the object The next task is to find a way to associate the boundingboxes obtained in each frame and assign an ID to each object Therefore, the processingsteps for each new frame is as follows:

• Detection: This step aims to detect the precision and locate the position ofobjects in the frame Any object detector can be applied In the original paper

of SORT, the Faster R-CNN was utilized

• Prediction: This step utilizes the Kalman filter to predict the new positions ofobjects at frame t based on previous t − 1 frames

• Association: In case of multiple object tracking, it needs an association algorithm

to associate a target with a detected object In SORT algorithm, Hungarian wasdeployed for this purpose

Hungarian Algorithm

The Hungarian algorithm [16] was developed and published in 1955 and proposed

as a solution to the assignment problem Let denote n the number of detection (i =

1, 2, , n) and m the number of tracks predicted (j = 1, 2, m) as show in Fig.2.8 The association of a detection i with a track j bases on a cost function that isthe distance between i and j in feature space Detail of the Hungarian can be seen

in the original paper [16] In the following, we just review some concept and ideas

of the algorithm The Hungarian algorithm tried to associate each detection with its

Trang 34

Pm j=1xij = 1, j = 1, 2,

• Theorem 2: Let C = [cij] be the cost matrix of the assignment problem (n detect,

m track) and X∗ = [xij] a solution (an optimal solution) to this problem Suppose

C′ is a matrix obtained from C by adding a number α ̸= 0 (positive or negative)

Trang 35

• Step 3: Prediction: Once the association is made, the algorithm predicts theposition of the objects in the next frame using a Kalman filter or another motionmodel.

• Step 4: Update: In this step, the tracked objects are updated with the newinformation from the current frame, such as the position and size of the objects

• Step 5: Track management: The final step involves managing the tracked objects,such as removing objects that are no longer in the frame or creating new tracksfor newly detected objects

It notices that there are three types of output of Hungarian algorithm: 1) It finds adetection corresponding to a target (matched tracks) Then this association will beused to update Kalman filter; 2) Unmatched tracks: no detection is found to matchwith the track, then the track may be deleted depending on it lifetime; 3) If there arenew objects detected, which are not matched with any targets, then they will be used

to create new track

Trang 36

2.3 People tracking

Figure 2.9: The tracking process of the SORT algorithm

2.3.3.3 DeepSORT

In the original version of SORT, the cost function is defined based on the IoU distance,

it does not take the appearance similarity of detection and target into account SORT was developed by Nicolai Wojke and Alex Bewley [17] to address the omissionproblems associated with a high number of ID switches The solution proposed byDeepSORT is based on using deep learning to extract features of objects to increaseaccuracy in the data association process In addition, a linking strategy known as

Deep-’Matching Cascade’ was developed to help link objects that had previously disappearedmore effectively

DeepSORT is an improvement over the SORT (Simple Online and Realtime ing) algorithm in multiple ways:

Track-• Association metric: DeepSORT uses the Mahalanobis distance metric to associatedetected objects with existing tracks, while SORT uses the Euclidean distancemetric The Mahalanobis distance takes into account the covariance matrix ofthe data, which allows better handling data with varying scales and correlationbetween dimensions This results in a more accurate association of objects withexisting tracks, even when there are occlusions or other objects in the scene

Trang 37

2.3 People tracking

• Feature embedding: DeepSORT uses a deep neural network to embed objectfeatures into a high-dimensional space, while SORT uses hand-crafted featuressuch as color histograms Deep learning-based embeddings are more powerfuland expressive, allowing for better discrimination between objects and reducingthe risk of track drift

• Track management: DeepSORT employs a track management strategy that lows better handling of occlusions, missed detection, and track fragmentation.Specifically, it uses a Kalman filter to estimate the position and velocity of theobject, and a gating mechanism to filter out detection that are unlikely to belong

al-to the track It also uses a track initiation process al-to start new tracks when noexisting track can be associated with a new detection

Overall, DeepSORT’s improvements over SORT result in more accurate and robusttracking, especially in challenging scenarios where objects are partially occluded ormove quickly

2.3.4 Tracking-based people counting

Tracking-based people counting is a method of counting people by using computervision techniques to track individuals as they move through a space There are severaldifferent tracking-based people-counting methods, including:

1 Single-camera tracking [18]: This method uses a single camera to track people

as they move through the space The camera captures images or video, and thesoftware analyzes the data to identify individuals and track their movements

2 Multiple-camera tracking [19]: This method uses multiple cameras placed out the space to capture images or video from different angles The software com-bines the data from each camera to track people as they move between differentareas

through-3 Depth-based tracking [20]: This method uses cameras that can capture depthinformation, such as Microsoft’s Kinect camera, to track people as they move

Trang 38

den-Figure 2.10: Architecture of a people counting and tracking system [21].

The authors in [22] presented a novel multi-person tracking system for crowd ing and normal or abnormal event detection in indoor and outdoor surveillance envi-

Trang 39

count-2.3 People tracking

ronments The proposed system consists of four modules, as shown in Fig.2.11: ple detection, head-to-torso template extraction, tracking, and crowd cluster analysis.Firstly, the system extracts human silhouettes using an inverse transform as well as amedian filter, reducing the cost of computing and handling various complex monitor-ing situations Secondly, people are detected by their heads and torsos due to theirbeing less varied and barely occluded Thirdly, each person is tracked through consec-utive frames using the Kalman filter technique with Jaccard similarity and normalizedcross-correlation Finally, the template matching is used for crowd counting with cuelocalization and clustering via Gaussian mapping for normal or abnormal event detec-tion The experimental results on two challenging video surveillance datasets, such asPETS2009 and UMN crowd analysis datasets, demonstrate that the proposed systemprovides 88.7% and 95.5% in terms of counting accuracy and detection rate, respec-tively

peo-Figure 2.11: Flow architecture of the proposed smart surveillance system [22]

Trang 40

2.4 Conclusion of the chapter

This chapter presented our study about methods for people counting based on peopledetection and tracking and people density estimation The detection and tracking-based people counting techniques are suitable for the normally crowded scene whilethe latter one is more suitable for highly crowded scenes In this work, we follow thefirst approach, which detects and tracks humans to improve the accuracy when anocclusion appears We will describe our proposed method in chapter 3

Tiêu đề	People counting using detection and tracking techniques for smart video surveillance
Tác giả	Ha Thi Oanh
Người hướng dẫn	Assoc. Prof. Tran Thi Thanh Hai
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Computer Science
Thể loại	master's thesis
Năm xuất bản	2023
Thành phố	Hanoi

Định dạng
Số trang	92
Dung lượng	1,5 MB