Some challenges in crowd counting [IJ ee Framework for a people counting based on face detcetion and tracking System framework for depth-assisted face detection and association for peop
Trang 1People Counting Using Detection And Tracking
Techniques For Smart Video Surveillance
Ha Thi Oanh
Ha Noi University of Science and Technology
Supervisor
Assoc Prof Tran Thi Thanh Hai
In partial fulfillment of the requirements for the degree of
Master of Computer Science
April 20, 2023
Trang 2Dr Doan hi Huong Giang who directly supported me
The master’s thesis is within the framework of the ministerial-level scientific
research project "Research and development of un automatic systeur for assessing learning activities in class based on image processing leclnolozy
and artificial intelligence” code CT2020.02.BK A.02 led hy Assoc Prof Dr
Le ‘Thi Lan, Students sincerely thank the topic
Finally, [wish to show my appreciation to all my friends and family mem- bers who helped me finalizing the project
Trang 3Abstract
Video or image-bused people vounting in realtime has multiple applice- tions in intelligent transportation, density estimation or class management, and so on Although this problem has been widely studied, it stills face
some main challenges due to crowded seenc and occlusion In a common
approach, this problem iy carried out by detecting people using conven- tional detectors However, this approach can be failed when people stay in various postnres or are accluded by each other We notice that even a main part of human body is ocelucied, their face and head are still observable
Tu additivn, a person can not be detected at a frame but may be recovered
at the previons or the next frames
In this thesis, we attempt to improve the people counting result beyond
these observations We first, deploy two detectors (Yolo and Ttetina- Face}
for detecting heads and for faces of pcople in the secne We then develop
4 pairing bochnique tnt aligns the fuce and the head of cach person This alignment helps ta recover the missed detection of head or face thus in creases the tme positive rate To overcome the missed detection of hoth face and head at a certain frame, we apply a tracking technique (i.e SOML}
on the combined detection result, Putting all of these techniques in an uni
ficd framework helps bo increases the true positive rates from 90.36% to 96.21% on ClassTTead Part 2 dataset.
Trang 4Contents
List of Acronymtypes
1 Introduction
1.2 Scientific and practical significance
1.2.1 Scientific significance 1.2.2 Practical significance 1.2.3 Challenges and Motivation -
Lâ Objeclive nad Contibuois
1.3.2 Contributions
Related works
2.1 Detection based people counting
241.1 Face detection bascd people counting 2
2.1.2 Head derection based an peaple counting 2.1.3 Ilybrid detection based on people counting
22 Density estimation based people counting
2BL Overview of ebjua trucking 2.0
23.2 Mniltiple Object Tracking 2.3.3 Tracking techniqnes
Trang 5Proposed methed for pcople counting
3L
32
The proposed people counting [nunework
Yolo-based head detection bee
34 Combination of head and face detection
3.4.1 Linear sum assignmenc prablem 3.4.2 Tead-fane pairing cost:
3.5 Person tracking
Experiments
4.1 Dataset and Evaluation Metits
4.1.1 Gur collected dataset: ClasHead
4.111 ClassHead Part1 4.1.12 Classllead Part.2
412 Hollywood Heads dataset
4.1.4 Wider Face dataset:
4.1.5 Dvaluation metries
Trang 6CONTENTS
4.15.1- InersecUon over nion (OU)
4.1.5.3 Precision and Reeall
4.2.3 Đvaluarion on Wider Eaccdabaset
4.2.4 Đvalunton on ClaissHoud Part 2 dalaso
5 Conclusions
5.1 Conelusion
5.2 Future Works
References
Trang 7Iilustracion of the input and output of people counting from an image
Some challenges in crowd counting [IJ ee
Framework for a people counting based on face detcetion and tracking
System framework for depth-assisted face detection and association for
people counting System framework for a people counting method based on head detection aud tracking ky ——
Network stricture of Double Anchor R-CNN Architecture of JointDet -
Examples of people density estimation Example of Multiple Object Tracking © ee Fimgarian Algorithm
The tracking process of the SORT algorithm
Architecture of the proposed penple counting and tracking system
Flow architecture of the proposed smart surveillance system
‘Lhe proposed framework for people counting by pairing head and face detection and tracking 2.0.0 ee eee
Trang 8Antomatic learning of hound bax anchors [4]
Activation functions used in Yolov5 (a) SiLU function (b) Sigmoid
Junction [4]
Đxample [or creating dataset.yaml
An overview of the single-stage dense face localisation approach Reti-
naFace is designed based on the feature pyramids with independent con-
text modules Following the context modules, we calculate a multi-task
Organize davuset for Yolo training, 2.00.0 Lxample of RetinaFace testing on Wider Face dataset
Flowchart of combining object detection and tracking to improve the
Camera layout in the simulated classroom and an image obtained from
Trang 9Results of Hollywood Heads dataset, (a) Results of head detection; (b)
Resulrs af face detection; (c} Marching head and face detection uaing the
Tiungarian algorithm [eads are denoted with green, faces are yellow, iiissed ground truths are red, und head-face pairiugy wre cyan 2 MAE measurement results on 2 proposed inelhody in Casablanca Heads
dataset
Results of Casablanca dataset (a) Results of head detection; (b) Re sults of face detection; (c) Matching head and face detection using the Hungarian algorithm Heads arc denoted with green, faces are yellow, missed ground (ruths are red, und hiead-face pairings ure cyan
Results of Wider Dace dataset (a) Results of head detection; (b) Re- sults of face detection; (c) Matching head and face detection using the Hungarian algorithm Heads arc denoted with grecn, faces are yellow, missed ground (ruths wre red, und head-face pairingy ure cyan
MultiDetert: resnits im ClassHead Part_2 (a) Head detections, (b) Face detections, (¢) MulriNetect
Head tracking method results in ClassHead Part? dataset (a) Head
detections at frame 1, (b) Head tracking at frame 100
MnltiDeteet with Traek method results in ClassHead Parr 2 dataset (a) MultiDetect with Track at frame 1, {b} MultiDetect with Track at frame
Trang 10Setup camera parameters for data collection
ClassHead Port | dataset for raining and tcsting Head detector YoluvS
ClassHead Part 2 dataset
Results of the proposed method on the Hollywood Heads dataset:
Results of the proposed method on the Ceseblanea datasct
Resuly of the proposed method on Wider Face dataset Results of the method af the head detection method in ClassHead Part_2 daraset:
Results of the method of the MultiDectect in Classllead L'art_2 dataset Results of the Head ‘Tracking in ClassHead Part_2 dataset
Results of method VultiDetect with Track in ClassHead Part 2 dataset
Tixperimental reenrlra ìn rhe ClassTlead Part 2 dataset after nsing 4 meth-
ods
ix
Trang 11List of Acronymtypes
CNN Convolutioual Neural Nebwork
HOG Recurrent Neurnl Network
LSTM Long short term memory
NN Neural Network
RPN Region Proposal Network
YOLO You Only Look Once.
Trang 12Chapter 1
Tntroduction
Hecently, people counting in images or video has become an active research topic due
to its wide range of applications, from public sofcty to intelligent crowd flew Manual comting is impractical since it is 4 tedious and time-consnming task, particularly
in crowded scenes This chapter aims to define the problem of people counting, ite challenges, and provide cliscussions on the drawbacks of existing methods to motivate our work, We then clarify our objectives and contributions to this ficld Finally, we
deseribe the organization of the Unesis
People counting in crowds refers to the process of accurately counting the number of indivicluals present in a densely populated area or space ‘I/his is a challenging task due
to the high density of people, occlusions, overlapping individuals, and the nced to track people us hey wove Unrough vhe crowd People counting, hus buon extensively sLudied
in recent years, and it has numerons real-life applicarions, including event management, public safety, and transportation Ror instance, it can be mtilized to monitor crowd density and prevent overcrowding in public spaces, optimize and improve sccurity at events and trausportution hubs, ele
To capture peaple in crowds, some sensors sich as thermal imaging cameras, RGR:
cameras, and lasers may be used RGR cameras are the moat commanly utilized due
Trang 131.2 Scientific and practical significance
to th low-cost and popularity in almost public spaces From video/data, computer vision techniques such as object detection and tracking, optical flow, and background subtraction can identify and track individuals in the crowd The problem of people
counting from an image is defined as follows:
¢ Input: An image or a frame from a video sequence
¢ Output: The number of people and/or their locations in the frame/image
Figure 1.1 depicts the input and output of a people counting algorithm The algo-
rithm stores the number of people detected and determines the bounding box of each
individual Depending on the context, location data may be crucial for further pro-
cessing However, in highly crowded scenes, obtaining an exact count of people may be
impractical, and an estimation of the number of individuals ufficient In the next
chapter, we will review some related works that provide an estimation of the people
count with or without location and bounding box information
Colflleo
Figure 1.1; Illustration of the input and output of people counting from an image
1.2 Scientific and practical significance
1.2.1 Scientific significance
People counting in crowds of humans has several scientific implications, including:
e Crowd dynami People counting in crowds provide important data for studying
crowd dynamics, such as how people move, how they interact with each other, and
Trang 141.2 Scientific aud practical significance
how they respond to changes in the environment, This information can be used ta
develop mathematical models of crowd behavior and improve onr mnderstanding
of crowd dynamics,
@ Social behavior: People counting in crowds can also provide insights into social behavior It can help researchers understand how people interact with each other
in crowded cuvirouments, such as how they form groups, how they communicate,
and how they coordinate their movernents
* Computer vision and machine learning; people counting in crowds provides on important application for developing und evaluating computer vision und niachine learning algorithms It helps to advance the state-of-the-art in object detection, tracking, segmentation, anc classification, which are essential for people counting
in crowded cuvirouments
Sensor technology: People counting in crowds also drive the development of new scnsor technologies, such as cameras, depth sensors, and thermal sensors, that
are designed to capture date in crowded environments, This helps to wdvance the
field of sensor technology and improve our ability to capture data in challenging environments
* Iluman-computer interaction: People counting in crowds can also provide in- sights into human-computer interaction, particularly in the context of intelligent systems It helps to understand how people interact with technology in crowded
cnvireninents and Lew technology can be designed to support people in these
Trang 151.2 Scientific aud practical significance
organizers identify high-density areas aud take action lu prevent overcrowding,
which is critical for public safety
* Retail and marketing: People counting is an essential tool for retailers to optimize staffing levels, measnre customer traffic, and improve customer service Tt helps retailers identify high-traffic areas and monitor customer behavior, such as the tine speul in specific sevtious or the frequeney of return visits
« Dublie safety anc seeurity: Meople counting is also an important tool for public safely und security, helping to monitor erowd density aud prevent overcrowding
in public spaces, as well as optimize scaffing levels and improt at events
© Education: Student counting in dussroom cuyironment help is an important and reliable source of information for improving the quality of education by changing
the content and teaching methods
1.2.3 Challenges and Motivation
"Yo solve the people counting problem, there exists a number of approaches which achieved impressive accuracy However, this problem still faces many challenges us following:
Occlusion: As crowd density increases, individuals may start to occlude each other, which poses a challenge for traditional detection algorithms and motivates
the development of density cstitmation models
Trang 161.2 Scientific and practical significance
(@) Rotation (6) Mlumination variation @ Weather changes
Figure 1.2; Some challenges in crowd counting [1]
ene, the background may be highly cluttered
¢ Complex background: In a natural s
and contain objects with similar appearances or colors to the foreground, which
can cause confusion
e Scale variation: One of the primary problems that should be addressed in the
density ¢ igned to address the
seale variation problem in the first step
the camera viewpoints, such as different poses and photographic angles
e Illumination variation: The illumination varies at different times of the day usually from dark to light and then back to dark, from dawn to dusk
e Weather changes: The scenes in the wild are usually under various types of
weather conditions, such as clear, clouds, rain, fog thunder nd extra
Trang 171,8 Objectives and Contributions
urination change (Fig].2.e) and weather change (Fig 2.0) These challenges can uot
be solved in one model Tn this thesis, we attempt to improve the people comnting per- formance by overcoming occlusion and scale changes, although some other challenges suny be iuplicitly resolved thanks o the studied model itself,
1.3.1 Objectives
The main goal of this thesis is to improve the performance of people counting from images/video to overcame the ocainsion issne in a crawded scene To obtain this goal, following are the specific objectives:
© Condnot a survey of existing techniques for human counting, analyze their main drawbacks, and then propose a suitable solution
Stuy and develop techniques for detecting hnmans that; can he used for peaple counting and localization
« Improve the techniques to avoid missed detection in crowded scenes
1.3.2 Contributions
"Lhe work of thie thesis is within the context of a project granted by the Ministry of Education and Training (MOET), with the project code CT2020.02.BKA.02 Que of tue tusks of Uhtis prujoct is Lo detect aud count the uumber of students ia elussrcont, and then create a density map of the students This will aid in better management
of the students and improve the quality of teaching and learning As a resnit, beside validating the propescd on benchmork datasct, we also take part in building a new datusct in classroom and bust our method on that dataset, We summurize the muia
contribution of our work aa follows:
Trang 181.4 Thesis outline
» First, we propose a method that combines the detection results of bud face
and head to improve the true positive rate of penple counting (namely called:
MultiDetect}
Sevuud, we deploy a tracking technique lo handle fast-moving objects that nay
cause motion blur effects and missed detection (namely called: MuttiDetect with
“Lrack)
« Finally, we conduct extensive experiments to validate MultiDetact improvement
on three benchmark datasets (Widerace, [ollyhead, and Casablanca) We also build a new dataset in the MOE’ project, in which 1 participated in collecting and annotating the data and we conduct extensive cxpcriments to validate both
improvements MulliDetect and MultiDeteet with Truck in our dataset
1.4 Thesis outline
‘The thesis is structured into 5 chapters:
1 Introduction: This chapter provides the definition of the people counting problem und introduces its scicutific aud pructical significance Then, we describe some
of the main challenges that motivated the wark of this thesis Finally, we present our abjectives, contributions, and thesis outline
ie Packground and Related Works: This chapter conducts a survey on the deep Tearning-hased approach for hnman detection and tracking for the problem of people counting and localization We also describe some methods that roughly estimate the people density without giving localization information The analysis
on both uppreuches und our construinis motivate us bo Ipllow the detection aud tracking methods We then present briefly fimdamental knowledge abont deep models for human detection and tracking
3 Proposed Vethad: This chapter incrodnces onr proposed framework for pen- ple detection and localization from videos We describe each component of the
Trang 191.4 Thesis outline
framework in delail aud how bo implement it in practice,
4, Experiments: This chapter presents the datasets, cvaluation protocol, teubinivall sevup, results, and discussions related to our experiments In particular, we describe the process af collecting and annotating our new dataset: in a classroom
environment
Conclnsion: This chapter summarizes ont work, highlights the contributions, analyzing the limitations, and providing some ideas for future research directions
Trang 20Chapter 2
Related works
‘This chapter presents some basic knowledge as well aa related works regarding the rescarch topic of this thesis There are two approaches to the people counting problem from still images: i} detection-hased approach that detects individuals and then can give a number of people in the scene: ii) density estimation-baced approach that just give a roughly a number of persons without location information Besides, to improve the detcetion result, some works doploy tracking techniques We present these ap- prouches in yetions 2.1 and 2.2 respectively, We then desuibe the racking Weelmiques
in section 2.3 Finally, we conclude the chapter in section 2.4
People counting cam be curried out by detecting luees, heads, or bodies depending on
the comext aud the scene The most common technique is detecting the human budy,
‘but in cases where the human body is acclnded or in a challenging posture, the head
and face can be alternative solutions
2.1.1 Face detection based people counting
Fuce detection-based people counting aims to detect und truck fnces in images or real
time video streams, then count vke number of detected faces, Face detection alporithius
Trang 212.1 Detection based people counting
typically rely on deep learning models trained on large datasets to identify and locate
faces in images or videos accurately
Tsong-Yi Chen et al presented an automatic people-counting system based on face recognition in which people passing through a gate or door are counted by placing a
video camera [5] First, they use the image difference to detect the rough edges of
moving people Then, color features are utilized to locate people’s faces Based on
the NCC (Normalized Color Coordinate) color space, the face is initially obtained by detecting the skin tone area, and then the subject's facial features are analyzed to determine if the subject is a real face or not After face detection, a person will be
tracked by following the recognized face, and this person will be added if that person’s
face touches the count line
Xi Zhao et al presented a method of counting people based on face detection,
tracking, and trajectory classification [2] They first performed face detection and then face tracking by combining a new scale-invariant Kalman filter with a kernel-
based color histogram tracking algorithm From each face orbit, the angle of those
neighboring points is extracted Finally, to distinguish the real face trajectory from the fake one, the authors used the K-NN classification method based on the Earth
Mover's distance The framework for this paper is described in Fig.2.1
Trang 222.1 Detection based people counting
people using the Kinect camera The author used depth information for false face
detection, and then a 3D data association is used to link tracks with detection results
Finally, they counted the people who enter the region of interest using a validated
trajectory, as shown in Fig,2.2
Figure 2.2: Depth-assisted face detection and association for people counting [6]
Face detection nowadays can achieve very high accuracy However, this problem still faces challenges such as variations in lighting conditions, occlusions, and face
orientation Besides, it requires a face in front of the camera, Without that assumption,
the performance of face detection-based people counting may drastically reduce
2.1.2 Head detection based on people counting
Head detection can be a more flexible solution to deal with constraints on detecting,
only frontal faces Bin Li and al {7] proposed a people-counting method based on
head detection and tracking The purpose of this proposal is to evaluate the number
of people who move under an indoor overhead camera This framework consists of
four parts: foreground extraction, head detection, head tracking, and crossing-line
judgment Firstly, the proposed method utilizes a foreground extraction method to
obtain foreground regions of moving people, and some morphological operations are
employed to optimize the foreground regions After that, it exploits an LBP (local binary pattern) feature-based Adaboost: classifier for head detection in the optimized
foreground regions Once head is detected, it is tracked by a local head tracking method
based on the Mean Shift algorithm Finally, based on head tracking, the method uses
crossing-line judgment to determine whether the candidate head object will be counted
11
Trang 232.1 Detection based people counting
In [8], the authors proposed a deep model-based method that works as a head
detector thar take scale variations inte account Peanle rmmring in outdoor vennes
face many challenges, sich as severe occlusians, few pixels per head, and significant variations in a person’s head size due to wide sports areas his method is based on the notion that the head is the most visible part of sports venues where large numbers
of people are gathered They generate seale-wware head proposals bused on a scale tap to cope with the problen of different scales, Scale-aware proposals are then fed to the Convolutional Nenral Netwark (CNN}, which provides a response matrix containing the presence probabilities of people observed across scene scales Hinally, they use nun-muximal suppression Lo get accurate head positions For une performuuree
evaluativu, they carry out exlensive experiments ow lwo standard datasets: S-HOCK
12
Trang 242.1 Detection based people counting
and UCF-HDDC
2.1.3 Hybrid detection based on people counting
In various scenarios, a single head detector or face detector may not provide accurate
results Therefore, some researchers proposed hybrid detection that combines detection results from different human parts (body, head, face)
Hybrid detection-based people counting combines human body parts to improve the efficiency of counting people in a crowd Double Anchor R-CNN network as Fig
2.4, proposed by Kevin Zhang [9] recently combines the head and body of a person
Figure 2.4: Architecture of Double Anchor R-C:
Another approach also combines head and body using JointDet architecture [10]
JoinDet network consists of four major components, as shown in Fig 2.5: The RPN
18
Trang 252.1 Detection based people counting
network, the Head R-CNN, the Body R-
'NN, and the RDM The head-to-body ratio
is then calculated to get whole-body recommendations The head and body proposals
are then submitted to two parallel R-CNN branches to obtain interim results These temporary results are further processed to get the final results, as follows:
Matching head - body using the proposed strategy to output the matched body-
head pairs as pair 1 to pair N;
e Extracting corresponding features of each pair for RDM to discriminate their
relation (i.e., whether they belong to the same person);
e According to the learned relationship to reduce head false positives and recall
14
Trang 262.2 Denslty estirnation based people counting
Video surveillance system ace commonly deployed in very crowd surveillance that ix impractical to detect each individnal As a consequence, a density estimation is an approach to approximately connt the number of people
‘he authors in [11] conducted an estimate of penple density in a crowded! environ-
ment In this paper, the authors proposed a two fold method First, they propose a density estimate for the crowd size Second, da a eaumt of the people in the crawd
As the density of the crowd increases, the congestion in the crowd also increases To get around this, they ean use an improved adaptive K - GMM background subtrac-
tion methed to extract the foreground accurately in real-time applications to avoid the
estimation provienm By applying a boundury detection ulgorithim, they were able ta
estimare the size of the crowd The number of people in a crowd was counted nsing the
*eanny edge detector” algorithm, the “eannected component: laheling” methad, and
the “centered bounding box” method This article proposes a real-time video surveil-
lance system The above-proposed works are compared with different davusels like
IBM, KTH, CAVIAR, PRTS2009, and CROWN Tt can he used for bath testing and
training phases
'Phe authors in [12] proposed a supervised learning framework for visual object
counting tasks, such as estimating the number of people in a surveillance video frame Their youl is to accurately estimate the mumber of people However, they omitted the difficnlt task of detecting and locating individual objects Instead, they proposed
to estimate an integral image density over any image region Learning how to infer
such a density can be formulated as minimizing a normalized quadratic cost function
Bu, they introduced w new loss function, well suited for guch learning and efficiently computable through a subarray maximal alyoritlin The problem can then be thought
of as a convex quadratic: program, that is solvable with cut-plane optimization The
proposed framework was flexible, as it can accept any domain-specific visual feature
Qnee trained, their system provides the number of objects und requires only a very
short amount of time for the feature extraction step Therefore, this mode! becomes
Trang 27Figure 2.6: Examples of people density estimation Counting people in a surveillance
video frame, Close-ups are shown alongside the images The bottom close-ups show examples of the dotted annotations (crosses) This framework learns to estimate the
number of objects in the previously unseen images based on a set of training images of
the same kind augmented with dotted annotations [12]
Tracking is an efficient technique to improve the true positive rate when an object is missed detected at a given frame In this section, we briefly present the overview of the object tracking problem, then two typical tracking techniques (SORT and DeepSORT) that are widely deployed in literature We finally analyze some works using tracking for people counting problems
2.3.1 Overview of object tracking
Object tracking is a technique used to assign a unique ID to each object as it moves temporally The process starts when the object appears and ends when the object
leaves the scene for a certain time The goal of object tracking is to accurately identify
16
Trang 282.3 People tracking
objects of interest, estimate their trajectories in the video, aud track them as they
move The ohject tracking problem involves:
* Object detection: The first step in real-time object tracking is to detect the abject of interest in each frame of the video or image stream There ar
various abject detection techniques available, including feature-based methods such as
scale-invariant feature transform (SI"I'), speeded-up robust features (BULL),
and histograms of oriented gradients (HOG), as well as deep learning-based object detection ulgeritluns sue as YOLO (You Only Look Onee), SSD (Single Shot Detector}, and Faster R-CNN {Region-based Convolutional Neural Nevwork}
Object wuckiug: Ouce Une object is doteeted, the wext slep is lo track it, over time Object tracking can be achieved using various techniques, including optical flow, mean-shift, particle filters, and Kalman filters These techniques estimate the object's motion and predict its location in subsequent frames
« Data association: In scenarios where there are multiple objects in the video
or image strewn, it iy casential to asyovinte cach object's location with ils eorre- sponding identity Data association bechniques, such as the Hungarian algorith, are used ta match the detected objects with their previous locations to maintain
their identities over time
Object re-detection: Tn some soenarios, the abject of interest may disappear from the video or image stream for a short duration Object re-detection techniques,
can be used to re-detect the
such as template matching or appearance modeling,
object when it reappeurs
2.3.2 Multiple Object Tracking
Simple object tracking assumes there is only oue object in Uke scene Tracking becomes harder when there are many objects ‘I'he multiple object tracking method aims to track all objects appearing in the frame by deteeting and assigning identifiers to each object,
17
Trang 292.3 People tracking
as shown in Fig 2.7 lu addition, the [Ds assigned to aư object need to be consistent across each frame Multiple abject tracking reqnires handling:
* Accurate object detection: This is a critical task, especially for detection-based
tracking, to ensnre the presence of all objects in the scene
« Occlnded objects: Objects are partially ar completely obscured When an TT) ig assigned to an object, the TD should be consistent thronghont the video Towever, when an object is obscured, relying solely on object detection is not enough to solve this problou
Object absence: Au vbject may gp out of the frame and then reappears, Similar
to the previous issue, this is about ID switches, It is uecessary lo solve the problam of object: re-identification, including abscuring or disappearing, to reduce
the number of [D_switches to the lowest: possible level
« Objects trajectories overlapping: Objects with overlapping trajectories can alsa lead to the wrong assignment of IDs to objects, which is also a problem to deal with when working with multiple object tracking
Vigure 2.7 illustrates an example of multiple object tracking In the first row, people are firstly detected and bounded by yellow boxes, The second row presents tracked peuple overtime Euch persow is identified Ly » color The lust row shows Ube euse one person is detected in the first frame (red bounding box), but missed in the next frames due to occlusion, it is still kept by tracking technique
2.3.3 Tracking techniques
In the literature, there is a number of tracking techniques proposed for different tasks such og huuiun tacking, robot trucking, wd yo on, Tn this section, we review three conventional lechuiques: Kalan Glier, SORT, aud DeepSORT, which are improved
versions by temps
18
Trang 302.3 People tracking
(c) tracklets by associating every detection box
Figure 2.7: Multiple Object Tracking (a) shows all the detection boxes with their
ows the tracklets obtained by previous methods which associates detection
‘ores are higher than a threshold, ie 0.5 The same box color represents
s obtained by the proposed method in the
scores, (b) s
boxes whose s
the same identity (c) shows the tracklet
paper The dashed boxes represent the predicted box of the previous tracklets using Kalman Filter, The two low score detection boxes are correctly matched to the previous tracklets based on the large IoU [13]
‘The Kalman Filter [14] was proposed by R E, Kalman in 1960 The Kalman filter pre-
dicts the state of an object using previous information The Kalman filter equations are
categorized into two groups: prediction (updated over time) and correction (updated
by measure), A metric update is used to provide a feedback value that, combined with
the prior state estimate, gives a posterior state estimate
In order to use the Kalman filter to estimate the internal state of a process given
only a sequence of noisy observations, one must model the process in accordance with
the following framework This means specifying the matrices, for each time-step k,
19
Trang 312.3 People tracking
following:
@ Fy: the state-transition madel;
© Hy: the ubservution model;
# Qg: the covariance of the process noise;
@ R,: the covariance of the observation noise:
© aud sometimes By: Uhe comrobinput inodel as described below; if By is included,
then there is alsa
© uj: the control veelor, represculiug the controlling input iute eontrol-input model
The Kalman filter made] assumes the true state at time k is evalved from the state
ak { — 1) according to Tiq.2.1:
ion model which is applied to the previous state x11
@ By is the control-inpnt model which is applied to the control vector uy;
ew; is the proces noise, which is assumed lo be drawn from a zero mean multe
variate normal distribntion
Trang 322.3 People tracking
® vụ is the observation noise, which is assumed lo be zero mean Gaussian white
noise with covariance Ra
The next processing steps of Kalman Filter can be divided into two main parts (probability-hased approach):
Step 1: Predict
Prediclod (a priori) siute estimate:
Bape — Fax ye 1+ Peay,
Predicted (a priori) estimate covariance:
Updated (a priori) state estimate:
Cri = Suh 1+ Kaye
Updated (a priori) estimate covariance:
Pips KEG) Pap
(am)
51
(28)
Trang 33SORT [15] is an acronym for Simple Online Real-time Object Tracking, an algorithm
belonging to tracking-bv-detection (or detection-based tracking) With the tracking by
deWwevion problem, w common Leature is bo sepueute the detection resuly und use Ulis
result 1o Wack the object The next task is tu find a way to assuciate the bounding
poxes obtained in each frame and assign an TD to each abject Therefore, the processing
steps for each new frame is as follows:
© Detection: This step aims to detect the precision and locate the position of objects in the frame Any object detector can be applied In the original paper
of SORT, the Faster R-CNN was utilized
© Prediction: ‘his step utilizes the Kalman filter to predict the new positions of objects at frame ¢ based on previous t 1 frames
* Association: In case of multiple object tracking, it needs an association algorithm
to associate a target with 0 detected object In SORT algorithm, Hungarian was
deployed for this purpose
Hungarian Algorithm
The TIingarian algorithm [16] was developed and published im 1955 and proposed
as a solution to the assignment problem, Let denote n the number of detection (i — 1,2, ,n} and m the number of tracks predicted (j = 1,2, m) as show in Fig
2.8 The ussociation of detevtion i with a track j buses ou a cost function thut is
the distance between @ and 7 in feature space Detail of the Hungarian cau be seen
in the original paper 16) In the following, we juat review some concept and ideas
of the algorithm The [lungarian algorithm tried to associate each detection with its
Trang 34e Theorem 1: Suppose the cost matrix of the assignment problem is non-negative
and has at least n zero elements Furthermore, if these n zeros are in n differ-
ent rows and m diffe
m track) and X* = |:;| a solution (an optimal solution) to this problem Suppose
C’ is a matrix obtained from C by adding a number a 4 0 (positive or negative)
23
Trang 352.3 People tracking
to each element in row r of C Then X* is also the solution to the assigument
problem with the cost matrix C"
Main steps of SORT algorithm The processing steps of SORT are shown in Fig
2.9:
Step 1: Detevtion: The first slop is to detect objects is cach frame of a video naing 4 computer vision algorithm such as a neural network-based detector
Step 2: Association: The next slep is te asseciabe the debleeted objects with,
previously tracked objects from previons frames This is dane hy eamparing the features of the detected objects with those of the existing tracked objects and
assigning a similurity score
Step 3: Prediction: Once the association is made, the algorithm predicts the position of the ubjects in the next mune using Kulinan filtcr or another nioviow
model
Step 4: Update: In vhis step, the trucked ebjecly arc upduted with the new information from the enrrent frame, stich as the position and size of the objects
Step fi: Track management: The final step involves manaying the tracked objects,
snch as removing objects that are no longer in the frame or creating new tracks
for newly detected objects
Tt notices that, there are three types of output of Himgarian algorithm: 1) Tt finds a detection corresponding to a target (matched tracks) ‘I'hen this association will he used to update Kalman filter; Unmatched tracks: no detection is found to match
with the track, then the track may be deleted depending on it lifetime: 3) If there are
new objects detected, which are nat matched with any targets, then they will he naed
to create new track.
Trang 36In the original version of SORT, the cost function is defined based on the IoU distance,
it does not take the appearance similarity of detection and target into account Deep-
SORT was developed by Nicolai Wojke and Alex Bewley [17] to address the omission
problems associated with a high number of ID switches The solution proposed by
DeepSORT is based on using deep learning to extract features of objects to increase
aceuracy in the data association process, In addition, a linking strategy known as
‘Matching Cascade’ was developed to help link objects that had previously disappeared more effectively
DeepSORT is an improvement over the SORT (Simple Online and Realtime Track-
ing) algorithm in multiple ways:
e Association metric: DeepSORT uses the Mahalanobis distance metric to associate detected objects with existing tracks, while SORT uses the Euclidean distance
metric The Mahalanobis distance takes into account the covariance matrix of
the data, which allows better handling data with varying scales and correlation
between dimensions This results in a more accurate association of objects with
existing tracks, even when there are occlusions or other objects in the scene
Trang 372.3 People tracking
Feature embedding: DeeySORT uses a deep ueural uetwork to ensbed object features into a high-dimensional apace, while SORT nses hand-crafted features such as color histograms Deep learning-based embeddings are more powerful und cxprossive, ullowiny for better dixcrlaination between objects und reducing
the risk of track drift
e@ Truck management: DecpSORT cuploys vu truck munagement strategy that ab
lows hetrer handling of occlusions, missed detection and track fragmentation Specifically, it uses a Kalman filter to estimate the position and velocity of the object, and a gating mechanism to filter out detection that are unlikely to belong
to the truck It also ayes a track imitivlion process bo slurt new tracks when uo
existing track cut be associated with a uew detection
Overall DeepSORT's inaprovements over SORT resull in wore accurate aud robust tracking, especially in chellenging scenarios where objects are partially occluded or move quickly
2.3.4 Tracking-based people counting
“Tracking-based neople enmting is a method of comting people by using computer vision techniques to track individuals as they move through a space ‘here are several different tracking-based people.counting methods, including:
1, Single-camera tacking [18]: This method uses a single camera to track people
as they move through the space The camera captures images or video, and the software analyzes the data to identify individuals and track their movements
2 Multiple.camera tracking 19]: ‘This method uses multiple cameras placed through out the space to capture images or video from different angles 'Phe software com bines the date Lrom cach camera to track peuple us they move between dilferent
areas
Depth-hased tracking [20|: This method uses cameras that can capture depth information, such as Microsoft’s Kinect: camera, to track people as they move
Trang 38track their movements
The research in the paper [21] is to develop an accurate and efficient system capa-
ble of error-free counting and tracking in public places The main goal of this research
xt, random particles are distributed, and features are
extracted Subsequently, particle flows are clustered using a self-organizing map, and
people counting and tracking are performed based on motion trajectories The test
results on the PETS-2009 and TUD-Pedestrian datasets achieved high results
Trang 39own in Fig.2.11: peo-
ple detection, head-to-torso template extraction, tracking, and crowd cluster analysis Firstly, the system extracts human silhouettes using an inverse transform as well as a median filter, reducing the cost of computing and handling various complex monitor-
ing situations Secondly, people are detected by their heads and torsos due to their being less varied and barely occluded Thirdly, each person is tracked through consec-
utive frames using the Kalman filter technique with Jaccard similarity and normalized
cr
-correlation Finally, the template matching is used for crowd counting with cue localization and clustering via Gaussian mapping for normal or abnormal event detec- tion The experimental results on two challenging video surveillance datasets, such as
PETS2009 and UM!
crowd analysis datasets, demonstrate that the proposed system
provides 88.7% and 95.5% in terms of counting accuracy and detection rate, respec-
' Frames Extraction Buckground Removal || |
1 Grayscale Conversion Connected Regions | |
1 Inverse Transform Silhouette Extraction |} 1
1 Morphology People Localization |] Gaussia smoathing | '
HỆ: PeopleExaeion - Head Detection | Crowd Clustering | 1
i ‘Template Matching ||| Cluster Analysis ¡
Ae seem sea pee eo, aes 1
Figure 2.11: Flow architecture of the proposed smart surveillance system [22]
28
Trang 402.4 Conclusion of the chapter
2.4 Conclusion of the chapter
This chupter presuuted our study about methods for people counting, based on people detection and tracking and people density estimation The detection and tracking- based people counting techniques are suitable for the nermally crowded scene while the latter one is more suitable for highly crowded scenes In this work, we follow the first approach, which detects and tracks humens to improve the accuracy when an occlusion appears We will descrihe our proposed merhod in chapter 3