Developing a real time object detection system on fpga

ABSTRACT Real-time object detection is now a key capability for embedded computer vision platforms used in numerous applications, from surveillance to transportation and defense systems.

Real-time Object Detection System

Object Detection Overview

Object detection is a sophisticated computer vision technology that allows machines to identify and locate specific classes of objects, including people, vehicles, and animals, within digital images and videos Utilizing advanced machine learning algorithms, this technology effectively detects various objects while pinpointing their presence and location in visual data.

Object detection is a vital technology utilized in various industries, enhancing applications such as surveillance systems for automatic identification of people and vehicles, crucial for security In agriculture, it facilitates livestock counting, crop monitoring, and disease detection Additionally, autonomous vehicles rely on object detection to identify cars, pedestrians, traffic signals, and road signs in real time for safe navigation This technology is also employed in advanced systems, including high-tech weaponry like Kamikaze drones.

Object detection accuracy is influenced by various factors, including color, contrast, quality, shape, and the orientation of the object Among these, the orientation poses a significant challenge for detection models, as many objects, such as humans and animals, can assume different poses and orientations, greatly affecting detection performance.

The accuracy of an AI model in object detection is influenced by various factors, including the specific image, lighting conditions, contrast, and the size of the object To achieve optimal accuracy in image processing, it is essential to choose the right object detection model or algorithm tailored to each application.

Figure 1 - Surveillance and Public Safety Camera [3].

Figure 2 - Autonomous surveillance along the border in the USA [4]

Object detection is a crucial aspect of computer vision that allows automated systems to identify and locate objects within visual data The accuracy of object detection can be influenced by various factors Recent advancements in deep learning, combined with traditional feature extraction algorithms and machine learning techniques, have significantly enhanced the effectiveness of object detection systems.

This thesis explores the advancements in real-time object detection, emphasizing the significance of performance on embedded edge devices With numerous possibilities for development in object detection applications, ensuring efficient processing and response times is essential for practical implementation.

Real-time Object Detection Challenges

Real-time object detection is a computer vision system that quickly locates and classifies objects in images or videos It is a challenge to balance accuracy and speed

Real-time processing is crucial for the successful operation of various applications, enhancing their value and efficiency Implementing real-time processing ensures that these applications function effectively, as seen in edge devices like car dash cameras equipped with vehicle detection for Advanced Driver Assistance Systems (ADAS) Similarly, tactical missiles utilize Imaging Infrared seekers to quickly identify and engage targets Despite its importance, real-time object detection faces numerous challenges that need to be addressed.

Real-time systems are required to process videos at high frame rates without delays, necessitating a careful balance between speed and accuracy for optimal performance Achieving minimal system latency across diverse components—such as input data, processors, software, and output devices—poses significant challenges.

Real-time object detection applications are frequently implemented on edge devices in large quantities, serving as satellite components within extensive systems Key challenges include size, power supply, processing performance, and cost While advanced deep learning techniques significantly enhance accuracy and speed, they necessitate high-performance GPUs or specialized hardware Achieving an optimal balance between performance and device configuration remains a major challenge in the realm of real-time object detection.

Figure 4 - Imaging Infrared Seeker on the Missile [7]

Real-time object detection is essential in various fields, necessitating a careful balance of speed, accuracy, performance, and hardware configuration By leveraging a blend of traditional methods and deep learning, this thesis focuses on creating a low-complexity real-time object detection system designed for easy deployment on edge devices, specifically utilizing a stationary camera.

Related Works

Object detection methods can be divided into traditional techniques and deep learning approaches, with the latter further classified into two categories The first category consists of region proposal object detection algorithms, which create region proposal networks and subsequently classify these proposals Notable examples of these algorithms include RCNN (2014), Fast-RCNN, and Faster-RCNN (both introduced in 2015).

The regression object detection category includes algorithms such as YOLO and SSD, introduced in 2016 Unlike traditional methods that rely on hand-crafted feature extraction techniques, like HOG, these modern approaches leverage neural networks for improved accuracy and efficiency in object detection.

2005, SIFT [14] in 1999, Haar [15] in 2004, etc

Figure 6 – Hand-crafted feature method [16]

Deep learning has emerged as a solution to the challenges of real-time object detection In 2020, the introduction of Tinier-YOLO demonstrated a compact model size of 8.9MB, achieving impressive real-time performance at 25 frames per second on the Jetson TX1 Additionally, it attained a mean Average Precision (mAP) of 65.7% on the PASCAL VOC dataset and 34.0% on the COCO dataset, highlighting its effectiveness in object detection tasks.

In 2022, an enhanced version of YOLOv5 was introduced, incorporating CBAM and SENet for superior object representation and employing multiscale detection, achieving an impressive 90.2% mAP By 2023, YOLOv7 emerged as the fastest and most accurate object detector, operating between 5 FPS and 120 FPS, and attaining the highest accuracy of 56.8% AP among all real-time object detectors at 30 FPS.

Deep learning methods are frequently praised for their effectiveness in real-time object detection; however, traditional algorithms offer distinct advantages Notably, hand-crafted feature algorithms can operate efficiently on standard CPU platforms, while deep neural networks typically necessitate powerful GPUs or specialized hardware A 2017 study highlighted that the hand-crafted feature extraction algorithm, HOG, demonstrates a performance advantage, being 311 to 13,486 times more efficient in certain applications.

This paper demonstrates that the proposed method achieves six times lower power consumption and 35 times higher throughput compared to traditional deep learning approaches, specifically CNNs Unlike conventional algorithms that require extensive expert knowledge and feature extraction, the new method minimizes dependence on large, costly labeled training datasets Research from 2020 highlighted that limited training data can lead to overfitting in machine learning models Additionally, deep neural networks possess millions of parameters, making manual adjustments challenging.

On the other hand, engineers can transfer their insights into algorithms and tweak parameters for a broader range of images in the traditional way

Object detection in images involves processing large datasets, necessitating optimization techniques for real-time execution on edge devices The HOG-SVM algorithm is commonly utilized for high-resolution and real-time computing, with various methods introduced to enhance its speed Notably, increasing the number of parallel computation blocks can accelerate calculations, as demonstrated in a 2016 study that employed multiple blocks for simultaneous image resolution processing Additionally,

In 2012, systems emerged that operated with images at a resolution of 512x512 pixels By 2022, advancements were made with the introduction of a real-time pedestrian detector that utilizes HOG feature extraction and SVM classification, enabling it to process 4K video at 60 frames per second on an AMD Xilinx Zynq UltraScale+ MPSoC device This represents a significant leap in real-time object detection capabilities, allowing for high-speed processing at ultra-high-definition resolutions.

Real-time object detection presents challenges due to the need for efficient processing of large image data While deep learning methods offer high accuracy, they require powerful GPUs or specialized hardware In contrast, traditional computer vision algorithms like HOG-SVM are more hardware-efficient but tend to have lower accuracy This thesis proposes the development of a real-time object detection system utilizing lightweight algorithms to achieve a balance between performance and efficiency.

The HOG-SVM object detection module utilizes the Zipfian Estimation Technique to effectively detect motion in stationary camera scenarios, significantly minimizing computational requirements The project's objective is to implement this innovative system on an FPGA hardware platform.

Lightweight feature extraction algorithms are ideal for FPGA hardware due to their efficiency, low power consumption, and the parallel processing capabilities of FPGAs These devices offer high efficiency and deterministic execution times, making them suitable for real-time applications when lightweight algorithms are properly optimized However, implementing complex algorithms or neural networks can be challenging because of FPGAs' limited on-chip resources, which may not meet the demands of large networks Successfully designing and optimizing these algorithms requires a thorough understanding of both the algorithmic details and FPGA architecture, complicating the design process Furthermore, there is a trade-off between FPGA flexibility and the efficiency gained from hardware-specific optimizations The development of FPGA-based solutions is often more time-consuming and costly compared to software implementations on general-purpose processors, primarily due to the complexities involved and the specialized skills needed Despite the advantages, the integration of complex algorithms into FPGAs presents significant challenges.

Conclusions

Real-time object detection is an essential technology with applications across many areas However, achieving real-time speeds, high accuracy, and hardware constraints presents significant challenges

Deep learning and traditional computer vision techniques are both employed to address various challenges, each presenting distinct tradeoffs in terms of accuracy, speed, performance, and hardware requirements While advanced deep learning models like YOLO, R-CNN, and SSD achieve cutting-edge accuracy, they necessitate robust GPUs or specialized hardware Conversely, traditional methods such as HOG-SVM offer more accessible solutions but may compromise on performance.

8 more lightweight and hardware-efficient However, they rely on feature engineering more heavily and may not match the accuracy of deep learning methods

This thesis proposes a system that can detect objects in real time by utilizing the

The HOG-SVM algorithm serves as the foundation of this project, with multiple techniques aimed at improving its speed and efficiency, particularly through the integration of background motion using the Zipfian Estimation method The ultimate goal is to implement the entire system on an FPGA platform, capitalizing on the advantages of hardware design optimization and parallel processing capabilities.

Proposed Real-time Object Detection System

Proposed System using Zifian Estimation Technique and HOG SVM

This article outlines a system architecture for detecting humans in video images, featuring key components such as motion detection, parallel processing for human detection, non-maximum suppression, and result merging The proposed method utilizes optimized algorithms to enhance efficiency and accuracy in human detection.

10 architecture to improve throughput The following sections will discuss the Dataflow and System Architecture in detail

To effectively process two image datasets, the first step involves converting them into grayscale images This conversion is performed by calculating the average pixel value across the Red, Green, and Blue channels, with each pixel represented by an 8-bit number that ranges from 0 to 255.

The Zipfian Estimation Techniques are utilized to generate a list of contours that identify motion areas in the input grayscale image, which are subsequently used for human detection processing.

In a study utilizing a 1300x982 resolution image scaled by 1.7, a total of 77,202 sliding windows were analyzed; however, only three images with resolutions of 176x180, 158x161, and 182x190 were processed in the object detection block, resulting in 556 sliding windows, which accounts for just 0.7% of the total.

Figure 7 - Pre-processing data flow

Utilize the Bilinear Interpolation Scale Generator algorithm to scale each detected motion area in increments of 1.05, continuing until the height is below 130 and the width is below 66, which meets the minimum requirements for the 64x128 sliding window plus a 1-pixel border needed for the HOG-SVM algorithm Subsequently, divide the scaled images into six parallel HOG-SVM modules The outputs will be integrated into a combined module and processed through the NMS module to eliminate unnecessary sliding windows Ultimately, the result will pinpoint the area identified by the system as containing an object, specifically a person.

Input : Detected human areas and raw input image

The contour result list plays a vital role in integrating the outcomes of human detection with the stored raw RGB image By combining these elements, it effectively produces the final results, ensuring accurate identification.

12 and outlining the detected objects or individuals The result can be shown on a monitor, saved to a file, or sent to the database for future applications

Figure 9 – Post-processing data flow

The proposed system architecture utilizes a sequence of images as input, which are pre-processed using Zipfian Estimation Techniques to effectively handle dynamic objects like humans by focusing only on those present in the frame Detected motion areas are sent to the Object Detection block, featuring a Scale Generator module, six parallel HOG-SVM computation modules, and an NMS algorithm module The Scale Generator and multithreaded HOG-SVM computations are essential for minimizing processing time while scaling input images to identify humans across multiple scales The results are then integrated with the original image for viewing, storage, or transmission to a server for additional processing.

The proposed system employs lightweight algorithms that operate in parallel to enable real-time human detection in video and image sequences It begins with motion detection for pre-processing, which helps narrow down the search space before implementing the HOG-SVM algorithm The system generates bounding boxes around detected individuals, making it applicable in diverse fields such as automated surveillance, advanced driver assistance systems, and human-computer interaction An efficient and accurate human detection capability is essential for these applications.

Lightweight Motion Detection Algorithm used on a stationary camera

Stationary cameras are essential for surveillance and monitoring in public areas due to their flexibility, cost-effectiveness, and ease of deployment Their compact size allows for large-scale installation, making them ideal for public spaces Unlike PTZ cameras, stationary cameras are fixed in one position with a set field of view, although some may feature adjustable zoom By maintaining a consistent background and focusing on dynamic subjects within the frame, these cameras reduce the number of calculations needed, resulting in faster processing and higher resolution image capture.

In 2014, The Zipfian Estimation [28] used MJPEG to extract moving and stationary blocks separately, resulting in a compression ratio twice as high as

The method of encoding only the residuals of moving blocks achieves a quality and bit rate comparable to H.264/AVC while requiring fewer operations than conventional MJPEG, particularly when static scenes account for 60% or more In 2007, the Zipfian estimation emerged as one of the fastest motion detection algorithms available.

In 2009, advancements in the Σ-Δ background subtraction algorithm from 2004 highlighted the effectiveness of Zipfian Estimation This approach shows that by selectively encoding only dynamic regions, significant compression gains can be achieved while maintaining visual quality and minimizing computational demands.

As mentioned above, the Zifian Estimation Module is a lightweight computing technique based on the Sigma-Delta algorithm to detect motion in the frame, as shown in Algorithm 1

Algorithm 1 - Zipfian estimation [31] find the greatest 2 p that divides (t mod 2 m ) set σ = 2 𝑚

2 p foreach pixel x do if Vt−1(x) > σ then if Mt−1(x) < It(x) then Mt(x) ← Mt−1 + 1 if Mt−1(x) > It(x) then Mt(x) ← Mt−1 − 1 foreach pixel x do

In the given algorithm, for each pixel x, the value Ot(x) is calculated as the absolute difference between Mt(x) and It(x) When the time step t is a multiple of TV, the algorithm updates the value Vt(x) based on the comparison of Vt-1(x) and the threshold N multiplied by Ot(x) Specifically, if Vt-1(x) is less than this threshold, Vt(x) increases by 1, while if it exceeds the threshold, Vt(x) decreases by 1 Finally, for each pixel x, if Ot(x) is less than Vt(x), the error term Et(x) is set to 0.

The algorithm begins by calculating a threshold based on the current and previous frame indices, t and t-1 This involves dividing the frame index by 2 raised to the power of m, where m equals 8 for grayscale images, resulting in a remainder p Consequently, the threshold σ is determined as 2 raised to the power of m divided by 2 raised to the power of p The background M t is updated whenever the variance V t-1 exceeds this threshold σ Additionally, O t represents the absolute difference between the current image I t and the updated background M t, while ensuring that the variance V t avoids self-referencing.

15 updated using a fixed period T v Usually, a T v has a power of 2 within the range of 1 to

64 In our case, we have set T v to 1 N is typically an amplification factor for the variance V t , ranging from 1 to 4 In our case, N = 2 Finally, the movement or stillness of a pixel is determined by comparing its absolute difference with the variance

The Zipfian Estimation demonstrates that encoding specific dynamic regions can achieve substantial compression improvements while maintaining visual quality and minimizing computational demands These efficient techniques align well with the objectives of this thesis.

Object Detection using the HOG SVM Algorithm

The Histogram of Oriented Gradients (HOG) - Support Vector Machine (SVM) algorithm, introduced in 2005 for human detection, has become a foundational approach in object detection HOG effectively analyzes edge distribution, striking a balance between detection accuracy and computational complexity, making it a standard baseline in the field Meanwhile, SVM serves as a powerful classification tool, leveraging the features extracted by HOG for improved object recognition.

The HOG algorithm creates intricate histograms utilizing various cell types, such as inverse tangent, square, square root, and floating-point multiplication Enhancing this process is essential for optimizing the HOG-SVM algorithm module.

Gradient magnitude and angle calculation:

In 2017, a novel technique was introduced for generating cell histograms, streamlining the process and enhancing accuracy This innovative method eliminates the need for pixel gradient calculations, resulting in more efficient outcomes.

The reconstruction error remains below 2% when using an 8-bit fractional length, highlighting the effectiveness of precision manipulation in gradient magnitude This process is streamlined by utilizing predefined sine and cosine values corresponding to quantized angles.

The aforementioned study introduced a technique that divides the primary gradient vector into two distinct vectors, illustrated in Figure 11 For each pixel I(x,y), the values of these vectors are determined by considering the four adjacent pixels: I(x+1,y), I(x-1,y), I(x,y+1), and I(x,y-1).

Figure 11 - Decomposition of a vector into the form of two vectors [35] Two magnitudes are the solutions of the two equations:

𝐵 𝜃 𝑖 = sin 𝜃 𝑖 +1 𝑑 𝑥 − cos 𝜃 𝑖 +1 𝑑 𝑦 sin 20 ; || 𝐵 || = 𝜃 𝑖 +1 cos 𝜃 𝑖 𝑑 𝑦 − sin 𝜃 𝑖 𝑑 𝑥 sin ⁡ (20)

This thesis uses this technique to improve the HOG-SVM algorithm and decrease the number of calculations needed to process each sliding window

(2)Implement HOG-SVM into hardware

In 2015, a real-time and energy-efficient multi-scale object detector hardware implementation was presented [23] Parallel detectors with a balanced workload increase the throughput, enabling voltage scaling and energy consumption reduction

Figure 12 - Object detection system architecture [23]

In 2019, a novel high-throughput hardware architecture was introduced for human detection, optimizing HOG feature generation and SVM classification This architecture enhances throughput through a fast, highly parallel, and cost-effective HOG feature generation process, coupled with a modified datapath for simultaneous computation of SVM and HOG feature normalization Implemented using TSMC 65nm technology, it operates at a maximum frequency of 500MHz, achieving an impressive throughput of 139 frames per second at full-HD resolution, with a hardware area cost of approximately 145kGEs and 242kb SRAMs.

Figure 13 - Proposed block diagram of object detection [24]

In 2023, a novel human detection system utilizing a HOG-SVM module alongside Direct Memory Access (DMA) was introduced, implemented on the Xilinx FPGA Development Kit ZCU106 This design effectively reduces CPU load while enhancing overall system performance.

Figure 14 - Block diagram of the proposed human detection system [36]

In summary, this section examined research on enhancing the HOG-SVM algorithm's efficiency for real-time object detection systems It proposes reducing complex gradient calculations by pre-quantizing sine and cosine values, thus maintaining accuracy while minimizing computational load Utilizing parallel architectures, deep pipelining, and direct memory access in FPGA/ASIC can lead to improved frame rates and reduced latency Although HOG-SVM is prevalent in object detection, it requires further optimizations to meet the demands of real-time applications with power limitations.

As mentioned above, the Object Detection Block will include Bilinear Interpolation Scale Generator Module, HOG-SVM module, combine module, and NMS module

(1)Bilinear Interpolation Scale Generator Module

Bilinear interpolation is a popular method for image scaling and 2D finite element analysis, utilizing linear interpolation across two dimensions The technique averages data from the corners of a rectangle and assigns weights based on the distance to these corners As a point (x,y) moves closer to a corner, the weight for that corner increases This approach is detailed in Algorithm 2.

Figure 15 - Bilinear Interpolation Scale Generator (a) and illustration (b)

Algorithm 2 – Bilinear Interpolation Scale Generator get original height and width of input image set new width is original_width / scale set new height is original_height / scale create output image of size (new height, new width) foreach pixel (x, y) in output_image: set src_x is x / scale and src_y is y / scale set x1 is floor(src_x) and y1 is floor(src_y) set x2 is min(x1 + 1, original width - 1) set y2 is min(y1 + 1, original height - 1) set w_x is src_x - x1 and w_y is src_y - y1 set output image[y, x] is (1 - w_x) × (1 - w_y) × input image[y1, x1] + w_x × (1 - w_y) × input image[y1, x2] + (1 - w_x) × w_y × input image[y2, x1] + w_x × w_y × input image[y2, x2]

To resize an image, we calculate new dimensions for the output and determine the corresponding pixel location in the input image, along with its four nearest neighbors By applying distance-based interpolation weights to these neighboring pixels, we compute a weighted average, which becomes the value of each output pixel This method is systematically applied to every pixel in the output image.

As mentioned in Section 0, the HOG-SVM algorithm was chosen to process object detection tasks, in this case, focused human detection

Figure 16 - HOG-SVM algorithm for human detection [23]

The algorithm analyzes a sliding window of 64 × 128 pixels from the input image, which is segmented into 8 × 8 pixel cells Each cell produces a histogram representing the gradient angle and magnitude by distributing gradient magnitudes into nine bins according to their angles Additionally, 2 x 2 cells are combined to create a block, and the histograms from the cells are normalized using the data from the block.

(a)Gradient calculation and Histogram generation

The initial phase of this algorithm utilizes the findings from previously published research, as outlined in section 2.3 It simplifies the computation process by avoiding complex operations such as inverse tangent, square, square root, and floating point multiplication, thereby reducing the overall number of calculations required for each sliding window.

To begin, the values of dx and dy are calculated Next, the absolute values of dx and dy are combined with their respective signs to establish two quantized angles, θi and θi+1 This determination is achieved by comparing A multiplied by dx with B multiplied by dy, where A represents the dividend.

To calculate the magnitudes Bi and Bi+1 using the divisor B mentioned in [35], refer to the two equations provided It is important to optimize these calculations with the values from [35], noting that the sine and cosine values have been rounded and multiplied by 256 as shown in Table 1.

Finally, these values will be multiplied by the round number 1 sin⁡ 20 ≈ 2 + 15 16 and divided by 256 (this is the secret of the number 47

Conclusions

This chapter presents a real-time object detection system specifically designed for human detection It utilizes efficient algorithms, including Zipfian Estimation for motion detection and an optimized HOG-SVM pipeline for classification Key features of the system include the implementation of parallel computation threads for HOG-SVM and the application of post-processing techniques like non-maximum suppression to enhance detection accuracy.

The entire process of data flow through each system stage has been analyzed

The process begins with motion detection, which identifies key regions of interest These regions are subsequently classified using the HOG-SVM computation technique Following classification, the identified areas are merged and filtered to create final bounding boxes for human detection Finally, these bounding boxes are overlaid onto the original input image for further applications.

This system utilizes lightweight algorithms for real-time object recognition on high-resolution images, serving as a foundation for future hardware implementation The upcoming chapter will detail the execution of this system across various datasets using C/C++ The architecture of the system presents an effective pipeline for precise and efficient human detection, making it suitable for practical applications.

Implementation and Evaluations

Experiment setup environment

The software testing program is designed using C++ and utilizes the OpenCV2 library to load images from the MOT15 dataset These images are processed through the proposed object recognition system, which records the position and size of detected objects in each frame Additionally, the program measures the processing time for each frame to evaluate the system's real-time performance This collected data is subsequently transferred to a Python environment for further analysis.

This program utilizes the Torch, Pandas, and NumPy libraries to analyze the True Positive data file from the dataset The outcomes of this analysis will be employed to compute key metrics including Precision, Recall, and mean Average Precision (mAP).

To establish the experimental environment, utilize a computer equipped with an Intel® Core™ i5-1035G4 processor, 8GB of RAM, a 128GB SSD, and an Intel® Iris™ Plus Graphics card Additionally, implement the Windows Subsystem for Linux 2 to operate Ubuntu 22.04.3 LTS on a Windows 11 operating system.

We consider processing accuracy and speed to evaluate the proposed system's effectiveness Accuracy is measured by mAP, while the number of frames per second measures speed

Mean Average Precision (mAP) is calculated as the average of Average Precision (AP) across various Intersection over Union (IoU) thresholds To understand mAP, it is crucial to grasp the concepts of precision and recall, which are fundamental in classification tasks In this context, if a model identifies an image as depicting a human, it is classified as a positive result; conversely, if it identifies the image as non-human, it is deemed negative The accuracy of the model's classification is indicated using True and False labels.

Table 2 - Truth Table Confusion Matrix

Positive Negative Positive True Positive False Positive

Negative False Negative True Negative

The Intersection over Union (IoU) metric is essential in object detection as it quantifies the overlap between predicted bounding boxes and ground truth boxes A higher IoU score indicates a closer match between the predicted and actual boxes, reflecting improved accuracy in object detection.

The precision is the ratio of true positives to total positive predictions It indicates the accuracy of a model in detecting an object

The recall ratio is the ratio of true positives to the total number of actual objects, assuming that the model can detect all relevant objects

Average Precision (AP) is a metric that evaluates an object detection model's precision by calculating the weighted mean precision across all recall values This measure effectively penalizes any decreases in precision that occur at higher recall thresholds, providing a comprehensive assessment of the model's performance.

Here is a summary of the steps to calculate the AP:

1 Generate the prediction scores using the model

2 Convert the prediction scores to class labels

3 Calculate the confusion matrix—TP, FP, TN, FN

4 Calculate the precision and recall metrics

5 Calculate the area under the precision-recall curve

The mAP is calculated by finding Average Precision(AP) for each class and then averaging over several classes

Frame rate, measured in frames per second, indicates how many frames a system can process in one second This study employs two key variables to assess the processing time of each frame: the first variable records the moment a new frame is loaded, while the second marks the completion of its processing By subtracting the initial time from the completion time, we determine the processing duration for each frame The average processing time for all frames is then calculated, and by dividing one by this average, we obtain the overall scanning frequency of the system for the given dataset.

Mean Average Precision (mAP) and frame rate are crucial metrics for assessing object detection systems mAP evaluates the system's accuracy through precision and recall, whereas frame rate reflects its real-time performance By utilizing these standardized metrics, it becomes easier to compare various methodologies and effectively pinpoint the strengths and weaknesses of the system.

Experimental results

The PETS09-S2L1 dataset features images captured at a resolution of 768x578 and a frame rate of 7fps, showcasing a high-angle view of a campus or outdoor area The primary motion in these images is attributed to humans, who are often seen walking in atypical patterns.

Figure 20 - The image extracted from input PETS209-S2L

The study revealed that the processing time for each frame was 55 milliseconds, resulting in an average frame rate of 18 frames per second Overall, the system demonstrated moderate performance, achieving an average precision of 55%, an average recall of 17%, and a mean Average Precision (mAP) score of 4%.

Figure 21 - Precision, recall, and mAP graphs per frame of input PETS209-S2L

The system efficiently processes and analyzes frames, yet it struggles with accurately identifying human movements This challenge is evident in its moderate precision, low recall, and very low mean Average Precision (mAP) score.

The TUD-Stadtmitte series features a collection of images captured at a resolution of 640x480 and a frame rate of 25fps, showcasing a vibrant urban pedestrian scene, likely set in a shopping area or street Notably, the images highlight groups of humans as the only moving subjects, emphasizing the dynamic nature of city life.

Figure 22 - The image extracted from the input TUD-Stadtmitte

The processing time for a single frame was measured at 41 milliseconds, yielding an average frame rate of 24 frames per second The system demonstrated moderate performance, achieving an average precision of 51%, an average recall of 20%, and a mean Average Precision (mAP) score of 7.7%.

Figure 23 - Precision, recall, and mAp graphs per frame of input TUD-Stadtmitte

The system processes images in near-real-time but struggles with accurately identifying and tracking human groups in urban pedestrian environments Its moderate precision, coupled with low recall and mAP scores, indicates a tendency to misidentify or overlook groups entirely.

The TUD-Campus series features a collection of images at a resolution of 640x480 pixels and a frame rate of 25fps, showcasing a group of individuals walking outside a contemporary building The structure is characterized by large windows that beautifully reflect the surrounding scenery.

Figure 24 - The image extracted from the input TUD-Campus

The system demonstrated a processing time of 84 milliseconds per frame, achieving an average frame rate of 12 frames per second Performance analysis revealed moderate results, with an average precision of 23%, an average recall of 21%, and a mean Average Precision (mAP) score of 3.3%.

Figure 25 - Precision, recall, and mAp graphs per frame of input TUD-Campus

The analysis of the TUD-Campus dataset reveals limitations in processing speed and accuracy, indicated by low precision, recall, and mAP values These metrics suggest that the algorithm struggles to accurately identify humans within the frames Additionally, the complexity of the images, characterized by multiple overlapping individuals and mirror reflections, further hampers the system's performance.

The proposed system faced several limitations during testing, resulting in low precision, recall, and mAP coefficients These issues stemmed from factors such as complex objects moving in groups, which caused encroachment, and objects blending with the background due to similar color tones, disrupting human detection algorithms Furthermore, recognition results were adversely affected by data interference from mirrors and shadows.

The system's performance is hindered by the absence of an algorithm for tracking objects within the frame, coupled with a limited SVM training set that restricts recognition of objects from diverse angles and directions However, its notable advantage is its impressive speed, which remains consistently reliable.

The MOT15 dataset serves as a robust benchmark for assessing multiple object detection algorithms, especially in the realm of human detection It encompasses various challenges, including occlusions, changes in illumination, and varying scales of human figures Despite technological advancements, the dataset continues to pose significant challenges, pushing the boundaries of current detection and tracking technologies Even advanced neural network algorithms struggle with accuracy on this dataset, highlighting the difficulties faced by simpler feature extraction systems.

In this chapter, we evaluated the performance of our proposed object detection system through experiments We assessed key metrics, including Mean Average Precision and frame rate, using video inputs from the PETS09, TUD-Stadtmitte, and TUD-Campus datasets within the MOT15 framework.

The system achieves a processing speed of up to 24 FPS; however, its detection accuracy is constrained, exhibiting medium precision (23-55%), low recall (17-21%), and minimal mAP scores (3.3-7.7%) Contributing factors to these limitations include complex backgrounds, reflections, overlapping objects, and the inherent constraints of the SVM classifier Additionally, the failure to select appropriate thresholds for processes like Zipfian, HOG_SVM, or NMS, along with the use of quantized values to enhance speed, further exacerbate these challenges.

Despite facing some challenges, the system boasts a low complexity and has been optimized for future hardware platforms, such as Xilinx's FPGA, enhancing its efficiency By placing our system in a controlled environment, providing a suitable training set of SVM weights, and conducting extensive experiments on comparable datasets, we can effectively address the identified issues This method allows for the selection of appropriate thresholds at each step, ultimately completing the system Additionally, implementing a tracking algorithm can significantly improve object detection accuracy, as it permits the HOG-SVM algorithm to be executed only once per object until it exits the frame, thereby reducing errors during the recognition process.

Tiêu đề	Developing A Real-Time Object Detection System On FPGA
Tác giả	Nguyễn Trung Kiên
Người hướng dẫn	TS. Bùi Duy Hiểu, TS. Trần Thị Thúy Quỳnh, PGS TS. Erwan Libessart
Trường học	Paris-Saclay University and VNU University of Engineering and Technology
Chuyên ngành	Electronics, Electrical Energy, Automation, Communication and Data Engineering
Thể loại	Master thesis
Năm xuất bản	2024
Thành phố	Hà Nội

Định dạng
Số trang	57
Dung lượng	1,95 MB