ADAPTIVE LEARNING SOLUTION BASED ON DEEP LEARNING FOR TRAFFIC OBJECT RECOGNITIONDOCTOR OF PHILOSOPHY OF COMPUTER SCIENCE Da Nang, 2022... COMMITMENT To the best of my knowledge, I hereb
Introduction
Artificial intelligence (AI) refers to intelligence demonstrated by artificial systems and has become ubiquitous in today’s world It is widely used in various applications, including office productivity tools, automatic answering systems, intelligent traffic management, and smart home systems Advances in computer hardware have significantly boosted AI capabilities, enabling its broader application across all areas of life and society.
Artificial intelligence (AI) focuses on developing algorithms and applications that assist humans in decision-making or autonomous decision-making through data identification and acquisition Key research areas include object detection, object action recognition, and human action recognition, which underpin applications like security surveillance, remote control systems, assistive technologies for the visually impaired, sports data analysis, automated robots, and self-driving cars Numerous solutions have been proposed for AI development, including heuristic algorithms, evolutionary algorithms, Support Vector Machines, Hidden Markov Models, expert systems, and neural networks However, traditional approaches often require extensive human intervention and large datasets, resulting in limited accuracy and restricted case identification.
To overcome those shortcomings, machine learning with focusing on Deep Learning Method (Deep Learning) is now being applied in artificial intelligence in terms of object detection and action recognition
Deep Learning is a highly debated area within artificial intelligence, focusing on enhancing neural network technologies to improve voice recognition, image recognition, and natural language processing As a subset of machine learning, Deep Learning has significantly advanced fields like object perception, machine translation, and speech recognition, achieving breakthroughs that were once considered very challenging for AI researchers.
However, despite of the fact that issues related to AI were solved, Deep Learning has still remained limitations that need to be settled
Creating an effective object recognition system using Deep Learning requires vast amounts of input data to enable accurate learning This data-driven process demands significant computational power, often relying on powerful large server systems equipped with high-performance processors to handle the intensive training workload efficiently.
Deep Learning currently struggles to recognize complex social contacts and faces difficulties in distinguishing similar objects due to limitations in logical recognition technology Additionally, integrating abstract knowledge—such as understanding what an object is, its uses, and how people interact with it—remains a significant challenge for machine learning systems Unlike humans, AI has not yet acquired common knowledge and contextual understanding, highlighting the ongoing limitations of deep learning in capturing real-world complexity.
How can a machine learning system autonomously acquire, select, and update relevant knowledge while constructing a cohesive, interconnected dataset similar to human learning? Research on Adaptive Learning [9, 10, 11, 12, 13, 14] offers promising solutions to address the limitations of traditional Deep Learning, by exploring methods that enable systems to navigate complex knowledge management tasks beyond current capabilities.
A comprehensive Adaptive Learning model enables an auto robot system to develop self-learning and self-intelligence capabilities that mimic human cognitive processes Over time, the system's intelligence improves through continuous operation and data processing The system automatically selects relevant data to retrain and update its models, replacing outdated versions to enhance performance and adaptability This evolving approach ensures the robot's intelligence becomes more accurate and efficient, making it ideal for advanced automation applications.
The proposed Adaptive Learning model shows great potential for application across various auto robot systems, including self-driving vehicles This doctoral research involves conducting studies and experiments specifically on autonomous vehicles to simulate their operational processes Recognition capabilities of self-driving cars encompass a wide range of objects in traffic environments, such as other vehicles (motorcycles, cars, trucks), pedestrians, traffic signs, roadways, and roadside elements, highlighting the importance of advanced perception in autonomous driving systems.
Research goal
The thesis focuses on exploring artificial intelligence, analyzing the existing methods and algorithms used in object detection It aims to evaluate the limitations of current approaches and propose improved solutions to enhance the efficiency and accuracy of AI systems By addressing these challenges, the research seeks to advance the effectiveness of object detection technologies through innovative methods and optimized algorithms.
- Study, analyze and evaluate traditional methods: Support Vector Machine, Hidden Markov Model, Neural network, and so on
- Study and evaluate the application of Deep Learning in classification and object detection in traffic (Pedestrians, traffic vehicles, traffic signs, etc.)
To enhance the performance capacity of deep learning models for Advanced Driver Assistance Systems (ADAS), implementing an adaptive learning approach proves highly effective Conducting experiments that optimize hyperparameters within this framework can significantly improve model accuracy and reliability in self-driving vehicle applications By systematically adjusting learning rates, network architectures, and data sampling strategies through adaptive learning methods, we can achieve more robust and efficient deep learning models tailored for real-world autonomous driving scenarios These experiments demonstrate that adaptive learning accelerates model convergence and enhances system safety, marking a promising direction for future advancements in ADAS technology.
- Develop data sets for training and recognizing objects in traffic.
Research method
This research gathers comprehensive information on fundamental algorithms, AI principles, Deep Learning, Adaptive Learning, and object detection by reviewing relevant documents and articles Experimental data were collected from real-time traffic cameras and publicly available online videos to ensure practical and diverse datasets for analysis.
-Comparison method: Summary and comparison between the obtained documents to provide an overview of the methods, advantages and disadvantages of those methods as well
- Analysis method: Analyze the algorithms, their operation and characteristics The effectiveness of the algorithms applied to specific cases is evaluated and analyzed to get the best results
- Expert method: Consult from AI experts to complete the area need to be studied
- Experimental method: Installing and testing algorithms applied to each method for a better understanding From this, the advantages and disadvantages of each method are then evaluated and verified
- Conduct experiments on Google's machine learning open-source system (TensorFlow), MathWorks (Matlab) to have comparison with the results of research experiments
To enhance the accuracy of traffic management algorithms, it is essential to collect and establish comprehensive real-world empirical data sets, including objects such as pedestrians, vehicles, and traffic signs These datasets are created by capturing images from real roadside photos and videos sourced from the internet, ensuring diverse and representative training and testing data Utilizing such authentic data improves the robustness and reliability of the proposed algorithms in real traffic scenarios.
- Install research results on the system to prove experiment.
Research subject and scope
+ Deep Learning method and Adaptive Learning method
+ Propose solutions to enhance on-road object detection quality of self- driving car system
+ Study and propose Adaptive Learning solution which is applied in on-road object detection
+ Create data and experiment, analyze results.
The structure of the thesis
OVERVIEW OF ARTIFICIAL INTELLIGENCE
Overview of artificial intelligence
There have been many different definitions of artificial intelligence, or AI in the world, specifically:
Artificial intelligence (AI) refers to intelligence demonstrated by machines or computer systems, often designed for various purposes The term encompasses both the development of AI technologies and their applications across different sectors AI systems are capable of performing tasks that typically require human intelligence, such as problem-solving, learning, and decision-making, making it a rapidly growing field within computer science and technology.
• According to Bellman, artificial intelligence is the automation of activities that we associate with human thinking, activities such as decision-making, problem solving, learning, etc
• Rich and Knight: “Artificial intelligence is the study of how to make computers do things at which, at the moment, people are better''
Artificial intelligence (AI) is a branch of computer science built on a strong theoretical foundation, focused on automating intelligent behavior in computers It enables machines to mimic human intelligence, including thinking, decision-making, problem-solving, learning, and self-adaptation Despite different definitions, AI fundamentally aims to create systems that perform tasks requiring human-like cognitive abilities.
The history of artificial intelligence [15, 16, 17] has gone over many different stages of development, as shown in Figure 1.1
Figure 1.1 History of artificial intelligence (Source: https://connectjaya.com/)
Machine learning and identification techniques
As an AI subfield, machine learning uses algorithms that enable computers to learn from data to perform tasks instead of being explicitly programmed [18]
Image processing problem solve issues of analyzing information from images or performing some transformations Some examples are:
Image tagging, like Facebook, an algorithm that automatically detects your face and your friends' photos Basically, this algorithm learns from photos you've tagged yourself before
Optical Character Recognition (OCR) is a technology that converts images of typed, handwritten, or printed text into machine-encoded digital text This process relies on advanced algorithms that learn to recognize individual characters from image snapshots, enabling efficient digital text extraction and improving automation in document processing.
Self-driving cars utilize advanced image processing techniques, with machine learning algorithms enabling them to detect road edges, traffic signs, and obstacles by analyzing each video frame captured by onboard cameras.
Text analysis is a work of transforming or classifying free texts The texts here can be Facebook posts, emails, chats, documents, etc Some common examples are:
Spam filtering is a widely used application of spam text classification, which aims to accurately identify and block unwanted emails This process involves text classification techniques that determine the subject and content of messages to distinguish spam from legitimate emails Advanced spam filters can also personalize their detection by "learning" each user's preferences, adapting to their specific definition of spam based on user interactions with email messages and subjects Optimized for SEO, effective spam filtering enhances inbox management and ensures a better user experience by reducing unwanted messages.
Sentiment Analysis learns how to classify an expression as positive, negative, or neutral
Information Extraction is the process of extracting information from textual sources, learn how to useful information, address, a person's name, a keyword, etc for ex
Data mining is a valuable process for discovering insights and making predictions from data sets, where each record represents an object to learn from, and each column represents a feature By analyzing these features, the value of a new record can be accurately predicted, or records can be grouped to identify patterns Common data mining applications include customer segmentation, fraud detection, market analysis, and improving decision-making processes This technique enables businesses to extract meaningful information, optimize strategies, and enhance overall operational efficiency.
Anomaly detection is a technique for finding an unusual point, credit card fraud detection or, for example A suspicious transaction may be discovered based on a change in consumer normal behavior
Association rules help identify patterns in customer purchasing behavior, such as which items are frequently bought together in supermarkets and e-commerce sites By understanding which products customers typically buy next, businesses can optimize cross-selling strategies and personalized marketing campaigns Leveraging this data enables companies to make informed decisions to enhance sales and improve customer experience through targeted promotions.
Grouping, for example, in a SaaS platform, users are grouped by their behavior or by profile information
Predictions, the value columns (of a new record in the database) For example, the price of an apartment can be predicted based on the previous price data
Machine learning has significantly contributed to the fields of video games and robotics, particularly in enabling game characters to navigate complex environments Reinforcement learning, a key machine learning technique, allows characters to learn obstacle avoidance by rewarding successful navigation and penalizing collisions This approach helps develop intelligent, adaptive behavior for game characters and robotic systems, improving their ability to reach destinations efficiently while avoiding obstacles.
1.2.2 Basic recognition techniques in machine learning
Applying AI methods combined with image processing for object recognition is a critical aspect of computer vision Machine learning techniques for this purpose are categorized into supervised and unsupervised learning Supervised machine learning methods—including decision trees, neural networks, SVMs, boosting, and random forests—rely on labeled datasets provided by experts to train recognition models This training process involves analyzing and developing the model using a labeled dataset called the training dataset Conversely, unsupervised learning algorithms operate on unlabeled data, performing classification based on analysis and statistical patterns within the input data itself.
Decision trees are a fundamental area of research in machine learning, widely utilized for knowledge extraction and pattern recognition They serve as predictive models built on a hierarchical tree structure, organizing data samples based on a series of decision rules In these models, the leaves represent specific classification outcomes, while the branches illustrate the combination of features that lead to those decisions, making decision trees an effective tool for data analysis and classification tasks.
A decision tree can be trained by dividing the training data set into subsets for test of a single attribute value or a group of attributes Classification can be described as simple classification combinations by using mathematical deductive techniques The training of classification model is the development process of a decision tree
Random forests (RF) are a powerful ensemble learning method that constructs multiple decision trees through a process of random feature selection, enhancing model robustness and accuracy Developed by Tin Kam Ho in 1998 and published in the IEEE Journal, RF is a supervised algorithm suitable for both classification and regression tasks, including datasets with missing values Increasing the number of trees in a random forest helps mitigate overfitting, making the model more generalizable Due to their effectiveness, random forest techniques are widely applied in computer vision and object classification, showcasing their versatility across various applications.
Boosting is a machine learning ensemble technique that involves constructing multiple classifiers simultaneously, which are then combined using weights to form a strong predictor Each individual classifier, known as a weak classifier, contributes to the overall model, and their combination enhances accuracy AdaBoost (Adaptive Boosting), developed by Freund and Schapire in 1999, is a widely used boosting algorithm that creates a nonlinear strong classifier by aggregating weighted weak classifiers This method assigns higher weights to difficult-to-classify patterns while giving less impact to easier samples, allowing the model to focus on challenging cases during training As training progresses, each weak classifier adjusts its weights, decreasing emphasis on correctly classified (easy) samples and increasing focus on misclassified (hard) ones Consequently, subsequent classifiers primarily address samples that previous classifiers struggled with, leading to an improved overall performance Finally, the weak classifiers are combined based on their accuracy, resulting in a robust strong classifier.
The Support Vector Machine (SVM), introduced by Corinna and Vapnik in 1995, is a supervised learning algorithm primarily designed for classification tasks SVM constructs a model that classifies data samples into two predefined categories by identifying an optimal hyperplane in multi-dimensional space This hyperplane maximizes the margin by ensuring the greatest possible distance between the training data points and the decision boundary, which enhances classification accuracy SVM requires that all samples are represented within the same feature space, and it predicts the class of new data based on their position relative to the hyperplane Renowned for its effectiveness in binary classification, SVM has since been expanded for use in various multilayer classification problems, making it a versatile tool in machine learning.
Figure 1.2 Classification simulation of SVM (Source: https://towardsai.net)
Support Vector Machine (SVM) remains one of the most popular classification techniques in computer science and data analysis due to its effectiveness with large datasets and high-dimensional data SVM is particularly suited for classifying image, text, and voice data, demonstrating high accuracy compared to traditional machine learning methods Its flexibility is highlighted by its ability to utilize various kernel functions, enabling both linear and non-linear classification Overall, SVM is a robust and widely adopted approach for accurate data classification tasks.
Artificial neural networks (ANNs), inspired by biological neural networks, consist of interconnected nodes called neurons and connecting arcs organized into layers, including input, hidden, and output layers These networks transmit information through connections that process data using propagation functions and weights, which are typically set during training Some advanced neural networks, such as Multilayer Neural Networks (MLNN) and Self-Organizing Maps (SOM), can adapt their architecture dynamically based on real data, enabling self-learning and improved performance.
Self-learning capability is a vital component of neural networks (NN), enabling them to adaptively modify their internal structure based on data Neural networks are complex, adaptive systems that adjust their connection weights to improve performance; each connection between neurons has a specific weight that influences signal transmission When the network produces accurate classification results, weight adjustments are unnecessary, but if the results are unsatisfactory, weights must be modified to enhance system adaptation and accuracy.
Deep Learning and Adaptive Learning
1.3.1 Overview of Deep Learning and Adaptive Learning
Deep Learning is a rapidly evolving field within computer vision and machine learning, characterized by algorithms that utilize multi-layered neural networks to tackle complex high-level data models Unlike traditional machine learning, Deep Learning employs nonlinear transformations and intricate architectures, enabling more advanced data processing Although it was introduced in the early 1990s alongside other machine learning techniques, its initial effectiveness was limited by hardware constraints and computational challenges Pioneering researchers like Lecun played a crucial role in advancing Deep Learning, proposing solutions that laid the foundation for its modern success.
Deep Learning was first introduced by Rina Dechter in 1986, with significant advancements made by Lecun and colleagues in 1989 when they developed a neural network using backpropagation to recognize handwritten digits with high accuracy Lecun’s pioneering work laid the foundation for modern Deep Learning applications and research Deep Learning neural networks are designed to solve complex problems by mimicking the structure of the human brain, enabling feature extraction, classification, and recognition across various fields such as voice recognition, computer vision, natural language processing, and predictive analytics Recently, Deep Learning has gained immense interest in computer science due to its ability to deliver higher accuracy and more effective results compared to traditional approaches, fueling innovation across multiple technological domains.
Adaptive learning originated from the need to develop intelligent systems that emulate the human brain’s capabilities Key advancements such as AlexNet, GoogLeNet, ResNet, R-CNN, Fast R-CNN, Faster R-CNN, and VGGNet have achieved high accuracy and effective multi-object recognition However, most of these improvements focus on altering network structures, tuning parameters, and refining training methods, with little progress toward enabling models to automatically enhance their intelligence over time Currently, these systems require manual interference and labeled data for learning, highlighting the need for more autonomous, AI-driven learning capabilities.
An effective Adaptive Learning model automatically recognizes objects, trains, assesses, and updates its intelligence, reducing the need for human intervention after the initial setup Its ability to adapt is demonstrated through the incorporation of diverse data, improved recognition of complex or unfamiliar objects, and the continuous adjustment of training parameters based on evolving datasets For instance, in autonomous vehicle systems, the initial model can identify basic objects like vehicles, lanes, pedestrians, trees, and traffic signs, but with ongoing learning, the system can recognize more unusual or complex object forms, continually updating itself to enhance accuracy and performance as the vehicle navigates different environments.
Deep Neural Networks (DNNs) are advanced artificial neural networks (ANNs) characterized by multiple hidden layers that seamlessly connect from input to output Unlike simple ANNs, DNNs feature a greater number of nodes per layer and more hidden layers, enabling them to model complex non-linear relationships more effectively This increased depth and node count allow DNNs to process and recognize intricate patterns in data, making them powerful tools for various machine learning applications.
Figure 1.4 Simple Deep Learning network with one layer and Deep Learning network with multiple hidden layers (Source: https://www.kdnuggets.com)
Early deep neural networks resembled one-layer models, consisting of an input layer, a hidden layer, and an output layer Over time, researchers developed deeper networks with multiple hidden layers, enhancing their ability to learn complex patterns These advanced architectures now commonly feature more than three layers, leading to the term "deep learning." Such multi-layered neural networks significantly improve performance in tasks like image recognition and natural language processing.
“deep” concept means the number of hidden layers in neural networks
In each layer in Deep Learning networks, nodes will be extensively trained with unique features based on outcomes of prior layers Once data go to inner layers of neural networks, they will be more complicated Nodes can recognize, synthesize and recombine features from prior layers to display features in higher levels This is known as “hierarchical featuring”, which is the hierarchy process where data become more complicated and abstract A deep neural learning network is to solve great datasets in multiple dimensions with billions of parameters treated by non-linear functions
Deep neural learning networks excel at identifying potential structures within unlabeled, unstructured databases, which are prevalent in real-world scenarios Research shows that these networks are highly effective in analyzing unstructured data such as multimedia content, images, documents, audio, and video Consequently, deep neural learning techniques are powerful tools for solving complex problems involving the recognition, classification, and analysis of unstructured, homologous, or abnormal data.
CNNs, or Convolutional Neural Networks, are a type of deep learning model inspired by biological brain processes, particularly the organization of the animal visual cortex Pioneered by LeCun, CNNs utilize regularized versions of multilayer perceptrons to simplify the pre-analysis process in various applications They mimic the way individual cortical neurons respond to stimuli within specific regions of the visual field, known as receptive fields, which partially overlap to comprehensively cover the entire visual scene.
Figure 1.5 Architecture of a simple convolution neural network (Source: https://medium.com)
A Convolutional Neural Network (CNN) architecture comprises an input layer, an output layer, and multiple hidden layers in between These hidden layers typically include convolutional, pooling, ReLU (rectified linear unit), normalization, and fully connected layers, as illustrated in Figure 1.5 Overall, CNNs are characterized by their layered structure of convolutional, pooling, and normalization layers, with the option to include fully connected layers for enhanced learning capacity.
Some CNNs which have been introduced and commonly used are AlexNet
[27], GoogLeNet [28], Microsoft ResNet [29], R-CNN [30], Fast R-CNN [31], Faster R-CNN [32] and VGGNet [33]
In Vietnam, from the 1990s to the early years of the 20th century, there were participation of the well-known researchers Assoc Prof Ngo Quoc Tao, Assoc Dr
Assoc Dr Do Nang Toan, along with researchers like Luong Chi Mai, specializes in AI research focusing on image processing and recognition Their notable work includes handwriting recognition, Vietnamese handwriting, speech recognition, face detection, and human body simulation, primarily utilizing classic algorithms such as SVM, Random Forest, Hidden Markov Models, and Artificial Neural Networks These research contributions serve as important foundational references for students and graduate researchers in the field Additionally, their publications have significantly advanced the understanding of image processing and object recognition techniques.
Since the early 20th century, advances in AI and computer hardware have driven significant progress in machine learning and object recognition However, in Vietnam, research on artificial neural networks and convolutional neural networks remained rudimentary, with most studies conducted by overseas Vietnamese PhD students Since 2015, Vietnam has seen a surge in publications in international journals such as ISI and Scopus, authored by research groups from institutions like Hanoi University of Technology, Ton Duc Thang University, and the National University of Ho Chi Minh City Additionally, independent researchers have contributed to applied AI fields, including health, transportation, agriculture, and national defense, with innovations such as autonomous vehicles, robotics, and human action recognition.
The history of AI and machine learning has evolved through several key phases, beginning with Alan Turing's demonstration of machine intelligence in 1950 In 1955, American computer scientist and cognitive scientist John McCarthy coined the term "Artificial Intelligence," defining it as the science of intelligent computer systems The following year, he organized the Dartmouth Conference, the first major event dedicated to AI, bringing together experts from institutions like Carnegie Mellon University, MIT, and IBM Since then, the term "artificial intelligence" has become widely recognized and central to advancements in computer science.
Through many different stages, AI in general and the field of machine learning in particular are still growing, continuously fulfill their task of exploring many important algorithms such as Support vector machine, Random Forest, Neural network, K-mean, Decision tree, Booting, Hog, and so on These algorithms are the fundamental for the growth of algorithms and applications in recognition, object classification, data processing, and so on Along with the growth of computer hardware, the 1998s forward, Deep Learning and Convolution neural network which is one of the components of machine learning, has made great progress with many applications in life [48, 49, 50, 51, 52] Yann LeCun is one of the pioneers in this particular field LeNet, one of the most famous CNN networks, was developed by Yann LeCun in the 1998s The structure of LeNet consists of 2 layers (Convolution + maxpooling) and 2 layers fully connected layer and the output (softmax layer) with the recognition accuracy up to 99%
In 2012, Alex Krizhevsky and his colleagues introduced the groundbreaking AlexNet model, a convolutional neural network (CNN) that revolutionized image recognition AlexNet notably achieved a significant margin of victory in the ImageNet LSVRC-2012 contest, reducing error rates to 15.3% compared to previous approaches with 26.2% This deep CNN features approximately 60 million parameters, vastly exceeding the complexity of earlier networks like LeNet Its innovative architecture and large-scale training set set new standards in CNN performance and paved the way for future advancements in deep learning-based image classification.
ReLU is used instead of sigmoid (or tanh) to deal with non-linearity, increasing computing speed by 6 times
DropOut is used as a new regularization method applied to CNN Dropouts not only enable the model to avoid over-fitting but reduce model training time
OverLap pooling is used reduce the size of the model (Traditionally pooling regions does not overlap)
Local response normalization is used to normalize each layer
Data augmentation technique is used to create additional training data by translations and horizontal reflections
AlexNet is trained by 90 epochs within 5 to 6 days with 2 GTX 580 GPUs Using SGD at learning rate 0.01, momentum 0.9 and weight decay 0.0005
The architecture of AlexNet consists of 5 convolutional layers and 3 fully connection layers Activation ReLU is used after each convolution and fully connection layer
RECOGNIZING OBJECTS BY DEEP LEARNING
Object recognition problems
Artificial intelligence has significantly transformed every aspect of daily life, with machine learning and its subset, deep learning, playing a crucial role Deep learning, built on convolutional neural networks, has revolutionized technologies like voice and object recognition, advanced medical applications, smart transportation systems, and robotics.
Deep Learning has significantly advanced in image recognition and processing capabilities, enabling models to accurately identify objects through extensive training This article focuses on Convolutional Neural Networks (CNNs) and their effectiveness in object recognition, especially for autonomous vehicles Evaluating CNN performance in recognizing various objects crucial for autonomous driving ensures improved safety and reliability on the road.
- Vehicles: motorbikes, cars and vans
- Other objects such as houses, trees and sky
Pedestrian recognition remains the most challenging aspect of autonomous vehicle navigation due to its complex movement patterns and recognition difficulties Accurately predicting pedestrian actions and walking speeds is essential for ensuring safety for both pedestrians and autonomous vehicles Pedestrians are categorized into three main types: crossing, walking, and waiting, each representing different interactions with vehicles Recognizing pedestrian gestures, locations, and scene contexts—such as roadway, side roads, and road edges—enables effective feature extraction from images These features are vital for training models to accurately predict and recognize pedestrian movements, enhancing autonomous vehicle safety systems.
The proposed approach encompasses two main phases: first, training a classifier model that predicts pedestrian movement using features extracted from CNN models, such as AlexNet; second, applying this system in real-time video footage from an autonomous vehicle on the road, where pedestrians are detected, regions of interest (ROI) are extracted, features are derived from these ROIs, and pedestrian movement is predicted accordingly The process begins with pedestrian detection utilizing the ACF algorithm, followed by feature extraction from the identified ROIs, leading to movement prediction through a trained SVM model This method effectively combines CNN-based feature extraction and traditional detection algorithms to enhance pedestrian movement forecasting in autonomous driving scenarios.
Figure 2.1 The process of extracted features by CNN model from image dataset
Figure 2.2 The process of pedestrian movement prediction
The resolution of used camera is 2 Megapixels or more with the minimum resolution of collected images of 72 dpi
It is useful to detect and recognize vehicles in traffic control and separation
As technology advances, the rising need for travel and increasing vehicle numbers highlight the challenges in managing and segregating vehicles effectively Implementing high-precision automatic control systems is crucial for addressing these issues within Intelligent Transportation Systems (ITS) While solutions like vehicle-mounted sensors and internet networking enable data collection and decision-making, limitations such as device production constraints, bandwidth issues, and high setup costs restrict their widespread adoption Therefore, developing reliable automatic vehicle recognition and classification systems is essential for enhancing traffic management and operational efficiency.
The proposed solution begins with acquiring images from surveillance cameras in ITS, which are analyzed to recognize objects of interest and identify transportation types Instead of vehicle detection methods, this approach focuses on recognition models, primarily utilizing a semantic segmentation model based on Segnet's CNN architecture to accurately identify vehicles Detected vehicles are then extracted to define regions of interest (ROI), representing specific vehicle samples To enhance accuracy, the CNN model can be combined with data augmentation techniques The recognition outcomes are integrated into the ITS system to send alerts for violations, such as vehicles crossing restricted lines, ensuring efficient traffic management and enforcement.
Figure 2.3 Proposed vehicle detection model
Suggested solution
Object recognition has been introduced with three basic steps:
(1) Detecting and extracting areas of interest
(2) Extracting features and training recognition models
However, the step 1 may be unnecessary once target object was identified Each step can have different techniques:
- Detecting and extracting areas of interest: use image meaning to extract areas of interest (pedestrians, vehicles, traffic signs, etc )
- Extracting features and training recognition models: build and introduce Deep Learning models to extract features of objects It is suggested to use SVM model to train recognition models
- Recognizing objects: use trained recognition models to recognize and classify objects according to individual problems
2.2.1.1 Extracting features and training classifier model
In machine learning, convolutional neural networks (CNNs) are widely used for analyzing visual imagery due to their effectiveness in image recognition tasks Various models such as AlexNet, GoogleNet, ResNet, and R-CNN family have been developed, each characterized by different architectures, sizes, and depths that contribute to their low error rates This thesis proposes the use of the AlexNet CNN model, which is known for its efficient processing time, making it suitable for rapid image analysis applications.
The AlexNet model effectively extracts and preserves fundamental features of input images, as illustrated in Figure 2.4 A dataset of 3,000 images was used, comprising 1,000 crossing pedestrians, 1,000 walking pedestrians, and 1,000 waiting pedestrians, sourced from real street videos on the internet (http://youtube.com) Each image undergoes convolutional neural network (CNN) processing to extract detailed features such as pedestrian postures, roadways, roadsides, and pedestrian positions on the road, as shown in Figure 2.5 These rich features serve as the basis for training an SVM classifier model, with visualizations of the input images and the extracted features provided in parts a) and b) of the figure.
Figure 2.4 Input images and simulate rich features of image
In CNN models, numerous feature layers can be extracted, including convolutional and fully connected layers Among these, layer 19 (commonly known as fc7 with 4096 units), located immediately before the classification layer, is considered the most advantageous for feature extraction purposes This layer captures high-level, discriminative features that are highly effective for tasks such as image recognition and transfer learning, making it a preferred choice in numerous deep learning applications.
Literally, in cases of object recognition such as animals, things and vehicles, the rate of recognizing object is higher (90% to 100%) In case of predicting the action of pedestrians, the features of input images focus on not only a specific object but also others such as vehicles, buildings, trees, and things around roadsides as shown in Figure 2.5
Figure 2.5 Influence of other objects on the road on pedestrian movement prediction
In this regard, in term of accuracy, ACF algorithm is used to detect pedestrians before extracting ROI, classifying and predicting the action of pedestrians
Pedestrian detection using the ACF classification model, such as 'inria-100x41' or 'caltech-50x21', is focused on identifying people in images The 'inria-100x41' model, trained on the INRIA Person dataset, is commonly used as the default in ACF algorithms, while the 'caltech-50x21' model is trained on the Caltech Pedestrian dataset The ACF algorithm outputs detection scores, which are confidence values ranging from 0 to 1, indicating the likelihood of a pedestrian being present; higher scores correspond to greater detection accuracy When a pedestrian is detected, a bounding box appears with a confidence score displayed on top, where larger scores signify higher reliability In complex images, the ACF model can sometimes produce false positives, so a threshold score of 0.25 is recommended during real-time detection to reduce errors—scores below this value, such as 0.1, tend to be less accurate, whereas scores of 0.25 or higher improve detection reliability.
Figure 2.6 Example input image for recognition a) b)
Figure 2.7 Pedestrian detection with scores = 0.1 (a) and scores = 0.25 (b)
When autonomous vehicles (AVs) navigate busy roads, images often contain numerous pedestrians in a single frame, which can hinder accurate detection To enhance recognition precision, each frame is segmented into multiple ROIs (Regions of Interest), as shown in Figure 2.8, allowing for better analysis of individual pedestrians Since real-time images captured by AVs are large and contain extraneous data, extracting specific ROIs at appropriate scales removes irrelevant objects and focuses on pedestrians This targeted ROI extraction improves the CNN model’s ability to extract precise features and reduces error rates in action recognition, ultimately enhancing the accuracy of SVM classification.
To accurately define the region of interest (ROI) in an image, consider a rectangle covering the pedestrian object, where H and W represent its height and width The coordinates x and y specify the top-left corner of this rectangle within the input image, which has dimensions of Width and Height The ROI is characterized by the values x1, y1, W1, and H1, which describe its position and size within the image Properly identifying these parameters ensures precise localization of pedestrian objects for image analysis and computer vision tasks.
In special cases, when x1, y1, W1, H1 are smaller than the edge value of the frame or bigger than the size of the input image, the values equal the edge values of the image
On the other hand, when ROI is out of image input size, the offset value of ROI on the opposite side is proposed in Figure 2.8
Figure 2.8 ROI extraction from pedestrian image
Pedestrian_crossing Pedestrian_waiting Pedestrian_walking
Pedestrian movement prediction involves extracting the Region of Interest (ROI) into a single image, followed by feature extraction using a Convolutional Neural Network (CNN) model These features are then classified with a Support Vector Machine (SVM) classifier to determine pedestrian behavior The system labels the outcomes based on predicted pedestrian activities, such as pedestrian crossing, waiting, or walking, enabling accurate and real-time pedestrian movement analysis.
(i) Pedestrian_crossing: When a pedestrian is crossing or walking in the road of other vehicles
(ii) Pedestrian_waiting: When a pedestrian is standing on the roadside and waiting to cross
(iii) Pedestrian_walking: When a pedestrian is walking on the edges of the road
Figure 2.9 The order of classifications of pedestrians when there are many pedestrians on the road in an input image
Reusing pre-trained models like AlexNet and GoogleNet is often not suitable for vehicle recognition tasks due to differences in model size and limited accuracy improvement with existing training parameters Instead, we developed a custom 24-layer CNN architecture tailored for vehicle recognition, comprising input, convolution, ReLU, normalization, max-pooling, and fully connected layers to transform input images into hierarchical feature descriptors The model processes 128×128×3 RGB images, with initial filters focusing on the three color channels, operating both independently and jointly across layers The final feature vectors extracted from the convolutional layers are used for accurate vehicle classification, ensuring the network's effectiveness for this specific recognition problem.
Table 2.1 CNN architecture with 22 hidden layers, 1 input layer, and the final classification layer
5 Max Pooling 3x3 max pooling with stride [1 1]
8 Max Pooling 2x2 max pooling with stride [1 1]
12 Max Pooling 2x2 max pooling with stride [1 1]
15 Max Pooling 2x2 max pooling with stride [1 1]
19 Max Pooling 2x2 max pooling with stride [1 1]
20 Fully Connected 1024 fully connected layer
22 Fully Connected 4 fully connected layer
24 Classification Output crossentropyex with 4 other classes
The training data set classified during the collection is shown in Figure 2.10
To enhance vehicle recognition accuracy, we significantly augmented our dataset by a factor of ten through various techniques The augmented images were subjected to rotations within the range of -50° to 50°, as well as flipping and noise addition, all while maintaining consistent image quality during training The expanded training dataset resulting from these augmentation methods is detailed in Table 2.5.
Experimental evaluation
2.3.1.1 Extracting features and training classifier model
This experiment utilizes approximately 3,000 images extracted using a CNN model to capture relevant features These features are then employed to train an SVM classifier, enhancing the model's ability to accurately categorize images Table 2.2 presents the dataset, including images, labels, and the extracted features used for training, illustrating the entire process from image extraction to classifier training for improved image recognition performance.
Table 2.2 Image and label datasets of extracted and trained features
Pedestrian crossing 1,000 Pedestrian_crossing Pedestrian waiting 1,000 Pedestrian_waiting Pedestrian walking 1,000 Pedestrian_walking
90% of images from each set is used for the training data and the rest 10% is used for the data validation
2.3.1.2 Pedestrian detection and action prediction
Using the pedestrian detection ACF algorithm on input images (Figure 2.6) produces the output shown in Figure 2.11 When multiple pedestrians are present in a frame, Regions of Interest (ROI) are extracted into a single image for action prediction using an SVM classifier Features are extracted from each image in Figure 2.11, enabling the system to classify pedestrian actions accurately Ultimately, the SVM classification model is employed to predict pedestrian behaviors and issue appropriate alerts for autonomous vehicles, as depicted in Figure 2.9.
Figure 2.10 Pedestrians detected and ROI extracted The maximum results of rate-recognition after training and comparing with dataset in Table 2.2 are as follow:
Table 2.3 Maximum confusion matrix for pedestrian action prediction
Recent real-time video on-road experiments demonstrate an accuracy rate ranging from 82% to 97%, showcasing reliable pedestrian detection The system achieves rapid processing speeds of just 0.6 seconds per pedestrian, highlighting its potential for self-driving vehicle applications These promising results indicate significant advancements in autonomous vehicle safety and efficiency.
We conducted experiments using a real vehicle database, including motorcycles, cars, coaches, and trucks, captured from actual traffic scenarios The dataset consists of 8,558 images collected in Nha Trang city, Khanh Hoa province, Vietnam, from various practical traffic routes Camera systems in the dataset typically receive signals from vehicles either in front of or behind, reflecting real-world traffic conditions The data is categorized into four vehicle classes—motorcycles, cars, coaches, and trucks—as illustrated in Figure 2.10 For model training and evaluation, the dataset is divided with 60% allocated for training and 40% for testing, as detailed in Table 2.4.
Figure 2.11 Some examples of vehicle categories
Sample size Overall Train Evaluation
Table 2.5 Training data after augmentation and balance data
Result obtained after CNN model training is shown as follows:
(i) Filter parameters: The first convolution layer uses 64 filters, whose filter's weight is shown in Figure 2.12:
The first convolutional layer features 64 filters of size 7x7, each connected to the three RGB input channels, as shown in Figure 2.12 When sample images are processed through these convolution filters, the resulting feature maps highlight distinct features different from the original RGB images, capturing various vehicle characteristics The convolution outputs may include negative values, which require normalization through linear adjustment to enhance the network's performance The processed layer outputs, including the input motor sample pattern, demonstrate the network's ability to extract diverse and meaningful features for vehicle recognition.
(a) The output of 64 convolutions at the first convolution layer
(b) The linear correction value after the first convolution layer
(c) The output of 64 samples at the second Convolution layer
Figure 2.13 Some results of linear convolution and linear correction for the input images being motors
Based on the experiment, three different methods have been evaluated on the same set of sample data as shown in Table 2.4 Methods include: (i) Traditional methods of HOG and SVM; (ii) CNN network; (iii) CNN network in combination with data augmentation
The accuracy of the HOG and SVM method on the sample data set was 89.31% Details of the sample size for each type and recognition result are shown in Table 2.6
Table 2.6 Confusion matrix of vehicle recognition using HOG and SVM
#Num Per(%) #Num Per(%) #Num Per(%) #Num Per(%)
The evaluated accuracy of the CNN method based on original data was achieved 90.10% on average, as shown in Table 2.7
Table 2.7 Confusion matrix of vehicle recognition using CNN
#Num Per(%) #Num Per(%) #Num Per(%) #Num Per(%)
The evaluated accuracy of the CNN method based on data augmentation was achieved 95.59% on average, as shown in Table 2.8
Table 2.8 Confusion matrix of vehicle recognition using CNN and data augmentation
#Num Per(%) #Num Per(%) #Num Per(%) #Num Per(%)
This study compared the proposed CNN model with a traditional approach using HOG feature descriptors and an SVM classifier The comparison results, illustrated in Figure 2.14, demonstrate the performance differences between the deep learning method and conventional techniques, highlighting the effectiveness of the CNN model.
Figure 2.14 Comparison of HOG+SVM, CNN model and CNN with augmenting data
DEVELOPMENT OF ADAPTIVE LEARNING TECHNIQUE IN OBJECT
Adaptive learning problem in object recognition
Advances in object recognition techniques, driven by deep convolutional neural networks (CNNs), have significantly increased recognition accuracy, supported by powerful computer hardware enabling more complex, multi-layered models trained on extensive datasets While these systems excel at identifying objects similar to training data, their performance diminishes when faced with variations in object appearance caused by environmental conditions such as brightness, rain, fog, or motion-induced vibrations Large training datasets, although comprehensive, often cannot encompass all real-world object states and are constrained by limited computational resources and time To address these challenges, adaptive approaches have been proposed to automatically update and enhance recognition models, aiming to achieve higher accuracy in diverse practical scenarios.
Suggested solutions
This chapter proposes an adaptive learning solution based on CNN models for ADAS recognition systems, enabling automatic model updates through real-time data collection during normal operation The method focuses on training the recognition model using new datasets that differ from previous ones, improving its adaptability and accuracy over time Key advantages include the system's ability to learn and incorporate new information independently, reducing reliance on manual data labeling by experts Leveraging advanced online storage technologies and high-speed data transmission platforms like 5G and cloud infrastructure, the proposed solution efficiently manages data storage and model updates The approach encompasses five main stages, ensuring continuous model enhancement and robust performance in dynamic driving environments.
(1) Object detection with low reliability
(2) Object tracking in n images in following processes to identify if they are objects of interest
When objects are confidently identified with high reliability, label them as "Positive" in the dataset Conversely, if the recognized objects are not of interest, assign a "Negative" label to all tracked objects from previous images This ensures accurate categorization based on the recognition confidence and relevance.
(4) Establishing a training dataset based on the collective combination of training dataset and new dataset
(5) Retraining and re-updating model if the new version has higher accuracy than the old one
Extensive trials comparing the proposed PDNet model with modern architectures like AlexNet and Vgg demonstrated that PDNet achieves higher accuracy than models trained solely through self-learning over time Additionally, the adaptive learning capabilities of the proposed model can be integrated with traditional recognition models such as AlexNet and Vgg to enhance their overall accuracy and performance.
3.2.2.1 Concept Definitions of System Components
Before going into detail the block functions of the system, some concepts are classified and defined as follows:
Adaptive learning in deep learning models enables self-learning and self-adaptability, allowing systems to automatically improve object recognition capabilities over time This adaptive process reduces the need for manual data supplementation and expert intervention, making AI models more efficient and scalable for various applications.
(2) Interest objects (IO) The object of interest to detect and recognize; for example, traffic signs, vehicles, etc
(3) Confidence scores A measure of reliability when an object is detected as
IO The confidence score of object O is denoted as Conf(O) Confidence H is a highly confident threshold
(4) Confident tracking The process of object tracking when an object is detected as an IO
Lost object (LO) refers to objects initially detected with low confidence that are tracked across multiple frames but continue to have low confidence scores When these objects fail to appear in subsequent frames, they are considered lost Accurate tracking of lost objects is essential for improving object detection and tracking performance in video analysis Optimizing the handling of lost objects helps enhance overall system reliability and accuracy in real-time applications.
Negative objects (NO) are initially identified as interest objects (IO) with low confidence scores below the Confidence H threshold These objects are tracked across multiple frames but are ultimately determined not to be genuine interest objects This process improves the accuracy of object recognition by filtering out false positives over consecutive frames.
NO =O 1 ,O 2 ,O n |O i ∈ IO and Conf(O i )