• Normalization layer: Normalization layers are components in neural networks that help to stabilize and hasten the training process by normalizing the inputs to alayer.[16] They are par
Trang 6Declaration of Authenticity
I declare that this research is our own, carried out under the supervision of Assoc.Prof Le Hong Trang and Mr Nguyen Quang Duc The results of our study are credibleand have not yet been made public All materials utilized in this research were gathered
by myself from various sources and are properly cited in the reference section
Furthermore, all of the research results are properly referenced and unrelated to theinitial data
In any event, I stand by my actions and accept responsibility for any plagiarism Thus,any copyright violations resulting from our research are not the responsibility of Univer-sity of Technology-Vietnam National University Ho Chi Minh City
Ho Chi Minh City, Dec 2024
Project Author
Ly Kim Phong
Trang 7Second, I want to express the appreciation to my advisers, Assoc Prof Le Hong
Trang and Mr Nguyen Quang Duc Without their assistance and support, I would not
have been able to finish my paper effectively; they have been incredibly gracious andpatient in guiding me through problems
Third, I also want to express my deepest thank to Dr Nguyen Duc Dung, my
re-viewer He has pointed out my mistakes and shortcomings and provides few direct vices to improve my work
ad-In closing, I would like to thank all the teachers, TA and the Department of
Com-puter Science for their assistance and support in getting me ready for this project; their
opinions and assessment have been invaluable Without their help, I could not have pleted this job They provided guidance for the path of my studies
com-One more time, I would like to express my gratitude and admiration to everyone whohas helped and inspired me Thank you to everyone
Trang 8in-of early detection employing deep learning models, machine learning, computer vision,and hardware monitoring We provide a one-stage model of YOLOv10 based on ref-erence research In order to optimize the model and further prune the models, furtherenhancements for this project require obtaining more balanced datasets.
Trang 91.1 Motivation 8
1.2 Problem statement 8
1.3 Scope 9
1.4 Structure of this project 9
2 Preliminaries 10 2.1 Background knowledge 10
2.1.1 Deep Learning Neural Network 10
2.1.2 Basic components of a neural network 10
2.1.3 How does a Deep Learning Neural Network work 15
2.1.4 CNN 16
2.2 Relevant Models 18
2.2.1 YOLO family 18
2.2.2 SSD 20
2.2.3 RetinaNet 21
2.2.4 Faster R-CNN 22
2.3 Training Losses 23
2.4 Evaluating metrics 23
3 Pest dection 27 3.1 Dataset 27
3.2 Training results 32
3.3 Evaluation 32
4 Mobile app integration 35 4.1 Framework 35
4.2 Database framework 35
4.2.1 Google Firebase 35
4.2.2 Firestore 36
4.3 Authentication 36
4.4 Requirements 37
4.5 Use-case 39
4.6 Database structure 40
4.7 Architecture 40
4.8 Implementation 41
Trang 105 Conclusion 46
Trang 11List of Figures
2.1 The basic building block of Deep Learning models - Perceptron 10
2.2 Convolution layer [17] 12
2.3 Hyperbolic tangent function [7] 12
2.4 Logistic-curve function [15] 13
2.5 ReLU function [3] 14
2.6 Deep Neural Network architecture [4] 15
2.7 Backpropagation algorithm[12] 16
2.8 CNN history[6] 17
2.9 CNN [24] 18
2.10 YOLOv10 architecture Dual Label Assignment 20
2.11 Example of YOLOv10 architecture 20
2.12 Example of PR curve 25
3.1 Original dataset with 2 classes: Miner (purple) and Brown eye spot (yel-lowish) 27
3.3 Example of tflite dataset with normal class 28
3.2 tflitedata distribution 28
3.4 Original and +40% brightness applied 31
3.5 Original and 10% noise applied 31
3.6 F1 curve 33
3.7 Confusion matrix normalized 33
3.8 PR curve 34
4.1 Use-case diagram 39
4.2 Database architecture 40
4.3 Architecture diagram 41
4.4 Slider 42
4.5 Login/Sign up 42
4.6 Recover account 43
4.7 Home & Recent activities & Profile pages 44
4.8 Chatbot page 44
4.9 Update info & Change password & History pages 45
Trang 12List of Tables
3.1 tflite class distribution 29
3.2 List of augments 30
3.3 Training losses 32
3.4 Evaluating results 32
3.5 Inference time 32
Trang 13Our research aims to develop an innovative application capable of identifying pests
in coffee plants through image analysis of roots, stems, and leaves This tool is designed
to provide farmers with timely alerts, enabling them to implement preventive measuresand mitigate potential losses
The motivation for this project stems from several key factors Coffee is a vital cultural product, cultivated extensively in numerous countries and contributing signifi-cantly to global economies Pests and diseases, if left unchecked, can drastically reduceyields and profitability In addition, there is an urgent need for efficient and accessiblesolutions to facilitate early threat detection for farmers
agri-We are confident that this application will create a transformative impact on the coffeeindustry by enhancing productivity, protecting farmers’ livelihoods, and contributing to
a sustainable coffee supply chain that benefits all stakeholders
1.2 Problem statement
Diseases such as coffee rust, coffee berry disease, and pink disease pose significantthreats to coffee plants by infecting leaves and impairing the ripening process, ultimatelyreducing coffee bean yields Early detection of these diseases is critical, and this can beachieved through automated systems capable of identifying symptoms at their initialstages
Currently, most leaf disease detection systems rely on convolutional neural networks(CNNs) and their variants, including R-CNN, F-CNN, Faster-CNN, SSD, and YOLO,
to identify disease-induced damage While these methods are effective, they heavily pend on accurately labeled datasets and face limitations in adaptability Specifically, theaddition of new disease variants often necessitates retraining the model from the begin-ning, which is both time-consuming and resource-intensive
de-To address these limitations, first we propose a one-stage approach using the YOLOv10xgeneral object detection model to directly identify leaf diseases This method also cal-culates the probability of various diseases, such as rust and miner infestations Our ap-proach enables seamless incorporation of new disease classes by simply adding a clas-
Trang 14sification label, eliminating the need for complete model retraining Then we develop
a mobile application that takes input (coffee leaves) images from farmers, detects thediseases of that images and provides the details and recommends treatments for thatdiseases
This report presents the experiments conducted and the findings that inform the velopment of a robust, scalable model for our application
de-1.3 Scope
In this project, my aim is to achieve these goals:
1 Assessing current document matching methods and making the necessary ments to determine which ones best meet our needs
adjust-2 Using the latest model to get a more accurate model architecture
3 Developing a basic mobile application to demonstrate our model
1.4 Structure of this project
The rest of this paper is organized as follows In section 2.1, we recall some grounds on deep neural network, how it works, and CNN introduction Section 2.2 is abrief introduction of relevant models Section 2.3 and 2.4 are the training losses and eval-uating metrics used to evaluate the model’s performance Chapter 3 is about the datasetbeing used in this project, our training and evaluating results Chapter 4 discusses aboutwhat we have done to develop the application Chapter 5 and 6 discusses the results andfuture improvements that can be made
Trang 15back-Chapter 2
Preliminaries
2.1 Background knowledge
2.1.1 Deep Learning Neural Network
Before going to what we have done on this project, we need some basic knowledgeabout Deep Learning Neutral Network
First of all, Machine Learning is a field of study in artificial intelligence that ables algorithms to uncover hidden patterns within datasets, allowing them to makepredictions on new, similar data without explicit programming for each task.[14] AndDeep Learning, conceptualized by Geoffrey Hinton in the 1980s, is a subset of MachineLearning that uses some functions to map input into output These functions will form
en-a relen-ationship between the input en-and the output by extren-acting essentien-al informen-ation from
the input data This is called learning, and the process of learning is training.[5]
Next, about Neural Networks, also known as Artificial Neural Networks, it was alsocreated by Hinton, which is a Deep Learning algorithm structured similar to the orga-
nization of Neurons in the brain Hinton took this approach because the human brain
is arguably the most powerful computational engine known today.[20] Similar to those
neurons of our brain, the basic building block of neural networks is called Neurons or
Perceptron(nodes) This network consists of 3 layers of perceptron: input, hidden and
output layer Before going to output layer, from hidden layer, the calculations of connected nodes must be carried out and between those calculations, in order to avoid
inter-overfitting problem when training, weights, biases and activation functions are added.
In the next section, we will introduce more details about the basic components of aneural network, how does a Deep Learning Neural Network work, what are CNN andwhy we use CNN for our problem
2.1.2 Basic components of a neural network
First, let us have a look at a perceptron or neuron:
Figure 2.1: The basic building block of Deep Learning models - Perceptron
Trang 16Inputs: They are passed on to a neural network to make predictions, they are presented
as features of a dataset
Weights: They are important real values associated with the inputs that tell the icant of the feature passed in
signif-Bias: Its mission is to shift the activation function across the plane towards either left
or right More information will be explained later
Sum: It is a function to add up the product of the weight and the input with bias
Layers: Layers in a deep learning model form the fundamental components of its chitecture They process data sequentially, where each layer takes input from the previousone and passes the output to the next.[22]
ar-There are several types of layer we want to declare in this project:
• Dense layer: also called a fully connected layer, uses a linear operation to mainly
transform the dimensionality of the input to fit the desired output (e.g., classificationprobabilities in the final layer), but sometimes it is used to aggregate and process
information Below is its mathematical operation.
For an input vector x:
y= σ (W x + b)
where σ is the activation function, W is the weight matrix, b is the bias, and y is the
output.[19]
• Pooling layer: Used for scaling down the input.
• Normalization layer: Normalization layers are components in neural networks that
help to stabilize and hasten the training process by normalizing the inputs to alayer.[16] They are particularly useful in deep learning architectures, including con-volutional neural networks (CNNs) and fully connected networks
• Convolutional layer: A convolutional layer is a fundamental component of
con-volutional neural networks (CNNs), primarily used for processing and analyzingvisual data It applies a mathematical operation called convolution to the inputdata, utilizing filters (or kernels) to extract features such as edges, textures, andpatterns This process helps the network learn hierarchical representations of theinput data.[21]
Trang 17Figure 2.2: Convolution layer [17]
Activation function: It is used to add non-linearity to the model Here are some vation functions that are commonly used:
acti-• Tanh function: Tanh or Hyperbolic Tangent function commonly denoted as ( tanh(x))
is a mathematical function that is widely used as an activation function in neuralnetworks It is defined as the ratio of the hyperbolic sine and hyperbolic cosinefunctions[9]:
Trang 18• Sigmoid / Logistic function: The sigmoid function, also known as the logistic
func-tion, is a mathematical function that maps any real-valued number into a value tween 0 and 1 It is defined by the following formula[15]:
1 + e−x
Figure 2.4: Logistic-curve function [15]
• ReLU function: The Rectified Linear Unit (ReLU) function is a widely used
acti-vation function in deep neural networks It is defined as:
f(x) = max(0, x)
In other words, the ReLU function returns 0 if the input is negative, and the inputvalue itself if it is positive.[26]
Trang 19Figure 2.5: ReLU function [3]
Loss function: The loss function, also known as a cost function, is a mathematicalfunction applied in machine learning to determine the discrepancy between the actualoutput and the output predicted by the model The primary notion behind fitting a ma-chine learning model is to minimize the loss function, so it tries to prompt the model tomake not just random predictions but rather accurate ones.[2]
In this project, we will only focus on 2 loss functions for object detection the firstone is for classification, the other is IoU loss There are 3 types of IoU, but we only giveComplete IoU(CIoU) loss here
LCIoU = 1 − IoU + d2/C2+ αv
where
v= ( 4
π2arctan(wgt/hgt) − arctan(w/h))2and
α = v/((1 − IoU ) + v)
α is a function of IoU The above equation states that the aspect ratio factor is lessimportant in the case of no overlap and more important in the case of more overlap.[8]
Trang 202.1.3 How does a Deep Learning Neural Network work
Deep Neural Networks consist of several layers of interconnected artificial neurons ornodes that are arranged in a stacked configuration Each node typically employs a simplemathematical function, often a linear function, to extract and map information A deepneural network is composed of three types of layers: the input layer, hidden layers, andthe output layer
Data is introduced into the input layer, where each node processes the informationand forwards it to the subsequent layer, known as the hidden layers These hidden layersprogressively extract features from the input data and apply transformations using thelinear function They are referred to as hidden layers because the parameters (weightsand biases) within each node are not visible; these layers introduce random parameters
to modify the data, resulting in varied outputs
Figure 2.6: Deep Neural Network architecture [4]
There are 2 key algorithms described how a neural network works:
• Forward Propagation: Forward propagation (or forward pass) involves the
com-putation and retention of intermediate variables, including outputs, as data movesthrough a neural network from the input layer to the output layer
• Backpropagation:The main workhorse algorithm in deep learning It is mainly
Trang 21Figure 2.7: Backpropagation algorithm[12]
The processes of forward propagation and backpropagation allow for efficient putation of the gradient through the network using the chain rule of calculus, be-cause the process of minimizing the errors of a neural network to reach high ac-curacies on arbitrary tasks can admittedly be arduous with forward and backwardpasses With each iteration, the algorithm gets closer to achieving even higher ac-curacies
com-2.1.4 CNN
Convolutional neural networks (CNNs) have a long and rich history, dating back tothe 1980s It was then that Yann LeCun designed and trained the first real-world convo-lutional neural network architecture for digit recognition in handwritten documents Itlaid the groundwork for future models because it was seen as inspired by mechanisms ofvisual processing in the human brain Fast forward to the 1990s and 2000s, during whichCNNs became ubiquitous standards for most image classification tasks due to trainingmethods’ advancements as well as CNN’s architectural complexity It was not until the
2012 ImageNet competition that CNNs truly began to show their potential with the duction of AlexNet, which took first place in said competition and fully showcased deeplearning capabilities Today, Reality Companies with CNNs have been extensively used
intro-to support computer vision applications, including facial recognition, medical imaging,among others.[6]
Trang 22Figure 2.8: CNN history[6]
CNN is a powerful neural network for the following reasons[24]:
• Effective Feature Extraction: CNNs have the ability to automatically learn
hierar-chical features from raw input data, enabling them to identify spatial relationshipswithin images This capability minimizes the necessity for manual feature engineer-ing, enhancing their effectiveness for tasks related to image processing
• Robustness to Variations: Designed to be invariant to translations, rotations, and
other transformations, CNNs exhibit resilience to variations in input data This acteristic is especially advantageous in practical applications where images mayvary in scale, orientation, or lighting conditions
char-• Scalability: CNNs can be scaled up with deeper architectures, allowing them to
learn more complex patterns and representations This scalability has led to
Trang 23signif-• Success in Competitions: CNNs have consistently outperformed traditional
meth-ods in various benchmark competitions, such as ImageNet, which has bolstered theircredibility and encouraged widespread adoption in both academia and industry
• Versatility: While originally designed for image processing, CNNs have been
suc-cessfully applied to a wide range of tasks beyond vision, including natural languageprocessing, audio analysis, and even time-series forecasting
• Strong Community and Research Support: The deep learning community has
produced a wealth of resources, frameworks (like TensorFlow and PyTorch), andpre-trained models, making it easier for researchers and developers to implementCNNs in their projects
Trang 24YOLO was introduced by Joseph Redmon and his colleagues in 2016 [23] The nal model combined bounding box prediction and class label identification into a singleneural network, revolutionizing object detection.
origi-The YOLO family includes several versions: YOLOv1, YOLOv2, YOLOv3, YOLOv4,YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10 and the latest YOLOv11.Each version has built upon the previous one, introducing improvements in architecture,speed, and accuracy
YOLOv10 was released in May 2024 by researchers from Tsinghua University It resents the latest advancements in the YOLO series, focusing on enhancing performancewhile reducing latency and computational requirements
rep-YOLOv10’s architecture is designed to optimize both speed and accuracy in real-timeobject detection It consists of three main components:
• Backbone:The backbone of YOLOv10 utilizes an enhanced version of CSPNet(CrossStage Partial Network), which is effective for feature extraction This backbone isoptimized to capture rich, multi-scale features from input images while minimizingcomputational overhead
• Neck: The neck of YOLOv10 is responsible for fusing features from differentscales It employs a Feature Pyramid Network (FPN) that enhances the flow of in-formation between various feature levels, allowing the model to effectively handleobjects of varying sizes
• Head: The head of YOLOv10 is where the final predictions are made It porates a dual label assignment strategy, combining one-to-one and one-to-manymatching techniques This approach allows the model to make multiple predictionsfor each object, improving detection accuracy while eliminating the need for non-maximum suppression (NMS) during inference:
incor-– One-to-Many: It can generates multiple predictions per object during training – One-to-One: It can generates a best prediction per object during inference.
Trang 25Figure 2.10: YOLOv10 architecture Dual Label Assignment
Figure 2.11: Example of YOLOv10 architecture
2.2.2 SSD
The Single Shot Detector (SSD) is a real-time object detection framework that lizes a single convolutional neural network (CNN) to predict bounding boxes and classlabels for multiple objects in an image It divides the image into a grid, with each cell re-sponsible for detecting objects within its region, and employs default boxes (also known
uti-as anchor boxes) of various uti-aspect ratios and scales to improve detection accuracy.Key Components of SSD:
• Backbone Network: SSD typically uses a pre-trained CNN, such as VGG16, as its
backbone for feature extraction The backbone processes the input image to generatefeature maps that capture essential visual information
• Multi-Scale Feature Maps: SSD generates feature maps at different resolutions,
allowing it to detect objects of varying sizes effectively Higher resolution maps areused for smaller objects, while lower resolution maps handle larger objects
• Default Boxes: At each grid cell, SSD employs multiple default boxes with
dif-ferent aspect ratios and scales Each default box is associated with predictions forbounding box offsets and class scores
Trang 26• Localization and Confidence Predictions: The model predicts the location of
ob-jects through bounding box offsets and class probabilities for each default box Thepredictions are refined during training to minimize the localization loss and confi-dence loss
Advantages of SSD:
• Real-Time Detection: SSD is designed for speed, allowing it to perform object
detection in real-time, making it suitable for applications like autonomous drivingand video surveillance
• High Accuracy: By leveraging multi-scale feature maps and default boxes, SSD
achieves high accuracy in detecting objects across various sizes and aspect ratios
• Simplicity: The single-shot approach simplifies the detection pipeline, reducing the
computational burden compared to two-stage detectors like Faster R-CNN
2.2.3 RetinaNet
RetinaNet is a one-stage object detection model that excels in detecting objects ofvarying sizes, particularly small and dense ones It integrates a Feature Pyramid Net-work (FPN) for multi-scale feature extraction and employs Focal Loss to address classimbalance during training, enhancing detection accuracy This model is efficient and ef-fective, making it a popular choice for various applications, including aerial and satelliteimagery
Architecture Overview:
• Backbone Network: The backbone, often a ResNet or ResNeXt, extracts feature
maps from the input image This network is pre-trained on large datasets and tuned for object detection tasks
fine-• Classification and Regression Subnetworks: RetinaNet includes two subnetworks:
– Classification Subnetwork: Predicts the probability of an object being present
at each spatial location for each anchor box and object class
– Regression Subnetwork: Estimates the offsets for the bounding boxes from the
anchor boxes for each ground-truth object
Trang 27• Efficiency: The single-stage design allows for faster inference times compared to
two-stage models, making it suitable for real-time applications
• Versatility: RetinaNet is effective across various domains, including autonomous
driving, surveillance, and medical imaging, due to its robust detection capabilities
2.2.4 Faster R-CNN
Faster R-CNN (Region-based Convolutional Neural Network) is a state-of-the-art ject detection framework that significantly improves the speed and accuracy of objectdetection tasks It is a two-stage detector that combines the strengths of region proposalnetworks (RPN) and fast R-CNN for efficient and precise object detection
ob-Key Components of Faster R-CNN:
• Backbone Network: Faster R-CNN typically uses a deep convolutional neural
net-work (CNN) as its backbone for feature extraction Common choices include ResNet,VGG16, and Inception The backbone processes the input image to produce a fea-ture map that captures essential visual information
• Region Proposal Network (RPN): The RPN is a crucial innovation in Faster
R-CNN It generates region proposals (potential bounding boxes) directly from thefeature map produced by the backbone The RPN uses sliding windows over thefeature map to predict objectness scores (indicating whether an object is present)and bounding box coordinates for a set of predefined anchor boxes of differentscales and aspect ratios
• RoI Pooling: Once the RPN generates region proposals, these proposals are fed
into the RoI (Region of Interest) pooling layer RoI pooling converts the sized proposals into fixed-size feature maps, which can then be processed by thesubsequent layers of the network
variable-• Object Detection Head: After RoI pooling, the fixed-size feature maps are passed
to fully connected layers that perform two tasks:
– Classification: Predicts the class of the object within each proposed region – Bounding Box Regression: Refines the coordinates of the bounding boxes to
improve localization accuracy
Advantages of Faster R-CNN:
• High Accuracy: Faster R-CNN achieves state-of-the-art performance on various
object detection benchmarks due to its effective use of deep learning techniquesand the integration of RPN for generating high-quality region proposals