Luận văn application of deep learning in detecting pests and diseases on coffee leaves

• Normalization layer: Normalization layers are components in neural networks that help to stabilize and hasten the training process by normalizing the inputs to alayer.[16] They are par

Trang 6

Declaration of Authenticity

I declare that this research is our own, carried out under the supervision of Assoc.Prof Le Hong Trang and Mr Nguyen Quang Duc The results of our study are credibleand have not yet been made public All materials utilized in this research were gathered

by myself from various sources and are properly cited in the reference section

Furthermore, all of the research results are properly referenced and unrelated to theinitial data

In any event, I stand by my actions and accept responsibility for any plagiarism Thus,any copyright violations resulting from our research are not the responsibility of Univer-sity of Technology-Vietnam National University Ho Chi Minh City

Ho Chi Minh City, Dec 2024

Project Author

Ly Kim Phong

Trang 7

Second, I want to express the appreciation to my advisers, Assoc Prof Le Hong

Trang and Mr Nguyen Quang Duc Without their assistance and support, I would not

have been able to finish my paper effectively; they have been incredibly gracious andpatient in guiding me through problems

Third, I also want to express my deepest thank to Dr Nguyen Duc Dung, my

re-viewer He has pointed out my mistakes and shortcomings and provides few direct vices to improve my work

ad-In closing, I would like to thank all the teachers, TA and the Department of

Com-puter Science for their assistance and support in getting me ready for this project; their

opinions and assessment have been invaluable Without their help, I could not have pleted this job They provided guidance for the path of my studies

com-One more time, I would like to express my gratitude and admiration to everyone whohas helped and inspired me Thank you to everyone

Trang 8

in-of early detection employing deep learning models, machine learning, computer vision,and hardware monitoring We provide a one-stage model of YOLOv10 based on ref-erence research In order to optimize the model and further prune the models, furtherenhancements for this project require obtaining more balanced datasets.

Trang 9

1.1 Motivation 8

1.2 Problem statement 8

1.3 Scope 9

1.4 Structure of this project 9

2 Preliminaries 10 2.1 Background knowledge 10

2.1.1 Deep Learning Neural Network 10

2.1.2 Basic components of a neural network 10

2.1.3 How does a Deep Learning Neural Network work 15

2.1.4 CNN 16

2.2 Relevant Models 18

2.2.1 YOLO family 18

2.2.2 SSD 20

2.2.3 RetinaNet 21

2.2.4 Faster R-CNN 22

2.3 Training Losses 23

2.4 Evaluating metrics 23

3 Pest dection 27 3.1 Dataset 27

3.2 Training results 32

3.3 Evaluation 32

4 Mobile app integration 35 4.1 Framework 35

4.2 Database framework 35

4.2.1 Google Firebase 35

4.2.2 Firestore 36

4.3 Authentication 36

4.4 Requirements 37

4.5 Use-case 39

4.6 Database structure 40

4.7 Architecture 40

4.8 Implementation 41

Trang 10

5 Conclusion 46

Trang 11

List of Figures

2.1 The basic building block of Deep Learning models - Perceptron 10

2.2 Convolution layer [17] 12

2.3 Hyperbolic tangent function [7] 12

2.4 Logistic-curve function [15] 13

2.5 ReLU function [3] 14

2.6 Deep Neural Network architecture [4] 15

2.7 Backpropagation algorithm[12] 16

2.8 CNN history[6] 17

2.9 CNN [24] 18

2.10 YOLOv10 architecture Dual Label Assignment 20

2.11 Example of YOLOv10 architecture 20

2.12 Example of PR curve 25

3.1 Original dataset with 2 classes: Miner (purple) and Brown eye spot (yel-lowish) 27

3.3 Example of tflite dataset with normal class 28

3.2 tflitedata distribution 28

3.4 Original and +40% brightness applied 31

3.5 Original and 10% noise applied 31

3.6 F1 curve 33

3.7 Confusion matrix normalized 33

3.8 PR curve 34

4.1 Use-case diagram 39

4.2 Database architecture 40

4.3 Architecture diagram 41

4.4 Slider 42

4.5 Login/Sign up 42

4.6 Recover account 43

4.7 Home & Recent activities & Profile pages 44

4.8 Chatbot page 44

4.9 Update info & Change password & History pages 45

Trang 12

List of Tables

3.1 tflite class distribution 29

3.2 List of augments 30

3.3 Training losses 32

3.4 Evaluating results 32

3.5 Inference time 32

Trang 13

Our research aims to develop an innovative application capable of identifying pests

in coffee plants through image analysis of roots, stems, and leaves This tool is designed

to provide farmers with timely alerts, enabling them to implement preventive measuresand mitigate potential losses

The motivation for this project stems from several key factors Coffee is a vital cultural product, cultivated extensively in numerous countries and contributing signifi-cantly to global economies Pests and diseases, if left unchecked, can drastically reduceyields and profitability In addition, there is an urgent need for efficient and accessiblesolutions to facilitate early threat detection for farmers

agri-We are confident that this application will create a transformative impact on the coffeeindustry by enhancing productivity, protecting farmers’ livelihoods, and contributing to

a sustainable coffee supply chain that benefits all stakeholders

1.2 Problem statement

Diseases such as coffee rust, coffee berry disease, and pink disease pose significantthreats to coffee plants by infecting leaves and impairing the ripening process, ultimatelyreducing coffee bean yields Early detection of these diseases is critical, and this can beachieved through automated systems capable of identifying symptoms at their initialstages

Currently, most leaf disease detection systems rely on convolutional neural networks(CNNs) and their variants, including R-CNN, F-CNN, Faster-CNN, SSD, and YOLO,

to identify disease-induced damage While these methods are effective, they heavily pend on accurately labeled datasets and face limitations in adaptability Specifically, theaddition of new disease variants often necessitates retraining the model from the begin-ning, which is both time-consuming and resource-intensive

de-To address these limitations, first we propose a one-stage approach using the YOLOv10xgeneral object detection model to directly identify leaf diseases This method also cal-culates the probability of various diseases, such as rust and miner infestations Our ap-proach enables seamless incorporation of new disease classes by simply adding a clas-

Trang 14

sification label, eliminating the need for complete model retraining Then we develop

a mobile application that takes input (coffee leaves) images from farmers, detects thediseases of that images and provides the details and recommends treatments for thatdiseases

This report presents the experiments conducted and the findings that inform the velopment of a robust, scalable model for our application

de-1.3 Scope

In this project, my aim is to achieve these goals:

1 Assessing current document matching methods and making the necessary ments to determine which ones best meet our needs

adjust-2 Using the latest model to get a more accurate model architecture

3 Developing a basic mobile application to demonstrate our model

1.4 Structure of this project

The rest of this paper is organized as follows In section 2.1, we recall some grounds on deep neural network, how it works, and CNN introduction Section 2.2 is abrief introduction of relevant models Section 2.3 and 2.4 are the training losses and eval-uating metrics used to evaluate the model’s performance Chapter 3 is about the datasetbeing used in this project, our training and evaluating results Chapter 4 discusses aboutwhat we have done to develop the application Chapter 5 and 6 discusses the results andfuture improvements that can be made

Trang 15

back-Chapter 2

Preliminaries

2.1 Background knowledge

2.1.1 Deep Learning Neural Network

Before going to what we have done on this project, we need some basic knowledgeabout Deep Learning Neutral Network

First of all, Machine Learning is a field of study in artificial intelligence that ables algorithms to uncover hidden patterns within datasets, allowing them to makepredictions on new, similar data without explicit programming for each task.[14] AndDeep Learning, conceptualized by Geoffrey Hinton in the 1980s, is a subset of MachineLearning that uses some functions to map input into output These functions will form

en-a relen-ationship between the input en-and the output by extren-acting essentien-al informen-ation from

the input data This is called learning, and the process of learning is training.[5]

Next, about Neural Networks, also known as Artificial Neural Networks, it was alsocreated by Hinton, which is a Deep Learning algorithm structured similar to the orga-

nization of Neurons in the brain Hinton took this approach because the human brain

is arguably the most powerful computational engine known today.[20] Similar to those

neurons of our brain, the basic building block of neural networks is called Neurons or

Perceptron(nodes) This network consists of 3 layers of perceptron: input, hidden and

output layer Before going to output layer, from hidden layer, the calculations of connected nodes must be carried out and between those calculations, in order to avoid

inter-overfitting problem when training, weights, biases and activation functions are added.

In the next section, we will introduce more details about the basic components of aneural network, how does a Deep Learning Neural Network work, what are CNN andwhy we use CNN for our problem

2.1.2 Basic components of a neural network

First, let us have a look at a perceptron or neuron:

Figure 2.1: The basic building block of Deep Learning models - Perceptron

Trang 16

Inputs: They are passed on to a neural network to make predictions, they are presented

as features of a dataset

Weights: They are important real values associated with the inputs that tell the icant of the feature passed in

signif-Bias: Its mission is to shift the activation function across the plane towards either left

or right More information will be explained later

Sum: It is a function to add up the product of the weight and the input with bias

Layers: Layers in a deep learning model form the fundamental components of its chitecture They process data sequentially, where each layer takes input from the previousone and passes the output to the next.[22]

ar-There are several types of layer we want to declare in this project:

• Dense layer: also called a fully connected layer, uses a linear operation to mainly

transform the dimensionality of the input to fit the desired output (e.g., classificationprobabilities in the final layer), but sometimes it is used to aggregate and process

information Below is its mathematical operation.

For an input vector x:

y= σ (W x + b)

where σ is the activation function, W is the weight matrix, b is the bias, and y is the

output.[19]

• Pooling layer: Used for scaling down the input.

• Normalization layer: Normalization layers are components in neural networks that

help to stabilize and hasten the training process by normalizing the inputs to alayer.[16] They are particularly useful in deep learning architectures, including con-volutional neural networks (CNNs) and fully connected networks

• Convolutional layer: A convolutional layer is a fundamental component of

con-volutional neural networks (CNNs), primarily used for processing and analyzingvisual data It applies a mathematical operation called convolution to the inputdata, utilizing filters (or kernels) to extract features such as edges, textures, andpatterns This process helps the network learn hierarchical representations of theinput data.[21]

Trang 17

Figure 2.2: Convolution layer [17]

Activation function: It is used to add non-linearity to the model Here are some vation functions that are commonly used:

acti-• Tanh function: Tanh or Hyperbolic Tangent function commonly denoted as ( tanh(x))

is a mathematical function that is widely used as an activation function in neuralnetworks It is defined as the ratio of the hyperbolic sine and hyperbolic cosinefunctions[9]:

Trang 18

• Sigmoid / Logistic function: The sigmoid function, also known as the logistic

func-tion, is a mathematical function that maps any real-valued number into a value tween 0 and 1 It is defined by the following formula[15]:

1 + e−x

Figure 2.4: Logistic-curve function [15]

• ReLU function: The Rectified Linear Unit (ReLU) function is a widely used

acti-vation function in deep neural networks It is defined as:

f(x) = max(0, x)

In other words, the ReLU function returns 0 if the input is negative, and the inputvalue itself if it is positive.[26]

Trang 19

Figure 2.5: ReLU function [3]

Loss function: The loss function, also known as a cost function, is a mathematicalfunction applied in machine learning to determine the discrepancy between the actualoutput and the output predicted by the model The primary notion behind fitting a ma-chine learning model is to minimize the loss function, so it tries to prompt the model tomake not just random predictions but rather accurate ones.[2]

In this project, we will only focus on 2 loss functions for object detection the firstone is for classification, the other is IoU loss There are 3 types of IoU, but we only giveComplete IoU(CIoU) loss here

LCIoU = 1 − IoU + d2/C2+ αv

where

v= ( 4

π2arctan(wgt/hgt) − arctan(w/h))2and

α = v/((1 − IoU ) + v)

α is a function of IoU The above equation states that the aspect ratio factor is lessimportant in the case of no overlap and more important in the case of more overlap.[8]

Trang 20

2.1.3 How does a Deep Learning Neural Network work

Deep Neural Networks consist of several layers of interconnected artificial neurons ornodes that are arranged in a stacked configuration Each node typically employs a simplemathematical function, often a linear function, to extract and map information A deepneural network is composed of three types of layers: the input layer, hidden layers, andthe output layer

Data is introduced into the input layer, where each node processes the informationand forwards it to the subsequent layer, known as the hidden layers These hidden layersprogressively extract features from the input data and apply transformations using thelinear function They are referred to as hidden layers because the parameters (weightsand biases) within each node are not visible; these layers introduce random parameters

to modify the data, resulting in varied outputs

Figure 2.6: Deep Neural Network architecture [4]

There are 2 key algorithms described how a neural network works:

• Forward Propagation: Forward propagation (or forward pass) involves the

com-putation and retention of intermediate variables, including outputs, as data movesthrough a neural network from the input layer to the output layer

• Backpropagation:The main workhorse algorithm in deep learning It is mainly

Trang 21

Figure 2.7: Backpropagation algorithm[12]

The processes of forward propagation and backpropagation allow for efficient putation of the gradient through the network using the chain rule of calculus, be-cause the process of minimizing the errors of a neural network to reach high ac-curacies on arbitrary tasks can admittedly be arduous with forward and backwardpasses With each iteration, the algorithm gets closer to achieving even higher ac-curacies

com-2.1.4 CNN

Convolutional neural networks (CNNs) have a long and rich history, dating back tothe 1980s It was then that Yann LeCun designed and trained the first real-world convo-lutional neural network architecture for digit recognition in handwritten documents Itlaid the groundwork for future models because it was seen as inspired by mechanisms ofvisual processing in the human brain Fast forward to the 1990s and 2000s, during whichCNNs became ubiquitous standards for most image classification tasks due to trainingmethods’ advancements as well as CNN’s architectural complexity It was not until the

2012 ImageNet competition that CNNs truly began to show their potential with the duction of AlexNet, which took first place in said competition and fully showcased deeplearning capabilities Today, Reality Companies with CNNs have been extensively used

intro-to support computer vision applications, including facial recognition, medical imaging,among others.[6]

Trang 22

Figure 2.8: CNN history[6]

CNN is a powerful neural network for the following reasons[24]:

• Effective Feature Extraction: CNNs have the ability to automatically learn

hierar-chical features from raw input data, enabling them to identify spatial relationshipswithin images This capability minimizes the necessity for manual feature engineer-ing, enhancing their effectiveness for tasks related to image processing

• Robustness to Variations: Designed to be invariant to translations, rotations, and

other transformations, CNNs exhibit resilience to variations in input data This acteristic is especially advantageous in practical applications where images mayvary in scale, orientation, or lighting conditions

char-• Scalability: CNNs can be scaled up with deeper architectures, allowing them to

learn more complex patterns and representations This scalability has led to

Trang 23

signif-• Success in Competitions: CNNs have consistently outperformed traditional

meth-ods in various benchmark competitions, such as ImageNet, which has bolstered theircredibility and encouraged widespread adoption in both academia and industry

• Versatility: While originally designed for image processing, CNNs have been

suc-cessfully applied to a wide range of tasks beyond vision, including natural languageprocessing, audio analysis, and even time-series forecasting

• Strong Community and Research Support: The deep learning community has

produced a wealth of resources, frameworks (like TensorFlow and PyTorch), andpre-trained models, making it easier for researchers and developers to implementCNNs in their projects

Trang 24

YOLO was introduced by Joseph Redmon and his colleagues in 2016 [23] The nal model combined bounding box prediction and class label identification into a singleneural network, revolutionizing object detection.

origi-The YOLO family includes several versions: YOLOv1, YOLOv2, YOLOv3, YOLOv4,YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10 and the latest YOLOv11.Each version has built upon the previous one, introducing improvements in architecture,speed, and accuracy

YOLOv10 was released in May 2024 by researchers from Tsinghua University It resents the latest advancements in the YOLO series, focusing on enhancing performancewhile reducing latency and computational requirements

rep-YOLOv10’s architecture is designed to optimize both speed and accuracy in real-timeobject detection It consists of three main components:

• Backbone:The backbone of YOLOv10 utilizes an enhanced version of CSPNet(CrossStage Partial Network), which is effective for feature extraction This backbone isoptimized to capture rich, multi-scale features from input images while minimizingcomputational overhead

• Neck: The neck of YOLOv10 is responsible for fusing features from differentscales It employs a Feature Pyramid Network (FPN) that enhances the flow of in-formation between various feature levels, allowing the model to effectively handleobjects of varying sizes

• Head: The head of YOLOv10 is where the final predictions are made It porates a dual label assignment strategy, combining one-to-one and one-to-manymatching techniques This approach allows the model to make multiple predictionsfor each object, improving detection accuracy while eliminating the need for non-maximum suppression (NMS) during inference:

incor-– One-to-Many: It can generates multiple predictions per object during training – One-to-One: It can generates a best prediction per object during inference.

Trang 25

Figure 2.10: YOLOv10 architecture Dual Label Assignment

Figure 2.11: Example of YOLOv10 architecture

2.2.2 SSD

The Single Shot Detector (SSD) is a real-time object detection framework that lizes a single convolutional neural network (CNN) to predict bounding boxes and classlabels for multiple objects in an image It divides the image into a grid, with each cell re-sponsible for detecting objects within its region, and employs default boxes (also known

uti-as anchor boxes) of various uti-aspect ratios and scales to improve detection accuracy.Key Components of SSD:

• Backbone Network: SSD typically uses a pre-trained CNN, such as VGG16, as its

backbone for feature extraction The backbone processes the input image to generatefeature maps that capture essential visual information

• Multi-Scale Feature Maps: SSD generates feature maps at different resolutions,

allowing it to detect objects of varying sizes effectively Higher resolution maps areused for smaller objects, while lower resolution maps handle larger objects

• Default Boxes: At each grid cell, SSD employs multiple default boxes with

dif-ferent aspect ratios and scales Each default box is associated with predictions forbounding box offsets and class scores

Trang 26

• Localization and Confidence Predictions: The model predicts the location of

ob-jects through bounding box offsets and class probabilities for each default box Thepredictions are refined during training to minimize the localization loss and confi-dence loss

Advantages of SSD:

• Real-Time Detection: SSD is designed for speed, allowing it to perform object

detection in real-time, making it suitable for applications like autonomous drivingand video surveillance

• High Accuracy: By leveraging multi-scale feature maps and default boxes, SSD

achieves high accuracy in detecting objects across various sizes and aspect ratios

• Simplicity: The single-shot approach simplifies the detection pipeline, reducing the

computational burden compared to two-stage detectors like Faster R-CNN

2.2.3 RetinaNet

RetinaNet is a one-stage object detection model that excels in detecting objects ofvarying sizes, particularly small and dense ones It integrates a Feature Pyramid Net-work (FPN) for multi-scale feature extraction and employs Focal Loss to address classimbalance during training, enhancing detection accuracy This model is efficient and ef-fective, making it a popular choice for various applications, including aerial and satelliteimagery

Architecture Overview:

• Backbone Network: The backbone, often a ResNet or ResNeXt, extracts feature

maps from the input image This network is pre-trained on large datasets and tuned for object detection tasks

fine-• Classification and Regression Subnetworks: RetinaNet includes two subnetworks:

– Classification Subnetwork: Predicts the probability of an object being present

at each spatial location for each anchor box and object class

– Regression Subnetwork: Estimates the offsets for the bounding boxes from the

anchor boxes for each ground-truth object

Trang 27

• Efficiency: The single-stage design allows for faster inference times compared to

two-stage models, making it suitable for real-time applications

• Versatility: RetinaNet is effective across various domains, including autonomous

driving, surveillance, and medical imaging, due to its robust detection capabilities

2.2.4 Faster R-CNN

Faster R-CNN (Region-based Convolutional Neural Network) is a state-of-the-art ject detection framework that significantly improves the speed and accuracy of objectdetection tasks It is a two-stage detector that combines the strengths of region proposalnetworks (RPN) and fast R-CNN for efficient and precise object detection

ob-Key Components of Faster R-CNN:

• Backbone Network: Faster R-CNN typically uses a deep convolutional neural

net-work (CNN) as its backbone for feature extraction Common choices include ResNet,VGG16, and Inception The backbone processes the input image to produce a fea-ture map that captures essential visual information

• Region Proposal Network (RPN): The RPN is a crucial innovation in Faster

R-CNN It generates region proposals (potential bounding boxes) directly from thefeature map produced by the backbone The RPN uses sliding windows over thefeature map to predict objectness scores (indicating whether an object is present)and bounding box coordinates for a set of predefined anchor boxes of differentscales and aspect ratios

• RoI Pooling: Once the RPN generates region proposals, these proposals are fed

into the RoI (Region of Interest) pooling layer RoI pooling converts the sized proposals into fixed-size feature maps, which can then be processed by thesubsequent layers of the network

variable-• Object Detection Head: After RoI pooling, the fixed-size feature maps are passed

to fully connected layers that perform two tasks:

– Classification: Predicts the class of the object within each proposed region – Bounding Box Regression: Refines the coordinates of the bounding boxes to

improve localization accuracy

Advantages of Faster R-CNN:

• High Accuracy: Faster R-CNN achieves state-of-the-art performance on various

object detection benchmarks due to its effective use of deep learning techniquesand the integration of RPN for generating high-quality region proposals

Định dạng
Số trang	54
Dung lượng	12,32 MB