In addition, by adding layers to Roboflow and using the YOLOv8 deep neuralnetwork - one of the most advanced models in the field of object recognition, itallows to recognize and classify
Trang 1VIETNAM NATIONAL UNIVERSTTY, HANOI
ID: 23070414Class: AIT2023B
Trang 2TEAM LEADER INFORMATION
- Program: Applied Information Technology
- Address: Tay Mo, Nam Tu Liem, Ha Noi
- Phone no /Email: 0963294472/ biayciii@ gmail.com
II Academic Results (from the first year to now)
Academic year | Overall score | Academic rating
IH Other achievements:
Advisor Hanoi, April 15, 2024
Trang 3We would like to send our most sincere acknowledgement to PhD Kim Dinh Thai and PhD Ha Manh Hung, who guided us on the right track with our
research assignment Thanks to your thorough support and carefulness, we
have completed this scientific research PhD Kim Dinh Thai and PhD Ha Manh Hung have always been caring and supporting us step by step from the
ideation to the completion phase of the research They not only inspired us to
generate creative ideas for the research but also motivated us when we were trying to overcome the difficulties in the process.
Without their support, we would not be able to complete this research Once more, we sincerely thank you for your huge contribution and look forward to collaborating with you in future projects.
Vu Xuan Bac
Trang 4Table of Contents
List Of fPUT€S ccQ Q HS ST ng EEE EERE EEE EEE EEE EEE EE kh EES 5 List of 1 aA eneeneeeeeneenaenseneennaeees 6
INTRODUTION & ABSTRACT ccc ce ccecc cece cence ene eee e tent ene nh hen 7
Student’s Information - C2222 eee e nh nen 7
I INTRODUCTION QC SH nh nhe 9
1.1 Concerning rationable of the study .cc cà cv 9
1.2 Research Qu€SfIONS cc n2 nh ha 10 1.3 ObJectand Scope of the SŠtudy c2 se 10
1.4 Research Methods nh nh hs 10
1.5 SfTUCẨUT€ SH TT KH nh kh ng 10
Il INTERATURE REVIEW e nee eeeeeeenenene teens 10
II METHODOLOGY ccc cece cece cent eee eee e nena eens eae kho 12 3.1 Basic theory about Image - .cccccc c2 12
3.1.1 Basic theory about Convolutional Neural Networks (CNNs) 13
3.1.2 Object detection problem -.cc ene eneeneees 15
3.1.3 Image object detection for medical Images 17 3.2 Models for Object Detecfion c ee ene teen tena no 17
“718299 17 3.2.2 Faster R-CNN ch hy 19
3.2.3 SSD (Single Shot Multibox Detector) 21 3.3 Image Object Detection for PIÏÏ - eee << teeta eee 22
3.3.1 Problem Statement - 22c 22
4909 ce 22
3.3.3 Dataset and F€afUT©S c2 nh kh nhà 26 3.3.4 Evaluation MetrICS -.-ccn nh nên 31 3.3.5 Implementing Method - -ccccc c2 31
Trang 5Lists of figures
Figure 1: Basic CNN architecfure -c- << cà:
Figure 2: RPN archit€CtUTe - chen
Figure 3: SSD Model ence ne ec eee ea eee eneneeneeaenaeneeneens
Figure 4: YOLOv8 archItecture - cv,
Figure 5: Example images of drug samples in Dataset
Figure 6: Histogram of Object Count by Image
Figure 7: The Model of Varifocal LoSS -.-< 2<
Figure 8: An illustration of IACS
method -‹ Figure 9: Graph of Focal LOSS -.c c2 Set
Figure 10: Image shows the final result of our Google Colab mode
Figure 11: The benchmark of our
Trang 6dataset -Lists of tables
Table 1: YOLOV8 aCCUTaCY c HS HH kh kh sa 24
Table 2: The dice score of the Google Colab ‹ c <<c + 32
Trang 7Full Name Class ID
Luong Dinh Tung AIT2023B 23070385
Nguyen Nhat Minh AIT2023B 23070461
Nguyen Duc Manh AIT2023A 23070205
3 Advisor(S):
Thai Kim Dinh is lecturer in the International School of VietNam National UniversityHung Manh Ha is lecturer in the International School of VietNam National University
4 Abstract (300 words or less):
These days, technology is very vital Computer vision and artificial intelligence(AD have made major contributions to numerous disciplines, particularly themedical sector The use of AI in this sector has several advantages and enormouspromise for both customers and industry Developing a system to inspect, choose,and assess the caliber of pharmaceutical and medical supplies is one of thoseadvantages
In addition, by adding layers to Roboflow and using the YOLOv8 deep neuralnetwork - one of the most advanced models in the field of object recognition, itallows to recognize and classify objects in images at high speed high accuracy andprecision, we have focused on creating and developing models in this research todetect, classify, and manage the quantity and quality of items Our goal is to create
a system that uses the information contained in the camera to be able to identify,select, classify and evaluate drugs most accurately and effectively By developingand testing this system, we hope that it will contribute to improving the quality andsafety of medical products, and help optimize the drug testing and evaluationprocess in pharmaceutical industry
Building a Computer Vision Assisted Pill Inspection Systerm Using YOLOv8Model is a deep learning-based model for object recognition, drug evaluation, and
Trang 8classification This is highly helpful in giving pharmaceutical firms an automated
and effective way to monitor the quality of their products The purpose of this work
is to train the YOLOv8 model to correctly identify pills in photos using thepreprocessed dataset A significant amount of data and corresponding training timeare needed for this approach Following training, the experimental resultsdemonstrate that the YOLOv8 models outperform multiple classes of earlierresearch findings, with an average mAP5O larger than 97.6% across all classes
5 Keywords (3 — 5 words)
YoLo, Computer Vision, Artifical Intelligence
Trang 9I INTRODUCTION
1.1 Concerning rationable of the study
These days, technology is very vital Computer vision and artificial intelligence(AD have made major contributions to numerous disciplines, particularly the medicalsector The use of AI in this sector has several advantages and enormous promise forboth customers and industry Developing a system to inspect, choose, and assess the
caliber of pharmaceutical and medical supplies is one of those advantages.
Medicine has been a significant part of illness prevention and treatment sinceancient times Even though the structure and shape of traditional Chinese medicine havechanged from bags to compressed pills, consumers still require assurances of qualityand safety before using them However, human precision and scrupulousness are needed
in drug testing and evaluation The manufacturer will lose a lot of money if you utilizethat manual way because it is time-consuming and may lead to errors Therefore, there
is a lot of promise in medicine and pharmacy for the development of drug testing,
selection, and evaluation systems utilizing artificial intelligence (AI) and computervision
In addition, by adding layers to Roboflow and using the YOLOv8 deep neuralnetwork - one of the most advanced models in the field of object recognition, it allows
to recognize and classify objects in images at high speed high accuracy and precision,
we have focused on creating and developing models in this research to detect, classify,and manage the quantity and quality of items Our goal is to create a system that usesthe information contained in the camera to be able to identify, select, classify andevaluate drugs most accurately and effectively By developing and testing this system,
we hope that it will contribute to improving the quality and safety of medical products,and help optimize the drug testing and evaluation process in pharmaceutical industry
We used numerous pictures and documents with information about various medicationsduring our work and investigation We gave the YOLOv8 model to the Roboflow dataafter it was collected, enabling the model to screen and recognize samples And weweren't let down because the model's training produced a ton of data that allowed us toassess it
Trang 101.2 Research question
In order to complete the development of a diagnostic model for pill inspection
we proposed 2 main research question as follows:
(1) What are the most effective data sources and types of images that should beused to train the pill inspection model to ensure reliable and robust performance inreal-world applications?
(2) Which models will most accurately and reliably object detection pill?
1.3 Object and Scope of the Study
- The object of the study is pictures of various medications
1.5 Structure
Besides introducing this research paper, we will explain the algorithm theorythat we use in the article and then apply it to the problem and compare the resultsobtained
The main body of the paper: Part II Methodology; Part II: Experiments & Results
Ul LITERATURE REVIEW
Drug prescription and inventory management are very important tasks for safedrug dispensing, while promptness and accuracy are also very essential Approximately
1000 types of pills are handled in large hospitals The pills used by patients are changeddepending on the patient’s degree of improvement In many existing hospitals andpharmacies, the pharmacist manually sorts and packs the pills according to the
Trang 11prescription, which is a time-consuming process In addition, simple repetitive tasks can
cause fatigue leading to mistakes being made during pill sorting; such situations can lead
to medical accidents
In recent years, automated equipment, such as automated medication dispensingmachines [1,2,3], have rapidly spread in pharmacies and hospitals where multipledispensing tasks need to be performed, such as sorting and packaging pills Anautomated medication dispensing machine is a device that sorts and packs drugs based
on a prescription that is input from a computerized program However, the automaticdispensing machine also requires a function to inspect the prepared product becausethere is a risk of erroneous formulation A vision inspection method using a digitalcamera is a widely used The vision inspection method uses two forms of analysis First,there is a rule-based analysis method that compares and analyzes product characteristics[4] The second method for analysis involves a template that compares a similarity with
a reference image [5,6,7] Recently, deep learning-based object detection algorithmshave been developed and investigated [8,9,10,11]
The template matching method is a method for finding a region with the highestsimilarity to a reference image in an input image The methods used for comparing theinput image with the reference image are divided into two categories: pixel-based andshape-based matching methods The pixel-based matching method calculates thedifference between the pixels of the reference image and the input image Itsrepresentative methods include the sum of squared difference and normalized crosscorrelation [12,13] The pixel-based matching method is robust against distortions such
as blurring caused by the shaking of the camera during capturing However, the based method is not effective for changes in the size and rotation of the inspection objectbecause it calculates the difference between the pixels
pixel-The shape-based matching method is a method for extracting a region of interest(ROD) of a test subject from a reference image and comparing it with an input image.The shape-based matching method does not use all the pixels of the reference image
Trang 12Rather, it uses only the representative features and is effective in changing the size androtation of an object The shape-based matching method is superior to the pixel-basedmatching method in relation to lighting changes Its representative methods includescale-invariant feature transform and shape-based matching included in the MVTecHALCON library [14,15,16].
Deep learning technology is an artificial neural network technology that can learnand make judgments on its own based on data This technology shows excellentperformance in the field of object detection Object detection refers to a method thatspecifies not only the presence of an object in an image but also the type and location ofthe object Representative deep learning methods include you only look once (YOLO)and region-based convolutional neural network (R-CNN) [17,18] To improve thedetection performance in deep learning, scientists use various methods, such as dataprocessing, loss function improvement, convolution layer control, and activation
neural network (CNN), a complex structure involving color, shape, and differenceimages is used However, the deep learning method requires a large amount of data fortraining despite its excellent detection performance Substantial effort is required tocreate new data that are not public Therefore, it is important to investigate how toeffectively extend a small amount of data
I METHODOLOGY
3.1 Basic theory about Image Object Detection
An essential component of computer vision is image object detection It is thepractice of breaking out an image into its component pieces in order to make theimage representation easier to understand or more straightforward The objective ofimage segmentation and object detection (Object Detection) is the same: to identifyimage regions that contain objects and give them the proper labels
3.1.1 Basic theory about Convolutional Neural Networks (CNNs)
Trang 13Convolutional neural networks (CNNs) is a kind of deep neural network that isfrequently used in image processing and computer vision Its purpose is toefficiently identify and extract characteristics from picture data In terms of pictureanalysis and classification, it has attained great efficiency These days, it is widelyused in practically every industry, particularly in cutting-edge applications likemedical imaging and self-driving cars.
Convolutional neural networks are a class of deep neural networks, mostcommonly applied to visual image analysis They have applications in image andvideo recognition, recommender systems, image classification, imagesegmentation, medical image analysis, natural language processing, brain-computerinterfaces To perform the above tasks, CNN includes several layers with specificfunctions, including convolutional layers, pooling layers, full connected layers anddropout layers
The layers of the CNN architecture for image classification are described below:
e Convolutional Layers: The most crucial component of the CNN network, this
layer is in charge of image series processing, recognition, and computation.CNN uses a processing step to take an input image, classify it based onpredetermined categories, and then output the results Depending on the imageresolution, the computer views the input image as an array of pixels Thecomputer will display the image's height, width, and thickness based on itsresolution
e Relu Layer: is the activation function in the CNN network, also known as the
activation function, which simulates the rate of neurons transmitting impulsesthrough the axon Currently, there are many activation functions, but however, theRelu function is the most commonly used and commonly used function Thisfunction is used for neural network training requirements with the advantage of
Trang 14supporting faster calculations Relu classes are used after the filter map iscalculated and Relu applied to the filter map values.
Pooling Layer: is a component of the CNN architecture whose primary job is tominimize the size of the feature map produced by the convolutional layers, therebyassisting in lowering computation and improving model generality Pooling comes
in various forms, including:
+ Max Pooling: extract the object matrix's largest element
+ Average Pooling: this is the computed and stored average value for everyfeaturemapregion
+ Sum Pooling: the total of the values in every feature map region
Fully connected layer: In CNN, fully connected layers are used to classify inputimages or perform other tasks such as segmentation Each neuron in this layer isconnected to all neurons in the previous and subsequent layers into a fullyconnected network In addition, if this layer has image data, it will convert theminto unclassified quality layers to find the highest quality image
Here is a diagram of the basic CNN architecture from the image layers:
Trang 15The structure shown above consists of different layers, including an input layer,two convolution layers, two max pooling layers, and an activation layer Theselayers, after going through the activation functions, will have weights in the nodesand can create more abstract information to the next layers in the network Inparticular, the Pooling class will have the ability to create immutability withtranslation, scaling and rotation Local coherence will show the levels ofrepresentation and data from low to high with the level of abstraction throughConvolution from the filter The CNN network has interconnected layers based onthe Convolution mechanism The layers are linked together through the convolutionmechanism The next layer is the result of the convolution calculation from theprevious layer, thanks to which we get local connections Thus, each neuron in thenext layer is generated from the result of the filter imposed on a local image area ofthe previous neuron Each layer uses different filters, usually hundreds of thousands
of such filters, and combines their results In addition, there are a number of otherclasses such as pooling/subsampling class used to distill more useful information.The last layer is used to classify the image
3.1.2 Object detection problem
Object detection is an important field in the field of computer vision Variousmachine learning (ML) and deep learning (DL) models are used to enhanceperformance in object detection and related tasks Previously, 2-stage objectdetectors were quite popular and effective With recent developments in single-stageobject detection and underlying algorithms, they have become significantly betterthan most two-stage object detectors Furthermore, with the advent of YOLO,various applications have used YOLO for object detection and recognition in avariety of contexts and performed very well compared to their respective two phasedetectors
Image classification is the task of classifying an image or an object within animage into one of predefined categories This problem is often solved with the help
Trang 16of supervised machine learning or deep learning algorithms where the model istrained on a large labeled dataset Some commonly used machine learning modelsfor this task include ANN, SVM, Decision trees, and KNN [24] However, we willencounter many problems when using some of those machine learning models.Therefore, we used CNNs and its architectural successors and variants dominateother deep models for classifying images and related works Apart from using CNN,
we also use object localization to determine the location of an object or multipleobjects in an image/frame with the help of a rectangular box around an object.object, often called a bounding box In this article, we will focus on CNNs andrelated models
CNNs used for semantic segmentation use a fully convolutional network(FCN) architecture that differs from traditional CNNs by replacing fully connectedlayers with convolutional layers This allows the network to process input images
of any size and produce a corresponding output map of the same size In the FCNarchitecture, an encoder structure with bypass connections is often used Theencoder gradually reduces the spatial resolution of the feature map while increasingthe number of channels through a series of convolutional and pooling layers
Object localization refers to the task of accurately identifying and locatingobjects of interest in an image It plays an important role in computer visionapplications for object detection, tracking, and segmentation Object localizationinvolves training a convolutional neural network to predict the coordinates ofbounding boxes that tightly enclose objects in an image The process takes place intwo steps, with the CNN extracting image features and the regression headpredicting the bounding box coordinates
By using those machine learning models, we can more easily analyze imagesand extract features to produce good results for our research
Trang 173.1.3 Image object detection for medical images
Medical image discovery refers to the use of artificial intelligence (AD — one
of the most promising tools when it comes to advancing the field of medicalimaging and healthcare In addition, we also use modern 3D imaging tools, cloudcomputing, imaging software solutions such as MRI, CTA The purpose ofdetecting and classifying this data is to identify objects may not be suitableduring the drug manufacturing process This job can take a lot of time, effort,and money, but the trade-off is the completely outstanding advantages mentionedabove that AI and other software have been applied to research , developing asystem to serve many tasks that require meticulousness, precision and budget
savings in the health industry in general and the pharmaceutical industry in
particular
Why is it beneficial? Because it allows objects to be analyzed and identifiedwithout having to perform manual steps as before This is very beneficial in largeproduction processes It also helps manufacturers determine product quality andquantity of damage
In our study, this process includes data collection, labeling layers onRoboflow Once labeled, we exported the data and trained it on Ultralytics Andthe ability to identify strange objects in that image is very good, giving positiveresults And we feel happy to produce those favorable results to serve the medicaland pharmaceutical industry
3.2 Models for Object Detection
In this post, we'll examine the YOLOv8, Faster R-CNN, and SSD models (Single Shot Multibox Detector) Each of these models is able to access our dataset.
3.2.1 YOLO
In scientific study, "YOLO," or "You Only Look Once," is an objectidentification system that uses deep learning to achieve remarkable accuracy and
Trang 18presented this approach in 2016 Furthermore, since the first publication of themodel till today, Yolo has undergone several revisions, advances, enhancements,and optimizations while maintaining the model's accuracy When it came toaccuracy, the original iteration of YOLO, known as YOLOvI (2015) [25], wasless accurate than later iterations It relied on a single neural network to forecastthe bounding boxes and probabilities of item classes immediately across theimage More object types could be detected with YOLOv2 (2016), and batchnormalization and anchor boxes increased accuracy The CNN architecture used
by YOLOv3 (2018) is called Darknet-53, and anchor boxes are resized to fit theobject's dimensions YOLOv4 (2020) [26] employs anchor boxes with k-meansclustering and a novel CNN architecture known as CSPNet GHM loss is a novelloss function that enhances FPN's design over YOLO v3
To enhance object identification performance across various object types,
YOLOv5 (2020) employs spatial pyramid pooling (SPP), CloU loss, anddynamic anchor boxes By utilizing improved training techniques and freshknowledge, YOLOv6 (2022) with EfficientNet-L2 significantly improved objectdetection In order to maximize parameter utilization, YOLOv7 focused onscaling and re-parameterized the convolution module model The YOLOv8(2023) model is the most recent model in the YOLO family, having been pre-trained on extensive datasets like COCO and ImageNet Moreover, Yolo modelsare smaller, more accurate, and have a faster training rate With the addition ofthe C3 convolution layer, anchorless detection, and mosaic enhancement,YOLOvs8 has improved upon several aspects of its predecessor
Trang 193.2.2 Faster R-CNN
Faster R-CNN [27] is an object detection model using a Region ProposalNetwork (RPN) with a CNN model The RPN shares full image convolution featureswith the detection network, allowing for nearly free region proposal It is a fullyconvolutional network capable of simultaneously predicting feature bounds andfeature scores at each location RPN is trained end-to-end to produce high-qualityregion proposals, which are used by Fast R-CNN for detection RPN and Fast R-CNN are merged into a single network by sharing their convolutional features: theRPN component tells the unified network where to look
Below is the structure of RPN: