Báo cáo nghiên cứu khoa học: Building a computer vision assisted pill inspection system

In addition, by adding layers to Roboflow and using the YOLOv8 deep neuralnetwork - one of the most advanced models in the field of object recognition, itallows to recognize and classify

Trang 1

VIETNAM NATIONAL UNIVERSTTY, HANOI

ID: 23070414Class: AIT2023B

Trang 2

TEAM LEADER INFORMATION

- Program: Applied Information Technology

- Address: Tay Mo, Nam Tu Liem, Ha Noi

- Phone no /Email: 0963294472/ biayciii@ gmail.com

II Academic Results (from the first year to now)

Academic year | Overall score | Academic rating

IH Other achievements:

Advisor Hanoi, April 15, 2024

Trang 3

We would like to send our most sincere acknowledgement to PhD Kim Dinh Thai and PhD Ha Manh Hung, who guided us on the right track with our

research assignment Thanks to your thorough support and carefulness, we

have completed this scientific research PhD Kim Dinh Thai and PhD Ha Manh Hung have always been caring and supporting us step by step from the

ideation to the completion phase of the research They not only inspired us to

generate creative ideas for the research but also motivated us when we were trying to overcome the difficulties in the process.

Without their support, we would not be able to complete this research Once more, we sincerely thank you for your huge contribution and look forward to collaborating with you in future projects.

Vu Xuan Bac

Trang 4

Table of Contents

List Of fPUT€S ccQ Q HS ST ng EEE EERE EEE EEE EEE EEE EE kh EES 5 List of 1 aA eneeneeeeeneenaenseneennaeees 6

INTRODUTION & ABSTRACT ccc ce ccecc cece cence ene eee e tent ene nh hen 7

Student’s Information - C2222 eee e nh nen 7

I INTRODUCTION QC SH nh nhe 9

1.1 Concerning rationable of the study .cc cà cv 9

1.2 Research Qu€SfIONS cc n2 nh ha 10 1.3 ObJectand Scope of the SŠtudy c2 se 10

1.4 Research Methods nh nh hs 10

1.5 SfTUCẨUT€ SH TT KH nh kh ng 10

Il INTERATURE REVIEW e nee eeeeeeenenene teens 10

II METHODOLOGY ccc cece cece cent eee eee e nena eens eae kho 12 3.1 Basic theory about Image - .cccccc c2 12

3.1.1 Basic theory about Convolutional Neural Networks (CNNs) 13

3.1.2 Object detection problem -.cc ene eneeneees 15

3.1.3 Image object detection for medical Images 17 3.2 Models for Object Detecfion c ee ene teen tena no 17

“718299 17 3.2.2 Faster R-CNN ch hy 19

3.2.3 SSD (Single Shot Multibox Detector) 21 3.3 Image Object Detection for PIÏÏ - eee << teeta eee 22

3.3.1 Problem Statement - 22c 22

4909 ce 22

3.3.3 Dataset and F€afUT©S c2 nh kh nhà 26 3.3.4 Evaluation MetrICS -.-ccn nh nên 31 3.3.5 Implementing Method - -ccccc c2 31

Trang 5

Lists of figures

Figure 1: Basic CNN architecfure -c- << cà:

Figure 2: RPN archit€CtUTe - chen

Figure 3: SSD Model ence ne ec eee ea eee eneneeneeaenaeneeneens

Figure 4: YOLOv8 archItecture - cv,

Figure 5: Example images of drug samples in Dataset

Figure 6: Histogram of Object Count by Image

Figure 7: The Model of Varifocal LoSS -.-< 2<

Figure 8: An illustration of IACS

method -‹ Figure 9: Graph of Focal LOSS -.c c2 Set

Figure 10: Image shows the final result of our Google Colab mode

Figure 11: The benchmark of our

Trang 6

dataset -Lists of tables

Table 1: YOLOV8 aCCUTaCY c HS HH kh kh sa 24

Table 2: The dice score of the Google Colab ‹ c <<c + 32

Trang 7

Full Name Class ID

Luong Dinh Tung AIT2023B 23070385

Nguyen Nhat Minh AIT2023B 23070461

Nguyen Duc Manh AIT2023A 23070205

3 Advisor(S):

Thai Kim Dinh is lecturer in the International School of VietNam National UniversityHung Manh Ha is lecturer in the International School of VietNam National University

4 Abstract (300 words or less):

These days, technology is very vital Computer vision and artificial intelligence(AD have made major contributions to numerous disciplines, particularly themedical sector The use of AI in this sector has several advantages and enormouspromise for both customers and industry Developing a system to inspect, choose,and assess the caliber of pharmaceutical and medical supplies is one of thoseadvantages

In addition, by adding layers to Roboflow and using the YOLOv8 deep neuralnetwork - one of the most advanced models in the field of object recognition, itallows to recognize and classify objects in images at high speed high accuracy andprecision, we have focused on creating and developing models in this research todetect, classify, and manage the quantity and quality of items Our goal is to create

a system that uses the information contained in the camera to be able to identify,select, classify and evaluate drugs most accurately and effectively By developingand testing this system, we hope that it will contribute to improving the quality andsafety of medical products, and help optimize the drug testing and evaluationprocess in pharmaceutical industry

Building a Computer Vision Assisted Pill Inspection Systerm Using YOLOv8Model is a deep learning-based model for object recognition, drug evaluation, and

Trang 8

classification This is highly helpful in giving pharmaceutical firms an automated

and effective way to monitor the quality of their products The purpose of this work

is to train the YOLOv8 model to correctly identify pills in photos using thepreprocessed dataset A significant amount of data and corresponding training timeare needed for this approach Following training, the experimental resultsdemonstrate that the YOLOv8 models outperform multiple classes of earlierresearch findings, with an average mAP5O larger than 97.6% across all classes

5 Keywords (3 — 5 words)

YoLo, Computer Vision, Artifical Intelligence

Trang 9

I INTRODUCTION

1.1 Concerning rationable of the study

These days, technology is very vital Computer vision and artificial intelligence(AD have made major contributions to numerous disciplines, particularly the medicalsector The use of AI in this sector has several advantages and enormous promise forboth customers and industry Developing a system to inspect, choose, and assess the

caliber of pharmaceutical and medical supplies is one of those advantages.

Medicine has been a significant part of illness prevention and treatment sinceancient times Even though the structure and shape of traditional Chinese medicine havechanged from bags to compressed pills, consumers still require assurances of qualityand safety before using them However, human precision and scrupulousness are needed

in drug testing and evaluation The manufacturer will lose a lot of money if you utilizethat manual way because it is time-consuming and may lead to errors Therefore, there

is a lot of promise in medicine and pharmacy for the development of drug testing,

selection, and evaluation systems utilizing artificial intelligence (AI) and computervision

In addition, by adding layers to Roboflow and using the YOLOv8 deep neuralnetwork - one of the most advanced models in the field of object recognition, it allows

to recognize and classify objects in images at high speed high accuracy and precision,

we have focused on creating and developing models in this research to detect, classify,and manage the quantity and quality of items Our goal is to create a system that usesthe information contained in the camera to be able to identify, select, classify andevaluate drugs most accurately and effectively By developing and testing this system,

we hope that it will contribute to improving the quality and safety of medical products,and help optimize the drug testing and evaluation process in pharmaceutical industry

We used numerous pictures and documents with information about various medicationsduring our work and investigation We gave the YOLOv8 model to the Roboflow dataafter it was collected, enabling the model to screen and recognize samples And weweren't let down because the model's training produced a ton of data that allowed us toassess it

Trang 10

1.2 Research question

In order to complete the development of a diagnostic model for pill inspection

we proposed 2 main research question as follows:

(1) What are the most effective data sources and types of images that should beused to train the pill inspection model to ensure reliable and robust performance inreal-world applications?

(2) Which models will most accurately and reliably object detection pill?

1.3 Object and Scope of the Study

- The object of the study is pictures of various medications

1.5 Structure

Besides introducing this research paper, we will explain the algorithm theorythat we use in the article and then apply it to the problem and compare the resultsobtained

The main body of the paper: Part II Methodology; Part II: Experiments & Results

Ul LITERATURE REVIEW

Drug prescription and inventory management are very important tasks for safedrug dispensing, while promptness and accuracy are also very essential Approximately

1000 types of pills are handled in large hospitals The pills used by patients are changeddepending on the patient’s degree of improvement In many existing hospitals andpharmacies, the pharmacist manually sorts and packs the pills according to the

Trang 11

prescription, which is a time-consuming process In addition, simple repetitive tasks can

cause fatigue leading to mistakes being made during pill sorting; such situations can lead

to medical accidents

In recent years, automated equipment, such as automated medication dispensingmachines [1,2,3], have rapidly spread in pharmacies and hospitals where multipledispensing tasks need to be performed, such as sorting and packaging pills Anautomated medication dispensing machine is a device that sorts and packs drugs based

on a prescription that is input from a computerized program However, the automaticdispensing machine also requires a function to inspect the prepared product becausethere is a risk of erroneous formulation A vision inspection method using a digitalcamera is a widely used The vision inspection method uses two forms of analysis First,there is a rule-based analysis method that compares and analyzes product characteristics[4] The second method for analysis involves a template that compares a similarity with

a reference image [5,6,7] Recently, deep learning-based object detection algorithmshave been developed and investigated [8,9,10,11]

The template matching method is a method for finding a region with the highestsimilarity to a reference image in an input image The methods used for comparing theinput image with the reference image are divided into two categories: pixel-based andshape-based matching methods The pixel-based matching method calculates thedifference between the pixels of the reference image and the input image Itsrepresentative methods include the sum of squared difference and normalized crosscorrelation [12,13] The pixel-based matching method is robust against distortions such

as blurring caused by the shaking of the camera during capturing However, the based method is not effective for changes in the size and rotation of the inspection objectbecause it calculates the difference between the pixels

pixel-The shape-based matching method is a method for extracting a region of interest(ROD) of a test subject from a reference image and comparing it with an input image.The shape-based matching method does not use all the pixels of the reference image

Trang 12

Rather, it uses only the representative features and is effective in changing the size androtation of an object The shape-based matching method is superior to the pixel-basedmatching method in relation to lighting changes Its representative methods includescale-invariant feature transform and shape-based matching included in the MVTecHALCON library [14,15,16].

Deep learning technology is an artificial neural network technology that can learnand make judgments on its own based on data This technology shows excellentperformance in the field of object detection Object detection refers to a method thatspecifies not only the presence of an object in an image but also the type and location ofthe object Representative deep learning methods include you only look once (YOLO)and region-based convolutional neural network (R-CNN) [17,18] To improve thedetection performance in deep learning, scientists use various methods, such as dataprocessing, loss function improvement, convolution layer control, and activation

neural network (CNN), a complex structure involving color, shape, and differenceimages is used However, the deep learning method requires a large amount of data fortraining despite its excellent detection performance Substantial effort is required tocreate new data that are not public Therefore, it is important to investigate how toeffectively extend a small amount of data

I METHODOLOGY

3.1 Basic theory about Image Object Detection

An essential component of computer vision is image object detection It is thepractice of breaking out an image into its component pieces in order to make theimage representation easier to understand or more straightforward The objective ofimage segmentation and object detection (Object Detection) is the same: to identifyimage regions that contain objects and give them the proper labels

3.1.1 Basic theory about Convolutional Neural Networks (CNNs)

Trang 13

Convolutional neural networks (CNNs) is a kind of deep neural network that isfrequently used in image processing and computer vision Its purpose is toefficiently identify and extract characteristics from picture data In terms of pictureanalysis and classification, it has attained great efficiency These days, it is widelyused in practically every industry, particularly in cutting-edge applications likemedical imaging and self-driving cars.

Convolutional neural networks are a class of deep neural networks, mostcommonly applied to visual image analysis They have applications in image andvideo recognition, recommender systems, image classification, imagesegmentation, medical image analysis, natural language processing, brain-computerinterfaces To perform the above tasks, CNN includes several layers with specificfunctions, including convolutional layers, pooling layers, full connected layers anddropout layers

The layers of the CNN architecture for image classification are described below:

e Convolutional Layers: The most crucial component of the CNN network, this

layer is in charge of image series processing, recognition, and computation.CNN uses a processing step to take an input image, classify it based onpredetermined categories, and then output the results Depending on the imageresolution, the computer views the input image as an array of pixels Thecomputer will display the image's height, width, and thickness based on itsresolution

e Relu Layer: is the activation function in the CNN network, also known as the

activation function, which simulates the rate of neurons transmitting impulsesthrough the axon Currently, there are many activation functions, but however, theRelu function is the most commonly used and commonly used function Thisfunction is used for neural network training requirements with the advantage of

Trang 14

supporting faster calculations Relu classes are used after the filter map iscalculated and Relu applied to the filter map values.

Pooling Layer: is a component of the CNN architecture whose primary job is tominimize the size of the feature map produced by the convolutional layers, therebyassisting in lowering computation and improving model generality Pooling comes

in various forms, including:

+ Max Pooling: extract the object matrix's largest element

+ Average Pooling: this is the computed and stored average value for everyfeaturemapregion

+ Sum Pooling: the total of the values in every feature map region

Fully connected layer: In CNN, fully connected layers are used to classify inputimages or perform other tasks such as segmentation Each neuron in this layer isconnected to all neurons in the previous and subsequent layers into a fullyconnected network In addition, if this layer has image data, it will convert theminto unclassified quality layers to find the highest quality image

Here is a diagram of the basic CNN architecture from the image layers:

Trang 15

The structure shown above consists of different layers, including an input layer,two convolution layers, two max pooling layers, and an activation layer Theselayers, after going through the activation functions, will have weights in the nodesand can create more abstract information to the next layers in the network Inparticular, the Pooling class will have the ability to create immutability withtranslation, scaling and rotation Local coherence will show the levels ofrepresentation and data from low to high with the level of abstraction throughConvolution from the filter The CNN network has interconnected layers based onthe Convolution mechanism The layers are linked together through the convolutionmechanism The next layer is the result of the convolution calculation from theprevious layer, thanks to which we get local connections Thus, each neuron in thenext layer is generated from the result of the filter imposed on a local image area ofthe previous neuron Each layer uses different filters, usually hundreds of thousands

of such filters, and combines their results In addition, there are a number of otherclasses such as pooling/subsampling class used to distill more useful information.The last layer is used to classify the image

3.1.2 Object detection problem

Object detection is an important field in the field of computer vision Variousmachine learning (ML) and deep learning (DL) models are used to enhanceperformance in object detection and related tasks Previously, 2-stage objectdetectors were quite popular and effective With recent developments in single-stageobject detection and underlying algorithms, they have become significantly betterthan most two-stage object detectors Furthermore, with the advent of YOLO,various applications have used YOLO for object detection and recognition in avariety of contexts and performed very well compared to their respective two phasedetectors

Image classification is the task of classifying an image or an object within animage into one of predefined categories This problem is often solved with the help

Trang 16

of supervised machine learning or deep learning algorithms where the model istrained on a large labeled dataset Some commonly used machine learning modelsfor this task include ANN, SVM, Decision trees, and KNN [24] However, we willencounter many problems when using some of those machine learning models.Therefore, we used CNNs and its architectural successors and variants dominateother deep models for classifying images and related works Apart from using CNN,

we also use object localization to determine the location of an object or multipleobjects in an image/frame with the help of a rectangular box around an object.object, often called a bounding box In this article, we will focus on CNNs andrelated models

CNNs used for semantic segmentation use a fully convolutional network(FCN) architecture that differs from traditional CNNs by replacing fully connectedlayers with convolutional layers This allows the network to process input images

of any size and produce a corresponding output map of the same size In the FCNarchitecture, an encoder structure with bypass connections is often used Theencoder gradually reduces the spatial resolution of the feature map while increasingthe number of channels through a series of convolutional and pooling layers

Object localization refers to the task of accurately identifying and locatingobjects of interest in an image It plays an important role in computer visionapplications for object detection, tracking, and segmentation Object localizationinvolves training a convolutional neural network to predict the coordinates ofbounding boxes that tightly enclose objects in an image The process takes place intwo steps, with the CNN extracting image features and the regression headpredicting the bounding box coordinates

By using those machine learning models, we can more easily analyze imagesand extract features to produce good results for our research

Trang 17

3.1.3 Image object detection for medical images

Medical image discovery refers to the use of artificial intelligence (AD — one

of the most promising tools when it comes to advancing the field of medicalimaging and healthcare In addition, we also use modern 3D imaging tools, cloudcomputing, imaging software solutions such as MRI, CTA The purpose ofdetecting and classifying this data is to identify objects may not be suitableduring the drug manufacturing process This job can take a lot of time, effort,and money, but the trade-off is the completely outstanding advantages mentionedabove that AI and other software have been applied to research , developing asystem to serve many tasks that require meticulousness, precision and budget

savings in the health industry in general and the pharmaceutical industry in

particular

Why is it beneficial? Because it allows objects to be analyzed and identifiedwithout having to perform manual steps as before This is very beneficial in largeproduction processes It also helps manufacturers determine product quality andquantity of damage

In our study, this process includes data collection, labeling layers onRoboflow Once labeled, we exported the data and trained it on Ultralytics Andthe ability to identify strange objects in that image is very good, giving positiveresults And we feel happy to produce those favorable results to serve the medicaland pharmaceutical industry

3.2 Models for Object Detection

In this post, we'll examine the YOLOv8, Faster R-CNN, and SSD models (Single Shot Multibox Detector) Each of these models is able to access our dataset.

3.2.1 YOLO

In scientific study, "YOLO," or "You Only Look Once," is an objectidentification system that uses deep learning to achieve remarkable accuracy and

Trang 18

presented this approach in 2016 Furthermore, since the first publication of themodel till today, Yolo has undergone several revisions, advances, enhancements,and optimizations while maintaining the model's accuracy When it came toaccuracy, the original iteration of YOLO, known as YOLOvI (2015) [25], wasless accurate than later iterations It relied on a single neural network to forecastthe bounding boxes and probabilities of item classes immediately across theimage More object types could be detected with YOLOv2 (2016), and batchnormalization and anchor boxes increased accuracy The CNN architecture used

by YOLOv3 (2018) is called Darknet-53, and anchor boxes are resized to fit theobject's dimensions YOLOv4 (2020) [26] employs anchor boxes with k-meansclustering and a novel CNN architecture known as CSPNet GHM loss is a novelloss function that enhances FPN's design over YOLO v3

To enhance object identification performance across various object types,

YOLOv5 (2020) employs spatial pyramid pooling (SPP), CloU loss, anddynamic anchor boxes By utilizing improved training techniques and freshknowledge, YOLOv6 (2022) with EfficientNet-L2 significantly improved objectdetection In order to maximize parameter utilization, YOLOv7 focused onscaling and re-parameterized the convolution module model The YOLOv8(2023) model is the most recent model in the YOLO family, having been pre-trained on extensive datasets like COCO and ImageNet Moreover, Yolo modelsare smaller, more accurate, and have a faster training rate With the addition ofthe C3 convolution layer, anchorless detection, and mosaic enhancement,YOLOvs8 has improved upon several aspects of its predecessor

Trang 19

3.2.2 Faster R-CNN

Faster R-CNN [27] is an object detection model using a Region ProposalNetwork (RPN) with a CNN model The RPN shares full image convolution featureswith the detection network, allowing for nearly free region proposal It is a fullyconvolutional network capable of simultaneously predicting feature bounds andfeature scores at each location RPN is trained end-to-end to produce high-qualityregion proposals, which are used by Fast R-CNN for detection RPN and Fast R-CNN are merged into a single network by sharing their convolutional features: theRPN component tells the unified network where to look

Below is the structure of RPN:

Tiêu đề	Building a Computer Vision Assisted Pill Inspection System
Tác giả	Vu Xuan Bac
Người hướng dẫn	PhD. Kim Dinh Thai, PhD. Ha Manh Hung
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Applied Information Technology
Thể loại	Student Research Report
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	38
Dung lượng	14,76 MB