VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL STUDENT RESEARCH REPORT Computer Vision-based Detection and Tracking of Football Players and Balls using YOLOv8 Project code: C
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
INTERNATIONAL SCHOOL
STUDENT RESEARCH REPORT
Computer Vision-based Detection and Tracking of Football Players and Balls using YOLOv8
Project code: CN.NC.SV.23_34
Team Leader: Vu Ba Quoc Hung
ID: 23070355 Class: AIT2023A
Hanoi, April 10th, 2024
Trang 2TEAM LEADER INFORMATION
- Program: Applied Information Technology
- Address: 58 To Huu Street, Nam Tu Liem, Hanoi
- Phone no /Email:
0373701205/vubaquochung@gmail.com
II Academic Results
Academic Year Overall score Academic rating
III Other Achievements:
Hanoi, April 16 th , 2024
Team Leader
(Sign and write fullname)
Vu Ba Quoc Hung
Trang 3
ACKNOWLEDGEMENT
We would like to send our best regards toward PhD Kim Dinh Thai who guided us on the right track with our research assignment Without your help, we will not be able to complete this research project PhD Kim Dinh Thai not only helped and guided us, but also inspired us to be as creative as possible As a group of freshmen, we cannot thank you enough for being so patient to lead us into the right direction
Once more, we sincerely thank you for your huge contribution and we are looking forward to having a chance to work with you again!
Vu Ba Quoc Hung
Trang 4Contents
ABSTRACT……… 4
LIST OF FIGURES ……….6
LIST OF TABLES……… 7
I LITERATURE REVIEW………8
II METHODOLOGY……… 9
1 Basic theory about Convolutional Neural Networks (CNNs)………… …… ….9
2 Object Detection……… 10
3 Football players and ball detection and tracking……… … 12
4 Results……… 14
a Enviromental Setup……… 14
b Evaluation Metrics……… …15
c Detection Results……… ……19
d Tracking Results……… ………21
III CONCLUSION AND FUTURE WORK………23
IV REFERENCES………23
Trang 5COMPUTER VISION-BASED DETECTION AND TRACKING OF
FOOTBALL PLAYERS AND BALLS
1 Project Code
CN.NC.SV.23_34
2 Member list:
3 Advisor(s)
- PhD Kim Dinh Thai - Lecturer at the International School of Vietnam National University
4 Abstract ( 300 words or less )
- English: This report represents an effective method of detecting players and
ball, which is used to determine ball possession and its usage in football data analytics by using object detection and object tracking In this report, we will use YOLOv8m due to its precision compared to other tools The input will be a video of a football match taken from a camera, and then passed to the object detection module.Thus, every object on the pitch will be tracked individually and ball possession is calculated per player The team’s possession over the ball will be specified by the sum of possession of total players in each team Even though YOLOv8m are the best options available for us, we still encounter some errors , but noticeably the instability of tracking the ball directly affects the overall results The best results we got were the YOLOv8m model with the Precision, Recall, and mAP50 score is 0.92971, 0.92026, and 0.94106,
respectively And the lowest model is YOLOv8n with 0.91321, 0.89288, and 0.91396 on the corresponding three mentioned categories
- Vietnamese: Bài báo cáo giới thiệu một phương pháp hiệu quả để phát hiện
cũng như theo dõi cầu thủ và bóng trên sân, để từ đó xác định tỉ lệ cầm bóng của hai đội Dữ liệu này có thể được sử dụng để phân tích chuyên sâu về trận
Trang 6đấu, thống kê, Bài báo cáo này sẽ sử dụng YOLOv8m vì sự chính xác vượt trội khi so sánh với những công cụ khác Đầu vào sẽ là video về một trận đấu được quay bằng camera, sau đó hệ thống nhận diện sẽ nhận dạng và theo dõi quả bóng và từng cầu thủ một cách xuyên suốt Nhờ đó, mọi vật thể trên sân bóng sẽ được theo dõi độc lập và tính toán được tỉ lệ kiểm soát bóng của mỗi cầu thủ Tỉ lệ kiểm soát bóng của mỗi đội sẽ được tính bằng tổng tỉ lệ của các cầu thủ mỗi đội Mặc dù YOLOv8m hiện là những lựa chọn phù hợp nhất, chúng tôi vẫn phải đối mặt với một số lỗi và sự thiếu liền mạch của việc theo dõi trái bóng cũng ảnh hưởng trực tiếp đến kết quả cuối cùng Kết quả tốt nhất
mà chúng tôi đạt được là với mô hình YOLOv8m với số liệu của Precision, Recall và mAP50 lần lượt là 0.92971, 0.92026, và 0.94106 Mặt khác, kết quả thấp nhất là với mô hình YOLOv8n với các chỉ số tương tự lần lượt là 0.91321, 0.89288, và 0.91396
5 Keywords ( 3 - 4 words )
YOLO, football, tracking, CNN
Trang 7List of Figures:
Fig 1 Examples of tactics in football………9
Fig 2 An example of the progress of CNN……….10
Fig 3 A brief review of the progress of YOLO……… 12
Fig 4 Examples of the dataset………14
Fig 5.Computing the Intersection over Union by dividing the area of overlap between the bounding boxes by the area of union………19
Fig 6 Graph of Training Results………21
Fig 7 Confusion matrix of YOLOv8m……… 21
Fig 8 Precision Confidence Curve of YOLOv8m……….22
Fig 9 Recall-Confidence Curve of model YOLOv8m……… 22
Fig 10 Example of players and ball detection on the data test……….23
Trang 8List of Tables:
Table 1 Performance comparison of different YOLOv8 models……….20 Table 2 Detailed Results using YOLOv8m model……… 20
Trang 9I LITERATURE REVIEW
Football is one of the most popular sports worldwide, and its popularity has only increased with the widespread availability of technology Many people consider Football as a “king of sports” Several of the football codes are the most popular team sports in the world.[9] Globally, association football is played by over 250 million players in over 200 nations,[10] and has the highest television audience in sport,[11] making it the most popular in the world [1] Because of it, many technology improvements in technology have been applied to this sports, such as Video Assistant Referee ( VAR ), Semi-Automated Offside Technology (SAOT), Goal line technology, smart ball system, [13] Each individual improvements have prove its usefulness, but there are none improvement to help people, coaches and players to understand more deeply into their tactics In recent years, computer vision-based object detection and tracking has emerged as an effective tool for distinguish different things all at once, as well as tracking object continuously Football perhaps is the most suitable sport to apply computer vision into since football rely heavily on tactics, and all statistics are measured logically
In the 4.0 era ( digital era ), we witnessed the evolution of Computer Vision ( CV ) in general and AI ( Artificial Intelligent ) in particular in almost every fields such as agricultural, economy, education, Some examples of AI applications such as “Image recognition and perception” in agriculture [12], Human Action Recognition in sports based on Computer Vision” in sports [16], Statistics are vital in every football match
As shown in Fig, 1, every movement of the players on the pitch are marked to help coaching staff have better insight on how the match will go Although football is one the most popular sport worldwide, only a few research actually dive into tracking football players and ball Understand the necessity of the idea, we started to develop an
AI capable of detecting and tracking individual players in both teams and the ball
Trang 10Fig 1 Examples of tactics in football
1 Basic theory about Convolutional Neural Networks (CNNs)
Convolutional neural network (CNN) is a neural network that learns feature engineering by itself via filters (or kernel) optimization Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections [14, 15]
The convolutional layer is the core building block of a CNN, and it is where the majority of computation occurs It requires an input data, a filter, and a feature map The input data will be a color image, which is made up of a matrix of pixels
in 3D This means that the input will have three dimensions—a height, width, and depth—which correspond to RGB in an image Moreover, we have kernel,
or a filter, which will move across the receptive fields of the image, checking if the feature is present This process is known as a convolution
Convolutional Neural Network structure consists of four layers:
a Convolutional layer:
The convolutional layer is where the action begins The convolutional layer is designed to discover image features Usually, it progresses from the general (i.e., shapes) to specific (i.e., identifying elements of an object, recognizing the face
of a certain man, etc.)
b Rectified Linear Unit layer (a.k.a ReLu)
This layer is considered as an extension of a convolutional layer The goal
of ReLu is to increase the image’s non-linearity It is the technique of
Trang 11removing excess fat from a picture in order to improve feature extraction
c Pooling layer
The pooling layer is used to minimize the number of input parameters, i.e.,
to conduct regression In other words, it focuses on the most important aspects of the information obtained
d Connected layer
It is a standard feed-forward neural network It’s the last straight line before the finish line, where everything is already visible It’s only a matter of time until the results are confirmed
Fig 2 An example of the progress of CNN
2 Object Detection
Object detection is a branch of computer vision, a technique for locating instances of objects in images or videos [17] When human look at an image or a video, we instanly recognize the object, such as humans, animals, buildings, cars, The goal of object detection is to achieve the same outcome, but accomplished by computer using many algorithms You can use a variety of techniques to perform object detection Popular deep learning–based approaches using convolutional neural networks (CNNs), such as R-CNN and YOLO or the latest SSD, automatically learn to detect objects within images
a RCNN and Fast RCNN
Region-based Convolutional Neural Network (R-CNN) is a type of deep learning architecture used for object detection in computer vision tasks RCNN was one of the pioneering models that helped advance the object detection field by combining the
Trang 12power of convolutional neural networks and region-based approaches [18] Fast R-CNN
is proposed based on SPPNet SPPNet deletes the crop/warp of R-CNN, replaces the last pooling layer before FC layer with SPP, and keeps the output image m*n parts no matter what resolution of the input image is These features accelerate the test speed by 24 to
102 times [19] The entire SPP training challenge is resolved by Fast R-CNN's Region
of Interest (ROI) pooling and proposal reflection Additionally, it makes use of a task loss layer, where Bouding-box is replaced by SmoothL1Loss and SVM by SoftmaxLoss These new techniques combine regression with classification, increasing the algorithm's precision It also accelerates the fully connected layers using SVD As a result, Fast R-CNN has training and testing speeds that are 3 times and 10 times faster than SPP, respectively The VOC07 dataset reveals that Fast R-CNN's mAP is 70
multi-b Faster RCNN
Faster R-CNN, which based on Fast RCNN solves the regional proposal problem by adding RPN, which is the Key contribution of Faster R-CNN [19] The RPN generates region proposals (bounding boxes) based on anchor boxes and scores them using a classification subnetwork The Fast R-CNN detector refines these proposals, extracts features using RoI pooling, and performs classification and bounding box regression Morover, Faster R-CNN achieves high accuracy by leveraging the region-based approach while benefiting from shared convolutional features across the proposal and detection stages
c SSD
SSD, a single-shot detector for multiple categories that is faster than the previous of-the-art for single shot detectors (YOLO), and significantly more accurate, in fact as accurate as slower techniques that perform explicit region proposals and pooling (including Faster R-CNN) The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied
state-to feature maps To achieve high detection accuracy we produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio [20]
Trang 133 Football players and ball detection and tracking
Fig 3 A brief review of the progress of YOLO
You Only Look Once (YOLO) is a viral and widely used algorithm [2] YOLO is a computer vision model built by Ultralytics who also creates of YOLOv5 In 2015, Redmon et al introduce the first YOLO version [3] The YOLOv8 model contains support for object detection, classification, and segmentation tasks It also accessible through a Python package as well as a command line interface The YOLO model outperforms other architectures in terms of processing speed Thanks to this advantage, the model is consistently being enhanced in many versions In this study, we use the YOLOv8 model, although not the newest, but the most stable version of YOLO YOLO architectures operate on the principle of performing object detection in a single forward pass of the network, making them outstandingly better: faster and more suitable for real-time tracking, detecting, etc The input image are splitted into a grid, predict bounding boxes and class probabilities for each grid cell The key components are:
- Backbone: The feature extractor that processes the input image It's duty is to capturing the various features at different scales
- Neck: The component that merges features from different levels of the backbone
It often make use of mechanisms like Feature Pyramid Networks (FPN) or Path Aggregation Networks (PAN) to enhance the detection of objects across different sizes
- Head: The last part of the network, which predicts the bounding boxes and class probabilities This is where a detection actually takes place
Trang 14Loss Function: Positive samples are assigned based on a combination of classification and regression-weighted scores, as shown in the following equation:
The incorporation of the aspect ratio of both prediction and ground truth bounding boxes
is illustrated in the CIoU Loss; it enhances the DIoU Loss via an additional influence factor
𝑪𝑰𝒐𝑼 𝑳𝒐𝒔𝒔 = 𝟏 − 𝑪𝑰𝒐𝑼 = 𝟏 − 𝑰𝒐𝑼 +𝒅𝟐
𝟐
𝒅𝑪𝟐+
𝒗𝟐(𝟏 − 𝑰𝒐𝑼) + 𝒗where ν is the parameter to measure the consistency of aspect ratio which is defined as follows:
Trang 15Fig 4 Examples of the dataset
We using the total of 1082 images, which include 684 images of the ball alone, and the remaining 398 images of the pitch The annotation classes were labeled as ‘0’ as 'FCB' ( FC Barcelona ), ‘1’ as 'RMA' ( FC Real Madrid ), ‘2’ as 'ball', ‘3’ as 'goalkeeper' and ‘4’ as 'referee' The ball are detected in almost all the images, and that is understandable since they outnumber the other classes of objects by a large margin Even though the ball is the second highly represented class of objects compared to other classes and this is because there are 3 different referees on 3 different parts of the pitch Referees is one of the lowest-represented classes, it is present in most of the frames Finally, the goalkeepers are the lowest represented class
The parameters for the experimental process in the YOLOv8m models are selected as follows: epochs=150, time=None, patience=100, batch=16, imgsz=640 We adjusted