Semantic segmentation of pet images using segmentation model pytorch

MINISTRY OF EDUCATION AND TRAININGHO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION GRADUATION THESIS MAJOR: ELECTRONICS AND COMMUNICATIONS ENGINEERING TECHNOLOGY INSTRUCTOR: LE MI

Introduction

Overview

Image segmentation is a vital technique in computer vision that includes methods like semantic, instance, and panoptic segmentation, aiming to classify each pixel in an image Recent advancements, particularly through deep learning and convolutional neural networks (CNNs), have propelled this field forward Key innovations, such as Fully Convolutional Networks (FCNs) and U-Net architectures, have greatly enhanced the accuracy and efficiency of image segmentation Additionally, the integration of attention mechanisms and transformer models has improved performance by capturing global context This progress aligns with a trend towards more precise and context-aware image analysis, benefiting applications in autonomous driving, medical imaging, and augmented reality Ongoing research in image segmentation highlights the significance of pixel-level understanding in the evolving landscape of artificial intelligence and computer vision.

Semantic segmentation is a technique that labels each pixel in an image according to its object type, without distinguishing between individual instances For instance, in an image with multiple cars and people, it assigns a "car" label to all vehicle pixels and a "person" label to all human pixels Common deep learning models used for semantic segmentation include Fully Convolutional Networks (FCN), U-Net, SegNet, DeepLab, and Mask R-CNN (when not generating specific masks) This technique is widely applied in areas such as object recognition, autonomous driving, obstacle detection, and medical image classification, making it essential in computer vision and artificial intelligence.

As autonomous driving technology develops, acquiring traffic scene information is essential for enhancing driving safety Semantic segmentation technology has emerged

The advancement of semantic segmentation, particularly through models like DeepLabv3+, is crucial for enhancing the safety of autonomous driving systems by allowing vehicles to accurately perceive and analyze real-time traffic environments at the pixel level Despite its capabilities, DeepLabv3+ struggles with accurately segmenting small targets and distinguishing similarly shaped objects An improved version has been proposed, integrating a fused spatial attention mechanism, which enhances segmentation performance by increasing the weights of segmented regions and utilizing Focal Loss alongside enhancements to the ASPP structure Experimental results on the Cityscapes dataset indicate a 1.56% increase in average accuracy for this improved model, showcasing its significance in autonomous driving applications Additionally, the "ABSSNet: Attention-based Spatial Segmentation Network for Traffic Scene Understanding" study introduces a convolutional attention module and a Spatial Convolutional Neural Network (SCNN) to enhance the understanding of spatial location distribution and modeling abilities This method effectively improves the neural network's application of spatial information, particularly in detecting road and lane lines in complex traffic scenes, supported by the creation of the NWPU Road Dataset However, this solution requires substantial computational resources due to the complexity of the attention modules and SCNN.

Objectives

Semantic image segmentation focuses on labeling each pixel in an image with its corresponding class, identifying the category that each pixel represents It is important to clarify that this process does not involve distinguishing between different instances of the same class; rather, it emphasizes the classification of pixels based solely on their category For example, if an input image contains two objects from the same category, the segmentation will treat both pixels as belonging to that single class without differentiating between the individual instances.

14 segmentation map does not inherently distinguish these as separate objects A different class of models, known as instance segmentation models, distinguishes between separate objects of the same class.

Scopes

The project's constraints in Semantic Segmentation with PyTorch are primarily focused on image processing, limiting its application in continuous video tasks such as object tracking The dataset is simplified to only include dogs and cats, which may hinder the model's adaptability for more complex segmentation challenges encountered in medical or industrial contexts Additionally, the straightforward U-net model may lack the necessary complexity to effectively manage intricate segmentation tasks that involve variations in object morphology or lighting conditions The goal is to process grayscale or RGB images to produce a segmentation map, where each pixel is assigned a class label represented by an integer in a height x width x 1 format.

Outlines

The project is divided into five chapters

 Chapter 1: Introduction, this chapter briefly introduces the technological trends in Image segmentation and deep learning networks

 Chapter 2: Background , providing information about image processing, image segmentation, network and libraries, the dataset, and performance metrics

 Chapter 3: Model Design and Methodology, is about system requirements and a system block diagram, data usage plan, U-Net architecture, and system flowchart

 Chapter 4: Results and Evaluation, presents results by scenarios and provides a comprehensive analysis of results based on different system scenarios and requirements

 Chapter 5: Conclusion and Further Work, which summarises the achievements and limitations of the thesis without delving into personal academic achievements, proposes potential improvements and developments for semantic segmentation

Backgrounds

Overview of image processing

Image processing is a vital area in computer science and engineering that utilizes algorithms and mathematical techniques to enhance and analyze images Its primary objectives include improving image quality, restoring images, detecting features, and transforming images for specific applications This field is categorized into two main types: digital image processing, which is more prevalent due to technological advancements, and analog image processing The applications of image processing are extensive and span various sectors, including healthcare, security, entertainment, and scientific research.

Figure 2.1 Block diagram of image processing

Here are the detailed functions of each block in the image-processing diagram:

 Original Image: This is the input image for the processing process This image can come from various sources such as cameras, scanners, or even images already stored on a computer

Pre-processing is a crucial step in image processing that aims to enhance image quality by reducing noise, improving contrast, and normalizing images for easier processing Common techniques employed in this phase include noise filtering, brightness and contrast adjustments, and color space transformations.

Image processing serves as the foundational step in the framework, focusing on enhancing image quality and extracting valuable information Various techniques, including semantic and instance segmentation, are employed to analyze images, identify objects, and extract key features effectively.

Displaying results after image processing is a crucial step in the workflow, allowing users to visualize enhancements or specific features through masks This is particularly important in image segmentation, where key areas are highlighted for easier identification By comparing images before and after processing, users can evaluate improvements in quality, clarity, and accuracy of object recognition Ultimately, visualizing post-processed images provides essential feedback, enabling users to assess and refine their processing techniques for optimal results.

Image segmentation

Image Segmentation refers to the process of dividing an image into distinct regions, focusing on identifying and labeling areas that contain objects Unlike object detection, which provides broader classifications, Image Segmentation requires pixel-level accuracy for a more detailed analysis This method offers a comprehensive understanding of an image by accurately depicting each object's location, shape, and pixel composition.

Figure 2.2: Distinguish between Object Detection and Instance Segmentation algorithms [3]

2.2.1 Input and Output of Image Segmentation

In supervised learning for computer vision, image segmentation involves labeling images to create a mask matrix, where each pixel is assigned a specific value based on the input image.

Input is shown in Figure 2.3, and Output is shown in Figure 2.4 of the Image

Segmentation model for a single-object problem A different color will represent each segment label Gray is the background, yellow is the image's border, and purple is inside the object

2.2.2 Different types of image segmentation

Figure 2.5: The difference between Semantic and Instance segmentation [4]

There are two main image segmentation problems:

Semantic segmentation involves categorizing image regions based on distinct labels, without differentiating between individual objects within those labels For instance, in an image, we can identify which pixels correspond to a person versus the background However, when an image contains multiple people, semantic segmentation does not specify which pixels belong to each individual.

Instance segmentation involves the precise division of image regions corresponding to each object within a specific label For instance, in an image featuring multiple individuals, the "people" label is segmented into distinct categories for each person, such as person 1, person 2, person 3, person 4, and person 5.

Image Segmentation has many applications in medicine, autonomous vehicles, and satellite image processing

Image Segmentation algorithms play a crucial role in medicine by aiding doctors in the diagnosis of tumors from X-ray images This technology not only identifies the location of tumors but also provides valuable information about their shape, enhancing diagnostic accuracy.

Autonomous vehicles rely on constant perception, processing, and planning to navigate dynamic environments safely To ensure absolute safety and precision in decision-making, these systems must accurately identify various objects in traffic, including pedestrians, traffic lights, signs, road markings, and other vehicles.

Figure 2.7: Zürich’s (Switzerland) Street after Instance Segmentation [5]

Satellite image processing involves the continuous collection of images of the Earth's surface by orbiting satellites Utilizing an Image Segmentation model, these satellite photos are analyzed to categorize various elements such as routes, neighborhoods, seas, and trees.

Figure 2.6: Histology Microscopy Samples and their Segmentation [7]

An automatic pesticide spraying system can significantly reduce pesticide usage in agriculture by utilizing image algorithm segmentation to differentiate between grass and crop areas This innovative technology activates automatically when grass encroaches upon crops, ensuring targeted treatment and promoting sustainable farming practices.

Forest fire control systems utilize satellite imagery to precisely identify fire locations, enabling the issuance of large-scale warnings regarding the extent and spread of forest fires.

The following are some typical applications of image segmentation There are many other potential applications of the Image Segmentation algorithm being exploited.

Network and Libraries

U-net is a convolutional neural network (CNN) architecture for medical image segmentation First introduced in 2015 by Olaf Ronneberger and colleagues in the paper

"U-Net: Convolutional Networks for Biomedical Image Segmentation," U-net has quickly become one of the most representative and influential methods in the field of medical image segmentation

Figure 2.8: Images of the property before and after zoning were taken by satellite [8]

Figure 2.9: Medical Image Segmentation with U-Net [6]

U-net architecture includes two main parts: the encoder and the decoder The encoding part uses convolutional and pooling layers to extract and reduce the size of image features In contrast, the decoding part uses convolutional and up-sampling layers to restore image size and create map-detailed segments A unique feature of U-net is the long connection structure between the corresponding layers in the encoding and decoding part, helping to preserve crucial characteristic information through the encoding-decoding process U-net is designed to work well with limited data, a common challenge in the medical field, thanks to its efficient connection structure and data augmentation techniques This makes U-net a popular choice for many applications in medical image segmentation, from cell and tissue segmentation to segmentation of larger structures such as lungs, liver, and heart in CT and MRI scans

Torchmetrics is a comprehensive machine learning and deep learning library built on the PyTorch framework, offering a wide range of standard metrics for evaluating and comparing deep learning model performance It simplifies the process for users by minimizing the need for extensive code rewrites, making it suitable for both research and production environments With robust integration capabilities, Torchmetrics supports various tasks, including classification, object detection, and segmentation, making it a versatile tool for deep learning practitioners Additionally, its extensible design allows users to easily define custom metrics, enhancing its functionality for diverse applications.

Torchmetrics offers deep learning developers and researchers the flexibility to add new metrics tailored to their project needs Its ease of use and seamless integration make it an essential tool in the field.

The Albumentations library is a powerful Python tool for image augmentation in deep learning and computer vision, offering a wide range of techniques from basic transformations like rotation and flipping to advanced shape and color distortions Its primary aim is to enhance the diversity of training data, thereby improving the generalization of deep learning models Designed for speed and efficiency, Albumentations utilizes libraries like OpenCV and NumPy, making it ideal for handling large volumes of image data in tasks such as image classification, object detection, and segmentation Additionally, its high customizability allows users to combine various augmentation techniques to suit specific project needs, establishing Albumentations as an essential resource for developers and researchers in the field.

Dataset

The "Oxford-IIIT Pet Dataset," provided by the Visual Geometry Group at the University of Oxford, is a prominent resource for semantic segmentation research This dataset comprises 7,349 images representing 37 distinct pet breeds, including 12 cat breeds and 25 dog breeds Each breed features approximately 200 photos captured from various angles, poses, and backgrounds, making it a valuable asset for advancing image recognition and classification in the field of computer vision.

Below are some example images in the dataset and their sizes:

Figure 2.11: German shorthaired red dog (500, 333, 3) [9]

Most images in the data set have H (height) equal to 500 or W (width) equal to 500, but there are still some rare cases, as shown in Figures 2.13 and 2.14.

Performance metrics

The Intersection over Union (IoU) metric, also referred to as the Jaccard index, measures the percentage of overlap between a target mask and its predicted output This metric is akin to the Dice coefficient, which is commonly utilized as a loss function in training processes.

In essence, the IoU metric calculates the ratio of the standard pixels between the target and prediction masks to the total number of pixels present in both masks combined

IoU = target ∩ prediction target ∪ prediction (1)

Considering the ground truth labeled mask, let's calculate the IoU score of the following prediction

Figure 2.14: Illustrative examples of the intersection (A ∩ B) and the union (A ∪ B) in human pictures [10]

The intersection (A ∩ B) encompasses the pixels present in both the prediction and ground truth masks, while the union (A ∪ B) comprises all pixels found in either the prediction mask or the ground truth mask.

My semantic segmentation prediction is calculated for each class separately and then averaged over all classes to provide a global mean IoU score

Pixel accuracy is an alternative approach to evaluating semantic segmentation, focusing on the percentage of correctly identified pixels within an image This metric can be reported for specific classes as well as aggregated to provide an overall assessment across all categories.

In evaluating pixel accuracy for each class, we analyze a binary mask A true positive is a pixel accurately predicted to be part of a specific class (based on the reference mask)

In contrast, a true negative is a pixel correctly recognized as not part of that class

Here's an explanation of the components within the formula:

 TP (True Positives): The number of positive instances correctly predicted as

 TN (True Negatives): The number of negative instances correctly predicted as negative

 FP (False Positives): The number of negative instances incorrectly predicted as positive

 FN (False Negatives): The number of positive instances incorrectly predicted as unfavourable

This metric can sometimes produce misleading results, particularly when the representation of a class within an image is minimal This occurs because the metric often skews towards highlighting the success in identifying negative cases, where the class is not present.

Figure 2.15: Visual example of the DICE score evaluation metrics [11]

The DICE score is a widely used metric in computer vision and medical imaging for evaluating the overlap between two sets It is particularly effective in applications such as image segmentation and object detection, measuring how accurately the predicted areas, like object edges, align with their actual positions in an image.

The DICE score, called the Sứrensen–Dice coefficient, is derived from a specific formula This formula is:

In this equation, the |𝑨 ∩ 𝑩| represents the size of the intersection between two sets

In image analysis, A and B represent the predicted and ground truth regions, respectively, with |A| and |B| indicating the sizes of these sets The DICE score, a similarity measure ranging from 0 to 1, is calculated by taking twice the intersection of the two sets and dividing it by the sum of their sizes This score is commonly used to compare predicted segmentations against ground truth in binary classification tasks.

A DICE score of 0 indicates a complete lack of overlap between the compared sets, meaning that the predicted segmentation or classification does not share any common elements with the ground truth data.

A DICE score of 1 signifies perfect similarity between two sets, indicating complete overlap This outcome reflects an ideal scenario where the predicted segmentation or classification aligns perfectly with the ground truth, showing no discrepancies.

A higher DICE score reflects improved accuracy in segmentation tasks, making it a vital metric for evaluating models that delineate object shapes and boundaries in images This score is particularly significant in medical imaging applications, such as tumor identification, cell detection, and lesion localization, where precise object demarcation is crucial.

Diagnosing a cat's color by classifying their eyes as either yellow or orange can lead to misleading accuracy metrics For instance, if a model predicts all 100 cats as yellow, despite there being ten orange cats, it would appear to have an impressive accuracy of 90% However, this high accuracy is deceptive, as it fails to account for the misclassification of important cases, similar to missing ten cancer diagnoses in a medical context Therefore, relying solely on accuracy is insufficient; alternative evaluation methods, such as the Confusion Matrix, are essential for a more reliable assessment in classification tasks.

A confusion matrix is an essential tool for assessing the performance of classification models, as it evaluates both accuracy and coverage metrics for each class It comprises four key indices that provide insights into the predictions made for every classification category.

Figure 2.16: Confusion matrix with 2 class labels [12]

In the context of cat color classification, we can illustrate four key indicators using a binary classification system Here, the orange class is identified as Positive, while the non-cancerous class is classified as Negative.

 TP (True Positive): Number of correct predictions This is when the model correctly predicts an orange cat

 TN (True Negative): The number of accurate predictions indirectly This is when the model correctly predicts a yellow cat, meaning that not choosing the orange case is correct

 FP (False Positive - Type 1 Error): Number of false predictions This is when the model predicts an orange cat, and that cat is yellow

A False Negative (FN), also known as a Type 2 Error, occurs when a model incorrectly predicts an outcome, leading to missed opportunities For example, if a model identifies a yellow cat when it is actually orange, it signifies a failure to recognize the correct category, resulting in an inaccurate prediction This highlights the importance of precise classification in machine learning to avoid overlooking crucial distinctions.

To assess the reliability of a model, we focus on two key indicators: Precision and its calculation, which measures the accuracy of positive predictions made.

Precision = TP + FP TP (4) b) Recall: Of all the Positive cases, how many were predicted correctly? This index is calculated according to the formula:

In a cat color classification scenario with a dataset of 100 cats, consisting of 90 yellow (Negative) and 10 orange (Positive), the model accurately predicts 2 out of the 10 orange cats, achieving a Precision of 1 However, it fails to identify the remaining 8 orange cats, resulting in a Recall of only 0.2 To assess the model's overall reliability, both Precision and Recall are combined into a single metric known as the F1-score, which provides a more comprehensive evaluation of the model's performance.

Model Design and Methodology

Requirements

The system requires an RGB color image of a single pet, either a dog or a cat, with dimensions not exceeding 500 pixels in height and width In cases where the image includes two animals, the output will focus on segmenting only one area, as the system is unable to distinguish between dogs and cats This initial semantic segmentation of the single pet serves as the foundation for behavioral recognition, which is essential for the subsequent training of pets.

Block diagram

 Original pet RGB image block: Includes 1 RGB pet image in the dataset that meets system requirements

 Trained model block: The model has learned the features from the data and is ready for prediction tasks

 Results block: contains original image, ground truth image, and predict image

To better understand where to get a trained model, here is the process of its formation:

RGB image Trained model Results

Figure 3.2: Training model block diagram

 Dataset block: This collects pictures and labels to train the model The dataset includes the necessary information for the model to learn and recognize objects

 Data Preprocessing & Augmentation block: This block processes the raw data

To enhance model performance, follow these steps to read data from the source and implement data augmentation techniques, including rotation, flipping, and brightness adjustments These methods generate diverse versions of the original dataset, allowing the model to learn more effectively.

The Data Loader block efficiently handles the loading of data from the pre-processed dataset, dividing it into smaller batches for training This approach not only optimizes the training process but also ensures effective memory management.

The training block is a crucial phase where the model learns from input data provided by the Data Loader During this stage, the model fine-tunes its parameters in response to the data and associated labels, enhancing its ability to make accurate predictions.

In addition to the Trained model block, the Training block also produces two results:

During the training process, it is essential to log key information such as accuracy and loss function values to effectively evaluate the model's performance and monitor its learning progress.

A deep learning model serves as the foundational structure where the learning process takes place, encompassing various types such as artificial neural networks, convolutional neural networks (CNNs), and recurrent neural networks (RNNs).

U-net

Figure 2.8: Images of the property before and after zoning were taken by satellite [8]

Figure 2.9: Medical Image Segmentation with U-Net [6]

U-net architecture includes two main parts: the encoder and the decoder The encoding part uses convolutional and pooling layers to extract and reduce the size of image features In contrast, the decoding part uses convolutional and up-sampling layers to restore image size and create map-detailed segments A unique feature of U-net is the long connection structure between the corresponding layers in the encoding and decoding part, helping to preserve crucial characteristic information through the encoding-decoding process U-net is designed to work well with limited data, a common challenge in the medical field, thanks to its efficient connection structure and data augmentation techniques This makes U-net a popular choice for many applications in medical image segmentation, from cell and tissue segmentation to segmentation of larger structures such as lungs, liver, and heart in CT and MRI scans

Torchmetrics is a versatile machine learning and deep learning library built on the PyTorch framework, offering a comprehensive collection of standard metrics for evaluating and comparing deep learning model performance It facilitates seamless integration into existing workflows, supporting both research and production tasks Covering a wide range of applications, including classification, object detection, and segmentation, Torchmetrics serves as a powerful tool for deep learning practitioners Additionally, its extensible design enables users to define custom metrics, enhancing its functionality for various projects.

Torchmetrics offers deep learning developers and researchers the flexibility to customize and add new metrics tailored to their project needs Its user-friendly interface and seamless integration make it an indispensable tool in the field.

Albumentations is a powerful Python library for image augmentation in deep learning and computer vision, offering a wide range of techniques from basic transformations like rotation and flipping to advanced shape and color distortions Its primary aim is to enhance training data diversity, thereby improving the generalization of deep learning models Designed for speed and efficiency, Albumentations utilizes libraries like OpenCV and NumPy, making it ideal for handling large volumes of image data in tasks such as image classification, object detection, and segmentation Additionally, its high customizability allows users to seamlessly combine various augmentation techniques to suit specific project requirements, establishing Albumentations as an essential resource for developers and researchers in the field.

The "Oxford-IIIT Pet Dataset," provided by the Visual Geometry Group at the University of Oxford, is a prominent semantic segmentation dataset featuring 7,349 images across 37 pet breeds, including 12 cat breeds and 25 dog breeds Each breed is represented by approximately 200 images captured from various angles, poses, and settings, showcasing diverse backgrounds and resolutions This dataset is extensively utilized in computer vision research, particularly for image recognition and classification tasks.

Below are some example images in the dataset and their sizes:

Figure 2.11: German shorthaired red dog (500, 333, 3) [9]

Most images in the data set have H (height) equal to 500 or W (width) equal to 500, but there are still some rare cases, as shown in Figures 2.13 and 2.14

The Intersection over Union (IoU) metric, also referred to as the Jaccard index, measures the overlap percentage between the target mask and the predicted output This metric is akin to the Dice coefficient, which is commonly utilized as a loss function during the training process.

In essence, the IoU metric calculates the ratio of the standard pixels between the target and prediction masks to the total number of pixels present in both masks combined

IoU = target ∩ prediction target ∪ prediction (1)

Considering the ground truth labeled mask, let's calculate the IoU score of the following prediction

Figure 2.14: Illustrative examples of the intersection (A ∩ B) and the union (A ∪ B) in human pictures [10]

The intersection of prediction and ground truth masks, denoted as (A ∩ B), comprises the pixels that are common to both Conversely, the union of these masks, represented as (A ∪ B), encompasses all pixels present in either the prediction mask or the ground truth mask.

My semantic segmentation prediction is calculated for each class separately and then averaged over all classes to provide a global mean IoU score

Pixel accuracy is an alternative approach to evaluate semantic segmentation by determining the percentage of correctly identified pixels within an image This metric is usually reported for each individual class as well as an aggregate measure across all categories.

In evaluating pixel accuracy for each class, we analyze a binary mask A true positive is a pixel accurately predicted to be part of a specific class (based on the reference mask)

In contrast, a true negative is a pixel correctly recognized as not part of that class

Here's an explanation of the components within the formula:

 TP (True Positives): The number of positive instances correctly predicted as

 TN (True Negatives): The number of negative instances correctly predicted as negative

 FP (False Positives): The number of negative instances incorrectly predicted as positive

 FN (False Negatives): The number of positive instances incorrectly predicted as unfavourable

This metric can sometimes produce misleading results, particularly when the representation of a class within an image is minimal Its tendency to focus on the identification of negative cases—instances where the class is not present—can skew the effectiveness evaluation.

Figure 2.15: Visual example of the DICE score evaluation metrics [11]

The DICE score is a widely used metric in computer vision and medical imaging that evaluates the overlap between two sets, particularly in image segmentation and object detection tasks It measures how accurately the predicted areas, such as object edges, align with the actual positions of objects within an image.

The DICE score, called the Sứrensen–Dice coefficient, is derived from a specific formula This formula is:

In this equation, the |𝑨 ∩ 𝑩| represents the size of the intersection between two sets

In image analysis, A and B represent the predicted and ground truth regions, respectively, with |A| and |B| indicating the sizes of these sets The DICE score, a similarity measure ranging from 0 to 1, is calculated by taking twice the intersection of the two sets and dividing it by the sum of their sizes This metric is commonly employed to compare predicted segmentations against ground truth in various binary classification tasks.

A DICE score of 0 indicates a complete lack of overlap between the two sets being analyzed, meaning that the predicted segmentation or classification does not share any common elements with the ground truth data.

A DICE score of 1 signifies perfect similarity between two sets, indicating complete overlap This scenario reflects an ideal situation where the predicted segmentation or classification aligns flawlessly with the ground truth, showing no discrepancies.

A higher DICE score reflects improved accuracy in segmentation tasks, making it a vital metric for evaluating models that delineate object shapes and boundaries in images This score is particularly valuable in medical imaging applications, such as tumor identification, cell detection, and lesion localization, where precise object demarcation is critical.

When diagnosing a cat's color by classifying their eyes into yellow and orange, a model that predicts all 100 cats as yellow, despite there being ten orange cats, achieves an accuracy of 90% However, this high accuracy is misleading, as it fails to identify the ten orange cats, similar to missing cancer diagnoses in medical assessments Therefore, relying solely on accuracy is insufficient for classification problems, highlighting the need for more reliable evaluation methods like the Confusion Matrix.

Flowchart

Figure 3.4: Flowchart of the system

The workflow for semantic segmentation of pet images using PyTorch begins by checking the allocated GPU to assess the potential for hardware acceleration during model training Following this, essential libraries like PyTorch and torchvision are installed to support the process.

The dog and cat image datasets are then downloaded and decompressed in preparation for processing Important libraries are imported into the source code, including PyTorch, torchvision, numpy, and matplotlib

A custom dataset class is created to efficiently handle data loading, labels, and essential transformations To enhance model robustness, data augmentation techniques like rotation, flipping, brightness adjustment, and random cropping are utilized to generate diverse variations of the original dataset Additionally, a function is implemented to convert images from their normalized state back to their original form, facilitating straightforward display and evaluation.

Before training the model, a pair of original and corresponding segmentation images are checked to ensure the data and labels match correctly Then, the basic UNet model

The architecture of 35 incorporates convolution, pooling, and upsampling layers, enhancing its performance To track essential metrics, an AverageMeter tool is implemented, which records and averages accuracy and loss function values throughout the training and testing phases Additionally, a dedicated function is designed to compute the model's accuracy effectively.

The training process involves meticulously preparing essential components such as dataloaders, loss functions, optimizers, and various training parameters During this phase, the model is trained on the dataset, enabling it to learn image segmentation through multiple epochs.

Following training, the model undergoes evaluation using a test dataset to assess its performance The segmentation results on various test images are then presented to visualize and evaluate the quality of the model's segmentation before concluding the process.

Results and Evaluation

Scenarios

In my project, I identified a single cat in one image, meeting the system's criteria The dataset included various cat images, each featuring one cat in diverse poses and shooting styles Below are examples of the results I gathered after the training process and my subsequent evaluation.

Based on Figure 4.2 , let's break down and evaluate the segmentation results in detail:

I have the original photo, which is a pose of a cat which is lying on a tree around the object being left

The ground truth mask clearly illustrates the anticipated segmentation, effectively outlining the cat against its background This binary mask features the cat's area highlighted in white, while the background is represented in black.

The model's predicted mask (c) illustrates the cat outlined against its background, but the prediction notably diverges from the actual ground truth, with even the cat's head absent from the result.

This high value demonstrates a significant overlap between the predicted and actual masks, indicating outstanding segmentation accuracy

A pretty high IoU value further validates that the predicted mask closely aligns with the ground truth

 Accuracy: Indicates that 94.21% of the total pixels are correctly classified

In Figure 4.3, the cat is depicted in a standing and pouncing pose similar to the original image (a) While the ground truth (b) demonstrates effective segmentation, the prediction (c) reveals some detail loss, particularly in features like the tail and whiskers.

This substantial value indicates a strong alignment between the predicted masks and the actual ones, showcasing exceptional accuracy in segmentation

A high IoU value further confirms that the predicted mask closely matches the ground truth

 Accuracy: This indicates that 95.92% of all pixels have been accurately classified

Just like with cats, my project is designed to identify only one dog per image When there are two or more dogs present in a picture, the system will separate the animals for individual analysis.

42 one region because semantic segmentation divides based on class and pixels

Based on the information provided in Figure 4.4 , let's analyze and assess the segmentation results thoroughly:

I have the original photo, which is a pose of a bulldog which is lying down

The ground truth mask effectively highlights the dog against its background, using a binary format where the dog's area is represented in white and the background in black.

The model-generated mask (c) outlines the dog against its background, but it reveals some issues, such as the absence of the dog's rear and a blending of background details with the dog's head Additionally, the front legs of the dog are not distinctly separated, as illustrated in (b).

This high value indicates a significant overlap between the predicted and ground truth masks, demonstrating excellent segmentation performance

An exceptionally high IoU value further affirms that the predicted mask closely aligns with the ground truth

The results depicted in Figure 4.5 showcase a moment of a running bull terrier, as evident in the original image (a) The semantic segmentation results reveal that the Ground truth (b) and the predicted mask (c) are quite similar; however, the Ground truth (b) accurately captures the dog's collar and clearly separates its legs, while these details are absent in the predicted mask (c), which also fails to include the dog's tail.

This high value indicates a significant overlap between the predicted and ground truth masks, demonstrating excellent segmentation performance

A relatively high IoU value further affirms that the predicted mask closely aligns with the ground truth

My project is currently unable to differentiate between two distinct animals within the same frame; it can only identify the background and the pet Enhancing this capability is a key focus for further development, which will be discussed in section 5.2, Further Work.

Conclusion and further works

Conclusion

This project involved the development and assessment of a straightforward U-Net-based model for semantic segmentation of pet images, utilizing the PyTorch framework We implemented various evaluation metrics, such as Intersection over Union (IoU), pixel accuracy, and Dice score, to effectively measure the model's performance.

Before training the model, we curated a dataset of pet images with pixel-level annotations The data preprocessing phase included resizing, normalization, and data augmentation techniques to improve the model's generalization and stability.

The U-Net model was selected and optimized for semantic segmentation tasks, with careful tuning of hyperparameters including the loss function, optimizer, and learning rate to ensure peak performance on the test dataset.

In our model evaluation, we assessed performance using Intersection over Union (IoU), pixel accuracy, and the Dice score The Dice score is a key metric in image segmentation, quantifying the overlap between predicted and actual masks By incorporating both IoU and the Dice score, we achieved a thorough understanding of the model's segmentation capabilities.

The project showcased effective segmentation of pet regions in images, highlighting the potential for future enhancements While a basic model was employed, the findings and methods can be further developed by investigating more sophisticated models, fine-tuning parameters, and incorporating advanced techniques to boost accuracy and overall performance.

Overall, this project provided a solid base for researching and applying deep learning methods to practical challenges in computer vision and image analysis.

Further Works

 Change the network architecture: change the appropriate network architecture to be able to identify more accurately, such as the ResU-Net network, PSP Net, or

45 feature pyramid network, from there, you can go further than the problem The problem is to segment each animal into one frame, which is called “Instance segmentation.”

 Serves as a basis for identifying pet behavior by recording moments, serving the purpose of training, which can be applied in many fields, such as training domestic pets or circuses

[1] R L a D He, ""Semantic Segmentation Based on Deeplabv3+ and Attention Mechanism," 2021 IEEE 4th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China,

[2] X Li, Z Zhao, and Q Wang, ""ABSSNet: Attention-Based Spatial Segmentation Network for Traffic Scene Understanding," in IEEE Transactions on Cybernetics, vol 52, no 9, pp 9352-9362, Sept 2022, doi: 10.1109/TCYB.2021.3050558."

[3] A Rosebrock, "Instance segmentation with OpenCV," [Online] Available: https://pyimagesearch.com/2018/11/26/instance-segmentation-with-opencv/

[4] A Thakur, "Image Segmentation Using Keras and Weights & Biases," [Online] Available: https://wandb.ai/ayush-thakur/image-segmentation/reports/Image- Segmentation-Using-Keras-and-Weights-Biases VmlldzoyNTE1Njc

[5] Xu, Yan & Li, Yang & Liu, Mingyuan & Wang, Yipei & Lai, Maode & Chang, Eric., "Gland Instance Segmentation by Deep Multichannel Side Supervision.,"

[6] Andy Chen, Chaitanya Asawa, "Going beyond the bounding box with semantic segmentation," [Online] Available: https://thegradient.pub/semantic- segmentation/

[7] Chhor, Guillaume and Cristian Bartolome Aramburu., "“Satellite Image

Segmentation for Building Detection using U-net.”," 2017

[8] Fritz, "Deep Learning for Image Segmentation: U-Net Architecture," [Online] Available: https://fritz.ai/deep-learning-for-image-segmentation-u-net- architecture/

[9] Omkar M Parkhi and Andrea Vedaldi and Andrew Zisserman and C V Jawahar,

"The Oxford-IIIT Pet Dataset," [Online] Available: https://www.robots.ox.ac.uk/~vgg/data/pets/

[10] The Difference Between Dice and Dice Loss, [Online] Available: https://pycad.co/the-difference-between-dice-and-dice-loss/

[11] Chauhan, Nagesh Singh Model Evaluation Metrics in Machine Learning, [Online] Available: https://aiplanet.com/blog/model-evaluation-metrics-in-machine- learning/

[12] Koshike, Sai Kalyan (2023) Semantic Segmentation of Lung Cancer Using Custom U-Net Model

APPENDIX https://colab.research.google.com/drive/1YT7S3PU5MRK8FGCUVHNyWpErNAUB6Ful?usp=sharing

Tiêu đề	Semantic Segmentation of Pet Images Using Segmentation Model Pytorch
Tác giả	Cao Tran Hung
Người hướng dẫn	Le Minh Thanh, M.Eng.
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Electronics and Telecommunication Engineering Technology
Thể loại	graduation project
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	50
Dung lượng	3,73 MB