1. Trang chủ
  2. » Luận Văn - Báo Cáo

Graduation project using ai and raman spectroscopy to measure glucose

62 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Graduation Project Using Ai And Raman Spectroscopy To Measure Glucose
Tác giả Ngo Quang Truong
Người hướng dẫn PhD. Nguyen Thanh Tung
Trường học Vietnam National University, Hanoi International School
Chuyên ngành Informatics and Computer Engineering
Thể loại Graduation project
Năm xuất bản 2023
Thành phố Hanoi
Định dạng
Số trang 62
Dung lượng 2,42 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • Chapter 1 Overview (10)
  • Chapter 2 Theory (12)
    • 2.1. Diabetes (12)
      • 2.1.1. Introduction of Diabetes (12)
      • 2.1.2. Types of Diabetes (13)
      • 2.1.3. Blood sugar concentration classification table (15)
    • 2.2. Artificial Intelligence (15)
      • 2.2.1. Introduction of Artificial Intelligence (15)
      • 2.2.2. History of Artificial Intelligence in medicine (16)
      • 2.2.3. Machine learning – neural networks and deep learning (16)
      • 2.2.4. Convolution Neutral Networks (17)
      • 2.2.5. Learning Methods (18)
      • 2.2.6. Training problem (22)
      • 2.2.7. Accuracy, Loss, Validation Accuracy, and Validation Loss (24)
      • 2.2.8. The future of AI in healthcare (27)
    • 2.3. Raman Spectroscopy (28)
      • 2.3.1. Introduction of Raman scattering (28)
      • 2.3.2. Application of Raman scattering in healthcare (29)
  • Chapter 3 Results (33)
    • 3.1. Input data (33)
    • 3.2. Training model (35)
      • 3.2.1. Type of model (35)
      • 3.2.2. Language and built-in library (35)
      • 3.2.3. Functions and Variables (36)
      • 3.2.4. Prepare data to training (37)
      • 3.2.5. Model’s architecture (41)
    • 3.3. Training result (46)
      • 3.3.1. Accuracy, Loss, Precious, Recall and F1-Score (46)
      • 3.3.2. Results after training (48)
    • 3.4. Source code (52)
  • Chapter 4 Discussion and Evaluation (54)
  • Chapter 5 Conclusion (56)

Nội dung

Graduation project using ai and raman spectroscopy to measure glucose Đồ án tốt nghiệp sử dụng quang phổ ai và raman để đo glucose

Overview

Insulin is a vital hormone that regulates blood sugar levels and maintains metabolic balance in the body Diabetes occurs when the body cannot effectively use insulin or fails to produce enough of it This condition often leads to hyperglycemia, marked by high blood glucose levels, which can cause significant damage to various bodily systems, especially blood vessels and nerves.

Global data from the Institute for Health Metrics and Evaluation indicates a significant rise in diabetes cases, with numbers soaring from 108 million in 1980 to 422 million in 2014, nearly quadrupling in 35 years This alarming trend is particularly evident in low- and middle-income countries, driven by increasing obesity rates and a widespread decline in physical activity.

As of 2014, statistics indicate that 8.5% of adults aged 18 and over have diabetes, directly causing approximately 1.5 million deaths, nearly half of which occur in individuals under

70 years old Diabetes also contributes to 460,000 deaths from other kidney diseases, and roughly 1 in 5 deaths from cardiovascular disease can be attributed to diabetes (Institute for Health Metrics and Evaluation, 2019)

Between 2000 and 2019, the global age-standardized mortality rate from diabetes rose by 3%, with lower-middle-income countries experiencing a significant 13% increase Conversely, the likelihood of death from major non-communicable diseases decreased by 22% worldwide during the same period.

As of 2021, the International Diabetes Federation (IDF) reports that approximately 537 million people worldwide have diabetes, representing 1 in 10 adults aged 20 to 79 Additionally, 1 in 6 infants is impacted by diabetes during fetal development, and nearly 50% of adults with diabetes are undiagnosed, according to the Institute for Health Metrics and Evaluation (2019).

Between 2000 and 2019, the age-standardized mortality rate from diabetes increased by 3% In lower-middle-income countries, diabetes mortality increased by 13%

Between 2000 and 2019, the global risk of dying from the four major noncommunicable diseases—cancer, chronic respiratory diseases, diabetes, and cardiovascular diseases—decreased by 22% for individuals aged 30 to 70.

In 2021, the International Diabetes Federation (IDF) reported that approximately 537 million people globally are living with diabetes, which equates to 1 in 10 adults aged 20 to 79 Additionally, 1 in 6 infants is impacted by diabetes during fetal development.

In particular, up to 50% of adults have undiagnosed diabetes (MINISTRY OF HEALTH,

In Vietnam, nearly 5 million individuals are living with diabetes, with over 55% facing complications A 2021 survey by the Ministry of Health reveals that the adult diabetes incidence is approximately 7.1%, affecting around 5 million people Alarmingly, only 35% of these cases have been diagnosed, and even fewer, at 23.3%, are receiving proper management and treatment Projections from the International Diabetes Federation indicate a concerning and rapid rise in diabetes cases both in Vietnam and worldwide.

Traditional diabetes detection methods, while accurate, are invasive and come with drawbacks such as high costs, lengthy wait times for results, discomfort, and risks of blood-borne diseases In response to these limitations, non-invasive testing methods are gaining popularity My project utilizes artificial intelligence to analyze Raman spectra, aiming to enhance non-invasive diabetes measurement as a viable alternative to conventional techniques As a student at the International School, Hanoi National University, I am excited to contribute to this important healthcare challenge.

Theory

Diabetes

Diabetes is a chronic condition that occurs when the body struggles to effectively use insulin produced by the pancreas or does not produce enough insulin to maintain healthy blood sugar levels.

Glucose (C6H12O6) is a vital monosaccharide and a key player in conditions like hyperglycemia and diabetes Plants and most algae produce glucose through photosynthesis, utilizing sunlight, water, and CO2 This essential sugar is crucial for energy metabolism in all living organisms, as it is used in cellular respiration to generate energy In plants, glucose is stored as cellulose and starch, while animals store it as glycogen D-glucose occurs naturally, whereas L-glucose is artificially produced in smaller amounts and is less significant.

Figure 2.1 Haworth projection of α-d-glucopyranose (Wikipedia, 2023)

The liver serves as the body's primary glucose reserve, crucial for balancing energy sources It synthesizes and stores glucose based on the body's needs, with key hormones like insulin and glucagon regulating its storage and release.

During meals, increased insulin and decreased glucagon levels lead the body to store glucose as glycogen In contrast, during fasting periods, such as overnight or between meals, the liver converts glycogen back into glucose through glycogenolysis Additionally, the liver can perform gluconeogenesis, generating vital glucose by synthesizing waste products, lipid byproducts, and amino acids.

Figure 2.2 Glucose production by liver during fasting conditions (Gluconeogenesis and Glycogenolysis) (University of

Insulin is a crucial hormone produced by the pancreas that regulates blood glucose levels and facilitates glucose storage in the liver, fat, and muscles while also managing the metabolism of carbohydrates, fats, and proteins When the body cannot produce sufficient insulin or use it effectively, glucose accumulates in the bloodstream, leading to difficulties in fat production and making glucose unavailable to cells Inadequate insulin levels can result in blood sugar imbalances and increase the risk of diabetes, which may lead to severe complications, including damage to the eyes, kidneys, nerves, heart disease, and certain cancers.

Diabetes is primarily categorized into four main types: type 1, type 2, gestational, and other specific types, with most individuals diagnosed falling into the first two categories Historically, diabetes was referred to by various names, such as juvenile-onset, adult-onset, ketosis-prone, and insulin-dependent, highlighting the challenges in accurate classification based on patient characteristics This complexity has led to the simplified use of type 1 and type 2 nomenclature Gestational diabetes, as indicated by its name, is diagnosed during pregnancy, while other types encompass a range of specific conditions.

14 of secondary or specific diabetes are caused by many causes, such as genetics, pancreatic diseases, endocrine diseases, etc Specific content is presented below (Cowie CC, 2018)

Type 1 diabetes accounts for about 5% of diabetes cases in the United States and is primarily caused by an autoimmune attack on pancreatic beta cells, leading to significant insulin deficiency Autoantibodies produced by B cells that target islet antigens are important indicators of the disease and may play a pathogenic role, although T cells are the main drivers of beta-cell destruction In clinical trials for diagnosing type 1 diabetes, the detection of these autoantibodies is often required Additionally, nonimmune factors may also contribute to beta cell loss, with fulminant diabetes, which has a rapid onset and severe symptoms, being notably more common in Asian populations.

Type 2 diabetes, which accounts for about 90 to 95 percent of diabetes cases worldwide, is primarily caused by inadequate insulin production and insulin resistance, often linked to obesity This condition is characterized by relative insulin deficiency, where the amount of insulin secreted is insufficient to overcome the level of insulin resistance Although the exact cause of the reduced insulin secretion is not fully understood, it is typically associated with metabolic factors rather than autoimmune processes.

Gestational diabetes is a pregnancy-related condition affecting 3% to 9% of expectant mothers, influenced by various research criteria It typically occurs when the body cannot produce enough insulin to manage the increased insulin resistance that develops during the second and third trimesters of pregnancy While gestational diabetes usually resolves after childbirth, women diagnosed with this condition are at a significant risk (over 50%) of developing permanent type 2 diabetes later in life due to underlying beta cell issues.

2.1.2.4 Secondary or Other Specific Types of Diabetes

The fourth category of diabetes includes secondary or specific types, which involve a range of conditions such as exocrine pancreatic diseases, endocrinopathies, infectious and immune-mediated disorders, and rare genetic disorders linked to diabetes This classification also covers monogenic deficiencies in beta cell function and genetic abnormalities that impact insulin action Advances in genetic research continue to enhance our understanding of these complex diabetes types.

15 underpinnings of diabetes becomes more refined, this list is likely to expand with the discovery of more precise genetic causes

2.1.3 Blood sugar concentration classification table

Figure 2.3 Blood sugar concentration classification table (Vinmec, 2023)

Artificial Intelligence

Artificial intelligence (AI) refers to the capability of machines to exhibit intelligence, distinguishing it from human and animal intelligence John McCarthy defines AI as the science and engineering focused on creating intelligent machines, particularly computer programs While AI is connected to the pursuit of understanding human intelligence, it is not limited to biologically observable methods.

Alan Turing's 1950 publication "Computing Machinery and Intelligence" marks the inception of the artificial intelligence discourse, well before the term was coined Often hailed as the "father of computer science," Turing raises the pivotal question, "Can machines think?" and introduces the now-famous "Turing Test," where a human interrogator distinguishes between computer-generated and human-written text Despite facing considerable criticism, the Turing Test, rooted in language concepts, remains a crucial milestone in the evolution of artificial intelligence and continues to inspire philosophical discussions.

"Artificial Intelligence: A Modern Approach" by Stuart Russell and Peter Norvig significantly influences AI discussions, presenting four key definitions of artificial intelligence The authors categorize computer systems based on their reasoning capabilities and their ability to either act or think, providing a comprehensive framework for understanding AI.

In recent years, artificial intelligence has seen significant advancements, becoming increasingly important in scientific and business sectors The launch of OpenAI's ChatGPT represents a key milestone, heralding a new and dynamic era in AI evolution.

2.2.2 History of Artificial Intelligence in medicine

Vivek Kaul, Sarah Enslin, and Seth A Gross highlight in their Gastrointestinal Endoscopy article that early limitations of AI models have restricted their acceptance in medicine However, the advent of deep learning has effectively tackled these challenges, utilizing advanced computing power, complex algorithms, and self-learning capabilities to transform the medical field.

The integration of AI in clinical practice through risk assessment models aims to improve diagnostic accuracy and enhance workflow efficiency This advancement, particularly the application of deep learning in medicine, marks a significant transformation, providing new opportunities to address past challenges and greatly enhance AI's capabilities in healthcare.

2.2.3 Machine learning – neural networks and deep learning

Machine learning, a key aspect of artificial intelligence and computer science, utilizes statistical methods to create models that learn from data, enhancing their accuracy through training According to IBM (2023) and Kalakota (2019), this technology emulates human learning, progressively refining its performance over time.

Recent data from Cross River Therapy reveals that around 77% of businesses have engaged with artificial intelligence, with 35% actively utilizing it and 42% still researching its integration Since 2000, the number of AI-related startups has increased 14 times, highlighting the rapid expansion of AI technology Notably, China and India are at the forefront of this trend, with nearly 60% adoption rates, underscoring the significant growth of AI in today's landscape.

Machine learning is essential in data science, as it enables algorithms to predict, classify, and extract valuable insights from data mining projects, ultimately impacting business decisions The surge in big data has led to a heightened need for data scientists capable of identifying key business questions and sourcing the information required to answer them.

In the healthcare sector, traditional machine learning is widely used in precision medicine to predict the most effective treatment plans based on patient characteristics and treatment scenarios These applications primarily utilize supervised learning, which relies on classified datasets and predefined outcome variables.

Neural networks, a key aspect of machine learning, have been utilized in healthcare research for many years, especially for patient classification tasks Advanced machine learning techniques, including deep learning and neural network models, are at the forefront of this innovation.

Deep learning, characterized by its multiple layers, plays a crucial role in the medical field, particularly in radiology It is utilized to detect potentially malignant tumors in X-ray images and to uncover significant features in radiological data that may not be visible to the human eye The application of deep learning in cancer-focused image analysis demonstrates promising advancements in diagnostic accuracy, surpassing earlier technologies such as computer-aided detection (CAD).

In this project, I utilized convolutional neural networks (CNNs), a prevalent type of deep neural network, to assess diabetes The term "convolutional" refers to a mathematical operation involving matrices A CNN comprises various layers, including convolutional, non-linearity, pooling, and fully connected layers While pooling and non-linearity layers do not have parameters, convolutional and fully connected layers do CNNs excel in machine learning tasks, particularly in image-related applications such as computer vision and natural language processing (NLP), achieving remarkable results in large image classification datasets like ImageNet (Saad Albawi, 2017).

Neural networks consist of three types of layers, which are:

The input layer is the initial stage of the model, responsible for receiving the input data The number of neurons in this layer corresponds to the total number of features present in the dataset or the number of pixels for image data.

The hidden layer in a neural network processes input from the input layer, and multiple hidden layers may be utilized based on the complexity and volume of the data Each hidden layer generally contains more neurons than the total number of input objects The output for each layer is determined by multiplying the output matrix from the preceding layer by the layer's learnable weights, adding the learnable biases, and applying an activation function, which is crucial for enabling nonlinearity in the network.

Raman Spectroscopy

Raman spectroscopy, named after the esteemed Indian physicist Sir Chandrasekhara Venkata Raman, who won the Nobel Prize in Physics in 1930, is a technique rooted in his significant contributions to light scattering Born on November 7, 1888, in the former Madras Province, Raman made a crucial observation in 1921 while traveling in Europe, where he noted the unique blue hues of the Mediterranean Sea and glaciers This sparked his curiosity, leading him to conduct experiments with monochromatic light from a mercury arc lamp to analyze the spectrum of transparent materials, ultimately discovering what are now known as Raman lines.

On March 16, 1928, during a scientific conference in Bangalore, C.V Raman presented his groundbreaking findings, initially met with skepticism as some physicists found it challenging to replicate his results However, Peter Pringsheim became the first German scientist to successfully reproduce Raman's work, helping to dispel doubts and coining the terms "Raman effect" and "Raman lines" for the scientific community.

Raman spectroscopy is a non-destructive analytical technique that utilizes scattered light to analyze the vibrational energy modes of a sample This method provides

29 his research partner K S Krishnan, first observed Raman scattering in 1928 (Horiba Scientific, 2022)

Figure 2.13 Sir CV Raman (Wikipedia, 2023)

Figure 2.14 Scheme of Raman scattering (University of Tartu, 2023)

Raman spectroscopy involves the scattering of incident light from a high-intensity laser, where most of the scattered light (Rayleigh scattering) retains the same wavelength as the laser and offers limited information However, a tiny fraction of this light (approximately 0.0000001%) scatters at different wavelengths, a phenomenon known as Raman scattering, which provides valuable insights into the sample's chemical structure and specific properties.

2.3.2 Application of Raman scattering in healthcare

Raman spectroscopy is an effective tool for early cancer detection, as it analyzes the molecular composition of tissues to identify subtle biochemical changes linked to cancerous growth.

Raman scattering analyzes the chemical composition of cells and tissues by interacting with the vibrational modes of molecular bonds This technique allows for noninvasive and label-free detection of alterations in the molecular fingerprints of cells or tissues affected by disease transformation.

Raman spectroscopy is a valuable tool in the field of infectious diseases, aiding in the identification and characterization of microorganisms to enhance diagnostic processes Viruses, frequently described as "organisms at the edge of life," have been linked to various human pandemics Traditional methods for viral detection often require skilled labor, rely on culture-based techniques, and can be time-consuming Although polymerase chain reaction (PCR) is the gold standard for detection, it presents challenges in resource-limited settings.

Raman spectroscopy has emerged as a powerful tool for detecting viral infections in biological samples, particularly from bodily fluids This technique captures nanoscale biochemical signatures associated with viral presence Various methods, including surface-enhanced Raman spectroscopy (SERS), Raman tweezers, tip-enhanced Raman spectroscopy (TERS), and coherent anti-Stokes Raman scattering (CARS), facilitate the identification of viral components, immune responses, and biomolecular changes in body fluids SERS, in particular, has shown promise for human viral detection, highlighting the significant potential of Raman spectroscopy in virology.

Raman spectroscopy is an essential tool in the pharmaceutical industry, enabling the assessment of drug quality by analyzing molecular composition This technique is crucial for confirming the presence of active ingredients and identifying potential contaminants, thereby ensuring the safety and efficacy of pharmaceutical products.

Raman spectroscopy has become an integral tool in industrial pharmaceutical manufacturing, enhancing various stages of the drug product life cycle and value chain Its applications range from drug discovery in laboratories to production under good manufacturing practice (GMP) conditions, facilitating real-time measurements that improve active pharmaceutical ingredient (API) reaction analytics, release testing, and statistical process control This technology is particularly beneficial for innovative manufacturing concepts like flow and continuous manufacturing, which prioritize continuous real-time quality assurance This approach aligns with the US Food and Drug Administration's (FDA) principles of process analytical technology (PAT), which emphasizes the importance of promptly measuring critical process parameters to ensure quality attributes are maintained throughout the manufacturing process.

The pharmaceutical industry is experiencing a rising demand for rapid post-market testing procedures to confirm the identity, safety, and efficacy of drug products As the industry advances, it faces complex challenges, particularly in quality assurance for biological drugs and non-biological complex drugs (NBCD), including nanomaterials These challenges stem from the sophisticated structures and innovative production processes that often involve more steps than traditional drug delivery systems Chemometrics-assisted Raman spectroscopy is expected to be pivotal in overcoming these challenges, especially in addressing the analytical gaps for heterogeneous sample matrices and complex biologicals, such as the mRNA vaccine technology developed for the SARS-CoV-2 pandemic.

Future challenges in the pharmaceutical industry will necessitate a focus on follow-on products, commonly referred to as "bio- and nano-similar" products, as well as combination therapies and the advancement of personalized medicine These developments present new production challenges and are actively being discussed in international working groups.

Traditional intracellular imaging methods like electron microscopy, cryoelectronic microscopy, and immunofluorescence microscopy are invasive and often damage the cells due to fixation, freezing, or the use of dyes To address these drawbacks, researchers have developed label-free and nondestructive imaging techniques for biochemical studies, including coherent anti-Stokes Raman (CARS) microscopy, multiphoton microscopy, and confocal Raman microscopy.

The confocal Raman system, built on a standard confocal light microscope, has resolution limits defined by the diffraction limit Recent advancements in acquisition times allow for the real-time visualization of cellular compartments in living cells without fixation or drying Traditionally, high-resolution Raman imaging employed near-infrared laser excitation to reduce tissue damage and minimize autofluorescence in biological samples However, ongoing research is expanding the usable spectral range into the visible spectrum.

Living cells and microorganisms possess unique Raman spectra that serve as fingerprint-like signatures, crucial for accurately identifying different species and analyzing their physiological and metabolic responses to environmental stress Advanced in situ Raman imaging techniques, supported by highly sensitive instruments, convert these spectral signatures into detailed snapshots, allowing for the visualization of molecular species and specific physiological reactions.

Results

Input data

In order to execute my project, simulating blood sugar levels became essential; however, I faced challenges due to the lack of authentic human-derived data and my limited ability to process raw data Additionally, Raman signals from the human body were vulnerable to interference from noise and extraneous signals from skin, muscles, and bones Acquiring data from human volunteers required permissions and evaluations by medical professionals, leading me to adopt a pragmatic approach by using a diluted glucose solution as a temporary measure until a more effective solution for managing real human data could be developed.

Raman spectroscopy has been widely recognized for its ability to quantitatively distinguish substances, particularly in the analysis of molecular structures As noted by M J Pelletier, while it is often associated with qualitative analysis, there have been over 50 years of documented applications for quantitative analysis in the technical literature Typically, analyte concentrations can range from 0.1% to 100%, with accurate measurements achievable even below 100 ppm in optimal conditions This highlights the method's effectiveness in determining the concentration of substances, such as sugar in distilled water solutions.

The training dataset consists of 50 samples divided into 10 unique classes based on soluble sugar concentration, with each category represented by five samples labeled from 0 to 9 Each signal measurement is standardized to a length of 2048 units and is organized into two CSV files: one containing the signal data obtained from a Raman signal meter and the other holding the corresponding sequential labels Visual representations of the data structure and example signal extracts from the dataset are provided below.

Figure 3.1 Raman signal in data source file

Figure 3.2 Raman signal's label in data source file (part 1) Figure 3.3 Raman signal's label in data source file (part 2)

Figure 3.4 The graph of a sample signal

Training model

Before building a machine learning model, it's essential to choose the right model type based on the data and understanding of the problem I have decided to use convolutional neural networks for this project due to their significant advantages Commonly used for image recognition and classification, these models are also highly adaptable to one-dimensional data, as supported by both instructional resources and my own research.

The model's versatility makes it suitable for various problem domains, excelling particularly in processing one-dimensional data This strategic choice adheres to modern best practices and is set to significantly enhance the project's success.

3.2.2 Language and built-in library

In this project, I utilized key AI development libraries, including NumPy for array and matrix processing, Pandas for efficient data handling, and Matplotlib for data visualization A crucial component of this toolkit is TensorFlow, which serves as a foundational library for building machine learning model architectures.

In the Keras.callbacks module, I imported the EarlyStopping feature to reduce overfitting and optimize computational resources by halting model training when performance declines This prevents overlearning and minimizes the memorization of data noise Additionally, I utilized TensorBoard for tracking and storing training results, which aids in fine-tuning model accuracy Lastly, I incorporated the Adam optimizer, a favored algorithm for training neural networks, to enhance performance.

Keras.regularizers offers the L2 module, which adds a loss component based on the sum of squared weights to the model This enhancement helps regulate weight magnitudes, reducing the risk of overfitting, especially in cases where the model may learn complex and large weight values.

In constructing the architecture of my project using the Keras.layers module, I utilized essential layers such as the Input layer, which defines the shape of the input data at the model's start The Dense layer serves as a fully connected layer, linking every neuron to all neurons in the previous layer, and is frequently used in hidden layers For processing sequence data, like text or time series, I incorporated the Conv1D layer, which extracts features from input data through filters Lastly, the Flatten layer is used to transform 3D matrix or tensor data into a 1D vector, typically placed between convolutional and fully connected layers to prepare the data for further processing.

To ensure the effective monitoring of machine learning models, it is crucial to regularly track accuracy and loss metrics after each training session I developed a specialized function using the Matplotlib library to visualize these outcomes As shown in Figure 3.6, the function generates two sets of graphs: one displaying accuracy alongside validation accuracy (val_accuracy), and the other illustrating loss with validation loss (val_loss).

In machine learning, it is essential to pay close attention to all four key parameters: accuracy, loss, validation accuracy, and validation loss Focusing only on accuracy and loss can lead to significant oversights, as these metrics do not adequately reflect the overall quality and structural integrity of the machine learning model Thus, a comprehensive evaluation of all four parameters is vital for achieving optimal results.

Figure 3.6 Graphing functions include accuracy, loss, validation loss, and validation accuracy

To improve the training process, I implemented the shuffle_data function to randomize the dataset prior to training This step is crucial, especially when the training and testing data are contained within the same file, and is particularly relevant when using the validation_split function, which will be explained in the following section.

This function processes two data strings: the Raman signal data from personal measurements and their corresponding labels It creates a new_index list to indicate the positions of the labels, which are then shuffled randomly As a result, both the Raman data series and label data series are reordered based on the shuffled new_index values, effectively randomizing the dataset The function ultimately returns three lists: the reordered data, the reordered labels, and the new positions of each data point.

Figure 3.7 below shows the function in detail

In the initial phase of training, I acquired data from two CSV files and defined relevant variables and weights Leveraging the Pandas library, I extracted data from

"Tenlabels_RamanData," encompassing signal data, and "Tenlabels_labels," comprising

38 label data, storing them in the "data" and "target" lists, respectively, as illustrated in Figure 3.8

Figure 3.8 Get data from files

Subsequently, I employed the Shuffle_data function, discussed earlier, to randomize the dataset before each training iteration The outputs of this function, namely

"after_suffle_data," "after_suffle_label," and "after_suffle_order," were then saved, and the values of "data" and "target" were updated after conversion from lists to arrays

Figure 3.9 Shuffle the data using the Shuffle_data function

To configure individual data dimensions, I defined three crucial variables:

In this study, the "sample_size" and "time_steps" are defined as the number of samples and the data length, respectively, which are obtained using the "shape" attribute For this particular case, there are 50 samples, and the data length is 2048 Additionally, the "input_dimension" is specified as 1, indicating the data's dimensionality.

Figure 3.10 "sample_size" and "time_steps" value

To comply with TensorFlow's requirements, I utilized the "reshape" function to add a dimension to the "data" list Additionally, I implemented one-hot encoding on the class labels in the "target" list using the "keras.utils.to_categorical" function, converting the class labels into vectors that match the total number of classes.

39 simplifying the processing of label data Figures 3.12 and 3.13 visually depict the "data" and "target" lists before and after reshaping

Figure 3.11 Reshape "data" and "target"

Figure 3.12 Shape of "data" list and the "target" list before reshaping

Figure 3.13 Shape of "data" list and the "target" list after reshaping

To effectively monitor and analyze my model's performance, I utilized TensorBoard from the TensorFlow library, setting the path for the data storage directory TensorBoard provides valuable visualizations, including charts and graphs that illustrate key metrics like loss function and accuracy, along with detailed model structure diagrams.

Figure 3.14 Define and set up the TensorBoard function

Next, I proceed to define and set up more important parameters Figure 3.15 below is one of the code snippets I used in this project that demonstrates them

Figure 3.15 Define settings for the machine learning model

The loss function is a crucial component in machine learning, particularly for classification tasks In my project focused on classifying and predicting samples based on sugar concentration in solutions, I opted for the "CategoricalCrossentropy" loss function This function is specifically designed for multiclass classification problems, especially when the label data has been encoded using one-hot encoding It is part of the "tf.keras.losses" module in TensorFlow, which offers a variety of built-in loss functions for different applications.

Proceeding to the definition of crucial parameters, the loss function was specified as

The Categorical Crossentropy loss function is ideal for multiclass classification tasks that utilize one-hot encoded labels For optimization, the Adam algorithm is employed with a learning rate of 0.0001, which is crucial for effective training An appropriately set learning rate enhances model performance and accelerates learning, while excessively high rates may cause overshooting and very low rates can hinder the learning process.

The "EarlyStopping" callback was integrated into the model training process, and configured to halt training when a stopping condition is met Key parameters include

"monitor='val_loss'," specifying the metric to monitor, "patienceP" determining the number of epochs without improvement before stopping, and

"restore_best_weights=True" to retain the model state with the best performance

Training result

3.3.1 Accuracy, Loss, Precious, Recall and F1-Score

Average accuracy represents the mean accuracy calculated from various batches or the entire dataset during training or evaluation, offering a comprehensive overview of a model's performance across different data subsets The formula used to determine the average accuracy and average validation accuracy is as follows:

When the "EarlyStopping" function is not activated, a training session consists of 200 epochs, during which the model's accuracy is calculated for each epoch After completing all iterations, the total accuracy is summed and divided by 200 to obtain the average accuracy per epoch This same calculation method is used for "val_accuracy." This structured approach allows for a detailed evaluation of the model's performance throughout the training period, providing insights into its learning dynamics.

In machine learning, the "loss" value quantifies the difference between predicted outputs and actual targets, with the primary objective during training being to minimize this loss for more accurate predictions The calculation of loss varies based on the task type, such as classification or regression, and relies on a specific loss function Understanding the model's loss calculation process during training involves several sequential steps that provide insight into its operational dynamics.

- Step 1: Forward Pass: Commencing with the input data or a batch thereof, the model executes a forward pass through the neural network For each individual input, the model generates a predictive output

In Step 2 of the process, the loss calculation involves comparing the predicted output to the actual target values This is achieved through a loss function, which quantifies the dissimilarity between the predicted values and the ground truth, providing a critical measure of model performance.

- Step 3: Backward Pass (Backpropagation): The gradients of the loss with respect to the model's parameters are computed This involves calculating how much each parameter contributed to the error

In Step 4 of the optimization process, gradient descent is employed to update the model's parameters, effectively minimizing the loss This algorithm adjusts the model's weights and biases, guiding them in a direction that reduces the overall loss, thereby enhancing the model's performance.

- Step 5: Iterative Process: Steps 1-4 are repeated for multiple iterations (epochs) until the model converges to a state where the loss is minimized

The loss function serves as a crucial metric for evaluating model performance, with lower loss values indicating a closer match between predictions and the actual outcomes Conversely, higher loss values highlight the necessity for refining the model's parameters to improve accuracy.

Precision is a crucial metric in classification problems, quantifying the accuracy of positive predictions It is defined as the ratio of true positive predictions to the total number of predicted positives, which includes both true positives and false positives The formula for calculating precision is essential for evaluating the performance of classification models.

Recall evaluates a classification model's effectiveness in identifying all relevant instances within a dataset It is calculated as the ratio of True Positives (correctly predicted positive instances) to the total of True Positives and False Negatives (incorrectly predicted negative instances) The formula for calculating recall is essential for understanding model performance.

High recall indicates that the model is effective at capturing relevant instances, but it may also produce more false positives

F1-score is the harmonic mean of precision and recall It provides a balance between precision and recall and is particularly useful when there is an uneven class distribution

It is calculated using the following formula:

F1-Score ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0 means either precision or recall (or both) is zero

To enhance the reliability of the model's accuracy, multiple training sessions were conducted to derive an average accuracy value, effectively reducing the impact of random noise This approach involved training the model on diverse datasets, minimizing reliance on a specific data split and decreasing the risk of bias from randomness By utilizing various data subsets, the model demonstrated improved generalization to new data, which was essential given the limited dataset Furthermore, the iterative training method allowed for a more precise estimation of the model's performance The results obtained after five training sessions are illustrated in Figures 3.20 to 3.24.

The results of the training sessions, illustrated in the accompanying visual representation, show impressive accuracy, loss, and validation loss for the developed model The final five optimal trials consistently achieved an average training accuracy of around 85.17%, with closely aligned values for both metrics.

50 training loss and validation loss As previously emphasized, these variations in performance metrics offer valuable insights to developers, guiding the formulation of an effective optimization strategy

Figures 3.25 and 3.26 provide a detailed graphical representation of the model's performance following the fifth training iteration The training accuracy reaches an impressive 92%, showing minimal fluctuations across iterations, while the validation accuracy remains lower at around 70% Both training and validation loss curves display a consistent pattern with slight fluctuations, highlighting the model's stability and robustness.

Figure 3.25 One of the graphs represents accuracy and validation accuracy

Figure 3.26 One of the graphs represents loss and validation loss

In the context of a multi-group classification problem, such as the one presented here, the computation of Precision, Recall, and F1 Score becomes inherently more intricate

To address the challenges of multi-label classification compared to binary classification, I utilized the `classification_report` function from the Scikit-learn library This tool was essential in quickly generating key metrics for each label The results, which include evaluations from both the training and validation datasets, are provided below, offering a thorough assessment of the entire dataset.

Figure 3.27 Precision, Recall and F1-Score

The image provides a comprehensive analysis of Precision, Recall, F1-Score, and Support for each label, with each label represented by five samples Eight labels—0, 1, 4, 5, 6, 7, 8, and 9—achieve a perfect Precision score of 1, while labels 2 and 3 have a Precision of 0.83 Additionally, the Recall values reveal that eight labels, including 0, 2, 3, 5, 6, 7, and 8, demonstrate strong performance metrics.

9, demonstrate a value of 1, whereas labels 1 and 4 exhibit a Recall value of 0.8 Regarding F1-Score, six labels, namely 0, 5, 6, 7, 8, and 9, boast a perfect score of 1, whereas labels 1, 2, 3, and 4 achieve F1-Score values of 0.89, 0.91, 0.91, and 0.89, respectively

In summary, the Precision attains a remarkable 96%, and both macro average (macro avg) and weighted average (weighted avg) also achieve 96%, offering a comprehensive overview of the model's performance across all classes

Source code

Ngày đăng: 28/02/2025, 22:52

TRÍCH ĐOẠN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w