00051000971 application of ai and raman spectroscopy for non invasive diabetes diagnosis

00051000971 application of ai and raman spectroscopy for non invasive diabetes diagnosis 00051000971 application of ai and raman spectroscopy for non invasive diabetes diagnosis

INTRODUCTION

Rationale

Diabetes is a chronic disease characterized by the body's inability to produce sufficient insulin or efficiently utilize the insulin it produces, leading to elevated blood glucose levels Recognized worldwide as one of the four major non-communicable diseases (NCDs), diabetes has seen a significant increase in both prevalence and incidence over recent decades Addressing diabetes is essential to reducing the global burden of NCDs and improving public health outcomes.

Early detection of diabetes is possible through affordable blood sugar monitoring methods Glucometer devices offer a widely accessible and user-friendly way to perform invasive blood glucose testing using blood samples analyzed in vitro These devices are suitable for use in both hospital and home environments, utilizing an electrochemical approach that requires a finger prick or lancet to obtain blood However, many individuals with diabetes find frequent blood sampling uncomfortable and inconvenient, highlighting the need for alternative, more patient-friendly solutions.

Non-invasive blood glucose monitoring methods provide a promising, cost-effective, and innovative alternative for diabetes diagnosis Numerous studies have demonstrated encouraging results in accurately identifying diabetes using these techniques Many approaches utilize optical sensors to measure glucose levels without invasive procedures, ensuring patient comfort Importantly, these methods prevent harm or injury to human tissues, making them a safe and patient-friendly diagnostic solution.

Aim and Objectives of the Study

In the realm of artificial intelligence, non-invasive approaches play a crucial role by enabling the generation of intermediate data for distillation and correlation analysis This data serves as the foundation for diabetic classification, providing a less invasive alternative to traditional diagnostic methods Briganti et al.[8] have highlighted the transformative potential of AI-driven medical technologies, emphasizing their emergence as indispensable therapeutic solutions These advancements underscore the application of AI techniques to preventive diagnostics, paving the way for early intervention and improved patient outcomes

The primary aim of this study is to develop a non-invasive diagnostic method for diabetes by integrating AI algorithms with Raman Spectroscopy The specific objectives of the study are:

• To review existing literature on the use of AI and Raman Spectroscopy in medical diagnostics

• To select and implement appropriate AI algorithms for analyzing Raman Spectroscopy data

• To collect and preprocess data using Raman Spectroscopy

• To develop and validate a diagnostic model for diabetes using the integrated

AI and Raman Spectroscopy approach

• To evaluate the effectiveness and accuracy of the developed diagnostic tool.

Research Questions

The study seeks to answer the following research questions:

1 How can AI be integrated with Raman Spectroscopy to detect diabetes biomarkers non-invasively?

2 What are the key advantages and challenges of using this integrated approach for diabetes diagnosis?

3 How effective is the developed diagnostic tool in identifying diabetic conditions compared to traditional methods?

Methods of the Study

The study employs a combination of literature review, experimental data collection, and algorithm development The methodology includes:

• Conducting a comprehensive literature review to identify relevant studies and technologies

• Selecting suitable AI algorithms for data analysis

• Acquiring Raman Spectroscopy equipment and collecting data from biological samples

• Preprocessing the collected data to ensure its quality and suitability for analysis

• Developing and validating the diagnostic model using machine learning techniques

• Analyzing the results to assess the accuracy and effectiveness of the diagnostic tool.

Scope of the Study

This study focuses on the integration of AI and Raman Spectroscopy for non- invasive diabetes diagnosis It includes:

• A detailed review of relevant literature and technologies

• Experimental data collection and analysis using Raman Spectroscopy

• Development and validation of an AI-based diagnostic model

• Evaluation of the developed diagnostic tool's effectiveness.

Significance of the Study

This study is significant because it has the potential to revolutionize diabetes diagnosis by offering a non-invasive, accurate, and efficient diagnostic method Its findings could improve patient care, minimize discomfort associated with traditional testing, and make diabetes detection more accessible Furthermore, the research advances the growing field of AI and Raman Spectroscopy applications in medical diagnostics, paving the way for innovative healthcare solutions.

Structure of the Study

This thesis is structured as follows:

• Chapter 1: Introduction - Provides an overview of the research, including its rationale, aims, objectives, research questions, methods, scope, significance, and structure

• Chapter 2: Literature Review - Summarizes existing research on AI and

Raman Spectroscopy in medical diagnostics, highlighting gaps and opportunities for further study

• Chapter 3: Methodology - Describes the research design, data collection methods, and analytical techniques used in the study

• Chapter 4: Results - Presents the findings of the study, including data analysis and interpretation

• Chapter 5: Discussion - Discusses the implications of the findings, compares them with existing literature, and suggests potential improvements

• Chapter 6: Conclusion - Summarizes the key points of the research, reflects on the overall progress and achievements, and suggests directions for future research.

LITERATURE REVIEW

Diabetes

Diabetes is a chronic condition characterized by the body's inability to effectively use insulin or produce enough insulin to regulate blood sugar levels Glucose, the most common monosaccharide, plays a vital role in hyperglycemia and diabetes management Through photosynthesis, plants and algae produce glucose from water and CO2 using sunlight, highlighting its importance in energy metabolism In living organisms, glucose is essential for cellular respiration, generating energy necessary for various bodily functions Plants store glucose as cellulose and starch, while animals store it as glycogen Naturally occurring D-glucose is key to metabolic processes, whereas L-glucose is artificially produced and less significant in biological functions.

Figure 1: Haworth projections(James Ashenhurst,2024)

The liver is essential in maintaining stable blood glucose levels by acting as a glucose reservoir, synthesizing and storing glucose as glycogen when insulin is high after meals It responds to the body's needs by converting glycogen back into glucose through glycogenolysis during fasting periods Additionally, the liver produces vital glucose via gluconeogenesis, synthesizing glucose from waste products, lipid byproducts, and amino acids This complex functionality ensures a continuous supply of glucose, supporting overall energy balance and metabolic health.

16 energy to the body even during fasting periods, which is crucial for maintaining overall metabolic balance

The liver is essential for maintaining stable glucose levels by acting as a glucose reserve, synthesizing and storing glycogen in response to hormonal signals like insulin and glucagon After meals, high insulin and low glucagon levels prompt the liver to store excess glucose as glycogen, while during fasting periods, the liver converts glycogen back into glucose through glycogenolysis Additionally, the liver supports continuous glucose supply by producing new glucose via gluconeogenesis from waste products, lipid byproducts, and amino acids This dual role ensures energy availability during fasting and sustains overall metabolic balance.

Diabetes can be categorized into three main types: type 1, type 2, and gestational diabetes Most patients are diagnosed with either type 1 or type 2 diabetes

Figure 2: Types of Diabetes (North Dakota Department of Health and Human

Type 1 Diabetes is a chronic autoimmune condition where the immune system mistakenly attacks and destroys the insulin-producing beta cells in the pancreas As

Type 1 diabetes is characterized by the body's inability to produce insulin, a vital hormone for regulating blood sugar levels, leading to elevated glucose in the bloodstream It often develops during childhood or adolescence but can occur at any age, with symptoms including frequent urination, excessive thirst, extreme hunger, weight loss, fatigue, and blurry vision Individuals with type 1 diabetes require daily insulin therapy through injections or an insulin pump to effectively manage blood sugar levels The exact cause remains unknown but is believed to result from a combination of genetic predisposition and environmental triggers such as viral infections Though less common than type 2 diabetes, proper management—including blood sugar monitoring, a healthy diet, and regular exercise—is essential to prevent complications.

Type 2 Diabetes is a chronic metabolic disorder that occurs when the body becomes resistant to insulin or when the pancreas does not produce enough insulin to regulate blood sugar levels Unlike type 1 diabetes, type 2 often develops gradually over time and is more common in adults, though it is increasingly being diagnosed in children and adolescents The primary risk factors for type 2 diabetes include obesity, lack of physical activity, poor diet, age, family history, and certain ethnic backgrounds Symptoms may include increased thirst, frequent urination, fatigue, blurred vision, and slow healing of wounds, though it can sometimes be asymptomatic in its early stages Management of type 2 diabetes typically involves lifestyle modifications such as a healthy diet, regular exercise, and weight loss In some cases, medication or insulin therapy may be required to maintain blood sugar levels Early diagnosis and consistent management are crucial in preventing complications like cardiovascular disease, kidney damage, and nerve damage [11]

Gestational diabetes is a form of diabetes that develops during pregnancy in women with no prior diagnosis of the condition, caused by the body's inability to produce sufficient insulin to meet increased pregnancy demands This condition usually occurs in the second or third trimester and often resolves after childbirth, but careful management is essential to protect both maternal and fetal health Key risk factors include a history of gestational diabetes in previous pregnancies, being overweight, and having a family history of diabetes Managing gestational diabetes through diet, exercise, and medical supervision helps prevent complications and ensures healthy pregnancy outcomes.

The history of diabetes indicates that certain ethnic groups have a higher susceptibility to the condition Symptoms are often mild or absent but may include increased thirst, frequent urination, or fatigue in some cases Effective management involves maintaining a healthy diet, engaging in regular physical activity, and monitoring blood sugar levels Depending on the severity, insulin therapy or medication may be necessary Although gestational diabetes typically resolves after childbirth, women who experience it face a higher risk of developing type 2 diabetes later in life, making regular follow-up essential for early detection and management.

2.1.2.4 Secondary or Other Specific Types of Diabetes

Secondary or other specific types of diabetes are caused by underlying medical conditions, genetic factors, or external influences, rather than the typical mechanisms seen in type 1, type 2, or gestational diabetes Although these forms are less common, understanding them is crucial for accurate diagnosis and effective management of the condition.

• Genetic Defects of Beta Cell Function: Known as monogenic diabetes, this group includes conditions like Maturity-Onset Diabetes of the Young

(MODY), caused by single-gene mutations that impair insulin production

• Pancreatic Disorders: Diseases such as pancreatitis, cystic fibrosis, or pancreatic cancer can damage the pancreas, reducing insulin production and leading to diabetes

• Endocrine Disorders: Conditions like Cushing's syndrome or acromegaly can cause elevated levels of hormones like cortisol or growth hormone, which counteract insulin's effects and result in diabetes

• Drug or Chemical-Induced Diabetes: Certain medications, such as corticosteroids or antipsychotics, or exposure to toxins, can impair glucose metabolism and induce diabetes

• Infections and Other Conditions: Rare infections or autoimmune diseases can trigger diabetes by affecting insulin production or action

• Genetic Syndromes: Some syndromes, including Down syndrome,

Turner syndrome, and Klinefelter syndrome, have a higher prevalence of diabetes

Management of secondary diabetes typically involves addressing the root cause, such as treating the underlying condition or adjusting medications, alongside standard diabetes care.

Substance of diabetes mellitus

Type 2 Diabetes Mellitus (T2DM) is the most common form of diabetes, accounting for 90–95% of cases, and is characterized by insulin resistance and a relative deficiency of insulin Unlike other types, individuals with T2DM often do not initially require insulin treatment, highlighting the progressive nature of the disease.

The exact causes of this type of diabetes remain unclear, but it is unlikely to involve autoimmune destruction of β-cells (as seen in Type 1 Diabetes) Furthermore, none of the other known causes of diabetes apply to this specific type

In the early 1980s, HbA1C was initially recommended as a diagnostic test for diabetes, but limited availability and lack of standardized assays slowed its adoption It was not until 2009 that an international expert panel officially included HbA1C in diagnostic criteria, recommending a threshold of 48 mmol/mol (6.5% DCCT) for diagnosis.

The American Diabetes Association (ADA) and the World Health Organization (WHO) have both endorsed these diagnostic criteria, establishing them as the gold standard for type 2 diabetes mellitus (T2DM) detection Consequently, many countries worldwide now adopt these guidelines for accurate T2DM diagnosis, as outlined in Table 1.

FPG (mg/dL) HbA1C (% DCCT)

Table 1: ADA diabetes diagnostic criteria in 2015

Traditional invasive methods for type 2 diabetes mellitus detection

2.3.1 Point of care Hemoglobin A1C Test (POCT):

Point-of-care Hemoglobin A1C (HbA1c) tests, also known as POCT, are essential diagnostic tools for assessing average blood glucose levels over the past two to three months These rapid tests play a vital role in the diagnosis and management of diabetes by providing immediate results during patient consultations, enabling timely medical decisions and improved patient care.

• Immediate Results: POCT devices provide rapid results, allowing healthcare providers to make timely therapeutic decisions during the same visit This reduces the need for additional appointments and follow-ups

• Convenience: These tests can be performed in various settings, including clinics, pharmacies, and even at home, making them accessible and convenient for patients

• Improved Glycemic Control: By offering immediate feedback, POCT devices help in better monitoring and management of blood glucose levels, leading to improved glycemic control and reduced risk of complications

• Standardization: Modern POCT devices are standardized and certified by programs like the National Glycohemoglobin Standardization Program (NGSP), ensuring accuracy and consistency in results [16-17]

• Diagnosis of Diabetes: POCT HbA1c tests are used to diagnose diabetes by measuring the percentage of glycated hemoglobin in the blood An HbA1c level of 6.5% or higher indicates diabetes [18]

• Monitoring Diabetes: These tests are also used to monitor the effectiveness of diabetes treatment plans, helping to adjust medications and lifestyle changes as needed

• Screening: POCT HbA1c tests can be used for screening individuals at risk of developing diabetes, enabling early intervention and prevention strategies

Ensuring accurate results with POCT devices requires strict adherence to proper testing procedures, including device calibration, proper sample handling, and comprehensive operator training, as these factors significantly influence the reliability of test outcomes.

• Cost: The cost of POCT devices and test cartridges can vary, and it is important to consider the cost-effectiveness of these tests in different healthcare settings [17]

• Limitations: POCT HbA1c tests may have limitations in certain clinical situations, such as in patients with hemoglobin variants or conditions affecting red blood cell turnover [16]

Overall, point-of-care HbA1c tests are valuable tools in the management of diabetes, offering convenience, immediate results, and improved patient outcomes

By enabling timely therapeutic decisions and better glycemic control, these tests play a crucial role in diabetes care

2.3.2 The fasting plasma glucose (FPG) test:

The fasting plasma glucose (FPG) test, also called the fasting blood glucose (FBG) test, measures blood sugar levels after at least 8 hours of fasting It is a simple, accurate, and cost-effective screening tool for detecting diabetes and evaluating insulin function.

The diagnosis of diabetes is commonly confirmed through the Fasting Plasma Glucose (FPG) test An FPG level of 126 mg/dL (7.0 mmol/L) or higher on two separate occasions is indicative of diabetes This test is a standard and reliable method for detecting both diabetes and pre-diabetes.

• Screening: It is recommended as a screening test for individuals aged 35 or older and those with symptoms or risk factors for diabetes

• Monitoring: The FPG test can be used to monitor blood glucose levels in individuals already diagnosed with diabetes, helping to evaluate the effectiveness of their management plan

• Preparation: Patients are instructed to fast for at least 8 hours before the test, during which they can only drink water

• Blood Sample: A blood sample is taken from the patient's arm, usually in the morning to ensure adequate fasting time

Blood tests are conducted to measure glucose levels, with normal fasting blood glucose being below 100 mg/dL (5.5 mmol/L) Levels ranging from 100 to 125 mg/dL (5.5 to 6.9 mmol/L) suggest impaired fasting glucose, which is a precursor to pre-diabetes Monitoring these levels is essential for early detection and management of potential blood sugar issues.

• Accuracy: The FPG test is regarded as accurate and more sensitive than the A1C test, though it is not as sensitive as the oral glucose tolerance test (OGTT)

• Risks: As a standard blood draw, the FPG test is considered safe, with minimal risks such as bruising or infection at the puncture site

• Limitations: The FPG test may not be suitable for individuals with certain medical conditions or those who cannot fast for extended periods

Overall, the fasting plasma glucose test is a valuable tool in the diagnosis and management of diabetes, providing essential information about blood glucose levels and helping to guide treatment decisions [19-20]

2.3.3 The oral glucose tolerance test (OGTT):

The oral glucose tolerance test (OGTT) is a vital diagnostic tool for assessing the body's ability to metabolize glucose, aiding in the detection of diabetes and related metabolic disorders This test requires fasting overnight followed by the ingestion of a glucose-rich solution, after which blood samples are collected at specific intervals Monitoring these samples helps evaluate how effectively the body processes glucose over time, providing essential insights into glucose metabolism and overall metabolic health.

The Oral Glucose Tolerance Test (OGTT) is recognized as the gold standard for diagnosing type 2 diabetes, gestational diabetes, and prediabetes It is highly effective in detecting abnormalities in glucose metabolism that other diagnostic tests may overlook Utilizing the OGTT provides accurate and reliable results essential for early intervention and management of diabetes-related conditions.

• Screening for Gestational Diabetes: The OGTT is commonly used during pregnancy to screen for gestational diabetes, typically between 24 and 28 weeks of gestation

• Assessment of Glucose Tolerance: The test helps assess how well the body handles glucose, providing insights into insulin sensitivity and beta-cell function

• Preparation: Patients are required to fast for at least 8 hours before the test

It is usually scheduled in the morning to ensure adequate fasting time

During a glucose tolerance test, after fasting blood samples are collected, patients consume a glucose solution containing 75 grams of glucose for non-pregnant adults For pregnant women, the glucose solution may vary between 50 to 100 grams, depending on the specific medical protocol This procedure helps assess how effectively the body processes glucose, aiding in the diagnosis of conditions like diabetes.

• Blood Samples: Blood samples are taken at multiple intervals, typically at

0, 1, and 2 hours after consuming the glucose solution These samples measure the blood glucose levels to see how the body processes the glucose

• Normal Glucose Tolerance: Blood glucose levels return to normal within

• Impaired Glucose Tolerance (Prediabetes): Blood glucose levels are higher than normal but not high enough to be classified as diabetes

• Diabetes: Blood glucose levels remain elevated, indicating a problem with glucose metabolism

• Accuracy: The OGTT is highly accurate but requires proper preparation and adherence to the testing protocol to ensure reliable results

The glucose test is generally safe; however, some patients may experience side effects such as nausea or dizziness from the glucose solution Additionally, there is a small risk of bruising or infection at the blood draw site.

• Limitations: The OGTT may not be suitable for individuals with certain medical conditions or those who cannot fast for extended periods

Overall, the oral glucose tolerance test is a valuable tool in diagnosing and managing diabetes and other glucose metabolism disorders It provides comprehensive insights into how the body handles glucose, helping healthcare providers make informed decisions about treatment and management [20-21]

2.3.4 Real Time Continuous Glucose Monitoring

Real-time Continuous Glucose Monitoring (CGM) is an innovative technology that enables individuals with diabetes to constantly track their blood glucose levels around the clock By providing real-time data, CGM helps users manage their blood sugar more effectively, supporting better decision-making related to diet, exercise, and medication This advanced monitoring system enhances overall diabetes management, promoting improved health outcomes and increased confidence in daily blood sugar control.

• Continuous Monitoring: CGM devices provide continuous glucose readings, typically every few minutes, allowing users to see trends and patterns in their glucose levels

• Immediate Feedback: Real-time data helps users make immediate adjustments to their lifestyle or treatment plan, improving glycemic control and reducing the risk of complications

CGM devices feature customizable alerts and alarms that notify users when their glucose levels are too high or too low These timely notifications enable prompt action, helping to prevent dangerous conditions like hyperglycemia and hypoglycemia and ensuring better blood sugar management.

• Data Integration: CGM systems often integrate with smartphones, insulin pumps, and other devices, providing a comprehensive view of glucose levels and facilitating better diabetes management

Continuous Glucose Monitoring (CGM) reduces the frequency of finger-prick tests, making diabetes management less invasive and more convenient Although occasional blood glucose testing may still be necessary, CGM significantly minimizes the need for frequent finger pricks, enhancing comfort for individuals with diabetes.

A CGM system typically consists of three main components:

• Sensor: A small sensor is inserted under the skin, usually on the abdomen or arm, to measure glucose levels in the interstitial fluid

• Transmitter: The sensor sends glucose data wirelessly to a transmitter, which then relays the information to a receiver or a compatible device, such as a smartphone or insulin pump

• Receiver: The receiver displays the glucose readings in real-time, allowing users to monitor their levels continuously

• Diabetes Management: CGM is particularly beneficial for individuals with type 1 diabetes, type 2 diabetes, and gestational diabetes, helping them maintain optimal blood glucose levels and avoid complications

• Personalized Treatment: By providing detailed glucose data, CGM enables healthcare providers to tailor treatment plans to individual needs, improving overall diabetes care

• Research and Development: CGM technology is also used in clinical research to study glucose metabolism and develop new diabetes treatments

Continuous Glucose Monitoring (CGM) devices are typically accurate; however, they may experience a slight lag compared to traditional blood glucose meters To maintain precise readings, regular calibration and periodic finger-prick tests are recommended, ensuring optimal device performance.

• Cost: CGM systems can be expensive, and not all insurance plans cover them It's important to consider the cost and potential benefits when deciding whether to use a CGM device

• Maintenance: Sensors need to be replaced regularly, typically every 7 to

Proper maintenance and adherence to the manufacturer's guidelines are essential for optimizing the performance of glucose monitoring devices, which typically last up to 14 days depending on the device Real-time Continuous Glucose Monitoring (CGM) is a valuable technology that provides continuous, real-time data on glucose levels, empowering individuals with diabetes to take control of their health, make informed decisions, and improve their overall quality of life.

Artificial Intelligence

Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans The basic concepts of

• Machine Learning (ML): A subset of AI that involves training algorithms to learn from and make predictions based on data Machine learning models can improve their performance over time without being explicitly programmed [23]

Neural networks are inspired by the human brain and consist of interconnected nodes, or neurons, that process data and learn patterns They are highly effective at handling complex data, making them a fundamental component of many deep learning models.

Deep learning is a specialized subset of machine learning that leverages deep neural networks with multiple layers to enhance data analysis These models automatically extract and learn features directly from raw data, significantly improving performance in tasks such as image and speech recognition Its ability to handle large, complex datasets makes deep learning a powerful tool for advancing artificial intelligence applications.

Artificial Intelligence (AI) is revolutionizing healthcare by significantly enhancing diagnostic accuracy and enabling early disease detection, leading to improved patient outcomes Its powerful ability to process vast amounts of data swiftly and accurately is transforming the methods used to diagnose and treat various medical conditions AI-driven technologies are increasingly optimizing healthcare delivery, making diagnoses faster, more precise, and more personalized.

Several studies have successfully utilized AI for diagnosing diseases, including diabetes for instance, AI algorithms have been developed to autonomously screen for diabetic retinopathy from fundus photography, achieving sensitivity and specificity greater than 85% compared to human graders Another study explored the use of AI

26 for automatic retinal screening, clinical diagnosis support, and patient self- management tools, which have been approved by the US Food and Drug Administration [27]

AI-driven diagnostics enhance healthcare by improving diagnostic accuracy, enabling early disease detection, and supporting personalized treatment plans These technologies democratize access to quality healthcare, particularly in regions with limited access to specialized medical professionals Additionally, AI reduces variability in diagnostic outcomes by offering consistent, data-driven insights, ensuring more reliable and equitable healthcare solutions.

Recent advances in non-invasive blood glucose monitoring offer a promising alternative to traditional invasive methods, which often cause infections and discomfort for individuals with diabetes Emerging research highlights the potential of non-invasive techniques, leveraging fields like chemistry, biology, optics, electromagnetic waves, and computer science, to overcome challenges associated with accurate glucose detection While non-invasive methods provide significant benefits, they also present limitations compared to invasive technologies, particularly in detection accuracy and reliability, as discussed in recent studies The development of transdermal biosensors and wearable devices is improving the efficiency, affordability, and resilience of non-invasive glucose monitoring solutions, making them more competitive in the healthcare market.

Over the past two decades, extensive research has focused on non-invasive blood glucose testing methods, with optical techniques leading the way These include near-infrared (NIR) spectroscopy, photoacoustic imaging, Raman spectroscopy, polarized optical rotation, and optical coherence tomography, all showing promise for accurate glucose monitoring Specifically, Raman spectroscopy employs a transilluminated laser beam as the excitation source, directed at a specific body point to monitor glucose levels, either in vivo or in vitro.

Raman spectroscopy, a non-destructive and label-free fingerprinting technique, is increasingly enhancing biomedical diagnostics both in vivo and in vitro Its benefits include relatively short acquisition time, non-invasiveness, and the ability to provide biochemical molecular information By selecting a source within the red area of the spectrum or near-infrared (NIR), both the source and the signal can fall within the

The 27 optical window for tissue transparency allows for penetration depths of several millimeters into human tissue, enabling advanced non-invasive diagnostic techniques Notably, Shao et al demonstrated that Raman spectra obtained with a 785 nm diode laser can be used for accurate, non-invasive blood glucose analysis in living animals Their study revealed a strong correlation between Raman intensity and blood glucose concentration, with an R-squared value of 0.91 and a Mean Absolute Error of 5.7%, highlighting the potential of Raman spectroscopy for reliable glucose monitoring.

Habibullah et al (2022) proposed an SVM classification technique for non-invasive blood glucose monitoring that achieved 85% accuracy, utilizing Gaussian filtering and histogram-based feature extraction for image database analysis Additionally, Villa-Manríquez et al demonstrated that Raman spectroscopy, combined with PCA-SVM classification, can attain over 80% accuracy in blood glucose measurement, highlighting its significant potential and advantage over NIR spectroscopy for non-invasive glucose assessment.

Shokrekhodaei et al proposed several additional techniques to enhance classification accuracy Notably, their Support Vector Machine (SVM) model for classifying glucose levels into 21 discrete concentration classes achieved an impressive mean F1-score of 99%, demonstrating high precision and reliability in glucose level prediction.

AI algorithms have been applied in medical diagnostics in various ways, focusing on pattern recognition and predictive analytics:

Pattern recognition through AI algorithms such as CNNs plays a crucial role in analyzing medical images like X-rays, MRIs, and CT scans to identify abnormalities and diagnose conditions including cancer, pneumonia, and retinal diseases These advanced models can detect subtle patterns often overlooked by human experts, enabling earlier detection and more accurate medical diagnoses.

Predictive analytics using machine learning models like SVMs and random forests plays a crucial role in healthcare by accurately forecasting disease likelihood based on patient data such as electronic health records and genetic information These advanced models help identify key risk factors and predict disease progression, thereby enabling the development of personalized treatment plans and proactive medical interventions for improved patient outcomes.

• Automated Diagnostics: AI-driven tools are used to provide real-time diagnostic support for clinicians For example, AI algorithms can analyze blood test results, detect diabetic retinopathy from retinal images, and

28 classify skin lesions as benign or malignant These tools enhance the efficiency and accuracy of clinical decision-making

AI Algorithms

Several AI algorithms are relevant to this study, including:

Support Vector Machine (SVM) is a highly effective supervised machine learning algorithm used for classification and regression tasks, known for its ability to efficiently handle high-dimensional data Its primary goal is to identify the optimal decision boundary that separates different classes within a dataset In classification tasks, SVM constructs a hyperplane that maximizes the separation margin between data points of different categories, which enhances the model's ability to generalize to new, unseen data and reduces misclassification errors Support vectors—the data points closest to the hyperplane—play a crucial role in determining the position of the boundary, making them essential components of the SVM decision-making process.

Support Vector Machines (SVM) utilize a kernel trick to transform non-linearly separable data into a higher-dimensional space, enabling linear separation Different kernel functions serve specific purposes: the linear kernel for linearly separable data, the polynomial and radial basis function (RBF) kernels for modeling complex, curved decision boundaries, and the sigmoid kernel for neural network-inspired classification tasks These transformations enhance SVM’s ability to classify highly non-linear datasets with greater accuracy Consequently, SVM is a powerful tool in pattern recognition, image analysis, and biomedical signal processing.

Support Vector Machines (SVM) are highly effective for both linear and nonlinear classification tasks due to their adaptability in high-dimensional feature spaces The kernel trick enables SVM to project input features into a higher-dimensional space, allowing a linear hyperplane to separate classes that are not linearly separable in the original space Commonly used kernels include the linear kernel, ideal for straightforward classification with minimal decision boundary curvature, and the polynomial kernel, which introduces curvature to handle more complex data patterns.

The Radial Basis Function (RBF) kernel is widely recognized as one of the most flexible and powerful options for non-linear classification However, selecting the appropriate kernel for a specific dataset remains a significant challenge, as there is no universal method for kernel selection Researchers typically rely on experimental evaluations and dataset-specific tuning to identify the most effective kernel, as highlighted by Ho [25].

Support Vector Machine (SVM) has become a widely researched machine learning model for biomedical applications of Raman spectroscopy, showing remarkable effectiveness in disease diagnostics, molecular fingerprinting, and spectral classification The ability of SVM to process high-dimensional spectral data has made it a valuable tool for non-invasive medical analysis, allowing researchers to detect subtle biochemical changes indicative of disease

Support Vector Machines (SVM) effectively distinguish various pathological conditions using Raman spectral variations, facilitating accurate classification of diseases such as cancer, autoimmune disorders, and bacterial infections Their capability to construct optimal hyperplanes ensures precise separation between different classes, reducing misclassification errors The kernel trick transforms spectral features into higher-dimensional spaces, significantly enhancing SVM's ability to handle complex, non-linearly separable biomedical data.

The integration of SVM with advanced preprocessing techniques like baseline correction, intensity modulation, and feature extraction has substantially enhanced the accuracy of biomedical spectral analysis By refining raw spectral signals, these methods improve SVM's capacity to accurately differentiate molecular structures This optimization strengthens SVM's role in AI-driven Raman spectroscopy, advancing its application in medical diagnostics.

Support Vector Machine (SVM) is widely utilized in biomedical applications, notably in Raman spectroscopy for analyzing high-dimensional spectral data efficiently It plays a crucial role in non-invasive medical diagnostics by classifying diseases, identifying molecular fingerprints, and analyzing spectral variations A notable study published in Nature demonstrated the use of SVM for diagnosing primary Sjögren’s syndrome (pSS) through Raman spectroscopy of blood samples By combining Principal Component Analysis (PCA) for feature selection and Particle Swarm Optimization (PSO) to optimize SVM parameters, the researchers achieved an outstanding classification accuracy of 94.44%, highlighting SVM's effectiveness in medical spectral analysis.

31 findings confirm that SVM is highly effective in distinguishing autoimmune disorders, reinforcing its role in spectral-based diagnostics.[49]

A study published in BMC Cancer demonstrates the potential of serum-based Raman spectroscopy as an effective tool for lung cancer screening The research highlights that combining Raman spectroscopy with Support Vector Machine (SVM) algorithms significantly improves the accuracy of distinguishing cancerous from non-cancerous serum samples, offering a promising non-invasive method for early lung cancer detection.

By leveraging advanced spectral classification techniques, the researchers achieved 91.67% sensitivity and 92.22% specificity, confirming SVM’s ability to detect subtle biochemical variations that differentiate malignant and benign cases These findings reinforce the potential of Raman spectroscopy combined with AI-driven analysis as a promising tool for non-invasive cancer diagnostics The study also emphasized the importance of spectral preprocessing techniques, such as baseline correction and noise filtering, which significantly improved the model’s classification accuracy.[50] Another study published in Springer investigated the application of Surface-

Enhanced Raman Spectroscopy (SERS) enables detection of trace-level analytes in biological samples by providing detailed molecular fingerprinting and high chemical specificity The integration of Support Vector Machines (SVM) significantly improves spectral interpretation, allowing for the discrimination of subtle differences within complex biological environments Combining chemometric techniques with machine learning models like SVM results in higher sensitivity and greater accuracy in molecular detection This approach underscores the importance of machine learning integration in spectral analysis, advancing the development of more precise and reliable biomedical applications, particularly in disease diagnostics.

Support Vector Machine (SVM) demonstrates significant strengths in biomedical diagnostics by effectively managing high-dimensional spectral data, enhancing classification accuracy, and enabling non-invasive disease detection However, challenges such as computational complexity, noise sensitivity—particularly from background fluorescence interference in Raman spectroscopy signals—and difficulties in hyperparameter tuning persist Processing large spectral datasets demands substantial computational power, while selecting optimal kernel functions and hyperparameters often involves iterative, resource-intensive procedures Robust baseline correction techniques are essential to mitigate noise sensitivity and ensure reliable classification performance in Raman spectral analysis.

In this research, SVM was implemented to analyze Raman spectral data, assessing classification accuracy before and after preprocessing The results

Effective baseline correction techniques like PenPoly (Polynomial Regression) and IMP (Intensity Modulation Processing) significantly improve classification accuracy by refining spectral signals prior to SVM application Proper preprocessing enhances feature extraction and class separability, leading to higher prediction accuracy This underscores the efficiency of SVM in handling high-dimensional data, affirming its vital role in AI-driven medical diagnostics.

The linear kernel is one of the simplest and most widely used kernel functions in Support Vector Machines (SVM), especially for linearly separable data It creates a straight-line decision boundary in the feature space, offering computational efficiency and ease of interpretation Unlike non-linear kernels like polynomial or RBF, the linear kernel relies solely on the dot product between feature vectors without transforming data into higher-dimensional spaces Mathematically, it is expressed as:

The kernel function K(x, y) = x · y, where x and y are feature vectors, offers a straightforward yet highly effective approach for high-dimensional datasets This method is particularly well-suited for applications such as text classification, gene expression analysis, and spectral data processing, where the number of features greatly exceeds the number of samples Its simplicity combined with efficiency makes it a popular choice for handling complex, high-dimensional data in various machine learning tasks.

METHODOLOGY

Integrated Development Environment

Jupyter Notebook is an open-source web application that enables users to create, share, and collaborate on documents containing live code, equations, visualizations, and narrative text, making it a popular choice in data science, scientific computing, machine learning, and academic research Its interactive coding feature supports real-time data analysis and visualization, allowing users to write and execute code seamlessly within the platform Supporting multiple programming languages—primarily Python—and extendable to R, Julia, and others via kernels, Jupyter Notebook offers great flexibility for diverse projects It integrates smoothly with visualization libraries like Matplotlib, Seaborn, and Plotly, facilitating the creation of rich visual content directly within notebooks The platform also supports rich text formatting, including equations using LaTeX, images, and links, making it ideal for detailed reporting and documentation Additionally, Jupyter Notebook promotes collaboration and reproducibility by allowing users to share notebooks through email, GitHub, or convert them into formats such as HTML, PDF, and slides.

Jupyter Notebook is commonly used for a variety of purposes, including data analysis and exploration, where users can analyze datasets, perform statistical calculations, and visualize results In machine learning, it is utilized to develop, train, and evaluate models For educational purposes, Jupyter Notebook serves as a platform for creating interactive teaching materials and tutorials It is also widely used in research documentation, allowing researchers to document their processes and findings in a reproducible format Overall, Jupyter Notebook is an indispensable tool for data-intensive fields, providing a flexible and interactive environment for coding and data analysis [29]

Google Colab, or Google Colaboratory, is a cloud-based platform that offers an environment similar to Jupyter Notebook for executing Python code, making it highly popular among data scientists, machine learning practitioners, and researchers Its key advantages include free access to powerful GPUs and TPUs, enabling efficient handling of deep learning and other resource-intensive tasks Additionally, being a cloud-based service, Google Colab eliminates the need for local installation of Python and libraries, saving users time and effort while ensuring easy access to essential tools for data analysis and machine learning development.

Google Colab’s collaboration features, similar to Google Docs, enable multiple users to work on the same notebook simultaneously, enhancing teamwork and productivity Seamless integration with Google Drive allows easy saving and sharing of notebooks, while pre-installed libraries like NumPy, pandas, TensorFlow, and PyTorch facilitate data analysis and machine learning projects Additionally, users can install extra libraries or connect to a local runtime, making Google Colab a flexible and powerful tool for a wide range of projects.

JetBrains Datalore is a versatile collaborative data science platform that combines the flexibility of Jupyter-compatible notebooks with intelligent coding assistance and team collaboration tools It supports multiple programming languages, including Python, R, Scala, and Kotlin, to accommodate diverse data analysis needs With features like code auto-completion, quick fixes, and built-in documentation, Datalore significantly enhances productivity for data scientists The platform also offers robust data integration capabilities, allowing users to connect to databases, execute SQL queries within cells, and seamlessly integrate with cloud storage services such as Amazon S3 and Google Cloud Storage, making it an all-in-one solution for modern data science workflows.

The platform excels in collaboration, enabling real-time teamwork on notebooks with customizable access levels Interactive reporting is another key feature, allowing users to transform notebooks into polished, shareable reports with hidden code cells Additionally, it supports scheduling and automation for repetitive tasks, offers custom environments via pip, Conda, or Docker images, and provides an on-premises hosting option for added security These features make Datalore a compelling choice for both individual and organizational use cases

For this experiment, Jupyter was picked for its Versatilities and user-friendly platform, perfect for conducting experiments Its interactive notebooks allow you to write, test, and debug code in manageable chunks, making the process seamless and efficient Plus, its ability to combine code, visualizations, and explanatory text in one place is ideal for documenting your experiment clearly

StandardScaler, a key preprocessing tool in machine learning and part of the scikit-learn library, standardizes features by removing the mean and scaling them to unit variance, which helps improve model performance and convergence Using StandardScaler ensures that each feature contributes equally to the model, especially when features are on different scales Proper feature scaling with StandardScaler enhances the effectiveness of algorithms sensitive to feature magnitude, such as support vector machines and logistic regression Implementing StandardScaler in your data pipeline can lead to more accurate and reliable machine learning models.

StandardScaler transforms the data in such a way that each feature's mean is centered at 0 and the standard deviation is scaled to 1 This is done using the formula:

43 where x is the original feature value, μ is the mean of the feature, and σ is the standard deviation of the feature

1 Normalization: Standardizing features ensures that they are on the same scale, which is especially important for algorithms sensitive to the scale of data, like SVMs with rbf kernels

2 Improved Performance: It can lead to faster convergence during training and better overall performance of the model

3 Handling Outliers: While it doesn't eliminate outliers, it can mitigate their impact, since the data is rescaled relative to the standard deviation

When working with features that have different units or scales, using StandardScaler is essential for ensuring that each feature contributes equally to the distance metrics in machine learning algorithms This normalization process improves model robustness and accuracy by preventing features with larger scales from dominating the learning process Incorporating StandardScaler helps optimize model performance, especially in scenarios where feature scale disparity could negatively impact the results.

In the SVM model setup, StandardScaler is applied to preprocess Raman spectroscopy data, normalizing features to have a mean of 0 and a standard deviation of 1 This data scaling enhances the consistency and reliability of the classification results, leading to improved model performance.

Penalized polynomial regression is an advanced technique that incorporates a penalty term into the polynomial regression model to prevent overfitting It combines polynomial features with regularization methods such as Lasso (L1), Ridge (L2), or Elastic Net, which merges both L1 and L2 penalties to enhance model stability By generating polynomial features up to a specified degree, penalized polynomial regression captures complex non-linear relationships within the data, making it a powerful tool for robust predictive modeling.

Regularization is a vital technique in modeling that helps reduce model complexity and prevent overfitting Lasso regularization applies an L1 penalty, promoting sparsity by shrinking some coefficients to zero, which effectively performs feature selection In contrast, Ridge regularization uses an L2 penalty that penalizes the sum of squared coefficients, resulting in smaller and more evenly distributed coefficients Elastic Net combines both L1 and L2 penalties, offering a balanced approach that enhances model flexibility and performance by incorporating feature selection and coefficient shrinkage.

Penalized polynomial regression offers numerous advantages by incorporating a regularization term that prevents overfitting, leading to better generalization and improved performance on unseen data This approach enhances model interpretability by selecting key features and diminishing the influence of less important ones, providing valuable insights into the data’s underlying structure Overall, penalized polynomial regression is an effective technique for building robust, understandable models that perform well on new datasets.

Penalized polynomial regression is highly effective for modeling complex, non-linear relationships in data that cannot be captured by simple linear models By incorporating polynomial features, this approach allows the model to adapt to a wider variety of patterns and interactions between variables, enhancing its predictive capabilities It is especially valuable in scenarios where capturing intricate data relationships is crucial for accurate analysis and forecasting.

Data acquisition

A Raman spectrometer is comprised of four key components that work together to measure the Raman effect:

Laser is the essential light source that emits a focused, monochromatic beam, crucial for inducing Raman scattering when interacting with a sample It provides the necessary energy to excite molecules within the sample, making it a vital component in Raman spectroscopy analysis.

The sample holding chamber is a crucial component designed to securely contain samples in various forms, including solids, liquids, and gases, during measurement processes Equipped with precise positioning mechanisms, it ensures the stability and accuracy of measurements by maintaining the correct sample orientation This ensures reliable results, making the sample holding chamber essential for accurate scientific analysis and testing.

The spectrometer chamber using a diffraction grating plays a crucial role in Raman spectroscopy by separating scattered light into its component wavelengths after laser interaction with the sample This dispersion technique enables the simultaneous measurement of a broad spectrum of wavelengths, allowing for accurate identification of the unique Raman shifts that correspond to the sample’s molecular vibrations.

• Amplification System and Collection: The dispersed light is then collected and amplified to enhance the signal strength for better

To ensure accurate Raman spectroscopy measurements, the amplified signal undergoes filtering to eliminate photons displaced by the Raman effect, using notch or edge filters that selectively transmit only the desired Raman scattered photons These precisely filtered photons are then directed into a grating spectrometer, which captures the wavenumbers across the entire spectrum simultaneously This comprehensive spectral analysis enables detailed insights into the sample's molecular composition, revealing valuable information about its chemical and physical properties.

To simulate blood sugar levels for my project, I initially faced challenges due to the lack of genuine human data, which requires consent and medical review To overcome this, I used a diluted glucose solution as a practical interim solution, enabling progress until more accurate data collection methods can be implemented The experimental dataset was obtained from a Raman spectrometer provided by my supervisor, with data collected at two stages from volunteers This dataset includes three labels used to train the machine learning model, facilitating accurate blood sugar level prediction.

1 Before consumed food: Represents samples from patients who haven't consumed any food

2 After consumed food: Represents samples from patients who have consumed food

We obtained a CSV file containing data from 20 Raman spectroscopic scans, which are crucial for training an SVM (Support Vector Machine) model to accurately classify different blood glucose levels These labeled Raman spectra serve as the foundational input data, enabling the machine learning model to analyze and learn distinctive spectral patterns associated with each glucose level By examining these spectra, the SVM model can identify key correlations, enhancing its ability to distinguish between varying blood glucose concentrations effectively.

Utilizing these labels enables machine learning models to precisely predict patients' blood glucose levels from Raman spectroscopy samples, offering an effective tool for diabetes classification and management This approach enhances diagnostic accuracy and supports personalized treatment strategies, making it a valuable advancement in diabetes monitoring.

Figure 3: Schematic diagram of Raman spectrometer setup using CCD detection module Excitation laser (Andrew Downes, 2024)

Raman spectroscopy

Professor C.V Raman, along with K.S Krishnan, pioneered the discovery of the Raman effect, a groundbreaking advancement in spectroscopy Their initial publication marked a significant milestone in scientific research, introducing a versatile technique used to analyze a wide range of evidence, including chemical compounds and biological materials The Raman effect remains one of the most adaptable and widely used methods in modern analytical science.

Raman spectroscopy is a highly versatile and non-invasive technique that plays a crucial role across various fields such as chemistry, physics, biology, and medicine It offers detailed molecular insights, enabling early disease diagnosis, monitoring disease progression, and conducting environmental analysis The pioneering work of Raman and Krishnan highlights the significance of innovation and collaboration in scientific research, paving the way for future advancements in spectroscopy.

Figure 4: Sir Chandrasekhara Venkata Raman (nobelprize, 2024)

This technique surpasses many constraints of current spectroscopic methods and is valuable for quantitative and qualitative analysis Quantitative analysis measures the intensity and frequency of scattered radiations, providing detailed molecular composition information This is crucial in fields like chemistry and biochemistry, where precise measurements are needed Qualitative analysis identifies specific molecules or functional groups by examining Raman shifts corresponding to molecular vibrations This method can identify unknown compounds, verify mixtures, and monitor molecular structure changes, making it a versatile and powerful tool in research and industry [42]

Raman spectroscopy, based on the Raman effect, is a scattering technique wherein monochromatic laser light interacts with molecules in a sample, producing scattered light The wavelength of this Raman scattered light is determined by the excitation light's wavelength Consequently, when comparing spectra obtained with different lasers, the Raman scatter wavelength is standardized to an artificial value This process involves shifting the excitation wavelength away from the Raman scattering point, resulting in what is known as a Raman shift

The wavelength of the excitation light directly influences the wavelength of the Raman scattered light As a result, Raman spectra obtained with different lasers are standardized to a common reference wavelength, ensuring consistent and comparable measurements across various experimental setups This standardization is essential for accurate interpretation and analysis of Raman spectroscopy data.

48 excitation wavelength is shifted away from the Raman scattering point, resulting in what is known as a Raman shift

Raman shifts are commonly expressed in wavenumbers (cm⁻¹), which are units of inverse length directly related to the energy of the scattering process, providing a clear connection between spectral data and energy levels This unit is preferred in Raman spectroscopy because it allows for precise interpretation of molecular vibrations and material properties Additionally, wavenumbers can be easily scaled for conversion to nanometers (nm), enabling researchers to compare Raman spectra across different laser sources and experimental conditions This flexibility enhances the versatility of Raman analysis and improves the accuracy of data interpretation.

Using wavenumbers in Raman spectroscopy allows scientists to easily interpret and compare spectra across different studies, regardless of the laser wavelengths used This standardization enhances consistency and accuracy in Raman analysis, enabling precise identification and characterization of diverse molecular compounds and structures.

Fluorescence background subtraction

Fluorescence interference in Raman spectroscopy can arise due to two main factors: molecule interference and sample contamination Molecule interference occurs when high-energy photons are absorbed, causing molecules to be excited to a higher electronic state As these molecules return to a lower energy state, they emit fluorescence light On the other hand, sample contamination can introduce foreign substances that fluoresce under the excitation light

Raman shifts are independent of the excitation wavelength, setting them apart from fluorescence In contrast, fluorescence emission is significantly affected by the excitation wavelength used during Raman spectroscopy Understanding this distinction is essential for accurate spectroscopic analysis, as it allows for effective differentiation between Raman signals and fluorescence interference, ensuring precise identification of molecular characteristics.

Raman scattering and fluorescence emission can compete when the excitation laser energy approaches the electronic transition energy of the material Fluorescence background is intensified by excitation sources with higher green or red intensities, such as visible lasers at 514 nm or 633 nm, which can cause strong fluorescence signals that interfere with Raman detection Conversely, near-infrared lasers operating at 785 nm or 1064 nm lack sufficient energy to excite molecules to higher electronic states, effectively reducing fluorescence interference and enhancing the clarity of Raman spectra This makes near-infrared excitation wavelengths preferable for applications requiring minimal fluorescence background in Raman spectroscopy.

49 electronic state or to remove fluorescing molecules in the material, resulting in a reduced or eliminated fluorescence effect [43]

Numerous polynomial fitting algorithms were examined to eliminate background fluorescence in the original signal To maintain data clarity, the Improved Modified Polynomial (IMP) and penalized polynomial (penalized_poly) fitting method was both employed for the experiment This method has proven effective in recent investigations due to its low signal-to-noise ratio The pure Raman spectrum is calculated by subtracting the corrected polynomial from the original Raman spectra and then summing the resulting values.

Machine Learning model selection

To automate diabetes classification using Raman spectroscopy, machine learning models are employed to distinguish between diabetic and non-diabetic patients effectively While the relationship between Raman intensity at specific wavelengths and blood glucose levels can be linear, variations in environmental conditions and subject differences make fixed wavenumber approaches unreliable, potentially leading to prediction errors Therefore, interpreting a broad range of wavenumbers simultaneously is essential, highlighting the advantage of machine learning models in accurately predicting diabetes status by analyzing complex, high-dimensional Raman spectral data and capturing implicit correlations across multiple Raman shifts.

Before inputting data into machine learning models, the fluorescence subtraction method is used as a crucial preprocessing step that leverages the inherent characteristics of Raman spectroscopy This technique effectively removes visible fluorescence signals from the dataset, minimizing variance at specific wavelengths and preventing prediction bias Implementing fluorescence subtraction enhances data quality and ensures more accurate and reliable Raman spectral analysis for machine learning applications.

Applying the fluorescence subtraction method effectively removes visible fluorescence signals from the data The data is then shuffled to ensure proper randomization, which prevents bias during machine learning model training This step is crucial for improving the robustness and generalizability of the models, enabling accurate differentiation between three patient stages: those who have not eaten, those who have consumed one cup of sugar water, and others with different dietary statuses, based on processed Raman spectroscopy samples.

Consuming two cups of sugar water provides valuable data for machine learning models By shuffling this data, the model can better learn and predict different stages, resulting in more accurate and reliable results This process enhances the effectiveness of machine learning in analyzing behavioral or physiological responses related to sugar intake.

The Support Vector Machine (SVM) model is implemented with the radial basis function (RBF) kernel, a widely used choice for capturing complex non-linear relationships To enhance model performance, the data is standardized using StandardScaler, which removes the mean and scales features to unit variance This essential preprocessing step ensures that all features contribute equally, preventing features with larger scales from dominating the model’s results and improving overall accuracy.

To evaluate the effectiveness of the SVM model, key performance metrics such as accuracy, sensitivity, and specificity are used to provide a comprehensive assessment of its classification capabilities Accuracy indicates the overall correctness of predictions, while sensitivity (recall) measures the model's ability to correctly identify positive cases, such as patients who consumed one or two cups of sugar water Specificity evaluates the model's effectiveness in correctly identifying negative cases, like patients who did not consume any sugar water Monitoring these metrics enables researchers to understand the model's strengths and weaknesses, ensuring optimal performance across various classification tasks.

Using repeated stratified k-fold cross-validation ensures the model is trained and tested on multiple data subsets, leading to a more robust performance evaluation This method minimizes bias and variance, providing a reliable estimate of the model's ability to generalize to unseen data.

This study presents a robust framework for distinguishing between non-fasted patients, patients who consumed one cup of sugar water, and those who consumed two cups using Raman spectroscopy The combination of Support Vector Machine (SVM) with the RBF kernel, data scaling, and comprehensive performance metrics, validated through cross-validation, ensures high model accuracy and reliability Additionally, an Extra Tree Classifier was used for comparison to evaluate the overall effectiveness of the models in disease classification.

Experimental setups

To conduct the experiment, the initial step involves preparing the data for machine learning The raw dataset comprises 20 samples, with 10 samples collected

51 before the volunteers had breakfast and 10 samples collected after breakfast The dataset is labeled accordingly: 0 for the samples taken before breakfast and 1 for those taken after breakfast

The dataset used in this research consists of Raman spectral data, capturing the interaction between laser light and biological molecules Each data sample contains a series of intensity values measured at different wavelengths, forming spectral curves that indicate chemical composition These spectral signals reflect glucose concentration variations, enabling classification between different glucose levels or between diabetic and non-diabetic subjects The dataset is high-dimensional, meaning it contains multiple spectral variables, each contributing unique information about molecular vibrations Since Raman spectroscopy relies on detecting subtle shifts in scattered light, the dataset inherently includes natural fluctuations in intensity, which need careful handling for accurate interpretation Furthermore, the spectral characteristics may vary based on environmental conditions, measurement settings, and biological differences, influencing the distribution of intensity values across samples The dataset provides a foundation for machine learning models to identify patterns, correlations, and diagnostic insights, making it a valuable resource for non- invasive medical analysis using AI-driven spectral classification

This research utilizes a spectral dataset comprising 20 samples and 1,901 features derived from Raman spectroscopy analysis Each sample represents an individual spectral measurement, potentially collected from various subjects or experimental conditions, ensuring diverse and comprehensive data coverage The dataset's 1,901 columns capture essential spectral attributes, providing detailed information for effective analysis and accurate pattern recognition in Raman spectroscopy studies.

52 with each column representing the intensity of Raman signals at a specific wave number, forming a detailed spectral profile that provides insights into molecular vibrations and chemical composition

Raw data collected from the human body is stored in CSV format, featuring two main columns: wave number (cm⁻¹) and Raman signal intensity, which are essential for molecular analysis Raman spectral intensity data enables the detection of specific molecular vibrations, providing valuable insights into the presence and concentration of biological molecules While the original dataset covers wave numbers from 400 to 2300 cm⁻¹, the processed dataset focuses on the 800 to 1800 cm⁻¹ range, highlighting vibrational bands associated with key chemical groups in proteins, lipids, nucleic acids, and carbohydrates This targeted spectral window is particularly useful for analyzing biological tissue structure and chemical composition, supporting advanced biomedical research and diagnostics.

Effective preprocessing techniques are crucial for improving classification accuracy when working with high-dimensional spectral datasets Machine learning models, such as Support Vector Machines (SVM) and ExtraTreesClassifier, are well-suited to classify glucose concentration levels and distinguish biochemical differences between diabetic and non-diabetic samples By applying these advanced algorithms to complex spectral data, this research aims to enhance AI-driven Raman Spectroscopy analysis, ensuring a robust and reliable approach to medical diagnostics.

Figure 6: parts of the raw data that was cut from 400 to 2300

In the dataset, labels of 0 and 1 are assigned to indicate measurements taken before and after the specified condition, enabling effective comparison of noise reduction techniques This labeling system facilitates machine learning analysis by clearly distinguishing between pre- and post-condition data, ultimately helping to evaluate the effectiveness of noise reduction methods and guide future improvements Proper labeling is essential for accurate analysis and optimization of noise reduction strategies.

Polynomial Fitting Method for baseline determination

The Polynomial Fitting Method is an essential approach for baseline determination in spectral analysis, particularly in Raman spectroscopy, where unwanted variations in intensity can interfere with the accurate identification of molecular structures Baseline distortions in spectral data often arise due to instrumental noise, fluorescence interference, and environmental factors, necessitating robust correction techniques to refine spectral accuracy Polynomial fitting models the baseline as a smooth polynomial function, which is subtracted from the raw spectral data to remove background signals while preserving the integrity of significant spectral peaks

A polynomial can be expressed as follows:

𝑗=0 where 𝛽 is the array of coefficients for the polynomial

For regular polynomial fitting, the polynomial coefficients that best fit data are gotten from minimizing the least-squares:

𝑦 𝑖 and 𝑥 𝑖 are the measured data,

𝑝(𝑥 𝑖 ) is the polynomial estimate at 𝑥 𝑖 ,

𝑁 is the number of data points

The primary advantage of polynomial fitting is its simplicity and effectiveness

Polynomial fitting is widely used in in vivo biomedical Raman applications due to its faster processing speed compared to other methods However, its effectiveness depends heavily on selecting an appropriate spectral fitting range and polynomial order, which can limit its accuracy.

Polynomial fitting offers a smooth and continuous correction curve, making it highly effective for baseline correction and fluorescence removal in spectral signals Its flexibility allows for tuning the polynomial degree to preserve essential spectral features while eliminating background noise However, selecting the optimal polynomial degree can be challenging, as inadequate fitting may either fail to correct baseline distortions or distort the spectral structure The method can be sensitive to noise, often requiring additional smoothing techniques for improved accuracy Additionally, higher-degree polynomials increase computational complexity, especially with large spectral datasets.

In Raman spectroscopy, polynomial fitting is essential for refining spectral intensity data and ensuring precise peak detection, which is crucial for accurate biochemical classification Advanced methods like piecewise polynomial fitting (PPF) enhance this process by dividing spectral data into smaller sections for localized baseline corrections Moreover, adaptive polynomial fitting techniques incorporate optimization algorithms to improve baseline correction accuracy, leading to more reliable spectral analysis.

56 algorithms to dynamically refine polynomial parameters, enhancing baseline accuracy across varied spectral conditions

To enhance background correction in Raman spectroscopy, I compare the effectiveness of penalized polynomial regression and the Improved Modified Polynomial (IMP) method The process begins with fitting a single polynomial P(v) to the raw Raman signal O(v), where v represents the Raman shift in cm⁻¹ Residuals R(v) are then calculated to assess the accuracy of the fit, along with their standard deviation (DEV) These methods are evaluated to determine which provides superior background correction and more accurate spectral analysis.

𝑛 where nn represents the number of data points on the spectral curve, and

𝑛 This approach ensures a more accurate baseline correction by accounting for the residuals and their variations

Figure 8: denoised data after combined all of the denoised CSV files

After transforming the data as shown in Figure 5, each of the 20 CSV files undergoes penalized polynomial regression for effective data denoising and signal enhancement This process results in 20 refined CSV files, similar to those depicted in Figure 8 Subsequently, the processed data are combined and reorganized by swapping from columns to rows, as illustrated in Figure 7, facilitating comprehensive data analysis and visualization.

Jupyter setup

To implement the baseline determination algorithm in my project, the first step is to install and import the Pybaselines library, a essential tool for data calculation and preprocessing I have also integrated the numpy library, which offers robust functions and data structures for efficient array and matrix operations, improving overall calculation performance.

I'm also using the matplotlib.pyplot library to create high-quality graphs and charts from the data Furthermore, I utilize the csv library to read and write CSV files,

CSV is a widely-used format for tabular data, facilitating easy reading and writing of information across various applications The os library is utilized to interact with the operating system, enabling operations such as creating, deleting, or moving files seamlessly.

Figure 10: library for penalized polynomial regression and graph draw

After that I start by retrieving the CSV files that was cut by scanning the folder which the CSV files are on

This function reads data from a CSV file to extract either x-axis or y-axis values based on the specified parameter It processes the file efficiently, skipping invalid entries to maintain data integrity The function ensures that all data is correctly parsed into appropriate types, such as integers and floats By doing so, it provides reliable data extraction for various analytical or visualization purposes.

1) Initialization: The function initializes two empty lists, datay and datax, to store the y-axis and x-axis values

2) Opening the File: It opens the specified CSV file using the with open statement, ensuring the file is properly closed after reading

3) Reading the File: The csv.reader is used to read the CSV file, and the function iterates over each row in the file

4) Data Parsing: For each row, if the first element is not empty, the function attempts to convert it to an integer and append it to datax It also converts the second element to a float and appends it to datay If a ValueError occurs during conversion, the function skips the invalid entry and prints a message

5) Returning Data: Depending on the value of the axit parameter ('x' or 'y'), the function returns the corresponding list of values

This ensures that the data is read and parsed correctly from the CSV file, ready for further processing or analysis

I developed an additional function called Draw_Graph_Raw to visualize raw signal line graphs, enabling detailed observation of signal changes before and after processing This function accepts shift Raman parameter values and intensities, providing clear labels for each component, including axis titles, graph labels, distinct colors, and data point values Implementing this visualization tool enhances analysis accuracy and aids in better interpretation of Raman spectroscopy data.

Figure 12: fuctions that draw graphs

Draw_Graph_Raw, is designed to plot a graph using the given x-axis and y-axis data Here's a breakdown of how the function works:

Create a New Figure: The function starts by creating a new figure using plt.figure() to ensure that each call to the function generates a new plot

To plot data effectively, use the plt.plot function, specifying x_axis and y_axis for the respective axes values Customize the visualization by setting the color parameter for line color and using the label parameter to create descriptive labels, such as combining the text "Processed signal" with the line_name parameter for clear identification in the legend This approach ensures clear, informative, and visually appealing plots suitable for data analysis and presentation.

1) Set Labels: The plt.xlabel and plt.ylabel functions set the labels for the x- axis and y-axis, respectively, using the x_label and y_label parameters

2) Set Title: The plt.title function sets the graph's title using the graph_name parameter

3) Custom X-Axis Ticks: The plt.xticks function customizes the x-axis ticks, displaying them at intervals of 50 units and rotating them vertically for better readability

4) Add Legend: The plt.legend function adds a legend to the plot, which helps in identifying the plotted line based on the label provided

The SaveFlattenData function, similar to the read function, is used to write processed data into a CSV file line by line Instead of searching for lines with data to

This function identifies empty lines to insert data, ensuring proper placement within the file When combined with the custom Autoname function, it enables the creation of a complete CSV file The resulting CSV file contains processed signal data, providing a streamlined solution for data organization and analysis.

This function is designed to save processed data into a CSV file Here's a step by step of what the function does:

Specify File Location: The function constructs the file path by combining the directory (dir) and filename (filename)

Convert Data to Strings: It converts the datay values to strings and stores them in data_convert However, this step isn't actually used later in the code

To efficiently combine data, create a list named "data" that stores sublists of corresponding x and y values Each sublist represents a row in the CSV file, containing paired x and y data points This method organizes the data for easy processing and analysis, ensuring each data pair is neatly encapsulated within its own sublist, facilitating straightforward export or further manipulation.

To write data to a CSV file, the function opens the designated file in write mode and utilizes csv.writer to record each row from the data list into the CSV The use of the 'newline=""' parameter prevents extra blank lines from appearing between rows, ensuring a clean and correctly formatted CSV output.

To effectively process noisy signals, the initial step is to establish the baseline, which is crucial for background subtraction and noise filtering In this project, I utilize the penalized polynomial regression method to accurately estimate the baseline, enhancing signal clarity and analysis accuracy This approach ensures robust noise removal and improves the overall quality of the signal processing.

62 with the algorithm chart provided This approach ensures that the baseline is accurately identified, enabling effective noise reduction and signal enhancement

Figure 14: Determine signals’ baselines and draw graphs

This code effectively processes each file, removes noise by calculating the baseline using penalized polynomial regression, and visualizes both the raw and baseline-corrected data

This project aimed to establish baseline signals using polynomial coefficients ranging from 3 to 16 Manually selecting the optimal polynomial order based on visual graph inspection proved challenging, so key points within the range were used to determine sample baselines for background correction The processed data was then input into a machine learning model to assess the results This method efficiently avoided complex manual calculations involving weights and standard deviations, saving time given the large volume of signal samples Ultimately, polynomial orders 3, 7, 12, and 16 were selected for baseline determination.

Figure 15: Determine the signal's baseline using polynomial order of 3

Polynomial orders refer to the degree of a polynomial function, which is essentially the highest power of the variable in the equation In the context of data

65 analysis, such as signal processing or regression models, choosing the right polynomial order is crucial

• Lower-order polynomials are simpler and can underfit the data by missing important details in complex patterns

Choosing the appropriate polynomial order is crucial for accurately modeling data without overfitting; higher-order polynomials offer flexibility but risk capturing noise instead of true trends Typically, a polynomial order between 7 and 12 provides a balanced baseline, avoiding underfitting at lower orders like 3 and overfitting at higher orders such as 16, as demonstrated in Figures 16 and 17 However, selecting the best polynomial degree is scenario-dependent, requiring experimental testing and visual or statistical evaluation To optimize results, machine learning techniques can be employed to compare the processed signal against original data and refine the model based on performance, ensuring the most accurate and reliable baseline fitting.

Data denoising

After establishing the baseline, the next crucial step in Raman spectral preprocessing is background correction, which involves the subtraction of background noise from the signal intensity This noise is identified through the baseline and is removed to enhance the clarity of Raman peaks, ensuring accurate spectral analysis Typically, after background correction, the received signal intensity decreases, since the subtracted background noise often has a positive intensity The sources of background noise include fluorescence interference, Rayleigh scattering, detector artifacts, and environmental distortions, all of which can significantly impact the accuracy of spectral classification Removing these unwanted signals allows the Raman spectrum to more effectively represent the true biochemical composition of a sample, making it essential for applications in disease diagnostics and molecular identification

Beyond background correction, data denoising is another vital preprocessing step that ensures Raman spectral signals remain free from random fluctuations caused

Spectral noise, caused by instrumental limitations or environmental factors, can obscure weak spectral features in biomedical samples, complicating the extraction of meaningful information Denoising techniques are essential for reducing spectral variability while preserving critical peaks that represent biomolecular structures Effective noise removal enhances the accuracy of classification models, such as Support Vector Machines (SVM), by ensuring precise spectral features crucial for accurate disease identification.

In this study, two advanced preprocessing techniques PenPoly (Polynomial Regression for Baseline Correction) and IMP (Intensity Modulation Processing) were utilized to refine Raman spectral data before classification

PenPoly utilizes polynomial fitting techniques to effectively remove low-frequency baseline distortions in Raman spectra, preserving crucial peaks while eliminating background signals Polynomial regression enhances spectral clarity by reducing fluorescence and baseline drift, resulting in more uniform spectral signals Selecting the optimal polynomial degree is essential; too low a degree may inadequately correct complex variations, whereas too high a degree can cause unwanted oscillations, such as the Runge phenomenon.

IMP is a crucial technique for normalizing and standardizing spectral intensity in Raman spectroscopy, ensuring consistency across diverse samples and measurement conditions By adjusting signal intensity values, IMP enhances peak visibility and reduces fluctuations caused by instrument variability This method improves peak detection accuracy by modulating intensity over a predefined spectral range, ultimately making Raman spectral data more reliable for machine learning classification models and ensuring accurate, consistent analysis.

PenPoly and IMP significantly enhance SVM classification accuracy for Raman spectral data by reducing baseline distortions and improving spectral clarity These preprocessing techniques provide high-quality input for SVM, leading to more accurate feature extraction and better class separation Ultimately, effective preprocessing with PenPoly and IMP strengthens AI-driven Raman spectroscopy, enabling more precise glucose monitoring and disease diagnosis.

Figure 20: Background correction result with polynomial order order of 3

I focus on x values ranging from 800 cm⁻ạ to 1800 cm⁻ạ because this range contains important vibrational bands that help analyze key biological molecules, including proteins, lipids, nucleic acids, and carbohydrates These molecules play a vital role in cellular function, and their unique spectral signatures provide valuable insights into the structure and composition of biological tissues By focusing on this specific range, I can ensure that the Raman spectroscopy analysis captures chemically relevant signals that are most useful for biomedical applications

This range is highly relevant for protein analysis, as it contains amide I and amide II bands that reveal secondary structures such as alpha-helices and beta-sheets, crucial for understanding protein folding and associated biological abnormalities like misfolding diseases Additionally, lipid-associated vibrations within this range enable detailed examination of cell membranes, lipid metabolism, and structural characteristics that distinguish healthy tissues from diseased ones, making it essential for biomedical research.

Nucleic acid vibrations from DNA and RNA, along with proteins and lipids, are detectable in this spectral window, providing valuable insights into genetic material interactions, molecular stability, and potential mutations linked to diseases Additionally, carbohydrates—essential for cell signaling, energy storage, and maintaining structural integrity—exhibit prominent vibrational signals, highlighting their crucial role in cellular functions.

70 activity within this range, making them useful markers for biochemical differentiation

The selected Raman spectral range of 800 cm⁻¹ to 1800 cm⁻¹ is crucial for minimizing water interference, which can distort signals at lower wave numbers Water molecules strongly absorb light below 800 cm⁻¹, creating background noise that affects spectral accuracy By focusing analysis within this range, the study effectively avoids interference signals, ensuring the examination of only relevant molecular features for more precise results.

Optimizing the spectral range in Raman spectroscopy enhances the accuracy and reliability of biomedical diagnostics by focusing on key vibrational bands linked to biologically significant molecules This targeted approach improves signal clarity, facilitating more effective classification of diseased versus healthy tissues using machine learning models like Support Vector Machines (SVM) By selecting the most relevant spectral features, this method provides precise molecular insights, advancing AI-driven applications such as glucose monitoring and disease diagnosis with increased accuracy.

Data shuffle

To enhance the robustness and generalizability of the training model, the data was randomly shuffled using the shuffle_data function, which rearranges samples and their labels to prevent potential biases This preprocessing step ensures that each training iteration exposes the model to a different order of data, promoting improved model performance and generalization The shuffled dataset was then used as input for the training process, contributing to a more reliable and unbiased model.

By incorporating this step, we ensure that the training model is trained on a randomly ordered dataset, which contributes to the overall accuracy and reliability of the machine learning model

FINDINGS AND DISCUSSIONS

CONCLUSION

Tiêu đề	Application of AI and Raman Spectroscopy for Non-Invasive Diabetes Diagnosis
Tác giả	Nguyen Minh Tuan
Người hướng dẫn	Assoc. Prof. Dr. Nguyen Thanh Tung
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Informatics and Computer Engineering
Thể loại	Master thesis
Năm xuất bản	2025
Thành phố	Hanoi

Định dạng
Số trang	95
Dung lượng	2,88 MB