1. Trang chủ
  2. » Luận Văn - Báo Cáo

Medically Applied Artificial Intelligence From Bench To Bedside.pdf

67 0 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Medically Applied Artificial Intelligence: From Bench to Bedside
Tác giả Nicholas Chedid
Người hướng dẫn Dr. Richard Andrew Taylor, Doctor of Medicine
Trường học Yale School of Medicine
Chuyên ngành Medicine
Thể loại Thesis
Năm xuất bản 2019
Thành phố New Haven
Định dạng
Số trang 67
Dung lượng 673,96 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • 1.1 Introduction (12)
    • 1.1.1 Ultrasound for Pericardial Effusion (12)
    • 1.1.2 Use of Neural Networks in Medical Imaging (13)
    • 1.1.3 Need for Data: a Call for Multicenter Collaboration (15)
  • 1.2 Methods (15)
    • 1.2.1 Image Acquisition and Classification (15)
    • 1.2.2 ResNet 20 (16)
  • 1.3 Results (18)
  • 1.4 Discussion (18)
  • 2.1 Introduction (20)
    • 2.1.1 Fractures in the Emergency Department (20)
    • 2.1.2 Image-to-Image Synthesis (22)
    • 2.1.3 Prior Work (22)
  • 2.2 Methods (23)
    • 2.2.1 Network Architecture (23)
    • 2.2.2 Image Acquisition and Preprocessing (25)
    • 2.2.3 Training (25)
    • 2.2.4 Postprocessing: Denoising (29)
    • 2.2.5 Visual Turing Test (29)
    • 2.2.6 Structural Similarity Index Measurement (SSIM) (30)
  • 2.3 Results (30)
    • 2.3.1 Visual Turing Test (33)
    • 2.3.2 Structural Similarity Index Measurement (SSIM) (35)
  • 2.4 Discussion (35)
  • 3.1 Introduction (37)
    • 3.1.1 Depression and it’s Diagnosis (37)
    • 3.1.2 Prior Work (39)
    • 3.1.3 Proposed Solution (41)
  • 3.2 Methods (43)
    • 3.2.1 Overview (43)
    • 3.2.2 Video Analysis (44)
    • 3.2.3 Audio Analysis (45)
    • 3.2.4 Pilot Studies for Gathering of First-in-Class Data (46)
    • 3.2.5 Need for Additional Data (46)
    • 3.2.6 Pilot Study with Medical Residents (48)
    • 3.2.7 Pilot Study at Ponce Health Sciences University (51)
    • 3.2.8 Pilot Study with Yale Emergency Department Patients (54)
  • 3.3 Results (56)
  • 3.4 Discussion (57)
  • 2.1 Multi-scale Discriminator (0)
  • 2.2 X-ray Preprocessing (0)
  • 2.3 Segmentation Preprocessing (0)
  • 2.4 Pix2pix Generated X-ray Images Prior to Implementation of Leave-One- (0)
  • 2.5 Examples of Generated X-rays (0)
  • 2.6 Generated vs Real X-rays Visual Turing Test Grid (0)
  • 3.1 Video and Audio Neural Networks Accuracy (0)

Nội dung

Medically Applied Artificial Intelligence from Bench To Bedside Yale University EliScholar – A Digital Platform for Scholarly Publishing at Yale Yale Medicine Thesis Digital Library School of Medicine[.]

Introduction

Ultrasound for Pericardial Effusion

Ultrasound technology was first introduced in the 1950s, but it gained widespread clinical use in the 1970s The development of real-time ultrasound in the 1980s enabled its application in emergency situations Today, Point-of-Care Ultrasound (POCUS) is a vital diagnostic tool in emergency departments, with extensive research focused on enhancing ultrasound techniques for assessing various clinical conditions.

Ultrasound is the preferred diagnostic method for pericardial effusion due to its speed, accuracy, wide availability, and non-invasive nature.

However, while some physicians have specific extended training using ultra- sonography, there is concern regarding diagnostic variability between those who have

Chapter 1 Deep Learning for the Detection of Pericardial Effusions in the Emergent Setting

A study published in Academic Emergency Medicine highlighted concerns regarding the accuracy of diagnosing pericardial effusion in the emergency room, revealing an overall sensitivity of 73% and specificity of 44% among residents and faculty at a Level 1 trauma center Given the urgent nature of pericardial effusion, implementing a diagnostic support tool could significantly reduce errors Convolutional neural networks (CNNs), a form of artificial intelligence increasingly utilized in imaging, may provide an effective solution for enhancing diagnostic accuracy.

Use of Neural Networks in Medical Imaging

Medical imaging consists of two main components: image acquisition and image interpretation While advancements in image acquisition have led to faster and more accurate results, improvements in image interpretation have lagged behind This delay is largely due to the reliance on human interpretation, primarily performed by physicians, which introduces limitations such as subjectivity, human error, fatigue, and variability among providers Recently, technological aids for enhancing the image interpretation process have started to emerge.

Machine learning (ML), a subset of artificial intelligence (AI), enables systems to learn and enhance their performance from experience without explicit programming Its application in medical imaging, especially in radiology and pathology, has grown significantly Notably, deep convolutional neural networks (CNNs) have emerged as the leading machine learning technique in medical imaging research.

Chapter 1 Deep Learning for the Detection of Pericardial Effusions in the Emergent Setting

Neural networks mimic the structure and function of biological nervous systems, consisting of layers of interconnected neurons Each neuron connects to those in the previous and next layers, with connections assigned specific weight values Functioning similarly to logistic regression, neurons process inputs and produce an output, leading to a final error value The model iteratively adjusts weights based on this error through a process known as backpropagation until the error reaches a minimum After optimal training on a dataset of images for maximum accuracy, the neural network is evaluated on a new set of images to assess its ability to generalize predictions.

Neural networks are increasingly utilized in the medical field for diverse applications, such as classifying skin cancer from pathology images, detecting pneumonia in chest X-rays, and identifying polyps during colonoscopy procedures.

The application of neural networks in ultrasound imaging is still in its infancy due to various challenges, including the complexity of ultrasound data, which consists of multiple frames rather than a single still image, and the limited availability of labeled information Compared to other imaging techniques like CT and MRI, ultrasound also suffers from lower resolution In the case of echocardiograms, the variability in measurements and visible anatomy caused by the heart's movement adds another layer of difficulty Initial studies have demonstrated that neural networks can effectively identify conditions such as hypertrophic cardiomyopathy and cardiac amyloidosis, achieving C-statistics of 0.93 and 0.84, respectively However, research on ultrasound data collected in point-of-care settings remains scarce.

Chapter 1 Deep Learning for the Detection of Pericardial Effusions in the Emergent Setting

Need for Data: a Call for Multicenter Collaboration

The key factor in developing a high-performing neural network is the amount of labeled data available, such as ultrasounds categorized by the presence or absence of effusion A larger dataset offers more information for the neural network to learn from, resulting in improved accuracy.

The demand for extensive data to train effective machine learning algorithms often exceeds the resources available at individual institutions, prompting experts to advocate for enhanced multicenter collaborations.

This paper presents a proof-of-concept neural network designed as a clinical decision support tool for diagnosing pericardial effusion in emergency situations It emphasizes the necessity for enhanced multicenter collaboration to develop high-performing neural networks in this field.

Methods

Image Acquisition and Classification

Image acquisition and classification was done primarily by Nicholas Chedid.

Echocardiograms in DICOM format were collected from the Emergency Department's picture archiving and communication system (QPath) for adult patients (≥18 years) who underwent echocardiograms between March 2013 and May 2017 Only those echocardiograms taken in the parasternal long axis view were included to ensure optimal visualization of various cardiac pathologies Furthermore, only echocardiograms with a minimum of two documented readings by physicians were considered for analysis.

Chapter 1 Deep Learning for the Detection of Pericardial Effusions in the Emergent Setting

A total of 1545 echocardiogram videos from 1515 patients were reviewed, with only those specifically addressing pericardial effusion included in the final dataset, resulting in 272 videos Each DICOM was organized in a Yale Secure Box folder and documented in an Excel spreadsheet, capturing essential details such as medical record number (MRN), account number, accession number, study date, effusion status, strain presence, exit status, ejection fraction status, and the number of studies per encounter.

The videos were processed using a Docker package developed by Adrian Haimovich, which included anonymization by removing all identifying metadata and converting the videos into still frames The final dataset comprised 12,942 still frames, with 80% (10,299 frames) allocated for training and the remaining 20% (2,643 frames) designated for testing.

ResNet 20

Work for building and tuning the ResNet architecture was done primarily by Nicholas Chedid

A neural network was developed using Python and Keras with a Theano backend, specifically a 20-layer residual network (ResNet-20), recognized as a gold standard for image classification and computer vision, having won the Imagenet challenge in 2015 This architecture addresses the challenges of training plain deep networks, which often suffer from vanishing and exploding gradients, by utilizing stacked Residual Blocks.

Chapter 1 Deep Learning for the Detection of Pericardial Effusions in the Emergent Setting

6 use skip connections that take activations of one layer and feed them to much deeper layers, ResNets are able to be built much deeper.

Our network features 20 weighted layers with shortcut connections, utilizing primarily 3x3 convolutional filters Each layer maintains the same number of filters for consistent output feature map sizes, while doubling the filters when the feature map size is halved to preserve time complexity Downsampling is achieved through convolutional layers with a stride of 2 The architecture concludes with a global average pooling layer followed by a 1000-way fully-connected layer with softmax activation To mitigate overfitting, L2 regularization is applied, and batch normalization enhances training speed and accuracy The activation functions predominantly use rectified linear units (ReLU), except for the softmax classifier layer.

To enhance the accuracy of the neural network on the test set, various training iterations were conducted for hyperparameter tuning The adjustable parameters included the number of epochs, the ResNet model layers, learning rate, L2 regularization coefficient, batch size, and several data augmentation techniques such as featurewise and samplewise centering, standard normalization, ZCA whitening, rotation range, width and height shift ranges, as well as horizontal and vertical flipping.

The optimal neural network configuration included 50 epochs, a ResNet model with 20 layers, a learning rate of 0.001, and an L2 regularization coefficient of 0.001 Additionally, a batch size of 16 was utilized, along with specific data augmentation settings: featurewise center and samplewise center were turned off, featurewise standard normalization and samplewise standard normalization were also off, and ZCA whitening was applied.

= off, rotation range = 180, width shift range = 0.15, height shift range = 0.15, horizontal flip = on, and vertical flip = on.

Training was performed over nearly 19 hours on a desktop computer with 3Titan X NVIDIA graphics cards with 8 GB RAM each.

Chapter 1 Deep Learning for the Detection of Pericardial Effusions in the Emergent Setting

Due to system constraints, the number of layers and batch size could not be increased further However, deepening the ResNet beyond 20 layers did not lead to a significant improvement in test accuracy, which remained at 92% across 200 epochs, as demonstrated in He et al [14] The ResNet code is publicly accessible on GitHub.

Results

After running the aforementioned ResNet for 200 epochs, we were able to achieve a final test accuracy of 71% Our results can be seen in Table1.1.

The table presents three columns: the percentage of the total dataset used for training, the final test accuracy, and the final train accuracy The final train accuracy consistently ranged from 74% to 81% Notably, the test accuracy showed significant improvement, rising from 49% to 71% as the training data increased from 20% to 80%, highlighting the benefits of utilizing more training data for our ResNet model.

TABLE1.1: Neural Network Performance in Identifying Presence or Ab- sence of Pericardial Effusion

% of dataset used Final Test Accuracy Final Train Accuracy

Discussion

We have demonstrated the creation of a proof-of-concept neural network for a clinical decision support tool for pericardial effusion in the emergent setting with an accuracy

Chapter 1 Deep Learning for the Detection of Pericardial Effusions in the Emergent Setting

The detection of pericardial effusions by academic emergency medicine physicians shows a sensitivity of 73% and a specificity of 44% We are in the process of developing code to evaluate the sensitivity and specificity of our program as well.

Our neural network's accuracy improved progressively with larger percentages of available data Since our training data originated from Yale New Haven Hospital, one of the busiest emergency departments in the U.S with the third highest ER visits in 2016, our findings indicate that further enhancements are possible with additional data This underscores the importance of multicenter collaboration to gather enough training data for developing high-performance algorithms that can support clinical decision-making.

Future developments will focus on coding to evaluate the sensitivity and specificity of our program To enhance accuracy, we plan to implement transfer learning from a ConvNet pre-trained on ImageNet, reformat input data from still frames to short video clips, and explore the use of a Generative Adversarial Network (GAN) in place of a ResNet Additionally, incorporating segmentations is expected to further boost performance.

Introduction

Fractures in the Emergency Department

Fractures are a leading cause of emergency department visits, with some being easily identifiable on x-rays while others require a radiologist's expertise for accurate diagnosis In the high-pressure setting of the emergency department, subtle signs of fractures can be overlooked or misinterpreted, resulting in potential medical errors A four-year study conducted in a busy district general emergency department has quantified this issue.

953 diagnostic errors, of which 760 (79.7%) were missed fractures [16] The primary reason for diagnostic error in 624 of 760 (82.1%) of these patients with fractures was a failure to interpret radiographs correctly [16].

The annual incidence of fractures has been estimated to be as high as 100.2 per10,000 in males and 81.0 per 10,000 in females [17].

Chapter 2 Fracture X-Ray Synthesis with Generative Adversarial Networks 10

Additionally delay in appropriate diagnosis may lead to worsened clinical out- comes and increased healthcare costs Medical errors cost the United States $17 billion in 2008 [18].

A technology capable of automatically detecting fractures could significantly decrease medical errors, costs, and waiting times in emergency departments However, training image analysis algorithms typically necessitates hundreds or thousands of manually annotated examples, making the annotation process both labor-intensive and time-consuming.

Developing automatic fracture detectors is challenging due to the variety of fracture types, necessitating the training of multiple detectors Each detector requires hundreds to thousands of manually annotated images for effective training However, we present a method that simplifies this process by generating synthetic X-rays from procedurally created segmentations, allowing for the creation of annotated datasets with significantly reduced human effort.

Data augmentation is the process of increasing the total information provided by a training dataset by generating many variants of datapoints within the dataset.

In image processing, basic transformations like rotation, scaling, and translation are crucial By training algorithms on various examples of the same image with different rotations, they can learn to achieve rotational invariance Similarly, exposing the algorithm to multiple resized versions of an image helps it develop scale invariance.

Simple image transformations fall short in teaching invariance to subtle features To enhance this invariance, generating synthetic images for data augmentation can be beneficial This study illustrates the feasibility of creating synthetic X-ray images through image-to-image synthesis to augment training datasets.

Chapter 2 Fracture X-Ray Synthesis with Generative Adversarial Networks 11

Image-to-Image Synthesis

Image-to-Image synthesis is the process of converting an image from an element of one domain to an equivalent image from an element of another domain.

Training image-to-image synthesis algorithms presents significant challenges due to the underconstrained nature of the problem, where multiple valid solutions exist for any given task.

A generative adversarial network (GAN) consists of two sub-networks: a generative model (G) and a discriminative model (D), which are trained through an adversarial process The generative model learns to create synthetic images from a specific domain, while the discriminative model focuses on distinguishing between real images and the synthetic ones produced by G This competitive training leads to a scenario where improving one model's performance typically results in a decline in the other's Ultimately, the goal is to achieve a unique solution where G accurately replicates the training data distribution, and D maintains a balanced performance of 1/2 across all inputs.

Prior Work

Chuquicusma et al [19] pioneered the use of generative adversarial networks (GANs) to synthesize lung cancer nodules within computed tomography (CT) images They assessed the quality of these synthetic nodules through a "Visual Turing test," where two radiologists attempted to differentiate between real and synthetic nodules This innovative approach to creating synthetic lung nodules opens up potential future directions, such as incorporating quantitative image synthesis measures like the structural similarity index, utilizing the pix2pixHD method for higher resolution images, and exploring the generation of entirely synthetic images rather than just components like lung nodules.

Chapter 2 Fracture X-Ray Synthesis with Generative Adversarial Networks 12

Korkinof et al [20] pioneered the use of GANs to create synthetic mammograms, assessing their quality through visual comparisons with real images This innovative approach to generating high-resolution mammograms opens up new possibilities Future directions could involve implementing more stringent qualitative evaluations, like the Visual Turing Test utilized by Chuquicusma et al., alongside quantitative assessments using metrics such as the structural similarity index Additionally, exploring the pix2pixHD method may yield even higher resolution images.

Methods

Network Architecture

This study employs the pix2pixHD network architecture, as outlined by Wang et al The pix2pixHD approach enhances Generative Adversarial Networks (GANs) by implementing a coarse-to-fine generator and a multi-scale discriminator, enabling the generation of high-resolution images while significantly reducing memory usage.

The coarse-to-fine generator features a global generator network, G₁, and a local enhancer network, G₂ The design of the global generator G₁ follows the architecture introduced by Johnson et al., which includes a convolutional front-end, a series of residual blocks, and a transposed convolutional back-end.

G 2 processes input to its residual blocks by combining the element-wise sum of feature maps from its convolutional front-end with the last feature map from the transposed convolutional back-end of G 1 This integration enhances the flow of information across the global network.

The coarse-to-fine moniker describes the training method of the generator First,

G 1 is trained on lower resolution versions of the original training images, thenG 2 is ap- pended toG 1 , and finally the two networks are trained together on the full resolution,

Chapter 2 Fracture X-Ray Synthesis with Generative Adversarial Networks 13 original images.

Utilizing a coarse-to-fine generator for producing higher resolution synthetic images presents a unique challenge, as traditional GAN discriminator designs struggle with these images To effectively differentiate between real and synthetic higher resolution images, a discriminator with a large receptive field is essential This can be achieved through deeper networks or larger convolutional kernels, but these approaches risk overfitting and demand significantly more training memory Wang et al tackled this issue by designing a multi-scale discriminator that incorporates three discriminators.

D1, D2, and D3 are structured in a pyramid formation, featuring identical network architectures Each discriminator processes images at varying scales, transitioning from lower to higher resolutions, as illustrated in Figure 2.1.

The multi-scale discriminator consists of three discriminators, D1, D2, and D3, all sharing the same architecture This structure is pyramid-like, with each discriminator functioning at progressively smaller scales and having correspondingly smaller receptive fields, ranging from D3 to D1.

Chapter 2 Fracture X-Ray Synthesis with Generative Adversarial Networks 14

Image Acquisition and Preprocessing

Image Acquisition and preprocessing work was done primarily by Nicholas Chedid.

A total of 50 x-rays of femoral fractures were sourced from an online search, demonstrating the effectiveness of a pix2pixHD pipeline in creating a scalable tool for data augmentation across various fracture types This approach significantly minimizes the manual effort and time required compared to traditional datasets, which often necessitate hundreds to thousands of labeled images for training fracture detection algorithms By leveraging a smaller dataset, the pix2pixHD pipeline not only streamlines the data acquisition process but also enhances the training of accurate neural networks that may have previously struggled due to insufficient original data.

The 22 highest quality images were then chosen for training and testing pur- poses Afterwards, artifacts and labels were removed from these 22 x-rays using the GNU Image Manipulation Program (GIMP) Segmentations of these images were cre- ated using the GIMP software package by drawing arcs and lines to represent bones and soft tissue Both the x-ray images as well as the segmentations were then converted to squares and resized to 1024 x 1024 pixels in order to be input into the pix2pixHD model The segmentations were further processed by having their RGB pixels pro- grammatically converted to all 0s and 1s as the final step in order to utilize them as input to the pix2pixHD model This work can be seen in Figures2.2and2.3.

Training

Coding, debugging, and training of the pix2pixHD neural network was done by both Nicholas Chedid and collaborator Praneeth Sadda.

Given our limited dataset of 22 x-rays and in order to improve the accuracy of our pix2pixHD model, we utilized the leave-one-out cross-validation method, which is

Chapter 2 Fracture X-Ray Synthesis with Generative Adversarial Networks 15

The X-ray preprocessing process is illustrated in Figure 2.2, showcasing three distinct rows The first row displays the original X-ray images, while the second row features the cleaned versions, free from artifacts and labels, achieved using the GIMP software Finally, the third row presents the processed X-rays, which have been programmatically resized to 1024 x 1024 pixel squares for input into the pix2pixHD model.

Chapter 2 Fracture X-Ray Synthesis with Generative Adversarial Networks 16

FIGURE 2.3: Segmentation Preprocessing: The first row contains the bone and soft tissue segmentations of the x-ray images created using the

The GIMP software package is utilized to process segmentations, which are resized to 1024 x 1024 pixel squares Additionally, the RGB pixels of these resized segmentations are converted to binary values of 0s and 1s for input into the pix2pixHD model.

Chapter 2 Fracture X-Ray Synthesis with Generative Adversarial Networks 17 commonly used in machine learning research to improve accuracy for models trained on smaller databases [23,24].

The leave-one-out cross-validation method involves training the machine learning model, pix2pixHD, on all data points except one, which is set aside for testing This process is repeated until each data point has been used as a test image, resulting in 22 iterations with 21 images for training and 1 for testing This approach enhances performance, particularly with smaller datasets, by increasing computational demands, as the model's parameters are recalculated for each data point Consequently, if the model requires a certain number of calculations (n) based on the dataset size, employing leave-one-out cross-validation necessitates 2n calculations.

This project utilizes an optimal method that emphasizes the scalability of our approach across various fracture types, intentionally prioritizing data limitations over computational constraints.

The leave-one-out cross-validation method offers a significant advantage by utilizing almost the entire dataset for training in each iteration, which is thought to provide the most precise parameter estimates This approach enhances the model's generalizability, leading to better predictions on new data.

The training data was assembled by pairing the segmentations with their asso- ciated x-ray images while leaving one out for testing in the method described above.

Our networks underwent training for 200 epochs, starting with a learning rate of 0.0002 for the initial 100 epochs Subsequently, the learning rate was linearly decayed to zero over the following 100 epochs Weights were randomly initialized based on a Gaussian distribution with a mean of 0.

Chapter 2 Fracture X-Ray Synthesis with Generative Adversarial Networks 18

Postprocessing: Denoising

To enhance image quality and minimize noise artifacts, images generated by the pix2pix model will be processed through a convolutional denoising autoencoder The quality of these images will then be evaluated using the Visual Turing Test and the Structural Similarity Index Measurement Algorithm Convolutional denoising autoencoders have proven effective in denoising medical images.

Visual Turing Test

Nicholas Chedid led the recruitment of radiologists for this study, while my collaborator Praneeth Sadda developed the code to display both real and synthetic X-rays for their assessment.

A Visual Turing Test for assessment of synthetic image quality produced by GANs was proposed by Chuquicusma et al [19] We follow a similar methodology here to evaluate our synthetic images.

We developed 10 Visual Turing Test experiments involving three radiologists: one resident and two attending MSK radiologists, including the division chief The code for displaying the x-rays for these experiments has been successfully implemented.

Our study involves five experiments featuring entirely generated x-rays and five with a combination of generated and real x-rays Each experiment presents nine images arranged in a 3 by 3 grid, allowing radiologists to zoom in or adjust their view Participants will be informed that the grid may contain all generated images, all real images, or a mix of both Radiologists will then identify which images are real and which are generated, with an estimated completion time of under 30 minutes for each radiologist.

Chapter 2 Fracture X-Ray Synthesis with Generative Adversarial Networks 19

We will quantitatively assess the outcomes of our Visual Turing Test to evaluate the quality of our synthetic x-rays by analyzing inter-observer variations, as well as calculating the False Recognition Rate (FRR) and True Recognition Rate (TRR).

Structural Similarity Index Measurement (SSIM)

Assessment of pix2pix accuracy using the structural similarity assay will be done pri- marily by Nicholas Chedid.

A quantitative evaluation of image synthesis quality can be achieved through the structural similarity index measurement (SSIM), as outlined by Wang et al [26] Unlike traditional methods like mean squared error (MSE) and peak signal-to-noise ratio, which focus on absolute errors, SSIM offers an objective assessment of perceptual image quality This predictive model of perceived image quality is particularly valuable for our research.

Once post-processing using a convolutional denoising autoencoder is completed,

Results

Visual Turing Test

After completing the postprocessing with a convolutional denoising autoencoder, the enhanced synthetic x-rays will be utilized for our Visual Turing Tests, as detailed in section 2.2.5 The code for these tests is already prepared An example of a 3 by 3 grid comparing generated and real x-ray images for the Visual Turing Test is illustrated in Figure 2.6.

We envision displaying our results from the Visual Turing Test experiments(including FRR) in a manner similar to Chuquicusmaet al.

Chapter 2 Fracture X-Ray Synthesis with Generative Adversarial Networks 23

The Visual Turing Test Grid, depicted in FIGURE 2.6, features a 3 x 3 arrangement of both generated and real X-ray images To facilitate comparison between the two, the generated X-ray images are highlighted with a pink outline.

Chapter 2 Fracture X-Ray Synthesis with Generative Adversarial Networks 24

Structural Similarity Index Measurement (SSIM)

After completing the post-processing with a convolutional denoising autoencoder, I will utilize the Structural Similarity Index (SSIM) to quantitatively assess the perceptual similarity between the updated synthetic x-rays and the original x-rays, as outlined in section 2.2.6.

SSIM results will be reported in the format seen in Table2.1.

Algorithm Avg MSE Avg SSIM pix2pixHD 97.1±34.6 97.1±34.6

TABLE2.1: Example table for SSIM results.

Discussion

The pix2pix method enables the synthesis of realistic x-rays from procedurally generated segmentations, demonstrating qualitative similarities to original x-rays The quality of these synthesized images will be rigorously assessed through Visual Turing Test experiments and Structural Similarity Index (SSIM) This study represents a pioneering effort to quantify generated medical images with such thoroughness.

This study is the first to quantify synthetically generated medical images using the Structural Similarity Index (SSIM) It also pioneers the generation of complete synthetic x-ray images from scratch through the pix2pixHD method, which offers higher resolution Additionally, it is the first to assess the quality of these entire synthetic x-ray images using the Visual Turing Test.

Our demonstrated image synthesis method has the potential to enhance automated fracture detectors by addressing the limitations of neural networks, which rely on supervised learning and the availability of labeled data Image synthesis can serve as an effective tool for data augmentation, particularly in scenarios where a fracture detector struggles to classify an image correctly In such cases, retraining the detector with multiple closely related synthesized images could significantly improve its performance.

Chapter 2 discusses the use of Generative Adversarial Networks (GANs) for synthesizing X-ray images to enhance the accuracy of fracture detection The challenge of limited available data can be addressed through image synthesis, which generates relevant examples for retraining detection algorithms This approach not only improves detection accuracy but also serves as a valuable method for creating training data for X-ray segmentation algorithms.

The use of Generative Adversarial Networks (GANs) to create synthetic examples enhances out-of-domain and novelty detection, enabling classifiers to identify unknown inputs This approach can significantly improve the generalizability of automated fracture detectors.

Our trained synthesizer can also be used to better describe images (e.g.by learn- ing features from the trained synthesizer).

The results were obtained using a small dataset sourced from easily accessible online information This approach aimed to showcase the effectiveness of the pix2pixHD pipeline as a scalable solution for enhancing data augmentation in automated fracture detection across various fracture types, thereby minimizing manual effort and the reliance on extensive databases.

Acquiring and utilizing a large dataset for fracture detection requires significantly less effort than obtaining and labeling a conventional dataset of hundreds to thousands of images for training a single algorithm Moreover, this tool allows for seamless adaptation to various fracture types, necessitating much less work to transition between them compared to traditional methods.

A promising approach involves training multiple neural networks on various tasks using small to moderately sized original datasets This can be enhanced by augmenting these datasets with synthetic images generated through customized iterations of the pix2pixHD pipeline The goal is to determine if this efficient method, which saves both time and manual effort, leads to improved performance of the neural networks.

Introduction

Depression and it’s Diagnosis

Depression is a disease with tremendous impact upon the human race Globally, over

Each year, approximately 350 million people worldwide experience depression, with around 16.2 million adults in the United States having faced at least one major depressive episode in 2016 Notably, one in five adults in the US is estimated to have encountered depression at some point in their lives This issue is particularly prevalent among specific groups, including high-functioning professionals, adolescents, and individuals with chronic illnesses.

Depression is a prevalent mental health condition that leads to considerable suffering, disability, and increased mortality rates It is estimated to cause more "years lost" due to disability than any other chronic illness, surpassing conditions like back pain, lung disease, and alcohol abuse Recent research indicates that individuals with depression experience higher rates of various health complications.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 27 of obesity, heart disease, and diabetes [31,32] Finally, untreated major depression is well known to be the highest risk factor for suicide [33,34].

Despite the availability of various psychopharmacologic and psychotherapeutic treatments for depression, self-recognition and diagnosis pose significant challenges, with approximately two-thirds of depression cases in the U.S remaining undiagnosed Diagnoses are primarily based on clinical assessments by physicians, which can introduce bias and variability; for instance, field trials of the DSM-5 revealed a low intraclass Kappa of 0.28, indicating substantial differences in assessments among physicians Additionally, obtaining a formal diagnosis necessitates a clinician visit, which can be a barrier for individuals with limited healthcare access, contributing to higher rates of undiagnosed depression The subjective nature of depression assessments, coupled with the infrequency of mental health appointments, complicates the ability to track changes in psychiatric conditions over time.

Tracking the response to therapy in major depressive disorder poses a significant challenge for clinicians, as they must wait weeks to evaluate the effectiveness of treatment This delay can result in lapses in care, leading to the unintentional neglect of patients with severe depressive symptoms Alarmingly, over a quarter of individuals who complete suicide are reportedly receiving treatment for their condition Therefore, there is a critical need for improved monitoring of patients who do not meet the criteria for mandatory inpatient admission.

The widespread occurrence and seriousness of undiagnosed and undertreated depression highlight the necessity for accessible and effective screening tools to identify depression symptoms Implementing these tools is crucial for improving public health outcomes.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 28 the U.S Preventive Services Task Force (USPSTF) has recommended routine depres- sion screening in primary care clinical practices [40] Survey-based methods such as the Patient Health Questionnaire (PHQ)-2 and PHQ-9 are the most commonly used screening tools in the primary care setting [41] However, these surveys take time, rely upon patient interaction with a primary care doctor, and do not adequately address the risk of developing depression symptoms in between often infrequent clinical vis- its They also do not allow for monitoring of treatment in between difficult-to-obtain clinical visits What is needed, and does not currently exist, is a solution to depression screening that is scalable, easy to administer, timely, and allows continual assessment.

We introduce an innovative digital tool that leverages AI-driven facial and language recognition to assess depression risk By utilizing a 30-second video recorded on any smartphone with a front-facing camera, the technology analyzes both video and audio to deliver real-time insights into the user's depression risk Individuals identified as high risk for major depressive episodes will receive tailored resources and potential referrals to clinicians or telepsychiatry services for further support This user-friendly and accessible technology ensures accurate assessments for mental health care.

Prior Work

Major depression poses a significant public health challenge, yet it remains underrecognized compared to other diseases in terms of research and treatment strategies While artificial intelligence is increasingly utilized in medicine, its application in enhancing mental health care has only recently been investigated Our hypothesis suggests that machine learning can identify microexpressions and auditory cues, similar to those used by experienced psychiatrists, thereby contributing to the development of effective screening tools for mental health disorders.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 29 in mental health Others have suggested AI analysis can predict and diagnose depres- sion [42], but most applications focus on using only a single modality, such as audio or text and do not track changes in mood longitudinally.

Previous research on automatic depression detection has utilized the Audio-Visual Emotion recognition Challenge (AVEC) datasets, which are currently the sole source of audiovisual data with verified depression labels The AVEC 2013 and 2014 datasets offer video samples linked to BDI-II scores, while later AVEC challenges include interactive video samples and transcriptions related to PHQ-8 scores Our focus is on the BDI-II survey, specifically the AVEC 2014 dataset, as it is the most widely used depression assessment in research and provides a more detailed scoring range of 0-63 compared to other surveys.

23 range on PHQ-8; these additional questions provide a possible source of metadata).

The AVEC 2014 dataset includes 150 interviews aimed at assessing patients for depression, organized into three equal sets: 50 for training, 50 for development, and 50 for testing Each video in the dataset is linked to a BDI-II score, which ranges from 0 to 63, with actual scores varying from 0 to 45, indicating a predominance of non-depressed individuals A higher BDI-II score signifies an increased risk of depression, and a cutoff score of 14 has been established in preliminary studies in accordance with NIH guidelines (NINDS).

Previous approaches utilizing AVEC to analyze depression faced limitations due to outdated techniques These methods involved inefficient feature engineering, including the combination of speech style, eye activity, and head pose modalities in Support Vector Machines, the use of hand-engineered features in Random Forests, and the application of facial and head movement dynamics along with vocal prosody in logistic regression Additionally, they relied on topic modeling-based learning, which further constrained their effectiveness.

Neural networks have shown far greater accuracy than the aforementioned methods at analyzing complex behaviors, and our hypothesis is that the same will hold

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 30 true in depression Thus, we believe that they may outperform these previous tech- niques in screening for depression While, the advent of neural networks has offered a novel opportunity to improve accuracy, current approaches using neural nets have methodological flaws of their own Two recent papers have used neural networks to predict PHQ-8 scores over audio, text and visual data with a best reported Root Mean Square Error (RMSE) of 5.4 on a 27 point scale [51, 52] RMSE can be simply under- stood as the average distance of a predicted value from the true value For example, an RMSE of 6 on a 27 point scale would indicate that the values predicted by the neural network are on average 6 points away from the true values Using text and audio anal- ysis, Hanai and colleagues [53] achieved an RMSE of 6.7 on the same scale By leaving out video analysis, their peak RMSE and ability to expand to other areas of application is limited Additionally the decision to use PHQ-8 as the ground truth misses out on the the granularity inherent to the more comprehensive BDI-II The best predictive re- sults were achieved by Janet al [54] They used a multi-modal approach (video and audio), used BDI-II as ground truth, and have achieved the best RMSE in the literature (7.4 for the 63 point BDI-2 scale) However, further improvement for clinical utility will require an algorithm to reach an RMSE of 5 or less since a BDI-II change of five points is considered "minimally clinically significant" [55] We believe that further improve- ment is hampered by the small size of the AVEC training set and that the way forward is not only refinement of the algorithm but also the creation of larger and more robust datasets.

Proposed Solution

We are creating multi-modal deep neural networks to predict BDI-II scores by integrating visual and audio neural networks into a unified master model that combines the features and predicted scores from each individual network Additionally, we plan to include text-based NLP analysis in the future to enhance our models.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 31 will make predictions using longitudinal data from multiple sessions, taking into ac- count previous history and predictions, a novelty in this area of work Additionally, our data collection will alleviate data sparsity problems associated with current deep learning models and enable more expressive models to be developed.

Our innovative technology aims to enhance predictive capabilities by integrating various inputs to monitor individual mood fluctuations over time By enabling each participant to act as their own control, our method improves predictive accuracy and minimizes the subjectivity and inter-user variability commonly associated with psychiatric diagnoses.

Our innovative approach addresses the critical gap in psychiatric care access, as 75% of Americans own smartphones, yet many are unaware of their mental health issues By seamlessly integrating our solution into daily life, we can track mood changes over time, predict major depressive episodes, and monitor treatment effectiveness This technology enables more frequent assessments than traditional clinic visits, ultimately democratizing mental healthcare for underserved populations and significantly impacting the prevalence of undiagnosed depression.

Our technology offers new insights into treatment efficacy and disease segmentation, which is especially beneficial for evaluating patients in clinical trials We have developed an algorithm that utilizes multimodal inputs to detect signs of depression, with plans to extend our screening and treatment monitoring capabilities to other mental health disorders such as burnout, bipolar disorder, schizophrenia, Alzheimer's, and Parkinson's Disease Furthermore, our innovative methodology will enable the application of our technology in clinical and acute care settings.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 32 the clinic, we imagine our active video analysis functionality to provide valuable, non- invasive immediate diagnostic and clinical support information useful during telepsych, telemedicine, and in-person clinical encounters Passive tracking, such as monitoring weeks of phone activity, will not be able to do this.

We aim to create a cutting-edge neural network for predicting depression while also developing a unique longitudinal audio-visual database linked to depression scores This pioneering data collection, which has been challenging to achieve historically, will provide valuable videos associated with verified depression metrics We anticipate that this resource will not only enhance our research but also contribute significantly to future academic studies and translational efforts in the field.

Methods

Overview

As described in the introductory sections, our overall neural network analysis can cur- rently be split primarily into video and audio analysis.

The current dataset comprises 150 videos from the academic AVEC 2014 database, featuring individuals speaking to a webcam in German This collection includes 50 videos each for training, testing, and development purposes Each video is associated with a BDI-II score ranging from 0 to 63, with actual scores predominantly falling between 0 and 45, indicating a bias towards non-depressed individuals.

We currently operate two distinct networks for processing audio and visual data Our goal is to generate a final score by determining the optimal weight between the outputs of these networks and the target BDI-II score Additionally, we are exploring the possibility of integrating audio and video features into a unified neural network to enhance the learning of the BDI-II score.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 33

The main objective of this research is to acquire a substantially larger dataset to enhance accuracy beyond current literature standards This effort will not only increase the volume of data—an effective method for boosting neural network performance—but also ensure that the quality of the data surpasses all previously collected datasets, as detailed in section 3.2.4.

Video Analysis

Work for building and tuning the video neural net architecture was done primarily by my colleague Michael Day.

A 19-layer convolutional neural network was developed using Python scripts and Keras packages, operating on a TensorFlow backend This advanced neural network serves as a gold standard for image classification and computer vision tasks.

Our network comprises 19 weighted layers, featuring three sets of three convolution layers, each followed by a 2x2 max-pooling layer The 2D convolution layers utilize a 3x3 convolution with 32, 64, and 128 output filters We then flatten the 3D feature maps into a 1D feature vector, which is processed through two fully-connected dense layers with 64 nodes each, incorporating 20% dropout The network concludes with a single-node dense layer that outputs a predicted BDI-II score for each input set Dense layers classify features extracted by the convolutional layers and downsampled by the pooling layers Faces are detected using Haar cascade classifiers and resized to 48x48 grayscale images to enhance processing speed and standardize inputs across varying resolutions The model is compiled with an Adam optimizer and employs mean squared error as the loss metric, primarily using rectified linear units (ReLU) as activation functions.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 34

To enhance the neural network's accuracy on the test set, multiple training iterations were conducted for hyperparameter tuning The key tunable variables included the number of epochs, batch size, samples per video, and various data augmentation features such as minimum face size, Haar cascade classifier scale factor, minimum number of neighbors, horizontal flip, and face image rescale size.

The ideal neural network configuration involved an infinite number of epochs, with training halting when no improvements were observed over several epochs Only the models exhibiting the lowest mean squared error were retained The final parameters included a batch size of 128, 500 samples per video, a minimum face size of 30x30, a Haar cascade classifier scale factor of 1.1, a minimum of 5 neighbors for the Haar cascade classifier, horizontal flipping turned off, and a face image rescale size of 48x48.

Training was performed over 781 epochs for more than 72 hours on a desktop computer with one NVIDIA 980ti graphics card with 6 GB RAM.

Audio Analysis

Work for building and tuning the audio neural net architecture was done primarily by my colleague Alexander Fabbri.

The audio architecture currently extracts Mel-frequency cepstral coefficients (MFCCs) using the Fourier transform from the entire audio stream, yielding 13 features per timestep Moving forward, additional features will be extracted, followed by feature selection to identify the most relevant ones These selected features are then input into a Gated Recurrence Unit (GRU) with a dimension of 100, as part of a Gated Recurrence Neural Network By training on audio data labeled with BDI-II, the networks are designed to predict BDI-II scores for the test set The datasets are organized into training, development, and test sets.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 35

Alex is enhancing accuracy by implementing cross-validation to assess algorithm results and exploring undersampling and oversampling methods for skewed data To address data sparsity, binning techniques will be utilized, with the ultimate goal of conducting a fine-grained analysis of depression The team plans to employ L1 and L2 regularization, along with Elastic Nets and dropout, to improve neural network generalization Standard optimization methods like gradient descent and its variants, as well as advanced techniques such as super-convergence and 1-cycle learning rate scheduling, will be applied Additionally, the analysis will include textual input related to BDI-II scores through speech-to-text conversion and Natural Language Processing The impact of "out of domain" data on neural network performance will also be investigated, focusing on unbiased subgroup identification and enabling algorithms to learn continuously without the need for retraining with each new data point.

Pilot Studies for Gathering of First-in-Class Data

Work for designing and implementing these pilot studies was done primarily by NicholasChedid.

Need for Additional Data

Currently, the only openly accessible audiovisual datasets related to depression are provided by AVEC However, these datasets exhibit several notable weaknesses, underscoring the necessity for more comprehensive data in this field.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 36

1) The AVEC datasets correlated to BDI-2 scores consist of only 150 videos; it is a generally accepted maxim in machine learning that increasing training data is one of the most effective ways to improve algorithm performance.

2) The audio in these videos is only in German; having audio in several lan- guages could enhance the generalizability of our algorithms.

3) This dataset, similar to many in medical research and the facial recognition space, consists of a relatively racially and ethnically monolithic participant popula- tion; facial recognition algorithms and medical research in general are hampered by non-diverse data which limits generalizability and applicability of such research For example, prior studies show that the accuracy of facial recognition algorithms is sensi- tive to the demographic composition of both training and test data [57,58] Numerous papers describe the importance of diverse patient populations in medical studies in general as well [59,60,61] Inclusion of minority participants in NIH funded research continues to be an ongoing issue; for example, since the NIH passed the Revitalization Act in 1993 to address this, less than 2% of the greater than 10,000 cancer clinical tri- als funded by the National Cancer Institute included sufficient minority participants to meet the NIH’s own criteria [60].

4) The AVEC database does not contain longitudinal data i.e multiple videos and BDI-2 scores from participants over time We hypothesize that audiovisual data collected longitudinally allows for more accurate prediction of BDI-II compared to data from a single encounter One possible reason for improved performance would be the ability to measure a patient’s delta or relative change from assessment to assessment as opposed to relying on just an absolute BDI-2 score.

5) The AVEC dataset lacks extreme BDI-2 scores particularly at the higher end of scoring with the majority of scores clustered in the low to intermediate range This lack of data significantly hampers the ability of any algorithms trained on this data to detect more significant depression.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 37

We are nearing the completion of IRB approval for our inaugural pilot study at Ponce Health Sciences University in Puerto Rico, aimed at addressing the identified shortcomings Additionally, we are initiating several other pilot studies focused on emergency room patients and medical residents to further tackle these issues.

Pilot Study with Medical Residents

The development of our pilot studies with medical residents was driven by the alarming rates of depression and burnout observed among medical trainees A meta-analysis by Mata et al published in JAMA revealed a 28.8% prevalence of depression among resident physicians Furthermore, a multicenter study by Williford et al in JAMA Surgery found that 75% of surgical residents experienced burnout, with 39% suffering from depression In contrast, the general population has a depression prevalence of about 9%, and for adults aged 25-34, it ranges from 12-13% In response to these concerning statistics, the Accreditation Council for Graduate Medical Education (ACGME) has implemented new requirements mandating residency programs to prioritize physician well-being and enhance mental health screening for residents.

A pilot study focused on enhancing mental health in medical trainees would not only fulfill a critical need but also rectify several limitations of the AVEC dataset This initiative would significantly expand our data volume, incorporate English alongside German, enhance ethnic and racial diversity, provide longitudinal insights, and potentially yield a broader range of data with higher BDI-2 scores.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 38 prevalence of depression in medical trainees is more than double that of their same-age peers (29% vs 12-13%) as described above within this same subsection.

I collaborated with Dr Rosemary Fischer, Director of Resident and Fellow Well-Being at Yale New Haven Hospital, to initiate a pilot study aimed at enhancing the accuracy of our neural networks through additional training data Subsequently, I presented our proposed pilot study and technology at the Yale Innovation Summit, where I also showcased a poster titled "Artificial Intelligence for the Detection of Psychiatric Disease," which received the Best Tech Poster award.

The feedback from my presentations proved invaluable, leading to an oral presentation titled "An AI-enabled mobile gaming platform for the early detection of psychiatric disease" at the Stanford Medicine X ED conference During this event, I had the opportunity to meet the President of Ponce Health Sciences University (PHSU) in Puerto Rico, where we discussed our shared interest in conducting a pilot study at PHSU He subsequently invited me to present a pilot proposal to the Deans of PHSU, which is elaborated on in section 3.2.7.

Our primary objective is to validate the accuracy of our algorithm in predicting mild depression or higher, as measured by the BDI-II instrument We aim for our algorithm to achieve a sensitivity of 75% and a specificity of 85% when identifying individuals with a BDI-II score of 14 or above.

∗ A BDI-II score of 14 or greater corresponds to depression ranging from mild to severe

∗ Primary care physicians have a sensitivity of 51% and specificity of 87% at detecting depression without an instrument [65]

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 39

∗ The most common screening instrument (PHQ-9) has a sensitivity of 74% and specificity of 91% at detecting depression [66]

We aim to achieve comparable specificity while enhancing the sensitivity of our screening technology beyond that of the PHQ-9 and primary care standards, as this is our primary focus during the initial development phase.

The objective of this study is to show that a longitudinal analysis of users' audiovisual data can effectively identify significant changes in BDI-II scores To achieve this, our algorithm aims to predict BDI-II scores with a root mean square error (RMSE) of under 7, ensuring clinically relevant accuracy in detecting these changes.

Measuring changes over time is crucial for identifying individuals at risk of transitioning between depressive and non-depressive states, as well as for monitoring the progress of patients with depression who are receiving treatment.

∗ An approximate 5-point change in the BDI-II score corresponded to a minimal clinically meaningful change in severity according to DSM-

We aim to achieve a root mean square error (RMSE) of less than 7 by the end of Phase 1 of our STTR grant, as a current RMSE of 5 in real-world data exceeds existing standards Our goal is to reach an RMSE of 5 within twelve months after completing Phase 1.

This study will include all residents at Yale New Haven Hospital, with a recruitment target of 150 participants using a consecutive sampling strategy Our goal is to have 100 medical residents complete the study, defined as submitting one survey each month Participants will receive reimbursement upon successful completion of the study.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 40

Participants will receive an email link to download the Sol application, while those without smartphones will access a weekly Qualtrics survey They will answer a simple question, such as “How was your day yesterday?” in either Spanish or English, based on their preference Care will be taken to avoid potentially triggering questions After submitting their video response through the Sol app or Qualtrics, participants will complete a BDI-2 survey All responses will be securely tagged to the corresponding video and stored on HIPAA-compliant servers for analysis by predictive AI algorithms.

We are in the process of applying for an NIMH STTR grant to support our pilot project and the Emergency Department pilot, with plans to initiate the pilot in October as grant funds become available The pilot will last for 12 months, consisting of 3 months for enrollment, 6 months for data collection, and 3 months for data analysis.

Pilot Study at Ponce Health Sciences University

I was invited to present a pilot study proposal on an AI technology designed for screening depression among healthcare students to the President and Deans of PHSU in Puerto Rico.

PHSU recognized the urgent need for enhanced mental health support for healthcare trainees and enthusiastically agreed to collaborate on our proposal We recruited two psychology PhD students to conduct the pilot study locally, with Dr Nydia Ortiz, Dean of the School of Behavioral and Brain Sciences and former Director of the Puerto Rico Mental Health and Substance Abuse Administration, serving as the principal investigator.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 41

The data collected from the PHSU pilot will enhance the AVEC dataset by increasing its size, incorporating Spanish data alongside English and German, and improving racial and ethnic diversity, particularly among Hispanic participants This initiative aims to obtain longitudinal data and gather a broader range of data with higher BDI-2 scores, which are expected to be more varied than those in the AVEC dataset Additionally, the resident pilot is anticipated to continue capturing higher BDI-2 scores due to the significant prevalence of depression among resident trainees.

Our pilot is titled:An AI-enabled mobile application for the rapid assessment and risk stratification of depression in medical professionals.

The primary objective is to gather audiovisual data to identify patterns in facial and linguistic expressions, along with other significant predictors, that aid in recognizing depression within the study population.

• Objective 2: Compare the effectiveness of an AI-powered facial and linguistic analysis algorithm to detect signs of depression as compared to a BDI-2 question- naire.

• Objective 3: Validate the feasibility and utility of rapid, automated psychiatric risk stratification via a mobile interface

This study will include healthcare students aged 21 and older from Ponce Health Sciences University (PHSU), the legal age for medical consent in Puerto Rico We will use a consecutive sampling strategy to recruit an estimated 300 to 400 students, with the goal of having 150 participants complete the study by filling out one survey each month The study will consist of two primary arms, each with an equal number of participants, one conducted in English and the other in Spanish.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 42

Participants in the study will receive an email link to download the Sol application, which features a user-friendly touch interface for video recording This app serves solely as a data collection tool, not for diagnostics For those without smartphones, a weekly Qualtrics survey link will be provided Both the application and survey will be available in Spanish and English During registration, participants will assess their language proficiency in both languages using a 5-point Likert scale, with scores from basic to native Participants scoring 3 or higher in only one language will complete the study in that language, while those scoring 3 or above in both will be randomly assigned to one of the two languages.

Participants will use the Sol app or Qualtrics to answer a simple question, such as “How was your day yesterday?” every other week, ensuring that questions are non-triggering Each week, they will also indicate if they have a clinical diagnosis of depression or are receiving treatment for it After submitting their video response, either automatically through the Sol app or manually via Qualtrics, participants will complete a BDI-2 survey The entire process is expected to take about 5 minutes.

Responses will be linked to the corresponding video and stored on secure, HIPAA-compliant servers for analysis by a predictive AI algorithm These servers operate on Amazon Web Services' HIPAA-secure platform Access to the information on these servers is restricted to study programmers, who will utilize the data to enhance the AI algorithm.

The IRB is in the final stages of approval We aim to begin recruitment thisMarch Pilot duration will be 6 months.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 43

Pilot Study with Yale Emergency Department Patients

While preparing our NIMH STTR grant, we identified a key limitation in the AVEC data: insufficient extreme BDI-2 scores, especially at the higher end This gap could hinder our algorithms' effectiveness in detecting severe depression Conducting a pilot study in the emergency department would enable us to selectively recruit depressed patients, thereby addressing this issue.

A limitation of conducting a pilot study in the emergency department is the absence of longitudinal data; however, this is counterbalanced by the strengths of previous pilot studies that do provide such data The non-longitudinal aspect of this pilot allows for lower reimbursement per participant, facilitating the recruitment of a larger number of participants Consequently, while the data may lack longitudinal depth, the diversity of participants offers valuable insights for analysis.

Our specific aim is the same as specific aim 1 in section3.2.6, since both pilots are part of the same NIMH STTR grant application:

Our primary objective is to validate the accuracy of our algorithm in predicting mild depression or higher, as measured by the BDI-II instrument We aim for the algorithm to demonstrate a sensitivity of 75% and a specificity of 85% when identifying individuals with a BDI-II score of 14 or above.

∗ A BDI-II score of 14 or greater corresponds to depression ranging from mild to severe

∗ Primary care physicians have a sensitivity of 51% and specificity of 87% at detecting depression without an instrument [65]

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 44

∗ The most common screening instrument (PHQ-9) has a sensitivity of 74% and specificity of 91% at detecting depression [66]

We aim to achieve comparable specificity while surpassing the sensitivity of the PHQ-9 and primary care, as our primary focus is the initial development of a screening technology.

This study targets patients over 18 years old in the Yale New Haven Hospital Emergency Department and Crisis Intervention Unit (CIU) who exhibit clinical signs of depression Individuals with excessive agitation or a history of schizophrenia or schizoaffective disorder will be excluded Participants will complete the study immediately after enrollment by recording a video response and filling out the BDI-II survey, a process that takes under 5 minutes Upon completion, participants will receive reimbursement The study aims to enroll 400 participants over a 7-month period.

Participants in the study will complete a survey using either the Sol app or Qualtrics on designated Emergency Department iPads They will answer a simple question, such as “How was your day yesterday?” in their preferred language, with options available in both Spanish and English Care will be taken to avoid potentially triggering questions After submitting their video response, participants will complete a BDI-2 survey, with each response linked to the corresponding video All data will be securely stored on HIPAA-compliant servers for analysis by predictive AI algorithms.

We are preparing to submit an NIMH STTR translational grant on April 1st to secure funding for both this pilot and the medical resident pilot, with the goal of launching the pilot in October.

Chapter 3 Neural Networks for Depression Screening & Treatment Monitoring 45 funds from the grant disburse Pilot duration will be 12 months: 9 months simultane- ous enrollment and data collection and 3 months data analysis.

Ngày đăng: 19/07/2023, 14:29

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN