To address these challenges, this study compared the performance of six machine learning and categorisation models in terms of accurately identifying suicidal behaviour in a prison popul
Trang 1Predicting suicidal behaviour without asking about suicidal ideation: Machine Learning and
the role of Borderline Personality Disorder criteria
Adam Horvath, Mark Dras, Catie C.W Lai, Simon Boag
Macquarie University
Author NoteAdam Horvath, Department of Psychology, Macquarie University; Mark Dras,
Department of Computing, Macquarie University; Catie C.W Lai, Department of Psychology,Macquarie University; Simon Boag, Department of Psychology, Macquarie University
Correspondence concerning this article should be addressed to Adam Horvath, Department ofPsychology, Macquarie University, New South Wales, 2019 Email:
adam.horvath@mq.edu.au
Trang 2AbstractIdentifying factors that predict who may be at risk of suicide could help prevent suicides viatargeted interventions It is difficult at present, however, to predict which individuals arelikely to attempt suicide, even in highrisk populations such as Borderline Personality
Disorder (BPD) sufferers The complexity of personsituation dynamics means that relying onknown risk factors may not yield accurate enough results for prevention strategies to besuccessful Furthermore, risk models typically rely on suicidal thoughts, even though it hasbeen shown that people often intentionally withhold this information To address these
challenges, this study compared the performance of six machine learning and categorisation
models in terms of accurately identifying suicidal behaviour in a prison population (n = 353),
by including or excluding questions about previous suicide attempts and suicidal ideation.Results revealed that modern machine learning algorithms, especially gradient tree boosting(AUC = 875, F1 = 846), can accurately identify individuals with suicidal behaviour, evenwithout relying on questions about suicidal thoughts, and this accuracy can be maintainedwith as low as 29 risk factors Additionally, based on this evidence, it may be possible toimplement a decision tree model using known predictors to assess individuals at risk of
suicide These findings highlight that modern classification algorithms do not necessarilyrequire information about suicidal thoughts for modelling suicide and selfharm behaviour
Keywords: suicide prevention, borderline personality disorder, machine learning,
prediction, classification, bpd, tree boosting
Trang 3Predicting suicidal behaviour without asking about suicidal ideation: Machine Learning and
the role of Borderline Personality Disorder criteriaSuicide is a major global health issue, with 800,000 deaths by suicide each year
Additionally, for each suicide, there are an estimated 20 or more suicide attempts (WorldHealth Organization,2014) Identifying factors that predict who may be at risk of suicidecould help prevent suicides via targeted interventions and so reduce more deaths (Mann et al.,2005) There is broad agreement, however, that it is particularly challenging to identify whowill die by suicide (Franklin et al.,2017; Pestian et al.,2017) Among these challenges aremethodological limitations, which have prevented testing the complex interaction of factorsassociated with suicide risk (Franklin et al.,2017) Regardless, some populations are known
to be more at risk than others Borderline personality disorder (BPD), for instance, is
especially associated with elevated suicide risk (Chesney et al.,2014) and has an estimatedcommunity prevalence of around 1–2% (Black et al.,2004; Gunderson et al.,2013) BPD ischaracterised by instability of selfimage, interpersonal relationships and affects, accompanied
by impulsivity, risktaking, and potential hostility and suicidal ideation (American PsychiatricAssociation,2013) Compared to the rest of the population, deliberate selfharm (69–80%),suicide attempts (75%), and completed suicide rate (10%) are much higher in persons withBPD (Black et al.,2004; Brown & Chapman,2007) Suicide attempt rates are especially highwhen BPD sufferers are in their twenties (American Psychiatric Association,2006), while thesuicide completion rate is highest for individuals in their thirties (Biskin,2015) Presently,however, it is difficult to predict which individuals are likely to attempt suicide, even inhighrisk populations such as BPD sufferers
There are known risk factors for suicide which, in theory, could help with prediction.Feelings of helplessness, sadness, anxiety, and negative affect, for instance, are known to beassociated with suicidality (Podlogar et al.,2018) Regarding BPD, crisisgenerating
behaviour, interpersonal conflict (Brown & Chapman,2007), and stressors resulting from pastimpulsive or avoidant behaviours (e.g., accruing a large debt) are also known risk factors(Brown & Chapman,2007) Merely relying on known risk factors, however, may not yieldaccurate enough results for prevention strategies to be successful (Franklin et al.,2017) For
Trang 4example, even seemingly important cues, such as expressed suicidal ideation, can have lowpredictive validity: only around 40% of people who die by suicide express suicidal thoughts at
an earlier time (McHugh et al.,2019) This may be partly due to factors such as impulsive,unplanned suicide attempts, or concerns regarding the outcome of disclosure (e.g., stigma, andbeing hospitalised or medicated) (Richards et al.,2019) In the latter case, evidence indicatesthat certain populations are reluctant to disclose such information For example, older people(Heisel et al.,2010), some cultures (Takeuchi & Nakao,2013), outpatients (Earle et al.,1994),and children and adolescents (Bolger et al.,1989) are all known to withhold information aboutsuicidal ideation Suicide questionnaires may consequently fail to identify the majority ofpeople who would attempt suicide shortly after answering the questions (Richards et al.,2019) Furthermore, health professionals may feel uncomfortable asking about suicidal
ideation, fearing that such questions might increase the likelihood of a suicide attempt — eventhough this does not appear to be the case (Bajaj et al.,2008; Stoppe et al.,1999) As such,suicide prevention strategies that could accurately identify atrisk individuals without relying
on reported suicidal ideation may have practical advantages
Franklin et al (2017) state that prediction accuracy has not improved significantly overthe past 50 years For example, suicide and selfharm prediction models employing logisticregression tend to reach AUC scores of only 6–.7 (e.g., Horton,2018; Kessler et al.,2017;Walsh et al.,2017) An AUC score can be interpreted as the probability that any randomlyselected suicidal subject would receive a higher prediction score than any other randomlyselected nonsuicidal subject (Fawcett,2006) As such, an AUC score of 5 indicates a randomclassification performance, and models producing scores of 6–.7 generally provide only poorlevels of discrimination Furthermore, around 80% of studies investigating the predictability
of suicide rely on only five broad risk factors for predicting suicidal behaviour, and the typicalaccuracy of these algorithms is only slightly better than chance (Franklin et al.,2017) Thissuggests overall that simple models built on relatively few risk factors, and not accounting forcomplex personsituation dynamics, cannot provide high enough accuracy to target
individuals at risk
These findings have led some researchers to believe it is not possible to predict suicide
Trang 5attempts (Black et al.,2004) However, such conclusions are based on algorithms, discussedabove, which are not suitable for addressing the complex nonlinear interactions between alarge number of predictors Although algorithms identifying suicidal behaviour using
relatively few predictors may not yield practical results, more recent attempts using machinelearning (ML) have demonstrated much higher prediction accuracy (Burke et al.,2019).Walsh et al (2017), for instance, achieved relatively high suicide prediction accuracy using
ML models based on health records of adults with deliberate selfharm (AUC = 84) Jung
et al (2019) also achieved similar accuracy (AUC = 86) using gradient tree boosting
modelling, identifying adolescents with suicidal ideation and suicidal behaviour
Unlike conventional statistical models, such ML methods can be applied even wherethere are a large number of predictors relative to the sample size ML methods in these
circumstances avoid overfitting the training data (i.e., avoids learning a set of model
parameters that may give zero error on the training data but does not generalise to other data,through a range of techniques; Hastie et al.,2001) For example, regularisation, which isfrequently used in ML models, penalises model complexity; crossvalidation, or testing themodel’s accuracy on a heldout dataset, allows an estimate of the average generalisation errorwhen the method is applied to a new, independent test sample Hastie et al (2001, pp 2479)showed that these approaches yield accurate models even for a large number of predictorsrelative to sample size
Given the complexity of personsituation dynamics, accurate prediction of suicidalthoughts and behaviours may need to take into account hundreds of risk factors (Franklin
et al.,2017) It is also possible, however, that after the aforementioned regularisation, anaccurate prediction would only require a much smaller subset of risk factors For instance,Ribeiro et al (2019) achieved good accuracy predicting suicidal ideation and attempts (AUC
= 83 .89) using random forest with 51 variables This finding demonstrates that ML methodscan accurately predict suiciderelated behaviours using a relatively small number of riskfactors Nevertheless, to translate these risk algorithms into clinical practice, the predictorsneed to be reduced to a more manageable and practical amount
Reducing the number of predictors is also beneficial with respect to interpretation Even
Trang 6though ML techniques may potentially achieve high enough accuracy to target individualswith suicidal behaviour (Walsh et al.,2017), they often work as a black box This makes itdifficult to understand what risk factors can predict suicide attempts, which means thesemodels can then be prohibitively complex to interpret Recent ML research has started toaddress this by focusing on both the interpretability and visualisation of these complex
models, regaining some insights on how the combination of predictors contributed to theoutput of a model (Lundberg & Lee,2017; Tan et al.,2018) As such, there are avenuesavailable for addressing this potential limitation Some of these easytointerpret modelsprovide a much clearer, simpler indication of how various parts of the complete model
operate These include decision trees, which can provide a compromise between the
complexity and accuracy of the ML models, and the interpretability and simplicity of the moretraditional models (Quinlan,1987) Decision treebased modelling involves creating a
treelike structure that represents questions and answers, and their consequences, such as thechance of any given outcome Recent research focusing on the prediction of suicidal ideationand suicidal behaviour has been able to build decision tree models with potential clinicalsignificance (e.g., Batterham & Christensen,2012; Handley et al.,2014; Mann et al.,2008).For instance, the decision tree model by Handley et al (2014) accurately predicted (AUC =.81) suicidal ideation in older adults on a fiveyear followup study Consequently, decisiontrees might provide a viable middle ground between modern ML algorithms and simplermodels for successfully predicting suicidal behaviour
Given that results to date using simple models indicate generally poor predictive validity
of suicidal behaviour (e.g., Horton et al.,2018; Kessler et al.,2017), and that modern MLalgorithms show promising potential in identifying individuals at risk of suicide (Burke et al.,2019), the present study compared the performance of different ML and categorisation models
in terms of predicting suicide attempts ML models studied include random forest and treeboosting, which typically yield better prediction than traditional algorithms such as logisticregression (Couronné et al.,2018; Neumann et al.,2004) We also tested the accuracy ofneural network models which, in some cases, outperform even random forest models (Jaimes
et al.,2005; Raczko & Zagajewski,2017; Were et al.,2015) The present study applied these
Trang 7ML algorithms to retrospective modelling based on prison data to determine whether suicidalbehaviour could be successfully predicted Previous findings indicate that retrospective andpredictive modelling of suicidal behaviour achieve similar accuracy (Walsh et al.,2017),which suggests that the same datasets and models can be used for both predictive and
retrospective modelling of suicidal behaviour, addressing a potential limitation of our
retrospective modelling Prisons tend to have a higher suicide rate than the rest of the
population (Naud & Daigle,2013), and prisoners have exceptionally high rates of personalitydisorders, including antisocial personality disorder (APD) and BPD (Fazel & Danesh,2002).Some researchers argue that these disorders are merely different representations of the sameunderlying psychopathology (Paris,1997), and further studies show that APD and BPD share
at least some behavioural and neurobiological background (Black et al.,2010; Buchheim
et al.,2013) As such, ML models could potentially help both with building accurate suicideprediction models, as well as identifying how BPD and APD contribute to the risk of suicideattempts
In summary, the present study sought to improve on existing ML models by comparingthe performance of six machine learning and categorisation models in terms of accurately
identifying suicidal behaviour in a prison population (n = 353), both relying on questions
about previous suicide attempts and suicidal ideation, and not Despite the recent
advancements in ML models for testing prediction of suiciderelated behaviours, for
translation into clinical practice, there is a need for risk algorithms/ML models that use arelatively small number of predictors There are also limitations with including suicidalideation as a predictor, since around 60% of people who die by suicide do not express suicidalideation at an earlier time (McHugh et al.,2019), and some populations are known to withholdthis information, as discussed above Accordingly, it is important to test the models’ accuracyafter excluding suicidal ideation We hypothesised that modern ML algorithms, especiallyrandom forest and gradient tree boosting, could accurately identify individuals with suicidalbehaviour within prison population even after excluding questions about suicidality, givenenough personal details such as factors related to demographics, physical and mental health,substance abuse, and criminal history We further expected that BPD diagnostic criterion
Trang 8would be an important predictor in these models We were also interested in how APD
diagnostic criteria would contribute to the model, given the similarities between the disorders.Finally, in order to help translate risk algorithms into clinical practice, we wanted to
investigate whether a smaller number of predictors could be used to predict suicide attempts
To allow us to better understand how many personal variables are required to build accuraterisk models, we were interested in how reducing the number of input variables would affectthe models’ performance
Method Materials
For this exploratory study, the Interuniversity Consortium for Political and SocialResearch catalogue was searched for datasets that contained detailed BPD diagnostic data and
a large number of participants (n > 100) to build ML models on In the identified dataset
(Sacks & Melnick,2011), one dependent variable, ‘suicide ever attempted’ as reported by theparticipants, and 915 independent variables were identified These were reduced to 641 afterremoving any suiciderelated and nonnumeric variables Some of the variables in the dataset,such as the Mental Health Screening Form Total Score and BPD diagnosis, were derived fromother variables using simple formulas, such as ifelse and addition (e.g.,
bpddiag = pdborder, 0through4 = 0, ELSE = 1) We kept both the source variables and thederived predictors in the dataset
Participants
The dataset described US prisoners who were participating in prisonbased substanceabuse treatment programs across 14 facilities Table1shows highlevel demographics andmental health data about individuals, and how the total dataset was randomly split into atraining dataset and a validation dataset Data about income, education, and socioeconomicstatus was not available from the participants
Trang 9Table 1
Descriptive statistics
DatasetTotal Training ValidationGender
Male 207 (58.64) 163 (57.80) 44 (61.97)Female 146 (41.36) 119 (42.20) 27 (38.03)Ethnicity
White 137 (38.81) 109 (38.65) 28 (39.44)Latino 120 (33.99) 97 (34.40) 23 (32.39)African American 96 (27.20) 76 (26.95) 20 (28.17)Suicide attempt
Yes 59 (16.71) 45 (15.96) 14 (19.72)
No 294 (83.29) 237 (84.04) 57 (80.28)BPD
BPD 04 303 (85.84) 246 (87.23) 57 (80.28)BPD 59 50 (14.16) 36 (12.77) 14 (19.72)Suicide attempt
Yes 27 (54.00) 18 (50.00) 9 (64.29)
No 23 (46.00) 18 (50.00) 5 (35.71)Gender
Male 18 (36.00) 13 (36.11) 5 (35.71)Female 32 (64.00) 23 (63.89) 9 (64.29)APD
APD 02 199 (56.37) 164 (58.16) 35 (49.30)APD 37 154 (43.63) 118 (41.84) 36 (50.70)Suicide attempt
Yes 32 (20.78) 23 (19.49) 9 (25.00)
No 122 (79.22) 95 (80.51) 27 (75.00)Gender
Male 103 (66.88) 78 (66.10) 25 (69.44)Female 51 (33.12) 40 (33.90) 11 (30.56)Total 353 (100) 282 (79.89) 71 (20.11)
Note Items are raw counts, percentages in parentheses APD =
antisocial personality disorder, diagnostic criteria ≥ 3; BPD =borderline personality disorder, diagnostic criteria ≥ 5
Trang 10We built six predictive models in Python programming language to compare theiraccuracy, both against each other and to other published results from similar studies Wewanted to compare different modelling approaches with different biasvariance tradeoffs(Hastie et al.,2001); in other words, models that can infer simple rules but potentially underfitthe data (high bias, low variance, such as generalised linear models), and models that can inferarbitrarily complex rules but might overfit the data (low bias, high variance, such as treebasedmodels) The six models were: gradient tree boosting, implemented in the Xgboost toolkit(Chen & Guestrin,2016); a fully connected threelayer neural network (multilayer
perceptron), implemented in Keras on top of Tensorflow (Chollet et al.,2015); random forest;decision tree; logistic regression; and linear regression with a simple cutoff classifier
(Pedregosa et al.,2011) The dataset was split into a training and validation dataset, the latter
of which was used to verify the generalisability of the models after training Due to the smallsample size, the validation dataset was also used to parameter tune the neural network model.The neural network model was trained on normalised predictors as this model is known to notlearn well on raw input scores, especially when some of the scores have a special meaning(e.g., 9: missing data) The rest of the models were trained on the raw dataset
Analysis of the models We planned to evaluate the models according to several
different criteria We included AUC because it is the most typically reported measure inliterature AUC, the total area under the Receiver Operating Characteristic (ROC), shows therelationship between sensitivity (true positive rate) and specificity (true negative rate) atdifferent cutoff values On a balanced dataset, where the number of positive and negativesamples are roughly equal, a random classifier would yield an AUC of 5, and a perfect
classifier would yield 1.0 However, we note a major limitation of AUC, namely that it ismisleading in the case of imbalanced classes, such as the ratio of suicideattempters to
nonattempters (Raeder et al.,2012) We also included positive predictive value (PPV) andsensitivity, so that our findings can be easily compared to other papers’ results (Belsher et al.,2019) Finally, we also reported the F1 scores, the harmonic mean of precision and recall,which is a standard measure in ML F1 scores focus purely on the positive class of interest, not
Trang 11obscuring it like some other metrics, such as the AUC While a high AUC generally means ahigh F1 score, it is also possible that a model with an overall lower AUC could outperformanother model with a higher AUC at a certain classificationcutoff threshold The F1 scorewas calculated based on the positive cases (suicide ever attempted) on the validation dataset.This means that the reported F1 scores are specific to the validation dataset and may not beindicative of the models’ performance on new datasets During the parameterisation phase, weoptimised the classificationcutoff thresholds for the highest F1 score possible on the
validation dataset, as opposed to the highest area under the curve (AUC) score Our datasetwas imbalanced, as discussed above, and AUC scores can be misleading in these cases (e.g.,Cook,2007; Saito & Rehmsmeier,2015)
One of the issues with ML models, as noted earlier, is that they often work like a blackbox, not giving any insights into how the final scores were reached To address this, weanalysed how each risk factor contributed to the final score in the tree boosting model usingthe SHAP (SHapley Additive exPlanations) library (Lundberg & Lee,2017) The idea behindSHAP is that, after the model is built, it revisits the data and evaluates how the values of eachinput variable contribute to the final score of the model, so gaining insights into otherwiseblackbox models Using this technique, we also investigated how the BPD diagnostic criteriacontributed to the final score Finally, we tested how much effect it would have on the finalmodel if we removed input parameters from the result (i.e., model reduction)
Results
As expected, all modern ML algorithms (tree boosting, random forest, and neuralnetwork) outperformed the decision tree and linear models (see Table2for accuracy detailsacross all models for suicide prediction, including F1, AUC, true and false positive, and trueand false negative rates)
The correlation between lifetime suicidal ideation and attempted suicide was high (r = 752, p < 001) Having a predictor with r = 75 in the input variables might be useful for
building highaccuracy models; however, using suicidal ideation as a predictor may
nevertheless not be useful in clinical practice As discussed earlier, healthcare professionals
Trang 12might be uncomfortable asking about suicidal ideation, and certain populations may withholdthis information Regarding the accuracy of the models, as measured by AUC, the tree
boosting model improved after adding suicidal ideation to the variables (from 875 to 955).This did not translate to better practical discrimination between individuals, however, asshown by the F1 score, which unexpectedly dropped somewhat (from 846 to 786) At thesame time, suicidal ideation in the input variables improved the random forest model both interms of AUC and F1, but it still was not able to match the performance of the tree boostingmodel (see Table2)
Given the small differences in F1 accuracy scores between including and excludingsuicidal ideation, and the concerns regarding the availability of this information across
different populations, we excluded suicidal ideation from the input parameters for furtheranalysis The analysis of the tree boosting model using SHAP revealed, in line with ourexpectations, that the BPD diagnostic criteria were in the top five most important predictors.The number of diagnostic criteria met for BPD, based on the Structured Clinical Interview forDSMIV Axis II Personality Disorders (SCIDII), showed a slightly increased risk for asuicide attempt when three and four criteria were met, and a marked but even increase forscores of five criteria and above (see Figure1) APD diagnostic criteria did not, however,appear in the first 20 most influential risk factors in the tree boosting model, despite thegeneral association of APD with suicidality
Even though both the derived scores and the individual questionnaire items were present
in the dataset, the most significant contributors to the final score were the total scores on thescales rather than answers to individual items It is also worth noting that building models thatoutput a raw risk value rather than a default binary classification score (true/false) helped toachieve higher F1 scores This suggests that the default cutoff values used by the softwarelibraries may not be suitable for imbalanced datasets, and that carefully finetuning the cutoffvalues would yield more accurate models Nevertheless, in line with previous research, AUCscores did not seem to be the most relevant measure when comparing the performance of themodels
Apart from BPD criteria, the most important risk factors and cutoffs for suicide in the
Trang 13Table 2
Accuracy of the models on the validation dataset (n = 71)
XGB XGBsi RF RFsi NNa DT LOG LINF1 scoreb 846 786 714 800 720 667 417 438
Note XGBoost (XGB), Random forest (RF), Neural Network (NN), Decision tree
(DT), Logistic regression (LOG), Linear regression (LIN); excluding suicidal
ideation, unless noted otherwise
aResults varied slightly due to random initialisation
bModels were optimised for F1 scores instead of AUC
cAlso called recall in ML context.
siIncluding suicidal ideation in the model
tree boosting model were: the number of times hospitalised for psychiatric problems (with athreshold of 1 or more); Mental Health Screening Form (with a threshold of 11 or more); andTexas Christian University Drug Screen score (with a threshold of 5 or more, especially ifcombined with psychiatric hospitalisation)
Our decision tree model (see Figure2) identified similar risk factors to the tree boostingmodel, including the number of BPD criteria met This resulted in a relatively easy to interpretand administer, although less accurate, test (F1 = 667, compared to the tree boosting model ofF1 = 846)
Trang 14BPD diagnostic criteria met
Figure 1 SHAP dependency plot of the number of BPD diagnostic criteria met The vertical
dispersion of the dots indicate interaction with other variables in the dataset A broader
dispersion means more significant interactions Higher overall risk values indicate a higherrisk of a suicide attempt