Automated high confidence compound identification of electron ionization mass spectra for nontargeted analysis

Nontargeted analysis based on mass spectrometry is a rising practice in environmental monitoring for identifying contaminants of emerging concern. Nontargeted analysis performed using comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC/TOF-MS) generates large numbers of possible analytes.

Trang 1

journalhomepage:www.elsevier.com/locate/chroma

ionization mass spectra for nontargeted analysis

Joseph Bendika,1, Richa Kaliaa,1, Jeet Sukumaranb, William H Richardotc,d, Eunha Hohd,

Scott T Kelleya,b,∗

a Department of Biology, San Diego State University, San Diego, CA, USA

b Department of Biology, San Diego State University, 5500 Campanile Drive, San Diego, CA 92104, USA

c San Diego State University Research Foundation, San Diego, CA, USA

d School of Public Health, San Diego State University, San Diego, CA, USA

Article history:

Received 30 July 2021

Revised 26 October 2021

Accepted 27 October 2021

Available online 31 October 2021

Keywords:

ChromaTOF

PyAutoGUI

Mass spectral comparison

Nontargeted analysis

Suspect screening

Machine learning

Nontargetedanalysisbasedonmassspectrometry isarisingpracticeinenvironmentalmonitoringfor identifying contaminants of emerging concern Nontargeted analysis performed using comprehensive two-dimensionalgaschromatographycoupledwithtime-of-flightmass spectrometry(GC×GC/TOF-MS) generateslargenumbersofpossibleanalytes.Moreover,thedefaultspectrallibrarysimilarityscore-based searchalgorithmusedbyLECO® ChromaTOF® doesnotensurethathighsimilarityscoresresultin cor-rectlibrarymatches.Therefore,anadditionalmanualscreeningisnecessary,butleadstohumanerrors especiallywhendealingwithlargeamountsofdata.Toimprovethespeedandaccuracyofthechemical identification,wedevelopedCINeMA.py(ClassificationIsNeverManualAgain).Thisprogrammingsuite automatesGC×GC/TOF-MSdatainterpretationby determiningthe confidenceofamatchbetweenthe observedanalytemass spectrumandthe LECO® ChromaTOF® softwaregenerated libraryhitfromthe NISTElectronIonizationMassSpectral (NISTEI-MS)library.Ourscriptallowstheusertoevaluatethe confidenceofthematchusinganalgorithmicmethodthatmimicsthemanualcurationprocessandtwo differentmachinelearningapproaches(neuralnetworksand randomforest).Thescriptallowstheuser

toadjust variousparameters (e.g.,similaritythreshold) andstudy theireffects onpredictionaccuracy

TotestCINeMA.py, weused datafromtwodifferentenvironmentalcontaminantstudies:anEPAstudy

onhouseholddustandastudy onstormwaterrunoff.Usingareference setbasedontheanalysis per-formedbyhighlytrainedusersoftheChromaTOFandGC×GC/TOF-MSsystems,therandomforestmodel hadthehighestpredictionaccuraciesof86%and83%ontheEPAandStormwaterdatasets,respectively Thealgorithmicapproachhadthesecond-bestpredictionaccuracy(82%and79%),whiletheneural net-work accuracyhad thelowest(63% and67%).Allthe approachesrequired lessthan1 mintoclassify

986observedanalytes,whereasmanualdataanalysisrequiredhoursordaystocomplete.Ourmethods werealsoabletodetecthighconﬁdencematchesmissedduringthemanualreview.Overall,CINeMA.py providesuserswithapowerfulsuiteoftoolsthatshould signiﬁcantlyspeed-updataanalysiswhile re-ducingthepossibilitiesofmanualerrorsanddiscrepanciesamongusers,andcanbeapplicabletoother GC/EI-MSinstrumentbasednontargetedanalysis

1 Introduction

Environmental monitoring forchemical contaminants typically

requires using targeted analysis, in which a priori information

∗ Corresponding author at: Department of Biology, San Diego State University, San

Diego, CA, USA

E-mail address: skelley@sdsu.edu (S.T Kelley)

1 These authors contributed equally to this work

(mass spectra,retentiontimes,etc.)on speciﬁcchemicalsis used

todetectcompoundsofinterest.Whilethesemethodsaresensitive andquantitative foraknown setofcompounds, theymiss unde-ﬁnedcompoundsregardlessoftheirabundance.Nontargeted anal-ysis (NTA),including suspect screening, wasdeveloped to detect multiple compounds simultaneously, includingnovel compounds, and involves comprehensive sample preparation and chromatog-raphyfollowedby fullmassspectrometryanalysis[1–3] Compre-hensivetwo-dimensional gaschromatographycoupledwith

time-https://doi.org/10.1016/j.chroma.2021.462656

Trang 2

In GC×GC/TOF-MS based NTA, the raw data is analyzed

us-ing dataprocessing software such asLECO® ChromaTOF®

Chro-maTOF’s “automatic peak search” ﬁrst identiﬁes features based

upon certain conditions (i.e., S/N ratio, GC retention time, etc.)

Additionally,ChromaTOF’speaktablealignmentfeature“Statistical

Compare”, enablesuserstomake comparisonsbetweengroupsof

samples(ex.SamplesvsControls)toeﬃcientlyisolatecompounds

ofinterest.“StatisticalCompare” alignspeaksacrosssamplegroups

basedupon1stand2nddimension GCretentiontimes,aswellas

massspectralsimilarity.Inordertoidentifycompoundsofinterest,

each peakiscomparedagainst theNationalInstituteofStandards

andTechnology electron ionizationmass spectral(NIST EI-MS)

li-brary (orcustomMS librarydependingontheuser),generatinga

list ofranked suggestedcompounds(Library Hits)and“similarity

score” byChromaTOF utilizingtheNIST Similarityscorebasedon

the relative abundances of the matched pairs ofmasses andthe

abundance ratios of adjacent matchingpeaks [10,11] Afterwards,

eachlibraryhitmustbemanuallyreviewedtofurtherevaluatethe

conﬁdenceofamatchbetweenthelibraryhitmassspectraandthe

observed mass spectra (after deconvolution), known asthe Peak

TruemassspectrainChromaTOF

Currently,theobservedmassspectraandlibraryhitmass

spec-tra are either manually reviewed inChromaTOF, orthe data can

beexportedasaPDF.Fig.1Ashowstheworkﬂowformanualdata

analysis[12].Oncethebestmatchesareobtainedusingthe

spec-tral library search algorithms, analytical reference standards are

procured, and their respective retention times and mass spectra

areobtainedfromthesameinstrumentalconditionofGC

×GC/TOF-MS Theveriﬁcationsuccessrateswere 94%and96%inour

stud-ies [4,13].Thissupportsthenotionthatthemanualreview works

for determinationofhighconﬁdence identiﬁcation.However, this

manual reviewcanbe timeconsuminganderrorpronewhenthe

data size is large, and results can be inconsistent among users

Forinstance,reviewingthousandsofcompound’smassspectraand

their matching massspectrafroma MS library(e.g.,theNIST

EI-MS library)cantakemanyhours orevendaysdependingonuser

experience This highlevelof manual data handlingleadsto

nu-merous errors necessitating multiple independent reviewsfor all

resultstominimize errors.Thus, automationofthesetaskswould

be extremely valuable to improve the accuracy and increase the

analysisthroughput[14]

To improve the speed and accuracy of identiﬁcation based

on mass spectral matching, we developed two programs:

chro-maTOF_auto.py and CINeMA.py (Classiﬁcation Is Never Manual

Again) The chromaTOF_auto.py script automates GC×GC/TOF-MS

data download from LECO® ChromaTOF® software, while

CIN-eMA.py facilitates the conﬁrmation of analyte matches between

the NIST mass spectral library and the experimental mass

spec-tra using twodifferent approaches:an algorithmic methodbased

directlyonthemanualcurationmethodandmachinelearning

ap-proaches using neural networks and random forests trained on

manually curated data sets These machine learning techniques

havebeenusedforsimilarmassspectrometryapplicationsin

pre-wasusedfordataprocessing.Thestormwaterrunoff samples(aka theStormwater dataset) were collectedby theSanFrancisco Es-tuaryInstitute(SFEI)fromNapa,Sonoma,andSantaRosacounties

inCaliforniafollowingthe2017NorthernCaliforniawildﬁres[13] Thehouseholddustsamples(aka.theEPAdataset)wereprovided

aspartof theU.S Environmental Protection Agency (EPA)’s Non-targetedAnalysis Collaborative Trial(ENTACT),an inter-laboratory studydesigned to compare the various workﬂow techniques im-plemented within the NTA research community [18,19] In brief, participants were givena seriesof samplesin a blind trialsome

of whichhad been spiked witha cocktail ofvarious compounds andwere instructed to conductNTA The EPAdata setcontained

986 observed analytes from the analysis of LECO® ChromaTOF® softwareandtheStormwaterdatasetcontained892observed ana-lytes.IntheEPAdataset,409compoundsweremanuallyreviewed

tobehighconfidencematches,and577werereviewedaslow con-fidence In the Stormwater data set, 373 were reviewed as high confidenceand519were reviewedaslow confidence.TheLECO® ChromaTOF® softwareassignseachchromatographicpeakaname baseduponmassspectralsimilaritytocompoundswithinthe2011 NISTEI-MSlibrary.Afterisolatingallcompoundsofinterestduring review,theusersortsthe“peaktable” inChromaTOF sothat each compound of interest is insequential order To do so, the “peak table” was sorted by “comment” and “peak number” The “peak true” (deconvoluted mass spectra) data of all compounds of in-terest were then exported in MSP format (peak_true.msp) Next, the mass spectra of each compound’s assigned name from the

2011NISTEI-MSlibrary(libraryhit)wereexportedusingthe chro-maTOF_auto.pyscript.ThechromaTOF_auto.pyisbasedon PyAuto-GUI, apython module tocontrol theuse ofmouseandkeyboard forautomationofanyGraphical UserInterface PyAutoGUI repro-duces human actions such as moving, clicking and dragging the mouse, pressing andholding keys, and pressingkeyboard hotkey combinations [20] Using thisscript an analyst can easily extract theGC×GC/TOF-MSlibraryhitsdatafromtheLECO® ChromaTOF® software for further analysis in a signiﬁcantly reduced time and withnegligible humaneffort The chromaTOF_auto.py scriptdoes notmodify,manipulate,orextendthesoftwareordatabasesofthe LECO® ChromaTOF® software

Fig.1Bshowstheworkﬂowforautomateddatadownloadwith chromaTOF_auto.py The LECO® ChromaTOF® workspace is com-posedin left torightorder withthefollowing components the directoryforaccessingtoolsandoptions(AcquisitionQue,GCand

MS Methods, Acquired Samples etc.), peak table, and the library hit mass spectrum (Fig S1) The chromaTOF_auto.py script saves thelibraryhitﬁlessequentiallyinthemostrecentdirectoryused

bytheuser,renamingtheﬁles(1.msp,2.msp,etc.)foreasyaccess

2.2 Data parsing

The data obtained from the GC×GC/TOF-MS data analysis by theLECO® ChromaTOF® softwareonboththeEPAandStormwater datasetswasparsedusingCINeMA.pytoextract:(1)Analytename

Trang 3

Fig 1 Workﬂow for manual (A) or automated (B) data analysis An environmental sample once collected is processed using GC ×GC/TOF-MS and analyzed using the LECO®

ChromaTOF® software The LECO® ChromaTOF® software outputs a list of observed analytes present in the sample For a manual analysis, this processed data for each observed analyte and their respective library hits are then manually downloaded by the analyst Next, the analyst reviews this manually downloaded data to evaluate the conﬁdence of the match (High or Low) between the mass spectra of each observed analyte and their corresponding library hit For the automated analysis, the user creates

a directory to save the observed analyte’s (SA) library hit ﬁles and then downloads them sequentially using the chromaTOF_auto.py script

(“Name”); (2)Mass-to-charge ratio(m/z) oftheionsandtheir

re-spectiveintensities;(3)SimilarityScorebetweentheobserved

an-alyte and library hit #1 from the LECO® ChromaTOF® software

(only presentinlibraryhits);and(4)Totalnumberofions inthe

massspectrum.ThisdatawasnecessaryforthescriptCINeMA.py

to analyzethe conﬁdenceofa matchbetweentheobserved

ana-lytes andlibrary hits Inaddition, since thelowest mass spectral

acquisition ionwasm/z50,themanualreview ofmatchesignores

allionsbelowm/z50presentinthelibraryhit.CINeMA.pyparsed

alltheﬁlesinthegivendatadirectoryintotherequireddata

struc-turetotrain,test,ormakepredictions,usingeitherthealgorithmic

modelorthemachinelearningmodels[21–23].Dependingonthe

useraction(predict,train,ortest),CINeMA.pyrequiresthedata

di-rectorytohaveaspeciﬁcorganizationalstructure(Fig.2)

CINeMA.py resultswere benchmarkedwiththoseobtainedvia

manual analysis to establishthe reliability of our CINeMA.py

re-sultsandtheeffectivenessofCINeMA.pyinreducingGC

×GC/TOF-MS data analysis time The peak_true.msp ﬁle contains data for

all the observed analytes together as shown in Fig S2 To

ver-ify the completeness of the analyte data, the script parses the

peak_true.msp ﬁle usinga state machine asshowninFig.S3

Fi-nally,eachcompound’slibraryhitisoutputtoanindividualﬁleas

showninFig.S4

2.3 Algorithmic model

The algorithmic model, outlined in Fig 3,begins by checking

for thesimilarity score threshold,which by defaultis setto 600

in this study, but the threshold can be changeable (out of999)

ThissimilarityscorefromNISTisanoutputfromtheLECO®

Chro-maTOF® software describing the measure of similarity between

theobserved analytemass spectrumandthelibraryhit fromthe

2011NIST EI-MS librarymatches.The user canalter this similar-ityscorethresholdusingthecommandlineinputsforCINeMA.py Thealgorithmcomparesthelibraryhitmassspectrawiththe ob-servedmass spectrafromLECO® ChromaTOF® software.Amatch

isdeemeda“highconﬁdence” matchifthefollowingaretrue:the similarityscoreisgreaterthanorequaltotheuserprovided sim-ilarity score,the mostabundant three ions ofthe library hit are presentintheobservedmassspectra(andviceversa), the molec-ular ion is present, and the correlation percentage between the spectraofthelibraryhitandtheobservedmassspectraisatleast 80%

2.4 Machine learning models

Two types of machine learning approaches were used to de-termineifthebestlibraryhitisa high-orlow-conﬁdencematch

to the observed mass spectra: a random forest algorithm, and a neuralnetwork.Randomforestandneuralnetworkswereboth se-lectedforthisstudyprimarilybecauseoftheireffectivenesswhen workingwithclassiﬁcationproblemssuchasthis.Neuralnetworks cananalyzecomplexrelationshipsbetweeninputs,whichmakesit

agood choicetodetect differencesinmassspectrathat can con-tainlargeamountsofionintensitydata.However,neuralnetworks usually require vast amounts of samples fortraining Conversely, randomforestworkswellwithsmalleramountsofdatawithmore clearly deﬁned features, such as the spectra features a reviewer looksforduringamanual review.Inaddition,feature importance can be easily provided with random forest, allowing the user to visualizetheaspectsoftheirmanualreviewthatthemachine con-sidersthemostimportant

Trang 4

Fig 2 Data directory structures (A) Under the sample directory there is a subdirectory called ‘hits’ and the peak_true.msp ﬁle that contains the data for observed analytes

The user should use the ‘hits’ directory to save all the library hits files obtained through using chromaTOF_auto.py Each sample subdirectory should contain a compounds.tsv file, which contains the m/z ratio for the molecular ion in the library hit file (B) For training or testing the accuracy of a machine learning model with a new data set, the root directory should contain sub directories, which are sample names Each sample subdirectory should contain a ground_truth.tsv file, which contains the manual interpretation

of the conﬁdence of a match of observed analytes and library hits obtained from GC ×GC/TOF-MS data analysis by the LECO® ChromaTOF® software

Fig 3 Algorithmic model If the similarity score from the LECO® ChromaTOF® software is less than the similarity score threshold, the algorithm classiﬁes the match as a

low conﬁdence match If the similarity score is higher, then the model normalizes the spectrum data for both the observed analyte (SA) and the library hit (LH) and checks the following set of conditions: (1) presence of most abundant three ions (Top 3 ions) of the library hit in the observed analyte, (2) presence of molecular ion of the library hit in the observed analyte, (3) presence of top three ions of the observed analyte in the library hit and (4) correlation ( > = 80) between the spectra of the library hit and the observed analyte If all these conditions are met, it interprets the match as a “high conﬁdence match.”

Trang 5

Fig 4 Neural network model’s structure The ﬁrst 10 0 0 inputs are the library hit

ion intensities and the next 10 0 0 are the observed analytes’ ion intensities There

are three hidden layers of size 10 0 0, 10 0 and 10 neurons, and have softsign activa-

tion functions The last layer of the network uses a softmax activation function and

is composed of two neurons for high or low predictions The model was trained

with 5 epochs and a batch size of 128

The input data for random forest consistedof the samemass

spectrafeatures checkedwhenusingthealgorithmic model:

sim-ilarity score, correlation percentage, molecular ion presence, and

thenumberoftopionspresentinthehitthatarealsopresentin

the observed analyte (and vice versa) The random forest model

wasbuiltinpythonusingtheScikit-Learnpackage[24,25].The

hy-perparametersforthemodelwere tunedbasedonoptimizingthe

accuracymetric,resultingin100treesandamaxdepthof4.The

input dataforthe neuralnetwork consistedoftheionintensities

foreachobservedanalyteanditsbesthittodetectifthetwo

spec-tra aresimilar enoughto beconsidered ahigh-conﬁdence match

This model wasbuilt in python using the Keras and Tensorﬂow

packages[26,27].Fig.4illustratesthestructure oftheneural

net-work model.Activationfunctions, thenumber ofepochs,andthe

batchsizewereselectedfortheneuralnetworkbasedonthe

accu-racymetric,aidedwiththeuseofGridSearchCVintheScikit-Learn

package.Modelperformancewasexaminedthroughconfusion

ma-trices, receiver operating characteristic (ROC) curves,and 10-fold

cross validation.All modelswere trained onone ofthe two data

sets and testedon the other using an expert’smanual review of

highand lowconﬁdence forthedata labels.Additionally,to

pro-vide moredataformodeltraining,thesedatasetswerealso

com-binedintoonelargedatasetandthentrainedandtestedinthree

ways: (1) Train on 80% of the combined set andtest on the

re-maining20%;(2)Trainon80%oftheEPAdatasetplus100%ofthe

Stormwater dataset,andthen test theremaining 20%ofthe EPA

dataset;(3)Trainon80%oftheStormwaterdatasetplus100%of

theEPAdataset,andthentesttheremaining20%ofthe

Stormwa-ter data set.Randomsplits were performedonall train testsplit

cases CINeMA.py also allows the analyst to train and save their

ownmachinelearningmodelonagivendataset.Thesaved

mod-elscanthenbeusedfortestingormakingpredictionsfornewdata

sets

2.5 Report generation

The CINeMA.py generates reports in the form of two ﬁles

report.tsv andreport.pdf The report.tsv ﬁle containsinformation

about the peak number, name of the observed analyte and the

predicted match between the library hit and the observed

ana-lyte The report.pdf ﬁle contains mirror plots between each

ob-served analyte’smassspectrumanditscorrespondinglibraryhit’s

mass spectrum [28].Fig 5shows exampleplots ofhighandlow

conﬁdence matches The plotsallow the analyst to visually

com-Fig 5 Mirror plots comparing observed analyte and library hit mass spectra The

mirror plots are provided by CINeMA.py for all matches from the non-targeted analysis to the given library spectra, allowing straightforward manual conﬁrmation The top spectra (positive values in blue) is the spectrum from the observed analyte in the sample, while the bottom mirrored spectra (negative values in red) is the spectrum of the corresponding library hit for the observed analyte (A) An example of

a high confidence library match (B) Example of a low confidence match (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

pare the observed analyte’s mass spectrum andthe correspond-ing libraryhit’s mass spectrum ifdesired The mirror-plotof the two mass spectra makes visual comparison easy while compar-ingthetwoseparate plotsproduced byLECO® ChromaTOF® soft-ware.Whentraininganeuralnetworkmodel,CINeMA.pyproduces model_performance.pdfcontaininglosscurvesforeachfoldduring cross-validation,shownin Fig.6.When testingeither ofthe ma-chine learningmodels,the script willproduce measures.pdf con-tainingtheconfusionmatrixandtheROCcurve,asinFig.7[29]

By considering low-confidence matches as“negatives,” and high-confidencematchesas“positives,” theusercanusetheconfusion matrixtocalculateperformancemetricssuchasaccuracy, sensitiv-ity,specificity, andbalanced accuracy.When trainingthe random forest modelwithfeature input data,the scriptwill produce im-portance.pdfcontainingabarplotwiththerelativeimportancefor eachfeature(Fig.8).SourcecodeforchromaTOF_auto.pyand CIN-eMA.py,alongwithtutorialsandtestdataareavailableonGithub

athttps://github.com/sharmaricha200/thesis.git

Trang 6

Fig 6 Example training loss generated during one-fold of the 10-fold cross Vali-

dation on the EPA data set The blue curve (top) indicates the loss on the training

samples, and the orange curve indicates the loss on the samples held out for valida-

tion in that fold This results shows Neural Network Model loss using ion intensity

data trained for 5 epochs and a batch size of 128 (For interpretation of the refer-

ences to color in this ﬁgure legend, the reader is referred to the web version of this

article.)

Fig 7 Example eﬃcacy outputs following random forest model training on the EPA

data set and testing on the Stormwater data set (A) Confusion matrix (B) Receiver

Operating Characteristic curve (ROC)

Fig 8 Example feature importance for the random forest model trained on the EPA

data set and tested on the Stormwater data set

3 Results and discussion

The automated datacollection workﬂow process implemented

in chromaTOF_auto.py needed only a few minutes on an Intel® CoreTM i7–6700 Quad CPU, with 8 GB RAM running Windows®

10,64-bit todownloadlibraryhit data( msp) ﬁlesfromagiven

GC×GC/TOF-MSdataoutputanalysisfromtheLECO® ChromaTOF® software.Becauseofcomputationalspeed,chromaTOF_auto.py ini-tiallycausedthe ChromaTOF® GUIto crash.To overcomethis is-sue,we includeda delaytimer in the chromaTOF_auto.py script, allowing theuser toset upthe screenasdescribedabove before theautomationtakesovertodownloadthelibraryhitﬁles Forgeneratingpredictions,CINeMA.pywasabletoproduce re-sultswithin a minute.When testingthe algorithm modelon the complete data sets, an accuracy of 81.54% was achieved on the

986 compounds in the EPA set and an accuracy of 78.70% was achieved on the 892 compounds in the Stormwater set For the machinelearningmodels,thehighestaccuracyvalueandArea un-dertheROCcurve score(AUC) seenonthecompleteEPAsetwas achievedusingtherandomforestmodelonthealgorithm’sfeature data.Thismodelhadanaccuracyof85.60%andhadanAUCscore

of 0.887 The highest accuracy value and AUC score seen on the Stormwaterset wasalso achievedusingtherandomforest model

onthealgorithm’sfeaturedata.Thismodelhadan accuracyvalue

of82.85%andanAUCscoreof0.899(Table1).Theneuralnetwork didnotperformaswellastheothermodelsbasedonthetesting accuracies, AUC scores, and cross-validation accuracies (Tables 1 and2).Combiningdatasetsdidsomewhatimprovethetesting ac-curacyandAUCscoreforthismodelhowever(Table1).Agreement ratesbetweenthehumanuser’sdecisionvs.amodeldecisionper

“High” and“Low” confidencewere similar, witha slightlyhigher agreementbythealgorithmmodelin“High” thanin“Low"(Table S1).Thisdemonstratesthatthemodelsworkequallyforcompound identificationregardlessof“High” and“Low” confidencematching

Toidentifyreasonsfordiscrepancybetweenclassifications (hu-man vs computer), we manually reviewed “incorrect” classifica-tions The main source of discrepancy when comparing human classificationsto the algorithm’s classificationsappeared to come from instances in which observed mass spectra and library hit massspectrawereverysimilar,butwereonthecuspofeitherhigh

orlowconﬁdence.Thisoftenoccurredininstancesinwhichthe li-braryhitmassspectracontainednumerousionswithlow relative abundance.SinceNTAofenvironmentalsamplesofteninvolvesthe detectionof trace contaminants, compounds presentat low

Trang 7

con-Table 1

Random forest and neural network model performances across the EPA dust and Stormwater data sets Includes the number of compounds

present in the training and test sets, the accuracy on the test set, and the Area Under the ROC Curve (AUC) score

#Training Compounds #Test Compounds Testing Accuracy AUC

RF Features

Train 80% (EPA + Stormwater) Test 20% (EPA + Stormwater) 1502 376 82.45% 0.873

NN Intensities

Train 80% (EPA + Stormwater) Test 20% (EPA + Stormwater) 1502 376 70.48% 0.761

Table 2

10-fold cross validation mean accuracy + /- standard deviation on the two neural network models across the EPA dust and Stormwater data sets

NN Intensities Train EPA Test Stormwater 74.44% ( + /- 3.28%) Train Stormwater Test EPA 71.30% ( + /- 4.46%) Train 80% EPA Test 20% EPA 72.21% ( + /- 5.31%) Train 80% Stormwater Test 20% Stormwater 70.83% ( + /- 4.79%) Train 80% (EPA + Stormwater) Test 20% (EPA + Stormwater) 73.50% ( + /- 2.68%) Train EPA + 80% Stormwater Test 20% Stormwater 75.40% ( + /- 2.99%) Train Stormwater + 80% EPA Test 20% EPA 73.10% ( + /- 2.70%)

centrationsmaynotproduceenoughlowabundanceionstobe

de-tected bythe massspectrometer Asthealgorithm isconﬁnedby

astrictsetofrules(i.e.,correlationpercentage≥ 80%),some

com-pounds maybeclassiﬁed as“low” while ahuman usermaytake

additionalfactorsintoaccountandclassifyas“high”

Additionally,boththealgorithmandrandomforest model

cor-rected human errors As shown in Fig S5, some compounds in

whichtheobservedmassspectraandlibraryhitmassspectrawere

nearperfectmatches were erroneouslyclassiﬁedas“not a match

(low)” by the human user but classiﬁed correctly as a “match

(high)” bythealgorithm.Conversely,therewereinstancesinwhich

the observed mass spectra and library hit mass spectra did not

match well butwere classiﬁedas“high” bythe human userand

classiﬁed as“low” bythe algorithm Such errorswere due to

fa-tigueexperiencedbythehumanusercomparinghundredsofmass

spectralmatchesinsuccession

Whiletherandomforestmodelhadthehighestaccuracyscores,

there are still some beneﬁts to the useofa simpliﬁed algorithm

over themachinelearningtechniques.The simpliﬁedalgorithmis

capableofworkingwithextremelysmalldatasetsanddoesnot

re-quireanoutsidesourceofdatafortraining.Bothtypesofmachine

learningtechniquesrequiredatafortrainingand,especiallyinthe

caseofneuralnetworks,largeamounts ofdatamaybenecessary

Thealgorithmicapproachhoweveravoidsthisissue,meaningusers

may prefer this method over training their own machine

learn-ingmodel.Consequentially,thismayexplainthelowperformance

metrics in the neural network compared to the other models as

the number ofsamples contained inthe data sets wasrelatively

smallforthistype ofmodel.Furthermore,thealgorithm iseasily

tunable,allowingtheusertospecifytheirownsimilarityscoreand

correlationpercentagethresholdswhentestingtheirowndatasets

Thisabilitytoeasilytunethealgorithmmakesitapplicableforuse

with programs other thanChromaTOF,astheir spectral matching

componentsmayuseascaledifferentthanChromaTOF’ssimilarty

score(0–999)

4 Conclusions

Overall, the random forest model provided the best accuracy value forbothdata sets,andwe showedthatcompounds missed

bythealgorithmwereoftenrecognizedbymachinelearning Fur-thermore,by ranking feature importance a machine learning ap-proach can highlight ways to improve the algorithmic approach

byillustrating whichfeaturethresholdscanbe tunedinthe algo-rithm Theneural network modelwithintensities has the poten-tialtopredict unknownrules andpatternsforanalyzingthedata set,whichthefeature-usingmodelslack.Featuremodelsarebased

on man-maderules and likelyhave room forimprovement since

it could be diﬃcult to hardcode all possible rules.Thus, in prin-ciple, withlarger data sets a neural network approach usingion intensitieshasthepotentialtoﬁndpatternsandrulesthatcannot

becodedviaanalgorithm.Furthermore,itcanbeimprovedby in-creasingthesizeandaccuracyoftrainingdatasets.Infuturework,

wewillcontinuetoexplorethepotentialofneuralnetworkswith intensitydatatoenhancetheaccuracyofNTA

Intermsofspeed,CINeMA.py isabletoprovideprediction re-sultswithinaminute.Manualdataanalysisbymultiplepeople re-quiredhoursorevendaysforthesamedatasetsofobserved ana-lytes.CINeMA.py’scapacitytorapidlyevaluatetheconﬁdenceofa matchbetweenobservedanalytesandlibrarymatches represents

asignificantimprovementovermanualanalysisthatcantake sub-stantial time dependingon the datasize and can be error-prone during heavy data handling CINeMA.py gives the user the flex-ibility to not only automate the interpretation of the confidence

ofthematchofobservedanalytesandtheircorrespondinglibrary matches, butalso to experiment withvarious test parameters to studyitseffectsontheanalysis.Inaddition,theusercanchooseto useeitherorboth thealgorithmicmodelandanyofthemachine learningmodels to analyze their dataand compare their predic-tions The user can also train the machine learningmodels with relevantdatasetstoimprovepredictionsonnewdatasets.Because

Trang 8

CRediT authorship contribution statement

Joseph Bendik: Software, Investigation, Formal analysis,

Vali-dation, Visualization, Writing – original draft Richa Kalia:

Soft-ware, Investigation,Visualization, Formal analysis, Writing–

orig-inal draft Jeet Sukumaran: Software, Methodology William H.

Richardot: Validation, Data curation, Resources Eunha Hoh:

Methodology, Validation, Funding acquisition,Writing – review &

editing.Scott T Kelley:Conceptualization,Writing– originaldraft,

Writing– review&editing,Supervision,Projectadministration

Funding

ThisworkwasfundedinpartbytheCaliforniaTobaccoRelated

DiseaseResearchProgramfundedgrant(27IP-0028C)

Acknowledgments

WewouldliketothankDr.NathanDodder,YingXu,BryanHo,

andBasilinBensonfortheirvaluableinsightsduringthestudy

de-sign

Supplementary materials

Supplementary material associated with this article can be

found,intheonlineversion,atdoi:10.1016/j.chroma.2021.462656

References

[1] L Chibwe, I.A Titaley, E Hoh, S.L.M Simonich, Integrated framework for iden-

tifying toxic transformation products in complex environmental mixtures, En-

viron Sci Technol Lett 4 (2017) 32–43, doi: 10.1021/acs.estlett.6b00455

[2] J Hollender, E.L Schymanski, H.P Singer, P.L Ferguson, Nontarget screening

with high resolution mass spectrometry in the environment: ready to go? En-

viron Sci Technol 51 (2017) 11505–11512, doi: 10.1021/acs.est.7b02184

[3] J.R Sobus, J.F Wambaugh, K.K Isaacs, A.J Williams, A.D McEachran,

A.M Richard, C.M Grulke, E.M Ulrich, J.E Rager, M.J Strynar, S.R Newton,

Integrating tools for non-targeted analysis research and chemical safety eval-

uations at the US EPA, J Expo Sci Environ Epidemiol 28 (2018) 411–426,

doi: 10.1038/s41370- 017- 0012- y

[4] C.D Tran, N.G Dodder, P.J.E Quintana, K Watanabe, J.H Kim, M.F Hovell,

C.D Chambers, E Hoh, Organic contaminants in human breast milk identi-

ﬁed by non-targeted analysis, Chemosphere 238 (2020) 124677, doi: 10.1016/

j.chemosphere.2019.124677

[5] M.B Alonso, K.A Maruya, N.G Dodder, J Lailson-Brito, A Azevedo, E Santos-

Neto, J.P.M Torres, O Malm, E Hoh, Nontargeted screening of halogenated

organic compounds in bottlenose dolphins (tursiops truncatus) from Rio de

Janeiro, Brazil, Environ Sci Technol 51 (2017) 1176–1185, doi: 10.1021/acs.est

6b04186

[6] C.A Manzano, N.G Dodder, E Hoh, R Morales, Patterns of personal exposure

to urban pollutants using personal passive samplers and GC × GC/ToF-MS, En-

viron Sci Technol 53 (2019) 614–624, doi: 10.1021/acs.est.8b06220

T Novotny, D Schlenk, R.M Gersberg, E Hoh, Assessing toxicity and in vitro bioactivity of smoked cigarette leachate using cell-based assays and chemical analysis, Chem Res Toxicol 32 (2019) 1670–1679, doi: 10.1021/acs.chemrestox 9b00201

[13] D Chang, W.H Richardot, E.L Miller, N.G Dodder, M.D Sedlak, E Hoh, R Sut- ton, Framework for non-targeted investigation of contaminants released by wildﬁres into stormwater runoff: case study in the Northern San Francisco Bay area, Integr Environ Assess Manag (2021) Online ahead of print, doi: 10.1002/ ieam.4461

[14] H Mol, Non-targeted is our target, The Anal Scientist (2013) https:// theanalyticalscientist.com/techniques- tools/nontargeted- is- our- target [15] E.D Strozier, D.D Mooney, D.A Friedenberg, T.P Klupinski, C.A Triplett, Use of comprehensive two-dimensional gas chromatography with time-of-ﬂight mass spectrometric detection and random forest pattern recognition techniques for classifying chemical threat agents and detecting chemical attribution signa- tures, Anal Chem 88 (2016) 7068–7075, doi: 10.1021/acs.analchem.6b00725 [16] F Allen, A Pon, R Greiner, D Wishart, Computational prediction of electron ionization mass spectra to assist in GC/MS compound identiﬁcation, Anal Chem 88 (2016) 7689–7697, doi: 10.1021/acs.analchem.6b01622

[17] D.D Matyushin, A.Y Sholokhova, A.K Buryak, Deep learning driven GC-MS library search and its application for metabolomics, Anal Chem 92 (2020) 11818–11825, doi: 10.1021/acs.analchem.0c02082

[18] E.M Ulrich, J.R Sobus, C.M Grulke, A.M Richard, S.R Newton, M.J Strynar,

K Mansouri, A.J Williams, EPA’s non-targeted analysis collaborative trial (EN- TACT): genesis, design, and initial ﬁndings, Anal Bioanal Chem 411 (2019) 853–866, doi: 10.10 07/s0 0216- 018- 1435- 6

[19] S.R Newton, J.R Sobus, E.M Ulrich, R.R Singh, A Chao, J McCord, S Laughlin- Toth, M Strynar, Examining NTA performance and potential using fortiﬁed and reference house dust as part of EPA’s non-targeted analysis collaborative trial (ENTACT), Anal Bioanal Chem 412 (2020) 4221–4233, doi: 10.1007/ s00216- 020- 02658- w

[20] A Sweigart , PyAutoGUI, GitHub Repository, 2014 https://github.com/ asweigart/pyautogui

[21] V Keleshev , Docopt, GitHub Repository, 2012 https://github.com/docopt/ docopt

[22] C.R Harris, K.J Millman, S.J van der Walt, R Gommers, P Virtanen, D Cour- napeau, E Wieser, J Taylor, S Berg, N.J Smith, R Kern, M Picus, S Hoyer, M.H van Kerkwijk, M Brett, A Haldane, J.F del Río, M Wiebe, P Peterson,

P Gérard-Marchant, K Sheppard, T Reddy, W Weckesser, H Abbasi, C Gohlke, T.E Oliphant, Array programming with NumPy, Nature 585 (2020) 357–362, doi: 10.1038/s41586- 020- 2649- 2

[23] W McKinney, Data structures for statistical computing in python, in: Proceed- ings of the 9th Python in Science Conference, 1, 2010, pp 56–61, doi: 10.25080/ majora- 92bf1922- 00a

[24] F Pedregosa , O Grisel , R Weiss , A Passos , M Brucher , G Varoquax , A Gram- fort , V Michel , B Thirion , O Grisel , M Blondel , P Prettenhofer , R Weiss ,

V Dubourg , M Brucher , Scikit-learn: machine learning in python, J Mach Learn Res 12 (2011) 2825–2830

[25] G Varoquaux , Joblib, GitHub Repository, 2009 https://github.com/joblib/joblib [26] F Chollet , Keras, GitHub Repository, 2015 https://github.com/fchollet/keras [27] M Abadi, A Agarwal, P Barham, E Brevdo, Z Chen, C Citro, G.S Corrado,

A Davis, J Dean, M Devin, S Ghemawat, I Goodfellow, A Harp, G Irving,

M Isard, Y Jia, R Jozefowicz, L Kaiser, M Kudlur, J Levenberg, D Mane,

R Monga, S Moore, D Murray, C Olah, M Schuster, J Shlens, B Steiner,

I Sutskever, K Talwar, P Tucker, V Vanhoucke, V Vasudevan, F Viegas, O Vinyals, P Warden, M Wattenberg, M Wicke, Y Yu, X Zheng, TensorFlow: large-scale machine learning on heterogeneous distributed systems, (2016) http://arxiv.org/abs/1603.04467

[28] J.D Hunter, Matplotlib: a 2D graphics environment, Comput Sci Eng 9 (2007) 90–95, doi: 10.1109/MCSE.2007.55

[29] M Waskom , Seaborn, GitHub Repository, 2013 https://github.com/mwaskom/ seaborn

Tiêu đề	Automated High Confidence Compound Identification of Electron Ionization Mass Spectra for Nontargeted Analysis
Tác giả	Joseph Bendik, Richa Kalia, Jeet Sukumaran, William H. Richardot, Eunha Hoh, Scott T. Kelley
Trường học	San Diego State University
Chuyên ngành	Environmental Monitoring and Mass Spectrometry
Thể loại	journal article
Năm xuất bản	2021
Thành phố	San Diego

Định dạng
Số trang	8
Dung lượng	1,34 MB