Nontargeted analysis based on mass spectrometry is a rising practice in environmental monitoring for identifying contaminants of emerging concern. Nontargeted analysis performed using comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC/TOF-MS) generates large numbers of possible analytes.
Trang 1journalhomepage:www.elsevier.com/locate/chroma
ionization mass spectra for nontargeted analysis
Joseph Bendika,1, Richa Kaliaa,1, Jeet Sukumaranb, William H Richardotc,d, Eunha Hohd,
Scott T Kelleya,b,∗
a Department of Biology, San Diego State University, San Diego, CA, USA
b Department of Biology, San Diego State University, 5500 Campanile Drive, San Diego, CA 92104, USA
c San Diego State University Research Foundation, San Diego, CA, USA
d School of Public Health, San Diego State University, San Diego, CA, USA
Article history:
Received 30 July 2021
Revised 26 October 2021
Accepted 27 October 2021
Available online 31 October 2021
Keywords:
ChromaTOF
PyAutoGUI
Mass spectral comparison
Nontargeted analysis
Suspect screening
Machine learning
Nontargetedanalysisbasedonmassspectrometry isarisingpracticeinenvironmentalmonitoringfor identifying contaminants of emerging concern Nontargeted analysis performed using comprehensive two-dimensionalgaschromatographycoupledwithtime-of-flightmass spectrometry(GC×GC/TOF-MS) generateslargenumbersofpossibleanalytes.Moreover,thedefaultspectrallibrarysimilarityscore-based searchalgorithmusedbyLECO® ChromaTOF® doesnotensurethathighsimilarityscoresresultin cor-rectlibrarymatches.Therefore,anadditionalmanualscreeningisnecessary,butleadstohumanerrors especiallywhendealingwithlargeamountsofdata.Toimprovethespeedandaccuracyofthechemical identification,wedevelopedCINeMA.py(ClassificationIsNeverManualAgain).Thisprogrammingsuite automatesGC×GC/TOF-MSdatainterpretationby determiningthe confidenceofamatchbetweenthe observedanalytemass spectrumandthe LECO® ChromaTOF® softwaregenerated libraryhitfromthe NISTElectronIonizationMassSpectral (NISTEI-MS)library.Ourscriptallowstheusertoevaluatethe confidenceofthematchusinganalgorithmicmethodthatmimicsthemanualcurationprocessandtwo differentmachinelearningapproaches(neuralnetworksand randomforest).Thescriptallowstheuser
toadjust variousparameters (e.g.,similaritythreshold) andstudy theireffects onpredictionaccuracy
TotestCINeMA.py, weused datafromtwodifferentenvironmentalcontaminantstudies:anEPAstudy
onhouseholddustandastudy onstormwaterrunoff.Usingareference setbasedontheanalysis per-formedbyhighlytrainedusersoftheChromaTOFandGC×GC/TOF-MSsystems,therandomforestmodel hadthehighestpredictionaccuraciesof86%and83%ontheEPAandStormwaterdatasets,respectively Thealgorithmicapproachhadthesecond-bestpredictionaccuracy(82%and79%),whiletheneural net-work accuracyhad thelowest(63% and67%).Allthe approachesrequired lessthan1 mintoclassify
986observedanalytes,whereasmanualdataanalysisrequiredhoursordaystocomplete.Ourmethods werealsoabletodetecthighconfidencematchesmissedduringthemanualreview.Overall,CINeMA.py providesuserswithapowerfulsuiteoftoolsthatshould significantlyspeed-updataanalysiswhile re-ducingthepossibilitiesofmanualerrorsanddiscrepanciesamongusers,andcanbeapplicabletoother GC/EI-MSinstrumentbasednontargetedanalysis
© 2021TheAuthors.PublishedbyElsevierB.V ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/)
1 Introduction
Environmental monitoring forchemical contaminants typically
requires using targeted analysis, in which a priori information
∗ Corresponding author at: Department of Biology, San Diego State University, San
Diego, CA, USA
E-mail address: skelley@sdsu.edu (S.T Kelley)
1 These authors contributed equally to this work
(mass spectra,retentiontimes,etc.)on specificchemicalsis used
todetectcompoundsofinterest.Whilethesemethodsaresensitive andquantitative foraknown setofcompounds, theymiss unde-finedcompoundsregardlessoftheirabundance.Nontargeted anal-ysis (NTA),including suspect screening, wasdeveloped to detect multiple compounds simultaneously, includingnovel compounds, and involves comprehensive sample preparation and chromatog-raphyfollowedby fullmassspectrometryanalysis[1–3] Compre-hensivetwo-dimensional gaschromatographycoupledwith
time-https://doi.org/10.1016/j.chroma.2021.462656
0021-9673/© 2021 The Authors Published by Elsevier B.V This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ )
Trang 2In GC×GC/TOF-MS based NTA, the raw data is analyzed
us-ing dataprocessing software such asLECO® ChromaTOF®
Chro-maTOF’s “automatic peak search” first identifies features based
upon certain conditions (i.e., S/N ratio, GC retention time, etc.)
Additionally,ChromaTOF’speaktablealignmentfeature“Statistical
Compare”, enablesuserstomake comparisonsbetweengroupsof
samples(ex.SamplesvsControls)toefficientlyisolatecompounds
ofinterest.“StatisticalCompare” alignspeaksacrosssamplegroups
basedupon1stand2nddimension GCretentiontimes,aswellas
massspectralsimilarity.Inordertoidentifycompoundsofinterest,
each peakiscomparedagainst theNationalInstituteofStandards
andTechnology electron ionizationmass spectral(NIST EI-MS)
li-brary (orcustomMS librarydependingontheuser),generatinga
list ofranked suggestedcompounds(Library Hits)and“similarity
score” byChromaTOF utilizingtheNIST Similarityscorebasedon
the relative abundances of the matched pairs ofmasses andthe
abundance ratios of adjacent matchingpeaks [10,11] Afterwards,
eachlibraryhitmustbemanuallyreviewedtofurtherevaluatethe
confidenceofamatchbetweenthelibraryhitmassspectraandthe
observed mass spectra (after deconvolution), known asthe Peak
TruemassspectrainChromaTOF
Currently,theobservedmassspectraandlibraryhitmass
spec-tra are either manually reviewed inChromaTOF, orthe data can
beexportedasaPDF.Fig.1Ashowstheworkflowformanualdata
analysis[12].Oncethebestmatchesareobtainedusingthe
spec-tral library search algorithms, analytical reference standards are
procured, and their respective retention times and mass spectra
areobtainedfromthesameinstrumentalconditionofGC
×GC/TOF-MS Theverificationsuccessrateswere 94%and96%inour
stud-ies [4,13].Thissupportsthenotionthatthemanualreview works
for determinationofhighconfidence identification.However, this
manual reviewcanbe timeconsuminganderrorpronewhenthe
data size is large, and results can be inconsistent among users
Forinstance,reviewingthousandsofcompound’smassspectraand
their matching massspectrafroma MS library(e.g.,theNIST
EI-MS library)cantakemanyhours orevendaysdependingonuser
experience This highlevelof manual data handlingleadsto
nu-merous errors necessitating multiple independent reviewsfor all
resultstominimize errors.Thus, automationofthesetaskswould
be extremely valuable to improve the accuracy and increase the
analysisthroughput[14]
To improve the speed and accuracy of identification based
on mass spectral matching, we developed two programs:
chro-maTOF_auto.py and CINeMA.py (Classification Is Never Manual
Again) The chromaTOF_auto.py script automates GC×GC/TOF-MS
data download from LECO® ChromaTOF® software, while
CIN-eMA.py facilitates the confirmation of analyte matches between
the NIST mass spectral library and the experimental mass
spec-tra using twodifferent approaches:an algorithmic methodbased
directlyonthemanualcurationmethodandmachinelearning
ap-proaches using neural networks and random forests trained on
manually curated data sets These machine learning techniques
havebeenusedforsimilarmassspectrometryapplicationsin
pre-wasusedfordataprocessing.Thestormwaterrunoff samples(aka theStormwater dataset) were collectedby theSanFrancisco Es-tuaryInstitute(SFEI)fromNapa,Sonoma,andSantaRosacounties
inCaliforniafollowingthe2017NorthernCaliforniawildfires[13] Thehouseholddustsamples(aka.theEPAdataset)wereprovided
aspartof theU.S Environmental Protection Agency (EPA)’s Non-targetedAnalysis Collaborative Trial(ENTACT),an inter-laboratory studydesigned to compare the various workflow techniques im-plemented within the NTA research community [18,19] In brief, participants were givena seriesof samplesin a blind trialsome
of whichhad been spiked witha cocktail ofvarious compounds andwere instructed to conductNTA The EPAdata setcontained
986 observed analytes from the analysis of LECO® ChromaTOF® softwareandtheStormwaterdatasetcontained892observed ana-lytes.IntheEPAdataset,409compoundsweremanuallyreviewed
tobehighconfidencematches,and577werereviewedaslow con-fidence In the Stormwater data set, 373 were reviewed as high confidenceand519were reviewedaslow confidence.TheLECO® ChromaTOF® softwareassignseachchromatographicpeakaname baseduponmassspectralsimilaritytocompoundswithinthe2011 NISTEI-MSlibrary.Afterisolatingallcompoundsofinterestduring review,theusersortsthe“peaktable” inChromaTOF sothat each compound of interest is insequential order To do so, the “peak table” was sorted by “comment” and “peak number” The “peak true” (deconvoluted mass spectra) data of all compounds of in-terest were then exported in MSP format (peak_true.msp) Next, the mass spectra of each compound’s assigned name from the
2011NISTEI-MSlibrary(libraryhit)wereexportedusingthe chro-maTOF_auto.pyscript.ThechromaTOF_auto.pyisbasedon PyAuto-GUI, apython module tocontrol theuse ofmouseandkeyboard forautomationofanyGraphical UserInterface PyAutoGUI repro-duces human actions such as moving, clicking and dragging the mouse, pressing andholding keys, and pressingkeyboard hotkey combinations [20] Using thisscript an analyst can easily extract theGC×GC/TOF-MSlibraryhitsdatafromtheLECO® ChromaTOF® software for further analysis in a significantly reduced time and withnegligible humaneffort The chromaTOF_auto.py scriptdoes notmodify,manipulate,orextendthesoftwareordatabasesofthe LECO® ChromaTOF® software
Fig.1Bshowstheworkflowforautomateddatadownloadwith chromaTOF_auto.py The LECO® ChromaTOF® workspace is com-posedin left torightorder withthefollowing components the directoryforaccessingtoolsandoptions(AcquisitionQue,GCand
MS Methods, Acquired Samples etc.), peak table, and the library hit mass spectrum (Fig S1) The chromaTOF_auto.py script saves thelibraryhitfilessequentiallyinthemostrecentdirectoryused
bytheuser,renamingthefiles(1.msp,2.msp,etc.)foreasyaccess
2.2 Data parsing
The data obtained from the GC×GC/TOF-MS data analysis by theLECO® ChromaTOF® softwareonboththeEPAandStormwater datasetswasparsedusingCINeMA.pytoextract:(1)Analytename
Trang 3Fig 1 Workflow for manual (A) or automated (B) data analysis An environmental sample once collected is processed using GC ×GC/TOF-MS and analyzed using the LECO®
ChromaTOF® software The LECO® ChromaTOF® software outputs a list of observed analytes present in the sample For a manual analysis, this processed data for each observed analyte and their respective library hits are then manually downloaded by the analyst Next, the analyst reviews this manually downloaded data to evaluate the confidence of the match (High or Low) between the mass spectra of each observed analyte and their corresponding library hit For the automated analysis, the user creates
a directory to save the observed analyte’s (SA) library hit files and then downloads them sequentially using the chromaTOF_auto.py script
(“Name”); (2)Mass-to-charge ratio(m/z) oftheionsandtheir
re-spectiveintensities;(3)SimilarityScorebetweentheobserved
an-alyte and library hit #1 from the LECO® ChromaTOF® software
(only presentinlibraryhits);and(4)Totalnumberofions inthe
massspectrum.ThisdatawasnecessaryforthescriptCINeMA.py
to analyzethe confidenceofa matchbetweentheobserved
ana-lytes andlibrary hits Inaddition, since thelowest mass spectral
acquisition ionwasm/z50,themanualreview ofmatchesignores
allionsbelowm/z50presentinthelibraryhit.CINeMA.pyparsed
allthefilesinthegivendatadirectoryintotherequireddata
struc-turetotrain,test,ormakepredictions,usingeitherthealgorithmic
modelorthemachinelearningmodels[21–23].Dependingonthe
useraction(predict,train,ortest),CINeMA.pyrequiresthedata
di-rectorytohaveaspecificorganizationalstructure(Fig.2)
CINeMA.py resultswere benchmarkedwiththoseobtainedvia
manual analysis to establishthe reliability of our CINeMA.py
re-sultsandtheeffectivenessofCINeMA.pyinreducingGC
×GC/TOF-MS data analysis time The peak_true.msp file contains data for
all the observed analytes together as shown in Fig S2 To
ver-ify the completeness of the analyte data, the script parses the
peak_true.msp file usinga state machine asshowninFig.S3
Fi-nally,eachcompound’slibraryhitisoutputtoanindividualfileas
showninFig.S4
2.3 Algorithmic model
The algorithmic model, outlined in Fig 3,begins by checking
for thesimilarity score threshold,which by defaultis setto 600
in this study, but the threshold can be changeable (out of999)
ThissimilarityscorefromNISTisanoutputfromtheLECO®
Chro-maTOF® software describing the measure of similarity between
theobserved analytemass spectrumandthelibraryhit fromthe
2011NIST EI-MS librarymatches.The user canalter this similar-ityscorethresholdusingthecommandlineinputsforCINeMA.py Thealgorithmcomparesthelibraryhitmassspectrawiththe ob-servedmass spectrafromLECO® ChromaTOF® software.Amatch
isdeemeda“highconfidence” matchifthefollowingaretrue:the similarityscoreisgreaterthanorequaltotheuserprovided sim-ilarity score,the mostabundant three ions ofthe library hit are presentintheobservedmassspectra(andviceversa), the molec-ular ion is present, and the correlation percentage between the spectraofthelibraryhitandtheobservedmassspectraisatleast 80%
2.4 Machine learning models
Two types of machine learning approaches were used to de-termineifthebestlibraryhitisa high-orlow-confidencematch
to the observed mass spectra: a random forest algorithm, and a neuralnetwork.Randomforestandneuralnetworkswereboth se-lectedforthisstudyprimarilybecauseoftheireffectivenesswhen workingwithclassificationproblemssuchasthis.Neuralnetworks cananalyzecomplexrelationshipsbetweeninputs,whichmakesit
agood choicetodetect differencesinmassspectrathat can con-tainlargeamountsofionintensitydata.However,neuralnetworks usually require vast amounts of samples fortraining Conversely, randomforestworkswellwithsmalleramountsofdatawithmore clearly defined features, such as the spectra features a reviewer looksforduringamanual review.Inaddition,feature importance can be easily provided with random forest, allowing the user to visualizetheaspectsoftheirmanualreviewthatthemachine con-sidersthemostimportant
Trang 4Fig 2 Data directory structures (A) Under the sample directory there is a subdirectory called ‘hits’ and the peak_true.msp file that contains the data for observed analytes
The user should use the ‘hits’ directory to save all the library hits files obtained through using chromaTOF_auto.py Each sample subdirectory should contain a compounds.tsv file, which contains the m/z ratio for the molecular ion in the library hit file (B) For training or testing the accuracy of a machine learning model with a new data set, the root directory should contain sub directories, which are sample names Each sample subdirectory should contain a ground_truth.tsv file, which contains the manual interpretation
of the confidence of a match of observed analytes and library hits obtained from GC ×GC/TOF-MS data analysis by the LECO® ChromaTOF® software
Fig 3 Algorithmic model If the similarity score from the LECO® ChromaTOF® software is less than the similarity score threshold, the algorithm classifies the match as a
low confidence match If the similarity score is higher, then the model normalizes the spectrum data for both the observed analyte (SA) and the library hit (LH) and checks the following set of conditions: (1) presence of most abundant three ions (Top 3 ions) of the library hit in the observed analyte, (2) presence of molecular ion of the library hit in the observed analyte, (3) presence of top three ions of the observed analyte in the library hit and (4) correlation ( > = 80) between the spectra of the library hit and the observed analyte If all these conditions are met, it interprets the match as a “high confidence match.”
Trang 5Fig 4 Neural network model’s structure The first 10 0 0 inputs are the library hit
ion intensities and the next 10 0 0 are the observed analytes’ ion intensities There
are three hidden layers of size 10 0 0, 10 0 and 10 neurons, and have softsign activa-
tion functions The last layer of the network uses a softmax activation function and
is composed of two neurons for high or low predictions The model was trained
with 5 epochs and a batch size of 128
The input data for random forest consistedof the samemass
spectrafeatures checkedwhenusingthealgorithmic model:
sim-ilarity score, correlation percentage, molecular ion presence, and
thenumberoftopionspresentinthehitthatarealsopresentin
the observed analyte (and vice versa) The random forest model
wasbuiltinpythonusingtheScikit-Learnpackage[24,25].The
hy-perparametersforthemodelwere tunedbasedonoptimizingthe
accuracymetric,resultingin100treesandamaxdepthof4.The
input dataforthe neuralnetwork consistedoftheionintensities
foreachobservedanalyteanditsbesthittodetectifthetwo
spec-tra aresimilar enoughto beconsidered ahigh-confidence match
This model wasbuilt in python using the Keras and Tensorflow
packages[26,27].Fig.4illustratesthestructure oftheneural
net-work model.Activationfunctions, thenumber ofepochs,andthe
batchsizewereselectedfortheneuralnetworkbasedonthe
accu-racymetric,aidedwiththeuseofGridSearchCVintheScikit-Learn
package.Modelperformancewasexaminedthroughconfusion
ma-trices, receiver operating characteristic (ROC) curves,and 10-fold
cross validation.All modelswere trained onone ofthe two data
sets and testedon the other using an expert’smanual review of
highand lowconfidence forthedata labels.Additionally,to
pro-vide moredataformodeltraining,thesedatasetswerealso
com-binedintoonelargedatasetandthentrainedandtestedinthree
ways: (1) Train on 80% of the combined set andtest on the
re-maining20%;(2)Trainon80%oftheEPAdatasetplus100%ofthe
Stormwater dataset,andthen test theremaining 20%ofthe EPA
dataset;(3)Trainon80%oftheStormwaterdatasetplus100%of
theEPAdataset,andthentesttheremaining20%ofthe
Stormwa-ter data set.Randomsplits were performedonall train testsplit
cases CINeMA.py also allows the analyst to train and save their
ownmachinelearningmodelonagivendataset.Thesaved
mod-elscanthenbeusedfortestingormakingpredictionsfornewdata
sets
2.5 Report generation
The CINeMA.py generates reports in the form of two files
report.tsv andreport.pdf The report.tsv file containsinformation
about the peak number, name of the observed analyte and the
predicted match between the library hit and the observed
ana-lyte The report.pdf file contains mirror plots between each
ob-served analyte’smassspectrumanditscorrespondinglibraryhit’s
mass spectrum [28].Fig 5shows exampleplots ofhighandlow
confidence matches The plotsallow the analyst to visually
com-Fig 5 Mirror plots comparing observed analyte and library hit mass spectra The
mirror plots are provided by CINeMA.py for all matches from the non-targeted anal- ysis to the given library spectra, allowing straightforward manual confirmation The top spectra (positive values in blue) is the spectrum from the observed analyte in the sample, while the bottom mirrored spectra (negative values in red) is the spec- trum of the corresponding library hit for the observed analyte (A) An example of
a high confidence library match (B) Example of a low confidence match (For inter- pretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
pare the observed analyte’s mass spectrum andthe correspond-ing libraryhit’s mass spectrum ifdesired The mirror-plotof the two mass spectra makes visual comparison easy while compar-ingthetwoseparate plotsproduced byLECO® ChromaTOF® soft-ware.Whentraininganeuralnetworkmodel,CINeMA.pyproduces model_performance.pdfcontaininglosscurvesforeachfoldduring cross-validation,shownin Fig.6.When testingeither ofthe ma-chine learningmodels,the script willproduce measures.pdf con-tainingtheconfusionmatrixandtheROCcurve,asinFig.7[29]
By considering low-confidence matches as“negatives,” and high-confidencematchesas“positives,” theusercanusetheconfusion matrixtocalculateperformancemetricssuchasaccuracy, sensitiv-ity,specificity, andbalanced accuracy.When trainingthe random forest modelwithfeature input data,the scriptwill produce im-portance.pdfcontainingabarplotwiththerelativeimportancefor eachfeature(Fig.8).SourcecodeforchromaTOF_auto.pyand CIN-eMA.py,alongwithtutorialsandtestdataareavailableonGithub
athttps://github.com/sharmaricha200/thesis.git
Trang 6Fig 6 Example training loss generated during one-fold of the 10-fold cross Vali-
dation on the EPA data set The blue curve (top) indicates the loss on the training
samples, and the orange curve indicates the loss on the samples held out for valida-
tion in that fold This results shows Neural Network Model loss using ion intensity
data trained for 5 epochs and a batch size of 128 (For interpretation of the refer-
ences to color in this figure legend, the reader is referred to the web version of this
article.)
Fig 7 Example efficacy outputs following random forest model training on the EPA
data set and testing on the Stormwater data set (A) Confusion matrix (B) Receiver
Operating Characteristic curve (ROC)
Fig 8 Example feature importance for the random forest model trained on the EPA
data set and tested on the Stormwater data set
3 Results and discussion
The automated datacollection workflow process implemented
in chromaTOF_auto.py needed only a few minutes on an Intel® CoreTM i7–6700 Quad CPU, with 8 GB RAM running Windows®
10,64-bit todownloadlibraryhit data( msp) filesfromagiven
GC×GC/TOF-MSdataoutputanalysisfromtheLECO® ChromaTOF® software.Becauseofcomputationalspeed,chromaTOF_auto.py ini-tiallycausedthe ChromaTOF® GUIto crash.To overcomethis is-sue,we includeda delaytimer in the chromaTOF_auto.py script, allowing theuser toset upthe screenasdescribedabove before theautomationtakesovertodownloadthelibraryhitfiles Forgeneratingpredictions,CINeMA.pywasabletoproduce re-sultswithin a minute.When testingthe algorithm modelon the complete data sets, an accuracy of 81.54% was achieved on the
986 compounds in the EPA set and an accuracy of 78.70% was achieved on the 892 compounds in the Stormwater set For the machinelearningmodels,thehighestaccuracyvalueandArea un-dertheROCcurve score(AUC) seenonthecompleteEPAsetwas achievedusingtherandomforestmodelonthealgorithm’sfeature data.Thismodelhadanaccuracyof85.60%andhadanAUCscore
of 0.887 The highest accuracy value and AUC score seen on the Stormwaterset wasalso achievedusingtherandomforest model
onthealgorithm’sfeaturedata.Thismodelhadan accuracyvalue
of82.85%andanAUCscoreof0.899(Table1).Theneuralnetwork didnotperformaswellastheothermodelsbasedonthetesting accuracies, AUC scores, and cross-validation accuracies (Tables 1 and2).Combiningdatasetsdidsomewhatimprovethetesting ac-curacyandAUCscoreforthismodelhowever(Table1).Agreement ratesbetweenthehumanuser’sdecisionvs.amodeldecisionper
“High” and“Low” confidencewere similar, witha slightlyhigher agreementbythealgorithmmodelin“High” thanin“Low"(Table S1).Thisdemonstratesthatthemodelsworkequallyforcompound identificationregardlessof“High” and“Low” confidencematching
Toidentifyreasonsfordiscrepancybetweenclassifications (hu-man vs computer), we manually reviewed “incorrect” classifica-tions The main source of discrepancy when comparing human classificationsto the algorithm’s classificationsappeared to come from instances in which observed mass spectra and library hit massspectrawereverysimilar,butwereonthecuspofeitherhigh
orlowconfidence.Thisoftenoccurredininstancesinwhichthe li-braryhitmassspectracontainednumerousionswithlow relative abundance.SinceNTAofenvironmentalsamplesofteninvolvesthe detectionof trace contaminants, compounds presentat low
Trang 7con-Table 1
Random forest and neural network model performances across the EPA dust and Stormwater data sets Includes the number of compounds
present in the training and test sets, the accuracy on the test set, and the Area Under the ROC Curve (AUC) score
#Training Compounds #Test Compounds Testing Accuracy AUC
RF Features
Train 80% (EPA + Stormwater) Test 20% (EPA + Stormwater) 1502 376 82.45% 0.873
NN Intensities
Train 80% (EPA + Stormwater) Test 20% (EPA + Stormwater) 1502 376 70.48% 0.761
Table 2
10-fold cross validation mean accuracy + /- standard deviation on the two neural network models across the EPA dust and Stormwater data sets
NN Intensities Train EPA Test Stormwater 74.44% ( + /- 3.28%) Train Stormwater Test EPA 71.30% ( + /- 4.46%) Train 80% EPA Test 20% EPA 72.21% ( + /- 5.31%) Train 80% Stormwater Test 20% Stormwater 70.83% ( + /- 4.79%) Train 80% (EPA + Stormwater) Test 20% (EPA + Stormwater) 73.50% ( + /- 2.68%) Train EPA + 80% Stormwater Test 20% Stormwater 75.40% ( + /- 2.99%) Train Stormwater + 80% EPA Test 20% EPA 73.10% ( + /- 2.70%)
centrationsmaynotproduceenoughlowabundanceionstobe
de-tected bythe massspectrometer Asthealgorithm isconfinedby
astrictsetofrules(i.e.,correlationpercentage≥ 80%),some
com-pounds maybeclassified as“low” while ahuman usermaytake
additionalfactorsintoaccountandclassifyas“high”
Additionally,boththealgorithmandrandomforest model
cor-rected human errors As shown in Fig S5, some compounds in
whichtheobservedmassspectraandlibraryhitmassspectrawere
nearperfectmatches were erroneouslyclassifiedas“not a match
(low)” by the human user but classified correctly as a “match
(high)” bythealgorithm.Conversely,therewereinstancesinwhich
the observed mass spectra and library hit mass spectra did not
match well butwere classifiedas“high” bythe human userand
classified as“low” bythe algorithm Such errorswere due to
fa-tigueexperiencedbythehumanusercomparinghundredsofmass
spectralmatchesinsuccession
Whiletherandomforestmodelhadthehighestaccuracyscores,
there are still some benefits to the useofa simplified algorithm
over themachinelearningtechniques.The simplifiedalgorithmis
capableofworkingwithextremelysmalldatasetsanddoesnot
re-quireanoutsidesourceofdatafortraining.Bothtypesofmachine
learningtechniquesrequiredatafortrainingand,especiallyinthe
caseofneuralnetworks,largeamounts ofdatamaybenecessary
Thealgorithmicapproachhoweveravoidsthisissue,meaningusers
may prefer this method over training their own machine
learn-ingmodel.Consequentially,thismayexplainthelowperformance
metrics in the neural network compared to the other models as
the number ofsamples contained inthe data sets wasrelatively
smallforthistype ofmodel.Furthermore,thealgorithm iseasily
tunable,allowingtheusertospecifytheirownsimilarityscoreand
correlationpercentagethresholdswhentestingtheirowndatasets
Thisabilitytoeasilytunethealgorithmmakesitapplicableforuse
with programs other thanChromaTOF,astheir spectral matching
componentsmayuseascaledifferentthanChromaTOF’ssimilarty
score(0–999)
4 Conclusions
Overall, the random forest model provided the best accuracy value forbothdata sets,andwe showedthatcompounds missed
bythealgorithmwereoftenrecognizedbymachinelearning Fur-thermore,by ranking feature importance a machine learning ap-proach can highlight ways to improve the algorithmic approach
byillustrating whichfeaturethresholdscanbe tunedinthe algo-rithm Theneural network modelwithintensities has the poten-tialtopredict unknownrules andpatternsforanalyzingthedata set,whichthefeature-usingmodelslack.Featuremodelsarebased
on man-maderules and likelyhave room forimprovement since
it could be difficult to hardcode all possible rules.Thus, in prin-ciple, withlarger data sets a neural network approach usingion intensitieshasthepotentialtofindpatternsandrulesthatcannot
becodedviaanalgorithm.Furthermore,itcanbeimprovedby in-creasingthesizeandaccuracyoftrainingdatasets.Infuturework,
wewillcontinuetoexplorethepotentialofneuralnetworkswith intensitydatatoenhancetheaccuracyofNTA
Intermsofspeed,CINeMA.py isabletoprovideprediction re-sultswithinaminute.Manualdataanalysisbymultiplepeople re-quiredhoursorevendaysforthesamedatasetsofobserved ana-lytes.CINeMA.py’scapacitytorapidlyevaluatetheconfidenceofa matchbetweenobservedanalytesandlibrarymatches represents
asignificantimprovementovermanualanalysisthatcantake sub-stantial time dependingon the datasize and can be error-prone during heavy data handling CINeMA.py gives the user the flex-ibility to not only automate the interpretation of the confidence
ofthematchofobservedanalytesandtheircorrespondinglibrary matches, butalso to experiment withvarious test parameters to studyitseffectsontheanalysis.Inaddition,theusercanchooseto useeitherorboth thealgorithmicmodelandanyofthemachine learningmodels to analyze their dataand compare their predic-tions The user can also train the machine learningmodels with relevantdatasetstoimprovepredictionsonnewdatasets.Because
Trang 8CRediT authorship contribution statement
Joseph Bendik: Software, Investigation, Formal analysis,
Vali-dation, Visualization, Writing – original draft Richa Kalia:
Soft-ware, Investigation,Visualization, Formal analysis, Writing–
orig-inal draft Jeet Sukumaran: Software, Methodology William H.
Richardot: Validation, Data curation, Resources Eunha Hoh:
Methodology, Validation, Funding acquisition,Writing – review &
editing.Scott T Kelley:Conceptualization,Writing– originaldraft,
Writing– review&editing,Supervision,Projectadministration
Funding
ThisworkwasfundedinpartbytheCaliforniaTobaccoRelated
DiseaseResearchProgramfundedgrant(27IP-0028C)
Acknowledgments
WewouldliketothankDr.NathanDodder,YingXu,BryanHo,
andBasilinBensonfortheirvaluableinsightsduringthestudy
de-sign
Supplementary materials
Supplementary material associated with this article can be
found,intheonlineversion,atdoi:10.1016/j.chroma.2021.462656
References
[1] L Chibwe, I.A Titaley, E Hoh, S.L.M Simonich, Integrated framework for iden-
tifying toxic transformation products in complex environmental mixtures, En-
viron Sci Technol Lett 4 (2017) 32–43, doi: 10.1021/acs.estlett.6b00455
[2] J Hollender, E.L Schymanski, H.P Singer, P.L Ferguson, Nontarget screening
with high resolution mass spectrometry in the environment: ready to go? En-
viron Sci Technol 51 (2017) 11505–11512, doi: 10.1021/acs.est.7b02184
[3] J.R Sobus, J.F Wambaugh, K.K Isaacs, A.J Williams, A.D McEachran,
A.M Richard, C.M Grulke, E.M Ulrich, J.E Rager, M.J Strynar, S.R Newton,
Integrating tools for non-targeted analysis research and chemical safety eval-
uations at the US EPA, J Expo Sci Environ Epidemiol 28 (2018) 411–426,
doi: 10.1038/s41370- 017- 0012- y
[4] C.D Tran, N.G Dodder, P.J.E Quintana, K Watanabe, J.H Kim, M.F Hovell,
C.D Chambers, E Hoh, Organic contaminants in human breast milk identi-
fied by non-targeted analysis, Chemosphere 238 (2020) 124677, doi: 10.1016/
j.chemosphere.2019.124677
[5] M.B Alonso, K.A Maruya, N.G Dodder, J Lailson-Brito, A Azevedo, E Santos-
Neto, J.P.M Torres, O Malm, E Hoh, Nontargeted screening of halogenated
organic compounds in bottlenose dolphins (tursiops truncatus) from Rio de
Janeiro, Brazil, Environ Sci Technol 51 (2017) 1176–1185, doi: 10.1021/acs.est
6b04186
[6] C.A Manzano, N.G Dodder, E Hoh, R Morales, Patterns of personal exposure
to urban pollutants using personal passive samplers and GC × GC/ToF-MS, En-
viron Sci Technol 53 (2019) 614–624, doi: 10.1021/acs.est.8b06220
T Novotny, D Schlenk, R.M Gersberg, E Hoh, Assessing toxicity and in vitro bioactivity of smoked cigarette leachate using cell-based assays and chemical analysis, Chem Res Toxicol 32 (2019) 1670–1679, doi: 10.1021/acs.chemrestox 9b00201
[13] D Chang, W.H Richardot, E.L Miller, N.G Dodder, M.D Sedlak, E Hoh, R Sut- ton, Framework for non-targeted investigation of contaminants released by wildfires into stormwater runoff: case study in the Northern San Francisco Bay area, Integr Environ Assess Manag (2021) Online ahead of print, doi: 10.1002/ ieam.4461
[14] H Mol, Non-targeted is our target, The Anal Scientist (2013) https:// theanalyticalscientist.com/techniques- tools/non- targeted- is- our- target [15] E.D Strozier, D.D Mooney, D.A Friedenberg, T.P Klupinski, C.A Triplett, Use of comprehensive two-dimensional gas chromatography with time-of-flight mass spectrometric detection and random forest pattern recognition techniques for classifying chemical threat agents and detecting chemical attribution signa- tures, Anal Chem 88 (2016) 7068–7075, doi: 10.1021/acs.analchem.6b00725 [16] F Allen, A Pon, R Greiner, D Wishart, Computational prediction of elec- tron ionization mass spectra to assist in GC/MS compound identification, Anal Chem 88 (2016) 7689–7697, doi: 10.1021/acs.analchem.6b01622
[17] D.D Matyushin, A.Y Sholokhova, A.K Buryak, Deep learning driven GC-MS library search and its application for metabolomics, Anal Chem 92 (2020) 11818–11825, doi: 10.1021/acs.analchem.0c02082
[18] E.M Ulrich, J.R Sobus, C.M Grulke, A.M Richard, S.R Newton, M.J Strynar,
K Mansouri, A.J Williams, EPA’s non-targeted analysis collaborative trial (EN- TACT): genesis, design, and initial findings, Anal Bioanal Chem 411 (2019) 853–866, doi: 10.10 07/s0 0216- 018- 1435- 6
[19] S.R Newton, J.R Sobus, E.M Ulrich, R.R Singh, A Chao, J McCord, S Laughlin- Toth, M Strynar, Examining NTA performance and potential using fortified and reference house dust as part of EPA’s non-targeted analysis collabora- tive trial (ENTACT), Anal Bioanal Chem 412 (2020) 4221–4233, doi: 10.1007/ s00216- 020- 02658- w
[20] A Sweigart , PyAutoGUI, GitHub Repository, 2014 https://github.com/ asweigart/pyautogui
[21] V Keleshev , Docopt, GitHub Repository, 2012 https://github.com/docopt/ docopt
[22] C.R Harris, K.J Millman, S.J van der Walt, R Gommers, P Virtanen, D Cour- napeau, E Wieser, J Taylor, S Berg, N.J Smith, R Kern, M Picus, S Hoyer, M.H van Kerkwijk, M Brett, A Haldane, J.F del Río, M Wiebe, P Peterson,
P Gérard-Marchant, K Sheppard, T Reddy, W Weckesser, H Abbasi, C Gohlke, T.E Oliphant, Array programming with NumPy, Nature 585 (2020) 357–362, doi: 10.1038/s41586- 020- 2649- 2
[23] W McKinney, Data structures for statistical computing in python, in: Proceed- ings of the 9th Python in Science Conference, 1, 2010, pp 56–61, doi: 10.25080/ majora- 92bf1922- 00a
[24] F Pedregosa , O Grisel , R Weiss , A Passos , M Brucher , G Varoquax , A Gram- fort , V Michel , B Thirion , O Grisel , M Blondel , P Prettenhofer , R Weiss ,
V Dubourg , M Brucher , Scikit-learn: machine learning in python, J Mach Learn Res 12 (2011) 2825–2830
[25] G Varoquaux , Joblib, GitHub Repository, 2009 https://github.com/joblib/joblib [26] F Chollet , Keras, GitHub Repository, 2015 https://github.com/fchollet/keras [27] M Abadi, A Agarwal, P Barham, E Brevdo, Z Chen, C Citro, G.S Corrado,
A Davis, J Dean, M Devin, S Ghemawat, I Goodfellow, A Harp, G Irving,
M Isard, Y Jia, R Jozefowicz, L Kaiser, M Kudlur, J Levenberg, D Mane,
R Monga, S Moore, D Murray, C Olah, M Schuster, J Shlens, B Steiner,
I Sutskever, K Talwar, P Tucker, V Vanhoucke, V Vasudevan, F Viegas, O Vinyals, P Warden, M Wattenberg, M Wicke, Y Yu, X Zheng, TensorFlow: large-scale machine learning on heterogeneous distributed systems, (2016) http://arxiv.org/abs/1603.04467
[28] J.D Hunter, Matplotlib: a 2D graphics environment, Comput Sci Eng 9 (2007) 90–95, doi: 10.1109/MCSE.2007.55
[29] M Waskom , Seaborn, GitHub Repository, 2013 https://github.com/mwaskom/ seaborn