Hội nghị Khoa học Công nghệ lần thứ 4 SEMREGG 2018 23 APPLICATION OF MLR, PCR AND ANN MODELS FOR THE PREDICTION OF STABILITY CONSTANTS OF DIVERSE METAL CATIONS WITH THIOSEMICARBAZONE DERIVATIVES IN EN[.]
Trang 1APPLICATION OF MLR, PCR AND ANN MODELS FOR THE PREDICTION OF STABILITY CONSTANTS OF DIVERSE METAL CATIONS WITH THIOSEMICARBAZONE DERIVATIVES IN
ENVIRONMENTAL MONITORING
Nguyen Minh Quang 1,2 , Huynh Nhat Lam 2 , Tran Xuan Mau 1 , Tran Thi Thanh Ngoc 3 ,
Pham Van Tat 4*
1
Faculty of Chemistry, University of Sciences, Hue University, 77 Nguyen Hue, Hue City
2
Faculty of Chemical Engineering, Industrial University of Ho Chi Minh City,
12 Nguyen Van Bao, Go Vap district, Ho Chi Minh City
3
Faculty of Geology and Mineralogy, Ho Chi Minh City University of Natural resources and
Environment, 236B Le Van Sy, Tan Binh district, Ho Chi Minh City
4
Faculty of Science and Technology, Hoa Sen University, 8 Dinh Cong Trang,
District 1, Ho Chi Minh City
*Email: vantat@gmail.com
ABSTRACT
Multivariate linear regression (MLR), principle component regression (PCR) and artificial neural network (ANN) methods were used to construct the quantitative relationships between
molecular structure and stability constants (logβ11) of metal-thiosemicarbazone complexes In this study, The QSPR models were built with knotp, SHBa, HOMO, xvpc4, N4, LUMO, ionization potential, dipole, molecular weight, Maxneg and Hf descriptors selected from the descriptor sets
The quality of QSPR models was proved with the statistical parameters: R2train = 0.9296, Q2LOO =
0.8673, MSE = 0.5878 and Fstat = 45.5829 for QSPRMLR model, and R2train = 0.9236, R2CV = 0.9423,
MSE = 0.4190 and Fstat = 30.7886 for QSPRPCR model, respectively The neural network model QSPRANN with architecture I(11)-HL(14)-O(1) was constructed by using descriptors in the 11-variables QSPRMLR model The QSPRANN model pointed out the training coefficient R2train = 0.9912,
Q2CV = 0.9938 and R2test = 0.9948 According to the QSPRMLR, QSPRPCR and QSPRANN models, the
logβ11 values were validated externally for ten randomly selected complexes The results are used for orientation of new thiosemicarbazone to design in rapid analysis of metal ions in environmental monitoring
Keywords: QSPR models, stability constants log 11, multivariate linear regression, artificial neural network, thiosemicarbazone
1 INTRODUCTION
Pollution is the introduction of contaminants into the natural environment that cause adverse change Pollution can take the form of chemical substances or energy, such as noise, heat or light Major forms of pollution include air pollution, light pollution, littering, noise pollution, plastic pollution, soil contamination, radioactive contamination, thermal pollution, visual pollution and water pollution [1] Here we are concerned about the environmental pollution that comes from chemical agents, especially the heavy metals
Trang 2In water pollution, heavy metal ions appear from a variety of sources Most of them have been emitted from the metallurgical and electroplating industries [2-3] Trace amounts of metal ions are important in industry [3], as an environmental pollutant [3, 4], and an occupational hazard [3, 4] There are many ways to monitor them in the environment One of the methods is the spectrophotometric method with UV-VIS equypment It is widely used because it is cheap and easy
to handle In the method, it use organic substances as complexing reagents [5] to determine various metals Besides, there has been a rapid growth in the popularity of thiosemicarbazones in environmental chemistry for determining the metal ions [6] A survey of literature reveals that a few thiosemicarbazones are employed for direct spectrophotometric determination of metals in aqueous solution In published researches, the authors proposed the new analytical reagents thiosemicarbazones for the spectrophotometric analysis of metals Regarding the complex formation
of metal ions with thiosemicarbazone ligand, the stability constant represented to be a very important role The quantity is related to the metal ions and the structural properties of the ligands The relationships can be built based on a method called quantitative structure and property relationship (QSPR) which is a modeling approach that has been successfully applied in many fields [7] Further, QSPR modeling supplies an effective method for establishing and reclaiming the relationship between chemical structure descriptors of molecules and their properties toward designs of new compound [8]
The present study focuses on the construction of QSPR models for surely predicting the stability constants of metal-thiosemicarbazone complexes The QSPR models were developed based
on the experimental stability constants and chemical structures The structural descriptors are calculated by using the semi-empirical quantum chemistry method with new version PM7 and PM7/sparkle [9] and molecular geometry calculation The QSPR models consist of the QSPRMLR, QSPRPCR and QSPRANN models The QSPRMLR model is established by using multivariate linear regression, the QSPRPCR model is built by using principle component regression and the QSPRANN
model is constructed by the error back-propagation method using multilayer perceptron algorithm with the input layer including variables of the best selected QSPRMLR model The stability constant log 11 of the complexes between the metal ions and thiosemicarbazone in the data set resulting from the QSPR models is also validated externally with experimental data in the literature Herein, the QSPR modeling investigate for the first time the stability constant of metal-thiosemicarbazone complexes in the world
2 COMPUTATIONAL METHODS 2.1 Experimental Datasets
The data sets of experimental stability constant (log 11) values for the (M:L) complexes of transition metal ions (Ni2+, Co2+, Mo6+, Cu2+, Mn2+, Zn2+, Ag+, Pb2+, Fe2+ and Zn2+) with different thiosemicarbazone ligands in water were selected from the published literature [10-21] at temperature ranges from 288K to 323K, pH of 2.4 to 10 and an ionic strength average I = 0.1M The
50 metal-thiosemicarbazone complexes of training set involve the metal ions containing 3 (Cd2+), 5 (Co2+), 2 (Cr3+), 1 (Cr6+), 3 (Cu2+), 4 (Fe2+), 1 (Mg2+), 9 (Mn2+), 12 (Ni2+), 5 (Pb2+) and 5 (Zn2+), as presented in Table 1 The test set includes 10 substances with the 8 metal ions Cu2+, Ho3+, Dy3+,
Co2+, Zn2+, Mn2+, Ni2+ and Fe3+ [15,16,18,22,23], as showed at the end of Table 1 The metal-thiosemicarbazone complexes (ML) are generated by the reaction between a metal ion (M) and a thiosemicarbazone ligand (L) in an aqueous solution [25] and the general structure of thiosemicarbazone and their complex are showed in Fig 1 [6]
Trang 3a) b)
Figure 1 The schematic of complex formation and general structure of thiosemicarbazone (a)
and metal-thiosemicarbazone complex (b)
Here the logβ11 values are log-transformed stability constants of metal-thiosemicarbazone
complexes and the stability constant β11 is calculated by the following equation [25]
11
[ML]
Table 1 The experimental log 11,exp values for ML complexes and the cross-validated and external
predicted log 11,exp values resulting from QSPRMLR, QSPRPCR and QSPRANN
No Thiosemicarbazone ligand Metal
ions logβ11.exp
logβ11.cal
ref
Training set and internal test set
1 H H H -C6H3(OH)OCH3 Co2+ 11.970 12.598 12.784 11.926 [10]
2 H H H -C6H3(OH)OCH3 Mn2+ 10.550 11.042 11.371 10.555 [10]
3 H -CH3 -CH3 -C5H4N Cu2+ 6.114 6.648 6.608 6.077 [11]
4 H H H -C6H3BrOH Cu2+ 5.633 6.075 6.061 5.778 [12]
5 H H -CH3 -CH=N-NHC6H5 Cu2+ 11.950 10.353 10.388 11.662 [13]
6 H H -CH3 -CH=N-NHC6H5 Co2+ 10.220 11.463 11.320 10.206 [13]
7 H H -CH3 -CH=N-NHC6H5 Ni2+ 10.890 10.914 11.058 10.873 [13]
8 H H H -C6H3(OH)OCH3 Cr6+ 4.842 5.125 5.202 4.871 [14]
9 H H -CH3 -CH=N-NHC6H5 Mn2+ 9.870 9.324 9.439 9.716 [15]
10 H H -CH3 -CH=N-NHC6H5 Mn2+ 9.720 9.324 9.439 9.716 [15]
11 H H -CH3 -CH=N-NHC6H5 Mn2+ 9.600 9.324 9.439 9.716 [15]
12 H H H -C9H5NOH Zn2+ 6.680 6.761 6.671 6.701 [16]
13 H H H -C6H3(OH)OCH3 Mn2+ 4.120 5.432 5.492 3.436 [17]
14 H H H -C6H3(OH)OCH3 Fe2+ 8.150 7.454 7.619 7.952 [17]
15 H H H -C6H3(OH)OCH3 Fe2+ 7.990 7.454 7.619 7.952 [17]
16 H H H -C6H3(OH)OCH3 Fe2+ 7.840 7.454 7.619 7.952 [17]
17 H H H -C6H3(OH)OCH3 Fe2+ 7.690 7.454 7.619 7.952 [17]
18 H H H -C6H3(OH)OCH3 Ni2+ 8.650 8.229 8.228 8.402 [17]
Trang 4No Thiosemicarbazone ligand Metal
ions logβ11.exp
logβ11.cal
ref
19 H H H -C6H3(OH)OCH3 Ni2+ 8.480 8.229 8.228 8.402 [17]
20 H H H -C6H3(OH)OCH3 Ni2+ 8.370 8.229 8.228 8.402 [17]
21 H H H -C6H3(OH)OCH3 Ni2+ 8.110 8.229 8.228 8.402 [17]
22 H H -CH3 -C6H4OH Ni2+ 5.940 5.492 5.433 5.558 [18]
23 H H -CH3 -C6H4OH Ni2+ 5.310 5.492 5.433 5.558 [18]
24 H H -CH3 -C6H4OH Ni2+ 5.140 5.492 5.433 5.558 [18]
25 H H H -C10H6OH Mg2+ 3.250 3.916 3.858 4.081 [19]
26 H H H -C10H6OH Mn2+ 4.660 3.709 3.759 4.665 [19]
27 H H H -C10H6OH Pb2+ 6.680 7.061 7.235 6.644 [19]
28 H H H -C10H6OH Pb2+ 6.570 7.061 7.235 6.644 [19]
29 H H - -C9H8NO Ni2+ 8.221 7.964 7.913 8.098 [20]
30 H H - -C9H8NO Ni2+ 8.124 7.964 7.913 8.098 [20]
31 H H - -C9H8NO Ni2+ 7.910 7.964 7.913 8.098 [20]
32 H H - -C9H8NO Ni2+ 7.709 7.964 7.913 8.098 [20]
33 H H - -C9H8NO Pb2+ 7.861 7.357 7.176 7.536 [20]
34 H H - -C9H8NO Pb2+ 7.653 7.357 7.176 7.536 [20]
35 H H - -C9H8NO Pb2+ 7.307 7.357 7.176 7.536 [20]
36 H H - -C9H8NO Co2+ 7.668 7.388 7.203 7.463 [20]
37 H H - -C9H8NO Co2+ 7.591 7.388 7.203 7.463 [20]
38 H H - -C9H8NO Co2+ 7.251 7.388 7.203 7.463 [20]
39 H H - -C9H8NO Zn2+ 7.820 7.269 7.241 7.272 [20]
40 H H - -C9H8NO Zn2+ 7.534 7.269 7.241 7.272 [20]
41 H H - -C9H8NO Zn2+ 7.423 7.269 7.241 7.272 [20]
42 H H - -C9H8NO Zn2+ 7.039 7.269 7.241 7.272 [20]
43 H H - -C9H8NO Cd2+ 7.015 6.924 6.927 6.774 [20]
44 H H - -C9H8NO Cd2+ 6.863 6.924 6.927 6.774 [20]
45 H H - -C9H8NO Cd2+ 6.611 6.924 6.927 6.774 [20]
46 H H - -C9H8NO Mn2+ 5.820 5.860 5.971 5.529 [20]
47 H H - -C9H8NO Mn2+ 5.621 5.860 5.971 5.529 [20]
48 H H - -C9H8NO Mn2+ 5.439 5.860 5.971 5.529 [20]
49 H H H -C6H4NO2 Cr3+ 10.150 11.007 10.952 10.696 [21]
50 H H H -C6H4NO2 Cr3+ 11.250 11.007 10.952 10.696 [21]
External test set
1 -CH3 -CH3 -C5H4N -C5H4N Cu2+ 7.080 7.1140 7.0279 6.560 [22]
2 H H -CH3 - C5H4N Ho3+ 8.640 8.9132 8.6831 9.198 [23]
3 H H -CH3 - C5H4N Dy3+ 8.240 8.6043 8.3225 8.765 [23]
4 H H -CH3 -CH=N-NHC6H5 Cu2+ 11.700 10.3534 10.3882 11.662 [15]
Trang 5No Thiosemicarbazone ligand Metal
ions logβ11.exp
logβ11.cal
ref
5 H H -CH3 -CH=N-NHC6H5 Co2+ 10.020 11.4634 11.3203 10.206 [15]
6 H -CH3 -CH3 -CH=N-NHC6H5 Cu2+ 12.300 12.2869 12.4456 11.903 [15]
7 H -C2H5 H -C9H5NOH Zn2+ 6.130 7.0925 7.1365 6.623 [16]
8 H H -CH3 -C6H4OH Mn2+ 5.000 4.7429 4.8538 5.280 [18]
9 H H H -C6H4NH2 Ni2+ 12.710 12.1015 12.2011 11.523 [22]
10 H H -C6H4OH -C6H4OH Fe3+ 5.496 6.2633 6.3243 6.251 [24]
2.2 Calculation and selection of descriptors
According to experimental results, the 2D structures of metal-thiosemicarbazone complexes were sketched using ChemBioDrawUltra 2013 [26] Then the structures were optimized by quantum mechanics on the MoPac2016 system [27] The quantum descriptors also were computed
on the MoPac2016 system by using the semi-empirical quantum method with new version PM7 and PM7/sparkle for lanthanides [9] The 2D and 3D topological descriptors were received by QSARIS system [28, 29]
After the computation of all essential parameters, it is essential to filter out non-suitable
variables for collecting a set of databases that includes observations with the values logβ11 and structural parameters Next, we used this database to develop models
2.3 QSPR modeling method
2.3.1 MLR method
In the quantitative structure and property relationship, dependent variable (Y) as values logβ11
are correlated with the values of independent quantitative or qualitative variables as structural
descriptors (X) If it is simulated that the relationship is well represented by a multivariate linear regression (MLR) model In this case, the regression model with k explanatory variables was
expressed [28, 30, 31]
0 1
k
j j j
where β0, is the intercept of the model, βj is the regression coefficients (slope), is the random error
with expectation 0 and variance 2
2.3.2 PCR method
In statistics, principal component regression (PCR) is a regression analysis technique based on principal component analysis (PCA) Commonly, it considers regressing the dependent variable on
a set of explanatory variables or independent variables based on a standard linear regression model, but uses PCA for estimating the unknown regression coefficients in the model
PCR can be divided into three steps: firstly, it runs a PCA on the table of the explanatory variables, secondly it runs an MLR on the selected components, and then it computes the parameters of the model that correspond to the input variables [32]
In PCR, instead of regressing the dependent variable on the explanatory variables directly, the principal components (PCs) of the explanatory variables are used as regressors One normally uses
Trang 6only a subset of all the PCs for regression, thus making PCR some kind of a regularized procedure Often the PCs with higher variances are selected as regressors However, for the purpose of predicting the dependent variable, the PCs with low variances may also be important, in some cases even more important [32]
One major use of PCR lies in overcoming the multicollinearity problem which arises when two
or more of the explanatory variables are close to being collinear [33] PCR can aptly deal with such situations by excluding some of the low-variance PCs in the regression step In addition, by usually regressing on only a subset of all the principal components, PCR can result in dimension reduction through substantially lowering the effective number of parameters characterizing the underlying model This can be particularly useful in settings with high-dimensional covariates Also, through appropriate selection of the PCs to be used for regression, PCR can lead to efficient prediction of the outcome based on the assumed model If there are linear relationships between the independent variables in MLR model, then there will be multiple alignments between the independent variables The main advantage of the PCR model is elimination of linearity in independent variables [32]
2.3.3 Artificial neural network
For ANN examination, multilayer perceptron (MLP) with many learning algorithm is normally used The MLP could have more than one hidden layer However, some studied results pointed out that one hidden layer is good enough for an ANN to approximate any complex non-linear function [34] In the present work, numbers of suitable hidden layer and epoch were tested with trial-and-error technique Simultaneously, the hyperbolic tangent sigmoid function was used as transfer functions in inputs and output datasets It was written as equation (3) [35]
1 2
2
e
(3)
In this study, all calculations of ANN analysis were performed on the Matlab software version 2016a [36] with Neural Network tool (nntool) toolbox on a data set of compounds We used a typical feed-forward neural network with Levenberg-Marquardt learning algorithm [37] (trainlm) to train it The algorithm appears to be the fastest method for training moderate-sized feed-forward neural networks
2.3.4 Models Validation
The methods of modeling correspond to minimizing the sum of square differences between the observed and predicted values This minimization leads to the following estimating of the
parameters of the model The models were screened by using the values R2train for training set, Q2LOO
or Q 2CV for cross-validation set, R 2test for independent test of only ANN model and Q2test for external validation set of all models [28, 30, 31] These were calculated by the same formula
2
2
1
ˆ
1
n
i i i
n i i
R
(4)
where Y i , Ŷ i, and Ȳ are the experimental, calculated and average value, respectively
The mean squared error (MSE) of regression methods is defined by equation (5) [28, 30, 31]
Trang 71
ˆ
1
N
i MSE
The training of ANN model is carried out till the mean square error (MSEANN) is minimized followed by a comparison of the network output with the actual values of the output obtained from
experimental results [38] MSEANN is the average squared error between the networks
outputs (o) and the target outputs (t) It is written as follows [38]
2 ANN
1
1 n
i i
In order to compare the quality of the models, we use the average absolute values of the relative errors MARE (%) where ARE (%) is the absolute value of the relative errors They are calculated by the following expression [28, 29]
1
, % , %
n i i
ARE MARE
11,exp 11,cal
11,exp
log
n is the number of test substances; β 11,exp and β11,cal are the experimental and calculated stability constants
3 RESULTS AND DISCUSSION 3.1 Constructing models QSPR MLR
In order to building of QSPRMLR model, the data set was divided into the training set and the test set, in which the portion of the test set is 20 % The construction of QSPRMLR models was performed using back-elimination and the forward regression technique on the Regress [39] and MS-EXCEL [28, 40, 41] The QSPR models was cross-validated by means of the leave-one-out
method (LOO) using the statistic Q2LOO
The quality of the models was evaluated through statistical values such as R2train, Q2LOO, Fstat
(Fischer‟s value) and MSE A good calibrating model has high R2
train, Q2LOO, and Fstat values, and
low MSE value with the least number of descriptors The results of QSPRMLR models were shown in
Table 2 The number of descriptors k was selected in range 1 to 12
Table 2 The results of QSPRMLR models (k of 1 to 12) with statistical values
The best model is in bold
No QSPRMLR models
1 log 11 = 10.9658 + 2.0345knopt
n = 50; k = 1; MSE = 1.7505; R²train = 0.2106; Q2LOO = 0.1526; Fstat = 12.8093
2 log 11 = 6.1372 + 2.0769knopt + 0.2107SHBa
n = 50; k = 2; MSE = 1.6150; R²train = 0.3421; Q2LOO = 0.2696; Fstat = 12.2220
3 log 11 = 16.2732 + 2.8514knopt + 0.2374SHBa + 1.2022HOMO
n = 50; k = 3; MSE = 1.4937; R²train = 0.4493; Q2LOO = 0.3635; Fstat = 12.5088
4 log 11 = 16.2307 + 3.6618knopt + 0.2864SHBa + 1.3207HOMO + 0.3637xvpc4