A new hybrid method to improve the effectiveness of cancer data classification

In this paper, we present an overview of the imbalanced data classification and the difficulties encountered in current approaches, from which we propose a new method, SMOTE-PLS. To evaluate the effectiveness of this new method, we conducted experiments based on standard cancer data sets from UCI sources, including breast-p, coil2000, leukemia, colon-cancer, and yeast.

Trang 1

This paper is available online at http://stdb.hnue.edu.vn

A NEW HYBRID METHOD TO IMPROVE THE EFFECTIVENESS

OF CANCER DATA CLASSIFICATION

Nguyen Thi Chinh1, Dao Thi Minh2, Le Xuan Ly3 and Dang Xuan Tho4

1Gifted High School, Hanoi National University of Education

2Natural Science department, Thai Binh Teacher Training College

3School of Applied Mathematics and Informatics, Hanoi University of Science and Technology

4Faculty of Information Technology, Hanoi National University of Education

Abstract Imbalanced data classification is one of the most difficult issues in the machine learning and data mining community In particular, the problem is becoming more difficult with data sets with a large number of features, many redundant features affect the efficiency

of the data classification process Specifically, many biomedical data, diagnosing cancer both have a large imbalance and have thousands of features Therefore, finding a solution to overcome these difficulties is extremely important and very meaningful In this paper, we present an overview of the imbalanced data classification and the difficulties encountered in current approaches, from which we propose a new method, SMOTE-PLS To evaluate the effectiveness of this new method, we conducted experiments based on standard cancer data sets from UCI sources, including breast-p, coil2000, leukemia, colon-cancer, and yeast Empirical results show that the correctly classified minority samples are significantly improved, which proves that the new method is more effective than the previous one in dealing with imbalanced data and the large number of features.

Keywords: data mining, SMOTE, Imbalanced data classification, PLS

1 Introduction

Data classification is a widely applied problem in practice, however, many problems appear imbalanced data which means there is a huge difference in the number of samples of the two labels Imbalanced data classification is one of the most difficult issues in the machine learning and data mining community The problem of class imbalance usually occurs with the problem of binary classification where the samples of one class of interest occupy a very small proportion compared to the other class The class imbalance greatly affects the efficiency of the classification model Practical applications encounter this problem more and more, such as fraud detection, network intrusion detection, oil spill detection from satellite Radar images, management risks, and medical diagnostics [1-3] Especially, in the medical database, the number of people with cancer accounts for

a very small proportion of the average number of people But diagnosing people with cancer - a minority - plays a very important role If we misdiagnose people who are ill, normal people will seriously affect human health and life

Received March 24, 2020 Revised May 4, 2020 Accepted May 11, 2020

Contact Nguyen Thi Chinh, e-mail address: chinh.nguyenthi@gmail.com

Trang 2

Regarding the data imbalance problem, there have been many scientists interested and there are many studies on this problem Currently, to solve the imbalanced data classification problem, there are two main approaches: data-level and algorithm-based approaches An algorithm-based approach means that the following standard classification algorithms are adjusted so that when applied to an imbalanced data set, it is still highly effective For example, with the SVM algorithm, it is suggested

to use different penalty constants for different classes or to adjust the class boundaries based on the idea of kernel-alignment Adjust the data distribution of the labels to reduce or eliminate the imbalance to apply standard classification algorithms [4, 5] There are many different ways to adjust data such as: generation of synthetic samples for minorities (randomly selecting samples to generate more, selecting by sample criteria to generate or creating new synthetic samples), remove samples from the majority class (randomly select samples to remove, select according to sample criteria to remove), or combine the above methods In the above approaches, the solution to create new synthetic samples for the minority is of interest to many scientists and has given certain results There have been several data adjustment algorithms based on this method proposed earlier such as: SMOTE [6], SPY [7], Random Boder Oversampling and Random Safe Undersampling [8], Tomek-link [9]

In addition to the occurrence of class imbalances, many data sets in biomedical medicine also have a large number of features with thousands of features However, among these features, there are many redundant features that are not useful in predicting minority class samples In many cases, they also affect the efficiency of classification and slow down the processing of experimental

To solve the imbalance problem and the huge number of features of the data, we propose a new method, SMOTE-PLS The new method combines a decrease in the number of features and the generation of synthetic samples in the minority class This method generates synthetic samples that balance the label while reducing redundant features to improve the efficiency of cancer data classification

2 Content

2.1 Framework

The general workflow for the method in this paper consists of four steps: (1) Divide the data set into training data and test data (k-fold cross validation); (2) apply artificial generation with SMOTE; (3) apply data dimension reduction with PLS; (4) training supervised machine learning models to predict cancer Illustration of the workflow is presented in Figure 1

Training set

Dataset

New training set

Balanced training set

SMOTE

Testing

Step 3

Step 4

PLS

Figure 1 A new hybrid method procedure

Trang 3

2.2.Synthetic Minority Over-sampling Technique (SMOTE) method

Synthetic Minority Over-sampling Technique method (SMOTE) [9] was proposed by the author NV Chawla et al The minority class is sampled by taking each sample in the minority and choosing the k nearest neighbors Depending on the number of sampling required, neighbors from the nearest neighbor were randomly selected The synthetic sample is created in the following way: Take the difference between the feature vector (sample) considered and its nearest neighbor Multiply this difference with a random number between 0 and 1, and add to it the feature vectors to

be considered This generates a random point along the segment between two specific samples The SMOTE algorithm is defined with 3 input variables: Number of samples of the minority class (T), The ratio of synthetic samples being generated further (N%), and the number of nearest neighbors (k) Thus, for each of these sets input values, the SMOTE algorithm will generate synthetic samples to train the new data set Specifically, the SMOTE algorithm is as follows [2]: Algorithm SMOTE (T, N, k)

Input: Number of samples of minority class: T;

Ratio of synthetic samples being generated: N%;

Number of nearest neighbors: k;

Output: (N / 100) * T: number of samples in the synthetic array (T and the number of synthetic samples generated)

1 If N < 100 then T = (N / 100) * T; N = 100;

2 If N > = 100 then

for i = 1 to T

- Calculate the k nearest samples for each sample of i and store them in the array nnarray

- Call the function Populate (N, i, nnarray)

3 function Populate (N, i, nnarray) (∗ Function creates synthetic samples ∗)

- While (N # 0)

+ Pick a random sample between 1 and k; and call it nn

+ Synthesize a new synthetic sample along the line joining i and nn

+ N = N -1

- End while

4 Return

The SMOTE algorithm is different from the previous known methods, it does not change the number of minority and majority classes, but only affects the minority label to produce synthetic samples, i.e This sample is randomly generated and will be labeled as a minority, thereby balancing the data between the two labels The SMOTE generates additional synthetic samples in accordance with the algorithm presented above For each value of N, the number of samples will change, specifically, the more N increases, the more the number of synthetic samples increases

On the other hand, if N increases too high, it will cause the data to be imbalanced in the opposite direction (that is, the majority now becomes a minority and the minority due to the addition

of more synthetic samples becoming the majority class), then the data imbalance is still not resolved Therefore, for each specific data set, the SMOTE algorithm works well at values of certain N parameters

2.3 The Partial Least Squares (PLS) method

Currently, there are many methods to reduce the number of data dimensions such as PCA, PLS The idea of this method group is to create a new feature set representing the old feature set This new feature set has all the features of the old property but is many times smaller than the

Trang 4

number of old features This dimension reduction minimizes information loss and especially does not change the nature of the data

The Principal Component Analysis (PCA) [10] is a methodology of a group of unsupervised learning methods in machine learning and synthetic intelligence The idea of PCA is to build a subspace of a smaller dimension from the original multi-dimensional space by constructing a new variable This new variable is constructed from the linear combination of the original variables So the axes of new space are a combination of coordinate axes in the old space and are called "key components" in the new space Map the data points of the original space to the subspace so that the average distance squared the error between the data points of the old space to its projection on the new space is minimal This means that PCA is building a new space with fewer dimensions The coordinate axes on the new space represent data better than the old space and ensure that on each coordinate axis, the variance of the data is greatest

The Partial Least Squares method (PLS) [11] is a group of techniques for building a relationship model between two multidimensional variables (a learning data set and a set of labels), i.e., constructing a regression function between dependent and independent variables in the regression problem Rules construct a discrete function to determine the value class received by the variable PLS is also a supervised learning method, which means that when it comes to reducing the number of data dimensions, PLS relies on both the feature data set (X) and the information in the class label data set (Y) This ensures "orientation" according to the available information gained from practical experience or through experiments

The idea of PLS is to represent variable class Y and feature X through the value of intermediate variables (hidden variables) Hidden variables are determined by linear combination of the initial variables related to each other As a result, the number of variables decreases a lot compared to the initial number of variables This eliminates subjective errors when selecting variables participating in the problem The choice of the number of hidden variables depends on the purpose of the user for the number of dimensions expressed by the object to be observed Therefore, PLS is mainly used to reduce the number of data dimensions for a feature set

In other words, PLS constructs a new space many times smaller than the original space, the spatial coordinate system is the orthogonal system (orthogonal coordinates) PLS finds point vectors

of new spaces by solving the problem of maximum covariance between variables That is, the problem returns to solve the eigenvalue problem, from which determines the eigenvectors (presented

in detail in NIPALS algorithm) The number of distinct vectors is the number of dimensions to use, chosen according to the size of that eigenvalue

PLS is particularly effective at reducing the number of data dimensions compared to traditional methods such as PCA, so it is strongly applied to biomedical data with a large number of features The new data after being reduced by PLS is more reliable than PCA by using both feature information and class labels This makes the adjustment data "oriented" in accordance with the actual value collected

2.4 A new hybrid method

SMOTE is one of the typical methods to increase the efficiency of classification by generating additional synthetic samples of the minority class But also because of that increase the capacity of the data set according to the amount of synthetic samples added On the other hand, currently the data in real applications often have a very large number of features This leads to a lot of time in the process of classification, along with the classification of those data sets will no longer be accurate, or the accuracy of data classification will not be high

To overcome the increase in data set capacity, reduce the number of such features, we came up with the idea of combining two algorithms that is to generate more samples and reduce the number

Trang 5

of data dimensions by the PLS method The idea of this combination being beneficial is that the data set capacity does not increase, or just the size of the original data set, greatly reduces the run time of the subclass, and more importantly, binding The combination of generating more samples and reducing the number of data dimensions results in a higher and more efficient classification result compared to separate methods

The combination of SMOTE and PLS methods is a combination of two advantages of two algorithms SMOTE and PLS Increasing the samples of the minority label, both reducing the number of data dimensions increases the accuracy to better classify the data in the imbalanced data classification, shortening the run time of the data sets

2.5 Evaluation criteria

In the case of two classes, a class with very few training samples, but of greater importance is called a positive label; differs from popular class, but does not have much meaning and importance

is called negative class (negative) Samples can be classified into four groups during the classification process as symbols in the following confusing matrix:

Table 1 Confusion matrix Predicted Positive Predicted Negative

Some evaluation criteria based on confusion matrix table:

G-mean is a measure used to evaluate the efficiency of data classification between two labels [4, 5, 7]

2.6 Experiments

2.6.1 Datasets

To evaluate the effectiveness of the combined method, we installed and ran the program in R and Perl languages which were tested on five cancer imbalanced data sets from UCI (University of California, Irvine) [12] as: Breast-p, Coil2000, Leukemia, Colon-cancer, and Yeast

Table 2 Cancer imbalanced data sets Dataset Number of samples Number of features Imbalance rate

Trang 6

2.6.2 Results

We first ran 10-fold cross-validation to divide the above data sets into 10 roughly equal random parts Of those 10 sections, one is used as test data, the others are used as training data Thereby adjusting the data into 7 different running ways with the purpose to compare the G-mean results and p-value values: Original data, SMOTE, PCA, PLS, RSO-RBO [8], SMOTE-PCA, and SMOTE-PLS The final comparison result is the G-mean average value of 20 times of 10 fold cross-validation For each data set, there will be different input parameters for PLS, PCA and SMOTE methods For example, with the PLS method, the input parameter (hidden variable number) depends on the square root value of the mean square error forecast (RMSEP), such as at the hidden variable number

of 4 RMSEP values without a level if the number of variables hidden is equal to the number of hidden variables sufficient for the PLS adjustment model and is similar to PCA For SMOTE method, the input parameter is N, for each value of N changes, a different number of synthetic samples will be generated For example, if N = 100, and the minority has 50 samples, then SMOTE will create 50 new synthetic samples in the minority In this experiment, parameter N was chosen to balance the data between the ratio of the minority and the majority

To see the effectiveness of the proposed method (SMOTE-PLS), we compared the results based

on G-mean values For each figure below, there is a graph representing the G-mean value of each data set with 7 methods: Original, SMOTE, PCA, PLS, RSO-RBO [8], PCA, and SMOTE-PLS

With breast-p dataset, the G-mean value of the SMOTE-PLS method was 68.45% higher than SMOTE-PCA 62.88% and much higher than the original data of 32.99%

Figure 2 The graph shows the G-mean value of breast-p dataset With coil2000 data set, the G-mean value of SMOTE-PLS method is 29.72%, much higher than SMOTE-PCA 4.12%, original data is 0 and SMOTE is 11.86%

Figure 3 The graph shows the G-mean value of coil2000 dataset

Trang 7

In the leukemia dataset, the G-mean value of the SMOTE-PLS method was 94.76% higher than that of PLS, 92.04% and SMOTE-PCA 90.08% and much higher than the original data by 75.07%

Figure 4 The graph shows the G-mean value of leukemia dataset Similar to the colon-cancer dataset, the G-mean value of the SMOTE-PLS method was 87.12% higher than the original data (86.46%) and higher than SMOTE-PCA, RSO-RBO

Figure 5 The graph shows the G-mean value of colon-cancer dataset

In addition, in the yeast dataset, the G-mean value of the SMOTE-PLS method was 76.39% higher than the original data (18.85%) and higher than SMOTE-PCA, RSO-RBO

Figure 6 The graph shows the G-mean value of yeast dataset

Trang 8

As shown on the above results, we can see that our combined method (SMOTE-PLS) is more effective than other methods To evaluate whether the combined method is statistically significant,

we apply t-test If the p-value of this test is less than or equal to 0.05, then we say the two mean values are different and statistically significant In this paper, we use the t-test function in the stats package of R to calculate the p-value

Table 3 P-values compare G-mean values

Breast-p

Original x <2.20E-16 5.02E-05 <2.20E-16

Leukemia

Original x 2.09E-12 <2.20E-16 <2.20E-16

Colon-cancer

Yeast

Original x <2.20E-16 2.04E-06 <2.20E-16

Based on the results of the G-mean calculation, we have evaluated the statistical significance of the above data sets Because the G-mean value of SMOTE-PLS combination method of each data set is much higher than the PCA and SMOTE-PCA method, we only conduct p-value calculation and compare methods: Original, PLS, and SMOTE Here are some p-value values shown in Table 3 The p-value values in the table show that the proposed method of combining dimension reduction and generation of additional samples (SMOTE-PLS) is statistically significant

3 Conclusions

In this paper, we study the method of combining dimensionality reduction and synthetic generation (SMOTE-PLS) in order to improve the accuracy of imbalanced data classification The research results have shown that when using our method, the accuracy of minor class classification is significantly improved Based on the results of the proposed method (SMOTE-PLS), it is possible to explore large-scale databases, improve the efficiency of the calculations, and increase the accuracy

of the data classification results

In the coming time, we will continue to explore and study data imbalance We will study the method of combining PLS and removing redundant features in order to further improve the classification efficiency

Trang 9

[1] G Chen, Y Li, G Sun, and Y Zhang, 2017 Application of Deep Networks to Oil Spill Detection Using Polarimetric Synthetic Aperture Radar Images Appl Sci., Vol 7, No 10,

p 968

[2] M Khalilia, S Chakraborty, and M Popescu, 2011 Predicting disease risks from highly imbalanced data using random forest BMC Med Inform Decis Mak., Vol 11, No 1, p 51 [3] M Zeng, B Zou, F Wei, X Liu, and L Wang, 2016 Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data IEEE International Conference of Online Analysis and Computing Science (ICOACS), pp 225-228 [4] G Haixiang, L Yijing, J Shang, G Mingyun, H Yuanyue, and G Bing, 2017 Learning from class-imbalanced data: Review of methods and applications Expert Syst Appl., Vol 73,

pp 220-239

[5] Y Sun, A K C Wong, and M S Kamel, 2009 Classification of Imbalanced Data: A Review Int J Pattern Recognit., Vol 23, No 4, pp 687-719

[6] N V Chawla, K W Bowyer, and L O Hall, 2002 SMOTE : Synthetic Minority Over-sampling Technique Artif Intell., Vol 16

[7] X T Dang, D H Tran, O Hirose, and K Satou, 2015 SPY: A Novel Resampling Method for Improving Classification Performance in Imbalanced Data Seventh International Conference

on Knowledge and Systems Engineering (KSE), pp 280-285

[8] K Q Huong, D T Hien, and D X Tho, 2015 The new method is based on boundary and safety zone to improve the efficiency of data classification unbalance Journal of Science of HNUE., Vol 60, No 7A, pp 103-111

[9] D Devi, S kr Biswas, and B Purkayastha, 2017 Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance Pattern Recognit Lett., Vol 93, pp 3-12 [10] H Zou, T Hastie, and R Tibshirani, 2006 Sparse principal component analysis J Comput Graph Stat., Vol 15, No 2, pp 265-286

[11] C Nitzl, J L Roldan, and G Cepeda, 2016 Mediation analysis in partial least squares path modeling Ind Manag data Syst

[12] D Dheeru and E K Taniskidou, 2017 UCI Machine Learning Repository [http//archive.ics.uci.edu/ml] Irvine, CA Univ California, Sch Inf Comput Sci

Định dạng
Số trang	9
Dung lượng	709,33 KB