In this paper, in order to increase the accuracy of the prediction model in imbalanced data classification problem, we propose a new cluster-based sampling method to address this work. Performing tests on a number of datasets, we have achieved important results when compared to cases without using any data balancing strategies and previous method.
Trang 1This paper is available online at http://stdb.hnue.edu.vn
A NEW METHOD BASED ON CLUSTERING IMPROVES THE EFFICIENCY
OF IMBALANCED DATA CLASSIFICATION
Nguyen Thi Hong and Dang Xuan Tho
Faculty of Information Technology, Hanoi National University of Education
Abstract Classification of data imbalance is an important problem in practice and is becoming a new approach for many researchers In particular, in the diagnosis of medicine, the number of ill people accounts for only a very small percentage of the total number of people, so the ability to detect people with many difficulties or major deviations, causing serious consequences, even affect human life Therefore, the efficiency of classification imbalance requires high accuracy and the preprocessing method of data is a common solution with good results This paper will introduce some approaches in imbalanced data classification, propose a new method based on cluster data We have installed this method and experimented on UCI international data sets: Blood, Glass, Haberman, Heart, Pima, and Yeast For example, the result of classification with Yeast data, G-mean of original data
is 21.41%, but when applying the new method, it has increased to 83.06% The experimental results show that the new method increases the classification efficiency of data significantly
Keywords: imbalanced data classification, Data mining, Clustering based undersampling
1 Introduction
Many classification algorithms published such as: k-nearest neighbors, Decision trees, Nạve Bayes, Support vector machines These are the standard algorithms applied to balance classification cases and has been tested experimentally However, applying these algorithms to data where the large disparity in the number of samples in classes is not effective [1-3] Therefore, new approaches need to be taken in case of data imbalance
A data imbalance is a case where data have a significant difference in the number of samples in classes At that time, the class had many samples called the majority class; the other class had few samples called the minority class This shows that, due to the overwhelming number of class samples, the efficiency of the classification process is significantly reduced For example, the Mamography data set consists of 11,183 data samples, of which 10,923 are assigned the label "negative" (no cancer) and 260 samples are labeled "positive" (cancer) Assuming a classification model of only 10% accuracy means that 234 samples of the minority class were misclassified into the majority, resulting in 234 people with cancer but diagnosed without cancer [4, 5] Clearly, the misclassification of such patients will have more serious consequences than misclassification of non-cancer to cancer Therefore, the problem of imbalanced
Received March 20, 2020 Revised May 5, 2020 Accepted May 12, 2020
Contact Nguyen Thi Hong, e-mail address: nguyenhong@hnue.edu.vn
Trang 234
data classification is an application that has important practical applications and is interested by many scientists in the field of data mining
In this paper, in order to increase the accuracy of the prediction model in imbalanced data classification problem, we propose a new cluster-based sampling method to address this work Performing tests on a number of datasets, we have achieved important results when compared to cases without using any data balancing strategies and previous method
2 Content
2.1 Related works
The problem of imbalanced class distribution is concerned by many researchers and many scientific papers have been published In general, there are two main approaches to problem solving, based on the algorithm level and data level At the algorithm level, the authors try to suggest new algorithms or improve existing classification methods to solve data with the imbalanced class distribution These methods are more effective than previous methods, such as using different thresholds and more accessible one-class methods Specifically, a number of cost-based learning algorithms with the addition of weight to the minority class [6], adjust the predictive probability of leaves for the decision tree method [7], add more constants different penalties for each class or adjustment of layered boundary improvement of supported vector machine algorithms [8]
On the other hand, data-level methods are based on preprocessing of the desired data to balance the training set again with the strategy of over-sampling or under-sampling samples The over-sampling approaches are used to increase the number of samples of the minority class, while the under-sampling approaches are used to reduce the number of majority class samples There are several common methods that are used, such as SPY [9], SMOTE [10], Borderline-SMOTE [11], Safe-level-Borderline-SMOTE [12], Safe-Borderline-SMOTE [13], BAM [14] Specifically, in 2017, Wei-Chao Lin et al [15] proposed clustering-based undersampling method to reduce the number of majority class samples based on clustering As shown in Figure 1, firstly, the author chooses the number of clusters equal to the number of samples of the minority class Then, the cluster center is selected by the k-means clustering algorithm on the majority class samples These cluster centers are used to replace all majority class samples Then, the number of majority and minority class samples includes the same number of samples The empirical results
on real data have shown that this method is highly effective, however, in some cases, we found that this method does not improve the efficiency of classification but also reduces accuracy In the next section, we will analyze a major drawback in this method and thereby propose improvements
Imbalanced
dataset
Training set
Testing set
majority
minority
k cluster centers
k-means
classifier
training testing
Figure 1 Clustering-based undersampling procedure
Trang 32.2 Main Drawback of Clustering-based under sampling
To illustrate this approach, Figure 2 (a) shows the clustered majority class samples based
on the k-means algorithm, with the number of clusters set by the number of minority class samples (i.e k = N) Figure 2 (b) introduces this k cluster center is used to replace the entire majority class data set Then, with this method, we see that the entire majority samples of a cluster will be represented by the cluster center which will be used for future learning Therefore, it is important to select cluster centers that represent samples in the cluster As we know, the cluster center is updated by the algorithm k-means as the average of all samples in the cluster:
Where xj is a data sample in the data set, Ci is a cluster (set of data samples) and oi is the cluster center (the center of cluster of Ci) However, in some clusters there are noise samples that affect the cluster center selection as shown in Figure 2 (c) Therefore, in order to address this drawback and improve the classification accuracy of the clustering-based undersampling method, we focused on how to choose a better cluster center which represents samples in the cluster This idea is illustrated in Figure 3, a new method that we term Priority-Center (PC), will
be presented in more detail in the next section
o i
Figure 2 Main Drawback of Clustering-based undersampling
2.3.The Priority-Center method
In order to overcome the drawback of Clustering-based undersampling described previously, we focus on choosing a good cluster center which represents samples in the cluster The ideal cluster center needs to deviate from the densely populated locations of the cluster; the samples located in the high density area will have a higher weight than the other samples Based
on that idea, we propose a new, Priority-Center method
A
B
C
D
o i
Figure 3 The idea of Priority-Center
Trang 436
Firstly, for each sample in the cluster, we count the number of nearest neighbors (nj) in the radius r Then, due the cluster center will represent important samples in the cluster, so we will calculate the cluster center by the key samples ( ′ ), not on all samples in the cluster The priority order of the key samples in the cluster will be determined based on the value of nj of each element Finally, the new center of the cluster is updated by the following formula:
where ′ are the key samples in the data set, nj is number of nearest samples of ′ in the radius
r, and oi is the new cluster center (the center of cluster of Ci) By applying the new cluster center calculation formula, we obtain a better cluster center, as shown in Figure 3b
2.4 Experiments
2.4.1 Evaluation matrix
Table 1 Confusion matrix Predicted Positive Predicted Negative
Confusion matrix in Table 1 is used to evaluate the performance of binary classification methods A labeled data is classified for calculating the values TP, FN, FP and TN The effectiveness of the classification of balanced data is assessed by a number of measures such as Recall, Precision and Accuracy given by the following formulas:
For datasets which have a high imbalance rate, if most of the minority class elements are misclassified and most of the majority class elements are correctly classified, the accuracy may still be high Therefore, the evaluation of classification efficiency by these measures is no longer suitable for imbalanced datasets G-mean measure was proposed to balance the accuracy between majority class elements and minority class elements [7, 16, 17] This value is high if both of TPrate and TNrate value are high
In this paper, the authors used the G-mean value to evaluate the classification efficiency of the datasets after adjusting the imbalance ratio
2.4.2 Experimental method and datasets
To evaluate the effectiveness of the proposed method, we used 6 imbalanced datasets from UCI international standard data store: Blood, Glass, Haberman, Heart, Pima and Yeast Among
Trang 5them, Haberman, Heart, Pima are datasets about breast cancer, heart disease and diabetes, respectively Blood is the dataset containing information about whether or not Taiwanese donors donated blood in March 2007 Yeast dataset is about protein localization sites in yeast The glass dataset is about whether or not the glass is a "float" type All of these datasets have an imbalance between the number of elements in the majority and the minority In particular, Glass (1:6) and Yeast (1:28) are the two datasets with the highest imbalance rate The classification of these data sets using standard algorithms such as SVM, KNN, and Nạve Bayes is inefficient Therefore, the authors make adjustments to these data sets by the proposed method to reduce the imbalance Information about datasets is given in Table 2
Table 2 UCI imbalanced datasets
elements
Number of Minority elements
Number of Majority elements
Percentage of Minority class
We conducted experiments using the 20 times 10-fold method Each time, the datasets were divided into 10 sections, each of which was selected as testing data and the remaining 9 were used as training data The values of the confusion matrix are calculated based on the testing data classification result The average G-mean value of 20 times 10-fold is used to evaluate the performance of the classification
In this experiment, we adjusted the original datasets by Clustering-based undersampling (CU) and Priority-Center methods (PC), then, classified these data using SVM, Random Forest, Nạve Bayes and K nearest neighbor (KNN) algorithms using packages supported by R: Kernlab, randomForest, e1071, class
2.4.3 Results
The G-mean values in Figure 4 show classification efficiency of original datasets and datasets after being adjusted by Clustering-based Undersampling and Priority-Centers It can be seen that, the proposed method is better in most of datasets when classified by the SVM, Random Forest and KNN algorithms except for Nạve Bayes Especially, when it comes to the Haberman dataset, the G-mean values of PC method increase significantly by 5-8% in comparison to that of CU method There are slight increases of 1-4% in classification efficiency
of other dataset After the application of undersampling methods, the number of elements in the datasets decreases sharply so the efficiency of classification by the Nạve Bayes algorithm is improved insignificantly Information on the number of elements of the datasets before and after undersampling is illustrated in Table 3
Trang 638
Table 3 UCI imbalanced Datasets after adjusting
Dataset
Number of Minority elements after adjusting
Number of Majority elements after adjusting
Number of elements after adjusting
Figure 4 The graphs compare the G-mean value calculated when classifying the original
data set and the data governed by CU and PC
Figure 5 shows 3D illustration in of Haberman, Heart and Yeast datasets before and after being adjusted by Clustering-based undersampling and Priority-Centers algorithms It can be seen that, Heart dataset has not changed much after being adjusted because the number of
Trang 7majority and minority class elements of this dataset is not much different (120:150), so the efficiency of the classification is hardly improved When it comes to Yeast dataset, it has a high imbalance ratio (51: 1433) After being adjusted by PC, there are only 102 elements (51:51) Therefore, the efficiency of classification on this dataset by Naive Bayes algorithm decreases significantly However, the classification efficiency of this dataset is significantly improved with both Random Forest and SVM algorithms because adjusted dataset has less overlapping distribution
Figure 5 3D illustration of Haberman, Heart and Yeast datasets before
and after being adjusted by CU and PC
Table 4 shows P-values calculated by T.test package in R when comparing G-mean values
in Figure 4 Most of the P-values are less than 5% so the comparison of the G-mean values of the methods are statistically significant
Trang 840
Table 4 P-values compare G-mean values
Dataset
Origin
Blood PC < 2.2e-16 0.0301 <2.2e-16 0.0329 <2.2e-16 0.0007 < 2.2e-16 0.0647
Glass PC < 2.2e-16 2.70E-09 1.12E-05 0.3913 3.40E-08 0.00017 6.16E-11 0.000767
Haberman PC < 2.2e-16 3.92E-14 <2.2e-16 0.07081 <2.2e-16 4.15E-11 < 2.2e-16 2.08E-08
Heart PC 0.05397 0.156 0.202 0.4161 7.23E-04 0.176 0.003911 0.465
Pima PC < 2.2e-16 4.57E-03 3.76E-16 0.02276 <2.2e-16 0.422 < 2.2e-16 0.490
Yeast PC < 2.2e-16 4.33E-04 3.76E-16 0.02276 <2.2e-16 1.34E-06 < 2.2e-16 0.490
3 Conclusions
In this scientific study, we presented an overview of the new algorithm based on clustering data to improve data classification efficiency By replacing the majority class samples with the center of each cluster, it has created the ability to explore large-sized databases, improve computational efficiency, reduces the level of data imbalance and thereby increasing the accuracy of the data classification results
Based on the research and the results achieved, we realize that there are many issues that need further study Specifically, we will study the combination of noise removal with other methods such as Safe level, Boderline-SMOTE, Add-Boder-SMOTE or further develop the Priority-Centers algorithm to achieve higher efficiency in solving class imbalance problems
REFERENCES
[1] M Zareapoor, 2015 Application of Credit Card Fraud Detection: Based on Bagging Ensemble Classifier Int Conf Intell Comput Commun Converg., Vol 48, No 12,
pp 679-686
[2] W Xiao, J Zhang, Y Li, S Zhang, and W Yang, Oct 2017 Class-specific cost regulation extreme learning machine for imbalanced classification Neurocomputing, Vol 261, pp 70-82 [3] G Haixiang, L Yijing, J Shang, G Mingyun, H Yuanyue, and G Bing, 2017 Learning from class-imbalanced data: Review of methods and applications Expert Syst Appl., Vol 73,
pp 220-239
[4] X Yuan, L Xie, and M Abouelenien, 2018 A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data Pattern Recognit., Vol 77, pp 160-172
[5] M Arafat, A Qusef, and G Sammour, 2019 Detection of Wangiri Telecommunication Fraud Using Ensemble Learning IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), pp 330-335
Trang 9[6] Z Sun, Q Song, X Zhu, H Sun, B Xu, and Y Zhou, 2015 A novel ensemble method for classifying imbalanced data Pattern Recognit., Vol 48, No 5, pp 1623-1637
[7] C X Ling, Q Yang, J Wang, and S Zhang, 2004 Decision Trees with Minimal Costs Proc 21st Int Conf Mach Learn., p 69
[8] Y Zhang, P Fu, W Liu, and G Chen, 2014 Imbalanced data classification based on scaling kernel-based support vector machine Neural Comput Appl., Vol 25, No 3-4, pp 927-935 [9] X T Dang, D H Tran, O Hirose, and K Satou, 2019 SPY: A Novel Resampling Method for Improving Classification Performance in Imbalanced Data Seventh International Conference
on Knowledge and Systems Engineering (KSE), pp 280-285
[10] N V Chawla, K W Bowyer, and L O Hall, 2002 SMOTE : Synthetic Minority Over-sampling Technique Artif Intell., Vol 16
[11] H Han, W Wang, and B Mao, 2005 Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning Lect Notes Comput Sci., Vol 3644, pp 878-887
[12] C Bunkhumpornpat, K Sinapiromsaran, and C Lursinsap, 2009 Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling Technique Lect Notes Comput Sci., Vol 5476,
pp 475-482
[13] X T Dang et al., 2013 A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data Chem-Bio Informatics J., Vol 13, pp 19-29 [14] H Nguyen Thi and T Dang Xuan, 2019 BAM: border adjustment method improve the efficiency of imbalanced biological data classification HNUE J Sci., Vol 64, No 6,
pp 173-182
[15] W C Lin, C F Tsai, Y H Hu, and J S Jhang, 2017 Clustering-based undersampling in class-imbalanced data Inf Sci (Ny)., Vol 409–410, pp 17-26
[16] C.-M Vong, J Du, C.-M Wong, and J.-W Cao, 2018 Postboosting using extended G-Mean for online sequential multiclass imbalance learning IEEE Trans Neural Networks Learn Syst., Vol 29, No 12, pp 6163-6177
[17] H Guo, H Liu, C Wu, W Zhi, Y Xiao, and W She, 2016 Logistic discrimination based on G-mean and F-measure for imbalanced problem J Intell Fuzzy Syst., Vol 31, No 3,
pp 1155-1166