2 Propose a novel model of combined semi-supervised clustering,this model automatically defines additional information.In our research, a part of sample of the fuzzy semi-supervised clus
Trang 1ACADEMY OF MILITARY SCIENCE AND TECHNOLOGY
Trang 2ACADEMY OF MILITARY SCIENCE AND TECHNOLOGY
Scientific supervisors:
1 Assoc Prof Dr Le Ba Dung
2 Dr Nguyen Doan Cuong
Reviewer 1: Assoc Prof Dr Bui Thu Lam
Military Technical Academy
Reviewer 2: Assoc Prof Phung Trung Nghia
Thai Nguyen University
Reviewer 3: Dr Nguyen Do Van
Academy of Military Science and Technology
The thesis was defended at the Doctoral Evaluating Council at Academy level held at Academy of Military Science and Technology at date ……., 2019
The thesis can be found at:
- The library of Academy of Military Science and Technology
- Vietnam National Library
Trang 3INTRODUCTION
1 The necessary of the thesis
Fuzzy semi-supervised clustering is an extension of fuzzy clustering using prior knowledge that increases quality of clusters Pre-informed information, also known as additional information, is intended to guide, monitor and control the clustering process
Fuzzy min-max neural network (FMNN) model proposed by Patrick
K Simpson is based on advantages of combining fuzzy logic, artificial neural network, fuzzy min-max theory to solve classing and clustering problem FMNN is an incremental learning model based on fuzzy meta-files for ability to process large data sets
Liver disease diagnosis based on data from liver enzyme test results can be formulated as a pattern recognition problem Use of FMNN is considered an effective approach One of the reasons that FMNN is used
in disease diagnostic support is the ability to generate if…then decision
rule that is very simple Each FMNN's hyperbox transforms into a rule described by quantifying min and max values of the data attributes However, the FMNN itself still has many shortcomings leading to the difficulties and limited practical application Main researches on FMNN focus on major directions such as improving the network structure, optimizing parameters, subscribing, reducing the number of hyperbox in the network, improving the learning method or incorporating other method to improve the quality
Based on the research on FMNN's development process, to improve the efficiency of FMNN, the thesis topic focuses on proposing and improving methodology by semi-supervised learning method In the new methods presented in the thesis, additional information is defined as the label assigned to a piece of data to guide and monitor the clustering process This is a new approach that earlier methods have not mentioned
Trang 42 Objectives of the research
1) Develop advanced fuzzy semi-supervised clustering algorithm based on label spreading Additional information is a small percentage of the samples labeled
2) Propose a novel model of combined semi-supervised clustering,this model automatically defines additional information.In our research, a part of sample of the fuzzy semi-supervised clustering algorithm is labeled
3) Develop a fuzzy clustering algorithm considering to the distribution of data
4) Apply fuzzy min-max neural network to the dump of fuzzy
if then decision rule in design of the liver disease diagnostic support
system from data is data of the results of the liver enzyme test
3 Object and scope of the research
The thesis focuses on the following issues:
- An overview of fuzzy min-max neural network and variations of fuzzy min-max neural network
- Analysis of limitations and solutions used by researchers to overcome these limitations
- Application of fuzzy min-max neural network with dump of fuzzy
if then decision rule in disease diagnosis
4 Research methods
The thesis uses theoretical research method, in particular, the thesis has studied the FMNN model for classing and clustering data Since then, the thesis focuses on the proposed semi- supervised clustering algorithm The thesis also uses simulated empirical method in combination with analysis, statistics and evaluation of empirical data
5 Contribution of the thesis
- Develop the advanced SS-FMM algorithm for fuzzy supervised clustering based on label spreading progress
Trang 5- Propose a novel model of semi-supervised clustering combined with FMNN and SS-FMM This model automatically defined additional information for semi-supervised clustering algorithms
- Develop a fuzzy clustering algorithm considering to the distribution of data
6 Structure of the thesis
Apart from the introduction and conclusion, the main contents of the thesis consists of three chapters:
- Chapter 1 presents an overview of the thesis, including the basic concepts of FMNN and FMNN extensions From general characteristics
of extensions, limitations, it shall provide the direction of the next research Throughout this chapter, the thesis gives an overview of the research problem, concepts and basic algorithms used in the research
- Chapter 2 presents suggestions for improving learning method in the FMNN using the semi-supervised algorithm model for data clustering The additional information is labeled a part of the sample in the training data set Then labels from this part of data are spreading to unlabeled data samples Fuzzy semi-supervised clustering combining with FMNN model automatically defines additional information This is also used as the input of the fuzzy semi-supervised algorithm Data clustering model in fuzzy min-max neural network takes into account distribution of data as well
- Chapter 3 presents the application of proposed model with the
generation of fuzzy decision rules formed if then in support system of
liver disease diagnostic on a real dataset
Chapter 1: Overview of fuzzy min-max neural network
1.1 Fundamental knowledge of fuzzy min-max neural network
* Hyperbox membership function
The degree determination of membership function b j (A,B j) measures
the degree of belonging of sample A corresponding to hyperbox B j It is defined by Eq (1.2) or Ed (1.3) below
Trang 6* Fuzzy min-max neural network structure
FMNN uses a straight-forward neural network structure, two-layer structure (Fig 1.4) with unsupervised learning and three-layer structure (Fig 1.5) with supervised learning
* Overlapping between hyperboxes
The FMNN algorithm is aimed at creating and modifying hyperboxes
in n-dimensional spaces If the expansion creates overlap between the
hyperboxes, the contraction process is performed to eliminate overlap The
overlap happens between B j and B k if one of the four following cases occurs:
- Case 1: max of B j overlapped with min of B k
- Case 2: min of B j overlapped with max of B k
- Case 3: B k contained within B j
- Case 4: B j contained within B k
If B j and B k are overlapped, the contraction process of hyperboxes is performed in the corresponding direction to eliminate overlap:
- Case 1 If v ji v ki w ji w kithen:
/ 2
v v w new old old/ 2
w v w
Trang 7- Case 3 If v ji v ki w ki w ji, considerring following cases:
+ If(w kiv ji w jiv ki), then: new old
- Case 4 If v ki v ji w ji w ki, considerring following cases:
+ If (w kiv ji w jiv ki), then: new old
* The learning algorithm in fuzzy min-max neural netwwork
Algorithm in fuzzy min-max neural network only include creation and modification of hyperboxes in the sample space The learning algorithm in FMNN consists of 3 steps: creation and expansion of hyperboxes, overlapping test, hyperbox contraction Each step is repeated for all samples in the dataset
1.2 Some researches to improve quality of FMNN
* Adjust size limit of hyperbox
In order to overcome the phenomenon of exceeding size limit of hyperbox for network training due to the averaging method, D Ma proposed an alternative solution of size limit function to be compared in all dimensions calculated according to formula (1.24) using the formula (1.29)
* Modify FMNN structure to manage overlapping areas
The FMCN (Fuzzy Min-max neural network classifier with Compensatory Neurons) and DCFMN (Data-Core-Based Fuzzy Min–Max Neural Network) models overcome the problems caused by contraction of the hyperboxes that created the additional hyperboxes Rather than adjusting contraction of the hyperboxes, the FMCN and
Trang 8DCFMN handle overlapping areas by using hyperboxes to manage separate overlapping area
* Improve learning method in FMNN
The semi-supervised model of GFMM (General Fuzzy Min-Max) and RFMN (Reflex Fuzzy Min-max Neural network) uses additional information
as the labels accompanying with some input patterns GFMM and RFMN used prior knowledge to monitor and guide clustering
1.5 Conclusion of Chapter 1
Chapter 1 presented the overview research on FMNN and development trend of FMNN, synthesized and compared the case researches on structural improvement of FMNN algorithm
The following chapters will present proposals on some issues that remain in development of FMNN and application of FMNN to support disease diagnosis
Chapter 2: The development of semi-supervised clustering algorithm using fuzzy min-max neural network
This chapter presents three algorithms to improve learning method and the experimental results used to evaluate proposed algorithms The novel models include:
- An improvement of SS-FMM semi-supervised learning method, results announced in [3]
- A novel model of semi-supervised clustering combined with FMNN and SS-FMM, results announced in [5]
- A fuzzy clustering algorithm considering to the distribution of data
In addition, the algorithm uses a set of additional rules in the training process Results announced in [2, 4]
2.1 SS-FMM semi-supervised fuzzy clustering algorithm
The GFMM model and the modified model (RFMN) have the advantage of using more prior information to monitor the clustering process, thereby improving the clustering quality However, both GFMM and RFMN are capable of producing hyperboxes with their own attributes
Trang 9that are not labeled Because when GFMM and RFMN create new hyperboxes for the first sample with out label, the new hyperbox is not labeled This hyperbox will wait for labeled samples to edit the label of the hyperbox by the label of the sample However, there may still be unlabeled hyperboxes that are not edited due to the absence of labeled samples Figure 2.1 is an illustrative example of the case of GFMM and RFMN producing unlabeled hyperboxes
Siêu hộp U
Siêu hộp V
Hyperbox U
Hyperbox V
Fig 2.1 Failed hyperboxes of GFMM and RFMN
Where: V is a hyperbox created from labeled samples or be adjusted label by labeled samples, U is a hyperbox created from unlabeled samples
or without label adjustment
The SS-FMM algorithm proposes the method to overcome this disadvantage of GFMM and RFMN SS-FMM prevents the algorithm
from making unlabeled hyperboxes using the β-limit threshold The
initial threshold is defined by user, but the algorithm has the ability to manually redefine the threshold for fit during training process The framework diagram is described in Figure 2.2
When creating a new hyperbox from the unlabeled pattern, SS-FMM
only creates a new hyperbox if it satisfies β criteria defined in (2.2)
Trang 10from labeled data samples and spread the labels from labeled hyperboxes
to the hyperboxes created by unlabeled samples SS-FMM incorporates all the hyperboxes with the same label that form a full cluster
y
Is there overlapping?
Fig 2.2 General diagram of SS-FMM algorithm
* Complexity evaluation of the SS-FMM algorithm
O(M(M(M-1)/2+NK) Where M is the total number of samples in the training data set, N is the number of attributes of the data sample, K is the total number of hyperboxes generated in the SS-FMM network
Trang 112.2 Combined fuzzy semi-supervised clustering algorithm (SCFMN)
The algorithm of SS-FMM generated hyperboxes, with each hyperbox as a cluster SS-FMM used many small hyperboxes to classify samples on the boundary However, when the value of parameter max
decreases, the number of hyperboxes in the network will increase and the complexity of algorithm increases as well SS-FMM should have a certain rate of labeled sample in the training set
To against this limitation of SS-FMM, SCFMN uses the max
parameter for different values in two stages to improve clustering results with fewer hyperboxes Value of 1
maxand 2
maxare the maximum size of the large and small hyperboxes, respectively In the first stage, SCFMN generates hyperboxes and label for fully attached samples with hyperboxes In the second stage, SCFMN spreads label from hyperboxes created in previous stage to hyperboxes created from unlabeled samples Large and small hyperboxes with the same label will form a full cluster Figure 2.3 shows the idea of using large hyperboxes at the center of clusters in conjunction with the smaller hyperboxes in the boundary These hyperboxes are expressed in 2-dimensional space and data sets
consists of two clusters Denote B is a large hyperbox, G is a small hyperbox (dashed line) obtained from labeled samples, R is a small
hyperboxes (dot- cross line) obtained from unlabeled samples
*
Hyperbox B Hyperbox R
+ + + +
+ + + +
+ + + + + + + + +
+ +
+ +
+
+ + + + + + + + + + + + + + + +
+ + + + + + + + + + +
+ + +
+
+
+ + + + +
+
Fig 2.3 SCFMN uses the large and small hyperboxes
Trang 122.2.2 Methodology of SCFMN algorithm
Figure 2.5 is general diagram of SCFMN algorithm
* The complexity of the SCFMN algorithm
SCFMN has a time complexity of O(KN(M(K+1)+1)+M(M-1)/2) Where M is the total number of samples in the training data set, N is the number of attributes of the data sample, K is the total number of hyperboxes generated in the SCFMN network
y
Is there overlapping?
Hyperbox contraction
Input pattern {A h ,d h} D
n
y
Phase 1: Additional information
Are all the data has labeled?
Hyperbox contraction
H d
G = G{H new}
Fig 2.5 General diagram of SCFMN algorithm
2.3 CFMNN fuzzy min-max clustering algorithm based on data cluster center
The value of FMNN membership function does not decrease as the samples are far away from the hyperbox To overcome these disadvantages, CFMNN relies on the distances between the samples and
Trang 13centroids of the corresponding hyper-boxes Centroid value is calculated until the sample is far away from the hyperbox and its membership is less than 0.6, when the membership function value does not decrease Apart from the min and max points, each hyperbox has the center of the hyperbox defined as in (2.8)
For each sample A h satisfies the size limit condition (1.24) where the
membership function value is b j < 0.6, its distance is calculated and compared with others Samples will belong to the closest hyperboxes
* Complexity of the CFMNN algorithm
CFMNN algorithm has a time complexity of O(MKN) Where M is the total number of samples in the training data set, N is the number of attributes of the data sample, K is the total number of hyperboxes generated in CFMNN
2.4 Experiment and evaluation
measurements are used to evaluate the performance of algorithms and
compare them to other ones Accuracy is calculated by (2.12), CCC is
calculated by (2.13)