Dissertation structure Main contents of the dissertation are presented in 4 chapters: Chapter 1: Literature review Chapter 2: Building a data warehouse Chapter 3: Text classification
Trang 1THE MINISTRY OF EDUCATION AND TRAINING
Major : COMPUTER SCIENCE Code : 62 48 01 01
Da Nang - 2017
Trang 2THE RESEARCH WAS ACCOMPLISHED AT
Advisors:
1 Assoc Prof Dr Vo Trung Hung
2 Assoc Prof Dr Doan Van Ban
Reviewer 1: Prof Dr Nguyen Mau Han
Reviewer 2: Prof Dr Phan Huy Khanh
Reviewer 3: Prof Dr Huynh Thi Thanh Binh
The dissertation was defended in front of The Dissertation Grading Council at The University of Danang level at The University of
Danang on September 29 th 2017
You can find the dissertation at:
- National Library of Vietnam;
- Learning Information Center, The University of Danang
Trang 3INTRODUCTION
1 Reasons for choosing the topic
Nowadays, the rapid development of science technology as well as information technology has brought people many abilities for approaching the information quickly and conveniently such as: electronic library, electronic portal, search application… These things help people more conveniently in exchanging, updating, searching for information all over the world through the Internet Therefore, operating the automatic document classification nowadays is considered as an urgent problem and it attracts many researchers as well In this dissertation, the author focused on investigating new methods for Vietnamese text classification more effectively which based on semi-supervised learning technology
2 Literature review
In computer science field, semi-supervised learning is a machine learning technology class which combined the using of labeled data and unlabeled data in training The quantity of labeled data is usually less than the quantity of unlabeled data because it requires a lot of time for labeling the data Many researchers in machine learning field proposed that the combination of unlabeled data and a small quantity
of labeled data can present many significant innovations in accurate learning
a Domestic literature review
b International literature review
3 Research target
The general target of this study is to investigate the application of semi-supervised learning technology in Vietnamese text classification
Trang 44 Research objects and scope
Research objects:
- Semi-supervised learning technology;
- Classification algorithms, clustering data in structured and structured data space;
semi Focusing on Vietnamese text classification
5 Research content
- Determining a function or a method which enables to classify data layers efficiently (usually two layers);
- Making predictions about layers for unlabeled data;
- Examining the impact of the number of unlabeled data to the results of the algorithm;
- Developing testing software for Vietnamese text classification
6 Research methodology
- Documentation methodology
- Empirical rmethodology
- Expert methodology
7 Main contributions of the dissertation
Main contributions of the dissertation include:
1 Proposing a new methodology in text classification based on Geodesic model and graph theory
2 Proposing solutions reducing the dimensionality of a vector for text classification based on Dendrogram
Building a data warehouse for Vietnamese text classification
8 Dissertation structure
Main contents of the dissertation are presented in 4 chapters:
Chapter 1: Literature review
Chapter 2: Building a data warehouse
Chapter 3: Text classification based on Geodesic model
Chapter 4: Reducing the dimensionality of a vector based on
Trang 5Dendrogram
Chapter 1 LITERATURE REVIEW 1.1 Machine learning
1.1.1 Definition
1.1.2 Application of machine learning
1.2 Machine learning methodologies
1.3 Overview of semi-supervised learning
1.3.1 Semi-supervised learning methodologies
- Expectation–maximization algorithm
- Transductive SVM - Self-training algorithm
Figure 1.1 Maximum-margin Figure 1.2 Visual performance of Self-
hyperplane training setup
Trang 61.3.2 SVM supervised learning algorithm and SVM supervised learning algorithm
semi Introduction
- Support vector machine (SVM) algorithm
Figure 1.4 Example of binary classification
1.3.3 SVM in text classification
1.3.4 Semi-supervised SVM and website classification
1.3.5 Typical text classification algorithm
1.4 Text classification
1.4.1 Text
1.4.2 Displaying text by vector
Figure 1.5 Displaying model text by specific vectors
1.4.3 Text classification
Trang 7Chapter 2 BUILDING A DATA WAREHOUSE
2.1 Introduction of data warehouse for Vietnamese text classification
a Introduction
Trang 8b Purpose of the data warehouse for Vietnamese text classification
2.2 Overview of the data warehouse
2.2.1 Definition of the data warehouse
2.2.2 Characteristics of the data warehouse
2.2.3 Purpose of the data warehouse
2.2.4 Data warehouse architectures
a Data warehouse architecture basic
Figure 2.1 Architecture of a data warehouse
b Data warehouse architecture with a staging area
Figure 2.2 Architecture of a data warehouse with a staging area
Components of the data warehouse:
- Data Sources
- Staging Area
Trang 9- Metadata
- Data Warehouse
- Data Marts
2.3 Requirements Analysis
2.3.1 Data warehouse building
Table 2.1 Downloaded raw data
2.3.2 Data warehouse exploration
2.3.3 Data warehouse update
2.4 Data analysis and specification
2.5 Data warehouse building methodology
2.5.1 A proposed general model
Figure 2.3 The proposed general data warehouse model
2.5.2 Process of building a data warehouse
2.5.3 Process of text classification program
Step 1
Step 2
Step 3
Trang 10Figure 2.4 Text classification process
a Data preprocessing
b Text display
Vector space model
Figure 2.5 Vector model in 3D space
2.5.4 Text classification using Nạve Bayes algorithm
Table 2.2 Training data
2.5.5 Formatting the data outputs in data warehouses
a Formatting the sample text
Trang 11b Example of the text’s format
2.6 Test result and evaluation of data warehouse
2.6.1 Test result of data warehouse
Table 2.3 Test result of data warehouse
3.1.1 Geodesic model
Figure 3.1 Illustrations of Euclidean and Geodesic distances
Figure 3.2 The proposed model
Trang 123.1.2 Geodesic distance-based manifold clustering technology 3.1.3 Geodesic distance calculation methodology
3.1.4 Multiple functions in Geodesic distance-based support vector machine
For vector support, there are many multiple functions such as:
- Polynomial function (homogeneous):𝑘(𝑥𝑘, 𝑥𝑙) = (𝑥𝑘∙ 𝑥𝑙)𝑑
- Polynomial function (inhomogeneous):𝑘(𝑥𝑘, 𝑥𝑙) = (𝑥𝑘∙ 𝑥𝑙+ 1)𝑑
- Hyperbolic tangent function: 𝑘(𝑥𝑘, 𝑥𝑙) = tanh(𝛽𝑥𝑘∙ 𝑥𝑙+ 𝑐)
𝑤𝑖𝑡ℎ 𝛽 > 0 and 𝑐 < 0
+ Gaussian function 𝑘(𝑥𝑘, 𝑥𝑙) = exp(−𝛾‖𝑥𝑘− 𝑥𝑙‖2)𝑤𝑖𝑡ℎ 𝛾 > 0
In this study, I propose the mutiple function of support vector machine which using Geodesic distance combined with Gausian function as follow:
𝑘(𝑥𝑘, 𝑥𝑙) = exp(−𝛾𝐷𝑘𝑙) 𝑘(𝑥𝑘, 𝑥𝑙) = exp(−𝛾𝐷𝑘(𝑥))
3.2 Text classification methodology based on Geodesic model
Proposed model:
Figure 3.3 Text classification model based on Geodesic distance
3.3 Testing text classification based on Geodesic model
Trang 13a The first experiment
Table 3.2 The first classification result with the use of the traditional SVM
Average rate of successful classification 69.9%
Table 3.3 The first classification result with the use of the proposed SVM
Trang 14The average rate of successful classification on all topics is 69.9% with the traditional SVM and 74.4% with the proposed method
b The second experiment
Table 3.4 The second classification result with the use of the traditional SVM
Average rate of successful classification 76.5%
Table 3.5 The second classification result with the use of the proposed SVM
Average rate of successful classification 70.3%
c The third experiment
Table 3.6 The third classification result with the use of the traditional SVM
Trang 15Average rate of successful classification 72.4%
d The fourth experiment
Table 3.8 The fourth classification result with the use of the traditional SVM
Average rate of successful classification 70.9%
Table 3.9 The fourth classification result with the use of the proposed SVM
Average rate of successful classification 72.9%
e The fifth experiment
Table 3.10 The fifth classification result with the use of the traditional SVM
Trang 16Table 3.11 The fifth classification result with the use of the proposed SVM
Average rate of successful classification 73.5%
Figure 3.4 The average value and the variance of the rate classification based on the traditional SVM and the proposed method
The figure above shows the average value and the variance of the successful rate of classification using traditional SVM and the proposed method
3.4 Conclusion
In this chapter, the author presented the results of text classification based on the proposed model which combined Geodesic model and support vector machine The Geodesic model uses the shortest correlation (the adjacent level between texts) to calculate the distance between two vectors This Geodesic distance is different from an Euclidean distance and helps to increase the accuracy of automatic
Trang 17text classification, allow to classify many types instead of two types (based on binary subclass)
Chapter 4 REDUCING THE DIMENSIONALITY OF A
VECTOR BASED ON DENDROGRAM
This chapter presents the proposed solution to reduce the dimensionality of a vector displaying Vietnamese text based on Dendrogram and documents taken from Wikipedia Reducing the dimensionality of a vector will be applied in Vietnamese text classification through experiments
Figure 4.2 An example about Dendrogram
4.2 Building Dendrogram from Wikipedia data
Trang 184.2.1 Wikipedia processing algorithm
Figure 4.3 Diagram of Wikipedia data processing algorithm
4.2.2 Dictionary processing algorithm
Figure 4.4 Diagram of dictionary processing algorithm
4.2.3 P matrix calculation algorithm for common appearing frequency
4.2.4 Algorithm for building Dendrogram
4.2.5 Cluster analysis
a Wikipedia processing
b Dictionary
c Calculating the matrix of common appearing frequency
d Data organizing in program
Trang 194.2.6 Experiment
4.2.6.1 System structure
4.2.6.2 Functions
a Clustering function
Figure 4.5 Example of cutting Dendrogram, three groups are received
b Building classification model function
c Classification function
4.2.6.3 Results
Clustering the dictionary shows the results as follow
Figure 4.6 The number of pairs of words according to the common
appearing frequency
Trang 20Figure 4.7 The number of groups based on clustering on Dendrogram
Cutting the dendrogram at 20% of the maximum distance gives a set of related words or synonyms as follow:
Figure 4.8 The result of using
dendrogram to clustering
Figure 4.9 Another example shows words related to music
Trang 21Figure 4.10 An example of
Dendrogram about words
Figure 4.11 An example shows words related to medicine
4.3 Applying words clustering into text classification
Trang 22Figure 4.12 The storage capacity of vectors depends on the number of words
Figure 4.13 Time of labeling of 5 times training
b Text classification c Accuracy of text classification
Figure 4.14 Average time for
classifying text of 5 times training
Figure 4.15 Classification rates of 5 times training
Trang 23d The average accuracy of text classification
Figure 4.16 The change of results according to the classification rate
Based on the figure above - reducing the dictionary can improve the accuracy of classification, if we choose the correct reduction rate for the dictionary (from 30% -> 70%) in accordance with initial vector space, the rate of text classification is higher than before – when we have not clustered and reduced words
4.4 Conclusion
Results gotten through proposed methodologies aim to enhance the quality of Vietnamese text automatical classification The first methodology uses Wikipedia encyclopedia and Dendrogram in reducing the dimensionality of a vector which displays Vietnamese text The second methodology applies the reduced vector for text classification Experiments show that the utilization of reduced vector space based on Dendrogram and Wikipedia library not only saves storage capacity and time for Vietnamese text classification but also guarantees the accurate classification rate, text classification rate
is higher than when have not clustered
The limitation of proposed methodology is just tested the common appearing frequency of pairs of words in one page of Wikipedia to
Trang 24cluster, therefore it can lead to the untruth in semantics if that Wikipedia page has too much information For example, one page covers much information about Sport, Law, Education… The following research will make good the limitations above
CONCLUSION Achieved results
In this dissertation, the author presents research results in Vietnamese text classification with the combination of semi-supervised learning technology and support vector machine (SVM) And there are many achieved results as follow:
- Built a data warehouse for Vietnamese text classification
- Proposing and testing the text classification methodology based
on Geodesic distance
- Proposing and testing methodology for reducing the dimensionality of a vector when displaying Vietnamese text for increasing processing speed but still ensuring the accuracy when classify text
Based on the results, the dissertation compared the proposed methodology which based on Geodesic distance to the traditional SVM model on the same data set The classification’s average rate
of 2 methodologies is not significantly different, however, the variance of the proposed method (± 2%) is smaller than that of the traditional SVM (± 4%) It suggests that the proposed method is more reliable than the traditional SVM for Vietnamese text classification Experiments show that the application of vector space which is reduced by Dendrogram and Wikipedia can not only help saving storage capacity and time for Vietnamese text classification but also
Trang 25ensuring the correct classification rate in comparison with when hav not clustered At the 30% - 70% reduction rate of the initial vector space, the correct classification rate is higher than when have not clustered
Limitation of the dissertation
- Basically, the text classification program has almost completed the proposed functions such as helping users building the classification model for Vietnamese texts, automatically classifying new texts based on the established model However, the initial data collection is just at the experiment stage
- The limitation of this dissertation is not using WORDNET or making the graph to consider the semantic correlation among words before building feature vectors for text This point can decrease the optimal ability when clustering
- Reducing the dimensionality of a vector for text is just tested the common appearing frequency of pairs of words in one Wikipedia page to divide word groups, so it can cause wrong meaning if the Wikipedia has too many information such as one page includes information about Sport, Education, Law, International, Society…
- The dissertation has just tested on support vector machine (VSM)
- The dissertation has not compared to different Dendogram algorithms yet
Next time, I will supplement several new functions and complete the program to enhance the effectiveness, at the same time, building a data warehouse enough for classifying text more correctly
Proposal for future research
Nowadays, text summarization is the research trend which attracts many scientists, especially in Vietnamese field which has many