AN APPLIED RESEARCH OF SEMI SUPERVISED LEARNING TECHNOLOGY IN VIETNAMESE TEXT CLASSIFICATION FIELD

Dissertation structure Main contents of the dissertation are presented in 4 chapters: Chapter 1: Literature review Chapter 2: Building a data warehouse Chapter 3: Text classification

Trang 1

THE MINISTRY OF EDUCATION AND TRAINING

Major : COMPUTER SCIENCE Code : 62 48 01 01

Da Nang - 2017

Trang 2

THE RESEARCH WAS ACCOMPLISHED AT

Advisors:

1 Assoc Prof Dr Vo Trung Hung

2 Assoc Prof Dr Doan Van Ban

Reviewer 1: Prof Dr Nguyen Mau Han

Reviewer 2: Prof Dr Phan Huy Khanh

Reviewer 3: Prof Dr Huynh Thi Thanh Binh

The dissertation was defended in front of The Dissertation Grading Council at The University of Danang level at The University of

Danang on September 29 th 2017

You can find the dissertation at:

- National Library of Vietnam;

- Learning Information Center, The University of Danang

Trang 3

INTRODUCTION

1 Reasons for choosing the topic

Nowadays, the rapid development of science technology as well as information technology has brought people many abilities for approaching the information quickly and conveniently such as: electronic library, electronic portal, search application… These things help people more conveniently in exchanging, updating, searching for information all over the world through the Internet Therefore, operating the automatic document classification nowadays is considered as an urgent problem and it attracts many researchers as well In this dissertation, the author focused on investigating new methods for Vietnamese text classification more effectively which based on semi-supervised learning technology

2 Literature review

In computer science field, semi-supervised learning is a machine learning technology class which combined the using of labeled data and unlabeled data in training The quantity of labeled data is usually less than the quantity of unlabeled data because it requires a lot of time for labeling the data Many researchers in machine learning field proposed that the combination of unlabeled data and a small quantity

of labeled data can present many significant innovations in accurate learning

a Domestic literature review

b International literature review

3 Research target

The general target of this study is to investigate the application of semi-supervised learning technology in Vietnamese text classification

Trang 4

4 Research objects and scope

Research objects:

- Semi-supervised learning technology;

- Classification algorithms, clustering data in structured and structured data space;

semi Focusing on Vietnamese text classification

5 Research content

- Determining a function or a method which enables to classify data layers efficiently (usually two layers);

- Making predictions about layers for unlabeled data;

- Examining the impact of the number of unlabeled data to the results of the algorithm;

- Developing testing software for Vietnamese text classification

6 Research methodology

- Documentation methodology

- Empirical rmethodology

- Expert methodology

7 Main contributions of the dissertation

Main contributions of the dissertation include:

1 Proposing a new methodology in text classification based on Geodesic model and graph theory

2 Proposing solutions reducing the dimensionality of a vector for text classification based on Dendrogram

Building a data warehouse for Vietnamese text classification

8 Dissertation structure

Main contents of the dissertation are presented in 4 chapters:

Chapter 1: Literature review

Chapter 2: Building a data warehouse

Chapter 3: Text classification based on Geodesic model

Chapter 4: Reducing the dimensionality of a vector based on

Trang 5

Dendrogram

Chapter 1 LITERATURE REVIEW 1.1 Machine learning

1.1.1 Definition

1.1.2 Application of machine learning

1.2 Machine learning methodologies

1.3 Overview of semi-supervised learning

1.3.1 Semi-supervised learning methodologies

- Expectation–maximization algorithm

- Transductive SVM - Self-training algorithm

Figure 1.1 Maximum-margin Figure 1.2 Visual performance of Self-

hyperplane training setup

Trang 6

1.3.2 SVM supervised learning algorithm and SVM supervised learning algorithm

semi Introduction

- Support vector machine (SVM) algorithm

Figure 1.4 Example of binary classification

1.3.3 SVM in text classification

1.3.4 Semi-supervised SVM and website classification

1.3.5 Typical text classification algorithm

1.4 Text classification

1.4.1 Text

1.4.2 Displaying text by vector

Figure 1.5 Displaying model text by specific vectors

1.4.3 Text classification

Trang 7

Chapter 2 BUILDING A DATA WAREHOUSE

2.1 Introduction of data warehouse for Vietnamese text classification

a Introduction

Trang 8

b Purpose of the data warehouse for Vietnamese text classification

2.2 Overview of the data warehouse

2.2.1 Definition of the data warehouse

2.2.2 Characteristics of the data warehouse

2.2.3 Purpose of the data warehouse

2.2.4 Data warehouse architectures

a Data warehouse architecture basic

Figure 2.1 Architecture of a data warehouse

b Data warehouse architecture with a staging area

Figure 2.2 Architecture of a data warehouse with a staging area

Components of the data warehouse:

- Data Sources

- Staging Area

Trang 9

- Metadata

- Data Warehouse

- Data Marts

2.3 Requirements Analysis

2.3.1 Data warehouse building

Table 2.1 Downloaded raw data

2.3.2 Data warehouse exploration

2.3.3 Data warehouse update

2.4 Data analysis and specification

2.5 Data warehouse building methodology

2.5.1 A proposed general model

Figure 2.3 The proposed general data warehouse model

2.5.2 Process of building a data warehouse

2.5.3 Process of text classification program

Step 1

Step 2

Step 3

Trang 10

Figure 2.4 Text classification process

a Data preprocessing

b Text display

Vector space model

Figure 2.5 Vector model in 3D space

2.5.4 Text classification using Nạve Bayes algorithm

Table 2.2 Training data

2.5.5 Formatting the data outputs in data warehouses

a Formatting the sample text

Trang 11

b Example of the text’s format

2.6 Test result and evaluation of data warehouse

2.6.1 Test result of data warehouse

Table 2.3 Test result of data warehouse

3.1.1 Geodesic model

Figure 3.1 Illustrations of Euclidean and Geodesic distances

Figure 3.2 The proposed model

Trang 12

3.1.2 Geodesic distance-based manifold clustering technology 3.1.3 Geodesic distance calculation methodology

3.1.4 Multiple functions in Geodesic distance-based support vector machine

For vector support, there are many multiple functions such as:

- Polynomial function (homogeneous):𝑘(𝑥𝑘, 𝑥𝑙) = (𝑥𝑘∙ 𝑥𝑙)𝑑

- Polynomial function (inhomogeneous):𝑘(𝑥𝑘, 𝑥𝑙) = (𝑥𝑘∙ 𝑥𝑙+ 1)𝑑

- Hyperbolic tangent function: 𝑘(𝑥𝑘, 𝑥𝑙) = tanh(𝛽𝑥𝑘∙ 𝑥𝑙+ 𝑐)

𝑤𝑖𝑡ℎ 𝛽 > 0 and 𝑐 < 0

+ Gaussian function 𝑘(𝑥𝑘, 𝑥𝑙) = exp(−𝛾‖𝑥𝑘− 𝑥𝑙‖2)𝑤𝑖𝑡ℎ 𝛾 > 0

In this study, I propose the mutiple function of support vector machine which using Geodesic distance combined with Gausian function as follow:

𝑘(𝑥𝑘, 𝑥𝑙) = exp(−𝛾𝐷𝑘𝑙) 𝑘(𝑥𝑘, 𝑥𝑙) = exp(−𝛾𝐷𝑘(𝑥))

3.2 Text classification methodology based on Geodesic model

Proposed model:

Figure 3.3 Text classification model based on Geodesic distance

3.3 Testing text classification based on Geodesic model

Trang 13

a The first experiment

Table 3.2 The first classification result with the use of the traditional SVM

Average rate of successful classification 69.9%

Table 3.3 The first classification result with the use of the proposed SVM

Trang 14

The average rate of successful classification on all topics is 69.9% with the traditional SVM and 74.4% with the proposed method

b The second experiment

Table 3.4 The second classification result with the use of the traditional SVM

Table 3.5 The second classification result with the use of the proposed SVM

c The third experiment

Table 3.6 The third classification result with the use of the traditional SVM

Trang 15

d The fourth experiment

Table 3.8 The fourth classification result with the use of the traditional SVM

Table 3.9 The fourth classification result with the use of the proposed SVM

e The fifth experiment

Table 3.10 The fifth classification result with the use of the traditional SVM

Trang 16

Table 3.11 The fifth classification result with the use of the proposed SVM

Figure 3.4 The average value and the variance of the rate classification based on the traditional SVM and the proposed method

The figure above shows the average value and the variance of the successful rate of classification using traditional SVM and the proposed method

3.4 Conclusion

In this chapter, the author presented the results of text classification based on the proposed model which combined Geodesic model and support vector machine The Geodesic model uses the shortest correlation (the adjacent level between texts) to calculate the distance between two vectors This Geodesic distance is different from an Euclidean distance and helps to increase the accuracy of automatic

Trang 17

text classification, allow to classify many types instead of two types (based on binary subclass)

Chapter 4 REDUCING THE DIMENSIONALITY OF A

VECTOR BASED ON DENDROGRAM

This chapter presents the proposed solution to reduce the dimensionality of a vector displaying Vietnamese text based on Dendrogram and documents taken from Wikipedia Reducing the dimensionality of a vector will be applied in Vietnamese text classification through experiments

Figure 4.2 An example about Dendrogram

4.2 Building Dendrogram from Wikipedia data

Trang 18

4.2.1 Wikipedia processing algorithm

Figure 4.3 Diagram of Wikipedia data processing algorithm

4.2.2 Dictionary processing algorithm

Figure 4.4 Diagram of dictionary processing algorithm

4.2.3 P matrix calculation algorithm for common appearing frequency

4.2.4 Algorithm for building Dendrogram

4.2.5 Cluster analysis

a Wikipedia processing

b Dictionary

c Calculating the matrix of common appearing frequency

d Data organizing in program

Trang 19

4.2.6 Experiment

4.2.6.1 System structure

4.2.6.2 Functions

a Clustering function

Figure 4.5 Example of cutting Dendrogram, three groups are received

b Building classification model function

c Classification function

4.2.6.3 Results

Clustering the dictionary shows the results as follow

Figure 4.6 The number of pairs of words according to the common

appearing frequency

Trang 20

Figure 4.7 The number of groups based on clustering on Dendrogram

Cutting the dendrogram at 20% of the maximum distance gives a set of related words or synonyms as follow:

Figure 4.8 The result of using

dendrogram to clustering

Figure 4.9 Another example shows words related to music

Trang 21

Figure 4.10 An example of

Dendrogram about words

Figure 4.11 An example shows words related to medicine

4.3 Applying words clustering into text classification

Trang 22

Figure 4.12 The storage capacity of vectors depends on the number of words

Figure 4.13 Time of labeling of 5 times training

b Text classification c Accuracy of text classification

Figure 4.14 Average time for

classifying text of 5 times training

Figure 4.15 Classification rates of 5 times training

Trang 23

d The average accuracy of text classification

Figure 4.16 The change of results according to the classification rate

Based on the figure above - reducing the dictionary can improve the accuracy of classification, if we choose the correct reduction rate for the dictionary (from 30% -> 70%) in accordance with initial vector space, the rate of text classification is higher than before – when we have not clustered and reduced words

4.4 Conclusion

Results gotten through proposed methodologies aim to enhance the quality of Vietnamese text automatical classification The first methodology uses Wikipedia encyclopedia and Dendrogram in reducing the dimensionality of a vector which displays Vietnamese text The second methodology applies the reduced vector for text classification Experiments show that the utilization of reduced vector space based on Dendrogram and Wikipedia library not only saves storage capacity and time for Vietnamese text classification but also guarantees the accurate classification rate, text classification rate

is higher than when have not clustered

The limitation of proposed methodology is just tested the common appearing frequency of pairs of words in one page of Wikipedia to

Trang 24

cluster, therefore it can lead to the untruth in semantics if that Wikipedia page has too much information For example, one page covers much information about Sport, Law, Education… The following research will make good the limitations above

CONCLUSION Achieved results

In this dissertation, the author presents research results in Vietnamese text classification with the combination of semi-supervised learning technology and support vector machine (SVM) And there are many achieved results as follow:

- Built a data warehouse for Vietnamese text classification

- Proposing and testing the text classification methodology based

on Geodesic distance

- Proposing and testing methodology for reducing the dimensionality of a vector when displaying Vietnamese text for increasing processing speed but still ensuring the accuracy when classify text

Based on the results, the dissertation compared the proposed methodology which based on Geodesic distance to the traditional SVM model on the same data set The classification’s average rate

of 2 methodologies is not significantly different, however, the variance of the proposed method (± 2%) is smaller than that of the traditional SVM (± 4%) It suggests that the proposed method is more reliable than the traditional SVM for Vietnamese text classification Experiments show that the application of vector space which is reduced by Dendrogram and Wikipedia can not only help saving storage capacity and time for Vietnamese text classification but also

Trang 25

ensuring the correct classification rate in comparison with when hav not clustered At the 30% - 70% reduction rate of the initial vector space, the correct classification rate is higher than when have not clustered

Limitation of the dissertation

- Basically, the text classification program has almost completed the proposed functions such as helping users building the classification model for Vietnamese texts, automatically classifying new texts based on the established model However, the initial data collection is just at the experiment stage

- The limitation of this dissertation is not using WORDNET or making the graph to consider the semantic correlation among words before building feature vectors for text This point can decrease the optimal ability when clustering

- Reducing the dimensionality of a vector for text is just tested the common appearing frequency of pairs of words in one Wikipedia page to divide word groups, so it can cause wrong meaning if the Wikipedia has too many information such as one page includes information about Sport, Education, Law, International, Society…

- The dissertation has just tested on support vector machine (VSM)

- The dissertation has not compared to different Dendogram algorithms yet

Next time, I will supplement several new functions and complete the program to enhance the effectiveness, at the same time, building a data warehouse enough for classifying text more correctly

Proposal for future research

Nowadays, text summarization is the research trend which attracts many scientists, especially in Vietnamese field which has many

Định dạng
Số trang	27
Dung lượng	882,64 KB