Transductive Support Vector Machines for Cross-lingual Sentiment Classification Nguyen Thi Thuy Linh Faculty of Information Technology University of Engineering and Technology Vietnam Na
Trang 1Transductive Support Vector Machines for Cross-lingual Sentiment Classification
Nguyen Thi Thuy Linh
Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi
Supervised by Professor Ha Quang Thuy
A thesis submitted in fulfillment of the requirements for the degree of
Master of Computer Science
December, 2009
Trang 2Table of Contents
1.1 Introduction 1
1.2 What might be involved? 3
1.3 Our approach 3
1.4 Related works 4
1.4.1 Sentiment classification 4
1.4.1.1 Sentiment classification tasks 4
1.4.1.2 Sentiment classification features 4
1.4.1.3 Sentiment classification techniques 4
1.4.1.4 Sentiment classification domains 5
1.4.2 Cross-domain text classification 5
2 Background 6 2.1 Sentiment Analysis 6
2.1.1 Applications 7
2.2 Support Vector Machines 7
2.3 Semi-supervised techniques 10
2.3.1 Generate maximum-likelihood models 10
2.3.2 Co-training and bootstrapping 11
2.3.3 Transductive SVM 11
3 The semi-supervised model for cross-lingual approach 13 3.1 The semi-supervised model 13
3.2 Review Translation 16
3.3 Features 16
3.3.1 Words Segmentation 16
3.3.2 Part of Speech Tagging 18
3.3.3 N-gram model 18
ii
Trang 3TABLE OF CONTENTS iii
4.1 Experimental set up 20
4.2 Data sets 20
4.3 Evaluation metric 22
4.4 Features 22
4.5 Results 23
4.5.1 Effect of cross-lingual corpus 23
4.5.2 Effect of extraction features 24
4.5.2.1 Using stopword list 24
4.5.2.2 Segmentation and Part of speech tagging 24
4.5.2.3 Bigram 25
4.5.3 Effect of features size 25
5 Conclusion and Future Works 28
Trang 4Abstract Sentiment classification has been much attention and has many useful applications
on business and intelligence This thesis investigates sentiment classification prob-lem employing machine learning technique Since the limit of Vietnamese sentiment corpus, while there are many available English sentiment corpus on the Web We combine English corpora as training data and a number of unlabeled Vietnamese data in semi-supervised model Machine learning eliminates the language gap be-tween the training set and test set in our model Moreover, we also examine types
of features to obtain the best performance
The results show that semi-supervised classifier are quite good in leveraging cross-lingual corpus to compare with the classifier without cross-lingual corpus In term of features, we find that using only unigram model turning out the outperfor-mace