Transductive Support Vector Machines for Cross-lingual Sentiment Classification [lm NGHE Nguyen Thi Thuy Linh Faculty of Information Technology University of Engineering and Technolo
Trang 1Transductive Support Vector Machines for Cross-lingual Sentiment Classification
[lm NGHE
Nguyen Thi Thuy Linh
Faculty of Information Technology
University of Engineering and Technology Vietnam National University, Hanoi
Supervised by Professor Ha Quang Thuy
A thesis submitted in fulfillment of the requirements for the degree of
Master of Computer Science
December, 2009
Trang 2Table of Contents
1 Introduction
11 Intreduction
1.2 What might be mvolved?
L8 Ourapproach
Lá Related more
1⁄41 Sentiment classification
141.1 Sentiment classification taska 14.1.2 Sentiment classification features
144.13 Seutinent classification techniques 14.1.4 Sentiment classification domains
14.2 Cross-domain text classification
2 Background
2.1 Sentiment Analysis
2.1.1 Applications 2.2 Support Vector Machines
2.3 Seui-supervised Leukmiques
2.3.1 Gencrate maximum-likelihood models
23.2 Co-training and bootstrapping 2.3.3 Transductive SVM
3 The semi-supervised model for cross-lingual approach
3.1 ‘The semi-supervised model
3.2 Teview Translation
33 Fealures ss 0.0 Lee
3.3.1 Words Segmentation 3.3.2 Part of Speech ‘Tagging 35.3 N-gram model
13
13
16
16
16
18 18
Trang 3
4.5.1 Effect of rosslingual corpus ee 28 4.5.2 Effect of extraction features 0 ee 24
45.2.1 Using stopword list beet eee 34 4.5.2.2 Segmentation and Part of speech tagging 34
Trang 4Abstract
Sentiment classification has heen much attention and has many useful applications
on business and intelligence This thesis investigates sentiment classification prob- Jem employing machine learning technique Since the limit of Vietnamese sentiment corpus, while there are many available English sentiment corpus on the Web We combine English corpora as training data and a number of unlabeled Vietnamese dude in semi-supervised model, Machine learning vlimninates the language gap be-
iween the training set aud test seb in our model Moreover, we ulso examine types
of features Lo obtain the best performance
The results show thar semi-snpervised classifier are quite good in leveraging cross-lingnal corpns to compare with the classifier without cross-lingual corpus In term of features, we find that using only unigram model tnrning out rhe ontperfnr- mace