Transductive support vector machines for cross-lingual sentiment classification Nguyễn Thị Thùy Linh Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01
Trang 1Transductive support vector machines for cross-lingual sentiment classification
Nguyễn Thị Thùy Linh
Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01
Người hướng dẫn: PGS.TS Hà Quang Thụy
Năm bảo vệ: 2009
Abstract Sentiment classification has been much attention and has many useful
applications on business and intelligence This thesis investigates sentiment classification problem employing machine learning technique Since the limit of Vietnamese sentiment corpus, while there are many available English sentiment corpus on the Web We combine English corpora as training data and a number of unlabeled Vietnamese data in semi-supervised model Machine learning eliminates the language gap between the training set and test set in our model Moreover, we also examine types of features to obtain the best performance The results show that semi-supervised classifier are quite good in leveraging cross-lingual corpus to compare with the classifier without cross-lingual corpus In term of features, we find
that using only unigram model turning out the outperformace
Keywords Khoa học máy tính; Công nghệ thông tin; Dữ liệu; Ngôn ngữ
Content
Table of Contents
1.1 Introduction 1
1.2 What might be involved? 3
1.3 Our approach 3
1.4 Related works 4
1.4.1 Sentiment classification 4
1.4.1.1 Sentimentclassification tasks 4
1.4.1.2 Sentimentclassification features 4
1.4.1.3 Sentimentclassification techniques 4
Trang 2B 32
1.4.1.4 Sentimentclassificationdomains 5
1.4.2 Cross-domain text classification 5
2 Background 6 2.1 Sentiment Analysis 6
2.1.1 Applications 7
2.2 Support Vector Machines 7
2.3 Semi-supervised techniques 10
2.3.1 Generate maximum-likelihood models 10
2.3.2 Co-training and bootstrapping 11
2.3.3 Transductive SVM 11
3 The semi-supervised modelfor cross-lingual approach 13 3.1 The semi-supervised model 13
3.2 Review Translation 16
3.3 Features 16
3.3.1 Words Segmentation 16
3.3.2 Part of Speech Tagging 18
3.3.3 N-gram model 18
4 Experiments 20
4.1 Experimental set up 20
4.2 Data sets 20
4.3 Evaluation metric 22
4.4 Features 22
4.5 Results 23
4.5.1 Effect of cross-lingual corpus 23
4.5.2 Effect of extraction features 24
4.5.2.1 Using stopword list 24
4.5.2.2 Segmentation and Part of speech tagging 24
4.5.2.3 Bigram 25
4.5.3 Effect of features size 25
5 Conclusion andFuture Works 28 A 30
References
Blitzer, J., Dredze, M., & Pereira, F (2007) Biograpies, bollywood, boom-boxes and
Trang 3blenders: domain adaptation for sentiment classification In Proceedings of ACL
Blum, A., & Mitchell, T (1998) Combining labeled and unlabeled data with cotraining
Proceedings of COLT-98
Dan, N D (1987) Logic of syntatic Hanoi: University and College Publisher
Efron, M (2004) Cultural orientation: Classifying subjective documents by co- ciation
analysis Proceedings of the A A A I Fall Symposium Series on Style and Meaning in
Language, Art, Music and Design
Gamon, M., Aue, A., Corston-Oliver, S., & Ringger, E (2005) Pulse: Mining customer
opinions from free text Advances in Intelligent Data Analysis VI (pp 121-132)
Hu, M., & Liu, B (2004a) Mining and summarizing customer reviews Proceedings of the
2004 ACM SIGKDD international conference on Knowledge discovery and data mining
(pp 168-177) New York, NY, USA: ACM Press
Hu, M., & Liu, B (2004b) Mining opinion features in customer reviews Proceedings of
Nineteenth National Conference on Artificial Intelligence (pp 755-760) San Jose, USA
Joachims, T (1998) Text categorization with support vector machines: Learning with many
relevant features Proceedings of the European conference on Machine Learning (ECML)
Joachims, T (1999) Transductive inference for text classification using support vector
machines Proceedings of ICML
Trang 4Linh, N T T (2006) Classification vietnamese webpages with independent language
Mullen, T., & Collier, N (2004) Sentiment analysis using support vector machines with
diverse information sources Proceedings of the EMNLP
Nigram, K., McCallum, A K., Thrun, S., & Mitchell, T (2000) Text classification from
labeled and unlabeled documents using em Machine Learning
Pang, B., & Lee, L (2004) A sentiment education: sentiment analysis using subjectivity
summarization based on minimum cuts Proceedings of the ACL
Pang, B., & Lee, L (2008) Opinion mining and sentiment analysis
Pang, B., Lee, L., & Vaithyanathan, S (2002) Thumbs up? sentiment classification using
machine learning techniques Proceedings of the ACL
Tu, N C., Nguyen, T.-K., Phan, X.-H., Nguyen, L.-M., & Ha, Q.-T (2006) Vietnamese
word segmentation with crfs and svms: An investigattion Proceedings of the Pacific Asia
Conference on Language, Information and Computation
(PACLIC)
Turney, P D (2002) Thumbs up or thumbs down? semantic orientations applied to
unsupervised classification of reviews In Proceedings of ACL
Turney, P D., & Littman, M L (2002) Unsupervised learning of semantic orientation from
a hundred-billion-word corpus
Vapnik (1998) Statistical learning theory Wiley
VLSP (2009) http://vlsp.vietlp.org:8080/demo/?page=home
Wan, X (2008) Using bilingual knowledge and ensemble techniques for unsupervised
chinese sentiment analysis Proceedings of the 2008 conference on Empirical Methods in
Natural Language Processing (pp 553-561) Honolulu
Wan, X (2009) Co-training for cross-lingual sentiment classification Proceedings of the
47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP (pp 235-243) Suntec,
Singapore