Automatically learning patterns in subjectivity classification for vietnamese

To deal with this problem, we introduce a statistical method to help the system learn syntactic patterns and evaluate these patterns from labeled training data.. POS patterns learning pr

Trang 1

Automatically Learning Patterns in Subjectivity Classification for Vietnamese

Article · January 2015

DOI: 10.1007/978-3-319-11680-8_50

CITATIONS

0

4 authors, including:

Some of the authors of this publication are also working on these related projects:

IFSA-SCIS 2017 SPECIAL SESSION SS-02: Uncertainty Handling in Recommender and Decision Support Systems (DEADLINE 15 FEB) View project

Thai Dang

Japan Advanced Institute of Science and Tech…

7 PUBLICATIONS 3 CITATIONS

SEE PROFILE

Van-Nam Huynh Japan Advanced Institute of Science and Tech…

134 PUBLICATIONS 972 CITATIONS

SEE PROFILE

All content following this page was uploaded by Thai Dang on 05 November 2015

The user has requested enhancement of the downloaded file All in-text references underlined in blue are added to the original document and are linked to publications on ResearchGate, letting you access and read them immediately.

Trang 2

Automatically Learning Patterns in Subjectivity

Classification for Vietnamese

Tran-Thai Dang1, Nguyen Thi Xuan Huong1,2, Anh-Cuong Le1, and Van-Nam Huynh3

1 University of Engineering and Technology Vietnam National University, Hanoi

144 Xuanthuy, Caugiay, Hanoi, Vietnam

2

Haiphong Private University

36 Danlap, Duhangkenh, Lechan, Haiphong, Vietnam

3 Japan Advanced Institute of Science and Technology 1-1 Asahidai, Nomi, Ishikawa, Japan {thaidangtran12@gmail.com; huong_ntxh@hpu.edu.vn;

cuongla@vnu.edu.vn; huynh@jaist.ac.jp}

Abstract Opinions are subjective expressions that describe people’s viewpoints,

perspectives or feeling about entities, events They are essential information for sentiment analysis Therefore, opinions detection, which is also called subjec-tivity classification, is an important task In this paper, we propose a statistical method to automatically create the patterns for determining opinions from vari-ous resources on the web The learned patterns are more flexible and adaptive to domain in comparison with manual creation In this work, we obtained approxi-mate 84% of accuracy when doing on Vietnamese comment data

Sentiment analysis process includes crawling, extracting, and analyzing people’s opin-ions shared on forums, news portals, social networks It helps manufacturers can gain real feedback to improve their products, or customers can get useful information to make decision when buying products

After crawling data from the Internet, we have to determine which comment belongs

to subjective comments (comments contain opinions) or objective comments (com-ments just express the fact) Subjectivity classification is considered as the first step

in sentiment analysis process The subjective comments will be normally used in next step to determine which are positive, negative or neutral comments

There are some introduced methods to find words, phrases which express opinions Most previous works are carried out on English data However, those methods are not effective totally when applying on Vietnamese data In this paper, we focus on deter-mining subjective comments in Vietnamese data

Through investigating the comments on several Vietnamese forums and blogs, peo-ple usually use adjective and verb to express their opinions For exampeo-ple, some common adjectives and verbs are used in people’s comments such as: “đẹp” (nice), “xấu” (ugly),

“tốt” (good), “mượt mà” (smooth), “thích” (like), “ghét” (hate), “cảm thấy” (feel), etc

Trang 3

Therefore adjectives and verbs are able to be strong clues which help to distinguish between subjective and objective comments

The words and phrases express opinion can be extracted based on sentiment dictio-nary, n-gram or syntactic patterns Among those ways, syntactic patterns are useful to enrich the set of features The patterns can be created manually based on knowledge of specific language such as grammar, POS For example we can build a pattern of POS

as the following:

“Con/Nu Nokia/Np này/P nhìn/V rất/R đẹp/A.”(This Nokia looks very nice)

(Nu: Unit noun; Np: Proper noun; P: Pronoun; V: Verb, R: Adverb; A: Adjective)

From above example, the phrase “nhìn rất đẹp” (look very nice) is a component that expresses opinion This phrase can be extracted from the pattern: V-R-A

The manual creation not only requires much time but also is difficult to cover all the rules to find out subjective features In the Vietnamese forums, blogs people often use spoken language and slang which is short and informal Hence, we need to propose suitable patterns, then investigate and evaluate their influence

To deal with this problem, we introduce a statistical method to help the system learn syntactic patterns and evaluate these patterns from labeled training data The learning processing includes two main steps such as patterns identification and evaluation The system will determine whether the patterns are used to express opinions frequently or not In our work, the training data are tagged by two labels “” (subjectivity), and

“<obj>” (objectivity) After that, the system extract and evaluate the subjective patterns

to build the features set The patterns may be created by using syntactic parser tree or POS information In our work, we chose POS for some reasons: firstly, people often use spoken language which includes incomplete sentences (the sentences lack subject or predicate), so it is difficult to obtain correct parsed tree; secondly, using the POS infor-mation is easier for adapting to domain than parser tree because we can use statistical approach to learn POS tags This method is able to apply for many languages with-out deep knowledge of their syntactic information Moreover, the learned POS patterns from the training data are more flexible and adaptive to domain than manual creation

The subjectivity classification focus on how to build the good features set to improve the system’s performance Janyce Wibe[6] identified strong clues of subjectivity based on distributional similarity, he use small seed manual annotation data to develop promis-ing adjective features In [1], Pang et al., used n-gram as features for polarity classifica-tion An other way, to enrich feature set, Perter D.Turney [10] used patterns to extract phrases which contain adjectives or verbs Similarly to English, people usually use ad-jectives and verbs to express their opinions in Vietnamese data, those are important information to extract features E.Riloff et al.,[3] used two boot-strapping algorithms that extract patterns to learn set of subjective nouns In [4], [5], [7] they used syntactic information to create the patterns To create the patterns to extract features, we need linguistic knowledge, or small set of sentiment seed words [3] In contrast, we propose

a statistical method which learns the patterns from labeled training data POS infor-mation is also used in various previous researches to determine features Gamon [14]

Trang 4

performed sentiment analysis on feedback data and analyzed the role of linguistic fea-tures like POS tags Pak and Paroubek [13] reported that both POS and bigram help to perform subjectivity classification of tweets Barbosa and Feng [15] proposed the use

of syntax features of tweets like retweet, hashtags, link, punctuation and exclamation marks in conjunction with features like prior polarity of words and POS tags Agarwal

et al [16] extended Barbosa and Feng (2010) by using a combination of real-valued polarity with POS and reported POS features are important to the classification accu-racy In [9], M.Sokolova and G.Lapalme proposed a hierarchical text representation and built domain-independent rules that do not rely on domain content words and emotional words Other observed characteristics may be used to recognize subjective expressions such as word lengthening in [12], emoticon in [11]

After extracting the features, they usually use many classification techniques to as-sign labels for comments Pang and Lee used Support vector machines (SVM), Naive Bayes (NB), Maximum Entropy (ME) in [1] Riloff used SVM in [4] We also investigate some classification algorithms such as SVM, NB in our empirical work

Motivating from using syntactical patterns and POS information to extract features on English data from previous works, we build a set of Vietnamese patterns that help to enrich the features set Different from previous researches, in our approach POS infor-mation is chosen to build the patterns because it is easy to adapt to domain In learning process, we determine the forms of patterns which are called as templates This process

is illustrated in figure 1

Figure 1 POS patterns learning process

In first stage of this process, we extract all patterns on the labeled training data from the predefined templates The second stage, the patterns are evaluated to select the best set of patterns The evaluation process contains two steps:

Trang 5

1 Evaluating to get sets of acceptable patterns A pattern is acceptable if and only if

it appears more frequently in subjective comments than in objective comments

2 Evaluating to get the best set of patterns The best set of patterns is selected from the sets of acceptable patterns by classifying on the training data with set of features (adjectives and verbs) extracted from the patterns

3.1 Training Data

In the training data, each comment are tagged by two labels which are “” (sub-jectivity) or “<obj>” (ob(sub-jectivity) and the number of subjective comments are equal to number of objective comments For example:

Giá quá tốt (good price)

Xấu tệ hại (very ugly)

Tôi không thích thiết kế của con này lắm (I don’t like the design

of this machine)

<obj> Tôi đang tính mua em này thay cho em Qmobile M45 (I am considering buying this machine instead of Qmobile M45)

<obj> Lại đổi giao diện (change the interface)

<obj> Có rồi nhưng màu bạc thì không có hàng (got this product but there is no sliver product)

The training data is checked spell and segmented into words before tagging by the POS labels because we work on Vietnamese text

3.2 Templates Definition

We mainly use adjectives and verbs as the features, so the templates are created by them and their surrounding POS labels The surrounding POS may be noun (N), proper noun (Np), other adjective (A), or verb (V), adverb (R), coordinating conjunction (Cc), auxiliary (T)

We propose two types of templates to learn that are:

• Type 1: The templates are built to extract the patterns which contain only POS labels.

We consider the verb and adjective with their surrounding POS labels in left side, right side, or both sides

• Type 2: The type 2 is similar to type 1, but it is more specific than type 1 because

we build the templates which extract the patterns which contain words (adjective or verb) and their surrounding POS labels

We can use templates of type 1 or type 2 In this paper, we will show experimental results when applying two types of templates for extracting patterns Table 1 and table

2 illustrate examples of the templates of two types

For example, the sentence in section 1 has one adjective, if we use the template in line 2 of table 1, we will extract the pattern R-A We are able to expand the templates shown in two tables by increasing the number of surrounding POS labels

Trang 6

Table 1 Templates of type 1

tag-tag[+1] if the current tag is adjective or verb, considering the template

which contains the current tag and one next tag

tag-tag[-1] if the current tag is adjective or verb, considering the template

which contains the current tag and one previous tag

tag-tag[-1] & tag[+1] if the current tag is adjective or verb, considering the template

which contains the current tag and one previous tag and one next tag (tags in both sides of the current tag)

tag-tag[+2] if the current tag is adjective or verbs, considering the template

which contains the current tag and two next tags

tag-tag[-2] if the current tag is adjective or verb, considering the template

which contains the current tag and two previous tags

Table 2 Templates of type 2

word-tag[+1] if the current tag is adjective or verb, considering the template

which contains the current word and one next tag

word-tag[-1] if the current tag is adjective or verb, considering the template

which contains the current word and one previous tag

word-tag[-1] & tag[+1] if the current tag is adjective or verb, considering the template

which contains the current word and one previous tag and one next tag (tags in both sides of current word)

word-tag[+2] if the current tag is adjective or verb, considering the template

which contains the current word and two next tags

word-tag[-2] if the current tag is adjective or verb, considering the template

which contains the current word and two previous tags

Similarly to previous works N-gram (unigram, bigram) are used as the features to classify N-gram of words (after segmenting the training data into words) are learned as same as the patterns We also consider that N-gram is equivalent to the phrases extracted from the patterns

3.3 Extraction and Patterns Evaluation

The predefined templates (section 3.2) are applied on the training data (it is tagged by POS) to extract all possible patterns After that we evaluate them to get the best set

of patterns which is result of learning process Evaluation process contains two steps which are mentioned in section 3

Acceptable Patterns Evaluation: We aim to find the patterns in the training data

which characterize subjective expressions That means we only consider the patterns which satisfy the constrain:

Trang 7

• A pattern is believable to express subjectivity if and only if:

P ( |patterni) > P (<obj> |patterni)

The constrain means that: A pattern is acceptable if and only if it appears in subjective comments more frequently than in objective comments in the training data

The formula below is proposed to get the sets of acceptable patterns:

P (|patterni)

P (|patterni)+P (<obj>|patterni) > threshold

In order to satisfy the constrain above, the threshold must be greater than 0.5 The threshold can be increased to get the different sets of acceptable patterns The range

of threshold is in [0.5, 1.0) (0.5 ≤ threshold < 1.0) In other word, we can generate a new set by changing the threshold By increasing the threshold, the new set is generated whose number of patterns may be smaller than the old set, so the set of acceptable patterns will be narrowed

This evaluating step is considered as the first filter in learning process It help us remove a large amount of patterns that are unbelievable to express opinions The best set of patterns is selected from the remainder of patterns

The best Patterns set Evaluation: This evaluation step aims to get the best set of

pat-terns from sets of acceptable patpat-terns We use the training data to evaluated In this case, the training data plays development data role A set of acceptable patterns is selected as the best set if it gets the highest accuracy by classifying on training data

We assume that the best set of patterns on the training data will be the best set on the other data That means other data is similar to the training data about their grammar

of sentences, so the best training data must cover most syntactic structures of subjective sentences However, in fact, set of patterns getting the best performance on the training data can be worse on other data because of the differences of the distribution and the grammar of sentences in data

From each set of acceptable patterns, we extract phrases then take the adjectives and verbs in these phrases as the features These features is used to classify on training data and evaluated by 10-fold cross-validation After that, we select a set which has highest value

We use the training data as the development data to evaluate acceptable patterns for adapting to domain and satisfying our assumption This work helps us build the set of patterns which is more flexible and diverse The quality of the patterns depends on the training data

Our experiment is conducted on technical product review data (review of mobile, laptop, tablet, camera, TV) We collected data from some Vietnamese technical forums such as

Trang 8

tinhte.vn, voz.vn, thegioididong.comby scrapy framework After that, we remove the non-diacritic comments, then correct spell errors in the comments

We labeled manually 9000 collected Vietnamese comments with two kinds of labels

“” (subjective) and “<obj>” (objective) (in section 3.1) After that we divided this annotated data into two parts The first part contains 3000 subjective comments and

3000 objective comments as the training data to learn the patterns The remainder of comments (3000 comments) is used to test the quality of learned patterns The training data and testing data are segmented into words and tagged by POS

We used some classification tools in weka5for evaluating in learning process and evaluating the quality of the learned patterns on test data

Learning Process: Firstly, we learned N-gram (unigram, bigram) of words from the

training data with threshold in range [0.5; 0.6; 0.7; 0.8; 0.9] In order to reduce number

of bigrams we just use the bigrams which appear at least two times in training data to build features set The results of this process are illustrated in table 3 Unigram and bigram will be used as the features to classify on the training data We used liblinear library in weka which implements SVM for classification We implemented in 10-fold cross-validation, then evaluated classification’s performance

Table 3 Classification results of unigram and bigram

The results of this experiment are shown in figure 2 From figure 2, we can see that using unigram are better than bigram to build the set of features Moreover, we would like to investigate whether the combination of unigram and bigram can make a better set of features or not We combined the best set of unigram (threshold=0.7) and the best set of bigram (threshold=0.9) into a features set and got 85.03% of accuracy Although using bigram without unigram make the system’s performance decline, it can enrich the set of unigram features

Secondly, we learned the POS patterns of two types which are mentioned in section 3.2 The predefined templates are applied on training data to extract all patterns We also use threshold in range [0.5; 0.6; 0.7; 0.8; 0.9] to generate the sets of acceptable patterns The patterns in each set are applied on the training data to extract phrases, ad-jectives, verbs as features for subjectivity classification We also used liblinear in weka

4http://scrapy.org

5

http://www.cs.waikato.ac.nz/ml/weka/

Trang 9

Figure 2 Classification results of unigram and bigram (%)

to evaluate set of patterns The results are shown in table 4 (patterns from templates of type 1) and table 5 (patterns from templates of type 2)

Table 4 Classification results of learning Patterns of type 1

threshold tag-tag[+1] tag-tag[-1] tag-tag[+2] tag-tag[-1] &

tag[+1]

tag-tag[-2]

In table 4, we got the template tag-tag[+2] with threshold is 0.5 as the best template

We extracted 72 patterns, some of them and their extracted phrases are illustrated in table 6

In table 5, we got the template word-tag[-1] with threshold is 0.9 as the best case

We got 2175 patterns and some of them with their extracted phrases are illustrated in table 7 Note: A (adjective); V(verb); R (adverb); N(noun)

Testing Learned Patterns: We investigated the quality of features set which is

n-grams, and words or words and phrases which are extracted from learned patterns The results is shown in table 8

From table 8, we see that:

• Base on the results in line 1, 2, 3, 4, 5 we can compare the quality of the features from POS patterns and N-gram When comparing with unigram or bigram or the combination of unigram and bigram, the features of POS patterns in two types are better, but it is not significant

Trang 10

Table 5 Classification results of learning Patterns of type 2

threshold word-tag[+1] word-tag[-1] word-tag[+2] word-tag[-1] &

tag[+1]

word-tag[-2]

Table 6 Learned patterns of type 1

V R A nhìn cũng bóng_bẩy; chụp cũng tốt; nhìn khá

lạ

V A A ốp chắc cực; nhìn đẹp thiệt; xài tốt hơn

A R A thực_sự rất ấn tượng; đen rất menly; tiện_lợi lại

sang_trọng

Table 7 Learned patterns of type 2

R yếu quá yếu; không yếu; rất yếu

R hay khá hay; rất hay; không hay

V xấu trông xấu; thiết_kế xấu; nhìn xấu

V yếu_ớt nhìn yếu_ớt; thấy yếu_ớt

N khỏe cấu_hình khỏe; sóng khỏe; máy khỏe

Table 8 classification results on testing data

# of features SVM Naive Bayes

words and phrases of patterns (type 1) 4472 82.56% 77.46% words and phrases of patterns (type 2) 3062 82.68% 68.27% words (type 1) + unigram + bigram 9847 83.47% 78.22% words (type 2) + unigram + bigram 8421 84.03% 76.72%

Định dạng
Số trang	12
Dung lượng	284,98 KB