VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOCY YU TIEN THANH A FEATURE-BASED OPINION MINING MODEL ON PRODUCT REVIEWS IN VIETNAMESE MASTER THESIS OF INFO
Trang 1
VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOCY
YU TIEN THANH
A FEATURE-BASED OPINION MINING
MODEL ON PRODUCT REVIEWS IN
VIETNAMESE
MASTER THESIS OF INFORMATION TECITNOLOGY
Hanoi — 2012,
Trang 2
VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOCY
VU TIEN THANH
A FEATURE-BASED OPINION MINING
MODEL ON PRODUCT REVIEWS IN
VIETNAMESE
Major : Computer Science Codc : 60 48 01
MASTER THESIS OF INFORMATION TECHNOLOGY
Supervisor: Assoc.Prof Ha QuangThuy
Hanoi — 2012
Trang 3
21 (Opinion Ming sens wicisw & vieea w Qa w BE Ww Ras HBR we 4
2.1.2 The basic concepts in the opinion mining field 7
2.1.3 Opinion mining problems 0.006000 ee 9
G1 Introdustion:<: = 00s Rw SSUES SSeS eG MAMAS Res BS 15
3.2.2 Token Segmenting and POS Tagging 17
3.3 Phase 2: Product Features and Opinion Words Extraction 18
Explicit Product Features Extraction 18
Opinion word Extraetion
Implicit Features identification
Trang 444 The Whole System Evaluallon v.v ẶY a 3
Trang 5A FEATURE-BASED OPINION MINING MODEL ON PRODUCT
REVIEWS IN VIETNAMESE
K16 Computer Science Master Course Faculty of Information Technology
Faculty of information Technology University of Engineering und Technology
University af Engineering and Technology Vietnam National University, Hanoi
Vietnam Nationul University, Hanvi thuyhq@ vaneda.yn
tienthanh_dhen@ vnuedu.vn
Abstract
Feature-based opinion mining and summarizing (FOMS) of reviews is a very in-
lcresting and allracting issue in the opinion mining ficld With the development of c-
commerce in Vietnam, there are more and more commercial sites and technical forums
where people can review or express their opinions on the products which they have
used As a result, the number of reviews has been increasing rapidly to hundreds
or even thousands for a hol-product in recent years Not only makes il difficult for
the customer to read them ta make a decision whether to buy product but difficult
for the producer to handle customer's opinions to improve their products as well In
this thesis, we describe a Heature-based apinion mining and summarizing model on
Vietnamese customer reviews Experimental results on Viewamese reviews of mobile
phone products domain demonstrate the effectivencas of the model
Keywords
feature-word; feature-based opinion mining system; opinion summarization; opinion-
word, reviews; syntax rules; VietSentiWordnet dictionary
PUBLICATIONS
+ Lluyen-Irang Pham, Tien:Thanh Vu, Mai-Vo ‘Tran and Quang-Thuy La A Solution for Grouping Vietnamese Synonym Feature Words in Product Reviews In Proceedings of the 6th international conference on Asia-Pacific Services Computing (APSCC 2011)
+ Quang-Thuy Lla, Tien-Thanh Vụ, Huyen-lrang Pham and Cong-Io Luu An Upgrading Feature-
bascd Opiniun Mining Mudcl on Vielnamese Product Reviews In Proceedings of the 7th international
conference on Active media technology (AMT 2011), pp 173-185
+ Tien-Thanh Vu, Huyen-Trang Pham, Cong-To T.uu and Quang-Thuy Ila A Fealure-Based Opinion
Mining Model on Product Reviews in Victnamesc In Semantic Methods for Knowledge Management
and Communication (SC1 381), pp 23-33
Trang 6I INTRODUCTIOW
Foature-based opinion mining and summarizing(FOMS) of product reviewsis a very interesling
and attracting issue in the opinion mining field [1][2][3][4] There are many research have done
for improving FOMS systems [5]I3][2]
In this thesis, we propose a Feature-based opinion mining and summarizing model on Viet-
namese customer reviews overcuming some drawbacks of the recent FOMS systems Wilh an
input customer reviews set of products, our task is performed into four steps:(1)Pre-processing
the input customer reviews by slandardizing reviews, segmenting Token, and POS lagging(2)
extracting explicit product features and opinion-words as well by using Vietnamese syutax rules, identifying implicit product features by using relationships with opinion words,and automatically
grouping synonym product features by combining HAC clustering method and semi-supervised SVM-KNN classification method; (3) identifying opinion sentences in each review and deciding
whether cach opinion senicnce is posilive, negative or neutral by using ä VictSemiWurdNcL
extended from an initial SentiWordNet 3.0; (4) summarizing the results
Tho rest of this thesis is organized as following In the second chapter, we provide some
literature reviews In next chapter, the [OMS model with four steps is described Experiment
results and remarks arc described in the fourth chapter Conclusions arc showed in the last chapter
Tl RELATED WORKS Because positive opinionated document on a particular abject does not mean that the author
has positive opinions on all features of the object and vice versa In a typical opinionated text, the
author writes both positive and negative featurcs ef the ebject, although the general sentiment on the object may be positive or negative Document-level and sentence-level classification do not
provide such information Thus, feature-based opinion mining is needed to determine positive,
negative or neutral opinions the feature level And the feature-based opinion mining focuses on
(wo main (asks [6]:
+ Identify object features(product features) For example, in the sentence “The touch screen
of this mobile phone is great”, the product feature is touch screen
+ Determine orientation of opinions on features (positive, negative, or neutral) In above
sentence, the opinion on “touch screen” is positive
A Features Extraction
The approach applied in early feature-based opinion mining systems to identify features is
based on association mining [7] The main idea of this approach is thal aldough different customers usually have different reviews related to product features, when they comment on
preduct features, the words that they use to express [he feature are consistent Thus, the approach
uses assaciation mining to find nown/noun phrases (N/NP) that frequently occur in reviews and
Trang 7considers those N/NP as product features, A disadvantage of the association mining based
approach is that it docs not identify implicit features
Other related works on feature extraction mainly use the topic modeling and clustering to
extract topics/features in customer reviews [8] The main idea of these approaches is that it
clusters the synonym features based on context of reviews
B Opinion Orientation Identification
Opinion Words Extraction The first approach applied to extract opinion words is based an
syntactic or co-occurrence patterns and also a seed list of opinion words to find other opinion
words in a large corpus [9] The approach starts with a list of seed opinion adjectives, and uses
them and a set of linguistic constraints such as “AND”, “OR”, “BUT” etc to identify additional
adjcetive opinion words and their orientations (positive, negative, or neutral) For cxample, gi
a sentence ‘This car is beautiful and spacious.” if “beautiful” is known to be positive, it can be
inferred thal “spacious” is also positive
Other approaches are based on dictionary, one of the simple techniques in this approach is
based on bootstrapping using « small set of sced opinion words and an online dictionary, c.g.,
WordNet [7][10] The approach firstly collects a smal] set of opinion words manually with
known orientations and then (o grow this sel by searching in the WordNet for their synonyms and antonyms After that, the newly found words are added to the seed list The next iteration saris, The iterative prucess slops when no more new words are found
n
Aggregating opinions: This step applies an opinion aggregation function to the resulting
opinion scores to determine the final orientation of the opinion on each object feature in the sentence Let the sentence be s, which contains a set of object features f\, , fm and a set of
opinion words or phrases op;, ,o, with their opinion scores obtained previous steps The
opinion orientation on cach feature f; in « is determined by the opinion aggregation function
(different functions on different systems) [6] defines the function as follows:
~ oP;
scored fir )- 2 Tang:
where op; is an opinion word in s, d(op;, A) is the distance between feature fi and opinion word op; in s op;.so is the orientation or the opinion score 0Ÿ ap¡.
Trang 8I Our FEATURE-BASED Opinion Mininc MopEL
A, Introduction
Figure 1 describes the proposed model for feature-based opinion mining and summarizing on
Vietnamese product reviews The system performs four following phases: (1)Pre-processing (2)
extracling explicit/implicil product features and upinion-words, and grouping synonym product
features(3) identitying orientation of opinion(4) summarizing the results Each step is imple-
mented by several modules
Phase 4; Results Suenmmarization
gL] VictSeatiWordnct
Wemamse custarmter revicws
1) Data Standardizing: The customer often uses a combination of standard spelling, apparently
accidental mistakes, slang, sentence fragments, “typographic slang” and interjoctions in their
reviews [11] We adopted a Vietnamese accented system combined N-gram statistic model
and Hidden Markov modcl(HMM) for the purpose of converting a sentence without acecnts into a Vietnamese accented sentence, for example,“Chiec camera nay that tien loi” switched
into “Chiée camera may that Uign lyi?_(This camera is convenient) The customer oficn uses
a combination of standard spelling, apparently accidental mistakes, slang, sentence fragments,
“.ypographic slang” and interjections in (heir reviews [11] Therefare, we adopled a Viewamese
Trang 9accenled syslem combined N-gram siatisic model and Ilidden Markov modelqIMM) for the
purpose of converting a sentence without accents into a Vietnamese accented sentence, for example,“Chiec camera nay that tien Joi” switched into “Chiée camera nay that tiện
lg?_(Thix camera is comvenient)
2) Token Seymenting und POS Taxging: Because the product features are often nouns or noun
phrases constructing from a bag of words, they nced to be scgmented and tagged In order
to obtain that goal, we use Vietnamese word segmentation tool [12] For example, given a
review sentence: “C4e tinh nang néi chung Ki dit/Features are generally good.), Alter token
segmenting and POS tagging, we achieve the following result: “Cae /NN | tinh ningjeateres ƒNa | nói chunggencraity X | Bare (Ce | 8tyooe Aw” All the segmented and tagged sentences
are then stored in the database along with the POS tag information
C Phase 2: Product Features and Opinion Words Extraction
This phase extracts product Scalurcs and opinion words from Vielnamese customer reviews In
this phase, we consider product features being nouns or nouns phrases, and opinion words being
not only adjectives as [7] bul also verbs because apart from adjectives, sometimes Vietnamese
verbs also express opinions For example, for the sentence “T6i thich mau sắc chiếc điện thoại
nay”_(1 love the color of this phone), “mau sae(Noun phrase)” oor is a product features; and
“thích(Yerb?”;„„ is an opinion word
Thorcfure, we combine Victnamese synlax rules with tho feature extraction method proposed
by [2] to obtain Vietnamese product features In addition, we resolve some drawback points
of FOMS system which are identifying co-references in subsection I-C2, extracting implicit features from opinion words in subsection III-C3, and grouping synanym product features in
subsection IH-C4
1) kxplictt Product Meatures tixtraction: Explicit product [catures are expressed directly in the
sentences in customer reviews For example,“Màn hình cảm ứng của chiếc Iphone 4 này rất
tuydt”_(The touch screen of the Iphume 4 is yreat), Touch sereen is an cxplicit product feature
This module extracts the product features based on the three syntax rules which are part-whole
relation, “No” patterns, and double propagation rule
2) Opinion word Extraction: This module not only extracts the nearest adjectives and verbs
with identificd product feature, but extracts both sentiment strength words (gradable wurds)such
as “rif”yery and negative words such as “Ichéng”)o, as well in the sentence If adjectives are connected to each ofher by comms
adjectives and consider them as opinion words
3) Implicit Features identification: ImplieiL features arc product features not appearing directly
in sentence but via opinion words in the sentence For example, “Bién thaai nay dit qua” This
phone is too expensive, so the opinion word “EAU capensive refers to product price not expressed
s or semicolons or conjunctions, we will extract all of these
Trang 10direclly in the sentence For the domain of “mobile phone”, we construct a mapping dictionary
to identify the implicit feature hy mapping thosc ones to corresponding opinion words
4) Grouping Synonym Features: We use two concepts in [1] Firstly, feature expression of a
feature is a word or phrase that actually appears in a review to indicate the feature Secondly,
feature group (or feature for short) is the name of a feature (given by the user} For example,
a feature group could be named “Ch&t Iwyng Anh" pyciuye guaticy’s bul there arc many possible
expressions indicating the feature, e.g., “Anh” picture, “‘hinh ảnh”; aa;;, and even the “Chất lượng
Ảnh xu», quzaays Ílsclf- AI the festare cxprcssions ïn a feature group signify the same feature
Because the customer can express on the same product feature with many different words
and phrases, for example, both “mẫu m㔄„ and “kiểu đáng”z„¡z„ are belong to “hình
thứC”zpszarzzcc group To make more useful of the summarization phase, these words or phrases, which express the same feature, need to be grouped into synonym features group [1] Our
grouping method based an the SVM-KNN semi-supervised learning | 13]|1|[14| along with HAC
clustering method generating training set for SVM-KNN Therefore, the method is unsupervised
and full automatic
5S) Frequent Features Identification: This step determines the frequent feature in reviews, and
removes redundant features To resolving this task, we compute the frequency of features
appearing on customer reviews If the frequency is greater than a given threshold, the feature
is a frequent fealure Whereas, the feature is redundant [eaturcs and il is climinated
D Phase 3: Determining the opinion orientation
Opinion orientation of each customer on each opinion feature is determined in this phase via
two following steps Firstly, the opinion weight of the customer on each feature on which the
customer expresses their opinions is determined Secondly, opinion orientation of the feature is
determined by classifying s: positive, negative or neutral
« In the first step, a initial VietSentiWordnet which is Vietnamese sentiment dictionary have
boen constructed by extending SentiWordnet 3.0 Therofore, customer's opinion weights am
product feature are calculated,
The inilial VietSeniWordNet hay 977 sentiment synsets and 1179 sentiment words has been
extended by using a semi supervised learning method [15][16] After the normalization all
of opinion words, the extending VietSentiWordNet has 9333 synsets and 9533 words
Denoting 6z as the opinion weight of the feature in a customer's review, ts; is the weight of
the #* opinion words on the feature in the review (denoted by word;); w; is opinion weight
of word; got tram VietSentiWordnet dictionary hy getting the subtraction of positive and
negative score of word: After that, ts is determined as: és = S77" isi where ra be the number
of opinion words of the feature in the review In cases of having negative ward such as
“khéng”,, ¢, the value of is; is reversed (it means that ts; = 1 x és;) In other cases, ts;
into one of three cla:
Trang 11
equals to œ¿ if there is no gradable word such as: rẤt, ry, and f¡ is determined as h x wi if
there is a gradable word with weight of h
« In the second step, opinion orientation for the feature is classified into one of three classes: positive/negative or neutral based on the weight of ts
— if +0.2 < fs so the opinion is positive
— if —0.2 < ts < +0.2 so the opinion is neutral
— if ts <—0.2 the opinion is negative
E Phase 4: Summarization
The summarization is determined by enumerating on all of customer's opinion orientation on
all of product features And the result is showed in table diagram like figure 2
Poitve negative
Vietnamese FOMS system on “mobile phone” product reviews In this chapter, we describe our
results in evaluating via two experiments which are: product features extraction and the whole system evaluations After the two experiments, we implement summarization task and show the
summarizing result in column charts
A Environment and Experimental Data
Trang 12+ Programming Tool: Java Bclipse SDK
2) Experimental Pata: We crawl 743 customer reviews on ten popular “mobile phone” prod- ucls from website hitp:/wwwhegioididong.com Table I shows the number of crawled and standardized reviews for each product
Table 1 TOTAL OF CRAWLED REVIEWS
Product names Number of comments
Subsequently we evaluate the achievement result on feature extracting phase using Vietnamese syntax rules Table II illustrates the effectiveness of the feature extraction For cach product, we read all of these reviews and list all product features from them Then we enumerate corrected
Trang 13fealures returned by the system The precision, recall and FI are illustrated ìn Col 2, 3 and 4 respectively Tt can be son that results of frequent foatures extraction stop are good with all values of F, above 80%
PRECISION, RECALL AND Fl OF FEATURE-BASED GHINION MINING MODEL OX VIETNAMESE MOBILE PHONES
REVTEWS]
Product names Precsion(%) RecaH(%) Fi(%)
LG GS290 Cookie Fresh | 7.12 Tĩ.T8 Tras
C The Whole System Evaluation
For cach feature extracted from the previous experiment, firstly, the system extract opinion
words from reviews mentioning to this feature in 743 crawled reviews Secondly, the system calculate opinion weigh! of the upinion words, Finally, we oblain positive, negative and neutral
comments for all features of each product According to the table ILL, the precision and recall
of our system are quile sulisfactory with both precision and recull valucs approximate 69% In
summarization task, figure 3 shows a summarization of the customer reviews on each features
of product LG Wink Touch T300
‘V CONCLUSION
In this thesis, we presented, in chapter I, an approach to build an opinion mining sys- tem of customer reviews according to product features based on Vietnamese syntax rules and VietSentiWordNet dictionary, with three main contributions as following:
» Firsily, in the phase 1, we buill a Vietnamese accented system combined N-gram slalistic
mode] and Hidden Markov model(HMM) for the purpose of converting a sentence without accents into a Vietnamese accented sentence
+ Secondly, in the phase 2, we proposed a method of using SVM-kNN semi-supervised
learning along with IJAC clustering method generating training set for SVM-KNN to group
synonym features; after that, co-reference was resolved by using some Vietnamese rules.