HSUM HC Integrating Bert Based Hidden Aggregation to Hierarchical Classifier for Vietnamese Aspect Based Sentiment Analysis HSUM HC Integrating Bert based hidden aggregation to hierarchical classifier[.]
Trang 1HSUM-HC: Integrating Bert-based hidden
aggregation to hierarchical classifier for Vietnamese
aspect-based sentiment analysis
Tri Cong-Toan Tran
Ho Chi Minh City University of Technology
Vietnam National University Ho Chi Minh City
Ho Chi Minh, Viet Nam tri.tran.1713657@hcmut.edu.vn
Thien Phu Nguyen
Ho Chi Minh City University of Technology Vietnam National University Ho Chi Minh City
Ho Chi Minh, Viet Nam thien.nguyen.phu@hcmut.edu.vn
Thanh-Van Le*
Ho Chi Minh City University of Technology Vietnam National University Ho Chi Minh City
Ho Chi Minh, Viet Nam ltvan@hcmut.edu.vn
*Corresponding Author
Abstract—Aspect-Based Sentiment Analysis (ABSA), which
aims to identify sentiment polarity towards specific aspects in
customers’ comments or reviews, has been an attractive topic
of research in social listening In this paper, we construct a
specialized model utilizing PhoBert’s top-level hidden layers
integrated into a hierarchical classifier, taking advantage of
these components to propose an effective classification method
for ABSA task We evaluated our model’s performance on two
public datasets in Vietnamese and the results show that our
implementation outperforms previous models on both datasets.
Index Terms—aspect based sentiment analysis, PhoBert, BERT,
hidden layer aggregation, hierarchical classifier, Vietnamese
cor-pus
I INTRODUCTION The fast growth of e-commerce, particularly the B2C
(business-to-customer) model, has resulted in a rise in online
purchasing habits It makes day-to-day transactions extremely
simple for the general public, and it ultimately becomes one of
the most popular sorts of purchases, especially during a global
pandemic like COVID-19 Due to the sheer development of
social media platforms, customers are encouraged to provide
reviews and comments expressing their positive or negative
sentiments about the products or services that they
experi-enced Analyzing a huge amount of data for mining public
opinion is a time-consuming and labor-intensive operation As
a result, building an automatic sentiment analysis system can
help consumers exploit quality judgments of others about
in-terest products Moreover, this system will support businesses
to better manage their reputation, understand the business
requirements well adapted to the customer’s needs and avoid
marketing disasters For this reason, sentiment analysis has
become one of the most attractive study fields in machine
learning among academic and business researchers in recent years
There have been previous interesting researches of sentiment analysis for Vietnamese text using VLSP 2016 datasets1 However, in modern days, sentiment analysis does not provide enough information since it assumes that the entire review only has one topic and one sentiment, but a product can have both its pros and cons in many aspects The challenge of Aspect-based sentiment analysis (ABSA) is not only detecting aspects
in a review but also the sentiment attached to that aspect A review can be represented by dozens or hundreds of words about multiple aspects with different sentiments to each, and determining which sentiment words go with which aspect can
be very difficult With ABSA, reviews about a product can now be analyzed in detail, showing the reviewer’s opinion on each aspect of that product
The main problem of ABSA is as follows: Given a customer review about a domain (e.g hotel or restaurant), the goal
is to identify sets of (Aspect, Polarity) that fit the opinion mentioned in the review Each aspect is a set of an entity and an attribute, and polarity consists of negative, neutral, and positive sentiment For each domain, all possible combinations
of entities and attributes are predefined The ABSA task will
be divided into two phases: (i) identify pairs of entities and attribute, (ii) analyze the sentiment polarity to the correspond-ing aspect (entity#attribute) identified in the previous phase
For example, a review “Nơi đây có quang cảnh tuyệt đẹp,
đồ ăn cũng ngon nhưng phục vụ hơi tệ” (This place has
an amazing view, the food is great too but the service is bad) will output (Entity#Attribute: Polarity) as follows:
(Ho-1 https://vlsp.org.vn/vlsp2016/eval/sa
Trang 2tel#Design&Features: Positive), (Food&Drinks#Quality:
Posi-tive), (Service#General: Negative)
In this paper, we propose a method using multiple Bert’s
top-level hidden layers for classification combined with an
intuitive hierarchical classifier for the ABSA task Our results
demonstrate that a large model with many hidden layers
contains useful information which can be used to get better
results We achieved the highest score when applying our
method to two Vietnamese ABSA datasets such as VLSP2
and UIT ABSA [1] dataset
II RELATEDWORK
In recent years, Sentiment Analysis has taken off and is
strongly developed by advanced researches for social listening
Many corpora and tasks have been developed, such as SemEval
2015 (Task 12) [2] and 2016 (Task 5) [3] for various languages,
including English, Chinese, etc The first public Vietnamese
benchmark datasets were released by the VLSP (Vietnamese
Language and Speech Processing) community in 2018 The
organizer built two benchmark document-level corpora with
4,751 and 5,600 reviews for the restaurant and hotel domain,
respectively
Several interesting methods have been proposed to handle
these tasks The earliest works are heavily based on feature
engineering (Wagner et al [4]; Kiritchenko et al [5]), which
made use of the combination of n-grams and sentiment lexicon
features to solve various ABSA tasks in SemEval Task 2014
Nguyen and Shirai [6]; Wang et al [7]; Tang et al [8] were
able to achieve higher accuracy by improving on Neural
net-work with hierarchical structure by integrating the dependency
relations and phrases [6], an Attention module [7], or with
target-dependent mechanism [8] Ma et al [9] incorporated
useful commonsense knowledge into a deep neural network to
further enhance the model
Recently, the pre-trained language model over a large text
corpus such as ELMo (Peters et al [10]), OpenAI GPT
(Radford et al [11]), and especially BERT (Devlin et al
[12]) have shown their effectiveness to alleviate the effort of
feature engineering Chi Sun et al [13] proposed four methods
of converting the ABSA task, such as question answering
(QA) and natural language inference (NLI), into a sentence
pair classification task by constructing auxiliary sentences and
fine-tuned a BERT model to solve the task The sentence
pair is created by concatenating the original sentence with
an auxiliary sentence generated by several methods from the
target-aspect pair Karami et al [14] proposed two modules
called Parallel Aggregation and Hierarchical Aggregation
uti-lizing the hidden layers of the BERT language model to
produce deeper semantic representations of input sequences
The prediction and its loss are performed on each one of the
selected modules These losses are then aggregated to produce
the final loss of the model They used Conditional Random
Fields (CRFs) for the sequence labeling task which yielded
better results In addition, their experiments also show that
2 https://vlsp.org.vn/vlsp2018/eval/sa
training BERT with a large number of epochs does not cause the model to overfit
For the low-resource language such as Vietnamese, there has been little study of aspect-based sentiment analysis over the year, but still steady progress Oanh et al [15] proposed a BERT-based Hierarchical model which integrated the context information of the entity layer into the prediction of the aspect layer, optimizing the global loss functions to capture the entire information from all layers Their model consists of two main components The Bert component encodes the context information of the review into a representation vector The representation vector will be used as input to the hierarchical model to generate multiple outputs (entity; aspect; polarity) corresponding to each layer Thin et al [16] performed an investigation on the performance of various monolingual pre-trained language models compared with multilingual models
on the Vietnamese aspect category detection problem This research showed the effectiveness of PhoBert compared to several models, including the XLM-R [17], mBERT model [12] and another version of BERT model for Vietnamese languages
III PROPOSED MODEL
In this section we will introduce HSUM-HC, our ABSA approach inheriting the benefits of PhoBert with hidden layer aggregation and hierarchical classifiers for Vietnamese text (Fig 1) By deeply analyzing the characteristic of each model,
we believe this combination can give us a model that is well suited for ABSA task PhoBert is a monolingual pre-trained model specifically made for the Vietnamese language Input sequences will be tokenized and fed into the PhoBert Model, then we take top n hidden layers as the meaningful context input for the next step of the hierarchical aggregation layer Then the output of the latter layer will be input into
a hierarchical classifier for predicting the set of aspects and sentiment polarity
1) Bert Model: There have been many multilingual pre-trained Bert models that support Vietnamese, but as pointed out by [18], these models have two main problems: little pre-training data and not recognizing compound words PhoBert is made to address these problems, it is also the first monolingual Bert model pre-trained for Vietnamese PhoBert’s pre-training approach is based on RoBerta [19], which aims to improve Bert’s training procedures for better performance The pre-training was done with 20GB of monolingual text (Vietnamese Wikipedia and Vietnamese news corpus3) and employs the use of a segmenter, VNCoreNLP4 to tokenize compound words (e.g khách_sạn, thức_uống) PhoBert has been used
as a pre-trained model in our research because we aim to process Vietnamese text for ABSA tasks For fine-tuning, we follow the steps taken when pre-training the model, using VNCoreNLP for word segmentation, and PhoBert’s tokenizer
to split sequences into tokens and map tokens into their index,
3 https://github.com/binhvq/news-corpus
4 https://github.com/vncorenlp/VnCoreNLP
Trang 3Figure 1: Our HSUM-HC model for the ABSA task
adding the [CLS] token at the start and [SEP] token at the end
of each sequence This tokenizer will also give us the attention
masks and pad sequences to ensure equal length Then the
list of tokens and attention masks will be input into the Bert
model
2) Hidden layer aggregation with hierarchical classifiers:
A Bert-based model with a hierarchical classifier was created
by Oanh et al [15] to deal with ABSA Its architecture is based
on how a human would annotate manually the same task It
carries out classification in three layers: Entity, Aspect, and
Sentiment The process is to label first the entity (e.g Hotel,
Room, ) then identify the entity’s attribute (e.g Design,
Comfort, ) to form an aspect, and lastly analyze the sentiment
for that aspect in the review Every layer contributes its output
as context to the next layer With this architecture, we can
solve ABSA with an end-to-end model, without the need for
multiple classifiers
In the original Bert with hierarchical classifier
implementa-tion from Oanh et al [15], we observe that some improvements
can be made to achieve better performance for this task Firstly,
they used a multilingual Bert model and did further training
for Vietnamese to create a pre-trained model accustomed to
Vietnamese However it is still not specialized since,
with-out the use of a segmenter, Vietnamese compound words
are not handled properly We experimented with the model architecture in their paper and saw that we could improve the result by around 3% by using PhoBert as a pre-trained model and VNCoreNLP for word segmentation Secondly, in their implementation, only the last hidden layer was used to make the prediction, this means the top layer is considered most important and all the information in previous hidden layers
is not utilized [20] showed that all hidden layers of BERT can contain information, higher-level layers have valuable semantic information Thus, we can enhance the Bert-based model by using these layers For that reason, we implemented the hierarchical hidden level aggregation architecture by [14], which adds a BERT layer on top of the hidden layers The output is then aggregated with the previous hidden layer and then goes through the hierarchical classifier and the total loss
is the sum of every classifier’s losses
The Binary Cross-Entropy loss function for each layer Li
of the classifier is calculated as follows:
Li=
C X
c=1
yc· log(σ(ˆy)) + (1 − yc) · log(1 − σ( ˆyc)) (1) with C being the number of classes for that layer
The loss for each classifier is the sum of three predictions layers’ losses calculated above
classif ier_loss = L1+ L2+ L3 (2) The total loss is the sum of all classifier’s losses, with H being the number of classifiers
total_loss =
H X
h=1 classif ier_lossh (3)
With this implementation, we obtain an enhanced model with the goal of achieving the best performance possible for the aspect-based sentiment analysis task: A monolingual pre-trained model for Vietnamese text, a mechanism to exploit this pre-trained model to its full potential, and a hierarchical classifier Our promising results will be presented in detail in the experiment section
IV EXPERIMENTS
A Datasets
We experimented our model’s performance with the VLSP
2018 ABSA dataset, which was the first public Vietnamese dataset for ABSA task This dataset was collected from user comments on Agoda5 and consists of document-level reviews The length of each review varies by quite a large number, some are short sentences but some reviews can contain hundreds of words, with the longest containing around 1000 words
We also evaluated our model on the UIT ABSA Datasets, which is sentence-level reviews containing relatively short sentences, which only have 1.65 per review on average The data was collected on mytour6 In the formulation of both
5 https://www.agoda.com/vi-vn
6 https://mytour.vn
Trang 4datasets, multiple annotators were employed and raw data were
manually annotated with strict guidelines
The datasets deal with the hotel and restaurant domains
being divided into training, development, and testing sets with
similar label ratios There are 34 aspects for the hotel domain
and each review can have a various amount of aspects Details
about the dataset can be seen in Table I and Table II From
the standard deviation for each dataset, it is apparent that
the aspect distribution is very uneven, with the most frequent
aspect appearing around 2000 times, and the rarest aspect only
appearing 2 or 3 times
Table I: Dataset Details for VLSP 2018 ABSA
Type #Reviews #Aspect Avg Aspect σ Avg Length
Table II: Dataset Details for UIT ABSA
Type #Reviews #Aspect Avg Aspect σ Avg Length
train 7180 11812 1.65 469.00 18.25
B Evaluation Metrics
To evaluate the performance of ABSA models, we use
the micro-average method The evaluation will be done in
two phases, Phase A will evaluate the model’s capabilities in
detecting aspects of a review, Phase B will evaluate the aspect,
polarity pair detection The Precision, Recall, and F1 Scores
are evaluated with the following formulas:
P recision =
P
c i ∈CT Pci P
c i ∈CT Pci+ F Pci
Recall =
P
c i ∈CT Pci P
c i ∈CT Pci+ F Nci
F 1 = 2 ∗ P re ∗ Rec
P re + Rec
C Experimental Setup
As mentioned above, we use VNCoreNLP’s segmenter to
segment each review before using PhoBert Then we use
PhoBert’s tokenizer to get token ids, attention masks and
then perform padding We use Hugginface’s AdamW
opti-mizer7together with the constant scheduler8 for warmup, the
base learning rate we choose is 2e−5 and 5e−6 for document
and sentence level datasets, respectively We set the warmup
ratio to 0.25 and batch size to 10, then we train each model
for 100 epochs The BERT model we use is PhoBert-large
with 25 Transformers blocks and a hidden layer size of 1024
We test the performance of two settings: 4 layers aggregation
(HSUM-HC_4) and 8 layers aggregation (HSUM-HC_8)
8 https://huggingface.co/transformers/main_classes/optimizer_schedules.html
D Experimental Results and Discussion
We compared our model’s performance with previous work done on the same dataset For the UIT ABSA Dataset, all results besides ours are from the baseline results in [1], these results will be taken from the Multi-task approach (except for SVM)
1) Experimental Results: Results can be seen in Table III and IV for two datasets Overall, we find that our implemen-tation outperforms previous methods in the same task For the VLSP 2018 dataset, our model achieved an F1 score of 85.20% for Phase A and 80.08% for Phase B, which is a significant improvement from previous Deep Learning models Notably, compared to [15], our model performs considerably better when applying a hierarchical classifier with a language-specific pre-trained model and hidden layers, improving 3.14%
in Phase A and 5.39% in Phase B F1 score of our model is 6.04% higher than one of [16] which used PhoBert-base with
a linear layer for aspect detection For the UIT ABSA Dataset, our model got 80.78% and 75.25% in Phase A and Phase B, respectively Our model also improved at least 1.68% in Phase
A and 1.56% in Phase B compared to baseline models in [1] It’s also proven that using the top 8 layers for hidden layer aggregation gives us a better performance compared to only
4, this is because we are using a large model with more hidden layers, which means more layers can contain useful semantic information
From the results of UIT ABSA sentence-level dataset, we can see that our implementation can have lower precision but much higher recall than previous models, which leads to a higher F1 score than Deep Learning models, meaning it overall outperforms these models This is even more apparent in the document-level dataset, which has longer reviews requiring the model to capture long-range dependencies, each review also has a higher amount of aspects on average Therefore this task can be considered more challenging than sentence-level However, for document-level, our model scores significantly higher than it did on sentence-level This means that our model, instead of being challenged by long sequences and forgetting information, actually can learn the extra information
in these sequences and make use of them to achieve a better result We see that our model shows its true potential when put through a more demanding task with more information to learn
Overall the results show that our implementation is effective
in dealing with ABSA, and all three components PhoBert, HSUM, and hierarchical classifier are essential for improving the model’s performance
2) Loss and performance curve: In our experiments, we trained our model with a high amount of epochs and relatively little data Our training loss curve can be seen in Fig 2, from
a first glance, it is obvious that our model started to overfit very early and the validation loss kept increasing However,
we observe that it is not the case Even though validation loss was increasing, performance still slowly increases as can be seen in Fig 3 This case was also observed by [14] and [21], indicating that the model still learns with a slow and steady
Trang 5Table III: Results on the test set of VLSP 2018 Dataset, Hotel domain
Phase A (Aspect Detection) Phase B(Aspect Polarity Detection) Models Precision Recall F1 Precision Recall F1
-BiLSTM + CNN 84.03 72.52 77.85 76.53 66.04 70.90
Our method
HSUM-HC_8 86.79 83.66 85.20 84,52 76.08 80.08
HSUM-HC_4 85.59 83.39 84.67 83.50 74.65 78.83 Table IV: Results on the test set of UIT ABSA Dataset, Hotel domain
Phase A (Aspect Detection) Phase B(Aspect Polarity Detection) Models Precision Recall F1 Precision Recall F1 Multiple SVM 76.68 74.70 75.68 69.06 67.28 68.16
LSTM + Attention 83.47 69.07 75.59 76.22 63.07 69.03 BiLSTM + Attention 82.02 72.08 76.73 74.68 65.63 69.86
CNN-LSTM + Attention 76.92 70.76 73.71 69.02 63.50 66.14 BiLSTM-CNN 77.11 78.22 77.66 70.23 71.23 70.72 PhoBert-base 83.46 75.18 79.10 77.75 70.03 73.69
Our method
HSUM-HC_8 80.26 81.31 80.78 76.87 73.71 75.25
HSUM-HC_4 79.75 80.96 80.34 76.89 72.97 74.88
Figure 2: The loss curves on the validation and test sets for VLSP 2018 (left) and UIT ABSA dataset (right)
Figure 3: The F1 curves on the validation and test sets for VLSP 2018 (left) and UIT ABSA dataset (right)
Trang 6pace At one point the performance stands and the learning
process stops It can be explained that BERT was pre-trained
on an enormous amount of data and therefore will not easily
overfit
E Conclusion
We implemented an effective method that utilizes hidden
layers of Bert with a hierarchical classifier to deal with the
Vietnamese ABSA task We experimented on two datasets on
different review levels and significantly outperforms previous
methods, achieving state-of-the-art results for both datasets
We find that since Bert-large has 25 hidden layers, using 8
layers for aggregation gives better performance compared to
the original 4 layers usage For future work, we plan to apply
our model to different domains and languages and test it with
online customer reviews to see its potential applications
ACKNOWLEDGMENT
We would like to thank VLSP 2018 organizers and the UIT
NLP Group for providing us with the ABSA datasets
REFERENCES [1] D Van Thin, N L.-T Nguyen, T M Truong, L S Le, and
D T Vo, “Two new large corpora for vietnamese aspect-based
sentiment analysis at sentence level,” ACM Trans Asian Low-Resour.
Lang Inf Process., vol 20, no 4, May 2021 [Online] Available:
https://doi.org/10.1145/3446678
[2] B Phạm and S McLeod, “Consonants, vowels and tones across
vietnamese dialects,” International Journal of Speech-Language
Pathology, vol 18, no 2, pp 122–134, 2016, pMID: 27172848.
[Online] Available: https://doi.org/10.3109/17549507.2015.1101162
[3] M Pontiki, D Galanis, H Papageorgiou, I Androutsopoulos, S
Man-andhar, M AL-Smadi, M Al-Ayyoub, Y Zhao, B Qin, O De Clercq,
V Hoste, M Apidianaki, X Tannier, N Loukachevitch, E Kotelnikov,
N Bel, S M Jiménez-Zafra, and G Eryi˘git, “SemEval-2016 task 5:
Aspect based sentiment analysis,” in Proceedings of the 10th
Interna-tional Workshop on Semantic Evaluation (SemEval-2016) San Diego,
California: Association for Computational Linguistics, Jun 2016, pp.
19–30.
[4] J Wagner, P Arora, S Cortes, U Barman, D Bogdanova, J Foster, and
L Tounsi, “DCU: Aspect-based polarity classification for SemEval task
4,” in Proceedings of the 8th International Workshop on Semantic
Eval-uation (SemEval 2014) Dublin, Ireland: Association for Computational
Linguistics, Aug 2014, pp 223–229.
[5] S Kiritchenko, X Zhu, C Cherry, and S Mohammad,
“NRC-Canada-2014: Detecting aspects and sentiment in customer reviews,” in
Proceed-ings of the 8th International Workshop on Semantic Evaluation (SemEval
2014) Dublin, Ireland: Association for Computational Linguistics, Aug.
2014, pp 437–442.
[6] T H Nguyen and K Shirai, “PhraseRNN: Phrase recursive neural
network for aspect-based sentiment analysis,” in Proceedings of the 2015
Conference on Empirical Methods in Natural Language Processing.
Lisbon, Portugal: Association for Computational Linguistics, Sep 2015,
pp 2509–2514.
[7] Y Wang, M Huang, X Zhu, and L Zhao, “Attention-based LSTM
for aspect-level sentiment classification,” in Proceedings of the 2016
Conference on Empirical Methods in Natural Language Processing.
Austin, Texas: Association for Computational Linguistics, Nov 2016,
pp 606–615.
[8] D Tang, B Qin, X Feng, and T Liu, “Effective lstms for
target-dependent sentiment classification,” 2016.
[9] Y Ma, H Peng, and E Cambria, “Targeted aspect-based
sentiment analysis via embedding commonsense knowledge into
an attentive lstm,” Proceedings of the AAAI Conference on
Artificial Intelligence, vol 32, no 1, Apr 2018 [Online] Available:
https://ojs.aaai.org/index.php/AAAI/article/view/12048
[10] M E Peters, M Neumann, M Iyyer, M Gardner, C Clark, K Lee, and L Zettlemoyer, “Deep contextualized word representations,” in
Proceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) New Orleans, Louisiana: Association for Computational Linguistics, Jun 2018, pp 2227–2237 [11] A Radford and K Narasimhan, “Improving language understanding by generative pre-training,” 2018.
[12] J Devlin, M.-W Chang, K Lee, and K Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” 2019 [13] C Sun, L Huang, and X Qiu, “Utilizing BERT for aspect-based
sentiment analysis via constructing auxiliary sentence,” in Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers) Minneapolis, Minnesota: Association for Computational Linguistics, Jun 2019, pp 380–385.
[14] A Karimi, L Rossi, and A Prati, “Improving bert performance
for aspect-based sentiment analysis,” arXiv preprint arXiv:2010.11731,
2020.
[15] O T Tran and V T Bui, “A bert-based hierarchical model for
viet-namese aspect based sentiment analysis,” in 2020 12th International
Conference on Knowledge and Systems Engineering (KSE), Can Tho, Viet Nam, 2020, pp 269–274.
[16] D V Thin, L S Le, V X Hoang, and N L.-T Nguyen, “Investigating monolingual and multilingual bertmodels for vietnamese aspect category detection,” 2021.
[17] A Conneau, K Khandelwal, N Goyal, V Chaudhary, G Wenzek,
F Guzmán, E Grave, M Ott, L Zettlemoyer, and V Stoyanov, “Un-supervised cross-lingual representation learning at scale,” 2020 [18] D Q Nguyen and A T Nguyen, “Phobert: Pre-trained language
models for vietnamese,” CoRR, vol abs/2003.00744, 2020 [Online].
Available: https://arxiv.org/abs/2003.00744 [19] Y Liu, M Ott, N Goyal, J Du, M Joshi, D Chen, O Levy, M Lewis,
L Zettlemoyer, and V Stoyanov, “Roberta: A robustly optimized BERT
pretraining approach,” CoRR, vol abs/1907.11692, 2019 [Online].
Available: http://arxiv.org/abs/1907.11692 [20] G Jawahar, B Sagot, and D Seddah, “What does BERT learn about
the structure of language?” in Proceedings of the 57th Annual Meeting
of the Association for Computational Linguistics, Florence, Italy, Jul.
2019, pp 3651–3657.
[21] X Li, L Bing, W Zhang, and W Lam, “Exploiting BERT for end-to-end
aspect-based sentiment analysis,” in Proceedings of the 5th Workshop
on Noisy User-generated Text (W-NUT 2019), Hong Kong, China, Nov.
2019, pp 34–41.