For this reason, in each BiLSTM-CRFs components, we initialize word embedding layer with word embedding trained by different methods, namely FastText [1], GloVe [20] and Word2Vec [16]..
Trang 1Improving Intent Extraction Using Ensemble Neural
Network
1st Thai-Le Luong
Faculty of Information Technology VNU University of Engineering and Technology
University of Transport and Communications
Hanoi, Vietnam luongthaile80@utc.edu.vn
2nd Nhu-Thuat Tran
Faculty of Information Technology VNU University of Engineering and Technology
Hanoi, Vietnam thuattn@vnu.edu.vn
3rd Xuan-Hieu Phan
Faculty of Information Technology VNU University of Engineering and Technology
Hanoi, Vietnam hieupx@vnu.edu.vn
Abstract—User intent extraction from social media texts is
aimed at identifying user intent keyword and its related
informa-tion This topic has attracted a lot of researches since its various
applications in online marketing, e-commerce and business
ser-vices One of such studies is to model this problem as a sequence
labeling task and apply state-of-the-art sequential tagging models
such as BiLSTM [12] and BiLSTM-CRFs [12] In this paper, we
take a further step to enhance intent extraction results based
on tri-training [23] and ensemble learning [2] Specifically, we
simultaneously use three BiLSTM-CRFs models, each of them
is different from others by the type of word embeddings, and
apply majority voting scheme over their predicted labels when
decoding final labels Extensive experiments on data from three
domains Real Estate, Tourism and Transportation show that our
proposed methods enjoy a better performance compared to single
model based approach.
Index Terms—intent mining, intent identification, tri-training,
ensemble, information extraction
I INTRODUCTION
In recent years, if one wants to look for a restaurant, a
place of entertainment or even an apartment, he/she will come
straight to a forum or a social network to share his/her intent
and preference with a desire to get some recommendations
from others These data will bring a lot of potential benefits
to business if there is a mechanism to automatically understand
and extract the user intents when they just appear on social
media under the forms of posts or comments Most researches
formerly focused on classifying user intents into some
pre-defined categories using various approaches such as [6], [9],
[10] There has been a minority of studies which attempt to
exploit the semantic features or linguistic features of online
texts to deeply understand the intention of user [3], [4], [13],
[15] In our previous paper [15], we formulated intent
extrac-tion task as a sequential labeling problem For a particular
intent domain, we attempted to extract intent head and intent
properties Terminologically, intent head is intent-keyword and
object; and intent properties include all possible constraints or
preferences like brand, price, date and so on In this paper,
we follow above-described idea to extract user intent from Vietnamese online texts (i.e posts/comments) For example,
with a post in the Tourism domain like “Our family is going
to Da Nang from 14/6 to 18/6, we have 5 adults and 1 child (1-year-old), could you recommend us the hotel, the best places
to visit there and the total cost is about 20 million dong Tks Phone number: 0913 456 233", intent extraction model will
produce following output intent-keyword = “is going to”; and intent properties= {destination = “Da Nang"; agenda = “from 14/6 to 18/6"; number of people = “5 adults and 1 child"; price
= “20 million dong"; contact = “0913 456 233"} Previously, ensemble learning methods have shown promis-ing results in classification problem [2] and sequential labelpromis-ing problem [18] We expect the similar results when applying this method to intent extraction task Therefore, we explore the tri-training and ensemble learning for intent extraction
in the context of deep neural networks Because BiLSTM-CRFs (Bidirectional Long Short-term Memory – Conditional Random Fields) has achieved remarkable results in sequence labeling task [12], it is reasonable to explore this model and its variations to improve task performance In particular, we come up with a novel model consisting of three BiLSTM-CRFs components This idea is based on the tri-training technique proposed in the paper [23], [21] and ensemble learning [18]
In our approach, we train above-mentioned model and apply majority voting scheme over three outputs when decoding final labels for each token To ensure the voting scheme works as expected, three BiLSTM-CRFs components need to be diverse
as much as possible For this reason, in each BiLSTM-CRFs components, we initialize word embedding layer with word embedding trained by different methods, namely FastText [1], GloVe [20] and Word2Vec [16] These word embeddings will
be fine tunned in training phase Although this model shows some improvements in intent extraction task, as presented
in section V-D, it requires a large amount of time to train
Trang 2Therefore, we further explore this model with the idea of
shar-ing layer Three components can share LSTM-based character
encoding layer, LSTM-based word layer or CRFs decoding
layer The results of these explorations will be discussed in
section V-D
Contribution Our contributions are: (a) We propose a novel
method to improve intent extraction results based on ensemble
learning method and tri-training technique in the context
of deep neural networks We also explore the variations of
proposed architecture in order to reduce training time (b) We
explore the role of word embeddings to intent extraction task
on our collected dataset We recognized that on our dataset,
using each word embeddings type independently sometimes
is more effective than concatenating them (c) We proposed
the definition of tag/label sets for intent information for three
specific domains, they are Real Estate, Transportation and
Tourism We perform an extensive evaluation of proposed
method compared to previous state-of-the-art methods on
datasets of these domains
II RELATED WORKS
A User intent understanding
To the best of our knowledge, most of researches attempted
to mine user goals through exploiting the user queries and/or
behaviors on online channels before 2012 The most popular
approach during this period is to classify user intents into
some pre-defined intent categories based on keywords or
personalized data [6], [9], [10] Recent years, there have been
some other works focusing on understanding user intent from
user posts/comments such as [11], [15], [17], [22], but the
number of them still remains modest Among those previous
studies, several ones are highly similar to our work in terms of
exploiting linguistic features and/or some high level features
of the user posts/comments In 2010, X.Li [13] confirmed that
determining the semantic intent of web queries not only
in-volves identifying their semantic class, but also understanding
their semantic structure They formally defined the semantic
structure of noun phrase queries as comprised of intent heads
(IH) and intent modifiers (IM), in which an intent head (IH)
is a query segment that corresponds to an attribute name of an
intent class while an intent modifier (IM) is a query segment
that corresponds to an attribute value (of some attribute name)
In the year 2012, the study of M Castellanos et al [3] mining
user intent extraction from the social media comments is
relatively similar to ours The authors tried to identify the
intent phrase, like “would like to see", “are planning a trip"
and extract intent attributes, like “Ages = 7; Date = June"
Besides some entities can be extracted automatically by CRFs
method supported by rules, it was required manual way to
identify some intents This is the key difference to our
end-to-end method
B Tri-training technique
Tri-training was proposed by Zhou et al [23] is a classic
method that reduces the bias of predictions on unlabeled
data by utilizing the agreement of three independently trained
models In detail, in each round of tri-training, an unlabeled example is labeled for a classifier if the other two clas-sifiers agree on the labeling, under certain conditions In [21], Sebastian Ruder et al proposed a novel method that reduced the time and space complexity of classic tri-training, called multi-task tri-training They chose two NLP tasks with different characteristics, namely a sequence prediction and a classification task (POS tagging and sentiment analysis) to treat as multi-task in the model After conducting extensive experiments, they recognized that multi-task tri-training model outperforms both traditional tri-training and recent alternatives
in the case of sentiment analysis However, classic tri-training
is superior in POS tagging This is one of reasons why we explore classical tri-training in the context of deep neural networks to deal with our sequence labeling task
C Ensemble learning
An ensemble is a collection of models whose predictions are combined by weighted averaging or voting In 2004, R Caruana et al presented a method for constructing ensembles from libraries of thousands of models and achieved promising results in classification task [2] Rather than combine good and bad models in an ensemble, they used forward stepwise selection from the library of models to find a subset of models that when averaged together yield excellent performance As stated in his paper, Nam et al [18] expected to get similar results when applying ensemble learning to structured learn-ing problems They presented Structured Learnlearn-ing Ensemble (SLE), a novel combination method for sequence predictions that actually incorporates correlations of label sequences Re-sults in a consequence on both POS (Part-of-speech) and OCR (handwritten character recognition) tasks, SLE has exhibited superior performance compared with the single best model SVM-struct (had been verified in their paper) Our work shares the same idea of exploring ensemble technique for sequence labeling but we mainly focus on exploiting neural network, a method that has been recently achieved state-of-the-art results
in sequential tagging
III INTENT HEADS AND INTENT PROPERTIES IN THREE
PARTICULAR DOMAINS
As stated above, we chose three domains to extract intent information, they are Real Estate, Transportation and Tourism With the motivation to extract the intent heads, including intent keyword, intent object and intent properties or constraints, we had to survey carefully our data collection Then we made
up our mind to build the set of 18 labels for Real Estate domain, 17 labels for Transportation domain and 15 labels for Tourism domain This work took us much more time than we thought Because a user post usually contains a lot of around information supporting for the main intention, it is hard to decide which information needed to be revealed and which did not After completing three sets of labels building, we annotated the data with these labels Some examples of tagged posts are presented in the figure 1 A pair of HTML-like tags
is used to mark a word/phrase that represents an intent keyword
Trang 3or intent attribute For example, we tag the phrase "Cho thuê"
with <intent> and </intent> to indicate an intent keyword In
figure 1, intent keyword is marked with red and each type of
other intent attributes is marked with a different color
Figure 1 Example of tagged posts.
IV INTENT EXTRACTION APPROACH BASED ON
TRI-TRAINING ANDENSEMBLE LEARNING
In this section, we will describe our proposed model
begin-ning at bottom layers and go up to top layers Our proposed
architecture consists of three BiLSTM-CRFs components, as
depicted in the figure 2, each of them has three layers The
lowest layer is input word embeddings followed by a BiLSTM
layer BiLSTM layers are responsible for producing input to
the third layer – CRFs layer that decodes labels for its own
components
Figure 2 Our proposed model based on tri-training technique in the context
of deep neural network & ensemble learning
A Input word embeddings
The input layers to all three components in our model
are vector representation of individual word Word vector
repsentation consists of character-based word representation
and pre-trained word representation In terms of character-based word representation, we intergrate this to our proposed model because our data was collected from social media or discussion forums, such data is sensitive to the spelling of words, leading to the case that two words has exactly the same meaning but come with different spelling One of the
obvious example is “tr" and “triệu" From pre-trained word
representation perspective, initializing word embeddings with meaningful value will produce better results than randomly initialization, as stated in [5] Pre-trained word representations
in our model, which are 100-dimensional dense vectors, are trained by three techniques mentioned above, namely FastText, Word2Vec and Glove Their roles are not only to improve task performance of their own component but also make three components diverge, resulting in better performance
of the whole model These pre-trained embeddings will be fine-tunned during training Mathematically, the ithword has following representaion:
wi=
hfi
hb i
epre−trained
In which hfi and hb
i are forward and backward representation
of word wi respectively (output of forward char-LSTM and backward char-LSTM respectively) Since three components are independent, hfi and hb
i will be learned independently during training phase If word wi comes to Glove-word em-bedding components, epre−trained will be lookup from Glove word embedding lookup table as describes in [12] The same understanding is for remaining components
B BiLSTM Encoding Layer
For each word wi in sentence (w1, w2, , wn) containing n words, each of them will be represented as described above Forward LSTM layer will generate hlirepresentation of its left context Similarly, backward LSTM will also generate hri rep-resentation of its right context We follow the settings in [12]
as forward and backward LSTM have different parameters The output of BiLSTM is
hi=
hl
hr i
(2)
C CRFs Decoding Layer
Instead of tagging each word independently, a CRFs de-coding layers is added to jointly decode labels for words in sentence This comes from the fact that data is not fully independent, each word has a dependence on its neighbors For example, in intent extraction task, three labels INT, B-OBJ, I-OBJ frequently come together; or I-LOC cannot follow
an I-INT This idea has been shown its efficiency in NER task
in [12] In each component of our model, CRFs take output of BiLSTM Encoding Layer as input and decode label for each word
Trang 4D Loss Function in Training Phase
Each BiLSTM-CRFs component will calculate its own loss
function as formula (1) in [12] The overall loss of proposed
model will be:
overall_loss =
3
X
i
lossi is the loss of ithcomponent
E Majority Voting Scheme over Outputs of Three
BiLSTM-CRFs Components
Given the input x = (x1, x2, , xN), outputs from three
components in our model are three sequence of labels, depicted
as {y(1), y(2), y(3)} The final output will be constructed from
these sequence of labels:
y = {majority(y(1)
1, y(2)1, y(3)1), , (4)
majority(y(1)N, y(2)N, y(3)N)}
in which, {y(1)
i, y(2)
i, y(3)
i} is output labels for each token
xi achieved from each three component In case of for a
specific token, if three components output three different
labels, we would choose the label of components with highest
Viterbi score given by CRFs when decoding label We tried
other strategy like constantly choosing one of three outputs
but this did not show any improvements
F Exploration of sharing layer
Figure 3 Ensemble model with sharing character LSTM layer
Although above-described model enjoy better performance
in intent extraction task compared to single model based
approach, it requires a large amount of training time
There-fore, we tried to share character-based word representation
over components as depicted in figure 3 This will make
character-based word representation jointly supervised by all
of components Despite training time reduction, this model
shows a slightly less performance than one above, as shown
in figures 4, 5, 6 We further explored idea of sharing layer such as sharing BiLSTM layer or CRFs layer However, sharing these top layers means the performance go down when testing
on our data
V EXPERIMENTAL EVALUATION
A Data
As stated above, we choose three intent domains to ex-tract user intention, they are Real Estate, Transportation and Tourism The main reason for this choice is internet users in Vietnam seem to share much more intention about these three domains rather than other domains as found
in our previous survey in [14] Thus, data for these three domains are more diverse We automatically crawled data from some famous forums, websites and public Facebook
groups, such as webtretho.com/forum, dulich.vnexpress.net, batdongsan.com.vn and facebook.com/groups/xemaycuhanoi.
We have a group of students annotating data based on three sets of labels that we mentioned in the section III After carefully crosschecking among these students’ works to ensure consistency in data annotation, we get the collection of about
3000 annotated posts for each domains Then, these data is divided into training set, development set and test set with the proportion of 60%, 20% and 20% respectively
B Evaluation metric
For all experiments, precision, recall and F1-score at the segment (or chunk-based) level are adopted as the official task evaluation Specifically, assume that the true segment sequence
of an instance is s = (s1, s2, , sN) and the decoded segment
sequence is s’ = (s01, s02, , s0K) Then, s0k is called a true positive if s0k ∈ s The precision and recall are the fractions
of the total number of true positives among the total number
of decoded and true segments respectively We report the F1-score which is computed as 2.precision.recall/(precision + recall) Besides, we have the support as the number of the true segments corresponding to each label in the test set The average/total of precision, recall and F1-score is calculated as weighted average of precision, recall and F1-score of labels,
in which the weight is the value of support
C Training Parameters
our knowledge, there has been no publicly available pre-trained embeddings for Vietnamese online texts Therefore, in our experiments we treated training dataset of each domain as a collection to build models that generate word embeddings Specifically, we used public libraries glove(https://pypi.org/project/glove/), fassttext(https://pypi.org/project/fasttext/) and genism (https://pypi.org/project/gensim/), all of them use window size of 7, to produce Glove word embeddings, FastText word embeddings and Word2Vec embeddings respectively
Character-based Word Representation: The character
embeddings corresponding to every character in a word are given in direct and reverse order to a forward and a backward
Trang 5LSTM The embedding for a word derived from its characters
is the concatenation of its forward and backward
representa-tions from the bidirectional LSTM [12] In our model, each
character embedding has dimension of 25 The dimension of
forward and backward character LSTM is 25, resulting in
50-dimension character-based word representation
BiLSTM Encoding Layer: our model uses single layer for
the forward and backward LSTMs whose dimension is 100
We apply dropout to mitigate overfitting [8] A dropout mask
was applied to the final embedding layer just before the input
to BiLSTM encoding layer In all of our experiments, dropout
rate was fixed at 0.5
Optimization and Fine tuning: We used Adam
optimiza-tion [7] with learning rate 0.001, β1 = 0.9, β2 = 0.999
and a gradient clipping of 10 The effectiveness of fine
tuning embeddings has been explored in sequential prediction
problem [19] In our model, each of initial embeddings will
be fine-tuned, i.e modify them during gradient updates of the
neural network model by back-propagating gradients
Our implementation was mostly based on one of Lample et
al [12]
D Experimental results and discussion
Figure 4 Average of F1-score over 5 runs for each model in Real Estate
domain
As our motivation is to explore ensemble and tri-training
techniques in intent extraction task, we build six following
models for each intent domain: (1), (2), (3) Three
BiLSTM-CRFs models proposed by Lample et al [12] (GLOVE,
FAST-TEXT, WORD2VEC) Each model has word embeddings
initialized by Glove, FastText and Word2Vec word embeddings
respectively; (4) A single BiLSTM-CRFs model proposed by
by Lample et al [12], in which word embeddings initialization
is the concatenation of Glove, FastText and Word2Vec
em-beddings (3-EMBEDDINGS); (5) Our proposed model, which
is based on tri-training technique and ensemble learning, as
presented in figure 2; (6) Model as depicted in figure 3, in
which char BiLSTM layer is jointly learned through three
com-ponents of network (SHARING CHAR-LAYER MODEL)
After carefully conducted all of the above-described models
for each intent domain, the averaged F1-score which is the
average results over 5 different runs on test sets, are presented
in figures 4, 5, 6, correspondingly
Figure 5 Average of F1-score over 5 runs for each model in Tourism domain
Figure 6 Average of F1-score over 5 runs for each model in Transportation domain
In all three domains, we observed that our proposed models reached a better result than the rest of four single models The biggest improvement was shown in Transportation do-main, where our proposed method achieved 1.15% higher F1-score than single BiLSTM-CRFs with Glove word embed-ding initialization and nearly 3% higher F1-score than single BiLSTM-CRFs with Word2Vec word embedding initializa-tion Ensemble model with character BiLSTM layer shared
by all components of network, however, showed the less improvements compared to model without sharing character BiLSTM layer Its highest performance improves F1-score by 0.88% over single BiLSTM-CRFs with Glove word embedding initialization in Real Estate domain and 2.62% over single BiLSTM-CRFs with Word2Vec word embedding initialization
in Transportation domain Since data on social media are generated massively on daily basis, this model is worth to explore to save the time required in training phase
Speaking of four single models, BiLSTM-CRFs with Glove word embedding initialization topped the list regarding F1-score Interestingly, when those single models are combined to build the ensemble model, components based on models with lower F1-score (model with word embeddings initialized by FastText embeddings and Word2Vec embeddings) contribute positively and boost the overall results of the whole ensemble model This means that each component in proposed models can support its counterparts to reach the better overall
Trang 6perfor-THE BEST CHUNK-BASED RESULT ACHIEVED WHEN APPLYING OUR
PROPOSED MODEL IN TOURISM DOMAIN
Specific Label Precision Recall F1-score Support
Description of Obj 42.06 48.18 44.92 110
Name of Accom 56.73 68.60 62.11 86
Number of Objects 93.83 93.83 93.83 81
Number of People 88.29 87.78 88.03 352
Point of Departure 75.31 75.31 75.31 81
Point of Time 88.26 91.81 90.00 794
avg/total 84.06 85.29 84.57 3691
mance
We also recognize that concatenation of three word
em-beddings types as the input for a single model is not always
effective with our data Experiments show that it always enjoys
better results than worst single model but perform worse than
best one Therefore, it does not show some improvement in
intent extraction task
Real Estate domain experienced the lowest result in all
experiments One reason for this is posts/comments in Real
Estate are usually long and complicated (e.g figure 1), so we
have much more information to extract and also much more
noise to face compare to the two domains left Finally, we
would like to present the best result among three domains for
extracting intent head and intent properties, which we achieved
when applied our proposed model for the Tourism domain
in table I This is because Tourism domain has least number
of labels compared to Real Estate and Transportation (15, 18
and 17 labels respectively) Moreover, after carefully analyzing
data from three domains, we found that Tourism domain
con-tains less noisy data, such as improper abbreviation, emoticons
than two remaining domains
VI CONCLUSION
In this paper, we present a novel approach to extract user
intention from social media texts, which is motivated by
tri-training technique and ensemble learning in the context of
deep neural network For this idea, the outputs from three
independent BiLSTM-CRFs components are aggregated to
produce the final prediction through majority voting scheme
To ensure the diversity of these BiLSTM-CRFs components,
each of them leverages word embedding initialized by different
generated methods, namely Glove, FastText and Word2Vec In
all experiments, our proposed models achieve higher F1-score
than the rest of single-model approach proposed by Lample et
al [12] Despite of better performance, one drawback of our
method is time-complexity Therefore, we explore a variation
of above model, in which character-based word representation
layer is shared and jointly learned over components Besides reducing time for training, this model still achieves a promising result Overall, our proposed ensemble models are effective when dealing with user intent extraction task
REFERENCES [1] P Bojanowski et al., “Enriching word vectors with subword informa-tion", Transactions of the Association for Computational Linguistics,
pp 135-146, 2017.
[2] R Caruana, A Niculescu-Mizil, G Crew and A Ksikes, “Ensemble selection from libraries of models", In Proc of the 21st ICML, pp.18, 2004.
[3] M Castellanos, et al “Intention insider: discovering people’s intentions
in the social channel" In Proceedings of the 15th ICEDT, pp 614-617, 2012.
[4] Y S Chang et al., “Identifying user goals from Web search results", In IEEE/WIC/ACM International Conference, pp 1038-1041, 2006 [5] R Collobert et al., “Natural language processing (almost) from scratch" Journal of Machine Learning Research, Vol.12, pp.2493-2537, 2011 [6] H.K Dai, L Zhao, Z Nie, J.R Wen, L Wang, and Y Li, “Detecting online commercial intention (OCI)" In Proc of the 15th WWW, pp 829-837, ACM, 2006.
[7] K Diederik and B Jimmy, “Adam: A method for stochastic optimiza-tion" arXiv preprint arXiv:1412.6980, 2014.
[8] G.E Hinton et al., “Improving neural networks by preventing co-adaptation of feature detectors" arXiv preprint arXiv:1207.0580, 2012 [9] D.H Hu, D Shen, J.T Sun, Q Yang and Z Chen, “Context–aware online commercial intention",In ACML, pp.135–149, 2009.
[10] B.J Jansen, D.L Booth and A Spink, “Determining the User Intent
of Web Search Engine Queries", Proceeding of The 16th WWW, pp.1149–1150, ACM, 2007.
[11] N Labidi, T Chaari, and R Bouaziz, “An NLP-Based Ontology Popula-tion for IntenPopula-tional Structure" In InternaPopula-tional Conference on Intelligent Systems Design and Applications, pp 900-910, 2016.
[12] G Lample, M Ballesteros, S Subramanian,K Kawakami and C Dyer,
“Neural architectures for named entity recognition", arXiv:1603.01360, 2016.
[13] X Li, “Understanding the semantic structure of noun phrase queries".
In Proceedings of the 48th AMACL, pp 1337-1345, 2010.
[14] Th.L Luong, Qu.T Truong, H.Tr Dang and X.H Phan, “Domain identification for intention posts on online social media" Proceeding
of SoICT, pp.52–57, 2016.
[15] Th.L Luong, M.S Cao, D.T Le and X.H Phan, “Intent extraction from social media texts using sequential segmentation and deep learning models", Proceeding of the 9th KSE, pp.215–220, 2017.
[16] T Mikolov, K Chen, G Corrado, and J Dean, “Efficient estimation of word representations in vector space" arXiv preprint arXiv:1301.3781, 2013.
[17] X.B Ngo, C.L Le, and M.Ph Tu, “Cross-Domain Intention Detection
in Discussion Forums" Proceeding of the 8th SoICT, pp.173-180, 2017 [18] N Nguyen and Y Guo, “Comparisons of sequence labeling algorithms and extensions", In Proceedings of the 24th ICML, pp 681-688, 2007 [19] N Peng and M Dredze, “Named entity recognition for chinese social media with jointly trained embeddings" In Proceedings of EMNLP, pp.548-554, 2015.
[20] J Pennington, R Socher and C Manning, “Glove: Global vectors for word representation" In Proceedings of the EMNLP, pp 1532-1543, 2014.
[21] S Ruder and B Plank, “Strong baselines for neural semi-supervised learning under domain shift" arXiv preprint arXiv:1804.09530, 2018 [22] J Wang, G Cong, W.X Zhao and X Li, “Mining user intents in Twitter:
a semi-supervised approach to inferring intent categories for tweets", Proceeding of the 29th AAAI, 2015.
[23] Z.H Zhou and M Li, “Tri-training: Exploiting unlabeled data using three classifiers", IEEE Transactions on Knowledge & Data Engineering vol.11, pp.1529–1541, 2005.