Improving Intent Extraction Using Ensemble Neural Network44895

For this reason, in each BiLSTM-CRFs components, we initialize word embedding layer with word embedding trained by different methods, namely FastText [1], GloVe [20] and Word2Vec [16]..

Trang 1

Improving Intent Extraction Using Ensemble Neural

Network

1st Thai-Le Luong

Faculty of Information Technology VNU University of Engineering and Technology

University of Transport and Communications

Hanoi, Vietnam luongthaile80@utc.edu.vn

2nd Nhu-Thuat Tran

Hanoi, Vietnam thuattn@vnu.edu.vn

3rd Xuan-Hieu Phan

Hanoi, Vietnam hieupx@vnu.edu.vn

Abstract—User intent extraction from social media texts is

aimed at identifying user intent keyword and its related

informa-tion This topic has attracted a lot of researches since its various

applications in online marketing, e-commerce and business

ser-vices One of such studies is to model this problem as a sequence

labeling task and apply state-of-the-art sequential tagging models

such as BiLSTM [12] and BiLSTM-CRFs [12] In this paper, we

take a further step to enhance intent extraction results based

on tri-training [23] and ensemble learning [2] Specifically, we

simultaneously use three BiLSTM-CRFs models, each of them

is different from others by the type of word embeddings, and

apply majority voting scheme over their predicted labels when

decoding final labels Extensive experiments on data from three

domains Real Estate, Tourism and Transportation show that our

proposed methods enjoy a better performance compared to single

model based approach.

Index Terms—intent mining, intent identification, tri-training,

ensemble, information extraction

I INTRODUCTION

In recent years, if one wants to look for a restaurant, a

place of entertainment or even an apartment, he/she will come

straight to a forum or a social network to share his/her intent

and preference with a desire to get some recommendations

from others These data will bring a lot of potential benefits

to business if there is a mechanism to automatically understand

and extract the user intents when they just appear on social

media under the forms of posts or comments Most researches

formerly focused on classifying user intents into some

pre-defined categories using various approaches such as [6], [9],

[10] There has been a minority of studies which attempt to

exploit the semantic features or linguistic features of online

texts to deeply understand the intention of user [3], [4], [13],

[15] In our previous paper [15], we formulated intent

extrac-tion task as a sequential labeling problem For a particular

intent domain, we attempted to extract intent head and intent

properties Terminologically, intent head is intent-keyword and

object; and intent properties include all possible constraints or

preferences like brand, price, date and so on In this paper,

we follow above-described idea to extract user intent from Vietnamese online texts (i.e posts/comments) For example,

with a post in the Tourism domain like “Our family is going

to Da Nang from 14/6 to 18/6, we have 5 adults and 1 child (1-year-old), could you recommend us the hotel, the best places

to visit there and the total cost is about 20 million dong Tks Phone number: 0913 456 233", intent extraction model will

produce following output intent-keyword = “is going to”; and intent properties= {destination = “Da Nang"; agenda = “from 14/6 to 18/6"; number of people = “5 adults and 1 child"; price

= “20 million dong"; contact = “0913 456 233"} Previously, ensemble learning methods have shown promis-ing results in classification problem [2] and sequential labelpromis-ing problem [18] We expect the similar results when applying this method to intent extraction task Therefore, we explore the tri-training and ensemble learning for intent extraction

in the context of deep neural networks Because BiLSTM-CRFs (Bidirectional Long Short-term Memory – Conditional Random Fields) has achieved remarkable results in sequence labeling task [12], it is reasonable to explore this model and its variations to improve task performance In particular, we come up with a novel model consisting of three BiLSTM-CRFs components This idea is based on the tri-training technique proposed in the paper [23], [21] and ensemble learning [18]

In our approach, we train above-mentioned model and apply majority voting scheme over three outputs when decoding final labels for each token To ensure the voting scheme works as expected, three BiLSTM-CRFs components need to be diverse

as much as possible For this reason, in each BiLSTM-CRFs components, we initialize word embedding layer with word embedding trained by different methods, namely FastText [1], GloVe [20] and Word2Vec [16] These word embeddings will

be fine tunned in training phase Although this model shows some improvements in intent extraction task, as presented

in section V-D, it requires a large amount of time to train

Trang 2

Therefore, we further explore this model with the idea of

shar-ing layer Three components can share LSTM-based character

encoding layer, LSTM-based word layer or CRFs decoding

layer The results of these explorations will be discussed in

section V-D

Contribution Our contributions are: (a) We propose a novel

method to improve intent extraction results based on ensemble

learning method and tri-training technique in the context

of deep neural networks We also explore the variations of

proposed architecture in order to reduce training time (b) We

explore the role of word embeddings to intent extraction task

on our collected dataset We recognized that on our dataset,

using each word embeddings type independently sometimes

is more effective than concatenating them (c) We proposed

the definition of tag/label sets for intent information for three

specific domains, they are Real Estate, Transportation and

Tourism We perform an extensive evaluation of proposed

method compared to previous state-of-the-art methods on

datasets of these domains

II RELATED WORKS

A User intent understanding

To the best of our knowledge, most of researches attempted

to mine user goals through exploiting the user queries and/or

behaviors on online channels before 2012 The most popular

approach during this period is to classify user intents into

some pre-defined intent categories based on keywords or

personalized data [6], [9], [10] Recent years, there have been

some other works focusing on understanding user intent from

user posts/comments such as [11], [15], [17], [22], but the

number of them still remains modest Among those previous

studies, several ones are highly similar to our work in terms of

exploiting linguistic features and/or some high level features

of the user posts/comments In 2010, X.Li [13] confirmed that

determining the semantic intent of web queries not only

in-volves identifying their semantic class, but also understanding

their semantic structure They formally defined the semantic

structure of noun phrase queries as comprised of intent heads

(IH) and intent modifiers (IM), in which an intent head (IH)

is a query segment that corresponds to an attribute name of an

intent class while an intent modifier (IM) is a query segment

that corresponds to an attribute value (of some attribute name)

In the year 2012, the study of M Castellanos et al [3] mining

user intent extraction from the social media comments is

relatively similar to ours The authors tried to identify the

intent phrase, like “would like to see", “are planning a trip"

and extract intent attributes, like “Ages = 7; Date = June"

Besides some entities can be extracted automatically by CRFs

method supported by rules, it was required manual way to

identify some intents This is the key difference to our

end-to-end method

B Tri-training technique

Tri-training was proposed by Zhou et al [23] is a classic

method that reduces the bias of predictions on unlabeled

data by utilizing the agreement of three independently trained

models In detail, in each round of tri-training, an unlabeled example is labeled for a classifier if the other two clas-sifiers agree on the labeling, under certain conditions In [21], Sebastian Ruder et al proposed a novel method that reduced the time and space complexity of classic tri-training, called multi-task tri-training They chose two NLP tasks with different characteristics, namely a sequence prediction and a classification task (POS tagging and sentiment analysis) to treat as multi-task in the model After conducting extensive experiments, they recognized that multi-task tri-training model outperforms both traditional tri-training and recent alternatives

in the case of sentiment analysis However, classic tri-training

is superior in POS tagging This is one of reasons why we explore classical tri-training in the context of deep neural networks to deal with our sequence labeling task

C Ensemble learning

An ensemble is a collection of models whose predictions are combined by weighted averaging or voting In 2004, R Caruana et al presented a method for constructing ensembles from libraries of thousands of models and achieved promising results in classification task [2] Rather than combine good and bad models in an ensemble, they used forward stepwise selection from the library of models to find a subset of models that when averaged together yield excellent performance As stated in his paper, Nam et al [18] expected to get similar results when applying ensemble learning to structured learn-ing problems They presented Structured Learnlearn-ing Ensemble (SLE), a novel combination method for sequence predictions that actually incorporates correlations of label sequences Re-sults in a consequence on both POS (Part-of-speech) and OCR (handwritten character recognition) tasks, SLE has exhibited superior performance compared with the single best model SVM-struct (had been verified in their paper) Our work shares the same idea of exploring ensemble technique for sequence labeling but we mainly focus on exploiting neural network, a method that has been recently achieved state-of-the-art results

in sequential tagging

III INTENT HEADS AND INTENT PROPERTIES IN THREE

PARTICULAR DOMAINS

As stated above, we chose three domains to extract intent information, they are Real Estate, Transportation and Tourism With the motivation to extract the intent heads, including intent keyword, intent object and intent properties or constraints, we had to survey carefully our data collection Then we made

up our mind to build the set of 18 labels for Real Estate domain, 17 labels for Transportation domain and 15 labels for Tourism domain This work took us much more time than we thought Because a user post usually contains a lot of around information supporting for the main intention, it is hard to decide which information needed to be revealed and which did not After completing three sets of labels building, we annotated the data with these labels Some examples of tagged posts are presented in the figure 1 A pair of HTML-like tags

is used to mark a word/phrase that represents an intent keyword

Trang 3

or intent attribute For example, we tag the phrase "Cho thuê"

with <intent> and </intent> to indicate an intent keyword In

figure 1, intent keyword is marked with red and each type of

other intent attributes is marked with a different color

Figure 1 Example of tagged posts.

IV INTENT EXTRACTION APPROACH BASED ON

TRI-TRAINING ANDENSEMBLE LEARNING

In this section, we will describe our proposed model

begin-ning at bottom layers and go up to top layers Our proposed

architecture consists of three BiLSTM-CRFs components, as

depicted in the figure 2, each of them has three layers The

lowest layer is input word embeddings followed by a BiLSTM

layer BiLSTM layers are responsible for producing input to

the third layer – CRFs layer that decodes labels for its own

components

Figure 2 Our proposed model based on tri-training technique in the context

of deep neural network & ensemble learning

A Input word embeddings

The input layers to all three components in our model

are vector representation of individual word Word vector

repsentation consists of character-based word representation

and pre-trained word representation In terms of character-based word representation, we intergrate this to our proposed model because our data was collected from social media or discussion forums, such data is sensitive to the spelling of words, leading to the case that two words has exactly the same meaning but come with different spelling One of the

obvious example is “tr" and “triệu" From pre-trained word

representation perspective, initializing word embeddings with meaningful value will produce better results than randomly initialization, as stated in [5] Pre-trained word representations

in our model, which are 100-dimensional dense vectors, are trained by three techniques mentioned above, namely FastText, Word2Vec and Glove Their roles are not only to improve task performance of their own component but also make three components diverge, resulting in better performance

of the whole model These pre-trained embeddings will be fine-tunned during training Mathematically, the ithword has following representaion:

wi=





hfi

hb i

epre−trained



In which hfi and hb

i are forward and backward representation

of word wi respectively (output of forward char-LSTM and backward char-LSTM respectively) Since three components are independent, hfi and hb

i will be learned independently during training phase If word wi comes to Glove-word em-bedding components, epre−trained will be lookup from Glove word embedding lookup table as describes in [12] The same understanding is for remaining components

B BiLSTM Encoding Layer

For each word wi in sentence (w1, w2, , wn) containing n words, each of them will be represented as described above Forward LSTM layer will generate hlirepresentation of its left context Similarly, backward LSTM will also generate hri rep-resentation of its right context We follow the settings in [12]

as forward and backward LSTM have different parameters The output of BiLSTM is

hi=

hl

hr i

(2)

C CRFs Decoding Layer

Instead of tagging each word independently, a CRFs de-coding layers is added to jointly decode labels for words in sentence This comes from the fact that data is not fully independent, each word has a dependence on its neighbors For example, in intent extraction task, three labels INT, B-OBJ, I-OBJ frequently come together; or I-LOC cannot follow

an I-INT This idea has been shown its efficiency in NER task

in [12] In each component of our model, CRFs take output of BiLSTM Encoding Layer as input and decode label for each word

Trang 4

D Loss Function in Training Phase

Each BiLSTM-CRFs component will calculate its own loss

function as formula (1) in [12] The overall loss of proposed

model will be:

overall_loss =

3

X

i

lossi is the loss of ithcomponent

E Majority Voting Scheme over Outputs of Three

BiLSTM-CRFs Components

Given the input x = (x1, x2, , xN), outputs from three

components in our model are three sequence of labels, depicted

as {y(1), y(2), y(3)} The final output will be constructed from

these sequence of labels:

y = {majority(y(1)

1, y(2)1, y(3)1), , (4)

majority(y(1)N, y(2)N, y(3)N)}

in which, {y(1)

i, y(2)

i, y(3)

i} is output labels for each token

xi achieved from each three component In case of for a

specific token, if three components output three different

labels, we would choose the label of components with highest

Viterbi score given by CRFs when decoding label We tried

other strategy like constantly choosing one of three outputs

but this did not show any improvements

F Exploration of sharing layer

Figure 3 Ensemble model with sharing character LSTM layer

Although above-described model enjoy better performance

in intent extraction task compared to single model based

approach, it requires a large amount of training time

There-fore, we tried to share character-based word representation

over components as depicted in figure 3 This will make

character-based word representation jointly supervised by all

of components Despite training time reduction, this model

shows a slightly less performance than one above, as shown

in figures 4, 5, 6 We further explored idea of sharing layer such as sharing BiLSTM layer or CRFs layer However, sharing these top layers means the performance go down when testing

on our data

V EXPERIMENTAL EVALUATION

A Data

As stated above, we choose three intent domains to ex-tract user intention, they are Real Estate, Transportation and Tourism The main reason for this choice is internet users in Vietnam seem to share much more intention about these three domains rather than other domains as found

in our previous survey in [14] Thus, data for these three domains are more diverse We automatically crawled data from some famous forums, websites and public Facebook

groups, such as webtretho.com/forum, dulich.vnexpress.net, batdongsan.com.vn and facebook.com/groups/xemaycuhanoi.

We have a group of students annotating data based on three sets of labels that we mentioned in the section III After carefully crosschecking among these students’ works to ensure consistency in data annotation, we get the collection of about

3000 annotated posts for each domains Then, these data is divided into training set, development set and test set with the proportion of 60%, 20% and 20% respectively

B Evaluation metric

For all experiments, precision, recall and F1-score at the segment (or chunk-based) level are adopted as the official task evaluation Specifically, assume that the true segment sequence

of an instance is s = (s1, s2, , sN) and the decoded segment

sequence is s’ = (s01, s02, , s0K) Then, s0k is called a true positive if s0k ∈ s The precision and recall are the fractions

of the total number of true positives among the total number

of decoded and true segments respectively We report the F1-score which is computed as 2.precision.recall/(precision + recall) Besides, we have the support as the number of the true segments corresponding to each label in the test set The average/total of precision, recall and F1-score is calculated as weighted average of precision, recall and F1-score of labels,

in which the weight is the value of support

C Training Parameters

our knowledge, there has been no publicly available pre-trained embeddings for Vietnamese online texts Therefore, in our experiments we treated training dataset of each domain as a collection to build models that generate word embeddings Specifically, we used public libraries glove(https://pypi.org/project/glove/), fassttext(https://pypi.org/project/fasttext/) and genism (https://pypi.org/project/gensim/), all of them use window size of 7, to produce Glove word embeddings, FastText word embeddings and Word2Vec embeddings respectively

Character-based Word Representation: The character

embeddings corresponding to every character in a word are given in direct and reverse order to a forward and a backward

Trang 5

LSTM The embedding for a word derived from its characters

is the concatenation of its forward and backward

representa-tions from the bidirectional LSTM [12] In our model, each

character embedding has dimension of 25 The dimension of

forward and backward character LSTM is 25, resulting in

50-dimension character-based word representation

BiLSTM Encoding Layer: our model uses single layer for

the forward and backward LSTMs whose dimension is 100

We apply dropout to mitigate overfitting [8] A dropout mask

was applied to the final embedding layer just before the input

to BiLSTM encoding layer In all of our experiments, dropout

rate was fixed at 0.5

Optimization and Fine tuning: We used Adam

optimiza-tion [7] with learning rate 0.001, β1 = 0.9, β2 = 0.999

and a gradient clipping of 10 The effectiveness of fine

tuning embeddings has been explored in sequential prediction

problem [19] In our model, each of initial embeddings will

be fine-tuned, i.e modify them during gradient updates of the

neural network model by back-propagating gradients

Our implementation was mostly based on one of Lample et

al [12]

D Experimental results and discussion

Figure 4 Average of F1-score over 5 runs for each model in Real Estate

domain

As our motivation is to explore ensemble and tri-training

techniques in intent extraction task, we build six following

models for each intent domain: (1), (2), (3) Three

BiLSTM-CRFs models proposed by Lample et al [12] (GLOVE,

FAST-TEXT, WORD2VEC) Each model has word embeddings

initialized by Glove, FastText and Word2Vec word embeddings

respectively; (4) A single BiLSTM-CRFs model proposed by

by Lample et al [12], in which word embeddings initialization

is the concatenation of Glove, FastText and Word2Vec

em-beddings (3-EMBEDDINGS); (5) Our proposed model, which

is based on tri-training technique and ensemble learning, as

presented in figure 2; (6) Model as depicted in figure 3, in

which char BiLSTM layer is jointly learned through three

com-ponents of network (SHARING CHAR-LAYER MODEL)

After carefully conducted all of the above-described models

for each intent domain, the averaged F1-score which is the

average results over 5 different runs on test sets, are presented

in figures 4, 5, 6, correspondingly

Figure 5 Average of F1-score over 5 runs for each model in Tourism domain

Figure 6 Average of F1-score over 5 runs for each model in Transportation domain

In all three domains, we observed that our proposed models reached a better result than the rest of four single models The biggest improvement was shown in Transportation do-main, where our proposed method achieved 1.15% higher F1-score than single BiLSTM-CRFs with Glove word embed-ding initialization and nearly 3% higher F1-score than single BiLSTM-CRFs with Word2Vec word embedding initializa-tion Ensemble model with character BiLSTM layer shared

by all components of network, however, showed the less improvements compared to model without sharing character BiLSTM layer Its highest performance improves F1-score by 0.88% over single BiLSTM-CRFs with Glove word embedding initialization in Real Estate domain and 2.62% over single BiLSTM-CRFs with Word2Vec word embedding initialization

in Transportation domain Since data on social media are generated massively on daily basis, this model is worth to explore to save the time required in training phase

Speaking of four single models, BiLSTM-CRFs with Glove word embedding initialization topped the list regarding F1-score Interestingly, when those single models are combined to build the ensemble model, components based on models with lower F1-score (model with word embeddings initialized by FastText embeddings and Word2Vec embeddings) contribute positively and boost the overall results of the whole ensemble model This means that each component in proposed models can support its counterparts to reach the better overall

Trang 6

perfor-THE BEST CHUNK-BASED RESULT ACHIEVED WHEN APPLYING OUR

PROPOSED MODEL IN TOURISM DOMAIN

Specific Label Precision Recall F1-score Support

Description of Obj 42.06 48.18 44.92 110

Name of Accom 56.73 68.60 62.11 86

Number of Objects 93.83 93.83 93.83 81

Number of People 88.29 87.78 88.03 352

Point of Departure 75.31 75.31 75.31 81

Point of Time 88.26 91.81 90.00 794

avg/total 84.06 85.29 84.57 3691

mance

We also recognize that concatenation of three word

em-beddings types as the input for a single model is not always

effective with our data Experiments show that it always enjoys

better results than worst single model but perform worse than

best one Therefore, it does not show some improvement in

intent extraction task

Real Estate domain experienced the lowest result in all

experiments One reason for this is posts/comments in Real

Estate are usually long and complicated (e.g figure 1), so we

have much more information to extract and also much more

noise to face compare to the two domains left Finally, we

would like to present the best result among three domains for

extracting intent head and intent properties, which we achieved

when applied our proposed model for the Tourism domain

in table I This is because Tourism domain has least number

of labels compared to Real Estate and Transportation (15, 18

and 17 labels respectively) Moreover, after carefully analyzing

data from three domains, we found that Tourism domain

con-tains less noisy data, such as improper abbreviation, emoticons

than two remaining domains

VI CONCLUSION

In this paper, we present a novel approach to extract user

intention from social media texts, which is motivated by

tri-training technique and ensemble learning in the context of

deep neural network For this idea, the outputs from three

independent BiLSTM-CRFs components are aggregated to

produce the final prediction through majority voting scheme

To ensure the diversity of these BiLSTM-CRFs components,

each of them leverages word embedding initialized by different

generated methods, namely Glove, FastText and Word2Vec In

all experiments, our proposed models achieve higher F1-score

than the rest of single-model approach proposed by Lample et

al [12] Despite of better performance, one drawback of our

method is time-complexity Therefore, we explore a variation

of above model, in which character-based word representation

layer is shared and jointly learned over components Besides reducing time for training, this model still achieves a promising result Overall, our proposed ensemble models are effective when dealing with user intent extraction task

REFERENCES [1] P Bojanowski et al., “Enriching word vectors with subword informa-tion", Transactions of the Association for Computational Linguistics,

pp 135-146, 2017.

[2] R Caruana, A Niculescu-Mizil, G Crew and A Ksikes, “Ensemble selection from libraries of models", In Proc of the 21st ICML, pp.18, 2004.

[3] M Castellanos, et al “Intention insider: discovering people’s intentions

in the social channel" In Proceedings of the 15th ICEDT, pp 614-617, 2012.

[4] Y S Chang et al., “Identifying user goals from Web search results", In IEEE/WIC/ACM International Conference, pp 1038-1041, 2006 [5] R Collobert et al., “Natural language processing (almost) from scratch" Journal of Machine Learning Research, Vol.12, pp.2493-2537, 2011 [6] H.K Dai, L Zhao, Z Nie, J.R Wen, L Wang, and Y Li, “Detecting online commercial intention (OCI)" In Proc of the 15th WWW, pp 829-837, ACM, 2006.

[7] K Diederik and B Jimmy, “Adam: A method for stochastic optimiza-tion" arXiv preprint arXiv:1412.6980, 2014.

[8] G.E Hinton et al., “Improving neural networks by preventing co-adaptation of feature detectors" arXiv preprint arXiv:1207.0580, 2012 [9] D.H Hu, D Shen, J.T Sun, Q Yang and Z Chen, “Context–aware online commercial intention",In ACML, pp.135–149, 2009.

[10] B.J Jansen, D.L Booth and A Spink, “Determining the User Intent

of Web Search Engine Queries", Proceeding of The 16th WWW, pp.1149–1150, ACM, 2007.

[11] N Labidi, T Chaari, and R Bouaziz, “An NLP-Based Ontology Popula-tion for IntenPopula-tional Structure" In InternaPopula-tional Conference on Intelligent Systems Design and Applications, pp 900-910, 2016.

[12] G Lample, M Ballesteros, S Subramanian,K Kawakami and C Dyer,

“Neural architectures for named entity recognition", arXiv:1603.01360, 2016.

[13] X Li, “Understanding the semantic structure of noun phrase queries".

In Proceedings of the 48th AMACL, pp 1337-1345, 2010.

[14] Th.L Luong, Qu.T Truong, H.Tr Dang and X.H Phan, “Domain identification for intention posts on online social media" Proceeding

of SoICT, pp.52–57, 2016.

[15] Th.L Luong, M.S Cao, D.T Le and X.H Phan, “Intent extraction from social media texts using sequential segmentation and deep learning models", Proceeding of the 9th KSE, pp.215–220, 2017.

[16] T Mikolov, K Chen, G Corrado, and J Dean, “Efficient estimation of word representations in vector space" arXiv preprint arXiv:1301.3781, 2013.

[17] X.B Ngo, C.L Le, and M.Ph Tu, “Cross-Domain Intention Detection

in Discussion Forums" Proceeding of the 8th SoICT, pp.173-180, 2017 [18] N Nguyen and Y Guo, “Comparisons of sequence labeling algorithms and extensions", In Proceedings of the 24th ICML, pp 681-688, 2007 [19] N Peng and M Dredze, “Named entity recognition for chinese social media with jointly trained embeddings" In Proceedings of EMNLP, pp.548-554, 2015.

[20] J Pennington, R Socher and C Manning, “Glove: Global vectors for word representation" In Proceedings of the EMNLP, pp 1532-1543, 2014.

[21] S Ruder and B Plank, “Strong baselines for neural semi-supervised learning under domain shift" arXiv preprint arXiv:1804.09530, 2018 [22] J Wang, G Cong, W.X Zhao and X Li, “Mining user intents in Twitter:

a semi-supervised approach to inferring intent categories for tweets", Proceeding of the 29th AAAI, 2015.

[23] Z.H Zhou and M Li, “Tri-training: Exploiting unlabeled data using three classifiers", IEEE Transactions on Knowledge & Data Engineering vol.11, pp.1529–1541, 2005.

Định dạng
Số trang	6
Dung lượng	404,43 KB